Summary:
Strides appear to cause a huge memory regression in some of our internal
training workflows. This diff stems the bleeding, while we figure out exactly
what happened.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12236
Reviewed By: dzhulgakov
Differential Revision: D10134319
fbshipit-source-id: 1547c89a65c05473c409c0977c19c99dcaefb89c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12122
We are deprecating support for c extensions. Please use cpp extension in the future.
Reviewed By: Yangqing
Differential Revision: D10060541
fbshipit-source-id: 4f7149e06a254bd7af463fd7aa9740f65369963a
Summary:
Original commit changeset: f5614a5d2607
D9986213 is causing Multifeed Aggregator a [huge performance different](https://our.intern.facebook.com/intern/ads/analyze_canary/412951953278781781/) and is blocking aggregator push since last Friday night: https://fburl.com/feedtools/b6izvwjz
We need to land this revert ASAP to unblock aggregator push.
Reviewed By: orionr
Differential Revision: D10123245
fbshipit-source-id: d83da8e00a1250f5d09811a0a587c127e377aab2
Summary:
Allow trailing commas in assign statements or tuples, which also allows single element tuples.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11723
Differential Revision: D10052162
Pulled By: eellison
fbshipit-source-id: 344d908a3ad942a23ebd9f341794bc9734226aa8
Summary:
We're waiting for the libtorch links to show up on the website. I had a fake link in the docs so far which is misleading. This PR changes it to a temporary markdown file until the web people fix the site tomorrow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12212
Differential Revision: D10121872
Pulled By: goldsborough
fbshipit-source-id: f1bd1315f7333b9168e99983f3f6b679c9b0c52a
Summary:
- fix https://github.com/pytorch/pytorch/issues/12120
- add `torch.argsort`, `torch.pdist`, `broadcast_tensors` to *.rst files
- add parameter dim to `torch.unique` doc
- fix table and args for `torch.norm`
- test plan: make html and check docs in browser
gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12126
Differential Revision: D10087006
Pulled By: weiyangfb
fbshipit-source-id: 25f65c43d14e02140d0da988d8742c7ade3d8cc9
Summary:
Currently when tracing optimizations are performed twice. This means that optimizing passes, like the fusion pass, are also called twice. This is unnecessary and this PR turns off optimizations when checking the trace (since the trace is independent of optimizations). This should improve performance and debugging.
apaszke who proposed this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12172
Reviewed By: ezyang
Differential Revision: D10109250
Pulled By: apaszke
fbshipit-source-id: 8b3385eae143446820f1b61ca7576d7c07f9b248
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11955
Adding a `filter<T>(NNModule)` function to easily get inputs/outputs of a DAI-style NNModule.
Reviewed By: duc0
Differential Revision: D9997696
fbshipit-source-id: 818c4f2e3093e0d02b35e6632b426e8d3189c21e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11954
Adding an alternative to external_input and external_output for use in some distributed settings
Reviewed By: aazzolini
Differential Revision: D9997121
fbshipit-source-id: 1b5cc03fd3051368a3edc69e7bc472386f5746b5
Summary:
In current upstream float scalars are always written into kernels with:
`out << std::scientific << v << "f";`
When the floats are special values like NaN, +inf, or -inf this produces nonsense that causes compilation to fail. This fix updates the conversion of float scalars to device-specific special values. The appropriate macros are added to the CPU and CUDA resource strings. Note that a NAN macro was not necessary on the CPU since math.h defines NAN.
To verify this fix I updated the test_clamp_fusion test in test_jit.py. I wanted to test -inf, too, but -inf is not currently accepted by the interpreter.
Edit:
Forgot to mention, this partially addresses issue #12067.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12070
Reviewed By: ezyang
Differential Revision: D10044704
Pulled By: soumith
fbshipit-source-id: 8f4a930862d66a7d37d985e3f6a6fb724579e74c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11891
maybe_wrap_dim is a slightly more general function, which is able
to, under some circumstances, treat 0 as a "valid" dimension even
with a tensor is scalar. canonical_axis_index_ never accepts
this behavior, so it always passes false.
Reviewed By: jerryzh168
Differential Revision: D9968320
fbshipit-source-id: 13c98fff0880d7bfcd00911a76c8aa10d37bd183
Summary:
This allows us to call the type argument with name other than `self_ty`. ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11903
Differential Revision: D10105029
Pulled By: SsnL
fbshipit-source-id: 0fbdc728123ebc1154d080628cb41a085ba3e6d7
Summary:
This functionality replaces the Scalar-Tensor builtin operators,
with builtin functions.
Builtin functions are used in place of operators where one operator
can be defined using a composition of another. This simplifies later
optimization passes by allowing us to have fewer operator.
In the future, builtin functions can be used for other purposes.
For example, we can define derivative functions as code rather than
building graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12141
Reviewed By: ezyang
Differential Revision: D10088065
Pulled By: zdevito
fbshipit-source-id: a2acb06346e649c4c8a2fe423b420871161c21cf
Summary:
This fixes a broadcasting error with the `StudentT` distribution
- [x] added a regression test
- [x] strengthened parameter broadcasting tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12148
Differential Revision: D10099226
Pulled By: soumith
fbshipit-source-id: 0c5eb14180d158f8fff28ceb9e7cd3471c2bb803
Summary:
goldsborough Modify the docs to match the changes made in #4999
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12158
Differential Revision: D10103964
Pulled By: SsnL
fbshipit-source-id: 1b8692da86aca1a52e8d2e6cea76a5ad1f71e058
Summary:
This PR replaces the use of `std::FILE` with `istream`/`ostream` for JIT serialization.
It uses this mechanism to add the possibility to serialize to/from binary buffers, in addition to files, both in `libtorch` and from Python.
`getExportImportCopy` in `test_jit.py` has been updated so that both file and buffer codepaths are exercised during tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11932
Differential Revision: D10084303
Pulled By: apaszke
fbshipit-source-id: b850801b3932922fa1dbac6fdaed5063d58bc20d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11349
Special case BatchGather and BatchGatherGradient for block_size=1. This makes BatchGather 3-4X faster and BatchGatherGradient 10X for this case.
Reviewed By: jspark1105, ilia-cher
Differential Revision: D7218043
fbshipit-source-id: ea12042239a8adc92b9efcbd0b66e354fb43f4c7
Summary:
This PR implements the design that we discussed. Changes:
- Added a World token IValue and type. The IValue is basically a dummy struct for now, in the future we may extend it (say, add thread-local state).
- Effectful ops explicitly declare they are mutable by having World tokens as inputs and outputs in their schema.
- Purely functional ops that use mutable values will get "fenced" and the world token will be threaded through the fences
- AnnotateEffects pass which wires up all the world tokens together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10700
Reviewed By: eellison
Differential Revision: D9547881
Pulled By: michaelsuo
fbshipit-source-id: ebbd786c31f15bf45e2ddb0c188438ff2f5f3c88
Summary:
Previously, doRead/doWrite were functions that could return partial reads/writes,
and we checked for this case inconsistently in the call sites of serialization.cpp.
Now, these functions do NOT return the amount of bytes read/written, and instead
handle the necessary checking loop themselves.
Fixes#12042. Maybe.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12143
Differential Revision: D10097027
Pulled By: ezyang
fbshipit-source-id: fd222ab8a825bed352153648ad396acfe124a3e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12090
This is a slight pessimization because we need to do a
full recompute of is_contiguous(), even though a modification
of dim-0 is guaranteed to preserve contiguity.
Reviewed By: jerryzh168
Differential Revision: D10050905
fbshipit-source-id: b99233e21c9f4275b0db6e76740462e5430ce152
Summary:
reduce sum negative indices turn to positive as caffe2 not supporting it. GE/LE symbolic operand order is wrong..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12123
Reviewed By: houseroad
Differential Revision: D10095467
Pulled By: wanchaol
fbshipit-source-id: eb20248de5531c25040ee68b89bd18743498138d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11971
- Switched TensorImpl::data<T>() to use Storage::unsafe_data<T>() to work
around an outstanding bug in the Storage::data<T>() implementation
where it only works on Ts which are valid ScalarType
- Qualify a bunch of identifiers which still live in caffe2:: namespace
- strides returns an IntList now
- s/update_strides/update_to_contiguous_strides/
- Correctly compute type_id_ for the Storage only constructor from Caffe2.
This is special cased to only work for CPU and CUDA dense tensors.
- Fix some signed-unsigned comparisons in Caffe2 code (OSS build for
ATen/core has more restrictive warning tests.)
Reviewed By: jerryzh168
Differential Revision: D9995559
fbshipit-source-id: 9c74032e011189e1c7e9a98d20f2bd1e25ad2e5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11970
Adds an ATen-core-headers target, which caffe2_cpu_internal depends
on, and makes ATen-core depend on caffe2_headers. If you link against
ATen-core, you must ALSO link against caffe2_cpu_internal; if you
link against caffe2_cpu_internal, you must ALSO link against ATen-core,
otherwise you'll have undefined symbols.
Then, we merge template data<T>() method with Caffe2 implementation,
demonstrating that includes to Caffe2 (core) from ATen/core are working
Reviewed By: jerryzh168
Differential Revision: D9967509
fbshipit-source-id: 3d220c38b2c3c646f8ff2884fdcc889fa9276c7a
Summary:
Export symbols for pybind and other libs after caffe2 rebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11975
Differential Revision: D10042615
Pulled By: yinghai
fbshipit-source-id: 6de562d99403099113093716834abc51bf726e94
Summary:
the speed-up of a single operation is up to 6X on BDW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11686
Reviewed By: yinghai
Differential Revision: D9828129
Pulled By: wesolwsk
fbshipit-source-id: 7dbacea90609e18438f6fe1229c641937d0696c8
Summary:
As reported in Issue #10726, the jit compiler, when running on ppc64le, may produce an isomorphic output but fail a diff test against the expected output file. The expected output file is created from a test that was ran on x86_64. This ensures that if ppc64le test output is different, the output is instead compared to an expected output file created when the test is run on a ppc64le system.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11579
Differential Revision: D10080890
Pulled By: soumith
fbshipit-source-id: 7249bf6b5dfa7c853368a3688a982bc9ed642bc9
Summary:
This does 6 things:
- add c10/util/Registry.h as the unified registry util
- cleaned up some APIs such as export condition
- fully remove aten/core/registry.h
- fully remove caffe2/core/registry.h
- remove a bogus aten/registry.h
- unifying all macros
- set up registry testing in c10
Also, an important note that we used to mark the templated Registry class as EXPORT - this should not happen, because one should almost never export a template class. This PR fixes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12077
Reviewed By: ezyang
Differential Revision: D10050771
Pulled By: Yangqing
fbshipit-source-id: 417b249b49fed6a67956e7c6b6d22374bcee24cf
Summary:
Apply weight decay for Adam in-place instead of via copy.
Synced offline with soumith , who mentioned that it should be OK. This is also consistent with other optimizers, e.g. eee01731a5/torch/optim/sgd.py (L93)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12107
Reviewed By: soumith
Differential Revision: D10071787
Pulled By: jma127
fbshipit-source-id: 5fd7939c79039693b225c44c4c80450923b8d673
Summary:
You can't GetDeviceType an undefined tensor, so test for this case
first. This allows you to safely move tensors out of blobs.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12125
Reviewed By: smessmer
Differential Revision: D10080075
Pulled By: ezyang
fbshipit-source-id: bb99b089b6daa9d4db99015208f939d7ce4d4a79
Summary:
migrant all tests in aten to use gtest except of basic.cpp
Sinc features of gtest are different from catch test, some of the tests has been re-writted with similar meaning.
Basic test has some version conflict with valgrind according to CI, therefore this testcase is still implementing catch.
It will be resolved by a different pr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11846
Differential Revision: D10080860
Pulled By: zrphercule
fbshipit-source-id: 439d4cf33fb6ccbe79b797860342853c63e59081
Summary:
I wrote some high level docs for the larger PyTorch C++ universe and the C++ frontend specifically. Happy for reviews, but let's please also land this ASAP so I can point users at something that looks more ready baked than the C++ docs landing page (https://pytorch.org/cppdocs) does right now.
ezyang soumith
CC ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12079
Differential Revision: D10080785
Pulled By: goldsborough
fbshipit-source-id: 3028de41373f307468eb1e3802aa27871c93b2e3
Summary:
We generate specialized list operations for int, float, and Tensor lists so that small lists of integers like the arguments to conv do not involve tons of boxing code.
This PR adds a fallback GenericList for List types that contain any other type. It does so by adding type variables to `jit::Type`, and machinery for matching/replacing the type variables during `tryMatchSchema` and operator lookup.
It also modifies the builtin list ops to include a fallback that works on a GenericList object that simply holds IValues. This is distinguished from IValue's tuple type so that conversion to/from Python still happens losslessly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12040
Differential Revision: D10037098
Pulled By: zdevito
fbshipit-source-id: 0c5f2864d12e7d33554bf34cc29e5fb700dde150
Summary:
An efficient atomicAdd for halfs has been added in `cuda_fp16.h` in CUDA 10:
```__CUDA_FP16_DECL__ __half atomicAdd(__half *address, __half val);```
Through this change, PyTorch will be able to utilize efficient atomicAdd when building with CUDA 10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12108
Differential Revision: D10053385
Pulled By: soumith
fbshipit-source-id: 946c90691a8f6bdcf6d6e367a507ac3c9970b750
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12065
Had compilation issues using CAFFE_DURATION in some contexts, specifically due to namespace resolution. Since this is a macro, it should fully qualify.
Reviewed By: heslami
Differential Revision: D10036132
fbshipit-source-id: b8d55dfe5e991ca702ce5b7483f0ffc699882c85
Summary:
Couple questions:
1) I used the log1p implementation in #8969 as a guide especially for testing. I'm not sure what the ```skipIfROCM``` annotation is for, so unsure if i need it for my test.
2) I implemented the branching logic in the narrow function itself; is this the right place to do so? I noticed that there a number of places where sparse-specific logic is handled with just an if statement in this file. Or should I implement a separate dispatch in native_functions.yml as in the log1p?
And of course, happy to make any any other updates/changes that I may have missed as well. This is my first PR to the project.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11342
Differential Revision: D9978430
Pulled By: weiyangfb
fbshipit-source-id: e73dc20302ab58925afb19e609e31f4a38c634ad
Summary:
The earlier tests had around 80 warnings, and now there are 6 warnings: these are due to JIT
The changes remove the wrapping of a Tensor by a Tensor constructor, which emits warnings due to the changes in https://github.com/pytorch/pytorch/pull/11061 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12038
Differential Revision: D10033392
Pulled By: apaszke
fbshipit-source-id: b1faf368e650d062d7983f9932511bee4702a893
Summary:
This unifies our versions across setup.py, libtorch, and libcaffe2. CMake has a default version (bumped to 1.0.0) that can be overridden by setup.py. The versions are also printed as a part of cmake/Summary.cmake to make sure they are correct.
cc Yangqing ezyang soumith goldsborough pjh5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12053
Differential Revision: D10041878
Pulled By: orionr
fbshipit-source-id: a98a01771f6c008d1016ab63ab785c3a88c3ddb0
Summary:
This PR does a few things:
Previously test_jit.py only tested autograd on backward graphs.
This is because we borrow from test_autograd and construct graphs with a small
number of nodes. Because the number of nodes is small (typically 1-2), those graph
do not end up containing autodiff subgraphs, so autodiff never gets tested.
This PR enables autodiff testing by doing the following:
- added disableDebugAutodiffSubgraphInlining fn to graph_executor to disable
autodiff subgraph inlining.
- (implementation) added autodiffSubgraphNodeThreshold and autodiffSubgraphInlineThreshold.
These are set to their default values (2, 5) but disableDebugAutodiffSubgraphInlining()
sets both to 1, disabling subgraph inlining and allowing 1-node autodiff subgraphs.
- The relevant backward jit tests disable autodiff subgraph inlining so they
will test the autodiff versions of the operators instead of autograd whenever
an autodiff variant exists.
- We don't run the tests that do inline autodiff subgraphs anymore.
This has no impact on testing correctness because the assumption is
that autograd functions are correct and are tested in test_autograd.py
This allows the graph fuser to work better because a lot of these ops were previously not autodiff-compatible but fusible. On a more concrete example, lstm backward contains a lot of tensor-scalar operations; these autodiff formulas help its double backward pass.
Included:
- arithmetic overloads
- abs, acos, asin, atan, ceil, cos, cosh, exp, expm1, floor, fmod, frac, log, log10, log1p, log2 reciprocal, remainder, round, sin, sinh, tan, trunc, rsqrt
TestJitGenerated tests autodiff for all of the added operations.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11832
Differential Revision: D10031256
Pulled By: zou3519
fbshipit-source-id: 9daf9900a5ad187743609cd0fbbd10b15411ad93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11548
This removes getting/setting the DestroyCall of a Blob,
paving the way to removing DestroyCall from Blob entirely and using the destructor stored in TypeMeta instead.
Use sites have been fixed in diffs stacked below this.
Reviewed By: dzhulgakov
Differential Revision: D9775191
fbshipit-source-id: 97d72d0c62843849057f295c27f391e63c99c521
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11414
caffe2::Blob can be stored in an IValue. This is a precondition for caffe2 to switch from Blob to IValue.
Reviewed By: ezyang
Differential Revision: D9731326
fbshipit-source-id: 462a39d2d9ab6f85b99b1670848c6976a3de417c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11924
Previous diffs removed Blob -> caffe2 dependencies, now we can move it to ATen/core.
This is pre-work for allowing storing Blob in IValue.
Reviewed By: ezyang
Differential Revision: D9980641
fbshipit-source-id: 32082a673ec94c42c20b2298adced8bb7ca94d07
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12021
TestPilot runs stress tests in parallel. These fail for serialized tests because extracting (and subsequent deletion) of binary data during the process isn't threadsafe. Extract zips into tempfile to avoid this problem.
Also remove some accidentally checked in zips of a test that we didn't end up including for now.
Reviewed By: houseroad
Differential Revision: D10013682
fbshipit-source-id: 6e13b850b38dee4106d3c10a9372747d17b67c5a
Summary:
TSIA. Right now we should basically use C10_EXPORT and C10_IMPORT for explicitly marking dllexport and dllimport, as a continued effort of the C10 unification.
This is a codemod by mechanically doing the following change:
CAFFE2_{EXPORT,IMPORT} -> C10_{EXPORT,IMPORT}
AT_CORE_{EXPORT,IMPORT} -> C10_{EXPORT,IMPORT}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12019
Reviewed By: ezyang, teng-li
Differential Revision: D10016276
Pulled By: Yangqing
fbshipit-source-id: a420d62c43d1110105fc88f9e9076e28a3203164
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12058
Methods on TensorImpl have to be written very carefully, because
when you have a VariableImpl subclass of TensorImpl, usually the
local fields on the TensorImpl are not valid; instead, you have to
forward to the "wrapped" tensor. Functions which are virtualized
are probably handled correctly by Variable, but functions which
are NOT cannot be handled correctly and shouldn't be called if you
have a Variable. This diff add checks to determine if this is
the case or not.
Reviewed By: jerryzh168
Differential Revision: D10034589
fbshipit-source-id: 650b2036ca9a044c0ab4abdf6f825521a64e1fc2
Summary:
This PR establish a baseline so that we can build IDEEP ops in the new work flow. From this baseline, we need to
- Merge the CMakefile of MKLDNN from caffe2 and Pytorch
- Get rid of `USE_MKL=ON`.
Build command from now on:
```
EXTRA_CAFFE2_CMAKE_FLAGS="-DUSE_MKL=ON -DINTEL_COMPILER_DIR=/opt/IntelComposerXE/2017.0.098" python setup.py build_deps
```
gujinghui
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12026
Differential Revision: D10041199
Pulled By: yinghai
fbshipit-source-id: b7310bd84a494ac899d8e25da368b63feed4eeaf
Summary:
LLVM trunk emits an error diagnostic when attempting to compile caffe2. The
identifiers following the `template` keywords are not templates, so the use of
the keyword does not make sense in this context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12037
Reviewed By: ezyang
Differential Revision: D10024531
Pulled By: modocache
fbshipit-source-id: da4b9ba405d9f7fd633ab8c1a61c77da9c1a1f89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12036
Sometimes you have a TypeIdentifier, and no way to get to
the TypeMeta. Still nice to be able to read out the name.
This should be obsoleted by smessmer's patches.
Reviewed By: gchanan, mingzhe09088
Differential Revision: D10024554
fbshipit-source-id: 42cdceefd5c59be0441254665f66f5edc829f422
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11902
Previously, they were going through THTensor_getStoragePtr which
incurred a null pointer check on storage. Now they use unsafe_data
method which doesn't do this check.
I don't know if this actually make things go faster, but I get
an added bonus of reducing code duplication, so we should take
this change anyway :)
Reviewed By: SsnL
Differential Revision: D9977654
fbshipit-source-id: f45c74828213a0439480755ad0b2d7f8858cb327
Summary:
Users generally expect ./configure to find libraries
installed in /usr/local and /usr, so search for nccl
there too.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12063
Differential Revision: D10036248
Pulled By: ezyang
fbshipit-source-id: d331ddd2ccc8ac9846fb54222db284b1ec371659
Summary:
The gpu_unary_kernel function was not handling arrays that
cannot use 32-bit indexing. This functions was only called directly
by CUDA division by a scalar. Other arithmetic operations go through
gpu_binary_kernel, which already properly handled large arrays.
This bug sometimes manifested as a crash and sometimes as an incorrect
answer.
Fixes#11788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12023
Differential Revision: D10034017
Pulled By: colesbury
fbshipit-source-id: b17300f327de54035746bf02f576766007c9b144
Summary:
This PR is a minor change, just adds a simple `magma_queue_destroy` function to the implementation of `Gesv`.
Also, I have replaced calls for obtaining handles with those already written in ATen.
```
THCState_getCurrentSparseHandle(at::globalContext().getTHCState()) --> getCurrentCUDASparseHandle()
THCState_getCurrentBlasHandle(at::globalContext().getTHCState()) --> getCurrentCUDABlasHandle()
```
Differential Revision: D10032204
Pulled By: soumith
fbshipit-source-id: ccd11989ecdc357313f0b661a2468f75d3aecb0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12043
Re-trying D9979976, this time with all call sites fixed.
D9979976 got reverted because there was a call site that wasn't covered by sandcastle it seems.
I fixed it and used 'grep' to ensure there aren't any more call sites in fbsource.
Reviewed By: ezyang
Differential Revision: D10026392
fbshipit-source-id: cd341514a8e53a40147ea0ee3e52f63bb6444157
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12020
- make it less verbose to create random blobs in python unit test by adding some test helper methods
- move str_compare test helper method to test_util.py
Reviewed By: ZolotukhinM
Differential Revision: D10003637
fbshipit-source-id: cb79d2ad508341f750a1bb8f564e87d055c65652
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12034
We need ATen and Caffe2 to line up, and the rule is
that if you have any private/protected members, you
should declare it as a class. Class we go.
(There are some other obvious candidates for this treatment,
but I've kept this patch just to Tensor)
Reviewed By: gchanan, mingzhe09088
Differential Revision: D10024467
fbshipit-source-id: 17cfe2741ba9c3f56cb87d6f5d1afd3c61a8e4fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12046
This /sounds/ like a good idea in theory, but a feature
like this must be implemented very carefully, because if
you just plop the Git version in a header (that is included
by every file in your project, as macros.h is), then every
time you do a 'git pull', you will do a FULL rebuild, because
macros.h is going to regenerate to a new version and of course
you have to rebuild a source file if a header file changes.
I don't have time to implement it correctly, so I'm axing
the feature instead. If you want git versions in, e.g.,
nightly builds, please explicitly specify that when you feed
in the version.
Reviewed By: pjh5
Differential Revision: D10030556
fbshipit-source-id: 499d001c7b8ccd4ef15ce10dd6591c300c7df27d
Summary:
This makes a few changes wrt Type, with the ultimate goal of removing Type from the public Methods/Functions. In particular:
1) Removes factory functions from Type, into TypeExtendedInterface.
2) sparse_coo_tensor is now a first class at:: namespace function, with TensorOptions overloads.
3) We move from Type-based sparse_coo_tensor dispatch to function-based.
Note we still require a number of changes to get rid of tType in the public interface, in particular TensorOptions needs to support CUDA vs non-CUDA dispatch. That is coming in a future patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12025
Reviewed By: ezyang
Differential Revision: D10017205
Pulled By: gchanan
fbshipit-source-id: 00807a37b09ed33f0656aaa165bb925abb026320
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12035
This brings it in line with Caffe2's naming
Reviewed By: mingzhe09088
Differential Revision: D10024485
fbshipit-source-id: a6feef82a56b5eb3043b0821ea802ba746e542a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12033
These are reasonable sensible default values. One key
pick is -1 for numel: this is because in Caffe2, a tensor
may be in "un-allocated" with no storage state; this is
historically represented in Caffe2 with numel_ == -1
Reviewed By: mingzhe09088
Differential Revision: D10024439
fbshipit-source-id: a167d727a7665daac7e7a1e98c0c89d8f1da6fa6
Summary: The controller you requested could not be found. Original commit changeset: 2ea17724e223
Differential Revision:
D10026321
Ninja: stable broken
fbshipit-source-id: faf87cb7cc0f78c2c10d4aa6fceea279cd27acd6
Summary:
As pytuple should be a constant type (since obj is constant), potential errors would occur without
this const decorator, e.g., when compiling against PyPy. Although PyPy is not supported yet, it
would still be useful if we remove this compilation issue (out of very few numbers of compilation
issues) to allow hackers playing with them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11857
Differential Revision: D10024149
Pulled By: soumith
fbshipit-source-id: aa7e08e58f6369233a11477113351dccd3854ba8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11923
This is pre-work to allow moving Blob to ATen/core, which cannot depend on caffe2 anymore.
(1) Removing the Blob -> Tensor dependency allows us to move Blob to ATen/core and use it inside IValue without having to wait for the Tensor merge to be complete.
(2) In the final Blob design, we want it to be a very small class that doesn't have any special treatment for Tensor (or to be more correct, doesn't allow storing Tensor anymore), so this is anyhow the direction we want to go.
This changes call sites that will have to be moved to IValue later, but they cannot be moved to IValue directly, because for that, IValue first needs to be able to store Blob, which in turn first needs this diff and some other changes coming up in future diffs.
Codemods:
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)\\.IsTensorType\\(" "BlobIsTensorType(\\1, "
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)->IsTensorType\\(" "BlobIsTensorType(*\\1, "
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)\\.GetMutableTensor\\(" "BlobGetMutableTensor(\\1, "
$ codemod --extensions h,hpp,c,cpp,cc "([a-zA-Z0-9_]+)->GetMutableTensor\\(" "BlobGetMutableTensor(*\\1, "
It is, however, not only these codemods because regex based refactoring was only able to match a small amount of the call sites. To catch more, I wouldn've needed a AST aware tool like clangr, which I didn't figure out how to use.
Reviewed By: ezyang
Differential Revision: D9979976
fbshipit-source-id: 2ea17724e223b5b73b44f99362727759ca689e61
Summary:
Source build doc section **LAPACK GPU** only lists magma-cuda80
The magma-cuda version should reflect the installed version of cuda.
- Verified on ubuntu with magma-cuda92 with build and test
- Verified 91 is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12000
Differential Revision: D10024158
Pulled By: soumith
fbshipit-source-id: a34c85a5e87b52657f1e6f7b21d235306ab7b2aa
Summary:
The MPI async work class returned a temporary as reference, which is
invalid (hat tip to colesbury for noticing it). This change fixes that and
uses a std::exception_ptr to hold on to the exception if applicable, and
then returns the reference by throwing it and returning it, like the
existing code path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11947
Differential Revision: D10019928
Pulled By: pietern
fbshipit-source-id: 5a8ed0e894615a09224ca5e48c8b3104275a3019
Summary:
When running /test/onnx/test_models.py, we see deprecation warnings in the test points for `super_resolution` and `squeezenet` models. This change updates those models to use the recommended methods, instead of the deprecated ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11827
Reviewed By: houseroad
Differential Revision: D10023998
Pulled By: ezyang
fbshipit-source-id: ee4e14304678c532ebd574e7bd143e3b311995ab
Summary:
Because we emit a lot of them in our symbolic AD. This brings down the backward time of an LSTM I'm testing from 14.2ms to 12.5ms (a 15% improvement).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11801
Differential Revision: D9916815
Pulled By: apaszke
fbshipit-source-id: 2d9cb886c424ccd43b9f996aad89950d3bddf494
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11688
As a first step to remove static context(merge with allocator), we'll create a
global registries for context constructors, and remove CreateContext function from tensor.
Reviewed By: ezyang, dzhulgakov
Differential Revision: D9779821
fbshipit-source-id: 8b239ea50af7a0556fde2382f58f79194f0e3dc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12018
Tried to use the file and ran into a small bug, this fixes it
Differential Revision: D10013231
fbshipit-source-id: 4cf8c29cf9e2cedd7a28fa0cc0196e5144a54bf2
Summary:
Currently the C++ API and C++ extensions are effectively two different, entirely orthogonal code paths. This PR unifies the C++ API with the C++ extension API by adding an element of Python binding support to the C++ API. This means the `torch/torch.h` included by C++ extensions, which currently routes to `torch/csrc/torch.h`, can now be rerouted to `torch/csrc/api/include/torch/torch.h` -- i.e. the main C++ API header. This header then includes Python binding support conditioned on a define (`TORCH_WITH_PYTHON_BINDINGS`), *which is only passed when building a C++ extension*.
Currently stacked on top of https://github.com/pytorch/pytorch/pull/11498
Why is this useful?
1. One less codepath. In particular, there has been trouble again and again due to the two `torch/torch.h` header files and ambiguity when both ended up in the include path. This is now fixed.
2. I have found that it is quite common to want to bind a C++ API module back into Python. This could be for simple experimentation, or to have your training loop in Python but your models in C++. This PR makes this easier by adding pybind11 support to the C++ API.
3. The C++ extension API simply becomes richer by gaining access to the C++ API headers.
soumith ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11510
Reviewed By: ezyang
Differential Revision: D9998835
Pulled By: goldsborough
fbshipit-source-id: 7a94b44a9d7e0377b7f1cfc99ba2060874d51535
Summary:
Changes the result type of half type and any integer type to return half
type (instead of float or double).
This is based on top of #11808. The first new commit is "Make promoteType(half, integer) -> half". I'll rebase on top of master once that PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11941
Differential Revision: D10014122
Pulled By: colesbury
fbshipit-source-id: 16a5eb3406a5712069201d872d8736d0599e9411
Summary:
Or even taking them as inputs. This prevents optimizations to happen
either inside the differentiable subgraphs, or in the surrounding graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11809
Differential Revision: D10009680
Pulled By: apaszke
fbshipit-source-id: face638566228e470a6deec48dc2aa3a1cce26d4
Summary:
Old per-API+arch headers reside in
/opt/android_ndk/r*/platforms/android-*/arch-*/usr/include/
New Unified headers reside in
/opt/android_ndk/r*/sysroot/usr/include/
Unified headers are not exactly drop-in replacements for the old ones. Old headers had some nested includes that are absent in the unified versions, so we need to explicitly include them.
Reviewed By: mzlee
Differential Revision: D9952200
fbshipit-source-id: 6515e1d1ab576069db499c3fb23a69d507279c8c
Summary:
Some of these symbols are used by device_test.cc .
d0db23e95a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11965
Reviewed By: bwasti
Differential Revision: D10002439
Pulled By: bddppq
fbshipit-source-id: 4ae95b9c888b3c7685d0ffdbcbfa3441bcf90091
Summary:
This PR is a large codemod to rewrite all C++ API tests with GoogleTest (gtest) instead of Catch.
You can largely trust me to have correctly code-modded the tests, so it's not required to review every of the 2000+ changed lines. However, additional things I changed were:
1. Moved the cmake parts for these tests into their own `CMakeLists.txt` under `test/cpp/api` and calling `add_subdirectory` from `torch/CMakeLists.txt`
2. Fixing DataParallel tests which weren't being compiled because `USE_CUDA` wasn't correctly being set at all.
3. Updated README
ezyang ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11953
Differential Revision: D9998883
Pulled By: goldsborough
fbshipit-source-id: affe3f320b0ca63e7e0019926a59076bb943db80
Summary:
- fixes https://github.com/pytorch/pytorch/issues/10723
- migrate PReLU to ATen and deprecate legacy PReLU
- performance:
CPU with weight.numel() = 1
```
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
100 loops, best of 100: 9.43 ms per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
10 loops, best of 100: 24.4 ms per loop
>>> m = nn.PReLU()
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 695 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.47 ms per loop
```
CPU with weight.numel() = channels
```
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 603 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 13.3 ms per loop
>>> m = nn.PReLU(100)
>>> x = torch.randn(100, 100, 100, requires_grad=True)
>>> %timeit -r 100 y = m(x)
1000 loops, best of 100: 655 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 y.backward(retain_graph=True)
100 loops, best of 100: 2.45 ms per loop
```
CUDA with weight.numel() = 1
```
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 187 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.01 ms per loop
>>> m = nn.PReLU().cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 195 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.28 ms per loop
```
CUDA with weight.numel() = channel
```
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
1000 loops, best of 100: 174 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.27 ms per loop
>>> m = nn.PReLU(100).cuda()
>>> x = torch.randn(100, 100, 100, requires_grad=True).cuda()
>>> %timeit -r 100 torch.cuda.synchronize(); y = m(x); torch.cuda.synchronize();
10000 loops, best of 100: 181 µs per loop
>>> y = m(x).sum()
>>> %timeit -r 100 torch.cuda.synchronize(); y.backward(retain_graph=True); torch.cuda.synchronize();
100 loops, best of 100: 2.26 ms per loop
```
The huge performance regression in CPU when weight.numel() = 1 is addressed by replacing at::CPU_tensor_apply* with parallelized kernels.
ezyang SsnL zou3519 soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11758
Differential Revision: D9995799
Pulled By: weiyangfb
fbshipit-source-id: d289937c78075f46a54dafbde92fab0cc4b5b86e
Summary:
This eliminates the need for any heuristics regarding stack size limits.
This is a re-do #11534 with a fix to properly handle cases where
multiple edges exist between a pair of functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11611
Differential Revision: D9991198
Pulled By: resistor
fbshipit-source-id: fecd2c5cac7e78f82a0f20cf33268bb1617bb4a0
Summary:
Previously, aten::view returned a Dynamic type when attr::size is a prim::ListConstruct.
See [this for a repro](https://gist.github.com/zou3519/cbd610472ba3369f556fa612a7d93b28).
This prevented a pre-multipled lstm input graph from being fusible (aten::view is necessary
to do premultiplication).
If aten::view is passed an output of a prim::ListConstruct node, then shape prop should
be able to figure out its TensorType because we statically know the number of inputs to
prim::ListConstruct. This PR implements that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11877
Differential Revision: D9972356
Pulled By: zou3519
fbshipit-source-id: cb87786f6e7f222d4b8f07d8f2a9de34859cb6a5
Summary:
This is pretty important because a common situation of passing LSTM hidden states as a tuple completely trashes performance of a network.
Cleans up all our propagation/undef specialization passes, at a cost of increased complexity of `ArgumentSpec` and `GraphExecutor`. An alternative would be to simply flatten all tuple inputs to a graph ahead of time, but that might just end up being confusing in the future (you never know if you're working with a graph that can have tuple or not).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11863
Differential Revision: D9992814
Pulled By: apaszke
fbshipit-source-id: 0a565a3b23e32f8fa72c0534e07c1ce6187739fc
Summary:
Previously, we didn't cast any 0-dim tensors used in CUDA operations. We
can only avoid the casts for 0-dim CPU tensors used in CUDA operations.
Fixes#11795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11808
Differential Revision: D9922406
Pulled By: colesbury
fbshipit-source-id: 940b8a8534770aa5cd70d5d09b96be0f0f8146ff
Summary:
- Disable addmm fusion. The reason for this is explained in the comment.
- Tiny change in `stack.h` that lets us avoid constructing an unnecessary temporary `IValue` on the (C++) stack (it will only get created on the interpreter stack directly).
- Fixed a correctness issue in requires grad propagation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11654
Reviewed By: colesbury
Differential Revision: D9813739
Pulled By: apaszke
fbshipit-source-id: 23e83bc8605802f39bfecf447efad9239b9421c3
Summary:
Spruriously added in #11261
I had a PR to catch these automatically (#11279), but it had some issues
passing on some CI environments but not others (e.g. for
`test_nn_group_norm`), any ideas?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11916
Differential Revision: D9992065
Pulled By: driazati
fbshipit-source-id: 05cfa8ed9af939e8ffd5827847ee7bfe0be799b2
Summary:
`-O0` is problematic for compiling sleef kernels since they consist of a bunch of vector intrinsics. In `-O0`, the compiler spills *every* intermediate value to the stack. In one example (TestEndToEndHybridFrontendModels.test_snli in test_jit.py) the function `Sleef_tanhf8_u10avx2` would spill 30kB of AVX registers onto the stack and run two orders of magnitude slower than in opt mode, causing the test to take minutes rather than seconds. I've verified that this behavior is not present with `-O1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11942
Differential Revision: D9994658
Pulled By: jamesr66a
fbshipit-source-id: cdd9474c6ae3aa9898d5715ac19a900f5f90468a
Summary:
Deleted this section by mistake in last PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11938
Reviewed By: SsnL
Differential Revision: D9993258
Pulled By: brianjo
fbshipit-source-id: 2552178cebd005a1105a22930c4d128c67247378
Summary:
- fix PR https://github.com/pytorch/pytorch/pull/11061 by moving `detach_()` and `set_requires_grad()` to `torch.tensor_ctor()` and `tensor.new_tensor`, and also removed warnings and `args_requires_grad` from `internal_new_from_data `
- with this patch, the returned tensor from `tensor_ctor()` and `new_tensor` will be detached from source tensor, and set requires_grad based on the input args
- `torch.as_tensor` retains its behavior as documented
gchanan apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11815
Differential Revision: D9932713
Pulled By: weiyangfb
fbshipit-source-id: 4290cbc57bd449954faadc597c24169a7b2d8259
Summary:
We do this by being more NaN tolerant.
Fixes: #9062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11933
Differential Revision: D9991129
Pulled By: soumith
fbshipit-source-id: c99b04462c1bee90d00eeabb0c111de12f855f4d
Summary:
The sample code in the docstring of `torch.jit.createResolutionCallback` is not working:
`createResolutionCallback()` gets the frame of `bar`. In order to get the frame of `baz`, one need to use `createResolutionCallback(1)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11921
Differential Revision: D9989123
Pulled By: soumith
fbshipit-source-id: a7166defdccbbf6979f7df4c871298e6b9a2b415
Summary:
There are two parts:
- Optional tensors cannot be dispatch tensors because dispatch
tensors cannot be optional.
- While the kernel dealt with undefined grad_outs, the logistics
around it did not fully accomodate grad_hy being undefined.
Fixes: #11800
Thank you, mttk for the reproduction!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11872
Differential Revision: D9978527
Pulled By: apaszke
fbshipit-source-id: e622c288d2eac93bd8388e141fb773f2588e2b8f
Summary:
I also fix a bug that crept in while we had incorrect semantics where UndefinedTensorImpl was a CPU tensor, and thus some moves which shouldn't have been legal didn't crash. Moving out the Tensor* also moved out the Tensor* in the blob, and it's not supported to store an undefined tensor in a blob.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11738
Reviewed By: gchanan
Differential Revision: D9847859
fbshipit-source-id: db6be0f76a8e6526a89fd0e87b6a23b9cc820c8d
Summary:
This PR fixes#11913.
In order to test for this, the model is serialized twice in `getExportImportCopy`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11915
Differential Revision: D9984697
Pulled By: soumith
fbshipit-source-id: ae0250c179000c03db1522b99410f6ecb9681297
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11817
Blob::Serialize() and Blob::Deserialize() are now free functions SerializeBlob(), DeserializeBlob() instead.
This takes away access to Blob internals from them and makes future refactorings easier.
Reviewed By: ezyang
Differential Revision: D9882726
fbshipit-source-id: 3251ebd4b53fc12f5e6924a6e4a8db3846ab3729
Summary:
This PR serves two purposes:
1. Design an abstraction over a serialization scheme for C++ modules, optimizers and tensors in general,
2. Add serialization to the ONNX/PyTorch proto format.
This is currently a rough prototype I coded up today, to get quick feedback.
For this I propose the following serialization interface within the C++ API:
```cpp
namespace torch { namespace serialize {
class Reader {
public:
virtual ~Reader() = default;
virtual void read(const std::string& key, Tensor& tensor, bool is_buffer = false) = 0;
virtual void finish() { }
};
class Writer {
public:
virtual ~Reader() = default;
virtual void writer(const std::string& key, const Tensor& tensor, bool is_buffer = false) = 0;
virtual void finish() { }
};
}} // namespace torch::serialize
```
There are then subclasses of these two for (1) Cereal and (2) Protobuf (called the "DefaultWriter" and "DefaultReader" to hide the implementation details). See `torch/serialize/cereal.h` and `torch/serialize/default.h`. This abstraction and subclassing for these two allows us to:
1. Provide a cereal-less serialization forward that we can ship and iterate on going forward,
2. Provide no-friction backwards compatibility with existing C++ API uses, mainly StarCraft.
The user-facing API is (conceptually):
```cpp
void torch::save(const Module& module, Writer& writer);
void torch::save(const Optimizer& optimizer, Writer& writer);
void torch::read(Module& module, Reader& reader);
void torch::read(Optimizer& optimizer, Reader& reader);
```
with implementations for both optimizers and modules that write into the `Writer` and read from the `Reader`
ebetica ezyang zdevito dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11619
Differential Revision: D9984664
Pulled By: goldsborough
fbshipit-source-id: e03afaa646221546e7f93bb8dfe3558e384a5847
Summary:
- Finishes unifying Half type in pytorch and caffe2
- As a side effect, aten_op works for fp16 now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11676
Reviewed By: weiyangfb
Differential Revision: D9829019
Pulled By: li-roy
fbshipit-source-id: b8c9663873c10fe64c90ef180dc81af2e866674e
Summary:
This PR contains changes for:
1. Performance enhancements for group conv using MIOpen
2. Performance enhancements by removing unnecessary computations while running pooling through MIOpen
3. Added check for bwdData comptutation while running MIOpen convGradient operator
4. Fix in MIOpen poolingGradient operator to compute window size for global pooling case
5. Minor code cleanup in MIOpen spatial batch norm operator
Differential Revision: D9979050
Pulled By: bddppq
fbshipit-source-id: fabc7a44a2f9ca0307d99564d1ce8fe1de9a6fbb
Summary:
Python never closes shared library it `dlopen`s. This means that calling `load` or `load_inline` (i.e. building a JIT C++ extension) with the same C++ extension name twice in the same Python process will never re-load the library, even if the compiled source code and the underlying shared library have changed. The only way to circumvent this is to create a new library and load it under a new module name.
I fix this, of course, by introducing a layer of indirection. Loading a JIT C++ extension now goes through an `ExtensionVersioner`, which hashes the contents of the source files as well as build flags, and if this hash changed, bumps an internal version stored for each module name. A bump in the version will result in the ninja file being edited and a new shared library and effectively a new C++ extension to be compiled. For this the version name is appended as `_v<version>` to the extension name for all versions greater zero.
One caveat is that if you were to update your code many times and always re-load it in the same process, you may end up with quite a lot of shared library objects in your extension's folder under `/tmp`. I imagine this isn't too bad, since extensions are typically small and there isn't really a good way for us to garbage collect old libraries, since we don't know what still has handles to them.
Fixes https://github.com/pytorch/pytorch/issues/11398 CC The controller you requested could not be found.
ezyang gchanan soumith fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11725
Differential Revision: D9948244
Pulled By: goldsborough
fbshipit-source-id: 695bbdc1f1597c5e4306a45cd8ba46f15c941383
Summary:
This patch adds fused forward and backward for clamp to the jit.
This is one item of #11118 . If it's OK, I'd be happy to also add some more of #11118 .
The patch depends on #11150 , which I merged into master as a base. I'll rebase it when that or #10981 is merged.
This is first serious jit patch, thank you, ngimel and the others for their guidance. All errors are my own.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11574
Differential Revision: D9943090
Pulled By: apaszke
fbshipit-source-id: c40954b8c28c374baab8d3bd89acc9250580dc67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11893
This is needed to run binaries compiled with CUDA support on on CPU-only machines.
Reviewed By: teng-li
Differential Revision: D9972872
fbshipit-source-id: 7e4107925b3cd4d2fcf84ae532e800ab65f4b563
Summary:
Fixes#11518
Upstream PR submitted at https://gitlab.kitware.com/cmake/cmake/merge_requests/2400
On some embedded platforms, the NVIDIA driver is verbose logging unexpected output to stdout.
One example is Drive PX2, where we see something like this whenever a CUDA program is run:
```
nvrm_gpu: Bug 200215060 workaround enabled.
```
This patch does a regex on the output of the architecture detection program to only capture architecture patterns.
It's more robust than before, but not fool-proof.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11851
Differential Revision: D9968362
Pulled By: soumith
fbshipit-source-id: b7952a87132ab05c724b287b76de263f1f671a0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11710
Added a test to check that output and gradient values are correctly
calculated wehn combine_spatial_bn is true on data parallel model
Reviewed By: enosair
Differential Revision: D9833660
fbshipit-source-id: 14d29fbebefa9dc303ffae06f9899ea4bde23025
Summary:
Two improvements to C++ extensions:
1. In verbose mode, show the ninja build output (the exact compile commands, very useful)
2. When raising an error, don't show the `CalledProcessError` that shows ninja failing, only show the `RuntimeError` with the captured stdout
soumith fmassa ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11724
Differential Revision: D9922459
Pulled By: goldsborough
fbshipit-source-id: 5b319bf24348eabfe5f4c55d6d8e799b9abe523a
Summary:
Following through on warning that indexing 0-dim tensor would be an
error in PyTorch 0.5 and to use `item()` instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11679
Reviewed By: soumith
Differential Revision: D9833570
Pulled By: driazati
fbshipit-source-id: ac19f811fa7320d30b7f60cf66b596d6de684d86
Summary:
This fixes the numerical problem in log_softmax cpu code when inputs are big but their differences are small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11866
Differential Revision: D9946799
Pulled By: soumith
fbshipit-source-id: 11fe8d92b91ef6b7a66f33fbce37ec2f0f0929be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11816
The file isn't in the std:: namespace, so is_same
must be qualified.
Reviewed By: smessmer
Differential Revision: D9923774
fbshipit-source-id: 126532e27f08b5616ca46be1293d5d837920f588
Summary:
+ https://github.com/pytorch/pytorch/issues/10236 : torch.bernoulli's out kwarg is broken
fixed in moving `bernoulli_out` to ATen
+ https://github.com/pytorch/pytorch/issues/9917 : BUG torch.bernoulli(p.expand(shape)) is broken
fixed in moving all `bernoulli` ops in ATen to use the modern apply utils methods
+ https://github.com/pytorch/pytorch/issues/10357 : torch.bernoulli inconsistent gpu/cpu results
fixed by adding CUDA asserts
In order to use `curand_uniform4`, I made some changes to `CUDAApplyUtils.cuh`. Specifically, I introduced an optional template parameter `int step` to the `CUDA_tensor_applyN` methods, representing that we want to process `step` values at each time for each of the `N` tensors.
The calling convention for `step = 1` (default) isn't changed. But if `step > 1`, the given lambda `op` must take in `int n` as its first argument, representing the number of valid values, because there may not be full `step` values at the boundary. E.g., here is what the `bernoulli(self, p_tensor)` call look like:
```cpp
// The template argument `4` below indicates that we want to operate on four
// element at each time. See NOTE [ CUDA_tensor_applyN helpers ] for details.
at::cuda::CUDA_tensor_apply2<scalar_t, prob_t, 4>(
ret, p,
[seeds] __device__(
int n, scalar_t& v1, scalar_t& v2, scalar_t& v3, scalar_t& v4,
const prob_t& p1, const prob_t& p2, const prob_t& p3, const prob_t& p4) {
curandStatePhilox4_32_10_t state;
curand_init(
seeds.first,
blockIdx.x * blockDim.x + threadIdx.x,
seeds.second,
&state);
float4 rand = curand_uniform4(&state);
switch (n) {
case 4: {
assert(0 <= p4 && p4 <= 1);
v4 = static_cast<scalar_t>(rand.w <= p4);
}
case 3: {
assert(0 <= p3 && p3 <= 1);
v3 = static_cast<scalar_t>(rand.z <= p3);
}
case 2: {
assert(0 <= p2 && p2 <= 1);
v2 = static_cast<scalar_t>(rand.y <= p2);
}
case 1: {
assert(0 <= p1 && p1 <= 1);
v1 = static_cast<scalar_t>(rand.x <= p1);
}
}
}
);
```
Benchmarking on `torch.rand(200, 300, 400)` 20 times, each time with 20 loops:
post patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
6.841588497161865 +- 0.05413117632269859
torch.bernoulli(xc)
0.05963418632745743 +- 0.0008014909108169377
x.bernoulli_()
0.4024486541748047 +- 0.0021550932433456182
xc.bernoulli_()
0.02167394384741783 +- 2.3818030967959203e-05
```
pre-patch
```
➜ ~ numactl --cpunodebind 1 --membind 1 -- taskset -c 12,13,14,15,16,17,18,19,20,21,22,23 env CUDA_LAUNCH_BLOCKING=1 python bern.py
torch.bernoulli(x)
12.394511222839355 +- 0.0966421514749527
torch.bernoulli(xc)
0.08970972150564194 +- 0.0038722590543329716
x.bernoulli_()
1.654480218887329 +- 0.02364428900182247
xc.bernoulli_()
0.058352887630462646 +- 0.003094920190051198
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10273
Differential Revision: D9831294
Pulled By: SsnL
fbshipit-source-id: 65e0655a36b90d5278b675d35cb5327751604088
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11875
Seems like the refactor to predictor_config dropped some functionality that is now blocking other teams
rFBS2b30208263c14ce7039f27c618a3b232bf11ee33 is the change that was missed
hoping to land this quickly :)
Reviewed By: jonmorton
Differential Revision: D9948324
fbshipit-source-id: 1628f7c51c06319fa7ca5dc9d59799135bb82c5f
Summary:
A missing environment variable raised a missing key error. Now it
raises a more descriptive error of the actual problem, for example:
ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11782
Differential Revision: D9888962
Pulled By: pietern
fbshipit-source-id: 5947e7a7bf7aa45f13bbd7b5e997529f26cc92d6
Summary:
This PR adds empty sparse tensor tests to `test_sparse.py`, and also fix various places in internal code to make the tests pass.
**[NOTE] API CHANGE:**
- `coalesce` on sparse tensor will always be performed out-of-place now (meaning the original tensor will never be affected)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11228
Differential Revision: D9930449
Pulled By: yf225
fbshipit-source-id: 7c62439b216a6badf7938a10741c358ff18a556d
Summary:
To illustrate the benefits of this commit, I'll use the time/iter I got from one of the JIT benchmarks on my machine.
| Run | Time |
|----------------------------------------------|-------------------------|
| No profiler | 45ms |
| With profiler | 56ms |
| Use `clock_gettime` instead of `std::chrono` | 48ms |
| Touch all pages on block allocation | 48ms (less jitter) |
| Use `const char*` instead of `std::string` | 47ms (even less jitter) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11773
Differential Revision: D9886858
Pulled By: apaszke
fbshipit-source-id: 58f926f09e95df0b11ec687763a72b06b66991d0
Summary:
I noticed I was including `torch/nn/pimpl.h` in the optimizer library just to access `TORCH_ARG`, even though that file includes a lot of irrelevant code. Let's save some re-compilation time by refactoring this macro into a separate logical file. #small-wins
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11787
Differential Revision: D9924447
Pulled By: goldsborough
fbshipit-source-id: 5acd4ba559ffb2a3e97277e74bb731d7b1074dcf
Summary:
currently grad assignment for half type fails with a misleading RuntimeError
```
RuntimeError: torch.cuda.sparse.HalfTensor is not enabled.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11781
Differential Revision: D9931884
Pulled By: soumith
fbshipit-source-id: 03e946c3833d1339a99585c9aa2dbb670f8bf459
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11779
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11731
This Predictor provides threadsafe interface and also
cleans-up activations after each run. So in multi-model setup
activation space doesn't explode
Reviewed By: highker
Differential Revision: D9842374
fbshipit-source-id: bfe253ae5fc813e73a347c5147ff6b58d50781ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11704
Add plan name to the logging in RunPlan
Reviewed By: Tianshu-Bao
Differential Revision: D9802416
fbshipit-source-id: 45c359dba0a5d992e303b3cdcf34624881a631d8
Summary:
Currently one of our GPU perf tests `test_gpu_speed_mnist` reports NaN after this commit (https://github.com/pytorch/pytorch/pull/8018), and we didn't have the logic in place to raise error when this happens. This PR fixes the problem and will also update the baseline properly even if its previous value is NaN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11588
Differential Revision: D9831798
Pulled By: yf225
fbshipit-source-id: b95eee38d69b3b8273f48b8ac7b7e0e79cf756ed
Summary:
Adds some pretty-printing capability to the IR graph to make debugging easier/more human readable, see `torch/csrc/jit/test_jit.cpp:925` and onwards for example outputs. Results aren't perfect yet but it's a start.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10319
Reviewed By: zdevito
Differential Revision: D9558402
Pulled By: driazati
fbshipit-source-id: 1d61c02818daa4c9bdca36d1477d1734cfc7d043
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11818
To do this, I have to move the static context registry into ATen/core.
I take the opportunity to convert it into an unordered_map.
Reviewed By: Yangqing
Differential Revision: D9924348
fbshipit-source-id: 8d92b9e8b4246ce608eba24ecef7ad5f8b9b6582
Summary:
This should not be there since logging does not depend on protobuf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11814
Reviewed By: ezyang
Differential Revision: D9923819
Pulled By: Yangqing
fbshipit-source-id: 4d4edaea1a2e317f5db6e92c35d58c85dd35c5fb
Summary:
The PR fixes#10873
The context is aten::add and aten::sub ST overloads don't have alpha, so onnx symbolic does not match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10972
Reviewed By: jamesr66a
Differential Revision: D9724224
Pulled By: wanchaol
fbshipit-source-id: eb5d1b09fa8f1604b288f4a62b8d1f0bc66611af
Summary:
For example, outputs of control blocks often have Dynamic type, and when we try to export them to ONNX we get an invalid proto, since `elem_type` is not populated on the TypeInfoProto. This makes it so at least we can get past the checker, since having a dynamic typed output from a control block should still be semantically valid
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11810
Differential Revision: D9922754
Pulled By: jamesr66a
fbshipit-source-id: 5c66113cc302a9d9b8b9f5a8605473d3c6ad5af1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11805
Some of our headers in Caffe2 pollute the macro namespace with things like MAX,
MIN, CHECK, so I renamed these in places where this is a problem.
This patch courtesy of gchanan, extracted out of #11721
Reviewed By: Yangqing
Differential Revision: D9917757
fbshipit-source-id: 17fc692ca04b208dcb8ae00731ed60e393284f7c
Summary:
I have no idea how to run distributed tests locally so I'll let CI do this. Hopefully everything still works with `IntEnum`.
cc mcarilli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11715
Reviewed By: pietern
Differential Revision: D9889646
Pulled By: SsnL
fbshipit-source-id: 1e2a487cb6fe0bd4cc67501c9d72a295c35693e2
Summary:
Followup to [the serialized test framework](https://github.com/pytorch/pytorch/pull/10594)
Round 1 for refactoring tests, starting alphabetically. I added some functionality, so I wanted to send out some of these initial changes sooner.
I'm skipping all tests that don't explicitly call assertReferenceChecks. Some tests directly call np.allclose, and others are simply TestCase (rather than HypothesisTestCase).
1. Start alphabetically producing serialized outputs for test functions, annotating those we want to include with `serialized_test_util.given`. So far I've only added one test per operator, but this already does seem to add quite a few tests.
2. Add functionality to allow us to generate outputs using pytest by adding pytest argument options. This allows us to skip adding a `__main__` function to quite a few tests.
3. Catch any exceptions generating the gradient operator and skip serializing/reading it, since certain operators don't have gradients.
4. Add functionality to better handle jagged array inputs, which numpy doesn't handle very well. We simply explicitly do the conversion to dtype=object.
5. Make only one file per test function, rather than 4, to reduce the number of files in the github repo.
I also noticed that there is some hypothesis handling that makes `serialized_test_util.given` not compatible with adding more hypothesis decorators on top. For example, there are tests that do
```
settings(...)
given(...)
def test_my_stuff(...)
```
But there is a hypothesis handler that explicitly checks that `given` is called below `settings`, so we cannot refactor this to `serialized_test_util.given`. I've just avoided decorating these kinds of tests for now, I hope that's alright.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11350
Reviewed By: houseroad
Differential Revision: D9693857
Pulled By: ajyu
fbshipit-source-id: a9b4279afbe51c90cf2025c5ac6b2db2111f4af7
Summary:
This PR adds empty sparse tensor tests to `test_sparse.py`, and also fix various places in internal code to make the tests pass.
**[NOTE] API CHANGE:**
- `coalesce` on sparse tensor will always be performed out-of-place now (meaning the original tensor will never be affected)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11228
Differential Revision: D9755189
Pulled By: yf225
fbshipit-source-id: e9d36f437db1a132c423d3a282ff405a084ae7cc
Summary:
Adds vararg support for meshgrid and adds checks for all the tensor arguments to have the same dtype and device.
Fixes: [#10823](https://github.com/pytorch/pytorch/issues/10823), #11446
The earlier pull request closed without any changes because I had some rebasing issues, so I made another pull request to close out #10823. Sorry for the inconvenience.
Differential Revision: D9892876
Pulled By: ezyang
fbshipit-source-id: 93d96cafc876102ccbad3ca2cc3d81cb4c9bf556
Summary:
Simple change to allow ModuleList subclasses's `__getitem__(slice)` to return class of subclass rather than ModuleList
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11694
Differential Revision: D9892824
Pulled By: ezyang
fbshipit-source-id: b75e9c196487f55cb93f0dab6c20d850e8e759ff
Summary:
Related to #11624 adding maxes to the function def of embedding_bag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11784
Differential Revision: D9892598
Pulled By: ezyang
fbshipit-source-id: e6372ccf631826ddf1e1885b2f8f75f354a36c0b
Summary:
Since we're making parts of the JIT public as part of loading script modules, they should be on the cppdocs website.
Orthogonal: We decided not to export things like `IValue` into the `torch` namespace, so `RegisterOperators` shouldn't be there either.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11712
Differential Revision: D9837578
Pulled By: goldsborough
fbshipit-source-id: 4c06d2fa9dd4b4216951f27424c2ce795febab9c
Summary:
tset_potri -> test_potri, even though it has been like this for a long time
More a curiosity than grave functionality...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11770
Reviewed By: ezyang
Differential Revision: D9884767
Pulled By: soumith
fbshipit-source-id: 9bedde2e94ade281ab1ecc2293ca3cb1a0107387
Summary:
I was reading this a couple times before figuring out it's also the entry point for the MPI_COMM_WORLD.
Reordered statements and added comment to clarify.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11764
Differential Revision: D9882834
Pulled By: pietern
fbshipit-source-id: a9282d55368815925fd695a2541354e5aec599da
Summary:
The PR aims to resolve issues related to BUILD_PYTHON and BUILD_TEST after FULL_CAFFE2 is removed on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11385
Reviewed By: orionr
Differential Revision: D9884906
Pulled By: mingzhe09088
fbshipit-source-id: fc114c0cbff6223f1ec261161e4caecc1fef5dd6
Summary:
```
I'm using AT_HOST_DEVICE outside of Half.h in an upcoming PR. Since this
changes code without making any semantic changes, I wanted to make this
change in a separate PR.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10945
Differential Revision: D9539821
Pulled By: colesbury
fbshipit-source-id: 0daae40ea78b077a543f7bfeec06b225634540de
Summary:
We now have CI machines with different number of amd gpus.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11771
Differential Revision: D9889837
Pulled By: bddppq
fbshipit-source-id: dacf728a282f209e3f2419da186e59528a08ca6a
Summary:
The second part of T32009899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11556
Differential Revision: D9888224
Pulled By: zrphercule
fbshipit-source-id: cb0d0ba5d9c7ad601ee3bce0d932ce9cbbc40908
Summary: Cleaning up converter.cc and allowing networks that have "pass through" inputs (that are also outputs but aren't actually consumed by the network)
Reviewed By: duc0
Differential Revision: D9759435
fbshipit-source-id: 1ddfcc60a1b865a06682e4022230dfecc4b89ec3
Summary:
It is currently tedious to change code generation because it takes two steps: change the code gen, then gen.py fails because of file mismatch. Just add an environment option of generating directly to source.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11759
Differential Revision: D9867259
Pulled By: gchanan
fbshipit-source-id: 3cf8024d9e302f382cf8b8a44cb843fb086f8597
Summary:
This fixes#8515 which was mostly issues in the test themselves. As long
as `math` is imported in the scope in which the script runs it resolves
to a `prim::Constant` with value `inf` correctly. This PR adds this to
the `test_jit.py` tests involving `inf` and adds a test to demonstrate
`inf` in a non-generated test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11302
Differential Revision: D9684336
Pulled By: driazati
fbshipit-source-id: 73df2848dfdb45ab50690a7c88df8fda269a64eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11748
For avx512, we need to align at a multiple of 64B not 32B
Regardless of avx512, it's in general a good idea to be cache line aligned.
Reviewed By: ilia-cher
Differential Revision: D9845056
fbshipit-source-id: b1d3ed67749c0c1a64acd5cc230a1279e8023512
Summary:
I'm just doing the honors and bumping the version to 1.0.0.
1.0 preview and RC releases will have the 1.0.0.dev{date} tag
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11717
Reviewed By: SsnL
Differential Revision: D9840857
Pulled By: soumith
fbshipit-source-id: 4c9c2e01dccb3c521dab26c49e1569d970a87ace
Summary:
Previously, it was a necessity to include TensorMethods.h after Tensor.h in order to get the tensor method definitions.
We abstracted this away from users by making sure ATen.h did this correctly; but we don't have any equivalent for ATen/core.
In order to solve this dependency issue, we now forward declare Tensor in the Type declaration, which breaks the dependency cycle.
Type.h now includes Tensor.h (for backwards compatibility) and Tensor.h now includes TensorMethods.h, so there is no longer include dependency restrictions.
We could get rid of TensorMethods.h completely now, but that would involve coordinating a code generation change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11720
Reviewed By: ezyang
Differential Revision: D9841488
Pulled By: gchanan
fbshipit-source-id: 1668199095e096c1790e646b5dc9f61ec1b33c0a
Summary:
A couple fixes I deem necessary to the TorchScript C++ API after writing the tutorial:
1. When I was creating the custom op API, I created `torch/op.h` as the one-stop header for creating custom ops. I now notice that there is no good header for the TorchScript C++ story altogether, i.e. when you just want to load a script module in C++ without any custom ops necessarily. The `torch/op.h` header suits that purpose just as well of course, but I think we should rename it to `torch/script.h`, which seems like a great name for this feature.
2. The current API for the CMake we provided was that we defined a bunch of variables like `TORCH_LIBRARY_DIRS` and `TORCH_INCLUDES` and then expected users to add those variables to their targets. We also had a CMake function that did that for you automatically. I now realized a much smarter way of doing this is to create an `IMPORTED` target for the libtorch library in CMake, and then add all this stuff to the link interface of that target. Then all downstream users have to do is `target_link_libraries(my_target torch)` and they get all the proper includes, libraries and compiler flags added to their target. This means we can get rid of the CMake function and all that stuff. orionr AFAIK this is a much, much better way of doing all of this, no?
3. Since we distribute libtorch with `D_GLIBCXX_USE_CXX11_ABI=0`, dependent libraries must set this flag too. I now add this to the interface compile options of this imported target.
4. Fixes to JIT docs.
These could likely be 4 different PRs but given the release I wouldn't mind landing them all asap.
zdevito dzhulgakov soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11682
Differential Revision: D9839431
Pulled By: goldsborough
fbshipit-source-id: fdc47b95f83f22d53e1995aa683e09613b4bfe65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11706
This is necessary to handle use-cases when Storage is not set (because the
tensor in question doesn't have a notion of storage.)
Reviewed By: orionr
Differential Revision: D9833361
fbshipit-source-id: e90a384019f44f57682b687d129b54e85b6fabb9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11701
There's one extra multiply from TypeMeta::itemsize() which needs
to be characterized. For all existing Caffe2 uses, storage_offset
is zero.
Reviewed By: li-roy
Differential Revision: D9831230
fbshipit-source-id: 353678edf76d2ccc297a73475a34f6ab2a20d1e1
Summary:
This will allow us to break the dependency cycle between Tensor and Type, because currently Type has defaulted Tensor (reference) arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11675
Reviewed By: ezyang
Differential Revision: D9819720
Pulled By: gchanan
fbshipit-source-id: a9577ac34a358120075129ab0654e7862d1dace6
Summary:
This way it shows up in all current and future setup.py commands, as otherwise we'd have to override every once to have them all call copy_protos. This is needed because the nightly packages still do not include caffe2_pb2, because setup.py bdist does not go through setup.py install or setup.py develop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11726
Reviewed By: orionr
Differential Revision: D9844075
Pulled By: pjh5
fbshipit-source-id: 57b469e48010aacd0c08c214ba8a7e5d757feefa
Summary:
We use these annotations during function declarations, not definitions. See the description of compiler error [C2491](https://msdn.microsoft.com/en-us/library/62688esh.aspx) for more details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11367
Reviewed By: ezyang
Differential Revision: D9697923
Pulled By: orionr
fbshipit-source-id: 1e539c02957851386f887e6d0510ce83117a1695
Summary:
This PR vectorizes the CPU grid sample 2d forward and backward kernels. Specifically,
1. add `.data()` in `TensorAccessor`
2. support non-void return value for declaring CPU kernel stub
2. add `bool at:: geometry_is_contiguous(IntList sizes, IntList strides)`
1. The following vectorized CPU primitives are added:
+ `gather<scale>(baseaddr, vindex)`: `result[i] = baseaddr[vindex[i] * scale]`
+ `mask_gather<scale>(src, baseaddr, vindex, mask)`: `result[i] = mask[i] ? baseaddr[vindex[i] * scale] : src[i]`.
+ comparison ops
+ binary logical ops
+ `min(a, b)`
+ `cast<dst_t, src_t>(src_vec)`: changing dtype but keeping the bit representation
+ `blendv(a, b, mask)`: `result[i] = mask[i] ? b[i] : a[i]`.
+ ctor with multiple values (i.e., `setr`)
+ `arange(start = 0, step = 1)`: constructs a vector with values specified by the arange parameters
+ `convert_to_int_of_same_size(vec)`: convert floating point vector to corresponding integral type of same size
+ `interleave2(a, b)` & `deinterleave2(x, y)`: interleave or deinterleaves two vectors. E.g., for `interleave`:
```
inputs:
{a0, a1, a2, a3, a4, a5, a6, a7}
{b0, b1, b2, b3, b4, b5, b6, b7}
outputs:
{a0, b0, a1, b1, a2, b2, a3, b3}
{a4, b4, a5, b5, a6, b6, a7, b7}
```
2. Grid sample CPU kernel implementations are described in the following note (also in `GridSampleKernel.cpp`:
```
NOTE [ Grid Sample CPU Kernels ]
Implementation of vectorized grid sample CPU kernels is divided into three
parts:
1. `ComputeLocation` struct
Transforms grid values into interpolation locations of the input tensor
for a particular spatial dimension, basing on the size of that dimension
in input tensor, and the padding mode.
```
```cpp
template<typename scalar_t, GridSamplerPadding padding>
struct ComputeLocation {
using Vec = Vec256<scalar_t>;
// ctor
ComputeLocation(int64_t size);
// Given grid values `in`, return the interpolation locations after
// un-normalization and padding mechanism (elementwise).
Vec apply(const Vec &in) const;
// Similar to `apply`, but also returns `d apply(in) / d in`
// (elementwise).
// this is often used in gradient computation.
std::pair<Vec, Vec> apply_get_grad(const Vec &in) const;
};
```
```
2. `ApplyGridSample` struct
Owns N `ComputeLocation` structs, where N is the number of spatial
dimensions. Given N input grid vectors (one for each spatial dimension)
and spatial offset, it gets the interpolation locations from
`ComputeLocation`s, applies interpolation procedure, and then writes to
the output (or grad_input & grad_grid in backward).
```
```cpp
template<typename scalar_t, int spatial_dim,
GridSamplerInterpolation interp,
GridSamplerPadding padding>
struct ApplyGridSample {
// ctor
ApplyGridSample(const TensorAccessor<scalar_t, 4>& input);
// Applies grid sampling (forward) procedure:
// 1. computes interpolation locations from grid values `grid_x` and
// `grid_y`,
// 2. interpolates output values using the locations and input data
// in `inp_slice`, and
// 3. writes the first `len` values in the interpolated vector to
// `out_slice` with spatial offset being `offset`.
//
// This assimes that `grid_x` and `grid_y` all contain valid grid
// values \in [-1, 1], even at indices greater than `len`.
//
// The `*_slice` argument namess mean samples within a batch (i.e.,
// with the batch dimension sliced out).
void forward(TensorAccessor<scalar_t, 3>& out_slice,
const TensorAccessor<scalar_t, 3>& inp_slice,
int64_t offset, const Vec& grid_x, const Vec& grid_y,
int64_t len) const;
// Applies grid sampling (backward) procedure. Arguments semantics
// and strategy are similar to those of `forward`.
void backward(TensorAccessor<scalar_t, 3>& gInp_slice,
TensorAccessor<scalar_t, 3>& gGrid_slice,
const TensorAccessor<scalar_t, 3>& gOut_slice,
const TensorAccessor<scalar_t, 3>& inp_slice,
int64_t offset, const Vec& grid_x, const Vec& grid_y,
int64_t len) const;
}
```
```
3. `grid_sample_2d_grid_slice_iterator` function
Among the tensors we work with, we know that the output tensors are
contiguous (i.e., `output` in forward, and `grad_input` & `grad_grid` in
backward), we need to randomly read `input` anyways, and `grad_output`
usually comes from autograd and is often contiguous. So we base our
iterating strategy on the geometry of grid.
`grid_sample_2d_grid_slice_iterator` function provides an abstract to
efficiently iterates through a `grid` slice (without batch dimension).
See comments of that function on the specific cases and strategies used.
```
```cpp
template<typename scalar_t, typename ApplyFn>
void grid_sample_2d_grid_slice_iterator(
const TensorAccessor<scalar_t, 3>& grid_slice,
const ApplyFn &apply_fn);
// `apply_fn` is a function/lambda that can be called as if it has
// declaration:
// void apply_fn(const Vec256<scalar_t>& grid_x,
// const Vec256<scalar_t>& grid_y,
// int64_t spatial_offset, int64_t len);
```
```
`apply_fn` will be called multiple times, and together cover the entire
output spatial space. Therefore, e.g., to implement forward 2d grid
sample, we can do
```
```cpp
ApplyGridSample<scalar_t, 2, interp, padding> grid_sample(input_accessor);
for (int n = 0; n < input_accessor.size(0); n++) {
grid_sample_2d_grid_slice_iterator(
grid_accessor[n],
[&](const Vec256<scalar_t>& grid_x, const Vec256<scalar_t>& grid_y,
int64_t spatial_offset, int64_t len) {
grid_sample.forward(out_accessor[n], input_accessor[n],
spatial_offset, grid_x, grid_y, len);
});
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10980
Differential Revision: D9564867
Pulled By: SsnL
fbshipit-source-id: 5b7c3c7ea63af00eec230ae9ee1c3e6c6c9679b4
Summary:
Changing `max` to `fmaxf` in `LabelCrossEntropy` kernel for hip to work correctly.
bddppq petrex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11733
Differential Revision: D9846783
Pulled By: bddppq
fbshipit-source-id: c1b394d2ba7ee0e819f7bf3b36b53d1962de5522
Summary:
Fixes#11452 .
Based on the discussion with SsnL and soumith , we want to bring back Upsample as a module instead of introducing a new nn.interpolate module for now. If anyone want to do downsample, they should use `nn.functional.interpolate ` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11568
Differential Revision: D9804359
Pulled By: ailzhang
fbshipit-source-id: 2b232d55fc83c2b581bf336f1ee8d1cf1c1159ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11355
There is no reason to implement refcounting manually in this case.
Given the correct NullType, toIntrusivePtr() and moveToIntrusivePtr() will do the right thing.
Reviewed By: ezyang
Differential Revision: D9694918
fbshipit-source-id: 8aae4d66aec32ca5f85c438d66339bd80b72b656
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11353
Before, there was one extra member in the union that had to be at least as large as the largest other member, because it was used for copying.
Now, this isn't needed anymore and we copy the union directly.
Reviewed By: ezyang
Differential Revision: D9694326
fbshipit-source-id: 42b2f7d51ac5d4ea5ebafea3a598b018e10fed68
Summary:
Current behavior is that each process (main and workers) will print trace from `KeyboardInterrupt`. And the main process will also print
```
RuntimeError: DataLoader worker (pid 46045) exited unexpectedly with exit code 1. Details are lost due to multiprocessing. Rerunning with nm_workers=0 may give better error trace.
```
due to our SIGCLD handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11718
Differential Revision: D9840844
Pulled By: SsnL
fbshipit-source-id: 1a05060bb02907fef5aac3f274d2c84f9f42d187
Summary:
Otherwise each build produces 65MB of warnings log, which makes the CI hard to debug.
iotamudelta Jorghi12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11698
Differential Revision: D9840356
Pulled By: bddppq
fbshipit-source-id: b69bf6a5c38a97b188221f9c084c608ffc9b37c8
Summary:
1. Document the Sequential module in the C++ API at a high, why-does-this-exist, and low, how-to-use, level
2. Change the Sequential tests to be in a style that makes them easier to convert to gtest. No code changes.
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11648
Differential Revision: D9834526
Pulled By: goldsborough
fbshipit-source-id: 39f2f5c6cbbf8ed5a1b69986978c8ef127036de1
Summary:
This PR splits the CPU and CUDA fusion compilers, putting them into a new jit/fusers/ directory with jit/fusers/common for common components. In particular:
- A fusion interface is created that allows "fusion handles" to be requested
- The CPU and CUDA fusers implement this interface, with dispatch determined by device
- The fusion compilers, fusion function specializations and resource strings are split
- CPU-specific classes like TempFile and DynamicLibrary are in the CPU fuser
- Common classes likes TensorDesc and the base fusion function class are in jit/fusers/common
- There is still some specialization in jit/fusers/common, but these specializations are small(-ish)
- Updates the build system to remove the dummy interface on Windows and minimize the use of macros
This structure should allow in-flight PRs to easily rebase while providing a clear interface to the fusers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10981
Reviewed By: soumith
Differential Revision: D9701999
Pulled By: apaszke
fbshipit-source-id: 3b6bec7b97e0444b2a93caa38d9b897f2e68c1b3
Summary:
Fixes#11663
`TensorIterator` was replacing the op tensors with type casted tensors
which ended up producing side effects in binary ops like `a.float() * b`
where `a` and `b` are `LongTensor`s.
colesbury ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11708
Differential Revision: D9834016
Pulled By: driazati
fbshipit-source-id: 4082eb9710b31dfc741161a0fbdb9a8eba8fe39d
Summary:
Often, we find ourselves looking at some long-running kernel or emit_nvtx range on an nvvp profile and trying to connect it to the offending line in a training script. If the op is in the forward pass that's easy: ops are enqueued explicitly from the Python side, so tracking it down with manual nvtx ranges supplemented by the built-in emit_nvtx ranges is straightforward. If the op is in the backward pass, it's much more difficult. From the Python side, all you can do is wrap loss.backward() in an nvtx range, and if you also use emit_nvtx, the automatic ranges provide only local information. Right now, the only consistent way to connect backward-pass kernels to their associated forward-pass lines of Python is to understand your script line by line, and know exactly where in the backward pass you are.
This PR augments the existing nvtx machinery to bridge the gap between forward and backward, allowing connection of backward-pass Function apply calls to the forward-pass operations that required/created those Functions.
The method is simple and surgical. During the forward pass, when running with emit_nvtx, the nvtx range for each function in VariableType is tagged with the current sequence number. During the backward pass, the nvtx range associated with each Function's operator() is tagged with that Function's stashed sequence number, which can be compared to "current sequence numbers" from the forward pass to locate the associated op.
Double-backward is not a problem. If a backward pass with create_graph = True is underway, the relationship between backward and double-backward is conceptually the same as the relationship between forward and backward: The functions in VariableType still spit out current-sequence-number-tagged ranges, the Function objects they create still stash those sequence numbers, and in the eventual double-backward execution, their operator() ranges are still tagged with the stashed numbers, which can be compared to "current sequence numbers" from the backward pass.
Minor caveats:
- The sequence number is thread-local, and many VariableType functions (specifically, those without a derivative explicitly defined in derivatives.yaml) don't create an associated function object (instead delegating that to sub-functions further down the call chain, perhaps called from within at::native functions that route back through VariableType by calling at::function_name). So the correspondence of stashed sequence numbers in Function operator() ranges with numbers in forward-pass ranges is not guaranteed to be 1 to 1. However, it's still a vast improvement over the current situation, and I don't think this issue should be a blocker.
- Feel free to litigate my use of stringstream in profiler.cpp. I did it because it was easy and clean. If that's too big a hammer, let's figure out something more lightweight.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10881
Differential Revision: D9833371
Pulled By: apaszke
fbshipit-source-id: 1844f2e697117880ef5e31394e36e801d1de6088
Summary:
This is causing codegen problems in caffe2, when we try to remove the circular Tensor/Type declarations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11673
Differential Revision: D9819341
Pulled By: gchanan
fbshipit-source-id: f2c2cd96e8a16f6de6aa4889e71b8a78e12e9256
Summary:
We'll have separate docs for the C++ frontend, right now this file is just misleading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11703
Differential Revision: D9832847
Pulled By: goldsborough
fbshipit-source-id: 2e8b30ccf6b5cba9d0526e6261160f7c6211a35c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11597
We should always CHECK pointers which we plan to dereference
if they are inputs to the function. Nobody knows how the function will
be called in the future.
Reviewed By: yinghai
Differential Revision: D9800002
fbshipit-source-id: 7fd05f4717f2256d1b09a9e75475b12de6685b03
Summary:
…cuda())
While I was at it, I audited all other ways I know how we might get a CUDA
type from PyTorch and fixed more constructors which don't work.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11533
Differential Revision: D9775786
Pulled By: ezyang
fbshipit-source-id: cd07cdd375fdf74945539ec475a48bf08cbc0c17
Summary:
There's no reason they need to be in Type.h and this moves us along the path of not having circular dependencies (so we can get rid of TensorMethods.h).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11650
Reviewed By: ezyang
Differential Revision: D9812271
Pulled By: gchanan
fbshipit-source-id: 8b70db9a5eb0a332398ab2e8998eeaf7d2eea6d7
Summary:
This adds a small check in `Dirichlet` and `Categorical` `__init__` methods to ensure that scalar parameters are not admissible.
**Motivation**
Currently, `Dirichlet` throws no error when provided with a scalar parameter, but if we `expand` a scalar instance, it inherits the empty event shape from the original instance and gives unexpected results.
The alternative to this check is to promote `event_shape` to be `torch.Size((1,))` if the original instance was a scalar, but that seems to add a bit more complexity (and changes the behavior of `expand` in that it would affect the `event_shape` as well as the `batch_shape` now). Does this seem reasonable? cc. alicanb, fritzo.
```python
In [4]: d = dist.Dirichlet(torch.tensor(1.))
In [5]: d.sample()
Out[5]: tensor(1.0000)
In [6]: d.log_prob(d.sample())
Out[6]: tensor(0.)
In [7]: e = d.expand([3])
In [8]: e.sample()
Out[8]: tensor([0.3953, 0.1797, 0.4250]) # interpreted as events
In [9]: e.log_prob(e.sample())
Out[9]: tensor(0.6931) # wrongly summed out
In [10]: e.batch_shape
Out[10]: torch.Size([3])
In [11]: e.event_shape
Out[11]: torch.Size([]) # cannot be empty
```
Additionally, based on review comments, this removes `real_vector` constraint. This was only being used in `MultivariateNormal`, but I am happy to revert this if we want to keep it around for backwards compatibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11589
Differential Revision: D9818271
Pulled By: soumith
fbshipit-source-id: f9bbba90ed6f04e0b5bdfa169e70ca20b280fc74
Summary:
This PR:
- adds a `.expand` method for `TransformedDistribution` along the lines of #11341.
- uses this method to simplify `.expand` in distribution classes that subclass off of `TransformedDistribution`.
- restores testing of `TransformedDistribution` fixtures.
- fixes some bugs wherein we were not setting certain attributes in the expanded instances, and adds tests for `.mean` and `.variance` which use these attributes.
There are many cases where users directly use `TransformedDistribution` rather than subclassing off it. In such cases, it seems rather inconvenient to have to write a separate class just to define a `.expand` method. The default implementation should suffice in these cases.
cc. fritzo, vishwakftw, alicanb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11607
Differential Revision: D9818225
Pulled By: soumith
fbshipit-source-id: 2c4b3812b9a03e6985278cfce0f9a127ce536f23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11576
Previously, they were spattered throughout the codebase.
We now follow this convention:
- LegacyTypeDispatch gives you Type
- Context gives you TypeExtendedInterface
- Tensor::type() gives you Type
- at::getType() gives you TypeExtendedInterface
I change some sites to use getType() over type().
Reviewed By: SsnL
Differential Revision: D9790187
fbshipit-source-id: 5e2577cb590a5bbf5df530f3763d3b3c0b4625ca
Summary:
This adds tests in tests/test_distributions.py to ensure that all methods of `Distribution` objects are jittable.
I've replaced a few samplers with jittable versions:
- `.uniform_()` -> `torch.rand()`
- `.exponential_()` -> `-(-torch.rand()).log1p()`
- `.normal_()` -> `torch.normal(torch.zeros(...), torch.ones(...), ...)`
Some jit failures remain, and are marked in test_distributions.py
- `Cauchy` and `HalfCauchy` do not support sampling due to missing `.cauchy_()`
- `Binomial` does not support `.enumerate_support()` due to `arange` ignoring its first arg.
- `MultivariateNormal`, `LowRankMultivariateNormal` do not support `.mean`, `.entropy`
- [x] Currently some tests fail (I've skipped those) due to unavailability of `aten::uniform` and `aten::cauchy` in the jit. Can someone suggest how to add these? I tried to add declarations to `torch/csrc/ir.cpp` and `torch/csrc/passes/shape_analysis.cpp`, but that resulted in "Couldn't find operator" errors.
- [x] There are still lots of `TracerWarning`s that something doesn't match something. I'm not sure whether these are real.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11560
Differential Revision: D9816327
Pulled By: apaszke
fbshipit-source-id: 72ec998ea13fc4c76d1ed003d9502e0fbaf728b8
Summary:
current torch.norm() runs sequentially on CPU. This PR did parallelization and vectorization of torch.norm() on ATen CPU path, roughly provide 2 order of magnitude performance boost.
Performance is benchmarks on Xeon skylake 8180, 2*28 cores 2.5GHz, using the following script:
```python
import torch
from time import time
count = 1000
size = 1000*1000
def test_norm(p=2):
a = torch.randn(size)
tstart = time()
for i in range(count):
torch.norm(a, p)
tend = time()
print("norm on size %d tensor p = %d: %f s" % (size, p, (tend-tstart)))
for p in range(4):
test_norm(p)
```
without this optimization,
```
(intel-pytorch) [mingfeim@mlt-skx065 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 1.071235 s
norm on size 1000000 tensor p = 1: 1.069149 s
norm on size 1000000 tensor p = 2: 1.068212 s
norm on size 1000000 tensor p = 3: 69.735312 s
```
and with this optimization,
```
(pytorch-tf) [mingfeim@mlt-skx053 unit_tests]$ python test_norm.py
norm on size 1000000 tensor p = 0: 0.127507 s
norm on size 1000000 tensor p = 1: 0.011867 s
norm on size 1000000 tensor p = 2: 0.011907 s
norm on size 1000000 tensor p = 3: 0.014470 s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11565
Differential Revision: D9804484
Pulled By: ezyang
fbshipit-source-id: 52899f30ac26139d00684d07edfb47cb9b25d871
Summary:
Previously, we would pretty much assume that all floating point tensors do require grad, which might result in some unnecessary compute.
I don't really like the fact that `TensorType` uses `tensor.is_variable() && tensor.requires_grad()` to infer the value of `requires_grad`, but changing constants to keep variables turns out to be pretty hard. I got halfway there, but it would still need some more work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11586
Reviewed By: ezyang
Differential Revision: D9813648
Pulled By: apaszke
fbshipit-source-id: 77f77756d18ff7632fca3aa68ce855e1d7f3bdb8
Summary:
I'm reading the doc of `torch.nn.functional.pad` and it looks a bit confusing to me. Hopefully this PR makes it clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11623
Differential Revision: D9818255
Pulled By: soumith
fbshipit-source-id: 4f6b17b0211c6927007f44bfdf42df5f84d47536
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11659
This is less error-prone and less code.
Reviewed By: smessmer
Differential Revision: D9814536
fbshipit-source-id: 028510e31e2fa7a9fa11c1398b0743c5cd085dd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11657
Previously, we had a constructor in TensorImpl for every constructor in Tensor.
This was unnecessary and wordy: Tensor is the user-visible class, so it deserves
the constructors, but TensorImpl is internal and doesn't need it. So
I replaced TensorImpl with a single, Storage accepting constructor, and then
rewrote Tensor to use that constructor.
Reviewed By: jerryzh168
Differential Revision: D9813742
fbshipit-source-id: 7501b54fe5f39180f1bc07573fd7c1640b0f4e89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11656
The mis-capitalization really sticks up my craw. I know why (we
already have a static function named GetDeviceType), but let's
name it differently.
```
codemod -d . --extensions cc,cpp,cu,cuh,h,py,hpp,TARGETS GetDevicetype device_type
```
Reviewed By: jerryzh168
Differential Revision: D9813544
fbshipit-source-id: fe462f4bc40b03e74921f8cf5ebd9cfc52e7e636
Summary:
The isCompleted function is changed to being non-const to accomodate
setting some internal status on the work object in the case of
completion. Previously, it was only checking a member field, but for the
MPI backend it calls MPI_Test to poll for completion of an asynchronous
request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11630
Reviewed By: SsnL
Differential Revision: D9808008
Pulled By: pietern
fbshipit-source-id: 18b70825b1fb4d561a552fa75e9475a522852cd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11396
std::move and std::forward in C++11 aren't constexpr (they are in C++14).
This caused a build issue orionr was working on.
It should be fixed by this diff
Reviewed By: orionr
Differential Revision: D9724805
fbshipit-source-id: 0d9047dce611385d659cc71a6c04cc7a6a40a5ae
Summary:
Requires https://github.com/onnx/onnx/pull/1377
This PR makes it so that slices with dynamic boundary values can be exported from pytorch and run in caffe2 via ONNX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11255
Differential Revision: D9790216
Pulled By: jamesr66a
fbshipit-source-id: 6adfcddc5788df4d34d7ca98341077140402a3e2
Summary:
Currently, because of some setup.py logic, `ninja` caching of the `generate_code.py` build step was broken. This resulted in `generate_code.py` running every single time builds were happening, regardless of whether inputs changed.
This updated logic fixes the input caching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11644
Reviewed By: orionr
Differential Revision: D9814348
Pulled By: soumith
fbshipit-source-id: 2012960908d0f600488d410094095cfd72adc34f
Summary:
This also removes the usage of torch.onnx.symbolic_override in instance_norm. Fixes#8439.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10792
Differential Revision: D9800643
Pulled By: li-roy
fbshipit-source-id: fa13a57de5a31fbfa2d4d02639d214c867b9e1f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11642
This is just a preparatory change to help with future
refactoring:
- I want to reduce the number of includes that tensor_impl.h
depends on, but
- I need to keep tensor.h providing all Caffe2 headers, because
users may be relying on tensor.h transitively providing those
headers.
Introducing a level of indirection lets me do both at the same time.
Reviewed By: jerryzh168
Differential Revision: D9810823
fbshipit-source-id: 8dfaac4b8768051a22898be8fcaf787ecc57eb13
Summary:
Before this PR it would warn that "dropout is non deterministic and can
cause problems when checking trace", so I disabled the trace checking.
cc zdevito apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11639
Differential Revision: D9812493
Pulled By: zou3519
fbshipit-source-id: fab86928a5fba8b218b47543533aaf7c82a10b4a
Summary:
Arg parser allowed additional positional args to be parsed into keyword-only params.
Fixes a couple cases:
- The positional argument happens to be of the right type, and it just works silently. Now, we fail as expected.
- The positional argument fails later down the line. Now, we fail at the appropriate time and get a better error message.
Pre-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
tensor([6, 0], device='cuda:1')
```
Post-fix:
```
>>> torch.cuda.LongTensor((6, 0), 1, 1, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new() received an invalid combination of arguments - got (tuple, int, int, int), but expected one of:
* (torch.device device)
* (torch.Storage storage)
* (Tensor other)
* (tuple of ints size, torch.device device)
* (object data, torch.device device)
```
Pre-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new_zeros(): argument 'dtype' (position 2) must be torch.dtype, not int
```
Post-fix:
```
>>> a = torch.tensor(5)
>>> a.new_zeros((5,5), 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: new_zeros() takes 1 positional argument but 2 were given
```
fixes#8351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10499
Differential Revision: D9811093
Pulled By: li-roy
fbshipit-source-id: ce946270fd11b264ff1b09765db3300879491f76
Summary:
In order to comply with Python's rules on implicit casting of
non-booleans to booleans, this PR removes implicit casting in favor of
explicit casts via `bool()`
cc zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11503
Differential Revision: D9780869
Pulled By: driazati
fbshipit-source-id: c753acaca27f4e79dddf424c6b04674f44a6aad9
Summary:
- Just a simple fix to support `fill_`
- And a fix for indexing in `pytorch-complex`
Differential Revision: D9804061
Pulled By: ezyang
fbshipit-source-id: 631129b3fa220a9670770b3766f14a8e03633bdf
Summary:
Add guards against using sparse tensor by checking the conversion from IValue -> PyObject & PyObject -> IValue.
This diff also changes the behavior in constant propagation to not run python ops even if all ops are constant because of possible mutation to global state. This came up in trying to run get_sparse(), and I'm including it here to make it easier to land.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11550
Differential Revision: D9804712
Pulled By: eellison
fbshipit-source-id: 9fe7daf721c6d6e48df4925c0f9c775873bcdc77
Summary:
Clean up some generated tests after we have newly nice features like var args.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11403
Differential Revision: D9800545
Pulled By: wanchaol
fbshipit-source-id: e9973b113f78dc38cf99a81b6ede3fa3485f1cfa
Summary:
* Many op in lstm part of the model don't have implementation in ideep/mkl, and it doesn't make sense to copy back and forth for the few available ops because majority of RNN will be on CPU
* Thus the strategy is to enable mkl only for the resnet18 part of the model, then switch to default cpu engine for the lstm part
* The net may contain some external_inputs falsely added during ONNX->Caffe2. Canary in service shows their existence could leads to service crash (presumably due to these blob somehow get shared between threads). They're now manually removed which seem to be enough to avoid the crash.
Reviewed By: viswanathgs
Differential Revision: D8888763
fbshipit-source-id: da7761bcb7d876ff7bbb6640ae4b24712c0b1de6
Summary:
After discussions in #11584 , new PR for just the test skip and hgemm integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11593
Differential Revision: D9798527
Pulled By: ezyang
fbshipit-source-id: e2ef5609676571caef2f8e6844909fe3a11d8b3e
Summary:
I am working on unifying the C++ extensions and C++ API, and one constraint for this is that we will want to be able to build the C++ API without cereal, since we won't want to ship it with the Python `torch` package.
For this I introduce a `TORCH_WITH_CEREAL` option to CMake. If on, the C++ API will be built with cereal and thus serialization support. If off, serialization functions will throw exceptions, but the library will otherwise still compile the same. __This option is on by default, so for regular C++ API users nothing will change__. However, from C++ extensions, we'll be able to turn it off. This effectively means we won't be searching for any cereal headers from C++ API headers, which wouldn't be installed in the Python package.
ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11498
Differential Revision: D9784803
Pulled By: goldsborough
fbshipit-source-id: 5d0a1f2501993012d28cf3d730f45932b483abc4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11470
In order to reduce build sizes, we are identifying files that can be split up into smaller units, allowing us to only include the ops we need.
Reviewed By: orionr, ajtulloch
Differential Revision: D9725819
fbshipit-source-id: def1074a33dffe99bd6a7e6e48aa9e5be3d04a6a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11587
To help debug the issue in T33295362, we add some checks in the function.
Possible crashing site in `GetTensorInfo`
1. tc is nullptr, which is checked.
2. tc->capacity_nbytes() hits nullptr, this is unlikely because storage is not a pointer and compute of capacity_nbytes doesn't involve pointers. It's numel * itermsize().
3. tc->ExtractDeviceOption hits nullpt. One possibility raw_data() is nullptr because tc->ExtractDeviceOption will use that. This is checked.
4. Tensor itself which is not a reference. This is also checked.
Reviewed By: salexspb
Differential Revision: D9793484
fbshipit-source-id: 3fc72746fc310a23ae45553bbe0d269a4b9edb72
Summary:
Documents the `AnyModule` class in the C++ API.
Also changed the API to be friendlier by default. Calling `AnyModule::forward` used to return an `AnyModule::Value` which you had to call `.get<T>()` on to cast to a concrete type. I changed the name of that `forward` method to `any_forward` and instead made `forward` templated on a `ReturnType` template parameter which you can supply to do the `.get<T>` cast for you automatically. I default this parameter to `torch::Tensor` so that it can often be omitted. So where you used to have to write
```cpp
any_module.forward(...).get<int>();
any_module.forward(...).get<torch::Tensor>();
```
you now write
```cpp
any_module.forward<int>(...);
any_module.forward(...);
```
ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11580
Differential Revision: D9798626
Pulled By: goldsborough
fbshipit-source-id: 060b4ea28facaffc417f53b80b846a9dff9acb73
Summary:
This eliminates the need for any heuristics regarding stack size limits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11534
Differential Revision: D9779866
Pulled By: resistor
fbshipit-source-id: 96753eead7904bbdc2869fb01f7bd42141032347
Summary:
This PR contains a C++ implementation of weight norm. The user-side exposure of weight norm through torch.nn.utils.weight_norm is unchanged.
If running on the GPU, and the norm is requested over the first or last dimension of the weight tensor, the forward pass is carried out using the fused kernels I wrote for our Fairseq GTC hero run, which offer superior performance to primitive ops and superior numerical stability when running in FP16. In the common case that the backward pass is not itself constructing a graph (ie not attempting to set up double backward) the backward pass will be carried out using another fused kernel. If the backward pass is constructing a graph, an alternate code path is taken, which does the math using differentiable primitive ops. In this way, the implementation allows double backward, even if the fused kernel was used in forward (although in this case, you don't benefit from the performance and stability of the fused backward kernel).
If running on the CPU, or if norming over an interior dim, the forward pass is carried out using double-differentiable primitive ops.
Figuring out how to generate all the right plumbing for this was tricky, but it was a fun experience learning how the autogenerator works and how the graph is constructed. Thanks to colesbury for useful guidance on this front.
I do have a few lingering questions:
- Should I unify my return statements (ie by default-constructing Tensors outside if blocks and using operator= within)?
- What is the significance of `non_blocking` when calling e.g. `auto norms = saved_norms.to(saved_g.type().scalarType(), non_blocking=True/False);`? I am currently omitting `non_blocking`, so it defaults to False, but I didn't see any associated synchronizes on the timeline, so I'm wondering what it means.
- Is there an "official" mapping from at::ScalarTypes to corresponding accumulate types, as there are for the PODs + Half in [AccumulateType.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/AccumulateType.h)? I looked for an equivalent mapping for ScalarTypes, didn't find one, and ended up rigging it myself (` at::ScalarType AccType = g.type().scalarType() == at::ScalarType::Half ? at::ScalarType::Float : g.type().scalarType();`).
- Are sparse tensors a concern? Should I include another check for sparse tensors in the `_weight_norm` entry point, and send those along the fallback CPU path as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10842
Differential Revision: D9735531
Pulled By: ezyang
fbshipit-source-id: 24431d46532cf5503876b3bd450d5ca775b3eaee
Summary:
This changes the way module import works so that when a module
is reloaded in python it becomes a ScriptModule and not a _C.ScriptModule
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11552
Differential Revision: D9782751
Pulled By: zdevito
fbshipit-source-id: 9576850b75494b228ce3def94c0d371a4a44b11d
Summary:
Also adds two additional tests that check for memory leaks while the relevant graph executors are alive:
- (minimal test): Create a ScriptModule, keep it alive, and test that it does not leak memory while it is alive
- (large test) Do MNIST training with a traced MNIST module and test that no memory is leaked while the traced module (with graph executor) is alive
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11544
Reviewed By: apaszke
Differential Revision: D9778479
Pulled By: zou3519
fbshipit-source-id: 2d6cdea81dd1264f2c0396b662f70fdafecb3647
Summary:
Fixes:
```
/bin/ld: warning: libnccl.so.1, needed by /data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so, not found (try using -rp
ath or -rpath-link)
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclAllReduce'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclBcast'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclCommInitAll'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclGetErrorString'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclReduceScatter'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclAllGather'
/data/users/ezyang/pytorch-tmp/build/lib/libcaffe2_gpu.so: undefined reference to `ncclReduce'
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11575
Differential Revision: D9789956
Pulled By: ezyang
fbshipit-source-id: 63e48763cc233be9d137cec721b239159b511a24
Summary:
This PR:
1. Documents `BatchNorm`,
2. Makes a number of API changes after reconsidering some quirks:
1. The default value for the `stateful` parameter used to be `false`, but the most common usage of `BatchNorm` out of the wild is certainly stateful, and the default in Python is also statefulness. So we change the default to stateful.
2. The `pure_forward` function used to use the internal running mean and variance variables instead of the ones supplied to that function call when `stateful` was true, which certainly seems odd. When you call `pure_forward` you would certainly expect the values you pass explicitly to be used. This is now fixed.
3. Adds tests for `BatchNorm`, finally.
ebetica apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11484
Reviewed By: pjh5
Differential Revision: D9779618
Pulled By: goldsborough
fbshipit-source-id: 59ba760e085c01454b75644b24b22317b688e459
Summary:
- Incorporates MKL addition by mingfeima Thank you! (but all errors are my own)
- Native CPU implementation: defer to matrix multiplication for
small batches and parallelize over batch dimension for large
batches.
- Add bmm test for CUDA just to be sure.
This is a partial fix for #10661, getting down to a factor ~5.
Considerable overhead is incurred for the setup in einsum. It might
be more efficient to eventually define an optimized contraction
functions for arbitrary and several dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11292
Differential Revision: D9784941
Pulled By: ezyang
fbshipit-source-id: f6dded2c6f5e8f0461fb38f31f9a824992a58358
Summary:
Make test work with CPU only build, also fixed the test failures for a long time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11567
Differential Revision: D9785740
Pulled By: teng-li
fbshipit-source-id: 61c43b758c1ee53117e30de8074583e6faea863a
Summary:
This makes torch.distributed works for CPU only build.
Also added one more CI test case to cover MPI CPU build.
All CI tests should cover this change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11513
Differential Revision: D9784546
Pulled By: teng-li
fbshipit-source-id: 0976a6b0fd199670926f0273e17ad7d2805e42e7
Summary:
Also, fix a performance bug in `ensureUnique`. Previously it formatted the warning string even though we weren't tracing, so all that work would *always* happen in the hot path and be for nothing.
A sample of how the new warnings look like:
```
tmp.py:4: TracerWarning: Converting a tensor to a Python integer might cause the trace to be incorrect. We can't record the data flow of Pytho
n values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
int(x)
tmp.py:5: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this fun
ction to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might caus
e the trace to be incorrect.
torch.tensor([1.])
tmp.py:6: TracerWarning: There are 2 live references to the data region being modified when tracing in-place operator add_. This might cause t
he trace to be incorrect, because all other views that also reference this data will not not reflect this change in the trace! On the other ha
nd, if all other views use the same memory, but are disjoint (e.g. are outputs of torch.split), this might still be safe.
torch.split(y, 2, dim=1)[0].add_(2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11545
Differential Revision: D9782975
Pulled By: apaszke
fbshipit-source-id: 5b3abd31366e59c69e0b7ff278042b5563deb5a9
Summary:
This fixes the build when CuDNN was not found on the system.
From the `git blame`, it looks like the bug has been around for 2 years :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11562
Differential Revision: D9784589
Pulled By: soumith
fbshipit-source-id: b33153436dced0a503c9833cdf52f7093f3394b4
Summary:
This adds a Note on making experiments reproducible.
It also adds Instructions for building the Documentation to `README.md`. Please ping if I missed any requirements.
I'm not sure what to do about the submodule changes. Please advise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11329
Differential Revision: D9784939
Pulled By: ezyang
fbshipit-source-id: 5c5acbe343d1fffb15bdcb84c6d8d925c2ffcc5e
Summary:
Ping ezyang
This addresses your comment in #114. Strangely, when running the doc build (`make html`) none of my changes are actually showing, could you point out what I'm doing wrong?
Once #11329 is merged it might make sense to link to the reproducibility note everywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11434
Differential Revision: D9751208
Pulled By: ezyang
fbshipit-source-id: cc672472449564ff099323c39603e8ff2b2d35c9
Summary: The PyTorch C++ API has `torch.nn.init` equivalents that the RNNG can use to initialize the state of its StackRNNs. This gets rid of the `fanInOut_` methods on `Parser` and tidies up `xavierInitialState` a little.
Reviewed By: wowitsmrinal
Differential Revision: D9472595
fbshipit-source-id: c202116f32383d3b4bba064c2c0d2656311e1170
Summary:
This PR does two things:
1. Replaces the implementation of the `Dropout` module with a call to the ATen function,
2. Replaces `Dropout2d` with a new `FeatureDropout` module that shall take the place of `Dropout2d` and `Dropout3d`. I contemplated calling it `Dropout2d` and making `Dropout3d` an alias for it, but similar to our decision for `BatchNorm{1,2,3}d` (c.f. https://github.com/pytorch/pytorch/pull/9188), we can deviate from Python PyTorch in favor of the ideal-world solution, which is to have a single module, since both actually just call `feature_dropout`.
I also replaced the implementation of `dropout3d` with a call to `dropout2d` in Python. The code is the same and it's easier for developers to parse than having to manually match the tokens to make sure it's really 100% the same code (which it is, if I matched the tokens correctly).
ebetica ezyang SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11458
Differential Revision: D9756603
Pulled By: goldsborough
fbshipit-source-id: fe847cd2cda2b6da8b06779255d76e32a974807c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11520
Previously, we had Type which was a catch all interface for all
functions and methods we could possibly want to do dynamic dispatch
on. However, we want to check in a non-autogenerated Tensor class
to ATen/core, and to do this, we must also check in a non-autogenerated
Type class which we can do dispatch on. In principle, we could
put the full Type interface in ATen/core, but this would be
a bad developer experience, since any time you add a new free
function, you'd have to regenerate the checked in Type header.
For a better dev experience, we split Type into a two parts,
Type, which will be checked in (though not in this diff), and
TypeExtendedInterface, which will NOT be checked in. Type contains
just enough methods to let Tensor be defined, and leaves the
rest to TypeExtendedInterface.
Some complications:
- We (very unfortunately) have overloaded virtual methods. Because
of C++'s rules, we cannot move one overload without doing some
extra work to make sure that overload in a superclass and an
overload in a subclass resolve together. I've chosen to resolve
this problem simply by moving ALL overloads of a method which
occurs in Tensor to Type.
- There are some places where we take a type() object and call
a method on it, which is not a Tensor base method. I've eliminated
some where possible, but in other cases calling the method on type
is the ONLY way to invoke it; in that case, I've just inserted
a cast. Further refactoring is necessary.
Reviewed By: gchanan
Differential Revision: D9771708
fbshipit-source-id: c59d39fe919cd6f42be6dca699d474346ea3c614
Summary:
The previous error was caused by mpi_test not depending on MPI_CXX_LIBRARIES. This might solve the problem.
Not tested locally - waiting for CI test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11416
Reviewed By: mingzhe09088
Differential Revision: D9771694
Pulled By: Yangqing
fbshipit-source-id: 53e7b4f64eadc88313bc4dd9b8e3f7931cda6e91
Summary:
This works around #11535 by avoiding `arange(n, out=x)` and `eye(n, out=x)` in `torch.distributions`. I've confirmed that the `.enumerate_support()` methods are now jittable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11542
Differential Revision: D9777805
Pulled By: apaszke
fbshipit-source-id: fa38f2f1acfc0a289f725fd8c92478573cfdbefb
Summary: Printing for complex numbers requires loading and storing between `Py_complex` and `std::complex`. This patch aims to support this for the plugin.
Differential Revision: D9771808
Pulled By: ezyang
fbshipit-source-id: 024865f1945d63ddb5efc775a35438c8ea06408e
Summary:
This whitelists train/eval functions in script modules, and tests that nested nn.Modules still work.
This also changes the code for calling python functions from script to allow non-tensor inputs/outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11505
Differential Revision: D9765466
Pulled By: zdevito
fbshipit-source-id: 1177bff931324422b69e18fa0bbaa82e3c98ec69
Summary:
ezyang delivering my promise to you :)
Basically, now aten tests can use gtest as part of our test harness unification effort. I also converted one test (atest.cpp) to show how one can do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11429
Reviewed By: ezyang
Differential Revision: D9762934
Pulled By: Yangqing
fbshipit-source-id: 68ec3a748403c6bd88399b1e756200985a4e07e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11413
LengthsTileOp was implemented using a sequence of device memcopies initiated on the CPU. This was very slow. I changed it to use a kernel. TUM benchmark QPS improved from 13k QPS to 20k QPS as a result.
Reviewed By: manojkris, xianjiec
Differential Revision: D9724988
fbshipit-source-id: 2f98c697730982734d7c6a26d0b6967310d49900
Summary:
Provide a TensorAccessor-Like interface for CUDA as discussed in #8366.
Compared to TensorAccessor
- the CUDATensorAccessor copies the sizes and strides while on the host (I didn't implement a host indexing function, though) to enable transfer to the device, on the device, `[]` works like for TensorAccessors,
- instantiation is from TensorAccessors in order to allow using `.accessor<..>`. The drawback is that it you cannot use `auto` for the variable declaration, but the alternative would be a cuda-specific `.accessor`-like function,
- there is a PtrTraits argument to enable `__restrict__`,
Example for the intended use:
```
...
template <typename scalar_t>
__global__ void
apply_homography_2d_kernel(cuda::CUDATensorAccessor<scalar_t, 4> dest_a,
cuda::CUDATensorAccessor<scalar_t, 4> src_a,
cuda::CUDATensorAccessor<float, 2> transform) {
...
}
template <typename scalar_t>
Tensor apply_homography_2d_template(Tensor& res, const Tensor& image, const Tensor& transform) {
...
cuda::CUDATensorAccessor<scalar_t, 4> image_a(image.accessor<scalar_t, 4>());
cuda::CUDATensorAccessor<scalar_t, 4> res_a(res.accessor<scalar_t, 4>());
cuda::CUDATensorAccessor<float, 2> transform_a(transform.accessor<float, 2>());
auto stream = at::cuda::getCurrentCUDAStream();
apply_homography_2d_kernel<scalar_t>
<<<grid, block, 0, stream>>>(res_a, image_a, transform_a);
return res;
}
...
```
I could use a hint where to put a test for this (e.g. doing a plain vanilla matrix multiplication with a custom kernel) and comparing with the aten mm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11373
Differential Revision: D9735573
Pulled By: ezyang
fbshipit-source-id: 482b218a0d514e19a8b692bbc77c0e37082cfded
Summary: Considering these increase the size of the message stack, I didn't touch the code outside `ATen/native`
Differential Revision: D9754283
Pulled By: soumith
fbshipit-source-id: 04198ec4fd0c4abae09eeba92c493a783408537a
Summary:
This PR adds the "merge to master" step before the build step in CircleCI, so that all PR commits are built against master instead of against the PR's branch. Note that all PRs still need to rebase to master to pick up this new config, so it won't apply to old PR branches retroactively.
To check in CI: make sure it's performing the git merge to master appropriately in "Merge Onto Master" step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11443
Differential Revision: D9775628
Pulled By: yf225
fbshipit-source-id: 8083db6b098d234a44ae4481f40a486e9906f6f8
Summary:
Disable all CircleCI jobs until we are ready to move forward with them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11523
Differential Revision: D9774462
Pulled By: yf225
fbshipit-source-id: c5724e71eb68bac4df958b4f7bcc380050668b3c
Summary:
Need to link CUDA statically for benchmarking purpose.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10596
Reviewed By: llyfacebook
Differential Revision: D9370738
Pulled By: sf-wind
fbshipit-source-id: 4464d62473e95fe8db65b0bd3b301f262bf269bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11486
I discovered these by narrowing the interface on Type, and then
fixing call sites outside of core plumbing code which depended
on these methods being provided.
Reviewed By: cpuhrsch
Differential Revision: D9757935
fbshipit-source-id: 3abda0c98919a448a326a757671d438964f6909f
Summary:
I noticed warnings from within pybind11 being shown when building C++ extensions. This can be avoided by including non-user-supplied headers with `-isystem` instead of `-I`
I hope this works on Windows.
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11459
Differential Revision: D9764444
Pulled By: goldsborough
fbshipit-source-id: b288572106078f347f0342f158f9e2b63a58c235
Summary:
This speeds up incremental builds by doing the following changes:
- Uses `rsync` instead of `cp` (when `rsync` is found) which is a bit smarter in doing "maybe copy"
- Introduces a `rebuild` mode which does not rerun `cmake` in `build_pytorch_libs.sh`.
*Note: `rebuild` should only be used if you dont add / remove files to the build, as `cmake` is not rerun*
Current no-op rebuild speedup:
- 1m 15s -> 20s
There are some lingering bugs. No-op rebuilds rerun `cmake` for two rebuilds (likely that cmake logic is dependent on the install folder, hence kicking off rebuild).
So what you see
```
python setup.py rebuild develop # first time - ~5 mins
python setup.py rebuild develop # second time - ~3 mins
python setup.py rebuild develop # third time - ~2 mins
python setup.py rebuild develop # fourth time - ~20 seconds
python setup.py rebuild develop # fifth time - ~20 seconds
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11487
Differential Revision: D9769087
Pulled By: soumith
fbshipit-source-id: 20fbecde33af6426149c13767e8734fb3be783c5
Summary:
ATen has had a separate build target in the past, but with our move to a root-level CMakeLists.txt file this makes less sense and is harder to maintain. Also, as we blend code between Caffe2 and ATen this will become even less maintainable.
Talked to ezyang about this, but also cc zdevito, Yangqing, and soumith. If this is too difficult, I will revert, but want to see if we can simplify for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11488
Differential Revision: D9770266
Pulled By: orionr
fbshipit-source-id: c7ba52a1676d84e2d052dad4c042b666f49451cd
Summary:
This adds a `.expand` method for distributions that is akin to the `torch.Tensor.expand` method for tensors. It returns a new distribution instance with batch dimensions expanded to the desired `batch_shape`. Since this calls `torch.Tensor.expand` on the distribution's parameters, it does not allocate new memory for the expanded distribution instance's parameters.
e.g.
```python
>>> d = dist.Normal(torch.zeros(100, 1), torch.ones(100, 1))
>>> d.sample().shape
torch.Size([100, 1])
>>> d.expand([100, 10]).sample().shape
torch.Size([100, 10])
```
We have already been using the `.expand` method in Pyro in our [patch](https://github.com/uber/pyro/blob/dev/pyro/distributions/torch.py#L10) of `torch.distributions`. We use this in our models to enable dynamic broadcasting. This has also been requested by a few users on the distributions slack, and we believe will be useful to the larger community.
Note that currently, there is no convenient and efficient way to expand distribution instances:
- Many distributions use `TransformedDistribution` (or wrap over another distribution instance. e.g. `OneHotCategorical` uses a `Categorical` instance) under the hood, or have lazy parameters. This makes it difficult to collect all the relevant parameters, broadcast them and construct new instances.
- In the few cases where this is even possible, the resulting implementation would be inefficient since we will go through a lot of broadcasting and args validation logic in `__init__.py` that can be avoided.
The `.expand` method allows for a safe and efficient way to expand distribution instances. Additionally, this bypasses `__init__.py` (using `__new__` and populating relevant attributes) since we do not need to do any broadcasting or args validation (which was already done when the instance was first created). This can result in significant savings as compared to constructing new instances via `__init__` (that said, the `sample` and `log_prob` methods will probably be the rate determining steps in many applications).
e.g.
```python
>>> a = dist.Bernoulli(torch.ones([10000, 1]), validate_args=True)
>>> %timeit a.expand([10000, 100])
15.2 µs ± 224 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit dist.Bernoulli(torch.ones([10000, 100]), validate_args=True)
11.8 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
cc. fritzo, apaszke, vishwakftw, alicanb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11341
Differential Revision: D9728485
Pulled By: soumith
fbshipit-source-id: 3b94c23bc6a43ee704389e6287aa83d1e278d52f
Summary:
This enabled `torch.einsum` both in tracing and in script mode. It's used all over Pyro at the moment, and is needed for any use of the JIT in there.
Fixes#11157.
zdevito fritzo neerajprad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11506
Differential Revision: D9764787
Pulled By: apaszke
fbshipit-source-id: 9b5251b9e7c5897034602bd07ff67b425d33326c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11418
Several improvements that aim to make the APIs more straightforward to use
- Get rid of helper methods subgraph and nonTerminal . Users now should create a NNMatchGraph directly via graph's createNode and createEdge API
- Get rid of operatorSubgraph helper method
- invertGraphTraversal flag applies to both the match graph and the scanned graph. This allows user to create match graph in the same direction as the scanned graph, thus reduce confusion.
- additional parameters of matchNode (count, includeInSubgraph, nonTerminal) are removed from the constructors and moved into setter methods. (We no longer enforce that MatchNode is immutable but this helps improve code clarity).
- Tests are updated to reflect the changes
Follow up changes:
- Possibly clean up the tests further. This change aims to minimally modify the unit tests.
- Help a validity check that enforce the current limitation of the match graph (single source node), and throws if the match graph does not satisfy the criteria.
- Have the single source node be detected automatically and callers just need to pass in the matchGraph instead of the source node reference.
Differential Revision: D9732565
fbshipit-source-id: ae8320e2bc89b867f6bb4b1c1aad635f4b219fa1
Summary:
The old `torch.distributed` will go to `torch.distributed.deprecated`
The old DDP will go to `torch.nn.parallel.deprecated`
Now `torch.nn.parallel.DDP` will use c10d DDP
Now `torch.distributed` will use C10d frontend API
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11405
Reviewed By: pietern
Differential Revision: D9733733
Pulled By: teng-li
fbshipit-source-id: d6a3f3e73f8d3a7fcb1f4baef53c78063b8cbb08
Summary:
Add flags for LMDB and LevelDB, default `OFF`. These can be enabled with
```
USE_LMDB=1 USE_LEVELDB=1 python setup.py build_deps
```
Also add a flag to build Caffe2 ops, which is default `ON`. Disable with
```
NO_CAFFE2_OPS=1 python setup.py build_deps
```
cc Yangqing soumith pjh5 mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11462
Reviewed By: soumith
Differential Revision: D9758156
Pulled By: orionr
fbshipit-source-id: 95fd206d72fdf44df54fc5d0aeab598bff900c63
Summary:
Skip torch tests as well when NO_TEST=1 environment variable is set. Also remove the separate ATen code path for not being built with Caffe2, since it will always be built with Caffe2.
cc The controller you requested could not be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11415
Reviewed By: soumith
Differential Revision: D9758179
Pulled By: orionr
fbshipit-source-id: e3e3327364fccdc57a703aeaad8c4f30452973fb
Summary:
There's a bunch of legacy code where people are explicitly instantiating Variable, and these call-sites have thus far been untraceable (appearing as prim::Constant nodes with the tensor value at the time of tracing). This makes it so that the new variable inherits the traced Value* from the tensor it's being constructed from
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11463
Differential Revision: D9756529
Pulled By: jamesr66a
fbshipit-source-id: da99c6a7621957a305f2699ec9cb9def69b1b2d7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10291
This new operator will do the following:
Given a LENGTHS vector and n_splits, output a "split" LENGTHS vector where:
1. Each length in input vector is split into n_splits values (thus output vector should have LENGTHS.size(0) * n_splits elements)
2. The new lengths in output should be evenly split, and if the length is not divisible by n_splits, then order new values in descending order. (e.g. n_splits = 3, length = 5 -> 2 2 1)
3. If n_splits > some element in the array, its split elements will contain 0s. (e.g. n_splits = 3, length = 2 - > 1 1 0)
Reviewed By: bddppq, chocjy
Differential Revision: D9013119
fbshipit-source-id: 82bf3371ec08c41fc3379177f0007afc142e0d84
Summary:
This PR is stacked on https://github.com/pytorch/pytorch/pull/10610, and only adds changes in one file `.jenkins/pytorch/test.sh`, where we now build the custom op tests and run them.
I'd also like to take this PR to discuss whether the [`TorchConfig.cmake`](https://github.com/pytorch/pytorch/blob/master/cmake/TorchConfig.cmake.in) I made is robust enough (we will also see in the CI) orionr Yangqing dzhulgakov what do you think?
Also ezyang for CI changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10611
Differential Revision: D9597627
Pulled By: goldsborough
fbshipit-source-id: f5af8164c076894f448cef7e5b356a6b3159f8b3
Summary:
Many constructors like `torch.zeros` or `torch.randn` didn't support
size tracing correctly which is fixed by this pass. Same issue has been
fixed in legacy tensor constructors.
Additionally, new tensor constructors, which do not participate in
tracing (most notably `torch.tensor`, `torch.as_tensor` and
`torch.from_numpy`) raise a warning when they are used.
Finally, entering a traceable operation disables the tracing in its body.
This is needed because
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11288
Reviewed By: ezyang
Differential Revision: D9751183
Pulled By: apaszke
fbshipit-source-id: 51444a39d76a3e164adc396c432fd5ee3c8d5f7f
Summary:
NCCL1 uses `int` as its numerical type for fields like `count`, which makes broadcasting tensors larger than `2 << 31 - 1` impossible, and raises opaque error `invalid arguments`. NCCL2 greatly increase the limit on many platforms by using `size_t`. This patch statically detects this type, and raises properly if the broadcast tensor exceeds the limit.
No test because I don't think our test suite should broadcast big tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11466
Differential Revision: D9754753
Pulled By: SsnL
fbshipit-source-id: 73506450cae047e06b5b225b39efdb42d5d26685
Summary:
Normalizing by the world size before the reduction is less likely to cause overflow in FP16 training.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11109
Differential Revision: D9594708
Pulled By: myleott
fbshipit-source-id: 93ab53cb782ee1cbe1264e529b333490a0940338
Summary:
I'm 80% sure that this fixes the math bug. But I can't repro locally so I don't know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11472
Differential Revision: D9755328
Pulled By: SsnL
fbshipit-source-id: 130be664d3c6ceee3c0c166c1a86fc9ec3b79d74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11294
The Tensor(ptr, retain) constructor is error prone and circumvents the intrusive_ptr safety.
This diff removes that and pushes the responsibility to callers.
Step by step, manual refcounting can be pushed back and possibly eliminated in the end.
Reviewed By: ezyang
Differential Revision: D9663476
fbshipit-source-id: 7f010e5e47b137a9575960201c5bf5d552c5c2f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11260
This is needed to make something like this work:
intrusive_ptr<TensorImpl, UndefinedTensorImpl> a = make_intrusive<SparseTensorImpl>(...);
Reviewed By: ezyang
Differential Revision: D9652089
fbshipit-source-id: 19c65e98460ccb27bc69e36d7e558cb9d6e67615
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11258
The two intrusive_ptr constructors in Tensor can be combined into one implementation that does both, moving and copying.
Reviewed By: ezyang
Differential Revision: D9652088
fbshipit-source-id: 5efca02654ba305c99c20bbeb83551469d17a51d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11238
- when moving an IValue, free the old value instead of keeping it allocated
- making classes final
- moving std::string
- making ConstantList const
Reviewed By: ezyang
Differential Revision: D9644700
fbshipit-source-id: ab7228368e4f00f664ba54e1242b0307d91c5e7e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11167
Narrow the Blob API as preparation for merging Blob/IValue
- get rid of templated IsType and Operator::InputIsType / OutputIsType
- Use 'using' instead of 'typedef' for DestroyCall (just for readability)
Reviewed By: ezyang
Differential Revision: D9623916
fbshipit-source-id: 952f0b0cf5a525094b02e8d2798dd57a56a9e1d8
Summary:
Checking assertExportImport for all of the generated test jit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10982
Differential Revision: D9636935
Pulled By: eellison
fbshipit-source-id: f3f1ce77d454848098f2ac7e0fa18bf8564890be
Summary:
`Process.start()` actually take some time as it needs to start a
process and pass the arguments over via a pipe. Therefore, we
only add a worker to self.workers list after it started, so
that we do not call `.join()` if program dies before it starts,
and `__del__` tries to join it but will get:
AssertionError: can only join a started process.
Example trace when such error happens:
```py
[unrelated]
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 500, in __iter__
return _DataLoaderIter(self)
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 292, in __init__
w.start()
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
KeyboardInterrupt
Exception ignored in: <function _DataLoaderIter.__del__ at 0x7fa704d5aa60>
Traceback (most recent call last):
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 398, in __del__
self._shutdown_workers()
File "/private/home/ssnl/miniconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 392, in _shutdown_workers
w.join()
File "/private/home/ssnl/miniconda3/lib/python3.7/multiprocessing/process.py", line 139, in join
assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process
```
No test because hard to reliably trigger.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11432
Reviewed By: ezyang
Differential Revision: D9735430
Pulled By: SsnL
fbshipit-source-id: a8912d9bb4063f210d6236267b178173810e2351
Summary:
as discussed with ezyang and slayton58 , this might be a nice convenience to be able to use code in extensions just as in ATen.
also split off `tracing_state.h` from `torch/jit/tracer.h` fix#11204 to bee able to use the utility functions
pytorchbot it's not a jit patch per se.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11425
Differential Revision: D9735556
Pulled By: ezyang
fbshipit-source-id: 466c92bbdb1d7d7a970eba1c26b7583fe9756139
Summary:
A recent build regression is that we need a system GoogleTest for builds to pass.
This was because, when building with Gloo, gloo is trying to build it's own tests, which look for system gtest [here](https://github.com/facebookincubator/gloo/blob/master/cmake/Dependencies.cmake#L72-L80) (because we're not using full cmake build and making it aware of third_party/GoogleTest, but instead, we are building it isolated using tools/build_pytorch_libs.sh
Traditionally, we didn't ask Gloo to build it's tests, but because we added `-DBUILD_TEST=1` by default to all builds (in refactoring variable names), we accidentally started asking Gloo to build it's tests.
This PR overrides the Gloo flags and asks it to not build tests (like it used to)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11431
Differential Revision: D9736387
Pulled By: soumith
fbshipit-source-id: 59e84edae780123b793bdaea5fd9ac46156cd0af
Summary:
This PR parallels `masked_fill` on CPU, currently it runs in sequential on CPU.
the following script is used to benchmark and verify this PR. On Xeon skylake 8180 (2 sockets * 28 cores),
it runs `4.20` sec without the PR and `0.11` sec with the PR.
```python
import torch
import random
from time import time
size = 10 * 1000 * 1000
count = 100
def test_masked_fill():
dst = torch.randn(size)
dst_ = dst.clone()
mask = torch.rand(size).mul(2).floor().byte()
val = random.random()
tstart = time()
for i in range(count):
dst.masked_fill_(mask, val)
tend = time()
print("masked_fill_: %f" % (tend-tstart))
for i in range(size):
if mask[i]:
if dst[i] != val:
print("fail")
else:
if dst[i] != dst_[i]:
print("fail1")
print("test_masked_fill: PASS")
test_masked_fill()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11359
Differential Revision: D9735578
Pulled By: ezyang
fbshipit-source-id: d437ad7c6dace1910d0c18d6d9ede80efb44fae4
Summary:
Added AVX optimizations for pdist using Vec256. This brings single threaded performance up to speed with scipy, but the current implementation greatly hurts performance without AVX enabled. Is there a way to special case out AVX on dispatch and call the non Vec256 code? Or is the way I used Vec256 completely wrong?
Single threaded comparison to scipy
============================
This is the time to compute the pdist of a 2048 x 2048 float matrix with only one thread for various values of p between torch and scipy. p = 3 is the code path for arbitrary p, and so is much slower than the other values.
p | torch | scipy
-----|-----------|------
0 | 6.27 s ± 393 ms | 7.23 s ± 498 ms
1 | 5.49 s ± 201 ms | 43.4 s ± 1.09 s
2 | 5.74 s ± 474 ms | 53.8 s ± 3.52 s
∞ | 5.59 s ± 292 ms | 47.4 s ± 2.03 s
3 | really slow | gave up
Result by AVX support
================
This is the time to compute the distance and gradient of a 2048 x 2048 float matrix with all threads by AVX support. `before` is the old code, `default` is no AVX support, etc. Interestingly the AVX optimizations provided a great benefit over the old unoptimized code, but drastically hurt performance when compiled without AVX optimizations. p = 3 is the code path for arbitrary p, and so is much slower than the other values.
Results for p = 0
----------------
avx | dist | grad
----|------|-----
before | 514 ms ± 87.5 ms | 191 µs ± 35 µs
default | 3.47 s ± 183 ms | 201 µs ± 24.6 µs
avx | 123 ms ± 18.2 ms | 281 µs ± 130 µs
avx2 | 103 ms ± 11.4 ms | 216 µs ± 74.4 µs
Results for p = 1
----------------
avx | dist | grad
----|------|-----
before | 426 ms ± 35 ms | 6.21 s ± 187 ms
default | 2.6 s ± 123 ms | 5.62 s ± 273 ms
avx | 104 ms ± 6.37 ms | 833 ms ± 44.3 ms
avx2 | 106 ms ± 3.59 ms | 924 ms ± 86.2 ms
Results for p = 2
-----------------
avx | dist | grad
----|------|-----
before | 425 ms ± 45.4 ms | 6.31 s ± 125 ms
default | 3.04 s ± 187 ms | 3.55 s ± 242 ms
avx | 110 ms ± 3.66 ms | 896 ms ± 21.8 ms
avx2 | 113 ms ± 4.68 ms | 934 ms ± 25.2 ms
Results for p = ∞
------------------
avx | dist | grad
----|------|-----
before | 501 ms ± 39.5 ms | 6.64 s ± 321 ms
default | 2.15 s ± 92.9 ms | 8.43 s ± 355 ms
avx | 104 ms ± 5.52 ms | 835 ms ± 36.7 ms
avx2 | 100 ms ± 3.41 ms | 864 ms ± 67 ms
Results for p = 3
-----------------
avx | dist | grad
----|------|-----
before | 22.6 s ± 413 ms | 11.1 s ± 242 ms
default | 24.9 s ± 1 s | 11.2 s ± 293 ms
avx | 2.69 s ± 148 ms | 5.63 s ± 88.4 ms
avx2 | 2.48 s ± 31.8 ms | 5.61 s ± 114 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11230
Differential Revision: D9735503
Pulled By: erikbrinkman
fbshipit-source-id: a9da619249e4ca2625b39ca1ca7f5543c3086bfb
Summary:
If pybind is build with cmake and installed, we should use config file instead of the Findpybind11 shipped with caffe2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11423
Differential Revision: D9735557
Pulled By: ezyang
fbshipit-source-id: 28a39e579fa045060aa1a716e5fd7dbcf7b89569
Summary:
Fixes the issue discussed in #10838. `hidden_size` should be the last dimension regardless if we're in ONNX or PyTorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11368
Differential Revision: D9734814
Pulled By: soumith
fbshipit-source-id: 7f69947a029964e092c7b88d1d79b188a417bf5f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11420
Surprisingly tricky! Here are the major pieces:
- We grow a even yet more ludicrous macro
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_COMPLEX_HALF
which does what it says on the tin. This is because I was
too lazy to figure out how to define the necessary conversions
in and out of ComplexHalf without triggering ambiguity problems.
It doesn't seem to be as simple as just Half. Leave it for
when someone actually wants this.
- Scalar now can hold std::complex<double>. Internally, it is
stored as double[2] because nvcc chokes on a non-POD type
inside a union.
- overflow() checking is generalized to work with complex.
When converting *to* std::complex<T>, all we need to do is check
for overflow against T. When converting *from* complex, we
must check (1) if To is not complex, that imag() == 0
and (2) for overflow componentwise.
- convert() is generalized to work with complex<->real conversions.
Complex to real drops the imaginary component; we rely on
overflow checking to tell if this actually loses fidelity. To get
the specializations and overloads to work out, we introduce
a new Converter class that actually is specializable.
- Complex scalars convert into Python complex numbers
- This probably fixes complex tensor printing, but there is no way
to test this right now.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Reviewed By: cpuhrsch
Differential Revision: D9697878
Pulled By: ezyang
fbshipit-source-id: 181519e56bbab67ed1e5b49c691b873e124d7946
Summary:
vishwakftw Your patch needed some updates because the default native function dispatches changed from `[function, method]` to `[function]`. The CI was run before that change happened so it still shows green, but the internal test caught it.
I did some changes when rebasing and updating so I didn't just force push to your branch. Let's see if this passes CI and internal test. If it does, let me know if you want me to force push to your branch or use this PR instead.
Note to reviewers: patch was already approved at #10068 .
cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11421
Differential Revision: D9733407
Pulled By: SsnL
fbshipit-source-id: cf2ed293bb9942dcc5158934ff4def2f63252599
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11331
In the previous commit, we added a bare-bones LegacyTypeDispatch in ATen/core.
This is not sufficient for the use cases we need: we not only need to be able to
get a Type, but we also need to be able to *initialize* the Types if its the first time
we have retrieved a CPU/CUDA/Complex type. I hemmed and hawed about how
to do this; the strategy this PR takes is to introduce a new "hooks" interface
specifically for initializing CPU/CUDA/Complex (which still lives in Context). We then
move all "user-friendly" functions to LegacyTypeDispatch.
Here were some other options which I considered, but don't work:
- Assume that Type is already initialized, because we only intend to call Type
from Tensor methods, where we already have a Tensor. This does not work
because Caffe2 created tensors will not have gone through the standard
Type codepath, and will have skipped initialization.
- Move CUDAHooks and ComplexHooks to ATen/core. Besides being sucky,
this isn't even a complete fix, because I still need to initialize CPU hooks
(so you *still* need another hooks interface).
Reviewed By: cpuhrsch
Differential Revision: D9666612
fbshipit-source-id: ac7004b230044b67d13caa81fdfaf3c6ab915e3f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11274
We don't want to put all of Context into ATen/core, but one
particular part cannot be avoided: the type registry, because
implementations of TensorMethods will need to get a Type,
and then do a virtual call on it.
I needed to do a little bit of (temporary) footwork to get this
in without also moving Type, because unique_ptr<Type> expects
to be able to see the destructor of Type (but it's forward declared
right now). So instead I put the destructor as an explicit functor. We
can get rid of this once Type actually moves in ATen/core
Reviewed By: cpuhrsch
Differential Revision: D9657449
fbshipit-source-id: 940931493bf4f1f6a8dad03f34633cacdd63dd0b
Summary:
to 300 seconds to be safe. It used to be no timeout in THD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11409
Differential Revision: D9731709
Pulled By: teng-li
fbshipit-source-id: 0ce011dcca507cbf063176ad4995405c77dd0cdd
Summary:
Currently gradient is copied into .grad if it is None. This PR aim to remove the copy when it is not absolutely needed.
It is generally an improvement of speed and memory usage. And here is a case it may help a lot:
Normally, people do optimizer.zero_grad() every minibatch before backward. It will translate into a memset, and later a point-wise add.
When there is some large weight in the network, one optimization people can always do is set parameter.grad to None instead of zero_grad. This will remove memset and change point-wise add to a memcpy.
Here is result running following script on V100 GPU. It is 100 iterations of forward/backward/zero_grad on single 1-billion word benchmark size embedding.
`Zero grad: 2.123847723007202`
`None grad: 1.3342866897583008`
With the backend change of this PR, the unnecessary memcpy is removed, thus further speed up is achieved.
`Zero grad: 2.124978542327881`
`None grad: 0.4396955966949463`
[benchmark.txt](https://github.com/pytorch/pytorch/files/2341800/benchmark.txt)
Some details on the code change:
.detach() is used because we need to get rid of new_grad being a view without copy data. This should be safe in first-order only mode.
data need to be contiguous, otherwise `grad_variable.data() += new_grad.data();` below will fail.
Only the last variable that has reference to the temp gradient will grab its buffer.
ngimel, mcarilli and mruberry helped on finalizing this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11165
Differential Revision: D9728874
Pulled By: soumith
fbshipit-source-id: b8fb822a2dff6e812bbddd215d8e384534b2fd78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11247
Previously, the default for a declaration in native_functions.yaml
was ['function', 'method'], i.e., generate both a method and
function for every binding. We now believe this is inappropriate:
the majority of new kernels added to PyTorch should live as
free functions, NOT methods. Thus, we change the default accordingly.
I also took the opportunity to de-method some "internal" functions
that had a leading underscore. While, strictly speaking, this is a
BC breaking change, I believe it is highly unlikely anyone was using
these directly.
Reviewed By: yf225
Differential Revision: D9648570
fbshipit-source-id: 8b94647b824e0899d6d18aa5585aaedc9d9957d2
Summary:
This is mainly to pick up the change 20074be19a to avoid polluting the CMAKE_DEBUG_POSTFIX variable. cc orionr .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11388
Reviewed By: orionr
Differential Revision: D9720931
Pulled By: Yangqing
fbshipit-source-id: 18a60d0409e74316f74d364f4fe16bf0d0198413
Summary:
Moves the code for the complex registration code into an out-of-line C++ extension to de-noise the test_cpp_extensions.py file. Let's keep it nice and tidy so we can point our users at it for usage examples.
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11397
Differential Revision: D9725335
Pulled By: goldsborough
fbshipit-source-id: 290618f2ee711b1895cdb8f05276034dfe315c6d
Summary:
~~This PR fixes#8525 by renaming `split_with_sizes` to `split` so that 2 `aten::split` ops are
generated (previously `aten::split(self, int, int)` and `aten::split_with_sizes(self, int[], int)` were generated)~~
~~`split_with_sizes` was made in PR #5443, but I don't see a reason for it to have
a different name than `split` rather than just overload `split`.~~
This PR fixes#8525 by adding `register_special_ops.cpp` to mirror Python dispatching from `split` to `split` and `split_with_sizes` in [tensor.py](https://github.com/pytorch/pytorch/blob/master/torch/tensor.py#L279).
It also fixes#8520 by adding an `int[]` wherever it sees `torch.Size`
In a follow up PR this could also be used to fix some of the other `unknown builtin op` test errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11051
Differential Revision: D9582443
Pulled By: driazati
fbshipit-source-id: d27201f85937d72e45e851eaa1460dd3dd1b61a9
Summary:
This seems to be causing different versions of OpenMPI being picked up
by different parts of the build. Not a good practice to include absolute
paths anyway, so let's try removing it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11386
Reviewed By: teng-li
Differential Revision: D9724349
Pulled By: pietern
fbshipit-source-id: 3dfef91c81f2e97e5125284aff9e7e98f8761917
Summary:
Continuing pjh5's work to remove FULL_CAFFE2 flag completely.
With these changes you'll be able to also do something like
```
NO_TEST=1 python setup.py build_deps
```
and this will skip building tests in caffe2, aten, and c10d. By default the tests are built.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11321
Reviewed By: mingzhe09088
Differential Revision: D9694950
Pulled By: orionr
fbshipit-source-id: ff5c4937a23d1a263378a196a5eda0cba98af0a8
Summary:
In addition to documentation, this cleans up a few error message formats.
It also adds infra to find which operators are supported by the JIT automatically, which is then used in the generation of the docs.
The wording and formatting of the docs is not yet polished, but having this will allow our document writers to make faster progress.
Followup PRs will polish the docs and fix formatting issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11357
Differential Revision: D9721277
Pulled By: zdevito
fbshipit-source-id: 153a0d5be1efb314511bcfc0cec48643d78ea48b
Summary:
Add a barrier() to wait for all PG created before destroy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11391
Differential Revision: D9727383
Pulled By: teng-li
fbshipit-source-id: 689d62c978e642b68f4949dcf29982e34869ada4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11382
We found this cudnn bug in S163230 that causes accuracy loss. We fix this in D9601217, but due to the reimplementation of spatialBN it's overwritten. Let's land this fix again.
Reviewed By: kuttas
Differential Revision: D9702347
fbshipit-source-id: 11547e9edaf7b2ba7f4aa7263ffb4f0281bbf078
Summary:
The next function I'm moving to C++ is `sync_params`. It is stacked on top of https://github.com/pytorch/pytorch/pull/9729, so some changes will go away when it lands and I rebase.
I also split code into a `.h` and `.cpp` file for better code organization.
The controller you requested could not be found. pietern apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9805
Differential Revision: D9688604
Pulled By: goldsborough
fbshipit-source-id: 4467104d3f9e2354425503b9e4edbd59603e20a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11336
Move `context_base.h` header to `ATen/core` and the implementations are in `caffe2/core/context_base.cc`
Reviewed By: ezyang
Differential Revision: D9670493
fbshipit-source-id: ce5bf2b3b4c80e9b62819f4332ce68af82720055
Summary:
This PR cleans up the `at::Tensor` class by removing all methods that start with an underscore in favor of functions in the `at::` namespace. This greatly cleans up the `Tensor` class and makes it clearer what is the public and non-public API.
For this I changed `native_functions.yaml` and `Declarations.cwrap` to make all underscore methods `variant: function` (or add such a statement to begin with), and then fixed all code locations using the underscore methods.
ezyang colesbury gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11152
Differential Revision: D9683607
Pulled By: goldsborough
fbshipit-source-id: 97f869f788fa56639c05a439e2a33be49f10f543
Summary:
Add the gpu kernel version.
The parallelism I went with performs poorly when there are a large number of vectors, but they're all short, as I don't allocate the thread pool to wrap in that case.
Test Plan
---------
```
python -m unittest test_torch.TestTorch.test_pdist_{empty,scipy} test_nn.TestNN.test_pdist{,_zeros,_empty_row,_empty_col,_cpu_gradgrad_unimplemented,_cuda_gradgrad_unimplemented} test_jit.TestJitGenerated.test_nn_pdist
```
Current performance specs are a little underwhelming, I'm in the process of debugging.
size | torch | torch cuda | scipy
-----|-------|------------|------
16 x 16 | 9.13 µs ± 3.55 µs | 9.86 µs ± 81.5 ns | 15.8 µs ± 1.2 µs
16 x 1024 | 15 µs ± 224 ns | 9.48 µs ± 88.7 ns | 88.7 µs ± 8.83 µs
1024 x 16 | 852 µs ± 6.03 µs | 7.84 ms ± 6.22 µs | 4.7 ms ± 166 µs
1024 x 1024 | 34.1 ms ± 803 µs | 11.5 ms ± 6.24 µs | 273 ms ± 6.7 ms
2048 x 2048 | 261 ms ± 3.5 ms | 77.5 ms ± 41.5 µs | 2.5 s ± 97.6 ms
4096 x 4096 | 2.37 s ± 154 ms | 636 ms ± 2.97 µs | 25.9 s ± 394 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11102
Differential Revision: D9697305
Pulled By: erikbrinkman
fbshipit-source-id: 2b4f4b816c02b3715a85d8db3f4e77479d19bb99
Summary:
This is so that TensorImpl does not have to depend on Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11337
Differential Revision: D9684421
Pulled By: gchanan
fbshipit-source-id: d2af93420ca6d493429c251cfe5a34e9289c4484
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11273
This one might strike you as a bit surprising, but it's necessary
to expose this interface in ATen/core, because we need to be
able to get a true Variable type from Variable tensors, and
to do that we need to go through the hooks interface.
Reviewed By: gchanan
Differential Revision: D9656548
fbshipit-source-id: 28bb5aee6ac304e8cd5fa1e4c65452c336647161
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11270
Still need to deduplicate this with caffe2/core/registry.h,
but this will be a bit tricky because the current formulation
of the macro is namespace sensitive (i.e., the macro for classes
defined in at:: namespace won't work if you call from caffe2::
namespace).
Reviewed By: gchanan
Differential Revision: D9654871
fbshipit-source-id: 2207d1f2cc6d50bd41bf64ce0eb0b8523b05d9d9
Summary:
After submitting PR #9726, PR #10581 created a different CUDAEvent class. The CUDAEvent proposed in #9726 was similar to the c10d::CUDAEvent class with additional testing and functionality. In particular, it was movable but not copyable. The CUDAEvent created by #10581 is refcounted and copyable. This PR retains the refcounting of the latter PR while fixing several bugs, adding tests, and extending the functionality to support testing and usage like in PR #8354. In particular, this PR:
- Adds set_device() to CUDAContext
- Adds three CUDAEvent tests to stream_test.cpp
- Fixes three bugs:
- Refcounting was broken. Destroying an of the RAIIs holding a particular CUDAEvent would destroy the event UNLESS it was the last RAII (the check was backwards).
- Moving an event would cause a segfault.
- Events were not destroyed on the device they were created on. See PR #9415 (pietern)
- Adds the happened() and recordOnce() functions
- Changes the record() functions to not be const
- Adds additional assertions to verify correctness
This PR does not:
- Make c10d use the ATen CUDAEvent (this is appropriate for a separate PR)
Whether events should be refcounted is an interesting question. It adds some atomic operations and makes event creation eager. Making events movable but not copyable (like the c10d events) avoids these costs and allows events to be lazily constructed. Lazy construction is preferable when working with containers (like std::array or std::vector) and because the event's device can be set automatically to the first stream it's recorded on. With eager construction the user is required to understand that events have a device and acquire the device of the stream the event will be recorded on upfront. This can be seen here:
542aadd9a7/aten/src/ATen/native/cudnn/RNN.cpp (L1130-L1132)
and that file is the only one which currently uses the ATen CUDAEvent.
Refcounting does allow single writer multi-reader scenarios, although these scenarios can be also be supported by providing indirect access to the underlying CUDAEvent. I believe all current and planned usage scenarios do not require refcounting, and if desired I can update this PR to remove refcounting and make the ATen event movable but not copyable like the c10d event. I think not refcounting is preferable because it can improve performance, ease usability, and simplify the code (as seen with two of the above bugs).
I have decided to separate this from PR #8354 since while it's required for PR #8354 the changes are, clearly, of independent interest. PR #8354 has a new dependency on this one, however. I am closing PR #9726 in favor of this PR.
apaszke ezyang pietern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11293
Differential Revision: D9665836
Pulled By: soumith
fbshipit-source-id: a1513fa4f9761e2f304d126e402f6b6950e1c1d2
Summary:
This adds an optional `expand=True` kwarg to the `distribution.expand_support()` method, to get a distribution's support without expanding the values over the distribution's `batch_shape`.
- The default `expand=True` preserves the current behavior, whereas `expand=False` collapses the batch dimensions.
e.g.
```python
In [47]: d = dist.OneHotCategorical(torch.ones(3, 5) * 0.5)
In [48]: d.batch_shape
Out[48]: torch.Size([3])
In [49]: d.enumerate_support()
Out[49]:
tensor([[[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0.]],
[[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.]],
[[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.]],
[[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.]],
[[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.]]])
In [50]: d.enumerate_support().shape
Out[50]: torch.Size([5, 3, 5])
In [51]: d.enumerate_support(expand=False)
Out[51]:
tensor([[[1., 0., 0., 0., 0.]],
[[0., 1., 0., 0., 0.]],
[[0., 0., 1., 0., 0.]],
[[0., 0., 0., 1., 0.]],
[[0., 0., 0., 0., 1.]]])
In [52]: d.enumerate_support(expand=False).shape
Out[52]: torch.Size([5, 1, 5])
```
**Motivation:**
- Currently `enumerate_support` builds up tensors of size `support + batch_shape + event_shape`, but the values are *repeated* over the `batch_shape` (adding little in the way of information). This can lead to expensive matrix operations over large tensors when `batch_shape` is large (see, example above), often leading to OOM issues. We use `expand=False` in Pyro for message passing inference. e.g. when enumerating over the state space in a Hidden Markov Model. This creates sparse tensors that capture the markov dependence, and allows for the possibility of using optimized matrix operations over these sparse tensors. `expand=True`, on the other hand, will create tensors that scale exponentially in size with the length of the Markov chain.
- We have been using this in our [patch](https://github.com/uber/pyro/blob/dev/pyro/distributions/torch.py) of `torch.distributions` in Pyro. The interface has been stable, and it is already being used in a few Pyro algorithms. We think that this is more broadly applicable and will be of interest to the larger distributions community.
cc. apaszke, fritzo, alicanb
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11231
Differential Revision: D9696290
Pulled By: soumith
fbshipit-source-id: c556f8ff374092e8366897ebe3f3b349538d9318
Summary:
This actually ended up being a lot more involved than I thought. The basic
problem is that in some of our build environments, thread local state is not
supported. The correct way to test if this is the case is using the
(undocumented) CAFFE2_FB_LIMITED_MOBILE_CAPABILITY macro.
On mobile, OptionGuard is not available, and you have to do everything
by hand. There's a static_assert to check if you accidentally use
OptionGuard in this case and give you a better error message in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11244
Reviewed By: gchanan
Differential Revision: D9646190
fbshipit-source-id: cf4016f79b47705a96ee9b6142eb34c95abb2bd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11323
If you do pass it this, you'll get a pointer to
UndefinedTensor; probably not what you want!
Reviewed By: Yangqing
Differential Revision: D9676205
fbshipit-source-id: 0bd3c22c2c40ac2958f95fc7a73b908af291cf22
Summary:
We need to remove nomnigraph from the list of public libraries in order to support libtorch extensions. Easiest way to do this is to include it into the Caffe2 source like all other caffe2/core/ code.
However, because the headers are in a different place, we need to include them for linked libraries (pybind, tests, etc).
On an upside, this means that nomnigraph is now default hidden visibility too.
FYI peterjc123 xkszltl goldsborough bwasti Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11303
Reviewed By: pjh5
Differential Revision: D9694932
Pulled By: orionr
fbshipit-source-id: 5db3eb20bc5ddc873ce9151236b74663fbb33ed8
Summary:
* purge hcSPARSE now that rocSPARSE is available
* integrate a custom hcc and HIP
* hcc brings two important compiler fixes (fixes hundreds of unit tests)
* HIP brings a smart dispatcher that allows us to avoid a lot of static_casts (we haven't yet removed the automatic static_casts but this catches some occurrences the script did not catch)
* mark 5 unit tests skipping that have regressed w/ the new hcc (we don't know yet what is at fault)
* optimize bitonic sort - the comparator is always an empty struct - therefore passing it by value saves at least 3 bytes. It also removes an ambiguity around passing references to `__global__` functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11198
Differential Revision: D9652340
Pulled By: ezyang
fbshipit-source-id: f5af1d891189da820e3d13b7bed91a7a43154690
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10888
Add cuda version of SpatialBNOp also optimize SpatialBN on CPU
Reviewed By: houseroad
Differential Revision: D9512435
fbshipit-source-id: 6f828c88d56d30dc9a2f98a297a161c35cc511b1
Summary:
Fixed a few bugs that were not tested in the c10d frontend APIs, including
get_rank, get_world_size, and destroy_process_group of a given group.
These APIs are added to the CI tests.
Also added all the group related tests, including full-group, and partial groups (existing ones), since both will hit different code paths.
Also removed experimental APIs for c10d initially used in DDP, now we don't use it anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11318
Reviewed By: pietern
Differential Revision: D9675896
Pulled By: teng-li
fbshipit-source-id: a2eac2c57933effa2d139855f786e64919a95bfc
Summary:
On the way to #10774
This PR adds advanced indexing with tensors.
The approach is to desugar advanced indexing into an at::index op.
This is exactly how normal pytorch does it.
[(I used this code as reference)](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_variable_indexing.cpp)
Supporting sequences is a little tricky because JIT script doesn't have
an easy way to turn arbitrary n-dimensional python lists into a tensor
(it would be easy if we supported `torch.tensor`), so that'll come
in a future PR.
cc jamesr66a zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10862
Differential Revision: D9659449
Pulled By: zou3519
fbshipit-source-id: 56d293720d44c0fd27909e18327ab3985ddfced6
Summary:
In #9466 I got rid of storage views and eliminated all places where
they were used... OR SO I THOUGHT. In actuality, under certain
conditions (specifically, if you trained a CUDA multiprocessing model
shared over CUDA IPC and then serialized your parameters), you could
also serialize storage slices to the saved model format. In #9466,
I "fixed" the case when you loaded the legacy model format (really,
just unshared the storages--not strictly kosher but if you aren't
updating the parameters, shouldn't matter), but NOT the modern model format, so
such models would fail.
So, I could have applied the legacy model format fix too, but
hyperfraise remarked that he had applied a fix that was effectively
the same as unsharing the storages, but it had caused his model to
behave differently. So I looked into it again, and realized that
using a custom deleter, I could simulate the same behavior as old
storage slices. So back they come.
In principle, I could also reimplement storage views entirely using
our allocators, but I'm not going to do that unless someone really
really wants it.
Fixes#10120.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11314
Reviewed By: ailzhang
Differential Revision: D9671966
Pulled By: ezyang
fbshipit-source-id: fd863783d03b6a6421d6b9ae21ce2f0e44a0dcce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11190
As discussed with Alexander Sidorov, params_bytes refer to the number of bytes we're reading for parameters, not the size of parameters. They only differ in sparse operators.
Reviewed By: mdschatz
Differential Revision: D9628635
fbshipit-source-id: 9e2aed0cf59388928dc69b8534cf254f0347c9c8
Summary:
This is an experimental build on top of what orionr and mingzhe09088 built.
Essentially, the idea is that we will need separate *_API versions for different shared libraries. If this theory is right, I'll try to clean up the design a bit and document it properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11266
Reviewed By: orionr
Differential Revision: D9682942
Pulled By: Yangqing
fbshipit-source-id: c79653199e67a1500c9174f39f8b0357324763f3
Summary:
We shouldn't use system Eigen in any cases when building with setup.py. If people want to use system Eigen (not from third_party) they can build with CMake for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11334
Reviewed By: pjh5
Differential Revision: D9689450
Pulled By: orionr
fbshipit-source-id: baf616b9f195692942151ad201611dcfe7d927ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11098
Added a test for testing CPU version across multiple devices.
Reviewed By: enosair, BIT-silence
Differential Revision: D9584520
fbshipit-source-id: 0d8c85e6d402bc7b34d5f8f16ef655ff9b61b49e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11338
The `min_` and `max_` value of the filler is in `double` format but when we are filling a specific type of tensor, their value can exceed the type limits, resulting in crash. This diff checks the type limits first and if `min_`/`max_` is out of the limits, it will clip it.
Reviewed By: highker
Differential Revision: D9684455
fbshipit-source-id: 6da98a03c57f3296abaddc7c5cfc1c836c611eb0
Summary:
This will allow users to set customized timeout option for the store.
Tested by my own debug print to make sure that C++ actually used the timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11265
Differential Revision: D9666164
Pulled By: teng-li
fbshipit-source-id: 4eb6441783da106a3fd59b95457e503e83e4640f
Summary:
This lets you compile builtin functions from C++ without having a dependence on Python
```cpp
auto module = torch::jit::compile(JIT"(
def my_script_method(x, y):
return torch.relu(x) + y
)");
IValue result = module->run_method("my_script_method", 1, 2);
```
goldsborough zdevito apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10847
Differential Revision: D9543461
Pulled By: driazati
fbshipit-source-id: 6160dae094030ca144a0df93cb9f26aa78c8cf27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11315
Rename unit tests file to make it consistent with fb cpp style guideline "The unittest for MyFoo.cpp should be named MyFooTest.cpp."
Reviewed By: yinghai
Differential Revision: D9671519
fbshipit-source-id: 44ed6794f6e479d190916db8064eee692e3ad876
Summary:
1. Add documentation to Linear and improve documentation for RNNs
2. Fix preprocessing in C++ docs by adding correct include path
3. Make myself and ebetica codeowner of docs/cpp to improve development speed
ebetica ezyang soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11313
Differential Revision: D9683615
Pulled By: goldsborough
fbshipit-source-id: 84ea32f9ea6b4060744aabbf5db368776a30f0b5
Summary:
Turns out that '' net.type is not acceptable to CreateNet.
But empty net.type is acceptable.
Fix that in this diff. Also this is related to T33613083
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11286
Reviewed By: Maratyszcza, wat3rBro
Differential Revision: D9659920
Pulled By: harouwu
fbshipit-source-id: d68f24b754e18e1121f029656d885c48ab101946
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11291
In S163230, we've found CuDNN 7 upgrade causes accuracy drop in training convolution network such as ResNeXt-101 (~0% accuracy), and video R(2+1)D (65 --> 63%).
Our current theory for this accuracy loss is because of the new "CUDNN_BATCHNORM_SPATIAL_PERSISTENT" in spatialBN operator. In Caffe 2, we've made this mode as default. According to CuDNN manual (https://fburl.com/z996mr13), this mode may introduce some limitation in the input data range and cause overflow (which outputs NaN). NaN is probably not the case, because we're seeing a few percent of accuracy drop but not gradient explosion or failure. However, this "performance-optimized" code path may introduce accuracy loss (which is not caught by our unit test case because the input data range is [-0.5-0.5].
Reviewed By: kuttas, stephenyan1231
Differential Revision: D9601217
fbshipit-source-id: 73c2690c19cb1f02ea4e5e2200f50128df4f377b
Summary:
this is a fix that's needed for building extensions with a
pre-packaged pytorch. Consider the scenario where
(1) pytorch is compiled and packaged on machine A
(2) the package is downloaded and installed on machine B
(3) an extension is compiled on machine B, using the downloaded package
Before this patch, stage (1) would embed absolute paths to the system
installation of mkl into the generated Caffe2Config.cmake, leading to
failures in stage (3) if mkl was not at the same location on B as on
A. After this patch, only a reference to the wrapper library is
embedded, which is re-resolved on machine B.
We are already using a similar approach for cuda.
Testing: built a package on jenkins, downloaded locally and compiled an extension. Works with this patch, fails without.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11298
Differential Revision: D9683150
Pulled By: anderspapitto
fbshipit-source-id: 06a80c3cd2966860ce04f76143b358de15f94aa4
Summary:
Now that we're building everything together, making all distributed flags conditional of USE_DISTRIBUTED being set.
cc pietern The controller you requested could not be found. cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11221
Reviewed By: Yangqing
Differential Revision: D9664267
Pulled By: orionr
fbshipit-source-id: a296cda5746ad150028c97160f8beacba955ff73
Summary:
Fixes#8560.
Unblocks #10715.
The assert (nDim <= uncompressedDims) was being triggered for a scalar
tensor because we compute nDim to be 1 for a scalar tensor but
uncompressedDim = 0.
This PR changes it so that we compute nDim to be 0 for a scalar tensor. This
works because indexing in a kernel depends on nDim. If nDim = 0, then
offset is always 0, which is what we want.
Some other (small) changes were necessary to make this work:
- One cannot define a 0-length array `IndexType arr[0]` so the code
guards against that
- Needed to change some of the maxTensorInfoSize logic to handle the
case when uncompressedDim == 0.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10952
Differential Revision: D9544607
Pulled By: zou3519
fbshipit-source-id: 2b873f47e2377125e1f94eb1b310a95cda51476c
Summary:
Distributed Data Parallel CPU module for c10d. This is basically the same code as Distributed Data Parallel CPU module for THD, since c10d now has the exact same front-end interface as torch.distributed.
We will keep both in the first release and remove the THD one once c10d is stable enough.
Test fully covered just as THD too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11168
Differential Revision: D9674963
Pulled By: teng-li
fbshipit-source-id: ecf52a7189374ca7930c2be305218167fdd822a7
Summary:
Linting `torch/csrc/` (non-recursive) and `torch/csrc/autograd` (non-recursive).
Fixed things like:
- `typedef` vs `using`
- Use `.empty()` instead of comparing with empty string/using `.size() == 0`
- Use range for loops instead of old style loops (`modernize-`)
- Remove some `virtual` + `override`
- Replace `stdint.h` with `cstdint`
- Replace `return Type(x, y)` with `return {x, y}`
- Use boolean values (`true`/`false`) instead of numbers (1/0)
- More ...
ezyang apaszke cpuhrsch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11050
Differential Revision: D9597505
Pulled By: goldsborough
fbshipit-source-id: cb0fb4793ade885a8dbf4b10484487b84c64c7f2
Summary: Closing the gap a bit on API, allowing users to go NetDef -> nomnigraph -> NetDef in python now
Reviewed By: duc0
Differential Revision: D9670495
fbshipit-source-id: 6497518ffc05a186deb0d657e06317980d39ddd5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11256
- in deleteNode method, remove optional deleteEdge flag as it's not used
- in deleteEdge method, remove optional removeRef flag as it's not used
- in replaceNode method, remove optional newHead_ parameter as it's not used - also simplifying the implementation by just calling replaceInEdges and replaceOutEdges
- remove importNode & importEdge as they're not in used
- add getEdgeIfExists that is like getEdge() but returns nullptr instead of throwing when the edge does not exist
- reduce verbosity in the basic graph unit test and add more test cases for ReplaceEdges
Differential Revision: D9650913
fbshipit-source-id: 6c18b37bef0d2abe1b57fb4fc47bfdbcee387694
Summary:
I'm setting up an automatic sync job for cppdocs and need two fixes to the cpp docs config:
1. Right now the cppdocs use the `torch` package to figure out the version. For C++ docs all I really need from the built package are the generated Tensor.h and Functions.h files. I can actually generate those directly via `aten/src/ATen/gen.py`, so I can skip building PyTorch altogether and save 10 minutes in the sync job! For this I need to avoid using the torch package in the docs.
2. Internal proxy issues prevent using the git link for sphinx_rtd_theme. We can just use the pip package for the cppdocs (not for the normal PyTorch docs)
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11300
Differential Revision: D9667193
Pulled By: goldsborough
fbshipit-source-id: 5567e0b3d3bdce03f5856babdb4ff76bcee91846
Summary:
This PR adds all PyTorch and Caffe2 job configs to CircleCI.
Steps for the CircleCI mini-trial:
- [ ] Make sure this PR passes Jenkins CI and fbcode internal tests
- [x] Approve this PR
- [ ] Ask CircleCI to turn up the number of build machines
- [ ] Land this PR so that the new `.circleci/config.yml` will take effect
Several Caffe2 tests are flaky on CircleCI machines and hence skipped when running on CircleCI. A proper fix for them will be worked on after a successful mini-trial.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11264
Differential Revision: D9656793
Pulled By: yf225
fbshipit-source-id: 7832e90018f3dff7651489c04a179d6742168fe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11254
Previously we use DeviceType in caffe2.proto directly, but it's an `enum` and have implicit conversion to int, which does not have type safety, e.g. we have to explicitly check for a device type is valid in event.h:
```
template <int d>
struct EventCreateFunctionRegisterer {
explicit EventCreateFunctionRegisterer(EventCreateFunction f) {
static_assert(d < MaxDeviceTypes, "");
Event::event_creator_[d] = f;
}
};
```
at::DeviceType is an `enum class`, and it does not have implicit conversion to int, and provides better type safety guarantees. In this diff we have done the following refactor(taking CPU as an example):
1. caffe2::DeviceType → caffe2::DeviceTypeProto
2. caffe2::CPU → caffe2::PROTO_CPU
3. caffe2::DeviceType = at::DeviceType
4. caffe2::CPU = at::DeviceType::CPU
codemod -d caffe2/caffe2 --extensions h,cc,cpp 'device_type\(\), ' 'device_type(), PROTO_'
+ some manual changes
In short, after this diff, in c++, caffe2::CPU refers to the at::DeviceType::CPU and the old proto caffe2::CPU will be caffe2::PROTO_CPU.
In python side, we have a temporary workaround that alias `caffe2_pb2.CPU = caffe2_pb2.PROOT_CPU` to make the change easier to review and this will be removed later.
Reviewed By: ezyang
Differential Revision: D9545704
fbshipit-source-id: 461a28a4ca74e616d3ee183a607078a717fd38a7
Summary:
Persistent rnns provide much better performance on V100 with half input data for a variety of cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11248
Differential Revision: D9665687
Pulled By: ezyang
fbshipit-source-id: 2bd09a7eb1f5190aadb580977b0ba956e21a7dd5
Summary:
- In Python 2, use of `/` (regardless of int/float/Tensor) causes a compiler error if
`from __future__ import division` is not imported in the file.
- The / operator is universally set to do "true" division for integers
- Added a `prim::FloorDiv` operator because it is used in loop unrolling.
The error if users use '/' in python 2 without importing from __future__
occurs when building the JIT AST.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11016
Differential Revision: D9613527
Pulled By: zou3519
fbshipit-source-id: 0cebf44d5b8c92e203167733692ad33c4ec9dac6
Summary:
The existing tests had every rank run send to every other rank and only
then switch to recv mode. This only works if the send operations are
non-blocking and the passed tensors are immediately copied to some kind
of send buffer. Instead, every send must be matched with a recv on the
other side, because from the API perspective they may block.
E.g. imagine a 1GB tensor being sent to every other rank. It can only go
through if there is a recv on the other side, or it will deadlock.
This change reflects this in the send/recv unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11275
Differential Revision: D9658197
Pulled By: pietern
fbshipit-source-id: fb6a3fc03b42343a9dfeed0def30d94914e76974
Summary:
Found these when compiling the new master with gcc 7.3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11257
Differential Revision: D9656612
Pulled By: SsnL
fbshipit-source-id: 7acb19e13204c010238dab7bc6973cc97b96f9a4
Summary:
This PR adds a .travis.yml check for our C++ documentation. The goal is to avoid any documentation/comments in our C++ code that would break the doxygen output and possibly ruin the C++ documentation site (currently https://pytorch.org/cppdocs).
For this, we:
1. Run doxygen and record any warnings,
2. Filter out some known bogus warnings,
3. Count the remaining warnings,
4. Fail the check if (3) is non-zero.
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11124
Differential Revision: D9651011
Pulled By: goldsborough
fbshipit-source-id: 30f776d23bb6d6c482c54db32828b4b99547e87b
Summary:
Allows mulitplication of e.g. numpy.float32 with tensors.
This came up with #9468
If you want this and after the other patch is done, I'll add tests (but that would be conflicting, so I prefer to wait).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9659
Differential Revision: D8948078
Pulled By: weiyangfb
fbshipit-source-id: c7dcc57b63e2f100df837f70e1299395692f1a1b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10874
Fixes the log message "WARNING:data_workers:Warning, data loading lagging behind: name=0" where instead of source name the size of a queue is reported
Reviewed By: panshen1, Novitial
Differential Revision: D9506606
fbshipit-source-id: 03717cfa9b991afb335ef877378afa3b52fd8f22
Summary:
`__repr__` currently fails for distributions with lazy attributes in PyTorch master, throwing a `KeyError`. This fixes the issue.
**Additionally:**
- Added `logits` to `arg_constraints` for distributions that accept either `probs` or `logits`. This is both to have `__repr__` display the `logits` param when available, and to be able to do validation checks (e.g. NaN checks) when the logit parametrization is used. fritzo, alicanb - I think there were reasons why we had not done so in the first place, but I am unable to recall now. It passes all the tests, but let me know if there is something that I am missing at the moment.
- There are certain distributions, e.g. `OneHotCategorical` which won't show any parameters because it uses a `categorical` instance under the hood and neither `logits` / `probs` in `arg_constraints` are present in the instance's `__dict__`. This isn't addressed in this PR.
cc. vishwakftw, fritzo, nadavbh12, apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11263
Differential Revision: D9654959
Pulled By: apaszke
fbshipit-source-id: 16f5b20243fe8e2c13e9c528050d4df0b8ea6e45
Summary:
This PR adds a hooks interface for registering types for complex
scalar types, and a sample implementation of the hook in
test_cpp_extensions.
The hook registration is patterned off of the existing CUDA hooks.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC The controller you requested could not be found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11216
Differential Revision: D9654840
Pulled By: ezyang
fbshipit-source-id: 7b97646280d584f8ed6e14ee10a4abcd04cf2987
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11127
it's invalid to capture `predicate` by reference as it's a local variable. capture it by value instead.
Differential Revision: D9600115
fbshipit-source-id: 92e0130d0a74908380b75ade5c3492df49e25941
Summary:
Also, make `torch.isclose` work with integral tensors and refactor `_check_trace` a bit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11246
Differential Revision: D9652701
Pulled By: apaszke
fbshipit-source-id: fb0bdbfd1952e45e153541e4d471b423a5659f25
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11123
this adds an operator that fills a tensor with a uniform(min, max)
the implementation is to use the fp32 generator and convert to fp16
if performance becomes an issue we could resort to intrinsics
Reviewed By: jspark1105, chocjy
Differential Revision: D9598142
fbshipit-source-id: 5aeab99acf7c3596fa6c33611d9d2c484f7c1145
Summary:
keep net type info when generating model complete net. This will keep the performance optimization option
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11032
Reviewed By: wat3rBro
Differential Revision: D9564125
Pulled By: harouwu
fbshipit-source-id: c6546af9b1d4ff5eddf6124e24a5da1b8baf47df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11215
I found these by deleting the implicit conversion of Type to
TensorOptions and then fixing sites. This isn't a complete
refactor, because I ran out of steam after fixing this many
and decided to keep the implicit conversion. Still, why
waste a perfectly good refactor?
Reviewed By: gchanan, cpuhrsch
Differential Revision: D9634750
fbshipit-source-id: 4d8fb778e13e6e24b888b1314a02709b2cb00b62
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11205
Our short term plan for supporting out of tree complex development requires an
external library to add a custom subclass of Type without access to the
code generation facilities in ATen. This commit reorganizes Type so
as to minimize the amount of boilerplate you have to write when making
a subclass of Type.
In particular, it:
- Creates a new CPUTypeDefault/CUDATypeDefault class, which you are
intended to inherit from, which provides default implementations
of CPU/CUDA that is layout/dtype agnostic.
- Adds new getCPUAllocator() and getCUDAAllocator() functions, as
a more public API to get your hands on Allocator
- Adds allocator() and getDeviceFromPtr(), abstracting the device
specific parts of storage() methods; these methods are now
implemented in base TypeDefault.
- Delete the static typeString() method, which is now dead.
- Move is_cuda/is_sparse/is_distributed to TypeDefault.
Reviewed By: SsnL
Differential Revision: D9631619
fbshipit-source-id: 40b600d99691230e36e03eb56434c351cbc2aa3a
Summary:
Just pulling this out of https://github.com/pytorch/pytorch/pull/10611
Make sure we can support environments where we don't have libprotobuf installed when we link-local protobuf.
cc goldsborough Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11161
Differential Revision: D9650282
Pulled By: orionr
fbshipit-source-id: 447b5e54cd2639973b4b10f58590d1c693a988d4
Summary:
Will use USE_DISTRIBUTED for both c10d and THD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11237
Differential Revision: D9647825
Pulled By: teng-li
fbshipit-source-id: 06e0ec9b5e2f8f38780fc88718f8499463e9e969
Summary:
This was lingering after #10731.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11240
Differential Revision: D9645437
Pulled By: pietern
fbshipit-source-id: d02c33354b094be3bb0872cf54a45721e20c4e7d
Summary:
This PR resolved the following compilation errors on devgpu:
/home/mingzhe0908/pytorch/build/lib/libcaffe2_gpud.so: undefined reference to `caffe2::CAFFE2_PLEASE_ADD_OPERATOR_SCHEMA_FOR_Tan()'
/home/mingzhe0908/pytorch/build/lib/libcaffe2_gpud.so: undefined reference to `caffe2::CAFFE2_PLEASE_ADD_OPERATOR_SCHEMA_FOR_MaxPool3D()'
....
The same error has been happening with caffe2 build with debug mode before build_caffe2 was removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11233
Reviewed By: orionr
Differential Revision: D9645527
Pulled By: mingzhe09088
fbshipit-source-id: 68a45aa7fd815cac41b7fd64cfd9838b3226345a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11060
Adding synthetic data generation to the filler.h file (the exact distribution to be replaced later on).
Reviewed By: highker
Differential Revision: D9417594
fbshipit-source-id: 5d66dfbcb254a5961c36b7d3a081332c7372dac7
Summary:
there are multiple views of the tensor live.
Also adds recording for copy_ because this is the critical in place
op where these views will cause LHS indexing to fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11129
Differential Revision: D9600195
Pulled By: zdevito
fbshipit-source-id: bfd8f5befa47377e36d704dbdb11023c608fe9a3
Summary:
TSIA. apaszke pointed out that it might be better to use third party folder in default, since system Eigen may often be out of date and does not have the version we need to compile successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11020
Differential Revision: D9562548
Pulled By: Yangqing
fbshipit-source-id: d8ab8a6ebe1f3d9eec638ef726cf5dc4dcf777b5
Summary:
Adding short circuit evaluation to AND or OR. The second expression of and AND or OR gets lifted into an if branch, which is conditionally evaluated.
BatchOps was using the expression `dims = dims1 or dims2`, where dims is often an empty tensor. This nows throws an error, because dims1 gets cast to a boolean, and you can't convert an empty tensor to a scalar. It now matches the behavior of pytorch in python.
One thing that came up is if the second expression in an and/or in python gets returned, it does not get coerced to a boolean.
`tensor == (False or tensor)`
`tensor == (True and tensor)`
We do not currently support this.
edit: wording
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11116
Differential Revision: D9618168
Pulled By: eellison
fbshipit-source-id: 93b202be2f222d41f85d38d9c95f04d1749e8343
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11189
Replaces it with an operator TensorOptions() method on
Type, reestablishing the implicit conversion. I originally
wanted to get rid of the implicit conversion entirely, but
there were a *lot* of use-sites, so I added it back to avoid
a huge codemod. In this patch, I only had to fix sites that
used the optional device_index API.
Reviewed By: cpuhrsch
Differential Revision: D9628281
fbshipit-source-id: 5fe2a68eefb77a3c9bb446f03a94ad723ef90210
Summary:
Example:
```sh
python run_test.py -i sparse -- TestSparse.test_factory_size_check -f
```
With this, the `--verbose` option is redundant (one can call `python run_test.py -- -v` instead of `python run_test.py -v`. But since this is (probably) a frequently used flag, I didn't remove the existing easier-to-use option.
cc ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11209
Differential Revision: D9632215
Pulled By: SsnL
fbshipit-source-id: ff522802da11ef0a0714578be46e4a44f6343d44
Summary:
We don't generate a corresponding Type implementations for them,
so this doesn't do anything at the moment.
We don't plan on supporting complex32 in the near future, but
it is added to reserve the name and number in case we do at
some point in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11173
Reviewed By: SsnL
Differential Revision: D9627477
Pulled By: ezyang
fbshipit-source-id: f49a44ab1c92d8a33130c249ac7b234f210a65e6
Summary:
In the state dict loading code, it would print the error message referring to the shape of the loaded parameters and the parameters in the initialised model with the formatting in the wrong order. Swapped them round to fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11200
Differential Revision: D9631160
Pulled By: SsnL
fbshipit-source-id: 03d9446303bd417fef67027b10d7a27de06486be
Summary:
Disables two of the unit tests in test_cuda that got introduced after test_cuda was enabled that fail on ROCm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11191
Differential Revision: D9628702
Pulled By: ezyang
fbshipit-source-id: 4c298c728f42bb43d39b57967aa3e44385980265
Summary:
This is necessary to allow us to use the complex header
which defines real (and is very sad if real is macro'ed).
We should also fix accreal, ureal, Real and REAL, but
only 'real' is the real blocker.
```
codemod -d aten/src/TH --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
codemod -d aten/src/THC --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
codemod -d aten/src/THNN --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
codemod -d aten/src/THCUNN --extensions c,cc,cpp,cu,cuh,h,TARGETS,py,hpp '\breal\b' scalar_t
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11163
Reviewed By: SsnL
Differential Revision: D9619906
Pulled By: ezyang
fbshipit-source-id: 922cb3a763c0bffecbd81200c1cefc6b8ea70942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11013
Previously, the parent class Type also contained a large number
of implementations, for things like broadcasting and native
functions that didn't need dispatch. We'd like to be able
to reference this interface from Tensor even when we don't
have any of these implementations are available.
To do this, we convert Type into a truly pure virtual interface,
and move all of the implementations to TypeDefault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11181
Differential Revision: D9561478
Pulled By: ezyang
fbshipit-source-id: 13c49d80bc547551adf524b1cf1d691bfe311133
Summary:
* improve docker packages (install OpenBLAS to have at-compile-time LAPACK functionality w/ optimizations for both Intel and AMD CPUs)
* integrate rocFFT (i.e., enable Fourier functionality)
* fix bugs in ROCm caused by wrong warp size
* enable more test sets, skip the tests that don't work on ROCm yet
* don't disable asserts any longer in hipification
* small improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10893
Differential Revision: D9615053
Pulled By: ezyang
fbshipit-source-id: 864b4d27bf089421f7dfd8065e5017f9ea2f7b3b
Summary:
This places all constants in the entry block of the graph, and de-duplicates them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10231
Differential Revision: D9601501
Pulled By: resistor
fbshipit-source-id: daa10ed8c99e9894830d6f3e5d65c8d3ab5ea899
Summary:
Previously when we had a slicing expression like `x[0:5, 0]`, where the sliced tensor was of size `5` in dimension 0, we would skip dispatching the actual slice call as an optimization.
This caused incorrect behavior under tracing, as we would not record the slice op and thus if we encountered an input with a different shape while running the trace, we would get incorrect results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11156
Differential Revision: D9622252
Pulled By: jamesr66a
fbshipit-source-id: 822f2e8f01504e131f53bd9ef51c171c7913a7cc
Summary:
This makes it so `detach` and `detach_` are traceable and also adds a pass to erase them before ONNX export
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11038
Differential Revision: D9588038
Pulled By: jamesr66a
fbshipit-source-id: 263dd3147e24fcb0c716743f37fdb9f84c0015e7
Summary:
Will need to be accessible by caffe2
This also removes a bunch of unnecessary includes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11154
Reviewed By: ezyang
Differential Revision: D9618681
Pulled By: cpuhrsch
fbshipit-source-id: 838a87b75d9c3959e145fd5fca13b63bc5de7bd3
Summary:
```
In file included from third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/cast.h:13:0,
from third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/attr.h:13,
from third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/pybind11.h:43,
from caffe2/torch/csrc/utils/pybind.h:6,
from caffe2/torch/csrc/jit/pybind.h:5,
from caffe2/torch/csrc/jit/script/init.h:3,
from caffe2/torch/csrc/jit/script/init.cpp:1:
third-party-buck/gcc-5-glibc-2.23/build/pybind11/889256a/include/pybind11/pytypes.h:118:19: note: declared here
In file included from caffe2/torch/csrc/jit/pybind.h:12:0,
from caffe2/torch/csrc/jit/python_ir.cpp:4:
caffe2/torch/csrc/jit/pybind_utils.h: In function 'torch::jit::IValue torch::jit::argumentToIValue(const torch::jit::FunctionSchema&, size_t, pybind11::handle)':
caffe2/torch/csrc/jit/pybind_utils.h:138:226: warning: 'pybind11::str pybind11::detail::object_api<Derived>::str() const [with Derived = pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>]' is deprecated: Use py::str(obj) instead [-Wdeprecated-declarations]
```
apaszke zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11107
Differential Revision: D9598040
Pulled By: goldsborough
fbshipit-source-id: 4a055353ac08d54a2bbca49573ff099310de3666
Summary:
ATen's doc/ folder is manually maintained and can thus cause confusion with the generated file. We now have proper online documentation for ATen, which is superior to ATen doc/. Let's delete ATen/doc.
ezyang apaszke soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11158
Differential Revision: D9618782
Pulled By: goldsborough
fbshipit-source-id: 0ef14f84947601a0589aa4a41e5c8619783426fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11035
Instead, inline its definition into Tensor. We need
to do this so we can avoid needing to getType() from
TensorImpl.
Reviewed By: cpuhrsch
Differential Revision: D9564516
fbshipit-source-id: 19fdaa2b93419e21572b9916714aee4165cb3390
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11031
The eventual plan is to get rid of TensorImpl::type()
entirely; but first we need a function to call.
Reviewed By: cpuhrsch
Differential Revision: D9564206
fbshipit-source-id: b59a9ccfaed44199f185eff392835cec89ccda8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11023
I'd like TensorOptions to not know anything about Context, so I can
move it to ATen/core without pulling in Context. To do this, the
type() method has to go, since it consults the context to get a Type.
Reviewed By: cpuhrsch
Differential Revision: D9562467
fbshipit-source-id: 61a18a76eb042a5e70b64b963501e9d68c25d4f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11144
We can move them now that TensorMethods no longer references Tensor.
Reviewed By: cpuhrsch
Differential Revision: D9613800
fbshipit-source-id: 99ad1dd7d77eb319000769230b7016294cf1980f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11027
Using swap() as a primitive, copy and move assignment become much easier.
Reviewed By: ezyang
Differential Revision: D9563753
fbshipit-source-id: e74faf39b596f097de758bfe038639565807040a
Summary:
This completely removes BUILD_CAFFE2 from CMake. There is still a little bit of "full build" stuff in setup.py that enables USE_CUDNN and BUILD_PYTHON, but otherwise everything should be enabled for PyTorch as well as Caffe2. This gets us a lot closer to full unification.
cc mingzhe09088, pjh5, ezyang, smessmer, Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8338
Reviewed By: mingzhe09088
Differential Revision: D9600513
Pulled By: orionr
fbshipit-source-id: 9f6ca49df35b920d3439dcec56e7b26ad4768b7d
Summary:
Added MPI group support.
And this will make all previous group test cases of MPI passed.
Also, release the MPI thread level support by serializing different PG's MPI ops. This is required.
The build is fixed too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11128
Differential Revision: D9602188
Pulled By: teng-li
fbshipit-source-id: 1d618925ae5fb7b47259b23051cc181535aa7497
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11101
I'd like to invert the dependency between Tensor and TensorOptions
(such that Tensor includes TensorOptions); to do this, I'd prefer
there to not be a Tensor constructor. Eventually, all references
of Tensor will disappear from TensorOptions.h
Reviewed By: cpuhrsch
Differential Revision: D9585627
fbshipit-source-id: dd4a28b2c06b1e55f629762915f03c2b6c34d840
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11122
these changes add fixes to device.cc that are appropriate to create the intra-device-copies for opencl
Reviewed By: bwasti
Differential Revision: D9553292
fbshipit-source-id: e59f17916b5df30a504adee0718f9cecfe28f35a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11021
We can now store a boolean saying if we want a Variable or not,
and context can use VariableHooks to get a VariableType if we
request one.
Reviewed By: cpuhrsch
Differential Revision: D9562312
fbshipit-source-id: 84653cd789622764132252406a5ea1a83eee3360
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11096
To discourage willy-nilly use, and make it clearer that it
is not a Variable
Reviewed By: cpuhrsch
Differential Revision: D9583699
fbshipit-source-id: 4fbde0c01ae3deb2c7ef8c125a9028f089b203ae
Summary:
Generate serialized test inputs/outputs/backward graphs of tests inside `caffe2/python/operator_test` that call assertSerializedOperatorCheck(). Tests should be decorated with serialized_test.collect_tests.given_and_seeded to run hypothesis tests that are actually random and a single fixed seeded hypothesis tests.
To use:
1. Refactor your test to be a SerializedTestCase
1a. Decorate it with given_and_seeded
1b. Call testWithArgs in main
2. Run your test with -g to generate the output. Check it in.
3. Subsequent runs of the test without generating the output will check against the checked in test case.
Details:
Run your test with `python caffe2/python/operator_test/[your_test].py -g`
Outputs are in `caffe2/python/serialized_test/data`. The operator tests outputs are in a further subdirectory `operator_test`, to allow for other tests in the future (model zoo tests?)
Currently, we've only refactored weighted_sum_test to use this, but in the next diff, we'll refactor as many as possible. The directory structure may also change as usually there are multiple tests in a single file, so we may create more structure to account for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10594
Reviewed By: ezyang
Differential Revision: D9370359
Pulled By: ajyu
fbshipit-source-id: 2ce77389cd8bcc0255d3bccd61569833e545ede8
Summary:
**Review last commit only.** Stacked on top of #10949.
This commit fixes a number of issues connected to caching
differentiability status of graphs inside graph executors,
and changes the rules for optimization of differentiable subgraphs.
Previously every one of those was instantiated as a separate graph
executor, but now they are simply heavier-optimized graph regions,
and graph executors are only instantiated for their backward.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10977
Differential Revision: D9600626
Pulled By: apaszke
fbshipit-source-id: dad09a0f586e396afbd5406319c1cd54fbb8a3d3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11095
We used getType to mean a lot of things.
- getVariableTypeFromBaseType: given a base Type (non-Variable type)
compute the Variable Type which corresponds to it.
- getVariableType: like at::getType, but return the Variable type
rather than the plain type.
This rename makes it clearer at the use-site what things are what,
and will make a subsequent rename of at::getType easier.
Reviewed By: gchanan, cpuhrsch
Differential Revision: D9583630
fbshipit-source-id: 2667ec98e7607bc466920c7415a8c651fd56dfca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11080
- Add a new TensorOptions(Device, ScalarType) constructor,
which serves roughly the same role as getType used to.
We shouldn't get too wild with these constructors, but
since this particular one was widely used by getType,
it seems worth adding.
- Change DLPack DeviceType conversion to at::DeviceType,
rather than at::Backend. While I'm at, add a few more
conversions that at::DeviceType understands.
- Add a new overload of from_blob which understands strides.
Reviewed By: gchanan, cpuhrsch
Differential Revision: D9578734
fbshipit-source-id: 28288ec053aae8765e23925ab91023398d632d6b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11077
getType now supports retrieving variable types, so make it clearer
when a getType function does NOT give you a variable type.
```
codemod -d . --extensions cc,cpp,cu,cuh,h getTypeOpt getNonVariableTypeOpt
```
Reviewed By: gchanan
Differential Revision: D9578398
fbshipit-source-id: 3ee502ac5c714849917f11ddc71de8eacfdaa9d3
Summary:
Operators like aten::chunk used to return a number of tensors, but
now return a list. To make it easier to do shape prop through
aten::chunk and fuse it, I've also introduced prim::ConstantChunk,
which behaves like the previous implementation (has a variable length
output list).
The downside of this PR is that the introduction of more lists to the IR causes the LSTM and MiLSTM graphs to be considered as non-differentiable by the graph executor. I verified that they are still optimize correctly, and my next patch (that changes how the specializations/differentiation works) will restore those.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10949
Reviewed By: zdevito
Differential Revision: D9556823
Pulled By: apaszke
fbshipit-source-id: 33e63b17fc7247cac6cfc05eb7eb9bf069b499ee
Summary:
update to the latest observer usage syntax
add an example of HistogramObservers
Reviewed By: jspark1105
Differential Revision: D6878439
fbshipit-source-id: c9521f2daecfc7f0c17de6a944dce58e568e3dbe
Summary:
How did we get so many uses of `NULL` again?
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11047
Differential Revision: D9566799
Pulled By: goldsborough
fbshipit-source-id: 83469f352ac69aa65bdaf1a1a21f922d892e0db3
Summary:
I've been seeing a lot of warnings about multiple declarations of this. Hopefully this fixes it.
cc Yangqing mingzhe09088 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11045
Reviewed By: mingzhe09088
Differential Revision: D9582756
Pulled By: orionr
fbshipit-source-id: 6171485609a2f2f357d6e1c44e26b4ecfcdb4ce6
Summary:
This is needed because the JIT declares some custom autograd functions.
colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11082
Differential Revision: D9580456
Pulled By: apaszke
fbshipit-source-id: 6bf00c1188a20b2ee6ecf60e5a0099f8263ad55a
Summary:
This was done because it surprising for a decorator to run a function
rather than wrap it, and not simplify the syntax for tracing modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11069
Reviewed By: jamesr66a
Differential Revision: D9583192
Pulled By: zdevito
fbshipit-source-id: b914b7ab4c73c255086465a6576eef3a22de1e13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11026
ezyang fixed a bug with moving or copying an intrusive_ptr into itself.
This diff adds test cases for it.
Reviewed By: ezyang
Differential Revision: D9563464
fbshipit-source-id: 3a3b3f681124730d2500b276c0135c3bba7875ae
Summary:
This PR creates a stream pool per issue #9646. When a new stream is requested, that device it's requested on lazily creates two pools, one low priority and one high priority, of 32 streams each. Streams are returned from these pools round-robin. That is, stream 0 is returned, then stream 1... then stream 31, then stream 0... This PR also takes the opportunity to clean up the stream API, reducing its complexity and verbosity.
Change notes:
- There are now 3 sets of streams per device, the default stream, the low priority streams, and the high priority streams. These streams live in lazily initialized pools and are destroyed on shutdown.
- All stream refcounting has been removed (the pools pattern replaces it).
- Setting a stream now sets it on its device. Streams are associated with a device and the previous
requirement to specify that device was unnecessary.
- There is no exposure for setting the flags on a stream. This may also seem like a regression but the flag was always set to cudaStreamNonBlocking.
- Streams are now low or high priority whereas previously the priority could be set with an integer. In practice, however, the range for priorities is -1 to 0 on the latest hardware. -1 is high priority, 0 is low priority (aka default priority). Low vs. high actually clarifies this behavior if people were trying finer separations. (E.g., if someone tried streams with priorities 0, 1, and 2, they would actually all have priority 0, historically, and the intended behavior would not be respected.)
- Unused THCStream and THCState stream-related functions were removed.
- A new test of pooling behavior was added in stream_test.
fyi: colesbury, apaszke, goldsborough
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9938
Reviewed By: SsnL
Differential Revision: D9569036
Pulled By: ezyang
fbshipit-source-id: 12ed673fe373170d0cf4d65cb570de016c53ee7d
Summary:
The warnings are erroneous as far as i can see,
so tweak things to avoid. The (unsigned int) cast is
to avoid passing -1 to a size_t time. This was triggered
in gcc8's lto build only, giving:
caffe2/aten/src/TH/generic/THTensor.cpp: In function ‘THFloatTensor_squeeze1d’:
lto1: error: ‘__builtin_memset’ specified size 18446744073709551608
exceeds maximum object size 9223372036854775807 [-Werror=stringop-overflow=]
In function ‘newImpl’,
inlined from ‘operator new’ at common/memory/OperatorOverride.cpp:86:23,
inlined from ‘allocate’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/ext/new_allocator.h:111:0,
inlined from ‘allocate’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/alloc_traits.h:436:0,
inlined from ‘_M_allocate’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/stl_vector.h:172:0,
inlined from ‘_M_default_append’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/vector.tcc:571:0,
inlined from ‘resize’ at third-party-buck/platform007/build/libgcc/include/c++/7.3.0/bits/stl_vector.h:671:0,
inlined from ‘THTensor_resizeDim’ at caffe2/aten/src/TH/THTensor.hpp:123:0,
inlined from ‘THFloatTensor_squeeze1d.part.198’ at caffe2/aten/src/TH/generic/THTensor.cpp:429:0,
inlined from ‘THFloatTensor_squeeze1d’:
common/memory/OperatorOverride.cpp:86:23: error:
argument 1 value ‘18446744073709551608’ exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
void* ptr = malloc(size);
Reviewed By: soumith
Differential Revision: D9568621
fbshipit-source-id: 4569a4be897d669caa3f283f4b84ec829e8d77ad
Summary:
Also add single grad whitelist to the jit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10782
Reviewed By: ezyang
Differential Revision: D9583378
Pulled By: erikbrinkman
fbshipit-source-id: 069e5ae68ea7f3524dec39cf1d5fe9cd53941944
Summary:
I've had `torch/lib/python3.6` show up as part of the build for some time now. It's not ignored which means I need to be extra careful about checking in files, or I end up with a thousand of them in my index.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11083
Differential Revision: D9580453
Pulled By: apaszke
fbshipit-source-id: 369e4fe87962696532d111b24f2a4a99b9572bf2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10826
Add strides, and make sure the strides are consistent with sizes, and is_contiguous, for all the Caffe2 functions.
is_contiguous means strides_[dim-1] = 1 and strides_[i] = strides_[i+1] * max(size_[i+1], 1);
Reviewed By: ezyang
Differential Revision: D9354480
fbshipit-source-id: 3643871b70f1111b7ffdd9fdd9fe9bec82635963
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10976
The app can run in XCode with the benchmark metrics collected.
It can also run when building with buck
Reviewed By: llyfacebook
Differential Revision: D9546755
fbshipit-source-id: 60ad0112946f8cf57138417f6838a58ed6d2c90f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11046
As suggested by jerryzh168, temporary fix for a new constraint that was added D9350686 is to remove this assert. Long term jerryzh168 is going to work out a better way of handling this.
Reviewed By: jerryzh168
Differential Revision: D9566323
fbshipit-source-id: e4630c7cbe0cc68a084974ea7048654811fae01f
Summary:
Currently our `skipIfLapack` has uses a try-catch block and regex match the error message. It is highly unreliable. This PR adds `hasLAPACK` and `hasMAGMA` on ATen context, and expose the flags to python.
Also fixes refcounting bug with `PyModule_AddObject`. The method steals reference, but we didn't `Py_INCREF` in some places before calling it with `Py_True` or `Py_False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11024
Differential Revision: D9564898
Pulled By: SsnL
fbshipit-source-id: f46862ec3558d7e0058ef48991cd9c720cb317e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11048
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9566744
fbshipit-source-id: 18292dd64a6d48192c34034200a7c9811d2172af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11052
Delete the test case for Predictor with constructing by MetaNetDef since the constructor
actually has been deprecated. The broken PR is for construcing predictor from DB instance.
Reviewed By: highker
Differential Revision: D9566935
fbshipit-source-id: 5511883953a2d3f6eb0a4f1c5518a1bc4b3ffbdc
Summary:
Not subclassed except by Tensor. Also requried to align further with
caffe2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11036
Reviewed By: ezyang
Differential Revision: D9565640
Pulled By: cpuhrsch
fbshipit-source-id: ff7203a2c95d3f3956282b4f2d8dda6c2b93f4a6
Summary:
Things like torch.zeros now appear in traces rather than constants.
To continue to support our current level of ONNX export, we run
constant prop to turn these back into constants where possible before
export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10935
Differential Revision: D9527427
Pulled By: zdevito
fbshipit-source-id: 552a8bcc01b911251dab7d7026faafdd7a3c758a
Summary:
Initial version of `unique` supporting a `dim` argument.
As discussed in [this issue](https://github.com/pytorch/pytorch/issues/9997) I added the `dim` argument to `torch.unique` with the same behavior like [numpy](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.unique.html).
Since the implementation is based on `std/thrust::unique`, the `tensor` always needs to be sorted. The `sorted` argument in `torch.unique` does not have any function, as in the CUDA version of the plain `torch.unique`.
To check the performance and equal behavior between `torch.unique` and `np.unique`, I've used [this gist](https://gist.github.com/ptrblck/ac0dc862f4e1766f0e1036c252cdb105).
Currently we achieve the following timings for an input of `x = torch.randint(2, (1000, 1000))`:
(The values are calculated by taking the average of the times for both dimension)
| Device | PyTorch (return_inverse=False) | Numpy (return_inverse=False) | PyTorch (return_inverse=True) | Numpy (return_inverse=True) |
| --- | --- | --- | --- | --- |
| CPU | ~0.007331s | ~0.022452s | ~0.011139s | ~0.044800s |
| GPU | ~0.006154s | - | ~0.105373s | - |
Many thanks to colesbury for the awesome mentoring and the valuable advices on the general implementation and performance issues!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10423
Differential Revision: D9517289
Pulled By: soumith
fbshipit-source-id: a4754f805223589c2847c98b8e4e39d8c3ddb7b5
Summary: When conversion fails, dump more information to help fix up the netdef
Reviewed By: hyuen, yinghai
Differential Revision: D9558667
fbshipit-source-id: 8917cc61c9be6285697e4f8395a9dbc7135f618e
Summary:
1. Support ops needed for inference of Faster-RCNN/Mask-RCNN needed in Detectron, mostly direct fallbacks.
2. Use CPU device to hold 0-dim tensors and integer tensors in both fallback op and blob feeder, needed by Detectron models.
3. Ignore 0-dim tensor in MKL-DNN concat operator.
4. Generate dynamic library of Detectron module for CPU device.
This PR obsoletes #9164.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10157
Differential Revision: D9276837
Pulled By: yinghai
fbshipit-source-id: dc364932ae4a2e7fcefdee70b5fce3c0cee91b6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10955
Add GPU version of HardSigmoid Op to Caffe2. Updated test file to
include GPU tests.
Reviewed By: enosair
Differential Revision: D9499353
fbshipit-source-id: fcb51902063d0c3e4b10354533a8a42cf827c545
Summary:
This probably fixes the logging test error that orionr is encountering - haven't tested locally but wanted to send out a PR to kick off CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10983
Reviewed By: ezyang
Differential Revision: D9552607
Pulled By: Yangqing
fbshipit-source-id: 9ac019031ffd9c03972144df04a836e5dcdafe02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10920
Update the black box predictor and the related code to use the
constructor with PredictorConfig.
Reviewed By: highker
Differential Revision: D9516972
fbshipit-source-id: fbd7ece934d527e17dc6bcc740b4e67e778afa1d
Summary:
The PR includes:
(1) torch.distributed.c10d, which now includes the complete backward compatible frontend API for `torch.distributed`
(2) `env://` init method functionality
(3) Minor change to `test_distributed.py`, which is now a test for `torch.distributed.c10d`.
(4) The old `test_distributed.py' is now moved to `test_distributed_thd`
(5) Miscellaneous bug fixes.
(6) DDP CPU test is removed since c10d doesn't have this support yet, but this is a very easy test after moving DDP CPU's dependency to torch.distributed.c10d.
(7) CI config to test MPI, NCCL, and Gloo backend of c10d
**Now all the distributed test including c10d DDP can pass with the c10d frontend API**
TODO: (in a separate PR)
MPI subgroup support, once this is added, CI group test will be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10871
Differential Revision: D9554514
Pulled By: teng-li
fbshipit-source-id: fb686ad42258526c8b4372148e82969fac4f42dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10987
some code style update to make it consistent with fb cpp style
Reviewed By: yinghai
Differential Revision: D9550130
fbshipit-source-id: 6aef9878676c08e7d384383c95e7ba8c5c9a1bce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11003
Need a interface to re-write the graph after the net is built and after adding gradient ops.
Reviewed By: aazzolini, harouwu
Differential Revision: D9557827
fbshipit-source-id: 2e082f0321c0776e488a29e18047d950948e7c37
Summary:
The goal here is to separate out the base Type into core; as it was done previously we need all derived Types to be defined when we compile the base Type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10947
Reviewed By: gchanan
Differential Revision: D9540025
Pulled By: ezyang
fbshipit-source-id: 49f0b5acb3c378348ef3a55780abb73e4ae27edd
Summary:
Fixes#10851
Speeds up profiling results dramatically.
For the following script:
```
import torch
import time
ITER = 2000
x = torch.randn(1, 1, requires_grad=True)
with torch.autograd.profiler.profile() as prof:
y = x
for i in range(ITER):
y = 3 * y - 2 * y
y.backward()
start = time.time()
print("Done running. Preparing prof")
x = str(prof)
print("Done preparing prof results")
end = time.time()
print("Elapsed: {}".format(end - start))
```
I get 7s before / 0.13s after these changes.
cc apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10969
Differential Revision: D9556129
Pulled By: zou3519
fbshipit-source-id: 26b421686f8a42cdaace6382567d403e6385dc12
Summary:
Breaking this out of https://github.com/pytorch/pytorch/pull/8338
mingzhe09088's fix of the docstrings for Windows builds. Unfortunately some versions of Windows seem to try and parse the `#` inside the string as a pre-processor declaration. We might need to change this to something else later, but want to get this landed first.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10998
Reviewed By: mingzhe09088
Differential Revision: D9557480
Pulled By: orionr
fbshipit-source-id: c6a6237c27b7cf35c81133fd9faefead675a9f59
Summary:
Breaking out of https://github.com/pytorch/pytorch/pull/8338
This test fails once we start building with `-DUSE_GLOG=OFF` since the non-glog logging case doesn't support flushing or streaming to the right location. For now, we just disable this test in that case.
cc Yangqing mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10999
Reviewed By: mingzhe09088
Differential Revision: D9557488
Pulled By: orionr
fbshipit-source-id: 8b306f210411dfc8ccc404bdccf77ddcd36a4830
Summary:
PyTorch exporting test and end to end cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10924
Reviewed By: Ac2zoom
Differential Revision: D9548210
Pulled By: houseroad
fbshipit-source-id: 2381d1ad92a4e07f97060eb65c9fd09f60ad3de6
Summary:
This is part of splitting Context from what needs to go in ATen/core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10951
Differential Revision: D9540369
Pulled By: gchanan
fbshipit-source-id: 73b0e8c4493785fbab368a989f46137c51f6ea0b
Summary:
Fix#10345, which only happens in CUDA case.
* Instead of returning some random buffer, we fill it with zeros.
* update torch.symeig doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10645
Reviewed By: soumith
Differential Revision: D9395762
Pulled By: ailzhang
fbshipit-source-id: 0f3ed9bb6a919a9c1a4b8eb45188f65a68bfa9ba
Summary:
This fixes multiple bugs in the handling of negative indices in both slicing and gather operations. These were uncovered by @[1466077526:Elias Ellison]'s diff D9493614, which made it so that we actually emit negative indices when we see them in PyTorch code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10973
Reviewed By: jhcross
Differential Revision: D9546183
Pulled By: jamesr66a
fbshipit-source-id: 6cb0e84e8ad399e47e24a96c44025f644c17b375
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10739
I wanted to assert that the blobs in the workspace of the new session after loading checkpoint are exactly the same as the blobs in the workspace of the old session before saving to a checkpoint.
But I found that when calling `task.get_step()`, a dummy task output blob, `task:output/ConstIntFill:0`, is added. Also a dummy net `task:output` was also added along with it. See https://fburl.com/937lf2yk
This makes it hard to assert "Equal", forcing me to assert "LessThan" or "GreaterThan".
This adding a dummy TaskOutput when user specifies no TaskOutput is a hack.
The reason for this is that ZMQ socket can't send empty blob list.
As a result, if the Task on the Worker had no output,
The master would never stop waiting and hang forever. See https://fburl.com/rd7fhy6p and imagine `socket.recv(net, 0)`.
TaskOuput is at user layer. The hack shouldn't be exposed to user layer, polluting user workspaces.
Instead, we should move the creating of the dummy blob to some deeper layer,
and remove the dummy blob in the workspace afterwards to avoid polluting user workspaces.
After this change, the workaround becomes totally transparent and no side-effect to users.
Reviewed By: mraway
Differential Revision: D9413150
fbshipit-source-id: 51aaf3201e26570b4fcf5738e9b9aa17c58777ac
Summary:
TODO: integrate into torch.onnx.export -- separate PR
*Problem:* We have a facility to trace PyTorch operations on Python code, but there are several failure modes where the trace is not representative of the actual underlying computation:
* The tracer encountered dynamic control flow
* Some computation escaped the tracer, and appeared as a Constant tensor node in the graph
* Some stateful function was traced, e.g. someone did an optimization in Python by memoizing function outputs
*Objective*: In an ideal world, this whole process would be automated and the user can trust that the system will magically capture the intended semantics from the program. Realistically speaking, we will likely have to settle with a human-in-the-loop error reporting system, allowing for the user to identify problems and modify the source code to allow for tracing.
*Stage 1* (this PR): Output-level checking & graph diff. torch.jit.trace gains a kwarg 'check_inputs', which is a list of tuples of input arguments. We will iterate through the list and trace the function again for each set of check inputs. We'll also interpret the original trace with these inputs and compare output values and graphs, printing a diff of the graph if there is a difference.
Examples:
```
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(4, 5),)])
def foo(x):
y = torch.arange(0, x.shape[0]).float()
return x + y.unsqueeze(1)
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Graphs differed across invocations!
Graph diff:
graph(%0 : Dynamic) {
- %1 : Dynamic = prim::Constant[value= 0 1 2 [ CPULongType{3} ]]()
? ^
+ %1 : Dynamic = prim::Constant[value= 0 1 2 3 [ CPULongType{4} ]]()
? +++ ^
%2 : int = prim::Constant[value=0]()
%3 : Dynamic = aten::_cast_Float(%1, %2)
%4 : int = prim::Constant[value=1]()
%5 : Dynamic = aten::unsqueeze(%3, %4)
%6 : int = prim::Constant[value=1]()
%7 : Dynamic = aten::add(%0, %5, %6)
return (%7);
}
Node diff:
- %1 : Dynamic = prim::Constant[value= 0 1 2 [ CPULongType{3} ]]()
? ^
+ %1 : Dynamic = prim::Constant[value= 0 1 2 3 [ CPULongType{4} ]]()
? +++ ^
Trace source location:
dank.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
dank.py(3): <module>
Check source location:
dank.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(281): check_trace
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(408): wrapper
dank.py(3): <module>
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
Node:
%1 : Dynamic = prim::Constant[value= 0 1 2 [ CPULongType{3} ]]()
Source Location:
dank.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
dank.py(3): <module>
Comparison exception:
Not equal to tolerance rtol=1e-07, atol=0
(shapes (3,), (4,) mismatch)
x: array([0, 1, 2])
y: array([0, 1, 2, 3])
```
==
```
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(3, 4),)])
def foo(x):
y = x.data
return x + y
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
ERROR: Tensor-valued Constant nodes differed in value across invocations. This often indicates that the tracer has encountered untraceable code.
Node:
%1 : Dynamic = prim::Constant[value=<Tensor>]()
Source Location:
dank.py(6): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
dank.py(3): <module>
Comparison exception:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 100.0%)
x: array([0.397137, 0.956105, 0.169478, 0.560292, 0.392568, 0.108441,
0.97645 , 0.34412 , 0.951246, 0.793061, 0.557595, 0.770245],
dtype=float32)
y: array([0.243178, 0.315964, 0.972041, 0.0215 , 0.927751, 0.457512,
0.951092, 0.97883 , 0.048688, 0.118066, 0.779345, 0.271272],
dtype=float32)
```
==
```
import torch
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(4, 4),)])
def foo(x):
for _ in range(x.size(0)):
x = torch.neg(x)
return x
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
ERROR: Graphs differed across invocations!
Graph diff:
graph(%0 : Dynamic) {
%1 : Dynamic = aten::neg(%0)
%2 : Dynamic = aten::neg(%1)
%3 : Dynamic = aten::neg(%2)
+ %4 : Dynamic = aten::neg(%3)
- return (%3);
? ^
+ return (%4);
? ^
}
```
==
```
import torch
def foo(x):
if not hasattr(foo, 'cache'):
foo.cache = torch.neg(x)
return x + foo.cache
traced = torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(3, 4),)])(foo)
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
ERROR: Graphs differed across invocations!
Graph diff:
graph(%0 : Dynamic) {
- %1 : Dynamic = aten::neg(%0)
+ %1 : Dynamic = prim::Constant[value=<Tensor>]()
%2 : int = prim::Constant[value=1]()
%3 : Dynamic = aten::add(%0, %1, %2)
return (%3);
}
Node diff:
- %1 : Dynamic = aten::neg(%0)
+ %1 : Dynamic = prim::Constant[value=<Tensor>]()
Trace source location:
test.py(5): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(402): wrapper
test.py(8): <module>
Check source location:
test.py(6): foo
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(281): check_trace
/Users/jamesreed/onnx-fairseq/pytorch/torch/jit/__init__.py(408): wrapper
test.py(8): <module>
```
The following two examples show instances where program semantics are lost in the Python -> trace transformation, and repeated invocation does not give us useful debug information. Further design in underway for catching these scenarios.
```
import torch
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(3, 4),)])
def foo(x):
for i in range(3):
x[i, :] = torch.zeros(4)
return x
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
Exception:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 100.0%)
x: array([0.830221, 0.915481, 0.940281, 0.555241], dtype=float32)
y: array([0., 0., 0., 0.], dtype=float32)
```
==
```
import torch
torch.jit.trace(torch.rand(3, 4), check_inputs=[(torch.rand(5, 6),)])
def foo(x):
x.view(-1).add_(-x.view(-1))
return x
```
```
torch.jit.TracingCheckError: Tracing failed sanity checks!
ERROR: Traced function outputs do not match the Python function outputs.
Exception:
Not equal to tolerance rtol=1e-07, atol=0
(mismatch 100.0%)
x: array([0.734441, 0.445327, 0.640592, 0.30076 , 0.891674, 0.124771],
dtype=float32)
y: array([0., 0., 0., 0., 0., 0.], dtype=float32)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10841
Differential Revision: D9499945
Pulled By: jamesr66a
fbshipit-source-id: 1f842a32d0b0645259cc43b29700b86d99c59a45
Summary:
This PR adds argument checking for script method invocation from C++. For this I had to:
1. The schema of a method is currently not serialized in script modules, so we now store the function schema in the `doc_string` field of the ONNX proto. Upon loading of a serialized script module, we parse the schema into the structured C++ form and assign it to the loaded method,
2. Inside `Method::operator()`, we now verify the number and types of arguments.
CC The controller you requested could not be found.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10872
Differential Revision: D9521219
Pulled By: goldsborough
fbshipit-source-id: 5cb3d710af6f500e7579dad176652c9b11a0487d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10929
Workspace classes methods were missing on the Python side.
Being able to write the New Checkpoint Framework with more control of the workspace and cleaner implementation.
Added
- ws.feed_blob(name, arr)
- ws.remove_blob(name)
Reviewed By: mraway
Differential Revision: D9486867
fbshipit-source-id: ea02d2e3a39d716a5a3da0482f57d4ac4c893763
Summary: Adds basic nomnigraph python bindings for quickly playing with the graphs.
Reviewed By: duc0
Differential Revision: D9441936
fbshipit-source-id: fd70f8ea279b28c766e40f124008800acd94bddd
Summary:
The previous NCCL all gather doesn't work as expected. This is a fully working async version. Tested on both C++ and Python Frontend.
Multi-node:
```
tengli@learnfair042:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=0 WORLD_SIZE=2 ./ProcessGroupNCCLTest
Multi-node world size: 2 rank: 0
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
tengli@learnfair117:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ TMPFILE="/private/home/tengli/temp/tengli-test" RANK=1 WORLD_SIZE=2 ./ProcessGroupNCCLTest
Multi-node world size: 2 rank: 1
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
```
CI test:
```
test_set_get (__main__.FileStoreTest) ... ok
test_set_get (__main__.PrefixFileStoreTest) ... ok
test_set_get (__main__.PrefixTCPStoreTest) ... ok
test_allreduce_ops (__main__.ProcessGroupGlooTest) ... ok
test_broadcast_ops (__main__.ProcessGroupGlooTest) ... ok
test_allgather_ops (__main__.ProcessGroupNCCLTest) ... ok
test_allreduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_broadcast_ops (__main__.ProcessGroupNCCLTest) ... ok
test_reduce_ops (__main__.ProcessGroupNCCLTest) ... ok
test_common_errors (__main__.RendezvousFileTest) ... ok
test_nominal (__main__.RendezvousFileTest) ... ok
test_common_errors (__main__.RendezvousTCPTest) ... ok
test_nominal (__main__.RendezvousTCPTest) ... ok
test_unknown_handler (__main__.RendezvousTest) ... ok
test_set_get (__main__.TCPStoreTest) ... ok
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10932
Differential Revision: D9542067
Pulled By: teng-li
fbshipit-source-id: 25513eddcc3119fd736875d69dfb631b10f4ac86
Summary:
Running `--accept` on a test doesn't tell you explicitly which sub-test is being updated, this PR fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10559
Differential Revision: D9353977
Pulled By: driazati
fbshipit-source-id: a9d4014386ff0fe388a092f3dcf50f157e460f04
Summary:
Changes the approach for resolving builtin ops so that the following works
```
add = torch.add
script
def foo(x):
return add(x, x)
```
This handles cases when people alias torch and torch.nn.functional to
shorter names.
This works by building a table of id -> builtin name for the know builtin
ops in torch, torch.nn.functional, and for any user-defined
op created by accessing in torch.ops.foo.bar
This allows us to clean up many SugaredValue types in the compiler.
Notes:
* we now consider any attributes on python modules to be constants
(e.g. math.pi, and torch.double).
* fixes a bug where we incorrectly allowed attribute lookup on arbitrary
pyton objects. It is now restricted to modules only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10927
Differential Revision: D9527522
Pulled By: zdevito
fbshipit-source-id: 0280422af08b4b0f48f302766d5a9c0deee47660
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10804
Make ShareData and ShareExternalPointer to create new storage when the old one is used by multiple tensors.
When we need to modify the field of storage, we'll create a new storage instead.
Reviewed By: ezyang
Differential Revision: D9350686
fbshipit-source-id: 68d2b6b886b0367b0fc4fabfd55b9a480e7388ca
Summary:
Currently we assume to find cudnn includes and libraries in the `CUDA_HOME` root. But this is not always true. So we now support a `CUDNN_HOME`/`CUDNN_PATH` environment variable that can have its own `/include` and `/lib64` folder.
This means cudnn extensions now also get support on the FAIR cluster.
soumith fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10922
Differential Revision: D9526856
Pulled By: goldsborough
fbshipit-source-id: 5c64a5ff7cd428eb736381c24736006b21f8b6db
Summary:
Since we don't need `torch.autograd.Variable` anymore, I removed `torch.autograd.Variable` from `onnx.rst`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10810
Differential Revision: D9500960
Pulled By: zou3519
fbshipit-source-id: 1bc820734c96a8c7cb5d804e6d51a95018db8e7f
Summary:
More support for tuples has uncovered a bug in constant prop where
it assumed it can create constant nodes of tuples, even though we
cannot easily create a single prim::Constant to represent a tuples.
This fix checks when we cannot represent an IValue as a prim::Constant
and then stops propagating the node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10923
Reviewed By: orionr
Differential Revision: D9523417
Pulled By: zdevito
fbshipit-source-id: 745058c4388d9a5e0fc1553eaa2731e31bc03205
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10824
API additions:
- Tensor(c10::intrusive_ptr<TensorImpl,UndefinedTensor>&&)
- Tensor(const c10::intrusive_ptr<TensorImpl,UndefinedTensor>&)
- Tensor::operator=(Tensor&&) && (for completeness sake)
- TensorBase::unsafeGetTensorImpl()
- TensorBase::unsafeReleaseTensorImpl()
- TensorBase::getIntrusivePtr()
- TensorImpl::type_id()
- Tensor::set_data()
- Tensor::is_same(Tensor)
- Tensor::use_count()
- Tensor::type_id()
- Tensor::scalar_type()
- WeakTensor::is_same(WeakTensor)
- intrusive_ptr::weak_use_count()
- weak_intrusive_ptr::weak_use_count()
- c10::raw::intrusive_ptr::{incref,decref,make_weak}
- c10::raw::weak_intrusive_ptr::{incref,decref,lock}
API changes:
- Tensor::pImpl is no longer public (and now named tensor_impl_)
- Most methods accessed this way are now accessible on Tensor
maybe_zero_dim() and set_wrapped_number() being prominent exceptions
(they are now accessed through unsafeGetTensorImpl())
- Type is no longer friend of Tensor
- TensorBase::reset(TensorImpl*) is deleted
- TensorBase::reset(TensorImpl*, bool should_retain) is deleted
- TensorBase::swap(TensorBaseImpl&) is deleted; use std::swap instead
- TensorBase::get() is deleted; use unsafeGetTensorImpl() instead
- TensorBase::detach() is deleted; use unsafeReleaseTensorImpl() instead
- TensorBase::retain() is deleted; use _raw_incref() instead
- TensorBase::release() is deleted; use _raw_decref() instead
- WeakTensor lost most of its methods (it no longer inherits from
TensorBase)
- TensorImpl::storage() is now a const method
- Tensor(TensorBase) constructor removed, instead
we go through getIntrusivePtr(). I'm not sure about
this change; I happened to have accidentally removed the
TensorBase constructor and decided to fix call sites,
but I could go the other way.
- detail::set_data() is deleted; use Tensor::set_data() instead
- c10::raw_intrusive_ptr_target removed; use the functions in c10::raw instead.
(The reason for this change, is that it is invalid to cast an intrusive_ptr_target*
to a raw_intrusive_ptr_target* to take advantage of the methods. But there is
no reason the incref/decref methods shouldn't also work on intrusive_ptr_target;
it is primarily an API consideration. We can be more standards compliant by
keeping them as functions, which are universally applicable.)
- intrusive_ptr::reclaim() and weak_intrusive_ptr::reclaim() now work on
pointers of the NullType. (This counts as a bug fix, because the documentation
specified that pointers produced by release() are valid to reclaim(), and
a release() on a null intrusive_ptr produces the NullType::singleton())
Bug fixes:
- Dispatch code for mutable references incorrectly returned
a reference to a value argument (which would immediately
go out of scope). They now correctly return a tensor by
value.
- intrusive_ptr copy/move assignment did not work correctly when
an object was assigned to itself. We now check for this case and
no-op if so. (This bug manifested itself as a Tensor mysteriously
becoming an UndefinedTensor after lines of code like
'x = x.mul_(y)')
Other changes:
- The checked cast functions in Utils.h have now been
renamed and detemplatized into checked unwrap functions.
- Added type_id() and scalar_type() methods to Tensor
- pImpl is no longer public
- Documented what the && overloads are doing
- All occurrences of 'new TensorImpl' (and similar spellings, like 'new THTensor')
have been expunged. This is NO LONGER a valid way to create a new
tensor, and if you do this, upon your first incref, you will catch an ASSERT
failure saying that only tensors created by intrusive_ptr::release() are valid
to reclaim(). Use c10::make_intrusive instead in this situation.
- IValue is adjusted to use intrusive_ptr instead of Retainable, and all
other sub-classes of Retainable were modified to use intrusive_ptr.
When doing this, I had to make the constructors of sub-classes like
ConstantList public, so that c10::make_intrusive could invoke them. Fortunately,
if you incorrectly stack allocate a ConstantList, and then try to get an
intrusive_ptr to it, it will fail, as stack allocated ConstantLists have refcount 0.
- IValue very narrowly sidesteps the problem of handling NullType, as it
considers intrusive_ptr<TensorImpl> identical to intrusive_ptr<TensorImpl, UndefinedTensor>
which is not always true. This was always the case, but there's now a comment
explaining what's going on.
Some MSVC bugs were uncovered during the preparation of this patch.
They are documented as comments in the code.
Reviewed By: gchanan
Differential Revision: D9481140
fbshipit-source-id: 14a8ea0c231ed88b5715fb86d92730926f9f92fc
Summary:
The goal of this PR is to enable miopen engine(for hip devices) for recurrent operator and also enable corresponding unit test.
bddppq petrex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10840
Differential Revision: D9518980
Pulled By: bddppq
fbshipit-source-id: 214661e79a47c5dc6b712ef0fba986bd99db051f
Summary:
Previously when tracing slicing & select negative indices would get normalized, fixing the index to the size of the traced tensor. This makes the behavior the same as script so aten::select with negative indices is emitted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10560
Differential Revision: D9493614
Pulled By: eellison
fbshipit-source-id: ce7a8bae59863723247208d86b9f2948051ccc6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10877
change default value of DeviceOption.numa_node_id to 0 and use has_numa_node_id() to check existence
Reviewed By: ilia-cher
Differential Revision: D9473891
fbshipit-source-id: 91ac6a152f445644691023110c93d20a3ce80d43
Summary:
* Fix the necessary pathways so that tuples and lists can be inputs to the script.
* prevent linear algebra functions from being run in shape prop because
they frequently will error out for nonsense data.
* favor schema-driven python input conversion where possible.
remaining cases where we directly create Stacks without schema are
only for debugging
* Make the error messages when calling script/trace functions more pythonic
* Simplify FlattenTuples -- now that tuples are supported we can choose to only flatten tuples when needed. This may have to be revisited pending onnx test results, but is necessary for making tuple io work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10812
Differential Revision: D9477982
Pulled By: zdevito
fbshipit-source-id: ed06fc426e6ef6deb404602a26c435a7fc40ea0c
Summary:
The scalar situation has gotten a lot better and now we can
remove all instances of FIXME_zerol().
cc zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10900
Differential Revision: D9514206
Pulled By: zou3519
fbshipit-source-id: e4e522f324126c5454cd6de14b832d2d1f6cb0ce
Summary:
PackedSequence is never supposed to be created by user, but unfortunately some community repo is already doing this (e.g., [here](7c191048ce/torchmoji/model_def.py (L218-L229))). Some change we made break the calling pattern `PackedSequence(data=x, batch_sizes=y)`. This patch adds back support for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9864
Differential Revision: D9011739
Pulled By: SsnL
fbshipit-source-id: 0e2012655d7f4863ec54803550df30874ec35d75
Summary:
Please review the expects carefully to make sure there are no regressions. I tried to go over them one by one when they changed, but it's sometimes easy to miss finer details.
Summary of changes:
- Renamed `TensorType` to `CompleteTensorType`. Added a new `TensorType` which records only the scalar type, number of dimensions, and device of a value. The argument behind the rename is to encourage people to use `CompleteTensorType` less, as most passes will only have limited information available. To make transition easier `complete_type->cast<TensorType>()` works, and makes our passes work with both kinds of specialization if they don't need extra the extra detail.
- Renamed `ArgumentSpec` to `CompleteArgumentSpec`. Added a new `ArgumentSpec`, which matches argument only at the level of the new `TensorType`.
- Shape analysis can process graphs with both `CompleteTensorType` and `TensorType`.
- Fuser was a part that heavily relied on full shape information being available. Now, we simply try to fuse the largest possible graphs, and have to do run-time checks to make sure they match the code we generate. If they don't, we fall back to regular interpretation. The shape checks are implementing using an optimized method exploiting algebraic properties of shapes with broadcasting, and the relations of broadcasting with pointwise ops. A full written proof of correctness of the shape checking algorithm is included in a comment in `graph_fuser.cpp`.
zdevito ezyang mruberry ngimel csarofeen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10844
Differential Revision: D9498705
Pulled By: apaszke
fbshipit-source-id: 0c53c2fcebd871cc2a29c260f8d012276479cc61
Summary: Update all the caller for the new interface
Reviewed By: highker
Differential Revision: D9323167
fbshipit-source-id: a39335ceb402db0719f5f2314085ba9a81380308
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10239
Make Conv + BN fusion also work for 3D convolutions
Reviewed By: duc0
Differential Revision: D9176314
fbshipit-source-id: 6604aa569c5c3afdb4480a5810890bc617e449c4
Summary:
This disables the symbolic override hacks and makes tracing emit the recently added ATen ops for RNNs (`aten::lstm`, `aten::gru`, ...). I managed to reuse pretty much all of the translation code for their symbolics.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10638
Differential Revision: D9385830
Pulled By: apaszke
fbshipit-source-id: ff06ef7b1ae7c3b7774825e0991bc3887e1ff59b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10759
Adding a basic registry pattern to pybindstate so that we can have separate 'cc' files register module updates. This is substantially cleaner than using multiple pybind modules (which have been known to cause bugs)
Reviewed By: bddppq
Differential Revision: D9441878
fbshipit-source-id: af9e9e98385e92b58ca50e935678328c62684d8e
Summary:
**Summary**: This PR is a followup of mruberry's https://github.com/pytorch/pytorch/pull/9318/. It tries to achieve the following:
- Specializing std common math functions for `at::Half` type.
- Create `CUDANumerics.cuh` to contain necessary parts from `THCNumerics.cuh`.
- Update `THCNumerics.cuh` with new usage and comments to demonstrate the best practice for developers and hence, making way for its deprecation.
- Remove legacy/redundant code path.
- Remove unused CUDA HALF macros (see separate PR https://github.com/pytorch/pytorch/pull/10147)
**Comments**: `CUDANumerics.cuh` contains mathematical functions that are either not in the std namespace or are specialized for compilation with CUDA NVCC or CUDA NVRTC. This header is derived from the legacy `THCNumerics.cuh`. Following are some rationale behind why some functions were kept while others were removed:
- All arithmetic can now be done in ATen using binary cuda kernel or CUDA tensor pointwise apply (check https://github.com/pytorch/pytorch/pull/8919 and `CUDAApplyUtils`). `at::Half` comparisons rely on implicit conversion to float.
- Functions that are c/c++ standard compliant, have been specialized for user defined types, for instance, the std namespace has been opened up for `at::Half`, that defines math function definitions for `at::Half`. Check `Half-inl.h`
- Some standard compliant functions are specialized here for performance reasons. For instance, `powi` is used for `pow` calculation on integral types. Moreover, `abs`, `isinf`, `isnan` are specialized to save one API call vs when used with std. Although this is subject to change, depending on if we really care about saving one API call.
- Numeric limits such as `max/min` is removed since they call standard defines. Moreover, numeric limits for
`at::Half` is present in `Half-inl.h`. I understood that HIP has some issue with `std::numeric_limits` and this the related github issue I found: https://github.com/ROCm-Developer-Tools/HIP/issues/374. AlexVlx mentions that the issue can be avoided by launching `std::numeric_limits` in `__device__`. Since, we are launching lambdas with device contexts, I don't see an issue why `std::numeric_limits` won't compile on HIP if launched with device context within a kernel, unless I am not aware of the real reason why max/min was there in THCNumerics in the first place. (Haven't ever tried a build with HIP).
Here are some reference PRs that was handy in refactoring TH into ATen:
- https://github.com/pytorch/pytorch/pull/6786
- https://github.com/pytorch/pytorch/pull/5475
- https://github.com/pytorch/pytorch/pull/9401
- https://github.com/pytorch/pytorch/pull/8689
- https://github.com/pytorch/pytorch/pull/8919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10301
Differential Revision: D9204758
Pulled By: soumith
fbshipit-source-id: 09f489c1656458c02367b6cd31c3eeeca5acdc8a
Summary:
This is along the way of removing Tensor as a member of the tagged union in Scalar. This simplifies ordering dependencies, because currently Scalar and Tensor both depend on each other (so we introduce a TensorBase). Also, this API isn't particularly useful publicly: we can't autograd through Scalars, so you still need a Tensor overload basically everywhere anyway.
I'm undecided what the final API should be here. We could keep a Tensor constructor on Scalar, but have it generate a local scalar; this is convenient but given this API used to be non-synchronizing, it may not be the best.
For now, I'm just using _local_scalar, which is clear, although we should get rid of the prefix _ if that's the API we intend to promote.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10852
Reviewed By: ezyang
Differential Revision: D9496766
Pulled By: gchanan
fbshipit-source-id: 16f39b57536b9707132a5a4d915650c381bb57db
Summary:
The schema_ field is a private and internal cache for nodes, and no
methods meant to manipulate it should be publicly visible. This call
wasn't even necessary at its call site, since removeInput will reset the
schema by itself.
zdevito jamesr66a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10822
Reviewed By: zdevito
Differential Revision: D9498683
Pulled By: apaszke
fbshipit-source-id: 42e1743e3737cb7d81f88e556204487d328c0e47
Summary:
When matching schema, first try to match without adding TensorToNum conversions. Then make another pass where TensorToNum conversions are allowed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10180
Differential Revision: D9438153
Pulled By: eellison
fbshipit-source-id: 80541b5abd06e9d4187e89dda751f44dab6f58c5
Summary:
Since ONNX opset version >5, Reshape changed semantics to take a shape tensor as input instead of relying on `shape` attribute to decide what shape to reshape to. ONNXIFI op has been postponing this change as some of the backends such as TensorRT were not ready. Now that the backends have adopted this semantics, we can remove the legacy mode and output opset version 7 ONNX models.
This change also flushes out some of the bugs and new requirement.
- Converting shape info into int64 tensor
- Fix a bug when we output the shape tensor in the mapped workspace instead of the original workspace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10848
Reviewed By: houseroad
Differential Revision: D9495121
Pulled By: yinghai
fbshipit-source-id: a6f44a89274c35b33fae9a429813ebf21d9a3d1a
Summary:
Currently on PyTorch AMD, memory accesses on the TensorInfo struct contained in the Operators passed into the kernelPointwiseApply kernel leads to hangs on the HCC runtime. Permuting the argument order such that the operator is first alleviates this issue and the kernel hangs disappear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10829
Reviewed By: ezyang
Differential Revision: D9492561
Pulled By: Jorghi12
fbshipit-source-id: d0f0e2ab7180e55846db909f2744b8c8b110205e
Summary:
We no longer use nanopb in PyTorch (or Caffe2) so removing. All protobuf manipulation should go through standard protobuf, which is statically linked inside libcaffe2.so by default.
cc zdevito pjh5 ezyang Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10772
Reviewed By: pjh5
Differential Revision: D9465894
Pulled By: orionr
fbshipit-source-id: 8cdf9f1d3953b7a48478d381814d7107df447201
Summary:
In prep for making FULL_CAFFE2 default, users shouldn't be required to have protobuf installed.
cc pjh5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10771
Reviewed By: pjh5
Differential Revision: D9474458
Pulled By: orionr
fbshipit-source-id: 3e28f5ce64d125a0a0418ce083f9ec73aec62492
Summary:
This is a small part of the effort to remove Tensor as a tagged member in Scalar because it is inconsistent with how we normally do overloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10828
Differential Revision: D9485049
Pulled By: gchanan
fbshipit-source-id: 103f5cc03bb7775cd2d3a0a5c0c5924838055f03
Summary:
Part of #10774.
This PR does the following:
- Support ast.ExtSlice in the frontend. This is done by returning a
list of ast.Index and ast.Slice.
- Support multidimensional indexing with ints and slices
The general approach is to desugar multidimensional indexing into
at::slice, at::select operations. This is exactly how normal pytorch
does indexing (by desugaring it into at::slice, at::select, and other ops).
I used [this code](https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_variable_indexing.cpp) as reference.
We should be able to copy the rest of this to implement the missing
indexing features in script (indexing with ellipses, tensors, sequences, etc).
After I'm done implementing the missing indexing features in future prs, I can try to
templatize python_variable_indexing.cpp so that it can work with both JIT
script and normal pytorch indexing, but right now I'm not sure if that's
a good idea or not.
cc zdevito jamesr66a apaszke wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10787
Differential Revision: D9481402
Pulled By: zou3519
fbshipit-source-id: 78c9fa42771a037d157879e23e20b87401cf1837
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10797
A few operators enforces in-place output (e.g., running mean/var for SpatialBN). Functional right now doesn't follow the inplace_enforced_ rules in OpSchema and therefore, the RunNetOnce() will fail on OpSchema->Verify(). Edit the output_names in Functional following the rules to pass check.
Reviewed By: jerryzh168
Differential Revision: D9470582
fbshipit-source-id: 168efeccecc32184bd1d02f3fefe8e61faa4e0f4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10835
The last diff of constructor cause performance regression in cold run.
This one tried to fix this.
Reviewed By: highker
Differential Revision: D9489617
fbshipit-source-id: a77c2e2c903a73e2ad9806b4f9c209cdb751442f
Summary:
Added Prefix Store support.
This will make group be backward compatible.
Test is covered too.
```
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./FileStoreTest
Using temporary file: /tmp/testoglRl4
Using temporary file: /tmp/testepZIpB
Test succeeded
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ ./TCPStoreTest
Test succeeded
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10762
Differential Revision: D9484032
Pulled By: teng-li
fbshipit-source-id: 85754af91fe3f5605087c4a2f79ae930a9fd1387
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10605
Make isSubgraphMatch returns a subgraph and map from MatchNodes to graph nodes in the result, which makes it easier to write graph fusion logic. Also include some more helper methods for NN subgraph matcher.
Reviewed By: bwasti
Differential Revision: D9374931
fbshipit-source-id: 3a273295eec81a43027ec3a9e835d27f00853df9
Summary:
apaszke recently ported RNNs from Python into ATen, which means we can replace our implementation in the C++ API (written by ebetica) with the ATen implementation, which cleans up a lot of code (+99, -323). Thanks apaszke!
I also added the `bidirectional` and `batch_first` options to the C++ API RNN options, just because why not.
apaszke ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10761
Differential Revision: D9443885
Pulled By: goldsborough
fbshipit-source-id: b6ef7566b9ced2b2f0b2e1f46c295b6f250c65a8
Summary:
* first integration of MIOpen for batch norm and conv on ROCm
* workaround a ROCm compiler bug exposed by elementwise_kernel through explicit capture of variables in the densest packing
* workaround a ROCm compiler bug exposed by having `extern "C" __host__` as a definition and just `__host__` in the implementation through the hipify script
* use fabs() in accordance with C++11 for double absolute, not ::abs() which is integer-only on ROCm
* enable test_sparse set on CI, skip tests that don't work currently on ROCm
* enable more tests in test_optim after the elementwise_bug got fixed
* enable more tests in test_dataloader
* improvements to hipification and ROCm build
With this, resnet18 on CIFAR data trains without hang or crash in our tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10612
Reviewed By: bddppq
Differential Revision: D9423872
Pulled By: ezyang
fbshipit-source-id: 22c0c985217d65c593f35762b3eb16969ad96bdd
Summary:
Things like `zeros(1,2,3, dtype=torch.int)` are now supported in the script by altering tryMatchSchema to auto-construct the list `[1,2,3]` when it sees inlined members of the list as the last positional arguments.
I suggest reading the commits individually, since the first two incrementally change how we do tryMatchSchema to get it ready for adding vararg list conversion, while the third actually does the modification.
closes#10632closes#8516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10250
Differential Revision: D9478235
Pulled By: zdevito
fbshipit-source-id: 0c48caf7a6184e463d9293d97015e9884758ef9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10779
The idea is to let classes opt-in to providing these methods
by default.
Reviewed By: jerryzh168
Differential Revision: D9466076
fbshipit-source-id: b6beee084cc71d53ce446cdc171d798eeb48dc12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10766
Added a `Workspace::ForEach(...)` API for accessing the global set of
existing Workspace instances. This is used in the signal handler to print blob
info on the thread receiving a fatal signal.
Reviewed By: mraway
Differential Revision: D9147768
fbshipit-source-id: a94d0b5e6c88390a969ef259ecb8790173af01a4
Summary:
This seems to save a few percent in binary size in libcaffe2_gpu.so, but
the effect may not be real. In fact, deleting some functions can cause
the binary size to increase (perhaps due to alignment issues).
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10707
Differential Revision: D9409009
Pulled By: colesbury
fbshipit-source-id: 282931e562e84e316a33ac6da4788c04c2984f08
Summary:
To prepare THCState for refactoring into ATen, this PR removes unused THCState code paths. In particular, it:
- Removes the UVA Allocator
- Removes the THDefaultDeviceAllocator
- Respects the 1 BLAS and 1 sparse handle per device reality
- Removes kernel p2p access
- Removes setting p2p access
- Removes the GCHandler code path
- Removes many unused THCState_... functions
- Removes THCThreadLocal.h/.cpp
It does not change the preexisting external behavior of any remaining function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9735
Differential Revision: D9438558
Pulled By: SsnL
fbshipit-source-id: dde9acbec237a18bb6b75683e0526f7ff1c9a6ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10509
This diff enables CUDA implementation of LARS operator in caffe2.
Reviewed By: enosair
Differential Revision: D9318356
fbshipit-source-id: 365b9f01e3afd4d9d3ba49155e72e728119f40c5
Summary:
When 0-sized dimension support is added, we expect an empty sparse tensor to be a 1-dimensional tensor of size `[0]`, with `sparseDims == 1` and `denseDims == 0`. Also, we expect the following invariants to be preserved at all times:
```
_sparseDims + _denseDims = len(shape)
_indices.shape: dimensionality: 2, shape: (_sparseDims, nnz)
_values.shape: dimensionality: 1 + _denseDims. shape: (nnz, shape[_sparseDims:])
```
This PR fixes various places where the invariants are not strictly enforced when 0-sized dimension support is enabled.
Tested and `test_sparse.py` passes locally on both CPU and CUDA with the `USE_TH_SIZE_ZERO_DIM` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9279
Differential Revision: D8936683
Pulled By: yf225
fbshipit-source-id: 12f5cd7f52233d3b26af6edc20b4cdee045bcb5e
Summary:
This uses zou3519's new `torch.broadcast_tensors()` #10075 to make `Categorical.log_prob()` and the `*Normal.__init__()` methods jittable. Previously `.log_prob()` was failing due to calls to `torch._C.infer_size()` with errors like
```
def log_prob(self, value):
if self._validate_args:
self._validate_sample(value)
> value_shape = torch._C._infer_size(value.size(), self.batch_shape) if self.batch_shape else value.size()
E RuntimeError: expected int at position 0, but got: Tensor
```
After this change I'm able to jit many more of Pyro's tests.
Reviewed By: ezyang
Differential Revision: D9477487
Pulled By: apaszke
fbshipit-source-id: 5f39b29c6b8fa606ad30b02fefe2dfb618e883d6
Summary:
When emitting if Branches, check that the types on each value returned are equivalent. As with reassignment of values, tensors are not forced to be the same shape or subtype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10281
Differential Revision: D9466566
Pulled By: eellison
fbshipit-source-id: 746abdeb34a0f68806b8e73726ad5003b536911c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9483
The interface is updated to accept the config to construct the predictor.
Reviewed By: highker
Differential Revision: D8872999
fbshipit-source-id: 3ca54d644970823fc33c0ade9a005e12f52e2b24
Summary:
This will make pybind version of MPI PG work. The issue is the scope of the tensor list won't be available for the MPI worker thread. So we pass the vector by value instead.
Also added recv_anysource pybind to make it work. The front-end API will wrap one level up with an int for this function. So taking a tensor should be the easiest way for now.
Also added abort pybind and fixed the flaky test.
```
tengli@devfair033:~/new_pytorch/pytorch/torch/lib/build/c10d/test$ mpirun -np 8 ProcessGroupMPITest
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10606
Differential Revision: D9474393
Pulled By: teng-li
fbshipit-source-id: cca236c333656431e87d0d3573eeae9232c598b0
Summary:
Augassign (i.e., `x += 1`) gets desugared to an assignment of a binop (`x = x + 1`).
Right now we assert that the RHS of the binop is a tensor,
but it really doesn't have to be because we support scalar/scalar ops and also
list-list ops (i.e., `[1, 2] + [2, 3]`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10730
Differential Revision: D9465110
Pulled By: zou3519
fbshipit-source-id: 7b118622701f09ce356aca81b8db743d9611097b
Summary:
Multiple failing external and internal CI signals were ignored when this commit
was landed. goldsborough please fix the text failures and resubmit this change as a
new PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10785
Reviewed By: ezyang
Differential Revision: D9466791
Pulled By: jamesr66a
fbshipit-source-id: b260e93bac95d05fd627c64e620b6aefb5045949
Summary:
ONNX doesn't support this. Instead flatten the inputs to the ListConstruct op and inline it into the subsequent usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10713
Differential Revision: D9458508
Pulled By: jamesr66a
fbshipit-source-id: 0b41e69320e694bb2f304c6221864a39121e4694
Summary:
I included "legacy" includes in the old spots for Backend, Generator, Layout; it seemed unlikely that the other ones had direct user includes.
This is another step on the path to move Type/Tensor to ATen/core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10740
Reviewed By: ezyang
Differential Revision: D9435888
Pulled By: gchanan
fbshipit-source-id: 89f4f0f445d4498a059d3a79069ba641b22bbcac
Summary:
Don't regex against strings that may have come from the backtrace.
Better to just not regex at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10702
Reviewed By: ezyang
Differential Revision: D9406154
Pulled By: jsrmath
fbshipit-source-id: 9b17abee2a6e737a32c05f1e3963aef4b6638a47
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10053
Tensor in Pytorch 1.0 will have
Tensor -> TensorImpl -> Storage -> StorageImpl
In this diff we split Storage from Tensor in order to align with this design.
We'll have Tensor -> Storage -> StorageImpl after this diff
Reviewed By: ezyang
Differential Revision: D9384781
fbshipit-source-id: 40ded2437715a3a2cc888ef28cbca9a25b1d5350
Summary:
I've tested locally that this works to build static and non-static binaries with and without CUDA.
In terms of ongoing testing, I am working on incorporating this into the release package generation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10754
Differential Revision: D9457423
Pulled By: anderspapitto
fbshipit-source-id: aa1dcb17c67c0f0c493a9cf93aca4a6e06b21666
Summary: The code in Operator::SyncDevice had some duplicate logic and using FinishDeviceComputation sufficed in this case.
Reviewed By: yinghai
Differential Revision: D9348288
fbshipit-source-id: d8d874bab491e6d448fcd5fa561a8b99d502753b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10362
This diff implements a manual export from PyText's CRF module to the caffe2 CRF layer.
Note that most of the changes in caffe2/python/crf.py are just formatting changes, the only relevant change is the new class CRFUtils.
Reviewed By: hikushalhere
Differential Revision: D9234126
fbshipit-source-id: 1a67d709034660e8b3d5ac840560b56de63e3f69
Summary:
```
Use intrusive_ptr in Storage; replace unique_ptr<Storage> with Storage
This patch does two major changes:
- It replaces the use of Retainable in Storage with a new implementation
based on intrusive_ptr. This will be necessary because Caffe2 will
be using this class to implement intrusive_ptrs, and we need to
line these up for the merge. One good thing about the new implementation is
that the default copy/move constructors/assignment operators and destructor
work automatically, instead of needing to be hardcoded into Storage/Tensor.
- It replaces all places where we returned std::unique_ptr<Storage> with
Storage, collapsing an unnecessary double indirection that is no longer
necessary now that we have correctly working copy/move constructors.
I didn't initially want to do step (2), but it was very important to
eliminate all bare uses of new Storage and new StorageImpl, and this making
the API change was the most straightforward way to do this.
HOW TO FIX YOUR CODE IN THE NEW API
- You no longer need to dereference the result of tensor.storage() to pass
it to set. So, instead of:
x.set_(*y.storage());
just write:
x.set_(y.storage());
- If you were accessing methods on StorageImpl via the pImpl() method, you
must use the dot operator to run pImpl(). Even better; just drop pImpl,
we now have method forwarding. So, instead of:
storage->pImpl()->data();
just do:
storage->data();
// storage.pImpl()->data() works too but is not as recommended
- storage->getDevice() is no more; instead use storage->device().index()
MISC CODE UPDATES
- retain, release, weak_retain, weak_release and weak_lock are now
reimplemented using the "blessed API", and renamed to make it
clearer that their use is discouraged.
- nvcc OS X and general OS X portability improvements to intrusive_ptr
- A new comment in intrusive_ptr describing how stack allocated
intrusive_ptr_targets work differently than heap allocated ones
from c10::make_intrusive
CAVEAT EMPTOR
- THStorage_weakRetain used to work on strong pointers, but it NO LONGER
works with intrusive_ptr. You must reclaim the strong pointer into a
real strong pointer, construct a weak pointer from it, and then release
the strong and weak pointers. See StorageSharing.cpp for an example.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10488
Reviewed By: gchanan
Differential Revision: D9306134
Pulled By: ezyang
fbshipit-source-id: 02d58ef62dab8e4da6131e1a24834a65c21048e2
Summary:
The optimized code for `linear()` which uses `addmm` when a bias is given was duplicated three times in the ATen and the C++ API. Let's just have `at::linear` and use that everywhere.
apaszke ezyang (who mentioned this in #10481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10755
Differential Revision: D9443881
Pulled By: goldsborough
fbshipit-source-id: a64862d1649b5961043d58401625ec267d97d9f3
Summary:
zdevito et al came to the conclusion that the ONNX spec does not mandate the widening conversion of integral types when serializing tensor data into raw_data, as opposed to serializing the data into int32_data. PyTorch recently made this change in the export code, which caused import in caffe2 to break because it did not match semantics. This fixes that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10718
Differential Revision: D9423712
Pulled By: jamesr66a
fbshipit-source-id: 479fbae67b028bf4f9c1ca1812c2c7b0c6cccd12
Summary:
Fixes `__getattr__` to adhere to its Python API contract, and wraps `range()` call in a list since it does not return one anymore in Python 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10525
Reviewed By: ezyang
Differential Revision: D9441360
Pulled By: tomdz
fbshipit-source-id: d489c0e7cefecc4699ca866fd55ddbfa629688d4
Summary:
This PR adds support for using custom ops in ScriptModules, the last step for our custom op strategy. You can now write
```
import torch
torch.ops.load_library('libcustom_ops.so')
class Model(torch.jit.ScriptModule):
def __init__(self):
super(Model, self).__init__()
torch.jit.script_method
def forward(self, input):
return torch.ops.custom.op(input) + 1
model = Model()
model.forward(torch.ones(5)) # Works
model.save("model.pt") # Works
model = torch.jit.load("model.pt") # Works
```
You can then load the `model.pt` in C++ and execute its `forward` method!
Missing for this was the fact that the script compiler didn't know to convert `ops.custom.op` into a `BuiltinFunction` which then emits a function call. For this I came up with the following strategy inside `torch/csrc/jit/scrip/init.cpp`:
1. When we access `torch.ops`, we return a `CustomOpValue` (subclass of `PythonValue`), whose purpose is only to return a `CustomOpNamespaceValue` (subclass of `PythonValue`) whenever something under it is accessed.
2. `CustomOpNamespaceValue` will then for each field accessed on it return a `BuiltinFunction`.
This doesn't reduce performance for any calls that are not to `torch.ops` (as opposed to inspecting every function call's name the call site, for example).
I also had to fix `BuiltinFunction` to not assume the namespace is always `aten::`.
A lot of other changes are just tidying up the Python and C++ test harness before I integrate it in CI.
zdevito dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10610
Differential Revision: D9387832
Pulled By: goldsborough
fbshipit-source-id: c00f431db56c7502a66fe1f813fe78067f428ecb
Summary:
This should resolves "error C2280: 'std::unique_ptr<caffe2::ObserverBase<caffe2::OperatorBase>,std::default_delete<_Ty>> &std::unique_ptr<_Ty,std::default_delete<_Ty>>::operator =(const std::unique_ptr<_Ty,std::default_delete<_Ty>> &)': attempting to reference a deleted function" from Visual Studio.
It should also make error message more human-readable in case if something really messed up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10593
Reviewed By: orionr
Differential Revision: D9436397
Pulled By: mingzhe09088
fbshipit-source-id: 31711667297b4160196134a34365da734db1c61d
Summary:
Let's run CI tests to see what fails given the changes that just landed in https://github.com/pytorch/pytorch/pull/10624
cc mingzhe09088 ezyang Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10692
Reviewed By: mingzhe09088
Differential Revision: D9423617
Pulled By: orionr
fbshipit-source-id: 3bda1f118d13f8dd8e823727c93167cae747d8cf
Summary:
Set the build environment before installing sccache in order to make sure the docker images have the links set up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10640
Reviewed By: yf225
Differential Revision: D9399593
Pulled By: Jorghi12
fbshipit-source-id: a062fed8b7e83460fe9d50a7a27c0f20bcd766c4
Summary:
This is part of moving the (base) Type to ATen/core; Some Type methods have default argument of type THNN Reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10703
Differential Revision: D9406060
Pulled By: gchanan
fbshipit-source-id: 789bb3387c58bd083cd526a602649105274e1ef6
Summary:
This will make the common case more natural (no need to do `_construct_empty_tensor_list()`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10705
Differential Revision: D9411622
Pulled By: michaelsuo
fbshipit-source-id: 2d91fbc5787426748d6e1c8e7bbeee737544dc96
Summary:
The broadcast is used by default when the opset version is greater then 6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10108
Reviewed By: bddppq
Differential Revision: D9176934
Pulled By: houseroad
fbshipit-source-id: b737bd87b0ddc241c657d35856d1273c9950eeba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10651
EnsureCPUOutputOp will copy the input from another Context to CPU, but currently there is no guarantee that the Copy will be executed.
Differential Revision: D9390046
fbshipit-source-id: af3ff19cf46560264cb77d2ab8821f0cc5be74f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10551
Renaming from "subtree" -> "subgraph" to improve clarity of subgraph matcher APIs since it now supports DAG
This is pure renaming, no functionalities change.
Reviewed By: bwasti
Differential Revision: D9348311
fbshipit-source-id: 4b9267845950f3029dfe385ce3257d3abb8bdad4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10549
Support dag matching in nomnigraph. This is done by maintaining a map from node in the MatchGraph to node in the input graph, and additionally enforce that same nodes in the MatchGraph must match to same nodes in the input graph (with the exception of multiplicity i.e. when count != 1 on the MatchGraph node).
In a follow up diff, I'll rename the API that refers to subtree as subgraph to improve clarity.
Reviewed By: bwasti
Differential Revision: D9347322
fbshipit-source-id: 171491b98c76852240a253279c2654e96dd12632
Summary:
Some more `ATEN_API` additions for hidden visibility.
Running CI tests to see what fails to link.
cc Yangqing mingzhe09088 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10624
Reviewed By: mingzhe09088
Differential Revision: D9392728
Pulled By: orionr
fbshipit-source-id: e0f0861496b12c9a4e40c10b6e0c9e0df18e8726
Summary:
Minor fix for the cuDNN cache. Previously we would skip the event reinitialization when an RNN function would be called on GPU 0, and then on GPU 1, but it would be in eval mode on GPU1. That would cause us to skip event re-initialization, and cause an incorrect resource handle error when trying to record the event.
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10662
Reviewed By: soumith
Differential Revision: D9393629
Pulled By: apaszke
fbshipit-source-id: e64c1c1d2860e80f5a7ba727d0b01aeb5f762d90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9888
Limiter cannot be shared or copied; just pass it to the first reader.
Reviewed By: xianjiec
Differential Revision: D9008871
fbshipit-source-id: e20cd785b26b1844e156efc3833ca77cfc3ffe82
Summary:
Trigonometry functions are newly added to ONNX in a recent PR https://github.com/onnx/onnx/pull/869
This PR makes pytorch support exporting graphs with trigonometry functions.
This PR might need to wait until it is ready to change
```python
_onnx_opset_version = 6
```
to
```python
_onnx_opset_version = 7
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/7540
Differential Revision: D9395041
Pulled By: bddppq
fbshipit-source-id: bdf3e9d212b911c8c4eacf5a0753bb092e4748d2
Summary:
There is no reason that user should do an extra import to use DistributedSampler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10671
Differential Revision: D9395189
Pulled By: SsnL
fbshipit-source-id: 8f41d93813c8fb52fe012f76980c6a261a8db9b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10478
- Removed Backend constructor from Device, and fixed all
use-sites to use DeviceType::CPU instead of kCPU, or
use a new function backendToDeviceType to perform
the conversion.
- New method device_type() on Type; it gives you the
underlying device type, e.g., CPU for SparseCPU.
- We add backward compatibility for kCPU/kCUDA uses,
by introducing a new special type which is implicitly
convertible to both DeviceType and Backend. As long as
you don't define a function that's overloaded on both
DeviceType and Backend (but not on BackendOrDeviceType),
the implicit conversions will ensure that uses
of at::Device(at::kCPU) keep working. We fixed use-sites in
the library, but did NOT fix sites in the test code, so that
we can exercise this BC code.
Reviewed By: Yangqing
Differential Revision: D9301861
fbshipit-source-id: 9a9d88620500715c7b37e655b4fd761f6dd72716
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026
This is done through the following:
1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.
- Expect test and correctness test to see if a single chunk is fused
by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
middle, end) and tensors (contiguous, non-contiguous, edge case
(splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
correctness test.
cc zdevito apaszke
LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.
After changes:
```
thnn cudnn jit
8.8468 6.5797 9.3470
```
Before changes:
```
thnn cudnn jit
9.9221 6.6539 11.2550
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10178
Differential Revision: D9382661
Pulled By: zou3519
fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
Summary:
Take 2 of #10543
The problem was that between commit and merge there was added one more run point `tools/build_libtorch.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10659
Differential Revision: D9393540
Pulled By: soumith
fbshipit-source-id: 8ebfed600fc735fd1cb0489b161ec80e3db062e0
Summary:
Fixes#10096
If the only thing preventing a simple mappable operator from being fused
into a fusion group is that its Tensor inputs are not of the same shape as the
output, then the graph fuser inserts explicit expand nodes for those
inputs.
This helps the graph fuser not miss out on any fusion opportunities
involving simple mappable operations that have Tensor inputs. This PR
doesn't do anything for the scalar case; that can be addressed later.
Test Plan
- Simple expect test case
- Added expect tests for a raw LSTMCell. The expands help speed up the
forwards pass by allowing more operations to be fused into the LSTMCell's single
FusionGroup.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10325
Differential Revision: D9379308
Pulled By: zou3519
fbshipit-source-id: 86d2202eb97e9bb16e511667b7fe177aeaf88245
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10630
`onnxTensorDescriptorV1.name` points to the string buffer. We use a vector of strings to serve as the storage. This means we cannot reallocate the vector because that may invalidate the `onnxTensorDescriptorV1.name` pointers. Solution is to reserve a large enough vector so that it won't reallocate.
Reviewed By: bddppq, houseroad
Differential Revision: D9381838
fbshipit-source-id: f49c5719aafcc0829c79f95a2a39a175bcad7bfe
Summary:
This is on the way to resolving #9940.
Fixes#10501
This PR modifies graph fuser to fuse operations that have constant
scalar arguments. These constant scalar arguments are directly inlined
into the kernel body.
The context for this is that LSTM backward (in particular, sigmoid
backward) has many add(x, 1.) operations. This PR should be sufficient for
LSTM backward to get fused by the graph fuser.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10511
Differential Revision: D9378896
Pulled By: zou3519
fbshipit-source-id: 6a7a2987f5b6e8edaaf4b599cd200df33361650f
Summary:
This is still not the final PR, but it removes all blockers for actually using the RNN functions directly in the JIT. Next patch should be final, and will actually remove the symbolic_override code, and change it to proper symbolics for those ATen functions. Turns out the symbolic code can be also cleaned up a bit, and I'll do that too.
zdevito ezyang
colesbury (for minor DispatchStub.h) changes
There was no way to handle those in the JIT for now, and they turned
out to be completely unnecessary. It should make the Python and C++
module code much simpler too, since all the logic is now centralized
in the native functions.
The downside is that RNN modules no longer own their dropout buffers,
which are shared per-device instead (with appropriate locking and
synchronization). This might appear as a perf regression at first, but
in reality it's highly unlikely that anyone will want to run cuDNN RNNs
on the same GPU in parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10581
Reviewed By: colesbury
Differential Revision: D9365541
Pulled By: apaszke
fbshipit-source-id: 3ef8677ee5481bae60c74a9117a2508665b476b5
Summary:
The PR is the first step to integrate torch.nn library with JIT. It adds the tests for nn functional interfaces in trace/script mode, and tries to find out the different between torch.nn.functional ops and the ATen ops, to see the work need to be done in order to support a full set of nn functional in script mode.
Some statistics in summary:
- Totally 84 useful functions in torch.nn.functional (the number does not include helper funcs and deprecated funcs in torch.nn.functional).
- 7 functions/ops does not support higher gradient, so just excluded from the whole test.
- 36 functions is different with the Aten op for different reasons. Among those 36 functions, bunch of them (roughly around 10-15) are just naming difference and simple transformation using other ops inside the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10409
Differential Revision: D9350694
Pulled By: wanchaol
fbshipit-source-id: 8fce6f30d8d25ace5a544a57b219fe61f5a092f8
Summary:
Inlining if branches which have constant inputs. If an if node gets inlined, the set of mutated variables returned by its ancestors may have changed. In the following example the block should
return a mutated set of (a) and not (a, b).
```
if cond:
if True:
a = a - 1
else:
b = b - 1
```
To calculate this we recursively update mutate variables in if branches from the leaf nodes up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10084
Reviewed By: michaelsuo
Differential Revision: D9340429
Pulled By: eellison
fbshipit-source-id: b0dd638a5cace9fdec3130460428fca655ce4b98
Summary:
https://github.com/pytorch/pytorch/pull/10100 recently take external input/output in nomnigraph. This PR makes adjust to
0. Relax some of the conditions on external input
1. Update NNModule inputs/outputs when pruning the input/output.
2. Avoiding copying external input/output as nomnigraph already takes care of it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10598
Reviewed By: bwasti
Differential Revision: D9371730
Pulled By: yinghai
fbshipit-source-id: 9273be5041dc4cc8585587f47cb6721e518a06a8
Summary:
Custom python installation, which have no aliases to `python` or `python3` can't be found by cmake `findPythonInterp` without extra cmake argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10543
Differential Revision: D9378844
Pulled By: ezyang
fbshipit-source-id: 022e20aab7e27a5a56b8eb91b6026151116193c7
Summary:
Fix "error LNK2019: unresolved external symbol" from "CAFFE_KNOWN_TYPE" in tests where we should use dllexport instead of AT_CORE_API(=dllimport).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10602
Differential Revision: D9377394
Pulled By: Yangqing
fbshipit-source-id: 993062a461ffce393f2321c5391db5afb9b4e7ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10282
This diff removes the unused/deprecated features from the code base.
Reviewed By: manojkris
Differential Revision: D9169859
fbshipit-source-id: d6447b7916a7c687b44b20da868112e6720ba245
Summary:
This is the last step in the custom operator implementation: providing a way to build from C++ and Python. For this I:
1. Created a `FindTorch.cmake` taken largely from ebetica with a CMake function to easily create simple custom op libraries
2. Created a ` torch/op.h` header for easy inclusion of necessary headers,
3. Created a test directory `pytorch/test/custom_operator` which includes the basic setup for a custom op.
1. It defines an op in `op.{h,cpp}`
2. Registers it with the JIT using `RegisterOperators`
3. Builds it into a shared library via a `CMakeLists.txt`
4. Binds it into Python using a `setup.py`. This step makes use of our C++ extension setup that we already have. No work, yey!
The pure C++ and the Python builds are separate and not coupled in any way.
zdevito soumith dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10226
Differential Revision: D9296839
Pulled By: goldsborough
fbshipit-source-id: 32f74cafb6e3d86cada8dfca8136d0dfb1f197a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10599
Not spawning threads with spin-lock synchronization is bad because they will switch to `condvar` wait, which increases wake-up latency next time they are needed.
Reviewed By: ajtulloch
Differential Revision: D9366664
fbshipit-source-id: 3b9e4a502aeefaf0ddc4795303a855d98980b02e
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554
Reviewed By: SsnL
Differential Revision: D9367762
Pulled By: jma127
fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
Summary:
A continuation of https://github.com/pytorch/pytorch/pull/10504 for GPU, torch, etc. builds.
I was testing with
```
FULL_CAFFE2=1 python setup.py build_deps | tee ~/log.txt
cat ~/log.txt | egrep 'undefined refer' | sort | less
```
I'll rebase on master when Yangqing's changes in 10504 land, but putting up for some testing.
cc mingzhe09088 anderspapitto ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10507
Reviewed By: Yangqing
Differential Revision: D9359606
Pulled By: orionr
fbshipit-source-id: c2a3683b3ea5839689f5d2661da0bc9055a54cd2
Summary:
Resubmit #10416 with fixed tests . This is to remove implicit conversion from gpu to cpu in when calling numpy to keep behavior match others.
It requires users to move the tensor back to cpu() before call numpy functions on it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10553
Differential Revision: D9350212
Pulled By: ailzhang
fbshipit-source-id: 9317d8fea925d4b20ae3150e2c1b39ba5c9c9d0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10494
Adding the AllredubeBcube routines as they are now available in gloo.
Reviewed By: wesolwsk
Differential Revision: D8269473
fbshipit-source-id: 6a3a32291bbf1fbb328b3ced0f2a753dc5caf4e5
Summary:
The ONNXIFI backend will absorb the constant weight in Conv, so we should not add it as an input. This is just a test artifacts. Note that Onnxifi transformer will do the right thing when cutting the graph to absorb the weights.
rdzhabarov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10575
Reviewed By: houseroad
Differential Revision: D9357339
Pulled By: yinghai
fbshipit-source-id: a613fa3acafa687295312f5211f8e9d7f77b39cd
Summary:
delete build_caffe2.sh, replace with build_libtorch.py as suggested by peter (and copy-pasted from his draft PR). This ensures that all consumers of the torch CMake file go through as unified a path as possible.
In order to change the surrounding infrastructure as little as possible, I made some tweaks to enable build_pytorch_libs.sh to generate the test binaries relative to the current directory, rather than hardcoding to pytorch/build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10508
Differential Revision: D9354398
Pulled By: anderspapitto
fbshipit-source-id: 05b03df087935f88fca7ccefc676af477ad2d1e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10546
Have you ever written an operator<< overload in the caffe2 namespace
in a core Caffe2 header, and then been stunned when some completely
unrelated code started breaking? This diff fixes this problem!
The problem looks like this:
1. You're building against a really old version of glog (think 0.3.2,
or something like that)
2. This version of glog defines operator<< overloads for std containers
in the global namespace
3. You add a new overload in your current namespace (e.g., caffe2).
Congratulations: this overload is *preferentially* chosen over
the global namespace one for all calls to << in that namespace.
And since it doesn't actually have std::vector overloads, unrelated
Caffe2 code breaks.
Newer versions of glog have a fix for this: they have the line:
namespace std { using ::operator<<; }
in their header. So let's help old versions of glog out and do this ourselves.
In our new world order, operator<< overloads defined in the global namespace
won't work (unless they're for std containers, which work because of ADL).
So this diff also moves all those overloads to the correct namespace.
Reviewed By: dzhulgakov
Differential Revision: D9344540
fbshipit-source-id: 6246ed50b86312668ebbd7b039fcd1233a3609cf
Summary:
This PR removes the `using Tensor = autograd::Variable;` alias from `torch/tensor.h`, which means `torch::Tensor` is now `at::Tensor`. This PR fixes up some last uses of `.data()` and tidies up the resulting code. For example, I was able to remove `TensorListView` such that code like
```
auto loss = torch::stack(torch::TensorListView(policy_loss)).sum() +
torch::stack(torch::TensorListView(value_loss)).sum();
```
is now
```
auto loss = torch::stack(policy_loss).sum() + torch::stack(value_loss).sum();
```
CC jgehring
ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10516
Differential Revision: D9324691
Pulled By: goldsborough
fbshipit-source-id: a7c1cb779c9c829f89cea55f07ac539b00c78449
Summary:
fixed NCCL test, which is not run in CI. We should enable it soon.
```
~/new_pytorch/pytorch/test$ python test_c10d.py
...............
----------------------------------------------------------------------
Ran 15 tests in 13.099s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10557
Reviewed By: ailzhang
Differential Revision: D9353286
Pulled By: teng-li
fbshipit-source-id: 5a722975beaa601203f51c723522cc881f2d2090
Summary:
Properly annotated all apis for cpu front. Checked with cmake using
cmake -DUSE_ATEN=ON -DUSE_CUDA=OFF -DBUILD_ATEN=ON
and resulting libcaffe2.so has about 11k symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10504
Reviewed By: ezyang
Differential Revision: D9316491
Pulled By: Yangqing
fbshipit-source-id: 215659abf350af7032e9a4b0f28a856babab2454
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10439
Update Im2Col related to make preparation for group conv in NHWC order.
Reviewed By: houseroad
Differential Revision: D9285344
fbshipit-source-id: 1377b0243acb880d2ad9cf73084529a787dcb97d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10528
adding 2 features to core and model_helper
- reroute_tensor which supports op insertion on net level
- model_helper complete net and cut net used for full graph analysis
Differential Revision: D9330345
fbshipit-source-id: 56341d3f500e72069ee306e20266c8590ae7985a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10053
Tensor in Pytorch 1.0 will have
Tensor -> TensorImpl -> Storage -> StorageImpl
In this diff we split Storage from Tensor in order to align with this design.
We'll have Tensor -> Storage -> StorageImpl after this diff
Reviewed By: dzhulgakov
Differential Revision: D9076734
fbshipit-source-id: ea9e1094ecf8c6eaeaa642413c56c6a95fb3d14e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10526
Resubmitting these changes. Previously they caused issues with multifeed, which I fixed with D9280622
Reviewed By: yinghai
Differential Revision: D9327323
fbshipit-source-id: ec69428039b45c6221a5403b8fe9a83637857f04
Summary:
Implemented via a wrapper, thank you Richard for the suggestion!
Fixes: #9929
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10067
Differential Revision: D9083388
Pulled By: soumith
fbshipit-source-id: 9ab21cd35278b01962e11d3e70781829bf4a36da
Summary:
This should make ASAN tests run faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9902
Differential Revision: D9032986
Pulled By: yf225
fbshipit-source-id: 3d2edec2d7ce78bc995d25865aa82ba6d3f971d0
Summary:
Pull a fix in FP16 for compilation bug when using Intel Compiler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10548
Differential Revision: D9349469
Pulled By: Maratyszcza
fbshipit-source-id: 43e6dc5c3c18319d31eca23426770c73795feec5
Summary:
In my environment, it looks like setup.py hangs when running
```
FULL_CAFFE2=1 python setup.py build_deps
```
Removing this fixes things, but we might also want to look at `tests_require`, which came over from `setup_caffe2.py`.
cc pjh5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10530
Differential Revision: D9349597
Pulled By: orionr
fbshipit-source-id: 589145eca507dfaf16386884ee2fbe60299660b4
Summary:
This PR removes couple of macros throughout TH* as part of the re-factoring effort for ATen. Removing these macros should avoid confusion among developers who are trying to move things from TH* to ATen. This PR is part of the THCNumerics deprecation that I have been working on following up on mruberry's https://github.com/pytorch/pytorch/pull/9318. I am separating these two commits to see if removal of these macros doesn't upset the pytorch public CI, as well as internal builds.
- Commit 1248de7baf removes the code paths guarded by `CUDA_HALF_INSTRUCTIONS` macro. Since the macro was removed in commit 2f186df52d, `ifdef CUDA_HALF_INSTRUCTIONS` would return false and hence the code path that is kept after this change is for the false case of `ifdef CUDA_HALF_INSTRUCTIONS`
- Commit 520c99b057 removes the code paths guarded by `CUDA_HALF_TENSOR` macro. Since Pytorch now provides support for only CUDA 8.0 and above, `CUDA_HALF_TENSOR` is always true since CUDA 8.0 satisfies `CUDA_HAS_FP16` and hence, the code path that is kept after this change is for the true case of `ifdef CUDA_HALF_TENSOR`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10147
Differential Revision: D9345940
Pulled By: soumith
fbshipit-source-id: c9392261dd432d304f1cdaf961760cbd164a59d0
Summary:
This is the first of two changes that are supposed to improve how we handle RNNs in the JIT. They still get traced as `PythonOp`s, but now it will be much easier to actually expose them to the JIT as e.g. `aten::lstm`, and ignore the Python interpreter entirely. This needs some symbolic adjustments that will be part of a second PR.
Even when we fix symbolics, there will still be a bit of a problem with statefulness of the cuDNN API (we need a mutable cache for the dropout state, but our IR has no way of representing that).
zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10481
Reviewed By: ezyang
Differential Revision: D9341113
Pulled By: apaszke
fbshipit-source-id: 0ae30ead72a1b12044b7c12369d11e5ca8ec30b5
Summary:
In the shortcut for n_sample=1, when category 0 has 0 weight,
we should not map the (uniform) sample 0 to category 0.
The conversion uniform->multinomial was apparently written to work on
a (0,1] range (like curand uses), but PyTorch uses a [0,1) range.
Fixes: #4858. Thank you, Roy Fejgin for reporting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9960
Reviewed By: soumith
Differential Revision: D9341793
Pulled By: ailzhang
fbshipit-source-id: 6b1a96419a7bc58cc594f761f34c6408ff6354cf
Summary:
Since we can't specify version number to `choco install curl`, we should not assume that `7.57.0` is the curl version that's in the Windows AMI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10476
Differential Revision: D9303129
Pulled By: yf225
fbshipit-source-id: 198544be68330860fbcf93c99bc995f4e280bda7
Summary:
Support broadcasting in _kl_categorical_categorical
this makes it possible to do:
```
import torch.distributions as dist
import torch
p_dist = dist.Categorical(torch.ones(1,10))
q_dist = dist.Categorical(torch.ones(100,10))
dist.kl_divergence(p_dist, q_dist)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10533
Differential Revision: D9341252
Pulled By: soumith
fbshipit-source-id: 34575b30160b43b6c9e4c3070dd7ef07c00ff5d7
Summary:
Two tests in the 'nn' test bucket may fail when the torch.half
(float16) data type is used. The assertions used in the tests
intend to allow slight floating point imprecision in the results,
but the tolerances used for the comparisons are too strict for
the half type.
Relax the tolerances so that slight float16 imprecision won't
cause test failures.
The affected tests are:
- test_variable_sequence_cuda
- test_Conv2d_groups_nobias
For more information, see issue:
https://github.com/pytorch/pytorch/issues/7420
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10519
Differential Revision: D9343751
Pulled By: soumith
fbshipit-source-id: 90aedf48f6e22dd4fed9c7bde7cd7c7b6885845a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10512
SubtreeMatchCriteria now becomes a graph of MatchNode
MatchNode consists of NodeMatchCriteria, nonTerminal and count. This is a cleaner internal representation of the data structure and will bring us much closer to DAG matching.
Note that I still keep the debugString method because convertToDotGraph doesn't currently work with Subgraph.
Reviewed By: bwasti
Differential Revision: D9321695
fbshipit-source-id: 58a76f007a9a95d18cf807d419c2b595e9bc847f
Summary:
optimize max and min reduction for ATen CPU path, current code path from TH module runs in sequential on CPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10343
Differential Revision: D9330799
Pulled By: ezyang
fbshipit-source-id: 5b8271e0ca3e3e73f88a9075aa541c8756001b7c
Summary:
I've implemented affine grid generation for volumetric (5d) inputs. The implementation is based off of the spatial implementation, extended by one dimension. I have a few questions about my implementation vs. the existing one that I will add inline.
I have some extensive test cases for the forward pass here: https://gist.github.com/elistevens/6e3bfb20d8d0652b83bd16b3e911285b However, they use `pytest.fixture` extensively, so I'm not sure the best way to incorporate them into the pytorch test suite. Suggestions? I have not tested backwards at all.
Diff probably best viewed with whitespace changes ignored.
Thanks for considering!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8322
Differential Revision: D9332335
Pulled By: SsnL
fbshipit-source-id: 1b3a91d078ef41a6d0a800514e49298fd817e4df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10395
Order switch ops (NCHW2NHWC and NHWC2NCHW) were only supporting 2D images.
This diff generalizes them to 1D and 3D, and also add a unit test we didn't have.
Reviewed By: protonu
Differential Revision: D9261177
fbshipit-source-id: 56e7ec54c9a8fb71781ac1336f3f28cf024b4bda
Summary:
We can't rely on the ATen fallback pathway here because we need to parse out the constant attributes explicitly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10513
Reviewed By: dzhulgakov
Differential Revision: D9322133
Pulled By: jamesr66a
fbshipit-source-id: 52af947e6c44532ef220cb4b94838ca838b5df06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10390
Fixed a bug in box_with_nms_limit where it may produce more bounding boxes than specified.
* The original code first finds the threshold for the boxes at the 'detectons_per_im' position, and filters out boxes lower than the threshold.
* In some cases that there are multiple boxes have the same threshold, the op will return more boxes than 'detectons_per_im'.
Reviewed By: wat3rBro
Differential Revision: D9252726
fbshipit-source-id: 63f40829bcd275cb181692bc7547c384cee01499
Summary:
Background: we run pytorch in embedded C++ pipelines, running in C++ GUIs in https://github.com/Kitware/VIAME and without this addition, the call was failing with the below error, but only on certain windows platforms/configurations:
OSError: [WinError6] The handle is invalid
At:
C:\Program Files\VIAME\Python36\site-packages\torch\cuda_init_.py(162):_lazy_init
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): <lambda>
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(182): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(176): _apply
C:\Program Files\VIAME\Python36\site-packages\torch\nn\modules\module.py(249): cuda
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\arrows\pytorch\pytorch_resnet_f_extractor.py(74):_init_
C:\Program Files\VIAME\lib\python3.6None\site-packages\kwiver\processes\resnet_descriptors.py(132): _configure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10379
Differential Revision: D9330772
Pulled By: ezyang
fbshipit-source-id: 657ae7590879004558158d3c4abef2ec11d9ed57
Summary:
Breaking out of #8338
This PR is a workaround for a bug with CUDA9.2 + GCC7.
Here is the error this PR fixed:
.../pytorch/caffe2/operators/elementwise_ops.h: In constructor ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>::BinaryElementwiseWithArgsOp(const caffe2::OperatorDef&, caffe2::Workspace*)’:
.../pytorch/caffe2/operators/elementwise_ops.h:106:189: error: ‘GetSingleArgument<bool>’ is not a member of ‘caffe2::BinaryElementwiseWithArgsOp<InputTypes, Context, Functor, OutputTypeMap>’
BinaryElementwiseWithArgsOp(const OperatorDef& operator_def, Workspace* ws)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10510
Reviewed By: orionr
Differential Revision: D9319742
Pulled By: mingzhe09088
fbshipit-source-id: ce59e3db14539f071f3c20301e77ca36a6fc3f81
Summary:
Previously, it's easy to do `x[0].accessor<float, 2>()`. However, x[0] is a temporary, so the accessor will point to invalid strides/sizes and probably segfault. With this change, such unsafe code is a compile error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10518
Reviewed By: goldsborough
Differential Revision: D9329288
Pulled By: ebetica
fbshipit-source-id: d08763bee9a19a898b9d1ea5ba648f27baa1992f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10514
fix the bug which break the windows build in fused_rowwise_random_quantization_ops.h
Reviewed By: ezyang, jspark1105
Differential Revision: D9322291
fbshipit-source-id: a6a27e87423b6caa973414ffd7ccb12076f2e1e4
Summary:
setup.py is the official install script, setup_caffe2.py is not used any more
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10520
Reviewed By: yinghai
Differential Revision: D9325548
Pulled By: bddppq
fbshipit-source-id: 3dda87f3dff061b574fd1d5c91859044f065ee33
Summary:
After this, all combinations of {String frontend, Python AST Frontend}{Python 3-style type annotations, MyPy-style type comments}{Script method, Script function} should properly accept type annotations.
Possible TODOs:
- Clean up the functions marked HACK
- Clean up the Subscript tree-view to better match the Python AST versions
- Can we use this for Python functions? That's the only place annotations.get_signature() is still needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10279
Differential Revision: D9319726
Pulled By: jamesr66a
fbshipit-source-id: b13f7d4f066b0283d4fc1421a1abb9305c3b28fa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10267
isSubtreeMatch now returns a SubtreeMatchResult which contains a match flag and a debugMessage string that contains the reason why a subtree is not matched (if requested).
Reviewed By: bwasti
Differential Revision: D9182429
fbshipit-source-id: 530591fad592d02fb4c31fc398960a14ec90c86a
Summary:
Provided python binding for these four ops. Also provided nccl binding test.
Based on https://github.com/pytorch/pytorch/pull/10058
Please only review init.cpp, and test file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10159
Reviewed By: yf225
Differential Revision: D9323192
Pulled By: teng-li
fbshipit-source-id: b03822009d3a785ec36fecce2fc3071d23f9994e
Summary:
Added
- Reduce (both NCCL and MPI)
- AllGather (both NCCL and MPI)
- Gather (MPI)
- Scatter (MPI)
for c10d process groups. This basically finalizes all supported ops for C10d to match THD.
All ops are tested as well.
```
mpirun -np 8 ./ProcessGroupMPITest
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
Test successful
```
```
./ProcessGroupNCCLTest
Allreduce test successful
Broadcast test successful
Reduce test successful
Allgather test successful
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10058
Reviewed By: yf225
Differential Revision: D9316312
Pulled By: teng-li
fbshipit-source-id: 6a6253268d34332327406b1f87335d1402f7133f
Summary:
After talking to users of the C++ API we found that having the tensor type be `autograd::Variable` causes more complications than having it be `at::Tensor`. It used to be a problem because `at::Tensor` didn't have the "autograd API" of variable (e.g. `detach()` or `grad()` methods), but those methods are now on `at::Tensor`. As such, we want to make a last big breaking change to have the tensor type be `at::Tensor`, while factory methods like `torch::ones` will return `Variable`s disguised as `at::Tensor`. This will make many things easier, like calling functions in ATen that take vectors of tensors.
This PR makes a small step in this direction by updating the optimizer classes to not use `.data()` on `Variable` to access the underlying `at::Tensor`. Using `.data()` is effectively a hack to work around our modification rules for tensors that require grad. The proper way of doing things is to use `with torch.no_grad` or equivalently `NoGradGuard` in C++ to guard in-place operations.
The next step can then simply redefine `torch::Tensor` to be `at::Tensor`. This transition should be smooth, since all methods available on `Variable` are at this point available on `at::Tensor`.
For this PR I:
1. Modified the implementations of optimizers to not use `.data()`. This means the implementations are now different from PyTorch, which still uses the legacy method of using `.data`.
2. To properly verify (1), I added more fine-grained test cases to our optimizer tests, e.g. `SGD` with and without `weight_decay`, then with `nesterov` etc. Generally more tests = more happy!
3. Minor cleanup of the optimizer codebase
ebetica apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10490
Differential Revision: D9318229
Pulled By: goldsborough
fbshipit-source-id: fb386700f37840542bc5d323f308ea88fe5ea5c5
Summary:
Now, run `python test/onnx/test_operators.py --no-onnx`, we won't introduce any onnx python dependence. (No onnx/protobuf python packages needs to be installed)
The major changes:
- output pbtxt from C++ exporter directly, so the floating format may be slightly different. (This should be fine, since it's just to guard ONNX exporting.)
- ONNX python packages are only imported if we run the ONNX related checks. Those checks are disabled when using `--no-onnx` flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10151
Reviewed By: jamesr66a
Differential Revision: D9130706
Pulled By: houseroad
fbshipit-source-id: ea28cf5db8399929179698ee535137f209e9ce6f
Summary:
There are three classes `RNNCell`, `LSTMCell`, `GRUCell` inherited from `RNNCellBase`, all defining the identical initialization function `reset_parameters`. Lets move it to the common base.
Another option is to have different initialization for RNN, LSTM and GRU. Maybe those weights whose output is processed with sigmoid (i.e. gain=1) should be initialized differently from those going to tanh (gain=5/3)?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10399
Differential Revision: D9316978
Pulled By: SsnL
fbshipit-source-id: a2d9408f0b5c971a3e6c3d42e4673725cf03ecc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10244
Use CAFFE_ENFORCE_EQ(x, y) instead of CAFFE_ENFORCE(x == y) in conv_op_impl.h for error messages with more information.
Reviewed By: viswanathgs
Differential Revision: D9177091
fbshipit-source-id: cf8d10afec1ce6793d3ae0b62f05648722a4130b
Summary:
It just calls into `ninja install`. For iterative work on
libtorch.so/_C.so,
`python setup.py rebuild_libtorch develop` should provide quick iteration
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10036
Differential Revision: D9317869
Pulled By: anderspapitto
fbshipit-source-id: 45ea45a1b445821add2fb9d823a724fc319ebdd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10389
Added some unit test for box_with_nms_limit_op.
Reviewed By: wat3rBro
Differential Revision: D9237860
fbshipit-source-id: 2d65744bd387314071b68d2a0c934289fc64a731
Summary:
Test only for existence for now. I had to skip a lot of them so there a FIXME in the test.
Also I'm not testing torch.* because of namespace issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10311
Differential Revision: D9196341
Pulled By: SsnL
fbshipit-source-id: 9c2ca1ffe660bc1cc664474993f8a21198525ccc
Summary:
- Exposed get_debug_graph for ScriptModule (gets the debug graph for its
forward Method)
- Added forward/backward expect tests for lstm and milstm cells. These
are intended to prevent regressions
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10506
Differential Revision: D9316590
Pulled By: zou3519
fbshipit-source-id: 3c2510d8363e9733ccbc5c7cc015cd1d028efecf
Summary:
This commit adds the ability to insert a node with inputs, using the schema to check the inputs are valid types, fill in any default values, and perform standard implicit conversions. Since it is schema based, it will discover and use the right overload.
Constructors to `NamedValue` enable it to be constructed using `IValue` constants so it is possible to use constant values in the input list as well:
```
g.insert(aten::add, {v, 3});
```
Keyword arguments are also supported:
```
g.insert(aten::add, {v}, {{"other", t}, {"scalar", 1}});
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10198
Differential Revision: D9307252
Pulled By: zdevito
fbshipit-source-id: 644620aa85047d1eae1288383a619d50fec44d9b
Summary:
AffineChannel is being used by public Detectron models, e.g. Mask-RCNN and Faster-RCNN. This PR folds this op into convolution the same way as BN to speed up inference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10293
Differential Revision: D9276789
Pulled By: yinghai
fbshipit-source-id: fbf6dd2c1be05f5713f760752e7245b1320a122b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10100
nomnigraph has until this point tried to ignore external input and output, as they aren't very well defined (does order matter?). but for DCE and some of Keren's work they are becoming necessary. I went ahead and added this to the core nomnigraph converter
Reviewed By: yinghai
Differential Revision: D9105487
fbshipit-source-id: a2e10e3cc84515611d6ab7d4bc54cf99b77729c0
Summary:
Fixes#10456
The graph fuser was fusing together groups with prim::FusedConcat (the producer) with other ops (the consumer) if the consumer is fusable. For example,
```
import torch
torch.jit.script
def fn(x, y, z):
x1 = x + y
y1 = x - y
w = torch.cat([x1, y1])
return w + z
x = torch.randn(2, 2, dtype=torch.float, device='cpu')
y = torch.randn(2, 2, dtype=torch.float, device='cpu')
z = torch.randn(4, 2, dtype=torch.float, device='cpu')
fn(x, y, z)
fn.graph_for(x, y, z)
```
produced the following graph:
```
graph(%x : Float(2, 2)
%y : Float(2, 2)
%z : Float(4, 2)) {
%3 : int = prim::Constant[value=1]()
%y1 : Float(2, 2) = aten::sub(%x, %y, %3)
%8 : int = prim::Constant[value=0]()
%14 : Float(4, 2) = prim::FusionGroup_0[device=-1](%z, %y1, %x, %y)
return (%14);
}
with prim::FusionGroup_0 = graph(%1 : Float(4, 2)
%5 : Float(2, 2)
%7 : Float(2, 2)
%8 : Float(2, 2)) {
%11 : int = prim::Constant[value=1]()
%9 : int = prim::Constant[value=1]()
%x1 : Float(2, 2) = aten::add(%7, %8, %9)
%w : Float(4, 2) = prim::FusedConcat[dim=0](%x1, %5)
%2 : int = prim::Constant[value=1]()
%3 : Float(4, 2) = aten::add(%w, %1, %2)
return (%3);
}
```
this is a problem because it violates two invariants:
1) all inputs to the FusionGroup must have the same size
2) prim::FusedConcat's output must not be used inside the FusionGroup
This PR fixes this problem by checking if the output to a FusionGroup came from a prim::FusedConcat node when deciding whether to fuse the consumer and producer.
If the producer is a value that came from a prim::FusedConcat node in a FusionGroup, then consumer & producer do not get fused.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10466
Differential Revision: D9296686
Pulled By: zou3519
fbshipit-source-id: ed826fa9c436b42c04ca7d4d790cece804c162bd
Summary:
A bootcamper was confused by the word "locally" and thought it meant on his macbook as opposed to his FB dev machine. Besides the confusion for the FB context, the word "locally" isn't really necessary at all
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10495
Reviewed By: soumith
Differential Revision: D9311480
Pulled By: goldsborough
fbshipit-source-id: 2779c7c60f903a1822a50d140ed32a346feec39e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10426
We were seeing linker errors for TanhGradientOperator in multifeed. Since we only use the float specialization, we might as well define it that way.
Reviewed By: yinghai
Differential Revision: D9280622
fbshipit-source-id: d2ffb698c73a84bb062de5e1f3bda741330e4228
Summary:
This operator implements b (1/2/4/8) bit stochastic quantization of a floating
matrix in a row-wise fashion. 8/b floating values are concatenated to a byte
and returned in uint8 tensor. PR: https://github.com/pytorch/pytorch/pull/8629
Reviewed By: harouwu
Differential Revision: D8493264
fbshipit-source-id: 01f64066568a1e5a2b87c6d2134bd31cdf119c02
Summary:
* some small leftovers from the last PR review
* enable more unit test sets for CI
* replace use of hcRNG w/ rocRAND (docker image was already updated w/ newer rocRAND)
* use rocBLAS instead of hipBLAS to allow convergence w/ Caffe2
* use strided_batched gemm interface also from the batched internal interface
* re-enable Dropout.cu as we now have philox w/ rocRAND
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10406
Reviewed By: Jorghi12
Differential Revision: D9277093
Pulled By: ezyang
fbshipit-source-id: 7ef2f6fe4ead77e501ed7aea5c3743afe2466ca2
Summary:
```
This removes PyObjectFinalizer. We were seeing SIGSEGV at exit in some
programs that use multiprocessing. The backtrace pointed to
StorageRef.__del__ being called from subtype_dealloc. My guess is that
the Python interpreter was shutdown before all C++ Storage objects were
deallocated. Deallocating the C++ Storage called the finalizer which
called back into Python after it was no longer safe to do so.
This avoids a callback from C++ into Python during Storage finalization.
Instead, dead Storage objects (expired weak references) are collected
periodically when shared_cache exceeds a limit. The limit is scaled with
2x the number of live references, which places an upper bound on the
amount of extra memory held by dead Storage objects. In practice, this
should be very small.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10407
Differential Revision: D9272400
Pulled By: colesbury
fbshipit-source-id: ecb14d9c6d54ffc91e134c34a4e770a4d09048a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10278
Translation to Backend happens immediately before we go into the
Type universe; otherwise we use TensorTypeId.
I allocated TensorTypeId corresponding exactly to existing ATen
Backend. Only CPUTensorId and CUDATensorId are relevant in the
Caffe2 universe.
Reviewed By: gchanan
Differential Revision: D9184060
fbshipit-source-id: 9d3989c26f70b90f1bbf98b2a96c57e2b0a46597
Summary:
This PR provides 4 fixes / features:
1. torch::nn::Cloneable inherits virtually from torch::nn::Module. We want to pass around a module with new functions, and the best way to do this is to do a diamond inheritance pattern, i.e.
```c++
struct MySuperModuleImpl : virtual public torch::nn::Module {
virtual void myFunction() = 0;
}
struct MySuperModule : public torch::nn::Cloneable<MySuperModule>, MySuperModuleImple {};
struct MyModule : public MySuperModule<MyModule> {
void myFunction() override;
};
```
This way, we can simply pass around MySuperModuleImpl around instead of torch::nn::Module.
2. Optimizer options are public now, since there's no way to decay the LR or modify it during training otherwise
3. Serialization functions creates autograd history and calls copy_! Bad!
4. Optimizers did not create buffers after add_parameters was called.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9837
Reviewed By: goldsborough
Differential Revision: D9199746
Pulled By: ebetica
fbshipit-source-id: 76d6b22e589a42637b7cc0b5bcd3c6b6662fb299
Summary:
Explanation copied from code:
// Motivation about the gflags wrapper:
// (1) We would need to make sure that the gflags version and the non-gflags
// version of Caffe2 are going to expose the same flags abstraction. One should
// explicitly use caffe2::FLAGS_flag_name to access the flags.
// (2) For flag names, it is recommended to start with caffe2_ to distinguish it
// from regular gflags flags. For example, do
// CAFFE2_DEFINE_BOOL(caffe2_my_flag, true, "An example");
// to allow one to use caffe2::FLAGS_caffe2_my_flag.
// (3) Gflags has a design issue that does not properly expose the global flags,
// if one builds the library with -fvisibility=hidden. The current gflags (as of
// Aug 2018) only deals with the Windows case using dllexport, and not the Linux
// counterparts. As a result, we will explciitly use CAFFE2_EXPORT to export the
// flags defined in Caffe2. This is done via a global reference, so the flag
// itself is not duplicated - under the hood it is the same global gflags flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10444
Differential Revision: D9296726
Pulled By: Yangqing
fbshipit-source-id: a867d67260255cc46bf0a928122ff71a575d3966
Summary:
On Windows, the FindRocksDB script doesn't detect rocksdb installation built by cmake.
And it doesn't include/link the RocksDB dependencies either, like:
* `Snappy`
* `Shlwapi.lib`
* `Rpcrt4.lib`
This PR try to detect in config mode first before using private find module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/7315
Differential Revision: D9287587
Pulled By: Yangqing
fbshipit-source-id: 314a36a14bfe04aa45013349c5537163fb4c5c00
Summary:
There's no need to hack.
Using `CUDA_LINK_LIBRARIES_KEYWORD` is the normal way.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10437
Differential Revision: D9287579
Pulled By: Yangqing
fbshipit-source-id: d3d575ea8c3235576ba971e4b7493ddb435f92f3
Summary:
Building caffe2 and pytorch separately will end up duplicated symbols as they now share some basic libs. And it's especially bad for registry. This PR fixes our CI and build them in one shot with shared symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10427
Reviewed By: bddppq
Differential Revision: D9282372
Pulled By: yinghai
fbshipit-source-id: 0514931ea88277029a68fa5368ff4336472f132e
Summary:
Optimize the max_pooling operation for inference path by setting the "inference" flag to the underlying MKL-DNN, saving the computation and store of max indices which is only needed for training. To make the API compatible, training mode is still the default and inference mode is set in the optimizeForIdeep path.
Test shows the speed-up of a single max_pooling operation is up to 7X on BDW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10156
Differential Revision: D9276755
Pulled By: yinghai
fbshipit-source-id: ad533d53aabb8ccb3b592da984d6269d9b794a8a
Summary:
This should just work now that sizes/strides are unified between TH and ATen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10414
Differential Revision: D9274681
Pulled By: gchanan
fbshipit-source-id: 69eb766f4e3a5b6c57b15837cffdef513b6d7817
Summary:
```
Correctly share CUDA Parameters, requires_grad and hooks.
Previously, the following was true:
- If you put a Parameter for a CUDA tensor
in multiprocessing queue (or otherwise tried to transfer it),
this failed, saying that we cannot pickle CUDA storage.
This is issue #9996.
- If you put a leaf Tensor that requires_grad=True through the
multiprocessing queue, it would come out the other end as
requires_grad=False (It should have come out the other end
as requires_grad=True). Similarly, backwards hooks were
lost.
- If you put a non-leaf Tensor that requires_grad=True through
the multiprocessing queue, it would come out the other end
as requires_grad=False.
The root cause for the first issue was that implementation of
reductions for Parameter used the superclass implementation
(tensor) in __reduce_ex__, but this always picks up the
non-ForkingPickler reduction, which doesn't work with CUDA tensors.
So, we registered a new ForkingPickler specifically for Parameter,
and adjusted the code to correctly rewrap a Tensor in a Parameter
if it was originally a parameter.
While working on this, we realized that requires_grad and backwards
hooks would not be preserved in the ForkingPickler reduction
implementation. We fixed the reducer to save these parameters.
However, Adam Paszke pointed out that we shouldn't allow sending
requires_grad=True, non-leaf Tensors over a multiprocessing
queue, since we don't actually support autograd over process
boundar. We now throw an error in this case; this may cause
previously working code to fail, but this is easy enough to fix;
just detach() the tensor before sending it. The error message says
so.
Fixes#9996.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10220
Differential Revision: D9160746
Pulled By: ezyang
fbshipit-source-id: a39c0dbc012ba5afc7a9e646da5c7f325b3cf05c
Summary:
closes#9702 .
cc jph00
Commit structure:
1. Change the index calculation logic. I will explain using 1-D for simplicity.
Previously we have (in pseudo code):
```
// 1. get the float locations from grid
scalar_t x = from_grid()
// 2. find the integral surrounding indices
int x_left = floor(x)
int x_right = x_left + 1
// 3. calculate the linear interpolate weights
scalar_t w_left = x_right - x
scalar_t w_right = x - x_left
// 4. manipulate the integral surrounding indices if needed
// (e.g., clip for border padding_mode)
x_left = manipulate(x_left, padding_mode)
x_right = manipulate(x_right, padding_mode)
// 5. interpolate
output_val = interpolate(w_left, w_right, x_left, x_right)
```
This is actually incorrect (and also unintuitive) because it calculates the
weights before manipulate out-of-boundary indices. Fortunately, this
isn't manifested in both of the current supported modes, `'zeros'` and
`'border'` padding:
+ `'zeros'`: doesn't clip
+ `'border'`: clips, but for out-of-bound `x` both `x_left` and `x_right` are
clipped to the same value, so weights don't matter
But this is a problem with reflection padding, since after each time we reflect,
the values of `w_left` and `w_right` should be swapped.
So in this commit I change the algorithm to (numbers corresponding to the
ordering in the above pseudo-code)
```
1. get float location
4. clip the float location
2. find the integral surrounding indices
3. calculate the linear interpolate weights
```
In the backward, because of this change, I need to add new variables to track
`d manipulate_output / d manipulate_input`, which is basically a multiplier
on the gradient calculated for `grid`. From benchmarking this addition doesn't
cause obvious slow downs.
2. Implement reflection padding. The indices will keep being reflected until
they become within boundary.
Added variant of `clip_coordinates` and `reflect_coordinates` to be used in
backward. E.g.,
```cpp
// clip_coordinates_set_grad works similarly to clip_coordinates except that
// it also returns the `d output / d input` via pointer argument `grad_in`.
// This is useful in the backward pass of grid_sampler.
scalar_t clip_coordinates_set_grad(scalar_t in, int64_t clip_limit, scalar_t *grad_in)
```
For example, if `in` is clipped in `'border'` mode, `grad_in` is set to `0`.
If `in` is reflected **odd** times in `'reflection'` mode, `grad_in`
is set to `-1`.
3. Implement nearest interpolation.
4. Add test cases
5. Add better input checking
Discussed with goldsborough for moving `operator<<` of `at::Device`,
`at::DeviceType` and `at::Layout` into `at` namespace. (Otherwise
`AT_CHECK` can't find them.)
6. Support empty tensors. cc gchanan
+ Make empty tensors not acceptable by cudnn.
+ Add `AT_ASSERT(kernel block size > 0)` if using `GET_BLOCKS`
+ Cache `numel` in `TensorGeometry`
I was going to use `numel` to test if cudnn descriptor should accept a
tensor, but it isn't used eventually. I can revert this if needed.
7. Add more test cases, including on input checking and empty tensors
8. Remove an obsolete comment
9. Update docs. Manually tested by generating docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10051
Differential Revision: D9123950
Pulled By: SsnL
fbshipit-source-id: ac3b4a0a36b39b5d02e83666cc6730111ce216f6
Summary:
I am using this to test a CI job to upload pip packages, and so am using the Caffe2 namespace to avoid affecting the existing pytorch packages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9544
Reviewed By: orionr
Differential Revision: D9267111
Pulled By: pjh5
fbshipit-source-id: a68162ed29d2eb9ce353d8435ccb5f16c3b0b894
Summary:
This was used as a convenient way for us to convert c1 models. Now that conversion is more or less done, we should probably require any users who need to convert c1 models to explicitly install c1. This PR removes the explicit c1 proto (which was copied from c1) in favor of explicit installation.
Note that caffe_translator would still work properly, only difference is that now users need to install c1 separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10380
Differential Revision: D9267981
Pulled By: Yangqing
fbshipit-source-id: a6ce5d9463e6567976da83f2d08b2c3d94d14390
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10360
It seems `lengths_host_.CopyFrom(lengthsInput, &context_);` is asynchronous w.r.t. the host while `lengths_host_.CopyFrom(lengthsInput);` is synchronous.
However, according to jerryzh168, `lengths_host_.CopyFrom(lengths, &context_); context_.FinishDeviceComputation();` is the safest way to guarantee synchronization.
Reviewed By: jerryzh168
Differential Revision: D9197923
fbshipit-source-id: 827eb63d9d15c1274851e8301a793aed39d4fa6b
Summary:
As in the title. I also did a small refactor that let us loose almost 400 loc. This is a first step in moving the RNN code to C++.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10305
Reviewed By: ezyang
Differential Revision: D9196227
Pulled By: apaszke
fbshipit-source-id: 54da905519aade29baa63ab1774a3ee1db5663ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10163
- Remove dependency on caffe2/core/common.h for ATen/core/typeid.h
Unfortunately, Windows seems to rely on typeid.h including this
header, so it is still included from the forwarding header
caffe2/core/typeid.h
- Deduplicate Demangle/DemangleType with their ATen equivalents
Reviewed By: smessmer
Differential Revision: D9132432
fbshipit-source-id: 21f2c89e58ca1e795f1b2caa316361b729a5231b
Summary:
Copy of #10191 because these changes didn't land with the diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10394
Differential Revision: D9260816
Pulled By: li-roy
fbshipit-source-id: 7dc16919cfab6221fda1d44e98c5b900cfb40558
Summary:
Before we had 0-dim tensors in TH, we were flexible in what we accepted wrt to the difference between size [] and size [1] tensors in backwards functions because they were identical in TH. So, we had backwards definitions that were technically incorrect, but happened to work. This often masks shape issues, adds greatly to code complexity and thus IMO isn't worth keeping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10382
Differential Revision: D9244618
Pulled By: gchanan
fbshipit-source-id: 2c29c53a8ffe8710843451202cad6b4323af10e8
Summary:
This makes clamp and relu faster (fixes#10276).
The extra copying was introduced when clamp moved to ATen and
the _th_clamp_ wrapper was used to forward to TH/THC,
we remove that and add _th_clamp(_out) instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10352
Reviewed By: ezyang
Differential Revision: D9233590
Pulled By: SsnL
fbshipit-source-id: 4f86a045498e5e577fb22656c71f171add7ed0ac
Summary:
If an `at::test` function is added, gcc can't figure out the `std::thread(test, -1)` resolution.
It is not a problem for current code. I bumped into this when playing with native functions. But I think it is a good to just prevent it from happening in future by removing `using namespace at;`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10381
Differential Revision: D9241614
Pulled By: SsnL
fbshipit-source-id: 972ac3cecff3a50602b3fba463ae1ebd3f53d036
Summary:
When only part of the outputs of unbind are used in a backward,
the gradients for the others are undefined. This sets those
to zero in to_tensor_list.
Fixes: #9977
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9995
Differential Revision: D9239610
Pulled By: soumith
fbshipit-source-id: eb8d1b3f2b4e615449f9d856e10b946910df9147
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10334
Keep kEps in one place to make sure they are consistent
Reviewed By: xianjiec
Differential Revision: D9202280
fbshipit-source-id: 35d173ce1d1a361b5b8cdbf1eac423e906e7c801
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10218
SubtreeMatchCriteria now supports:
- nonTerminal flag : if this is set, it means we only match the root of the subtree and do not care about the children. Example use case: to match an "input" node but does not care how the input is produced.
Additional tests for these new logic are added to subgraph_matcher_test.cc.
Subgraph matching APIs for NNGraph is also added.
(Further enhancement to make the SubgraphMatching API constructs a Subgraph object/more diagnostic information will go later).
Reviewed By: bwasti
Differential Revision: D9156092
fbshipit-source-id: 3f28ac15d9edd474b3e0cd51fd7e6f973299d061
Summary:
Current Dockerfile builds pytorch using default python within miniconda, which happens to be Python 3.6
This patch allows users to specify which python should be installed in the default miniconda environment used by the pytorch dockerfile. I have tested the build for python 2.7, 3.5, 3.6 and 3.7. Python 2.7 required typing and cython
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10317
Differential Revision: D9204401
Pulled By: ezyang
fbshipit-source-id: 11355cab3bf448bbe8369a2ed1de0d409c9a2d6e
Summary:
This PR adds tracing infrastructure for custom operators. It also simplifies the tracer overall, and changes the codegen to do more metaprogramming there instead of via C++ (which was necessary for the custom op tracing).
To give an example of the tracer/metaprogramming change, what used to look like this in `VariableType.cpp`:
```
jit::tracer::PreTraceInfo trace_info;
if (jit::tracer::isTracing()) {
trace_info = jit::tracer::preRecordTrace(jit::aten::index_select, "self", self, "dim", dim, "index", index);
}
```
is now simply the inlined version of `preRecordTrace`, minus C++ metaprogramming:
```
torch::jit::Node* node = nullptr;
if (jit::tracer::isTracing()) {
auto& graph = jit::tracer::getTracingState()->graph;
node = graph->create(jit::aten::index_select_out, /*outputs=*/0);
jit::tracer::recordSourceLocation(node);
jit::tracer::addInputs(node, "result", result);
jit::tracer::addInputs(node, "self", self);
jit::tracer::addInputs(node, "dim", dim);
jit::tracer::addInputs(node, "index", index);
graph->appendNode(node);
}
```
zdevito apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10212
Differential Revision: D9199615
Pulled By: goldsborough
fbshipit-source-id: cd4b603c1dc01340ead407228e109c99bdba2cfc
Summary:
While waiting for dropout to be fully ported to ATen, here's performance fix for the most common dropout case. Dropout is still in python function, I just added efficient path to it. I could not make inplace work, because generator always generates `return self` for inplace function, and I need to return both original tensor and mask, so inplace goes on the existing pass. Even with non-inplace version, since mask is now a ByteTensor, memory used is just a little larger than for inplace dropout, due to savings on mask.
Once dropout is moved to aten, these kernels still can be used for efficient implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9666
Reviewed By: SsnL
Differential Revision: D8948077
Pulled By: ezyang
fbshipit-source-id: 52990ef769471d957e464af635e5f9b4e519567a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10228
Sometimes, for all items in the minibatch in test mode, input length will be
equal to max time steps. This avoids having to pass in an external tensor.
Differential Revision: D9174378
fbshipit-source-id: 22f7d5c311c855d9c3ac59f2a5e773279bd69974
Summary:
This PR extends the existing type and shape metadata tracing and verification done in autograd with device information. This expansion of tracing is required for #8354, is likely useful in other scenarios, and is a healthy sanity check, just like type and shape tracing.
The precise changes are:
- TypeAndShape -> InputMetadata, now includes device()
- Creating InputMetadata is simplified to just require a tensor, and callers were updated to use this simpler invocation wherever possible
- The gradient accumulator of a variable is now reset when set_data() is called if either the type or device changes, and this reset now locks to avoid contention with acquiring the gradient accumulator
- Mismatched devices during backward() will throw a runtime error, just like mismatched type and shape
- (Bonus!) Two uninitialized pointers in THCReduce are now initialized (to nullptr) to prevent build warnings
fyi colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9796
Reviewed By: goldsborough
Differential Revision: D9119325
Pulled By: ezyang
fbshipit-source-id: 76d1861b8d4f74db0575ff1f3bd965e18f9463de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10274
Good C++ libraries don't take up un-namespaced identifiers
like DISABLE_COPY_AND_ASSIGN. Re-prefix this.
Follow up fix: codemod Caffe2 to use the new macro, delete
the forwarding definition
Reviewed By: mingzhe09088
Differential Revision: D9181939
fbshipit-source-id: 857d099de1c2c0c4d0c1768c1ab772d59e28977c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10268
Running torch.distributed.init_process_group fails with more than ~64 processes, with various errors like connection refused or connection reset by peer. After some digging, it looks like the root cause is that all workers have to connect to master via TCP (both in Zeus init and in DataChannelTCP - look for `connect()`), and the listening socket only has a backlog of 64.
I increased the backlog to 1024, that seems like enough for reasonable purposes (the hard limit is 65535 in /proc/sys/net/core/somaxconn). There's probably a more correct way to do this that involves retries when connection is refused.
Reviewed By: soumith
Differential Revision: D9182216
fbshipit-source-id: 2f71c4995841db26c670cec344f1e3c7a80a7936
Summary:
Previously, `tensor[i:]` was transformed to `tensor[i:-1]`. This incorrectly leaves off the last element. Noticed this when implementing slicing for list types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10286
Differential Revision: D9193292
Pulled By: michaelsuo
fbshipit-source-id: df372b815f9a3b8029830dd9e8769f9985a890e7
Summary:
I changed the name of this builtin to match Python's native style, but forgot to change the compiler error to match.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10265
Differential Revision: D9192963
Pulled By: michaelsuo
fbshipit-source-id: 225ca4cd50fbbe3b31c369deeb3123a84342aab1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10264
Since we now have DISABLE_COPY_AND_ASSIGN macro in the file,
CoreAPI is no longer an accurate name.
Reviewed By: dzhulgakov
Differential Revision: D9181687
fbshipit-source-id: a9cc5556be9c43e6aaa22671f755010707caef67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10263
Auxiliary changes that were needed:
- Add DISABLE_COPY_AND_ASSIGN to CoreAPI.h (maybe we should rename this file
now)
Reviewed By: dzhulgakov
Differential Revision: D9181321
fbshipit-source-id: 975687068285b5a94a57934817c960aeea2bbafa
Summary:
When we directly use -std=c++11, it propagates to the downstream applications.
Problems:
1. Gcc flags propagating to nvcc.
2. nvcc flags propagating to nvcc. (Which throws an error like redeclaration of std flag)
This PR will fix these propagation issues!
Similar problem:
https://github.com/FloopCZ/tensorflow_cc/pull/92https://github.com/CGAL/cgal/issues/2775
Requires: Cmake 3.12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10098
Differential Revision: D9187110
Pulled By: ezyang
fbshipit-source-id: 0e00e6aa3119c77a5b3ea56992ef3bbfecd71d80
Summary:
This PR for the ROCm target does the following:
* enable some unit tests on ROCm
* fix a missing static_cast that breaks BatchNorm call on ROCm
* fix BatchNorm to work on ROCm w/ ROCm warp sizes etc
* improve the pyhipify script by introducing kernel scope to some transpilations and other improvements
* fix a linking issue on ROCm
* for more unit test sets: mark currently broken tests broken (to be fixed)
* enable THINLTO (phase one) to parallelize linking
* address the first failing of the elementwise kernel by removing non-working ROCm specialization
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10266
Differential Revision: D9184178
Pulled By: ezyang
fbshipit-source-id: 03bcd1fe4ca4dd3241f09634dbd42b6a4c350297
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10261
1. Reserve
Currently, Reserve will allocate new memory and old data in the tensor is also preserved,
and Resize is relying on this behavior in some call-site, e.g. https://github.com/pytorch/pytorch/blob/master/caffe2/operators/reservoir_sampling.cc#L103, where we should be using Extend.
We want to bring semantics of Reserve to be more aligned with std::vector, i.e. we want it to be
an optimization about memory allocation and remove the semantics about preserving the data. We'll remove the guarantee that data will be preserved after Reserve, and Extend will be the only API that preserves old data when we do in-place extension of memory. This also helps with the later refactoring on split Storage from Tensor.
Also, we'll only pass in the outer dimension to Reserve which means the later dimensions should be set before we call Reserve.
2. Extend/Shrink
Previously, Extend actually means ExtendBy and Shrink means ShrinkTo, I would like to add a ExtendTo for convenience, and change Shrink to ShrinkTo.
Old functions calling Extend is still there, although it actually means Extend by, but I think it still makes sense to have it.
3. Usage Patterns
The expected usage patterns right now is:
```
t->Resize({0, 32, 32, 32});
t->template mutable_data<T>(); // set meta_
t->Reserve(100);
auto* t_data = t->template mutable_data<T>();
// feed data to tensor using t_data
for (int i = 0; i < 100; ++i) {
t->Extend(1, 50, &context_);
// you can continue to use t_data if you have reserved enough space
// otherwise, you should call t->template mutable_data<T> again to
// get the new data pointer since Extend will allocate new memory even
// though the original data is preserved.
}
```
Reviewed By: ezyang
Differential Revision: D9128147
fbshipit-source-id: e765f6566d73deafe2abeef0b2cc0ebcbfebd096
Summary:
the new entrypoint is `./tools/build_pytorch_libs.sh caffe2`
this will also speed up CI builds a bit, since we will no longer be compiling all of libtorch twice
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9836
Differential Revision: D9182634
Pulled By: anderspapitto
fbshipit-source-id: 0b9a20ab04f5df2d5c4e7777e4dc468ab25b9ce2
Summary:
Turns out some people are using this via the C-API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10259
Differential Revision: D9180135
Pulled By: gchanan
fbshipit-source-id: 68f59beabf7f8093e67581d7e7ebfe8dff9e6b69
Summary:
Using Visual Studio Code and Visual Studio, these IDEs store configurations to `FOLDER/.vscode` and `FOLDER/.vs`.
But "setup.py clean" deletes these folders because those are described in `.gitignore` file.
To prevent this, add "BEGIN NOT-CLEAN-FILES" marker to `.gitignore` file and "setup.py clean" ignores lines after this marker.
Discussed in #10206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10233
Differential Revision: D9175515
Pulled By: ezyang
fbshipit-source-id: 24074a7e6e505a3d51382dc5ade5c65c97deda37
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10197
Support generic feature in DPER2
For now since we only have one generic type 1, we are directly adding the parsed feature record to embedding feature.
For new feature types with specific structure, there should also be corresponding coding changes expected.
Reviewed By: itomatik
Differential Revision: D8788177
fbshipit-source-id: 9aaa6f35ece382acb4072ec5e57061bb0727f184
Summary:
Fixes#10032
When capturing an output, GraphExecutorAutogradFunction creates
SavedVariable with is_output=False and owns it:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/graph_executor.cpp#L87
Constructing SavedVariable with is_output=False makes it own a copy of
the shared_ptr<GraphExecutorAutogradFunction>, which causes a reference
cycle:
6456b944fd/torch/csrc/autograd/saved_variable.cpp (L27)
The solution in this PR is to construct the SavedVariable with
is_output=True if the captured value is an output.
Test Plan
Turn on cuda memory checking for JitTestCase. If the test's name
includes "cuda" or "gpu" in it, the cuda memory checking test happens.
cc zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10222
Reviewed By: ezyang
Differential Revision: D9162995
Pulled By: zou3519
fbshipit-source-id: aeace85a09160c7a7e79cf35f6ac61eac87cbf66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10175
Previously, we had at::Device::Type and caffe2::DeviceType (from protobuf),
intended to help us distinguish between CPU, CUDA, etc. devices.
This replaces at::Device::Type entirely with at::DeviceType, which in turn
is a direct, 'enum class' version of the protobuf generated caffe2::DeviceType
'enum'. We can't eliminate the 'enum' because this would a pretty drastic
API change (enum is interconvertible with integers, enum class is not) but
we can make the two line up exactly and share code for, e.g., printing.
Reviewed By: Yangqing
Differential Revision: D9137156
fbshipit-source-id: 566385cd6efb1ed722b25e6f7849a910b50342ab
Summary:
- New concept of a message stack; you can add messages
using AppendMessage
- New concept of a caller; it's just a way to pass along
some arbitrary extra information in the exception
Coming soon is changing Caffe2 to use at::Error instead of
EnforceNotMet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10183
Differential Revision: D9139996
Pulled By: ezyang
fbshipit-source-id: 6979c289ec59bc3566a23d6619bafba2c1920de9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10224
It doesn't work with Caffe2; use AT_CORE_API from ATen/core/CoreAPI.h
instead.
Reviewed By: smessmer
Differential Revision: D9162467
fbshipit-source-id: 3c7d83c1ccb722ebac469296bdd7c3982ff461e5
Summary:
The basic game plan is to stop accessing the type_ field directly,
and instead using the stored backend_, scalar_type_ and
is_variable_ to look up the appropriate Type from Context.
Storage of backend_ and scalar_type_ are new.
At some future point in time, I'd like to look at this code
carefully to see if I can get everything in this codepath inlining.
I didn't do it in this patch because there are circular include
problems making things difficult.
Some other details:
- Added Device::backend() which does what it says on the tin
- SparseTensorImpl is temporarily hard-coded to root in at::Context
for the appropriate context. If/when we put this in shared code,
we'll have to break this dep too, but for now it should be OK.
- There's a stupid problem with globalContext() deadlocking if
you didn't actually initialize it before loading libtorch.so
(which is bringing along the variable hooks). I fixed this by
reordering the static initializers. Fixes#9784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10210
Differential Revision: D9150697
Pulled By: ezyang
fbshipit-source-id: 89e2006c88688bcfab0dcee82dc369127c198c35
Summary:
- fixes#9141, #9301
- use logsigmoid at multilabel_soft_margin_loss to make it more stable (NOT fixing legacy MultiLabelSoftMarginCriterion)
- return (N) instead of (N, C) to match the same behavior as MultiMarginLoss
- Note that with this PR, the following behavior is expected:
```
loss = F.multilabel_soft_margin_loss(outputs, labels, reduction='none')
loss_mean = F.multilabel_soft_margin_loss(outputs, labels, reduction='elementwise_mean')
loss_sum = F.multilabel_soft_margin_loss(outputs, labels, reduction='sum')
loss.sum() == loss_sum # True
loss.mean() == loss_mean # True
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9965
Differential Revision: D9038402
Pulled By: weiyangfb
fbshipit-source-id: 0fa94c7b3cd370ea62bd6333f1a0e9bd0b8ccbb9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10217
It's only used in debug printing and is not that reliable anyway. If we want to implement it later - we should do it proper accounting for shared storages.
Reviewed By: jerryzh168
Differential Revision: D9155685
fbshipit-source-id: 48320d41a0c4155645f3ba622ef88730a4567895
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9360
This implements a first set of c10 operators, namely the ones needed for the multithread predictor benchmark.
All implementations are CPU-only and experimental. They're not meant to be used in production.
They can be used, however, to test calling simple c10 MLPs from Caffe2 or PyTorch when working on these integration paths.
Reviewed By: dzhulgakov
Differential Revision: D8811698
fbshipit-source-id: 826789c38b2bfdb125a5c0d03c5aebf627785482
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9369
This adds the capability for caffe2 to call c10 operators and adds a dummy c10 sigmoid op as a proof of concept.
I used this test script to make sure it works:
from caffe2.python import workspace, model_helper
import numpy as np
data1 = np.random.rand(16, 100).astype(np.float32)
workspace.FeedBlob("data1", data1)
m = model_helper.ModelHelper(name="my net")
sigmoid1 = m.net.C10Sigmoid_DontUseThisOpYet("data1", "sigmoid1")
sigmoid2 = m.net.Sigmoid("data1", "sigmoid2")
workspace.RunNetOnce(m.param_init_net)
workspace.CreateNet(m.net)
data1 = np.random.rand(16, 100).astype(np.float32)
workspace.FeedBlob("data1", data1)
workspace.RunNet(m.name, 1)
print(workspace.FetchBlob("data1"))
print(workspace.FetchBlob("sigmoid1"))
print(workspace.FetchBlob("sigmoid2"))
(and check that both sigmoid outputs are the same)
Reviewed By: ezyang
Differential Revision: D8814669
fbshipit-source-id: eeb0e7a854727f1617a3c592a662a7e5ae226f40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10214
Seems we're passing weak pointers over C API boundaries. Need this API there too.
Reviewed By: ezyang
Differential Revision: D9154505
fbshipit-source-id: c9889689b87dad5d918f93ba231e01704b8d2479
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10130
Update some include paths to make them internally consistent
Reviewed By: ezyang
Differential Revision: D9119906
fbshipit-source-id: b44e5cab8e8e795ee18afe9ffc6caf1f2b413467
Summary:
This PR adds a way to infer the JIT/script schema of a function from its signature, and then create an operator from the schema and implementation. The implementation function is wrapped into another function, which pops values from the stack into an argument tuple, then invokes the function and pushes the return value back onto the stack, sometimes unpacking the return value if it is a tuple.
Currently the method is called `createOperator`. We may want to think of a nicer way of registering ops in tandem with `RegisterOperators`. It might be very cumbersome to add a template constructor to `Operator`, so maybe we can come up with a chaining method on `RegisterOperators` like `RegisterOperators(schema, func).op(schema.func).op(schema, func)` -- it has to work at startup time (for a static variable) though. We can solve this in another PR.
zdevito apaszke smessmer dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10048
Differential Revision: D9125975
Pulled By: goldsborough
fbshipit-source-id: de9e59888757573284a43787ae5d94384bfe8f9a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10192
- release_resources() method must be non-const because it modifies the object
- for intrusive_ptr<const MyClass>, this needs to be const_cast :(
Reviewed By: ezyang
Differential Revision: D9143808
fbshipit-source-id: 9203ff7a7ff3bec165931279371c6e75d4f0ca8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10133
This is useful for C APIs where we want to give owning pointers to/from other languages.
Reviewed By: ezyang
Differential Revision: D9121493
fbshipit-source-id: f903f5830f587b2ba69c0636ddcf1a066bbac2e0
Summary:
The PR allows int→float and float→int casts. Current we only allow `tensor→int` and `tensor→float` casts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10168
Differential Revision: D9141163
Pulled By: wanchaol
fbshipit-source-id: 5e5591a98b4985a675641dfc9a385b2a0bf8e208
Summary:
Previously, `foo = [bar, baz]` would construct a TupleType of fixed arity. This would cause code like:
```
foo = [2]
if True:
foo = [2, 2]
```
to fail to compile, since `(int)` is not the same as `(int, int)`.
This PR changes things so that list literals construct ListTypes, which can be resized.
Potentially breaking changes introduced:
- Empty list literals are now disallowed, `_constructEmptyFooList()` builtins are required to replace them.
- Iterable variable unpacking where the rhs is a list is now disallowed. (Tuples still work)
- Lists must have a single type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10193
Differential Revision: D9147166
Pulled By: michaelsuo
fbshipit-source-id: bbd1b97b0b6b7cb0e6f9d6aefa1ee9c731e63039
Summary:
* Changes `insertConstant(g, val)` to `g.insertConstant(val)`.
* Moves SourceRange to its own file to enable it.
* Cleans up dead attribute code in schema matching and graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10177
Differential Revision: D9137789
Pulled By: zdevito
fbshipit-source-id: 8a73cfb01a576f02e7e4dce019be9c0a0002989d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9897
Add an IntrusivePtr class to do intrusive refcounting with a shared_ptr-like interface.
Reviewed By: ezyang
Differential Revision: D9018619
fbshipit-source-id: 5de8706aab8eea2e30bead0f59bd6a7ca4d20011
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10173
With D9024330, `Extend` fundtion is no more a template, which makes
the `template` keyword here invalid. For some reason current version of LLVM
doesn't catch this, but the latest one does.
Reviewed By: jerryzh168
Differential Revision: D9133462
fbshipit-source-id: 54ac9aad01f81b9b4e7b6e2864b8961478d2d860
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9905
This diff improves lars operator in Caffe2 by applying clipping to the computed learning rate
Reviewed By: pjh5
Differential Revision: D9020606
fbshipit-source-id: b579f1d628113c09366feac9406002f1ef4bd54f
Summary:
This PR adds strings to the ast and implements them for print statements. Strings are lifted as attributes to the print node. They must be arguments to print itself, not as an argument for an object that is passed to print. If they are encountered elsewhere a NYI exception will be thrown.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9324
Reviewed By: jramseyer
Differential Revision: D8807128
Pulled By: eellison
fbshipit-source-id: 984401ff458ed18d473c6d1bd86750e56c77d078
Summary:
This is part of the process of removing THLongStorage to represent sizes/strides.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10146
Differential Revision: D9126611
Pulled By: gchanan
fbshipit-source-id: b0d995a4c51dfd54bf76dcfee9a69f37f9d01652
Summary:
In this changeset:
* improvements to `hipify-python.py`
* marking unit tests broken for ROCm
* reducing the number of jobs for the built to avoid out of memory issues
* switch to Thrust/cub-hip master for the CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9653
Differential Revision: D9117791
Pulled By: ezyang
fbshipit-source-id: a6c3c7b81f2bda9825974bf9bf89a97767244352
Summary:
Enabled support for generating random numbers in fusion compiler. Currently a philox RNG implemented by Tensorflow is used, as the NVRTC couldn't resolve the curand.h header correctly. The two implementation should have exact same behavior according to our tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9795
Differential Revision: D8999029
Pulled By: SsnL
fbshipit-source-id: f0d2616a699a942e2f370bdb02ac77b9c463d7b8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10081
Add new utility that make it easier to write graph transformation. Callers now only need to take care of the actual transformation logic. The subgraph matching is simplified because callers only need to specify a simple construct for subtree matching criteria.
The utlity is SubgraphMatcher::replaceSubtree
Some notes:
- replaceSubtree takes a subtree matching criteria, and a lambda that takes a subtree root. It does't not handle any transformations itself. Callers should be responsible for the transformation part, including deleting all nodes in the matched subtree(s). We could enhance this to also handle the deletion part if it turns out to be useful.
- Only sub tree matching is supported for now but we can add general DAG sub-graph support later if needed.
Reviewed By: bwasti
Differential Revision: D9073297
fbshipit-source-id: 465a0ad11caafde01196fbb2eda2d4d8e550c3b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9860
For 3D group convolution, in the case of CUDNN 7 and NCHWD order, filter dim is (M, C/group_, k_h, h_w, k_d).
According to CUDA doc (https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#grouped-convolutions), the existing implementation is incorrect, and will crash the 3d video model training with group convolution.
In the implementation, `filter.dims(1)` is already `C/group_`. So don't need to divide it by `group_` again.
Reviewed By: BIT-silence
Differential Revision: D9008807
fbshipit-source-id: 2f0d6eb47f4e16d7417a7e3baeba709e3254154f
Summary:
Implement IR transformation for control flow
- `prim::Constant`: clone to new graph directly
- `prim::NumToTensor`: create a `BatchTensor` from output tensor with `batch_size = 1`
- `prim::TensorToNum`: clone to new graph
- `prim::ListConstruct`: clone to new graph
- `prim::If`: execute both `if_block` and `else_block` and combine results from them using `cond`
- `prim::Loop`:
- for loop
- while loop: change while `cond` to `cond_any`, use `cond` to update outputs
test case: hand-written LSTM, greedy search, beam search
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9392
Differential Revision: D8822369
Pulled By: ChunliF
fbshipit-source-id: 8f03c95757d32e8c4580eeab3974fd1bc429a1e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10152
- Moved from namespace c10::guts to at
- I fixed the use sites, since there were only three of them
- Macro renamed from C10_ to AT_
Reviewed By: smessmer
Differential Revision: D9123652
fbshipit-source-id: bef3c0ace046ebadb82ad00ab73371f026749085
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10139
We want CaffeTypeId to be interconvertible with at::ScalarType, and
this means we should have the numbers line up exactly. Fortunately
this is not too hard to do.
Reviewed By: smessmer
Differential Revision: D9123058
fbshipit-source-id: 7e9bd59ca25a552afe9d2d0a16cedc4f6311f911
Summary:
This exposes expand_outplace to python. Fixes#8076. Fixes#10041.
I didn't name it torch.broadcast because numpy.broadcast does something
slightly different (it returns an object with the correct shape
information).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10075
Differential Revision: D9125816
Pulled By: zou3519
fbshipit-source-id: ebe17c8bb54a73ec84b8f76ce14aff3e9c56f4d1
Summary:
Previously, the parser was emitting list literals for tuples, but the IR was representing list literals internally with TupleTypes.
For implementing most list operations, I think it will be helpful distinguish between lists (dynamic size, homogeneous types) and tuples (fixed arity, heterogeneous types)
This diff modifies the parser logic to emit tuple literals. This frees us to represent lists as ListType in the IR, while still properly mapping tuple literals to TupleTypes.
A following diff will actually switch over list literals to emit ListTypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10128
Differential Revision: D9121305
Pulled By: michaelsuo
fbshipit-source-id: e0cad07ae8bac680f7f8113d10e5129d5a1a511d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10138
Note that `TensorCPU` and `TensorGPU` are all refined to be `Tensor` now. Basically they are the same thing. So check like `blob.IsType<TensorCPU>()` is no longer safe as `TensorGPU` can pass the check too.
We need to systematically weed out the such usage in our codebase... @[100008320710723:jerryzh]
Reviewed By: houseroad
Differential Revision: D9115273
fbshipit-source-id: 13b293c73691002eac34e095cdcd96c27183e875
Summary:
This rewrites checked_convert to use stringstreams, eliminating the use of to_string which is not available on Android stdc++.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10137
Reviewed By: smessmer
Differential Revision: D9122340
fbshipit-source-id: b7c1bff70e36217305f2b3333c51543ef8ff3d9c
Summary:
This will be needed soon because I want to move Half.h into
ATen/core, and then I cannot have a TH dependency.
I also took the liberty of making the code more strict-aliasing
safe (this is not actually useful, since we will never built Torch
with strict aliasing) by replacing pointer casts between
float and unsigned with a memcpy instead.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10134
Differential Revision: D9121920
Pulled By: ezyang
fbshipit-source-id: 3b1f86a7c5880e8ac1a589a51f0635bb72e1fd40
Summary:
…e_/is_variable_
The basic game plan is to stop accessing the type_ field directly,
and instead using the stored backend_, scalar_type_ and
is_variable_ to look up the appropriate Type from Context.
Storage of backend_ and scalar_type_ are new.
At some future point in time, I'd like to look at this code
carefully to see if I can get everything in this codepath inlining.
I didn't do it in this patch because there are circular include
problems making things difficult.
Some other details:
- Added Device::backend() which does what it says on the tin
- SparseTensorImpl is temporarily hard-coded to root in at::Context
for the appropriate context. If/when we put this in shared code,
we'll have to break this dep too, but for now it should be OK.
- There's a stupid problem with globalContext() deadlocking if
you didn't actually initialize it before loading libtorch.so
(which is bringing along the variable hooks). I didn't fix
it in this PR; it's tracked in #9784
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9787
Reviewed By: cpuhrsch
Differential Revision: D8980971
Pulled By: ezyang
fbshipit-source-id: 2b4d867abfdc3999a836a220c638c109053145a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9740
- Remove implicit ArrayRef -> vector conversion
- Fix 4 call sites that accidentally did an implicit expensive vector conversion but wouldn't have needed to
- Remove explicit vector conversion from 4 call sites that also didn't need to do that
Reviewed By: ezyang
Differential Revision: D8961693
fbshipit-source-id: 980da9f988083c0072497f9dbcbbf6f516fa311c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9610
Mostly making some stuff in ArrayRef constexpr to give it better perf.
Reviewed By: ezyang
Differential Revision: D8926785
fbshipit-source-id: af6d4b05fbc69d20855a80f3edc2b501577a742b
Summary:
in particular, make not building tests actually work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10091
Differential Revision: D9121366
Pulled By: anderspapitto
fbshipit-source-id: d7d38cf759aa46bff90d3b4f695c20f29039ae75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10107
This header is needed for ATen/core stuff
This diff also fixes an issue in C++17.h when run in C++17 enabled compilers.
Reviewed By: ezyang
Differential Revision: D9095209
fbshipit-source-id: d45947956019a7095875f48746b88c414e8865bc
Summary:
zdevito explained that the attributed versions of `Operator`s are no longer necessary. This PR does two things:
1. Removes all code associated with attributed operators,
2. Adds a second kind of state to `Operator` where it is constructed with an `Operation` directly instead of an `OperationCreator`. This will be useful to test custom operators which don't require a node (you can just retrieve it directly).
Now rebased on top of https://github.com/pytorch/pytorch/pull/9801
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10080
Differential Revision: D9113668
Pulled By: goldsborough
fbshipit-source-id: 1276a191c7cf89da1c38488769f2105ce2664750
Summary:
Extracted from https://github.com/pytorch/pytorch/pull/8338
Updating Eigen submodule to fix an issue we saw with BUILD_ATEN and BUILD_CAFFE2 removal.
cc mingzhe09088 ezyang smessmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10095
Reviewed By: mingzhe09088
Differential Revision: D9109877
Pulled By: orionr
fbshipit-source-id: 90e36c298d8a22398558d70dc5f68a95a7687d6b
Summary:
It's not a particularly pretty process right now, but it may as well
be documented. I'm not aware of an ideal location for this, so I'm
just dropping it in the docs/ folder for now as recommended by
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10087
Differential Revision: D9119681
Pulled By: anderspapitto
fbshipit-source-id: cd4afb642f3778c888d66a501bc697d0b0c88388
Summary:
This also makes Backtrace more portable, by disabling its functionality for
mobile builds as well.
It also handles Caffe2 static Windows builds by introducing a new variable,
AT_CORE_STATIC_WINDOWS, which must be set if you're building
ATen on Windows as part of a static library.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10092
Reviewed By: gchanan, smessmer
Differential Revision: D9094393
Pulled By: ezyang
fbshipit-source-id: 93281f9302bd378605a26589ae308faf1dac7df4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10035
This is an initial diff which refactors some of the components in the Seq2SeqModelCaffe2EnsembleDecoder class.
Reviewed By: jmp84
Differential Revision: D9026372
fbshipit-source-id: 449635208f24494209ae2fb78a19fca872970ea8
Summary:
This PR depends on the tests added in #9670. It moves the first, tiny function from the c10d DDP to C++: `dist_broadcast_coalesced`. Let me know if ` torch/csrc/distributed/c10d/ddp.h` will be a good place to put these rewritten functions.
pietern The controller you requested could not be found. apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9729
Differential Revision: D8985308
Pulled By: goldsborough
fbshipit-source-id: dc459fe9040273714044152063585e746974752f
Summary:
I opened an issue explaining some of my frustrations with the current state of schedulers.
While most points that I raised in [that issue](https://github.com/pytorch/pytorch/issues/8741#issuecomment-404449697) need to be discussed more thoroughly before being implemented, there are some that are not so difficult to fix.
This PR changes the way the LambdaLR scheduler gets serialized:
> The lr_lambda functions are only saved if the are callable objects (which can be stateful).
> There is no point in saving functions/lambdas as you need their definition before unpickling and they are stateless.
This has the big advantage that the scheduler is serializable, even if you use lambda functions or locally defined functions (aka a function in a function).
Does this functionality need any unit tests?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9927
Differential Revision: D9055505
Pulled By: soumith
fbshipit-source-id: 6c1cec588beedd098ec7d2bce6a9add27f29e48f
Summary:
There is a regression in softmin in 0.4.1 that was not present in 0.4.0. The behavior of softmin(x) should match softmax(-x) however instead it is implemented (in v0.4.1) as -softmax(x). These are not the same. The fix is trivial because the bug is due to operator precedence.
This is a major regression that broke my training. I'm not sure how a unit test did not catch this.
```
x = torch.tensor([1, 2, 3.5, 4])
print(F.softmin(x, dim=0)) # this has the wrong output in 0.4.1 but correct in 0.4.0
print(F.softmax(-x, dim=0)) # this is what softmax should be
print(F.softmax(x, dim=0))
print(-F.softmax(x, dim=0)) # this is how softmax is implemented incorrectly
```
In 0.4.1 this produces
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
tensor([0.6668, 0.2453, 0.0547, 0.0332])
tensor([0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
In 0.4.0 this produces the correct values
tensor([ 0.6668, 0.2453, 0.0547, 0.0332])
tensor([ 0.6668, 0.2453, 0.0547, 0.0332])
tensor([ 0.0278, 0.0755, 0.3385, 0.5581])
tensor([-0.0278, -0.0755, -0.3385, -0.5581])
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10066
Differential Revision: D9106995
Pulled By: soumith
fbshipit-source-id: 7332503c6077e8461ad6cd72422c749cf6ca595b
Summary:
This is a cleanup and refactoring.
In its original form (changeset 6fdf915c057a) this diff caused a 5% regression
on ads CPU. The root cause was an omission of link_whole = True, causing
symbols to be stripped in mode/opt and forcing the converter to fallback
causing patterns to be unmatched in the graph transform logic. This version of
the diff tests for link_whole by including a C++ test of the transform
Reviewed By: yinghai
Differential Revision: D9040511
fbshipit-source-id: 3e19b89989aa68b021762d12af2d0b4111280b22
Summary:
`.bat` file's EOL is LF, so a build is failed on some Windows machines.
To fix this, add `.gitattributes` and set batch file's EOL to CRLF.
Discussion is in #9677.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9813
Differential Revision: D9026486
Pulled By: soumith
fbshipit-source-id: 341eaa677c35f8476a7eda1bac9827385072eb29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10044
The test was subtly broken! This transform wasn't writing to the correct blob and the test did not catch that because it was looking at the old version.
thanks @[100022211048576:kerenzhou] for catching this
Reviewed By: Jokeren
Differential Revision: D9075520
fbshipit-source-id: c31ff0afcd78dd2dc7ffc240e2e89eeda87f1fb4
Summary:
This should prevent slow startup times, and will not report as many
errors during static initialization time which are hard to debug
ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9801
Reviewed By: goldsborough
Differential Revision: D8986603
Pulled By: zdevito
fbshipit-source-id: 440d43ab5e8cffe0b15118cb5fda36391ed06dbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10059
Without virtual dtor, it could induce incorrect sized deallocation, messing up the memory. And unfortunately, sized deallocation cannot be detected by ASAN, yet.
Reviewed By: jerryzh168
Differential Revision: D9080526
fbshipit-source-id: c136cf653134e75b074326be2bc03627da42446f
Summary:
The affected files are all files that are planned to be moved
to ATen/core; the includes are for headers which are NOT slated
for movement.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10085
Differential Revision: D9093746
Pulled By: ezyang
fbshipit-source-id: 2beeffdae26d03d631d2d51b40bf6303759a2f50
Summary:
This lays out initial support for taking and returning a richer set
of types than only tensors. Floats and ints are already valid, lists are
straightforward to add, tuples need some discussion.
Based on top of #9948. Review only the last commit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9969
Reviewed By: zdevito
Differential Revision: D9076973
Pulled By: apaszke
fbshipit-source-id: 5a1fe912ea6b79ab2bfd0dcce265eb05855b5ff0
Summary:
_pointwise loss has some python special casing, we converted reduction to aten enums too early.
fixes#10009
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10018
Differential Revision: D9075489
Pulled By: li-roy
fbshipit-source-id: 4bf2f5e2911e757602c699ee1ec58223c61d0162
Summary:
This PR fixes#9418 .
Openmpi 1.10 segfaults in MPI_Bcast with CUDA buffer. And it's a retired openmpi version.
I've tested on 2.1.1 and 3.0.0 and they work well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10015
Reviewed By: soumith
Differential Revision: D9088103
Pulled By: ailzhang
fbshipit-source-id: fc0a45e5cd016093ef0dbb9f371cbf67170d7045
Summary:
The CPU and CUDA variants are a direct transposition of Graves et al.'s description of the algorithm with the
modification that is is in log space.
The there also is a binding for the (much faster) CuDNN implementation.
This could eventually fix#3420
I still need to add tests (TestNN seems much more elaborate than the other testing) and fix the bugs than invariably turn up during the testing. Also, I want to add some more code comments.
I could use feedback on all sorts of things, including:
- Type handling (cuda vs. cpu for the int tensors, dtype for the int tensors)
- Input convention. I use log probs because that is what the gradients are for.
- Launch parameters for the kernels
- Errors and obmissions and anything else I'm not even aware of.
Thank you for looking!
In terms of performance it looks like it is superficially comparable to WarpCTC (and thus, but I have not systematically investigated this).
I have read CuDNN is much faster than implementations because it does *not* use log-space, but also the gathering step is much much faster (but I avoided trying tricky things, it seems to contribute to warpctc's fragility). I might think some more which existing torch function (scatter or index..) I could learn from for that step.
Average timings for the kernels from nvprof for some size:
```
CuDNN:
60.464us compute_alphas_and_betas
16.755us compute_grads_deterministic
Cuda:
121.06us ctc_loss_backward_collect_gpu_kernel (= grads)
109.88us ctc_loss_gpu_kernel (= alphas)
98.517us ctc_loss_backward_betas_gpu_kernel (= betas)
WarpCTC:
299.74us compute_betas_and_grad_kernel
66.977us compute_alpha_kernel
```
Of course, I still have the (silly) outer blocks loop rather than computing consecutive `s` in each thread which I might change, and there are a few other things where one could look for better implementations.
Finally, it might not be unreasonable to start with these implementations, as the performance of the loss has to be seen in the context of the entire training computation, so this would likely dilute the relative speedup considerably.
My performance measuring testing script:
```
import timeit
import sys
import torch
num_labels = 10
target_length = 30
input_length = 50
eps = 1e-5
BLANK = 0#num_labels
batch_size = 16
torch.manual_seed(5)
activations = torch.randn(input_length, batch_size, num_labels + 1)
log_probs = torch.log_softmax(activations, 2)
probs = torch.exp(log_probs)
targets = torch.randint(1, num_labels+1, (batch_size * target_length,), dtype=torch.long)
targets_2d = targets.view(batch_size, target_length)
target_lengths = torch.tensor(batch_size*[target_length])
input_lengths = torch.tensor(batch_size*[input_length])
activations = log_probs.detach()
def time_cuda_ctc_loss(grout, *args):
torch.cuda.synchronize()
culo, culog_alpha = torch._ctc_loss(*args)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
def time_cudnn_ctc_loss(groupt, *args):
torch.cuda.synchronize()
culo, cugra= torch._cudnn_ctc_loss(*args)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
def time_warp_ctc_loss(grout, *args):
torch.cuda.synchronize()
culo = warpctc.ctc_loss(*args, blank_label=BLANK, size_average=False, length_average=False, reduce=False)
g, = torch.autograd.grad(culo, args[0], grout)
torch.cuda.synchronize()
if sys.argv[1] == 'cuda':
lpcu = log_probs.float().cuda().detach().requires_grad_()
args = [lpcu, targets_2d.cuda(), input_lengths.cuda(), target_lengths.cuda(), BLANK]
grout = lpcu.new_ones((batch_size,))
torch.cuda.synchronize()
print(timeit.repeat("time_cuda_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'cudnn':
lpcu = log_probs.float().cuda().detach().requires_grad_()
args = [lpcu, targets.int(), input_lengths.int(), target_lengths.int(), BLANK, True]
grout = lpcu.new_ones((batch_size,))
torch.cuda.synchronize()
print(timeit.repeat("time_cudnn_ctc_loss(grout, *args)", number=1000, globals=globals()))
elif sys.argv[1] == 'warpctc':
import warpctc
activations = activations.cuda().detach().requires_grad_()
args = [activations, input_lengths.int(), targets.int(), target_lengths.int()]
grout = activations.new_ones((batch_size,), device='cpu')
torch.cuda.synchronize()
print(timeit.repeat("time_warp_ctc_loss(grout, *args)", number=1000, globals=globals()))
```
I'll also link to a notebook that I used for writing up the algorithm in simple form and then test the against implementations against it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9628
Differential Revision: D8952453
Pulled By: ezyang
fbshipit-source-id: 18e073f40c2d01a7c96c1cdd41f6c70a06e35860
Summary:
I previous did some transformations, e.g. _nDimension,_dim -> nDimensionLegacyAll, nDimension -> nDimensionLegacyNoScalars.
But this didn't touch dim(), which needs to be updated to support scalars. Instead of doing an (ugly) move, I audited the call sites and updated the cases that could be size 1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10023
Differential Revision: D9068996
Pulled By: gchanan
fbshipit-source-id: c63820767dd1496e908a5a96c34968482193f2c5
Summary:
We missed the upsample symbolic when bumping up the opset to 7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10001
Reviewed By: bddppq
Differential Revision: D9067212
Pulled By: houseroad
fbshipit-source-id: 3e285d2800a32cb04fa82f8e7f261bdd010a8883
Summary:
ATenCore.h is a dummy header to just test that this is working at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10019
Reviewed By: smessmer
Differential Revision: D9067262
Pulled By: ezyang
fbshipit-source-id: 58bab9c0aa83b56335e36b719b9b6505400d8dee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9890
Minor cleanups for Graph.h to make it more consistent with our style guide
Also fix opt/device.cc and binary_match_test.cc to not access subgraph.nodes_ which is now private
Reviewed By: bwasti
Differential Revision: D9017108
fbshipit-source-id: 9f5cba4a2cd2a452a955005f4704f6c120bbc1d5
Summary:
Adding a constant propagation pass to the JIT. I have added examples to the expect files.
There are a couple of special cases which have not been implemented here. IF nodes with constant conditions can be inlined with the correct block. WHILE nodes can be removed if the condition is false. I have added a test for each case in test_jit.py file as expected failures.
To be consistent with DCE, python ops & CPP ops are treated as not having side-effects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8808
Reviewed By: wanchaol
Differential Revision: D8906770
Pulled By: eellison
fbshipit-source-id: 10ad796d89f80b843566c9ddad6a0abd1f3dc74c
Summary:
This causes numpy to yield to the torch functions,
e.g. instead of numpy array/scalar __mul__ converting the tensor to
an array, it will now arrange for the Tensor __rmul__ to be called.
Fixes case 2 of #9468
I also makes case 3 and 4 equivalent but does not fix them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9651
Differential Revision: D8948079
Pulled By: ezyang
fbshipit-source-id: bd42c04e96783da0bd340f37f4ac3559e9bbf8db
Summary:
More clang tidy cleanups in `torch/csrc`. This time:
1. `hicpp-use-equals-default` recommends `= default` instead of `{}` for constructors/destructors. This is better practice because it expresses the intent better (https://stackoverflow.com/questions/6502828/what-does-default-mean-after-a-class-function-declaration)
2. `readability-inconsistent-declaration-parameter-name` enforces that parameter names in the declaration match parameter names in the definition. This is just generally useful and can prevent confusion and bugs.
Also updated my script a little bit.
apaszke ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9737
Differential Revision: D9069069
Pulled By: goldsborough
fbshipit-source-id: f7b3f3a4eb4c9fadc30425a153566d3b613a41ae
Summary:
These could use some autograd tests, which are coming in a later PR, but using them in autograd is probably pretty rare.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9947
Reviewed By: ezyang
Differential Revision: D9032778
Pulled By: gchanan
fbshipit-source-id: fa5a6509d3bac31ea4fae25143e82de62daabfbd
Summary:
Not sure if anybody is interested but I managed to infer a `GRU` fine in `wasm` using ATen's compiled with emscripten. It was quite trivial to fix the configuration.
It also passes most of the tests, specially all scalar tensor tests.
The command line to configure was, but could be simplified:
```
emconfigure cmake -DAT_LINK_STYLE=STATIC -DCAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO=OFF -DCMAKE_C_FLAGS="-Wno-implicit-function-declaration -DEMSCRIPTEN -s DISABLE_EXCEPTION_CATCHING=0" -DCMAKE_CXX_FLAGS="-Wno-implicit-function-declaration -DEMSCRIPTEN -s DISABLE_EXCEPTION_CATCHING=0" -DCMAKE_INSTALL_PREFIX=/home/sugar/aten-wasm ../
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9803
Differential Revision: D9004610
Pulled By: ezyang
fbshipit-source-id: db26c59f27162ed80f6aee2973c4cb9252d3d1e4
Summary:
Fixes#9818.
It seems original Python doesn't add `[PYTHONPATH]\Library\bin` into `PATH`. We try to add it before dll loading process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9920
Differential Revision: D9040825
Pulled By: soumith
fbshipit-source-id: c07fff71b2aea254a396042ab677696f6829aac7
Summary:
Minor addition to the docstring of `torch.nn.optim.Adam`, adding the default argument description for the `amsgrad` argument to the docstring for concistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9971
Differential Revision: D9040820
Pulled By: soumith
fbshipit-source-id: 168744a6bb0d1422331beffd7e694b9d6f61900c
Summary:
This was introduced in #9826 following the corresponding cuda file context_gpu.cu file, tests have passed in the PR, at that point master was 94439d7df. However during the long landing process, a new master commit aebf3b4 has come in that removed the `CAFFE_KNOWN_TYPE(Tensor<HIPContext>)` in context_hip.cc file, which then has broken the HIP BlobStatGetter, and we did NOT run tests again during merge and so when #9826 later landed to master the rocm tests start breaking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9973
Differential Revision: D9040671
Pulled By: bddppq
fbshipit-source-id: f3b16cabaf681fc0535ca733db0b48430868f922
Summary:
Supersedes #8925
This PR fixes#8502, it fixes the gradients problem for clamp when passing None to the function, and add support for the NoneLiteral and NoneType in script to enable clamp tests. Now we could have corner cases like:
```python
torch.jit.script
def func():
x = torch.randn(3, 3, requires_grad=True)
y = torch.clamp(x, None, 0) # max = 0
y = torch.clamp(x, min=None, max=0)
```
In both JIT and Aten, we use Scalar(NAN) as a sentinel value when passing None type to function clamp, this is the current way we used to support None type in JIT and to solve the gradient problem when user explicitly passing None into clamp.
In JIT side, we create a tensor(NAN) and undefinedTensor if we encounter None when matching the function schema, and later in the interpreter, it will translate to Scalar(NAN) if needed.
Ideally we don't need clamp_min and clamp_max in ATenNative/Autograd and could only support clamp after this change, but since bunch of other operators (e.g. Activation.cpp, Loss.cpp) is using clamp_min in several places, we will still have the functions available, but all python invocations will only call clamp instead of clamp_min/max (with calling underlying th_max/th_min in clamp).
zdevito jamesr66a
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9596
Reviewed By: zdevito
Differential Revision: D8940839
Pulled By: wanchaol
fbshipit-source-id: c543a867b82e0ab8c99384773b173fdde2605d28
Summary:
This is a follow-up to https://github.com/pytorch/pytorch/pull/9794 that contains only the serialization library and exposes a cleaner API. This should later be incorporated into the module export code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9900
Reviewed By: zdevito
Differential Revision: D9021057
Pulled By: jamesr66a
fbshipit-source-id: 01af74a7fdd1b90b2f5484644c3121d8ba9eb3b3
Summary:
If we have this "spatial" attribute and its value equals to 1, we could just remove this attribute and convert this op to caffe2 SpatialBN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9492
Differential Revision: D8988165
Pulled By: houseroad
fbshipit-source-id: a9218dc9cd5fab43deb371f290f81285f5283231
Summary:
We only support special case. The original dim is not supported by ONNX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9660
Reviewed By: bddppq
Differential Revision: D8965507
Pulled By: houseroad
fbshipit-source-id: 021dffdf0489c2d3a50bfd1e0c4cfd00d4a3d776
Summary:
The goal of this PR is to update the hip files to reflect relevant changes in cuda source files.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9826
Differential Revision: D9032840
Pulled By: bddppq
fbshipit-source-id: 504e55c46308eebfee3c9a7beea1f294fe03470f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9747
Currently the ctc_greedy_decoder op initializes the `merge_repeated` argument only if it has been provided by the user. Change to initialize in all cases.
Reviewed By: houseroad
Differential Revision: D8963635
fbshipit-source-id: 18955c7c26a77d9d7f5137e4dec085252ffabfeb
Summary:
```
This adds TensorIterator, a helper class for computing element-wise
operations that's intended to replace the CPU and CUDA apply utils
functions.
CPU kernels are implemented as functions that operate on strided 1-d
tensors compared to CPUApplyUtils which operated individual elements. This
allows the kernels to handle vectorization, while TensorIterator handles
parallelization and non-coalesced dimensions.
GPU kernels continue to operate on elements, but the number of
specializations is reduced. The contiguous case remains the same. The
non-contiguous case uses a single (reduced) shape for all operands and
the fast integer division from THCIntegerDivider. To avoid extra
specializations for indexing with 64-bits, large operations are split
into smaller operations that can be indexed with 32-bits.
Major semantic changes:
- No more s_add, s_mul, s_div, or s_sub. Broadcasting is handled by
TensorIterator. The autograd engine performs the reduction assuming
standard broadcasting if the gradient shape does not match the
expected shape. Functions that do not use standard broadcasting rules
should either continue to trace the expand calls or handle the
reduction in their derivative formula.
- Use ONNX v7, which supports broadcasting ops.
Performance impact:
- Small increased fixed overhead (~0.5 us)
- Larger overhead for wrapped numbers (~2.5 us)
- No significant change for ops on contiguous tensors
- Much faster worst-case performance for non-contiguous GPU tensors
- Faster CPU bias addition (~2x)
- Faster GPU bias addition (~30% faster)
Future work:
- Decrease overhead, especially for wrapping numbers in Tensors
- Handle general inter-type operations
- Extend to unary ops and reductions
- Use buffering for compute-bound operations on non-contiguous tensors
(pull in from CPUApplyUtils)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8919
Differential Revision: D8677600
Pulled By: colesbury
fbshipit-source-id: 61bc9cc2a36931dfd00eb7153501003fe0584afd
Summary: Minor fix for a bug introduced by D9004285
Reviewed By: anderspapitto
Differential Revision: D9028762
fbshipit-source-id: 9b9c5eef30e61d7ae19784e0418fa29bad2b5564
Summary:
I hope this helps me for the windows build failure in #9628 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9904
Differential Revision: D9026715
Pulled By: soumith
fbshipit-source-id: bb97d41d060823f5a37bfc9a1659815b8b9f4eab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9939
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: ezyang, houseroad
Differential Revision: D9024330
fbshipit-source-id: e0b8295d2dc6ebe2963383ded5af799ad17164ba
Summary:
This was showing up in the n-dimensional empty tests as flaky because it's reading uninitialized cuda memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9907
Differential Revision: D9021413
Pulled By: gchanan
fbshipit-source-id: 31542b7597919df9afd6e528bb108a4a3e8eaf60
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9895
The primary goal here was to remove THTensor::_dim, which isn't part of the API moving forward.
Instead, we provide 3 options for getting the dimensionality (this is temporary although non-trivial to remove!):
```
nDimension corresponds to the "true" ATen dimension. TODO: implement.
nDimensionLegacyNoScalars correpsonds to the ATen dimension, except scalars are viewed as 1-dimensional tensors.
nDimensionLegacyAll corresponds to the ATen dimension, except scalars are viewed as 1-dimensional tensors
and tensors with a dimension of size zero are collapsed to 0-dimensional tensors.
```
So in this patch, nDimension -> nDimensionLegacyNoScalars and _dim/_nDimension goes to nDimensionLegacyAll.
These are just codemods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9835
Reviewed By: ezyang
Differential Revision: D8999338
Pulled By: gchanan
fbshipit-source-id: a4d676ac728f6f36ca09604a41e888d545ae9311
Summary:
Hello! I just find a small spell mistake while reading this source code. Just PR it, Thx!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9868
Reviewed By: gchanan, ezyang
Differential Revision: D9016030
Pulled By: soumith
fbshipit-source-id: fc3877177be080adbdbda99a169e401691292ebb
Summary:
Based on top of #9763 (first 3 commits belong to that PR). The first commits from this PR are "Stop using attributes ..."
I tried to separate the changes into fairly meaningful commits. I can't split them up into smaller PRs, because everything starts working and all tests pass only after the whole sequence, but hopefully this will make reviewing somewhat easier.
Known issues/regressions/future tasks:
- `aten::lerp` and `aten::clamp` are no longer fusable
- `CreateAutodiffSubgraphs` needs a rewrite
- It is much more strict now, and will miss a lot of opportunities, especially when viewing ops are involved. Our previous approach was "ignore the assumption on shape availability in gradient formulas to determine differentiability, and hope that shape prop will be robust enough to actually deliver them before we differentiate", which obviously doesn't scale well to more complex cases. We should either work on reducing the size dependency of grad formulas (feasible e.g. for `view`/`reshape`, unfeasible for `squeeze`/`unsqueeze`), or make `CreateAutodiffSubgraphs` integrate some kind of "I could integrate this node into an AD subgraph, but will I be able to infer the shape of its input" reasoning (kind of like a limited shape prop, that doesn't infer anything, and only tells if it *could* infer something).
- It sometimes creates constant-only (or constants + one node) graphs, which is useless
- Broken `aten::add` in auto-batching, because it gained a non-tensor input. I changed the test for pointwise operations to use `aten::mul` instead, but I needed to disable the LSTM cell test. I'm not sure how scalar constants should be implemented in this case, because I don't fully understand our format. cc: ChunliF
- Graph import does some hacks to recover type of constants. This code should be removed once we'll gain the ability to export the IR along with value types.
- There's still a fair amount of dead code that can be removed. I didn't want to make this diff any bigger, and removing it is an easy task.
- Graph fuser could be improved to use signature matching (possibly using `OperatorSet`) instead of basing on node kinds.
- Manual constant propagation for the `ListConstruct` node in `torch/onnx/utils.py` should be replaced with a proper constant propagation pass (or we should ensure that the one we have handles at least this case before we remove this code).
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9807
Reviewed By: ezyang
Differential Revision: D9004285
Pulled By: apaszke
fbshipit-source-id: fe88026a765f6b687354add034c86402362508b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9901
Added support for UINT8 datatype for additional data (prefetching and
output) by ImageInputOp
Reviewed By: ashwinb
Differential Revision: D9018964
fbshipit-source-id: f938a8a072c15c0ee521b2f16788c024b08cd37f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9855
Support production models with predictor benchmark
Two new flags are added:
`--update_prod`: pull production data (netdef, input types, input dims) from Hive and store locally
`--use_prod`: run benchmark with local production data with the same workload as in production.
By default, 300 models will be loaded.
production vs benchmark
avg net run time:
(collected by prod: https://fburl.com/scuba/6lb91zfx and bench: https://fburl.com/ngjj1dc8)
**prod: `408us` vs bench: `543us`**
(With prod data distribution, this should be even closer)
framework overhead (as of 2018-07-22):
prod:
```
9.111% BlackBoxPredictor::Run
4.602% SimpleNet::Run
2.377% Operator::Run
1.786% BlackBoxPredictor::AllocateMemory
1.372% Observable::StartAllObservers
1.358% Observable::StartObserver
1.206% Blob::GetMutable
```
bench:
```
8.577% BlackBoxPredictor::operator()
3.276% SimpleNet::Run
1.954% Operator::Run
1.697% BlackBoxPredictor::AllocateMemory
1.477% Tensor::ShareData
1.230% Blob::GetMutable
1.034% Observable::StartObserver
```
Reviewed By: yinghai
Differential Revision: D8942996
fbshipit-source-id: 27355d7bb5a9fd8d0a40195261d13a97fa24ce17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9581
Mostly to simplify code. Should also improve performance but order switch ops
don't take much time anyway.
Reviewed By: viswanathgs
Differential Revision: D8909766
fbshipit-source-id: 17a302d5bf4aba2755d88223fc01a41fd72c5919
Summary:
Follow up task of #9584.
Commit 1:
- change expect/cast to return shared pointers instead of raw pointer
- isSubtypeOf accept TypePtr instead. Use `x->isSubtypeOf(NumberType::get())` rather than `x->isSubtypeOf(*NumberType::get())`
Commit 2:
- to address enable_shared_from_this pitfalls, we make the constructor private and expose the factory method to make sure user can only create it using our factory method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9786
Reviewed By: zdevito
Differential Revision: D8980441
Pulled By: wanchaol
fbshipit-source-id: e5c923fc57a701014310e77cf29985b43bb25364
Summary:
This PR fixes#9743 .
Adding backward support when loading a checkpoint from 0.3.* with 1dim tensor, they are now 0 dim tensor in 0.4+.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9781
Differential Revision: D8988196
Pulled By: ailzhang
fbshipit-source-id: a7a1bc771d597394208430575d5a4d23b9653fef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9891
Add an argument to benchmark binary to specify the seconds to sleep before the run and after the warmup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9880
Reviewed By: llyfacebook
Differential Revision: D9014254
Pulled By: sf-wind
fbshipit-source-id: d5566186c8ed768f1e170e9266c5f2d6077391e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9847
CTCBeamSearchDecoder and CTCGreedyDecoder do not currently support IDEEP
execution. Add fallback operators to allow IDEEP execution of models that use
these operators.
Reviewed By: yinghai
Differential Revision: D9006234
fbshipit-source-id: fc539ba67b07d1f960d28564d8adde0be8690649
Summary:
And let Gemm conversion to inspect the input `C` to try converting to FC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9870
Reviewed By: houseroad
Differential Revision: D9013198
Pulled By: bddppq
fbshipit-source-id: b4c509cfccca238262e1c406b004e66cef256321
Summary:
This is blocking the IR operator unification, because I need to be able to pass scalars to backward functions.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9763
Reviewed By: zou3519
Differential Revision: D8978457
Pulled By: apaszke
fbshipit-source-id: 570b4c3409322459cb0f2592069730a7d586ab20
Summary:
I don't think this file is used anywhere, I guess we'll find out!
(Weirdly this failed lint on one of my PRs even though it shouldn't).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9843
Differential Revision: D9003949
Pulled By: gchanan
fbshipit-source-id: 26d580d1e7cdd30e82e5f4176244e51fd7cd616d
Summary:
Pull Request resolved: https://github.com/facebookresearch/weakly-supervised-action-detection/pull/13
Pull Request resolved: https://github.com/pytorch/translate/pull/166
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9125
Closes https://github.com/pytorch/pytorch/pull/9125
Use inheritance for polymorphism, and remove template parameter
This is to change the templating in call sites, the core implementations will change later
Before Caffe2 Tensor class was compile-time fixed to bind to a particular device/context. With this change, we're making it a runtime property (stored inside the tensor), but preserve the same semantics. For example, one has to specify device type in order to create a Tensor - there are no uninitialized tensors. More specifically the changes are:
1. We added an extra argument *DeviceType* to most of the constructors of the tensor, e.g. (Tensor(DeviceType type)),
2. Semantics of constructor Tensor(const Tensor<SrcContext>& src, ContextForCopy* context); is changed, in this constructor, the second context is passed in to enable us to call the templated Copy function, it could be in a different context as source and target previously, now we'll enforce that the context should have same device type as src, if it is provided.
3. To preserve 'get-or-construct' semantics of Blob, we added specialized getter Blob::GetMutableTensor that verifies both that Blob contains a Tensor and that it's of a correct type
4. Specifically, Tensor type is not default-constructible any more (as we don't have unknown device tensors) and thus some of the code handling STL containers needs to change
Note: Some changes are postponed just to keep this diff a bit smaller. Please see `TODO`s.
Reviewed By: xw285cornell
Differential Revision: D8121878
fbshipit-source-id: 4a5e9a677ba4ac82095df959851a054c81eccf81
Summary:
The PR contains:
Fixes for running MIOpen conv operator in a multi worker scenario, along with a performance fix
Fixing a typo in MIOpen pool op and adding some extra checks for MIOpen spatial BN op
bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9842
Differential Revision: D9012512
Pulled By: bddppq
fbshipit-source-id: 270e1323c20fbfbc4b725f9a4ff34cd073ddaaa8
Summary:
I split it into two parts, _local_scalar and _local_scalar_dense (unchecked)
so I could reuse the sparse logic in both paths.
_local_scalar became a method on Tensor to work around a circular
include problem.
This is resurrected copy of #9652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9762
Differential Revision: D8972348
Pulled By: ezyang
fbshipit-source-id: 2232dbfc8e1286b8a4a1c67d285c13a7771aad4c
Summary:
We think this will band-aid some of the new Caffe2 test failures.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9830
Differential Revision: D9008052
Pulled By: ezyang
fbshipit-source-id: 84f1c0faea429d758d760965d6cbfe9e4c72eb19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9831
Follow up to D8980903 - replace dataIterator with nodeIterator where the data isn't used.
Reviewed By: pjh5
Differential Revision: D8998351
fbshipit-source-id: c333847ecd8b6d8075352322845839b94a63aecc
Summary:
https://github.com/pytorch/pytorch/pull/9755 broke this, but it was only tested if size zero dims were turned on (it can still happen even if that isn't turned on, because we support size [0] tensors).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9825
Differential Revision: D8997303
Pulled By: gchanan
fbshipit-source-id: 911dce112f73fad0f3980a7f4f9423df0f2d923d
Summary:
This was used to build Caffe2 Docker version 170.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9828
Differential Revision: D8997808
Pulled By: ezyang
fbshipit-source-id: f48938b2b71bc86578c9d9b46c281ed05478724e
Summary:
…o dim.
Manifest:
1) The scalar boolean is now in THTensor, although it isn't hooked up at the TH level yet.
2) setScalar is gone, everything now goes through the maybeScalar equivalent (which is renamed)
3) all "scalars" in this context now refer to "zero_dim" in order to differentiate this concept from the "Scalar" class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9783
Differential Revision: D8978911
Pulled By: gchanan
fbshipit-source-id: f09254be4bebad0e4c510fefe4158b4f7e92efe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9790
- Add way to check if a NodeRef is in a graph
- Make a nodeIterator (similar to dataIterator) but only iterate through nodes.
Reviewed By: bwasti
Differential Revision: D8980903
fbshipit-source-id: b20504a46715858752e25242303125a15a709b88
Summary:
Temporarily need this to prevent sccache from breaking when I move sccache install to the DockerFile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9810
Differential Revision: D8991684
Pulled By: Jorghi12
fbshipit-source-id: 14cd0278f53a72372f9bbe27b228980f8d3c1d4a
Summary:
The tutorials url with http is not valid, replacing it with https.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9812
Differential Revision: D8991344
Pulled By: ezyang
fbshipit-source-id: c12faa57905b50eadc320f9938c39c4139bd093b
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/6890. (backward pass for non-symmetric eigen-decomposition is not implemented in other packages, e.g. autograd, mxnet, tensorflow, presumably because the eigenvalues can be imaginary for the general case, and AFAIK we cannot support complex numbers).
This patch adds a backward function for the symmetric eigen-decomposition function `torch.symeig`. The formula used is taken from [here](http://eprints.maths.ox.ac.uk/1079/1/NA-08-01.pdf). Unit tests are added to verify correctness.
There is still one outstanding issue, which is how to handle the case where the `symeig` is called with `eigenvectors=False`. In this case, the eigenvectors are returned as a zero tensor, but the backward computation for the eigenvalues depends on the eigenvectors. There was a previous attempt to implement this in https://github.com/pytorch/pytorch/pull/2026, where apaszke mentioned that the `eigenvectors` argument should be overridden so that they are saved for the backwards pass. The forward code is autogenerated, though, and it isn't clear to me how that would be done. I'd appreciate any guidance. For now, there is a unit test that will fail until that issue is resolved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8586
Reviewed By: ezyang
Differential Revision: D8872760
Pulled By: SsnL
fbshipit-source-id: 76614495d0f9c118fec163a428f32e5480b4d115
Summary:
The primary use-site of typeString was checked_cast_tensor.
I did a little more than I needed in this patch, to set
the stage for actually deleting the tensor type.
Specifically, I modified checked_cast_tensor to explicitly
take Backend and ScalarType, the idea being that once we
remove the tensor subclasses, we will delete the T template
parameter.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9764
Differential Revision: D8969196
Pulled By: ezyang
fbshipit-source-id: 9de92b974b2c28f12ddad13429917515810f24c6
Summary:
This implements the two-parameter Weibull distribution, with scale $\lambda$ and shape $k$ parameters as described on [Wikipedia](https://en.wikipedia.org/wiki/Weibull_distribution).
**Details**
- We implement as a transformed exponential distribution, as described [here](https://en.wikipedia.org/wiki/Weibull_distribution#Related_distributions).
- The `weibull_min` variance function in scipy does not yet support a vector of distributions, so our unit test uses a scalar distribution instead of a vector.
Example of the bug:
```
>>> sp.stats.expon(np.array([0.5, 1, 2])).var() # fine
array([1., 1., 1.])
>>> sp.stats.weibull_min(c=np.array([0.5, 1, 2])).var() # buggy
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 490, in var
return self.dist.var(*self.args, **self.kwds)
File "/usr/local/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 1242, in var
res = self.stats(*args, **kwds)
File "/usr/local/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py", line 1038, in stats
if np.isinf(mu):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9454
Differential Revision: D8863574
Pulled By: SsnL
fbshipit-source-id: 1ad3e175b469eee2b6af98e7b379ea170d3d9787
Summary:
I got some tensor->variable conversion exceptions from `torch/csrc/autograd/variable.h`, which used the `TORCH_ASSERTM` macros instead of `AT_CHECK`, so they didn't have backtraces. This was such a substantial loss for debugability that I decided to update the whole codebase to use the backtrace-enabled ATen macros instead of `TORCH_ASSERT` and `JIT_ASSERT`, the latter having been an alias of the former.
ezyang apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9575
Differential Revision: D8924566
Pulled By: goldsborough
fbshipit-source-id: 7a4013b13eec9dbf024cef94cf49fca72f61d441
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9770
Add zero ops to operators that do not have a valid schema
Reviewed By: hlu1
Differential Revision: D8957472
fbshipit-source-id: d8d0a351183e88ace2e050a87c1e1c363af67e33
Summary:
Before I can rewrite portions of the c10d DDP in C++ I need proper tests in place to make sure I am not breaking anything as I port code. There were no tests for the c10d DDP in place so I wrote some.
I refactored the c10d tests to derive some tests cases from a general `MultiGPUTestCase` and followed lots of patterns from `test_distributed.py` w.r.t. how tests are skipped (such that the main process doesn't initialize CUDA, which I found is a super important detail!!!).
I am largely unfamiliar with this code so feel free to scrutinize. The DDP test code itself is also largely taken from `test_distributed.py` but more inlined which I find easier to read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9670
Differential Revision: D8977724
Pulled By: goldsborough
fbshipit-source-id: 186eab38a72384d7992a2ec5c89f304ad42d5944
Summary:
Fixes: #9754
Maybe this could also make its way into 0.4.1, it is a severe debugging headache if you hit this...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9755
Reviewed By: ezyang
Differential Revision: D8967178
Pulled By: zou3519
fbshipit-source-id: 151ed24e3a15a0c67014e411ac808fb893929a42
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9643
Current map interface assumes float data type, which is not always correct.
Reviewed By: kennyhorror
Differential Revision: D8455784
fbshipit-source-id: b94a31267760f7f97c15aa4b03008affc347fd10
Summary:
When building iOS apps with a caffe2 dependency, we were seeing the `caffe2/caffe2/mobile/contrib/ios/mpscnn/mpscnn.mm:33:17: error: method 'copyWithZone:' in protocol 'NSCopying' not implemented [-Werror,-Wprotocol]`. This fixes it by implementing a shallow copy with that method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9748
Reviewed By: jerryzh168
Differential Revision: D8954332
Pulled By: williamtwilson
fbshipit-source-id: 0cd44408257c0bd3f4ffb80312ea9d13d13e5ff3
Summary:
This can hardly be called an improvement (we now print
CPUFloatType instead of CPUFloatTensor) but it was the
simplest way I could think of devirtualizing this function in
the short term. Probably need some sort of native function
that gives string information about a tensor.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Approved in #9710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9758
Differential Revision: D8966935
Pulled By: ezyang
fbshipit-source-id: a4641affe0a6153f90cdd9f4f2a1100e46d1a2db
Summary:
Not in the same format. Skip at the moment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9751
Reviewed By: yinghai
Differential Revision: D8965636
Pulled By: houseroad
fbshipit-source-id: 81d39c2f5625c14c0e1ee11408b5f7267b53798f
Summary:
ebetica made me aware that `nn::Module::clone()` always clones to the current device (usually CPU) instead of preserving the device of each parameter. This PR changes the signature of `clone` from
`shared_ptr<Module> clone()`
to
`shared_ptr<Module> clone(optional<Device> device = nullopt)`
with semantics of:
1. If a `device` is given, all parameters/buffers are moved to that device,
2. If no `device` is supplied (default), parameters/buffers retain their device.
ezyang apaszke ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9609
Differential Revision: D8957367
Pulled By: goldsborough
fbshipit-source-id: 0d409ae645ed2b8d97d6fc060240de2f3d4bc6c8
Summary:
I renamed the variable in the `Embedding` module from `weight` to `table` a few months ago, because it seemed like a more meaningful name. Turns out it's not such a good idea because it deviates from PyTorch, which unnecessarily breaks C++->Python translated code.
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9720
Differential Revision: D8955647
Pulled By: goldsborough
fbshipit-source-id: 77228b07d2b733866e8cdecaa6d0686eef4cc3ea
Summary:
The underlying reason why this is even an issue is that the conversion
into and out of the 'fictional' onnx operators is done in an unhygenic
order. This doesn't address that, but it does fix the one observable
case where this produces an incorrect result, and unblocks some other
work being done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9657
Differential Revision: D8940824
Pulled By: anderspapitto
fbshipit-source-id: ea827a24c85447fe4ae470336a746329598eee84
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9718
This patch switches the interpreter to use IValue's primitive numbers rather than tensors for computing on integers and floats. In addition to preparing the interpreter for first-class support of other types, this cleans up the handling of primitive numbers, making it possible to just use the normal operator overloading dispatch to find the right implementation for numbers. As a result of this change, a lot of other functionality needed to be updated since it was the first time we use non-tensors in a lot of places in the code base.
Notes:
* Fixes code_template.py so that multi-line strings are indented correctly when used on a standalone line
* Cast operators (`int(x)`) now are functional. Some tests have addition conversions to integers because
we no longer allow implicit tensor -> integer conversions following the same convention as in python
* prim::ListConstruct/createList has been added to the interpreter for creating lists and this has
replaced aten::stack for integers lists
* gen_jit_dispatch.py has been refactored so that non-tensor types use operators on IValues to extract
the primitives
* IValue gains a .to<T> method that is the equivalent of tensor_as but for IValue instead of at::Tensor
* `constant_as<T>` is switched over to using IValues's `.to<T>` method, to make conversion from constant->IValue->C++ type
more consistent. This functionality combined with `toIValue(Value*)` replaces the `tensor_as` and `as_tensor` family of functions.
* conditional expressions (if, loop) and operators related to them are now computed on integers rather than tensors
* IValue gains constructors for constructing from at::Scalar and converting to it. However, IValue itself will always store
the scalars as a double or int64.
* To align with python 3 syntax, TK_INT, TK_FLOAT, and TK_BOOL have been removed from the parser, and int/float/bool are just treated as special identifiers in the compiler,
along with print. These are represented as special sugared values with a `call` method implemented. For int/float/bool this implements casting behavior.
* Dropped shared_from_this from Type/Module. They were not needed and they making debugging harder because they internally throw/catch exceptions.
* Shape propagation has been updated to support running nodes that include floating point primitive types, this required some refactoring of internal functions.
* TensorToNum and NumToTensor have actual implementations as operators now
* regster_prim_ops now contains implementations of math operators for float/int primitive types, and for mixed (prim <+> tensor) versions. This removes the need for special handling in compiler.cpp
* Primitive math is now entirely handled by letting the compiler choose the right overloads. This removes tons of special casing in the compiler.
* incorporates eellison's change to allow casting from return values. Due to the addition of primitive support, the code need slight modifications, so I just pre-merged it here.
* stack.h gains generic vararg versions of push/pop that know how to convert to/from C++ types:
```
at::Tensor a;
at::Scalar b;
pop(stack, a, b);
at::Tensor c = a + b;
push(stack, c);
```
apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9584
Reviewed By: apaszke
Differential Revision: D8910546
Pulled By: zdevito
fbshipit-source-id: 0f3e60d4d22217f196a8f606549430e43b7e7e30
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9667
MKL-DNN doesn't support 64-bit integger (cfee61bf81/include/mkldnn_types.h (L62-L75)). So force converting from `TensorCPU<long>` to `s32` Ideep tensor will cause memory issue. This diff gives an alternative solution, where we just fall through to TensorCPU. The reasoning is that since MKL-DNN doesn't support 64 bit integer tensor, downstream ops have to be in CPUConext. So there is no reason force converting to ideep tensor and back.
Reviewed By: pjh5
Differential Revision: D8943544
fbshipit-source-id: f514903cda27e34b8887271c9df56c8220895116
Summary:
This is a modification of the strategy from https://github.com/pytorch/pytorch/pull/8919 and https://github.com/pytorch/pytorch/pull/9579.
```
Previously, the CPU architecture-specific kernels self-registered with
the DispatchStub. When linking as part of a static library, this requires
the flag --whole-archive to be passed to the linker to ensure that the
object files for the kernels are included. Caffe2 and TensorFlow use that
strategy.
We ran into some issues with --whole-archive blowing up the binary size
of some downstream projects in Facebook. This PR avoids --whole-archive
for CPU kernels. The downside is that the generic code needs to be aware
of whether kernels are compiled with AVX and with AVX2 (via
HAVE_AVX_CPU_DEFINITION and HAVE_AVX2_CPU_DEFINITION).
The CUDA kernels still self-register with DispatchStub because the CPU
library is not aware of whether the CUDA library will be available at
runtime.
There are a few major changes to DispatchStub
- The environment variable ATEN_CPU_CAPABILITY overrides the CPU
capability detection code (Previous ATEN_DISABLE_AVX/AVX2)
- DispatchStub is defined in the generic native code instead of the
CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9664
Differential Revision: D8943350
Pulled By: colesbury
fbshipit-source-id: 329229b0ee9ff94fc001b960287814bd734096ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9717
D8722560 was landed with some build errors, unfortunately the c10 code isn't part of contbuild yet.
Fixing them.
Differential Revision: D8954141
fbshipit-source-id: 2a082fb8041626e45ccd609f37a8ef807f6dad8a
Summary:
This is to simplify the data format during benchmarking. After this change, we can use the same benchmarking harness data conversion method to parse data from multiple binaries.
This change should be coordinated with the PR: https://github.com/facebook/FAI-PEP/pull/63
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9555
Reviewed By: pjh5
Differential Revision: D8903024
Pulled By: sf-wind
fbshipit-source-id: 61cabcff99f0873729142ec6cb6dc230c685d13a
Summary:
This pull request implements low rank multivariate normal distribution where the covariance matrix has the from `W @ W.T + D`. Here D is a diagonal matrix, W has shape n x m where m << n. It used "matrix determinant lemma" and "Woodbury matrix identity" to save computational cost.
During the way, I also revise MultivariateNormal distribution a bit. Here are other changes:
+ `torch.trtrs` works with cuda tensor. So I tried to use it instead of `torch.inverse`.
+ Use `torch.matmul` instead of `torch.bmm` in `_batch_mv`. The former is faster and simpler.
+ Use `torch.diagonal` for `_batch_diag`
+ Reimplement `_batch_mahalanobis` based on `_batch_trtrs_lower`.
+ Use trtrs to compute term2 of KL.
+ `variance` relies on `scale_tril` instead of `covariance_matrix`
TODO:
- [x] Resolve the fail at `_gradcheck_log_prob`
- [x] Add test for KL
cc fritzo stepelu apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8635
Differential Revision: D8951893
Pulled By: ezyang
fbshipit-source-id: 488ee3db6071150c33a1fb6624f3cfd9b52760c3
Summary:
…unctions.
This also unifies the error checkign between scatter/scatterAdd on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9658
Differential Revision: D8941527
Pulled By: gchanan
fbshipit-source-id: 750bbac568f607985088211887c4167b67be11ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9683
This pops off `refcount_`, `storage_`, `storage_offset_`; there are now no more direct accesses to these fields and we can make them private (with appropriate friending).
Stacked on #9561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9591
Reviewed By: SsnL
Differential Revision: D8922246
Pulled By: ezyang
fbshipit-source-id: dfae023d790e29ce652e2eab9a1628bbe97b318d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9665
In data_parallel_model, we isolate synchronizing barrier init net into its own from the param_init_net, so that we could have finer granularity of control over the barrier net.
Reviewed By: andrewwdye
Differential Revision: D8375389
fbshipit-source-id: ce0c8c1c8e4bd82b7078a1b07abaced3f149d578
Summary:
**REVIEW LAST COMMIT ONLY**
As discussed in our yesterday's meeting. Nodes can be now matched to particular overloads using the `matches(...)` function:
```cpp
n->matches("aten::type_as(Tensor self, Tensor other) -> Tensor")
```
This also changes the shape prop and peephole passes to use those functions for matching. This fixes a few bugs, makes them much more robust, and prepares us for removal of attributes.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9567
Reviewed By: zdevito
Differential Revision: D8938482
Pulled By: apaszke
fbshipit-source-id: eb2382eeeae99692aada2d78d5d0c87c8ef1545e
Summary:
This PR contains the change for explicit conversion between ushort and __half required for ROCm 1.8.2 support
bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9663
Differential Revision: D8943937
Pulled By: bddppq
fbshipit-source-id: 16102f9dbc68ed4ece2e8fc244825c3992c24901
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9637
Adding a method to run plan in background. The intended use is to run BlueWhale's data reading & preprocessing net in background while the GPU is training.
Reviewed By: MisterTea
Differential Revision: D8906439
fbshipit-source-id: b1c73ca7327e2d87a8f873924e05ab3d161a3f1e
Summary:
ezyang noticed that the CUDAStream files lived under ATen/ despite being CUDA-specific, and suggested porting them to ATen/cuda and exposing them with a new CUDAContext. This PR does that. It also:
- Moves ATen's CUDA-specific exceptions for ATen/cudnn to ATen/cuda for consistency
- Moves getDeviceProperties() and getCurrentCUDASparseHandle() to CUDAContext from CUDAHooks
The separation between CUDAContext and CUDAHooks is straightforward. Files that are in CUDA-only builds should rely on CUDAContext, while CUDAHooks is for runtime dispatch in files that can be included in CPU-only builds. A comment in CUDAContext.h explains this pattern. Acquiring device properties and CUDA-specific handles is something only done in builds with CUDA, for example, so I moved them from CUDAHooks to CUDAContext.
This PR will conflict with #9277 and I will merge with master after #9277 goes in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9435
Reviewed By: soumith
Differential Revision: D8917236
Pulled By: ezyang
fbshipit-source-id: 219718864234fdd21a2baff1dd3932ff289b5751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9636
Make sure that the blobs are registered to the net
Reviewed By: pjh5
Differential Revision: D8924883
fbshipit-source-id: f09422a2d4d5ba8bf6cfbfd00172097b5ab1fcd6
Summary:
In the repr funciton of LPPoolNd(..) class, there was a missing '='. (`kernel_size{kernel_size}`)
Link to line in the code: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/pooling.py#L694
Original:
return 'norm_type={norm_type}, kernel_size{kernel_size}, stride={stride}, ' \
'ceil_mode={ceil_mode}'.format(**self.__dict__)
Fixed:
return 'norm_type={norm_type}, kernel_size={kernel_size}, stride={stride}, ' \
'ceil_mode={ceil_mode}'.format(**self.__dict__)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9629
Differential Revision: D8932913
Pulled By: soumith
fbshipit-source-id: 9030dff6b14659b5c7b6992d87ef53ec8891f674
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9598
The "max_length" should be passed to UnPackSegmentsOp if "max_length" is given when calling PackSegmentsOp.
Reviewed By: jerryzh168
Differential Revision: D8919799
fbshipit-source-id: 8c97aa717b69177b8a5d5d56892817d488853840
Summary:
This PR adds machinery to cache the schema in an IR node, and allows lookups of (possibly) constant inputs by their names (instead of position). The new methods are:
- `at::optional<T> get<T>(Symbol name)` - if the argument called name is a constant, then casts it to type `T` and returns it. If it's not constant returns `nullopt`. Raises an error if there's no argument with that name.
- `at::optional<IValue> get<T>(Symbol name)` - like above, but packs the result in an IValue
- `Value* getValue(Symbol name)` - retrieves a `Value*` for an argument (no need to know its position).
All above functions currently inspect the attributes as well, but that's only so that I could start using them in other places in the JIT without disrupting our current functionality. I wanted this diff to be a preparation that doesn't change the semantics too much, and so both the tracer and script create nodes with attributes. The next PR will put that to a stop, and hopefully the changes we need to make to other components will be simpler thanks to what I did here.
One more thing I'd like to do before actually stopping creating the non-attributed nodes is to have a convenient way of creating a schema programmatically, matching nodes against it, and creating them without having to pack inputs into flat argument lists (which is quite error prone).
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9505
Reviewed By: ezyang
Differential Revision: D8915496
Pulled By: apaszke
fbshipit-source-id: 39d14fc9a9d73d8494f128367bf70357dbba83f5
Summary:
This fix will prevent errors like (found in `bincount`)
```
RuntimeError: %s not implemented for '%s'bincounttorch.FloatTensor
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9625
Differential Revision: D8932945
Pulled By: soumith
fbshipit-source-id: 794e3b58d662779402ab318e274661826a5db8b2
Summary:
fixes#4176 cc vishwakftw
I didn't do `:math:` and `\neg` because I am using double ticks so they render more similarly with `:attr:`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9630
Differential Revision: D8933022
Pulled By: SsnL
fbshipit-source-id: 31d8551f415b624c2ff66b25d886f20789846508
Summary:
As in the title. Lets us simplify a lot of code.
Depends on #9363, so please review only the last commit.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9414
Reviewed By: zdevito
Differential Revision: D8836496
Pulled By: apaszke
fbshipit-source-id: 9b3c3d1f001a9dc522f8478abc005b6b86cfa3e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9501
Added a new stat value to log static states like CPU and memory usage.
Reviewed By: pjh5
Differential Revision: D8872254
fbshipit-source-id: 469e94cab99029a3da55f8986dddeadac076e2a8
Summary:
This PR adds the functional version of `DataParallel` (i.e. `data_parallel`) to the C++ frontend.
For this, I had to:
1. Add "differentiable" versions of scatter and gather, which perform their inverse operation in the backward pass, to C++. I've added them under `torch/csrc/autograd/functions/comm.{h,cpp}`. I had to move some utilities from `VariableType.cpp` into `torch/csrc/autograd/functions/utils.h`, and changed them a bit to fix the `const_cast`s for which there were `TODO`s,
2. Implement the `replicate`, `parallel_apply` and the combining `data_parallel` functions in C++.
`replicate` is implemented based on our existing `clone()` interface, along with the ability to set the current device via `at::OptionsGuard` (so nice).
`parallel_apply` is implemented using `at::parallel_for` (CC cpuhrsch) and [follows the code from PyTorch](https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/parallel_apply.py).
Added lots of tests for these things.
apaszke ezyang ebetica colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9234
Differential Revision: D8865182
Pulled By: goldsborough
fbshipit-source-id: 4f1fecf2b3f3bc1540c071dfb2d23dd45de433e4
Summary:
In our pimpl system, default constructing a module holder default constructs the contained module. This means `Linear linear;` is ill-formed, since `Linear` doesn't have a default constructor. Instead we require `Linear linear = nullptr;` to get the empty state of the `Linear`. This PR makes the error message for the ill-formed case nicer.
I had to change the forwarding constructors of most of our modules for this, but that's a minor adjustment.
E.g.
```
Linear linear;
In file included from /home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:5:0,
from /home/psag/pytorch/pytorch/test/cpp/api/module.cpp:3:
/home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/pimpl.h: In instantiation of ‘torch::nn::ModuleHolder<Contained>::ModuleHolder() [with Contained = torch::nn::LinearImpl]’:
/home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/dropout.h:45:1: required from here
/home/psag/pytorch/pytorch/torch/csrc/api/include/torch/nn/pimpl.h:46:5: error: static assertion failed: You are trying to default construct a module which has no default constructor. Use = nullptr to give it the empty state (like an empt
y std::shared_ptr).
static_assert(
```
ebetica ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9565
Differential Revision: D8903666
Pulled By: goldsborough
fbshipit-source-id: 5e6b788921a27a44359db89afdc2b057facc5cec
Summary:
This is a few files taken from https://github.com/pytorch/pytorch/pull/8919. They're unchanged from the latest versions of that PR.
```
This is part of https://github.com/pytorch/pytorch/pull/8919. It's
separated to make it easier to merge the PR in pieces.
There are a few major changes to DispatchStub
- The environment variable ATEN_CPU_CAPABILITY overrides the CPU
capability detection code (Previous ATEN_DISABLE_AVX/AVX2)
- DispatchStub is defined in the generic native code instead of the
CPU_CAPABILITY_DEFAULT kernel.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9579
Differential Revision: D8909000
Pulled By: colesbury
fbshipit-source-id: fdeb606270b06acdab3c01dba97ec9d81584ecc0
Summary:
* THTensor now stores `sizes_` and `strides_` which is a `std::vector<int64_t>`
* Anywhere a "public" API function made use of a int64_t* of sizes, I opted to just finagle it out of the tensor using THTensor_getSizePtr rather than try to rewrite all of these sites to use ArrayRef. They should use ArrayRef eventually, but not yet.
* There are new utility functions for resizing sizes/strides in one go (THTensor_resizeDim), or replacing sizes and strides with completely new values (THTensor_setSizesAndStrides)
* Anywhere you said `t->size[n] = 0`, we now say `THTensor_setSizeAt(t, n, 0)`, ditto for strides
* Anywhere you said `t->size[n]`, we now say `t->size(n)` (coming soon: ditto for strides)
Previous review of just the `std::vector` change in #9518, but I'm planning to merge this all in one go.
Note for gchanan: review from commit "ci" and after
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9561
Reviewed By: cpuhrsch
Differential Revision: D8901926
Pulled By: ezyang
fbshipit-source-id: 483cf275060ab0a13845cba1ece39dd127142510
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9594
When the input vector is a zero vector, the previous GPU code will give Nan in backward. We fix this.
Reviewed By: pjh5
Differential Revision: D8849732
fbshipit-source-id: 87b1fb1ee05dfdb0d43bcbe67e36f15896fe1706
Summary:
The underlying reason why this is even an issue is that the conversion
into and out of the 'fictional' onnx operators is done in an unhygenic
order. This doesn't address that, but it does fix the one observable
case where this produces an incorrect result, and unblocks some other
work being done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9593
Differential Revision: D8919125
Pulled By: anderspapitto
fbshipit-source-id: a88ca979c3b9d439863e223717d3697180c26121
Summary:
This is mainly straightforward, with two exceptions:
1) cublasSgemv, cublasDgemv appear to have a bug where (x,0).mv(0) does not handle beta, whereas cublasSgemm, cublasDgemm do for case where (x,0).mm(0,y). This is handled by manually calling zero / mul.
2) I fixed a bug in btrifact that was broken even when dealing with non-empty tensors. Basically, if out.stride(0) was 1, because the underlying BLAS call expects column-major matrices, to get a column-major tensor, out.transpose_(0, 1) would be called. But this is just wrong, as if the batch dimension (0) doesn't match the size of the columns (1), you don't even have a tensor of the correct shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9573
Reviewed By: ezyang
Differential Revision: D8906144
Pulled By: gchanan
fbshipit-source-id: de44d239a58afdd74d874db02f2022850dea9a56
Summary:
0. Fixes#9479
1. rewrites `as_strided` as a native function. This is fine because `set_` does the scalar check.
2. allow using `self` in `python_default_init`. Previously `python_variable_methods.cpp` has `self` as an input `PyObject *`, and use `self_` as the unpacked tensor. But `python_torch_functions.cpp` just use `self` as the unpacked tensor, making it impossible to use `self` in `python_default_init`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9538
Differential Revision: D8894556
Pulled By: SsnL
fbshipit-source-id: ca7877b488e12557b7fb94e781346dcb55d3b299
Summary:
The goal of this PR is to add an infrastructure; to convert(hipify) CUDA ops into [HIP](https://github.com/ROCm-Developer-Tools/HIP) ops , at **compile** time.
Note that HIP ops, which are portable c++ code, can run on AMD and NVIDIA platform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9322
Differential Revision: D8884707
Pulled By: bddppq
fbshipit-source-id: dabc6319546002c308c10528238e6684f7aef0f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9509
generate_proposals_op_util_nms.h conditionally requires OpenCV in some cases,
and earlier this was checking just CV_MAJOR_VERSION macro, but that is
undefined unless opencv.hpp is included. Adding `-DCAFFE2_USE_OPENCV` to
TARGETS when opencv is included in external_deps to check for this correctly.
Thanks jinghuang for flagging this issue!
Differential Revision: D8880401
fbshipit-source-id: 65abbcf4ffe3feffc0ee2560882cb8eb0b7476f9
Summary:
This is the first step of refactoring the Predictor. In this diff the config struct
is introduced and the internal data structure of Predictor has been updated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9434
Differential Revision: D8843262
Pulled By: fishbone
fbshipit-source-id: 23f5e4751614e3fedc9a04060d69331bfdecf864
Summary:
Prior to this diff, there have been two ways of compiling the bulk of the torch codebase. There was no interaction between them - you had to pick one or the other.
1) with setup.py. This method
- used the setuptools C extension functionality
- worked on all platforms
- did not build test_jit/test_api binaries
- did not include the C++ api
- always included python functionality
- produced _C.so
2) with cpp_build. This method
- used CMake
- did not support Windows or ROCM
- was capable of building the test binaries
- included the C++ api
- did not build the python functionality
- produced libtorch.so
This diff combines the two.
1) cpp_build/CMakeLists.txt has become torch/CMakeLists.txt. This build
- is CMake-based
- works on all platforms
- builds the test binaries
- includes the C++ api
- does not include the python functionality
- produces libtorch.so
2) the setup.py build
- compiles the python functionality
- calls into the CMake build to build libtorch.so
- produces _C.so, which has a dependency on libtorch.so
In terms of code changes, this mostly means extending the cmake build to support the full variety of environments and platforms. There are also a small number of changes related to the fact that there are now two shared objects - in particular, windows requires annotating some symbols with dllimport/dllexport, and doesn't allow exposing thread_local globals directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8792
Reviewed By: ezyang
Differential Revision: D8764181
Pulled By: anderspapitto
fbshipit-source-id: abec43834f739049da25f4583a0794b38eb0a94f
Summary:
THCStream was recently moved to ATen by mruberry: https://github.com/pytorch/pytorch/pull/8997. This PR now introduces a guard class that replaces `AutoStream` from `torch/csrc/` and also uses this new stream interface.
I had to extend the `CUDAStream` interface with unchecked calls, so that we can reset the stream without throwing an exception in the guard's destructor.
colesbury apaszke ezyang
Fixes https://github.com/pytorch/pytorch/issues/7800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9277
Differential Revision: D8865183
Pulled By: goldsborough
fbshipit-source-id: 67c9bc09629d92fa5660286b5eec08fde9108cd7
Summary:
….txt setting
In the ROCm branches we will experiment with turning this on.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9543
Differential Revision: D8897990
Pulled By: ezyang
fbshipit-source-id: ae9d25d1b79ee421d49436593edf8c7e49b3a4e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9438
Current implementation of create_from_proto doesn't work as expected: it
duplicates networks and execution steps by copying original PlanDef first and
adding each step one-by-one later.
Reviewed By: pjh5
Differential Revision: D8850316
fbshipit-source-id: 9b02836d6e6ee1c91cfdd3b4c4804f14137dc22b
Summary:
The purpose of this config is to make sure that CircleCI builds
don't fail when I turn them on for pytorch/pytorch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9537
Differential Revision: D8894497
Pulled By: ezyang
fbshipit-source-id: 22f43c84a9b8a54cd47a6572ba068f70a73f043a
Summary:
Fix RoIAlignOp GPU implementation for RoIs without batch index
According to https://caffe2.ai/docs/operators-catalogue.html#roialign, RoIs is "2D input of shape (R, 4 or 5)"
Pass RoIs 2nd dimension as kernel parameter and adjust kernel accordingly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9230
Reviewed By: houseroad
Differential Revision: D8886798
Pulled By: malfet
fbshipit-source-id: 52a8b4df85f7e350e36c842ee4428f3a1cba2588
Summary:
Fix gatherTopK template
This change makes it possible to instantiate getherTopK() with IndecesType other than caffe2::TIndex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9231
Reviewed By: houseroad
Differential Revision: D8886778
Pulled By: malfet
fbshipit-source-id: d5fb1f8814710cd81bc0cf65e0f96fd9fd8317da
Summary:
…CPU LAPACK routines.
Note that the LAPACK functions in general require a different approach, because direct calls with size zero dims do not work.
Here I just selected a reasonable subset of LAPACK routines to support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9522
Reviewed By: ezyang
Differential Revision: D8888180
Pulled By: gchanan
fbshipit-source-id: 16b9013937806d375d83d1c406815765fda00602
Summary:
A 0-dimensional tensor is now returned when squeezing a tensor with a single element.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9529
Differential Revision: D8893103
Pulled By: soumith
fbshipit-source-id: 658189ecfff283b2b7281feb16a397692d6dbd8f
Summary:
This PR contains the ROCm contributions of last week:
* documentation of pyHIPIFY data format originating from #8812 reviewing comments by ezyang
* removal of most patch files from the `amd_build` directory and integration into the code base
* enabling of previously disabled_features that do compile now
* improvement to the static_cast feature in pyHIPIFY (it will only apply static_cast to kernel arguments, not launch arguments)
* addition of two workarounds to pyHIPIFY for ROCm/HIP shortcomings: a) `__forceinline__` does not imply `static`, hence change to `__inline__`, b) `std::[exp,log,pow]` math functions cannot be selected in device code, use `::[exp,log,pow]` instead. Both of these workarounds will be removed once the issues are fixed upstream. Neither of these issues have surfaced on the CI but were reproduced internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9432
Differential Revision: D8887441
Pulled By: ezyang
fbshipit-source-id: 71cf5c6b13772a66d10be369a45ebf06e4e268e1
Summary:
This command (suggested by albanD when I raised a related question in pytorch slack) is super useful to me. I have used it several times and it worked like a charm (without it, I have to delete entire pytorch folder and clone things again). So I guess it is nice to have in the CONTRIBUTING doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9524
Differential Revision: D8890126
Pulled By: soumith
fbshipit-source-id: c1798ff1ab2423627fcd8e0662a66c4e85cb2413
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9520
Add random data filler to predictor bench to support production nets
Reviewed By: salexspb
Differential Revision: D8712757
fbshipit-source-id: 2c732b2ba71ab210f9222adf94d08442ca71dc03
Summary:
- I ran into this couple days ago, and thought it might be useful to take note on that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9504
Reviewed By: soumith
Differential Revision: D8887396
Pulled By: weiyangfb
fbshipit-source-id: d2061cf379ce140d6e43ef6c18241f7ce00dbab6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9458
The goal is to support count_include_pad in Caffe2 ONNX backend. This commit contains the first step - support 4-D tensor cases.
AveragePool with count_include_pad can be expressed as PadImage + AveragePool.
Reviewed By: houseroad
Differential Revision: D8852180
fbshipit-source-id: 4db00e9771be7a000a2d92850dfd066d9c9c38bf
Summary:
If this is good, I could write some tests to ensure collision doesn't occur within a given range.
Closes#7228
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9246
Differential Revision: D8872608
Pulled By: ezyang
fbshipit-source-id: 0ed29a73188f4167b42756f59a5c9a3d5cb37326
Summary:
It implements per-channel alpha_dropout. It also creates corresponding function classes and unifies the process of dropout and alpha_dropout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9073
Differential Revision: D8727008
Pulled By: ezyang
fbshipit-source-id: 9d509f9c5db4e98f7b698cdfc4443505a4d2b331
Summary:
This is enabled by the allocator patch; previously we could not
deduplicate THStorage_free/THCStorage_free; now we can.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9495
Reviewed By: SsnL
Differential Revision: D8875497
Pulled By: ezyang
fbshipit-source-id: 387198dff446eb9f84d2d6187066fae1d595dea7
Summary:
ebetica asked for a way to add parameters to `Optimizer`s after they are created.
ebetica ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9472
Differential Revision: D8872176
Pulled By: goldsborough
fbshipit-source-id: 39a4032c519a6d3b458dd3596361b04afea10365
Summary:
…ors (CPU).
This includes (mainly) CPU fixes; CUDA fixes are a little more involved because you can't use an empty grid.
This also includes a fix for index_copy, which checked that self.size(dim) == src.size(0), which isn't correct (the same dimension should be compared).
Finally, also includes a fix for CUDA flip (although it's not tested yet), to get the stride using multiplication rather than division to avoid divide-by-0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9482
Reviewed By: ezyang
Differential Revision: D8873047
Pulled By: gchanan
fbshipit-source-id: 86523afd3d50277834f654cd559dfbc7875cdffe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9480
Ops like Reshape sometimes take a second input tensor of long with the new
shape (can also be specified in arg). If this input tensor is passed in via
external input (which ONNX does sometimes), LoadOp fails with an exception.
Such ops anyway are executed by IDEEPFallbackOp, so this should be fine.
Reviewed By: yinghai
Differential Revision: D8872671
fbshipit-source-id: 659a02416c374e373ce041a7d65a174be828702d
Summary:
It was only used to toggle refcounting, but we ALWAYS
refcount tensors.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9494
Differential Revision: D8875169
Pulled By: ezyang
fbshipit-source-id: 3a8618fb288334e62942bbaf388f3c9e473e7524
Summary:
This issue was fixed in 976f9253a5425918eda7cf865b097cf42b5da8d7
Fixes#5311.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9498
Differential Revision: D8875605
Pulled By: ezyang
fbshipit-source-id: 449ffe975d35c959f92874437ba9be37d4d3a1f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9497Fixes#7883 by using `rfft`.
It's worth noting that this is BC breaking. And it's impossible to detect the change because the two signatures before and after this change supports a common subset of calling patterns, e.g., `stft(Tensor, int, int)`. (some other calling patterns will raise error).
soumith and I plan to change the current `stft` interface because it is a bit messy and non-standard. rafaelvalle suggested us that `librosa` is a good reference API to align with. After discussing with soumith and ezyang , and given that `stft` is only out for 1 release, I decide to go with directly changing the signature. Also, my understanding is that most researchers in this field will welcome this change as `librosa` seems to be the golden-standard here. (it doesn't yet support all `pad_mode` but those will become available if added to `F.pad`.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9308
Reviewed By: ezyang
Differential Revision: D8806148
Pulled By: SsnL
fbshipit-source-id: f6e8777d0c34d4a4d7024e638dc9c63242e8bb58
Summary:
test_cuda.py uses routine 'number' to prepare many testscases.
number should return a floating point value for float-type tensor
types, or integer otherwise. But number's test to classify the type
is incorrect, so it always returns the integer value.
(type(t).__name__ is always 'torch.tensortype' so never matches
'Double', 'Float', or 'Half'.)
Update number to use the existing is_floating() helper to make the
check.
The change to number causes a few tests to fail for HalfTensor. Relax
the tolerance for those in line with other HalfTensor testcases. The
failing tests--for addcdiv and fill--were not previously relaxed for
HalfTensor so are held to the over-strict 1e-5 default tolerance.
Finally, update a couple other tests for HalfTensor type to use the
existing is_half() helper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9475
Reviewed By: yf225
Differential Revision: D8872112
Pulled By: ezyang
fbshipit-source-id: 016e3e15adb23f6606bd4c08218954c1396699db
Summary:
This change makes README.md compatible with both Github and VSTS markdown engines. Images can be reduced if necessary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9296
Differential Revision: D8874931
Pulled By: soumith
fbshipit-source-id: 0c530c1e00b06fc891301644c92c33007060bf27
Summary:
I noticed that `Sequential::clone()` does not work. This is because `Sequential` does not use `reset()` which is normally where modules have to initialize and register its submodules. Further, this is because of the way `Sequential` allows its modules to be passed in the constructor, which doesn't work with `reset()` (since it does "late" initialization).
I've added some better error messages inside `Cloneable::clone()` which makes this kind of mistake clearer for other users, and tests for `Sequential::clone()`.
I also had to give `AnyModule` a deep `clone()` method.
ebetica ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9372
Differential Revision: D8865189
Pulled By: goldsborough
fbshipit-source-id: b81586e0d3157cd3c4265b19ac8dd87c5d8dcf94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9403
In BBoxTransform and GenerateProposal ops, clip_boxes makes sure the bbox fits
within the images. For rotated boxes, this doesn't always make sense as there
could be multiple ways to clip a rotated box within an image boundary.
Moreover, clipping to a horizontal box means we leave out pixels of interest
potentially. Therefore, we clip only boxes with angle almost equal to 0 (with a
specified `angle_thresh` tolerance).
Reviewed By: pjh5
Differential Revision: D8828588
fbshipit-source-id: 39c1eafdb5d39d383780faa0a47e76149145e50c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9153
Closes https://github.com/pytorch/pytorch/pull/9153
Modified the values reported by the benchmarking platform to include tensor_shape and op_args. These values have a different naming scheme to values like flops and latency.
Reviewed By: sf-wind
Differential Revision: D8729791
fbshipit-source-id: f050200be01c6d0794bf5faaa6e8cef12a00affe
Summary:
Storage views were previously used to implement CUDA IPC sharing,
but they weren't necessary. The new strategy is described in
Note [CUDA IPC and the caching allocator].
This also fixes an unrelated bug, where we weren't actually using
the Tensor forking pickler, because we didn't register a pickler
for torch.Tensor.
Fixes#9447. Fixes#46.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
CC apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9466
Reviewed By: apaszke
Differential Revision: D8859698
Pulled By: ezyang
fbshipit-source-id: 3362cb92f6ae4aa37084c57d79b31004bd0b4a97
Summary:
IValue is short for interpreter value. It is used frequently so a short name is important.
This will allow us to implement more non-tensor types in an efficient way and remove
many hacks from the compiler.
This PR is limited. It only introduces IValue and changes interpreter to use it.
Follow up PRs will:
* Change the way aten_ops consume non-tensor types so that integer lists,
are no longer represented as Tensors.
* Introduce TensorList as a fundamental type and remove all vararg handling in gen_jit_dispatch
* Change the compiler to implement math on primitive numbers rather than converting to tensors.
jamesr66a apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9368
Reviewed By: ezyang
Differential Revision: D8817598
Pulled By: zdevito
fbshipit-source-id: 29dce80611ce5f6384234de9d12a67861d2b112f
Summary:
Add `WeakTensor` - a `Tensor` counterpart which doesn't keep the data (or any other expensive resources) alive. They can be `.lock()`ed and return `at::optional<Tensor>` if they're still alive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9363
Reviewed By: ezyang
Differential Revision: D8815434
Pulled By: apaszke
fbshipit-source-id: 1b3e96503c1285d78ef124c585e65c7630f3253e
Summary:
The tests were too flaky, and the procedure for legitimately
updating versions of software too onerous, to warrant continually
testing these.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9459
Reviewed By: zou3519
Differential Revision: D8852357
Pulled By: ezyang
fbshipit-source-id: 24e99cd00b4252cdeec2a1d9af92456b4a54912a
Summary:
If the type_as operator takes in two values with the same type, remove that operator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9316
Reviewed By: zdevito
Differential Revision: D8808355
fbshipit-source-id: 2d5710a6380b22f4568fc38a439061b5340c4eb1
Summary:
`test_neg` sometimes fails internally because `random_()` can generate an out-of-range value for CharTensor. This PR fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9431
Reviewed By: SsnL
Differential Revision: D8843284
Pulled By: yf225
fbshipit-source-id: bf516cceb8f780e133fa54f7364c77821eb7c013
Summary:
This PR removes `distributions.utils._log_sum_exp` in favor of `torch.logsumexp`. Also fixes some warnings with `reduce` arg. in `binary_cross_entropy_with_logits`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9173
Reviewed By: SsnL
Differential Revision: D8764174
Pulled By: ezyang
fbshipit-source-id: b9c4136dbf0182e8ae77082e6448d23a430d5cb6
Summary:
See Note [Supervisor deleter] for how SupervisedPtr works.
This design is not the obvious one, but there were a lot of
constraints feeding into it:
- It must support the reallocation usage-pattern, where, given
an existing Storage, we allocate a new region of memory,
copy the existing data to it, and then deallocate the old
region of memory.
- Creation of a deleter for memory MUST avoid dynamic allocations
in the common case. We've done some benchmarking in Caffe2
where dynamic allocation for deleters is ruinously expensive,
and it's really hard to avoid these performance tarpits in
very general function wrappers like std::function or
folly::Function (while benchmarking this, we discovered that
folly::Function's move constructor was way more expensive
than it should be).
- We need to be able to deallocate data that comes from external
sources, e.g., dlpack and numpy tensors. Most notably,
you often cannot deallocate these with merely the void*
data pointer; you need some extra, out-of-band information
(e.g., the managing struct) to deallocate it. Sometimes,
you may even want to resize data living in an external source!
- The "core" allocators need to support being wrapped in a Thrust
allocator, so you need to be implement the following two functions:
char* allocate(size_t);
void deallocate(char*, size_t);
- We need to support tensors which contain non-POD, non-trivially
copyable data; specifically tensors of std::string. This is
an upcoming requirement from Caffe2. It's dirty AF, but
it's really useful.
- It should use C++ standard library types like std::unique_ptr
(which is hugely problematic because std::unique_ptr doesn't
call the deleter when the pointer is null.)
Here is the billing of changes:
- Built-in support for realloc() has been DROPPED ENTIRELY.
Instead, you're expected to allocate and then copy from
the old memory to the new memory if you want to do a
reallocation. This is what you'd generally have expected
to occur; and axing realloc() from the design lets us avoid
some tricky correctness issues with std::realloc(), namely
the fact that we must refuse the realloc if the type of the
elements are not trivially copyeable. If it really matters,
we can add this back, but there really needs to be a good
explanation WHY you need fast resizing reallocations (by in
large, people don't resize their storages, and it should
be acceptable to have a performance degradation when they
do).
- TH_STORAGE_FREEMEM is no more; instead, if you want a
storage which doesn't free its result, you just give it
an empty deleter.
- What we used to call an "allocator" (really, a combined
object for allocating/deleting) has been split into two
concepts, an allocator, and a smart pointer (SupervisedPtr)
which knows how to delete data.
- Unlike previously, where THAllocator/THCDeviceAllocator
could have a per-tensor context storing extra information
(e.g., a pointer to the metadata you need to actually
free the tensor), there is no context in the allocator or
the deleter of the smart pointer; instead, the smart
pointer directly holds an owning reference to the
metadata necessary to free the data. This metadata
is *freshly manufactured* upon every allocation, which
permits us to resize tensors even in the absence of
built-in support for realloc().
- By default, allocators don't support "raw" allocations
and deallocations with raw pointers. This is because
some allocations may return a different context every
time, in which case you need to reconstruct the context
at delete time (because all you got was a void*, not
a unique_ptr that carries the deleter).
- The diff between at::Allocator and THCDeviceAllocator is a
bit larger:
- It used to return a cudaError_t. Now, allocators
are expected to check the error status immediately and throw
an exception if there was an error. It turns out that this
is what was immediately done after all occurrences of
allocate/release, so it wasn't a big deal (although some
subsidiary interfaces had to themselves be converted to
not return cudaError_t).
There is one notable exception to this, and it is how
we handle CUDA OOM: if this occurs, we attempt to return
unused memory to the system and try again. This is now
handled by a catch-all try-catch block. The cost of
catching the exception is probably the least of your worries
if you're about to OOM.
- It used to take the CUDA stream to perform the allocation
on as an argument. However, it turned out that all call
sites, this stream was the stream for the current device.
So we can push this into the allocator (and the choice,
in the future, could be made explicitly by twiddling
thread local state.)
- It held two extra methods, emptyCache and cacheInfo, specifically
for interacting with some state in THCCachingAllocator.
But this "generality" was a lie, since THCCachingAllocator
was the only allocator that actually implemented these
methods, and there is actually a bunch of code in THC
which assumes that it is the caching allocator that is
the underlying allocator for CUDA allocations. So I
folded these two methods into this interface as
THCCachingAllocator_emptyCache and THCCachingAllocator_cacheInfo.
- It held its context directly inside the THCDeviceAllocator
struct. This context has been moved out into whatever
is holding the at::Allocator*.
- The APIs for getting at allocators/deleters is now a little different.
- Previously there were a bunch of static variables you could get
the address of (e.g., &THDefaultAllocator); now there is a
function getTHDefaultAllocator().
- Some "allocators" didn't actually know how to allocate (e.g.,
the IPC "allocator"). These have been deleted; instead, you
can wrap the produced pointers into SupervisedPtr using
an appropriate makeSupervisedPtr() static method.
- Storage sharing was a lot of work to wrangle, but I think I've
tamed the beast.
- THMapAllocator and its "subclasses" have been refactored to
be proper, honest to goodness C++ classes. I used the enum
argument trick to get "named" constructors. We use inheritance
to add refcounting and management (in libshm). What we previously
called the "Context" class (Context has been dropped from the name)
is now the supervisor for the data.
- Sometimes, we need to pull out the file descriptor from a
tensor. Previously, it was pulled out of the allocator context.
Now, we pull it out of the supervisor of the SupervisorPtr,
using the static method fromSupervisedPtr(), which uses the
deleter as the typeid, and refines the type if it matches.
- I renamed the std::function deleter into
InefficientStdFunctionSupervisor, to emphasize the fact that it does
a dynamic allocation to save the std::function deleter.
TODO:
- Windows libshm is in shambles and needs to be fixed.
Perhaps for the future:
- newFromFd is now unconditionally calling cudaPointerGetAttributes
even though this is unnecessary, because we know what the device
is from higher up in the callstack. We can fix this by making
newWithDataAndAllocator also take an explicit device argument.
- Consider statically distinguishing between allocators that
support raw_allocate/raw_deallocate, and those which don't.
The Thrust constraint applies only to the CUDA device allocator;
you never need to allocate CPU memory this way
- Really want to get rid of storage views. Ugh.
Nontrivial bugs I noticed when preparing this patch:
- I forgot to placement-new unique pointers and attempted to
assign them directly on uninitialized memory; very bad! Sam
Gross has encouraged me to replace this with a proper constructor
but I keep putting it off, because once everything goes in
StorageImpl there really will be a proper constructor.
- I rewrote a number of APIs to use newWithDataAndAllocator
instead of newWithAllocator, calling the allocator at the
call site (because they required "allocation context" which
we no longer give to "allocators"). When I did this, I forgot
to insert the multiplication with sizeof(real) to scale from
numels to number of bytes.
- The implementation of swap on storages was missing it for
scalarType and backend. It was benign (because the only case
we call swap is when these are the same), but I fixed it anyway.
- I accidentally returned a nullptr unique_ptr with no deleter,
even though there was a legitimate one. This matters, because
some code still shoves its hands in the deleter context to
get extra metadata about the function.
- I used std::move() on a unique_ptr, and then did a boolean
test on the pointer aftewards (always false!)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9358
Reviewed By: SsnL
Differential Revision: D8811822
Pulled By: ezyang
fbshipit-source-id: 4befe2d12c3e7fd62bad819ff52b054a9bf47c75
Summary:
This PR add a device_ member to CUDAEvent. This is necessary since if we create a cudaEvent on one device but destroy it from another, it also creates an additional context on that device. So this device information is needed to guard the cudaEventDestroy. (cc: ngimel is this expected behavior? I can provide a simple cu script to repro this).
c10d tests are probably not in CI yet, please let me know how the test are run and I could double check.
Thanks pietern apaszke for help debugging!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9415
Reviewed By: apaszke
Differential Revision: D8839688
Pulled By: ailzhang
fbshipit-source-id: b950ba37d57b9e3c5fe71726ec92f6a9601c4d0e
Summary:
Fixes: #9419
This assumes that anyone who knows localScalar can also grep for the
error message or get a traceback.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9443
Reviewed By: soumith
Differential Revision: D8850718
Pulled By: ezyang
fbshipit-source-id: a106fee718fef97064e861810a49ca05f536f27e
Summary:
Fixes: #9421
I don't think it is easy to deal with non-contiguous array in cuda topk, so I'm adding a check.
The argument number is a bit confusing when it shows in PyTorch but it is consistent with the other checks. (Not sure whether it would make sense to eliminate argument numbers from the error TH/THC error messages given that they're probably off more than once...)
Do we need a test that it indeed refuses non-contiguous?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9441
Reviewed By: soumith
Differential Revision: D8850719
Pulled By: ezyang
fbshipit-source-id: d50561bb37ed50ab97aeaf54d8e3fc6c765bdc7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9299
Onnx has ReduceL1 and ReduceL2 operators that would facilitate this, so allow pytorch to export those and allow caffe2 to run them.
I only implemented this on CPU so far.
Reviewed By: pjh5
Differential Revision: D8757381
fbshipit-source-id: 68afc9e2f90042a70929b73ace05a499b5c670c7
Summary:
During tracing (and export) we are now introducing an unnecessary hard-coded view on the RHS of indexed assignments such as `tensor[idxs] = rhs`. This caused a regression in the PyTorch translate models because these expressions appear with variable sizes in the RHS. This change makes it so we only call view if we indeed need to strip leading 1-dimensions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9424
Reviewed By: colesbury
Differential Revision: D8838881
Pulled By: jamesr66a
fbshipit-source-id: 399e5daa7d021f4f59f6f92b9fae581f92bfc538
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9385
The operator transform dense features to sparse features by bucketizing. Only the feature in indices tensor will be transformed and output.
Reviewed By: bddppq
Differential Revision: D8820351
fbshipit-source-id: a66cae546b870c6b2982ac20641f198334f2e853
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8999
Closes https://github.com/pytorch/pytorch/pull/8999
Implemented the WRgrad optimizer operator for dense (base case as well as the case with additional output for effective learning rate and update value) and sparse case.
Reviewed By: pjh5
Differential Revision: D8627933
fbshipit-source-id: a63cde46c04bcc6b428ab5f77a4b3b2beb66c046
Summary:
I'm cramming through clang tidy emitted warnings. This PR addresses the `hi-cpp-override` check which warns that `virtual` + `override` is redundant, since `override` already signifies that a function is overriding and thus virtual.
Where there was `virtual` + `override` I removed the `virtual`, where there was `virtual` and no `override` I removed `virtual` and added `override`.
ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9335
Differential Revision: D8807082
Pulled By: goldsborough
fbshipit-source-id: e0a261053f6540a22cc56ec160a24aa285af6319
Summary:
This PR improves perfomance of (formerly) latency-bound non-contig-dim reduction kernels by up to 20X, while maintaining determinism.
Currently, reducing across a non-contiguous dimension uses the parallelism exposed across the number of output elements. This means that performance suffers if the number of output elements is small. Example:
```
a = torch.cuda.FloatTensor(32768, 32)
a.sum(dim=0)
```
Before this PR, `a.sum`'s kernel (kernelReduceNoncontigDim_shared) took 138 microseconds on my machine. The speed-of-light estimate (based on a bandwidth of 700 GB/s) should be around 6 microseconds. After this PR's changes, `a.sum(dim=0)`'s kernel takes 6.9 microseconds on my machine.
Christian implemented some nice logic to squeeze out better performance for cases like `a.sum` using intra-block and instruction-level parallelism across the dimension being reduced, but his kernel still only launched one block for every 32 output elements. This was insufficient to saturate the device in many cases, like `a.sum` here (where only one block is launched).
My PR adds block cooperation across the dimension being reduced. Many blocks, instead of one block, help to reduce into each 32 output elements. Internally, each block leverages all of Christian's nice logic to compute a partial reduction into a per-block staging buffer, then the last block to finish combines the results to compute the final output.
Block cooperation does require THCudaMalloc-ing staging and semaphore buffers, so it's not always worthwhile. I included a set of rough heuristics to decide when the kernel should choose to use block cooperation. These heuristics are based on Python-side timings of calling sum() many times in a loop, and comparing to the old implementation.
I tested a wide range of sizes (to determine heuristics) and as long as the number of output elements is greater than 16ish, I don't think there are any remaining pathological sizes where users will encounter unexpectedly poor performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9214
Reviewed By: gchanan
Differential Revision: D8808127
Pulled By: colesbury
fbshipit-source-id: 139f310fc6ea6d187a7c983128f8eb8e1c9b4be3
Summary:
While talking to mruberry, I noticed a few places that use
special cast wrappers that are no longer necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9401
Differential Revision: D8828874
Pulled By: colesbury
fbshipit-source-id: 2b7fe7ac3af3b71be26b43a9ad3949f8065a7bc9
Summary:
This is to unify the handling of empty tensors in std/var between the dimension reduce and all reduce cases.
Also to avoid triggering ubsan errors around divide by 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9400
Reviewed By: ezyang
Differential Revision: D8828879
Pulled By: gchanan
fbshipit-source-id: 6b9306805c94251eec28bd12e234618338bff4e3
Summary:
This includes either bug fixes or NumPy semantics changes for the following methods:
chunk, diagonal, unfold, repeat, flatten, reshape, split, unsqueeze.
The n-dimensional empty tensor feature is still hidden behind a feature flag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9362
Reviewed By: ezyang
Differential Revision: D8817002
Pulled By: gchanan
fbshipit-source-id: 6ff704ec96375f00b4dd39ebcd976efac0607fb4
Summary:
Pure experimental addition to guide us on delivering this
into real production systems and their threadpools. Biggest limitation
now is that we need to turn off BlackBoxPredictor activation
deallocation logic to get to sane performance
Reviewed By: highker
Differential Revision: D8798029
fbshipit-source-id: ec7962689d605fba62b2c9e0904309df567a25a4
Summary:
This was previously meant to be used for c10 code but that plan since changed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9367
Reviewed By: orionr
Differential Revision: D8814361
Pulled By: smessmer
fbshipit-source-id: 8e35fa74e160343a2bb8432013847677aa73695a
Summary:
To allow our C++ customers to use our initialization methods as well, this PR moves some of the code from `torch.nn.init` to ATen, calls it from Python, and adds equivalent code to the C++ frontend.
Notes:
1. Happy to hear thoughts on whether it's ok to have e.g. `torch.nn.init.dirac_` *and* `torch.dirac_` (the former has a `no_grad` guard). We have this for `ones_` and stuff too, so I don't mind it.
2. I left the exception checking in Python because they throw `ValueError`s while ATen errors show as `RuntimeError`s. I imagine this would break users' error handling if someone were to have a `try`-`except` handler for `ValueError` (or maybe it's a far fetch)
EDIT: After discussions with zdevito, the PR now simply duplicates the code in C++ exclusively for the C++ API, and we leave the Python code as-is (to make it easier for people to read/modify).
ebetica ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9295
Differential Revision: D8813793
Pulled By: goldsborough
fbshipit-source-id: 4b969f3f75952c1be4e837e19e23b8098e5fbd4b
Summary:
Migrated PriorCorrectionCalibration from Dper2 layer to Dper3 module.
A few notes:
1. Calibration operators need dynamic linking;
2. All calibration implementation and tests are located in /modules/calibration/
3. Added a type inference function in operator_shcema.h/operator_schema.cc
Reviewed By: idning
Differential Revision: D8756832
fbshipit-source-id: 7e6300a3bb3d3feaaf3b82340ece2f35d71493fc
Summary:
This PR changes the ATen `CMakeLists.txt` slightly, to enable standalone build of ATen inside PyTorch. Currently, the tests in ATen gets linked to `libcaffe.so libcaffe2.so`. As a result, ATen can't be built standalone without building from the root pytorch directory. I know that there is a big merge happening between caffe2 and pytorch and hence, the purpose of this PR is to really start a conversation on what would be the proper way of migrating the CMakeLists to enable clean builds. We should also follow up on this PR: https://github.com/pytorch/pytorch/pull/7275. For your reference, that PR has the explanation for why `-Wl --no-as-need` is needed. Moreover, without `set(ATen_CUDA_SRCS ${all_cuda_cpp})`, the standalone build will throw unresolved references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9377
Reviewed By: smessmer
Differential Revision: D8825921
Pulled By: orionr
fbshipit-source-id: c521159b4885639fc7990a9819202051455d07db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9396
The custom and local CUDADevice RAII wrapper has been superseded by at::DeviceGuard so it doesn't make sense to keep it around.
Reviewed By: ailzhang
Differential Revision: D8824200
fbshipit-source-id: 39fa00ffab4f495606c8001446e976bbf603e866
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9309
This is faster when you're dealing with a small number of processes.
Around the 16 processes mark the halving/doubling algorithm is faster.
Reviewed By: apaszke
Differential Revision: D8785364
fbshipit-source-id: 4a03326266e473026d943787186e149d0cc489f0
Summary:
Use decorator `torch.jit.batch` to implement auto-batching (call `to_batch` pass to do IR tranformation).
- `to_batch` pass: "to_batch.h/cpp" in csrc/jit/passess to transform a graph to a new batched graph.
- Write several basic operators for BatchTensor (add, mul, sigmoid, tanh, mm, matmul, select).
- Register the operators in a lookup table `<std::string, std::shared_ptr<Graph>>`. (use the Graph to replace the original node in IR graph)
Move BatchTensor in python from torch.BatchTensor to torch.jit.BatchTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9198
Reviewed By: zdevito
Differential Revision: D8744466
Pulled By: ChunliF
fbshipit-source-id: 9ea56a30f55cb870f13a2069a47cc635419763ff
Summary:
In the C++ API, `Sequential` currently was not refcounted itself, but stored `shared_ptr<AnyModule>` to get the reference semantics. This is unfortunate because most modules in the API are accessed via `->`, e.g. `Linear l(1, 2); l->forward(...);`. `Sequential` was different in that it had value semantics itself, thus was accessed via `.`.
This PR makes `Sequential` store `AnyModule` (without extra indirection), and uses the same pImpl mechanism we use for all other modules to make `Sequential` have reference semantics itself. This makes it consistent with the rest of the library. It also removes one level of indirection inside of `Sequential`, which is cool.
One thing I had to change was that the `ModuleHolder` with which the whole pImpl thing is implemented previously did some tricks to make `Linear(3, 4)` actually construct `Linear(LinearOptions(3, 4))`. This doesn't work well with `Sequential` since it takes a variadic parameter pack. Instead, I made `ModuleHolder` forward all arguments to the underlying module, and then further pushed the trick to forward parameters to modules' options types into the actual Modules. This adds one constructor per Module in the library. This is not something user modules have to do (unless they want this nice forwarding themselves). It makes the code simpler overall.
ezyang ebetica apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9151
Reviewed By: ezyang
Differential Revision: D8809298
Pulled By: goldsborough
fbshipit-source-id: da68452c3de912fbc67af330ba93b5220de6909f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9352
I am debugging a failed workflow f61490672, and found the original error message to be not informative.
Differential Revision: D8808181
fbshipit-source-id: 3f524ca092881186a492c5c0456124ce31d54751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9350
Re-apply #9270
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
Reviewed By: mingzhe09088
Differential Revision: D8794431
fbshipit-source-id: de656334af46c697802073f8e8d9a6aeb9ca65a7
Summary:
Breaking this out of #8338
This fixes some CUDA related build and runtime issues after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9347
Reviewed By: orionr
Differential Revision: D8806954
Pulled By: mingzhe09088
fbshipit-source-id: 9f8e3feee06478d1ac2deb30796939453352d388
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9056
Closes https://github.com/pytorch/pytorch/pull/9056
Updates bbox_transform for rotated boxes with angle info to normalize the
predicted angle to be within [angle_bound_lo, angle_bound_hi] range.
Reviewed By: pjh5
Differential Revision: D8706240
fbshipit-source-id: f3ee834cf362736136e285f0f8f0c063af94a879
Summary:
THNN was accumulating the result of reduction loss functions
into real instead of accreal. This was causing precision issues with
MSELoss.
This patch only fixes MSELoss. Some of the other losses exhibit bad precision as well (because they accumulate into real instead of accreal) and require more investigation. I will open an issue for those (#9286)
Fixes#8710
cc li-roy SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9287
Reviewed By: SsnL
Differential Revision: D8775708
Pulled By: zou3519
fbshipit-source-id: d1a1f159deee0cb90fd8e81e63b246115eea8e9e
Summary:
operator.cpp is not generated. removing the line prevents generate_code.py from always thinking it is out of date and running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9339
Reviewed By: ezyang
Differential Revision: D8798689
Pulled By: zdevito
fbshipit-source-id: f25a2e215fec29aa51571e6a31771f0f91e7a213
Summary:
dlpacks deserve documentation. :)
I wonder whether it might make sense to merge the various small torch.utils pages (and include a link for the larger ones, e.g. data) to enhance the structure in the docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9343
Differential Revision: D8801227
Pulled By: soumith
fbshipit-source-id: 2980d271971743b86f052bec5a2cb4d146a90d9b
Summary:
Helps prevent calling functions of the base case on float/double/int subclasses that aren't supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9321
Reviewed By: colesbury
Differential Revision: D8793627
Pulled By: cpuhrsch
fbshipit-source-id: 7fde779ecd4b890dda406f3d1306b58bab40efe2
Summary:
As discussed in call, this will allow us to keep this integral part of the effort to run PyTorch on ROCm in sync with the main code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8812
Reviewed By: ezyang
Differential Revision: D8796245
Pulled By: bddppq
fbshipit-source-id: 8e12c2acf6a7e0740f31b21e50be74e10ed8b12c
Summary:
This is a series of two commits that should probably be read separately. They are stacked on top of #9018 since the second commit requires it for correctness.
Commit 1
=======
This commit is the first in a series that will clean up how we handle declaring operators and intrinsics in the JIT to make it more modular and readable. This introduces readable declarations that can be used to register operators and switches gen_jit_dispatch to generate this schema. A follow up PR will remove the dispatch keys like "add-3" and resolve ops directly based on the registered schema, further simplifying the generation process.
* Switches schema over to parsed declarations, in the future this will allow something like:
```
registry.register_intrinsic("foo(Tensor a, Tensor b) -> Tensor", [](Stack& stack) {
...
})
```
This will allow the scalable registration of intrinsics for lists, tuples, and other ops, as long as meta-data for these ops (e.g. derivatives and size propagation routines).
The declarations resemble those used by PythonArgParser but have been singificantly cleaned up to minimize the number of types that can appear in the declaration. We should strive to get the other parts of PyTorch switched over to this restricted declaration set when possible, but it is too much to do in a single PR. My hope is that eventually we will use a very similar language to describe declarations in C10, and this can serve as a guide for that.
Parsing is done using the script lexer, so it is very robust to whitespace and extensible for future types.
This removes the other way we encoded schema, and makes it easier to see what schema are registered.
Current generated declarations: https://gist.github.com/zdevito/a96a17766fb3a098d69a91ee00abaaf6
* Switches how we handle attempting to use an integer in the place of a fixed-sized int list, such as in conv (e.g. 'int[3] stride=1'). Now that we can statically distinguish between int and Tensor, we handle the expansion as an implicit conversion in the compiler. This allows us to simplify the interpreter since it no longer needs to handle the conversion itself.
* Schema declarations have been changed so that they match the type system in the IR exactly. In particular, attribute_info which was used by liftConstantAttributes has been dropped and constant attributes are lifted purely based on the type of the input. Type conversions in compiler have been simplified due to this change.
* Error highlighting in ErrorReport now only reports at most 20 lines of code, to make reading where an error occurred easier.
Commit 2
=======
This commit unifies aten_dispatch and aten_schema into a single Operator object that both contains schema and implementation information. In the future we can use this object to also contain functionality like shape prop and autodiff needed by all operators. Operators are registered globally, and dispatch logic uses the schema information to figure out which variant to use. Descriptor keys, a frequent source of inscrutable debug errors, have been removed.
* Introduce Operator, to replace TensorOp. Unlike TensorOp, we use Operator for all op implementations, including primitives that may occur in the graphs. The only exceptions are ops that are only known to the interpreter like jumps, and GraphExecutors where we need to record additional debug info.
* Adds a global registry for Operator implementations. aten_dispatch.cpp turns into register_aten_ops.cpp, which registers all the Operators for aten with the operator registry. register_prim_ops.cpp now contains the implementations for primitive operators that used to be in the interpreter. This means that it is now safe to use `getOperation(node)` to lookup the true interpreter function for the node, which will simplify const-propagation passes.
* Remove addInterpreterOpHandler in favor of global operator registry.
* Instead of descriptors, we match Node arguments directly against FunctionSchema describing expected inputs in `matchSchema`. `matchSchema` knows how parse both attributes and positional inputs from a node and match it to the appropriate registered operator. Debug error messages when we try to run an invalid operator are significantly improved: they now automatically display the schema for the op with the same name that are registered.
* Merge aten_schema into regsiter_aten_ops. Each Operator takes a string schema which is parsed to determine when to dispatch to that op.
* Cleans up gen_jit_dispatch.py now that we do not need to write out descriptors. In particular, skip_scalar_overloads can be removed since Richard's code sorts declarations to put Tensor, Tensor declarations first.
* remove matchSchemaAndLiftConstantAttributes and use emitBuiltinCall instead to remove code duplication
* refactor stack manipulation functions into a separate header file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8885
Reviewed By: jamesr66a
Differential Revision: D8751048
Pulled By: zdevito
fbshipit-source-id: 312aabfbf88307c5f6ab947b6caf691468b94557
Summary:
Breaking this out of #8338
This takes care of the Eigen failure we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Fix is to isolate Eigen from headers included by cu files and processed by nvcc. This was worked on with smessmer.
cc mingzhe09088 smessmer BIT-silence Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9270
Reviewed By: mingzhe09088
Differential Revision: D8768025
Pulled By: orionr
fbshipit-source-id: 5b34017aeb67e35a1b5938d962181ccd4cd37591
Summary:
Usually DLPack consumer is expected to call DLManagedTensor's
deleter to signal that it doesn't need the contents.
This patch calls the deleter when freeing unconsumed
DLPack capsules created by PyTorch.
Test script:
```
import torch
import torch.utils.dlpack
import gc
for i in range(10000):
a = torch.randn(1000,1000, dtype=torch.float32, device='cuda')
b = torch.utils.dlpack.to_dlpack(a)
gc.collect()
```
Before patch: consume all GPU ram.
After patch: constant GPU ram consumption.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9297
Differential Revision: D8781571
Pulled By: soumith
fbshipit-source-id: 2ebadec6c857646220d632ca64110af430dbd52f
Summary:
I'm trying to write a multi-gpu network by pipelining some layers onto different GPUs. However, the current gradient clip requires all the parameters to locate in the same device.
The overhead of CUDA launch is reduced since the scalar calculation is performed on CPU, but it introduces extra data transfers.
No performance regression is observed by running the following snippet:
```python
import time
import torch
module = torch.nn.Sequential(
torch.nn.LSTM(1024, 1024),
torch.nn.LSTM(256, 256),
torch.nn.Linear(100, 10000),
).cuda()
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
start = time.time()
for _ in range(1000):
torch.nn.utils.clip_grad_norm_(module.parameters(), 1)
torch.cuda.synchronize()
time_elapse = time.time() - start
print('{} ms per clip'.format(time_elapse))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9302
Differential Revision: D8781551
Pulled By: soumith
fbshipit-source-id: 9d76d01fe0531927f770a16b9523872a7e08e927
Summary:
Fixes#9264 .
There can be so many elements in the output of `vol2col` so it overflows `int` range! This PR changes 3d conv to use `int64_t` mostly.
Also fixes some unused var warning (cc goldsborough )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9274
Differential Revision: D8770682
Pulled By: SsnL
fbshipit-source-id: f6e37f1aa56fe1009dd4c9bcbc042244e47252db
Summary:
The underlying use-case is the file descriptor to storage cache in
torch.multiprocessing.reductions. Previously, this was implemented by wrapping
an existing allocator with a "weak ref" allocator which also knew to null out
the weak reference when the storage died. This is terribly oblique, and
prevents us from refactoring the allocators to get rid of per-storage allocator
state.
So instead of going through this fiasco, we instead directly implement weak
pointers and finalizers in THStorage. Weak pointers to THStorage retain the
THStorage struct, but not the data_ptr. When all strong references die,
data_ptr dies and the finalizers get invoked.
There is one major hazard in this patch, which is what happens if you
repeatedly call _weak_ref on a storage. For cleanliness, we no longer
shove our grubby fingers into the finalizer struct to see if there is already
a Python object for the weak reference and return it; we just create a new one
(no one is checking these Python objects for identity). This means if you
keep calling it, we'll keep piling on finalizers. That's bad! But I am
not going to fix it until it is actually a problem for someone, because
then we need to add another caching layer.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9148
Differential Revision: D8729106
Pulled By: ezyang
fbshipit-source-id: 69710ca3b7c7e05069090e1b263f8b6b9f1cf72f
Summary:
Fix the problem if caffe2 works with old version of onnx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9284
Reviewed By: yinghai
Differential Revision: D8773894
Pulled By: houseroad
fbshipit-source-id: 99b5a962099f854edc85a2ea815cb88c82a6e175
Summary:
ONNX-TensorRT is still using old opset (<7). Patch it for now.
Future fix would be expose versioning in onnx exporter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9285
Reviewed By: houseroad
Differential Revision: D8775268
Pulled By: yinghai
fbshipit-source-id: c272073f80cce35ebd971e44ec9472e3c8fd4b9e
Summary:
This PR implements and tests N-dimensional empty tensors for indexing, factories, and reductions if compiled with -DUSE_TH_SIZE_ZERO_DIM.
Still remaining to add:
1) TensorShape functions
2) Simple linear algebra functions (matrix multiply variants)
3) Other functions that operate over a dimension (but don't reduce).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9209
Reviewed By: ezyang
Differential Revision: D8751257
Pulled By: gchanan
fbshipit-source-id: 2113374dc7af6caf31a99bf67b3893f130a29e23
Summary:
Tested on my mac on a pretty clean anaconda3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8509
Reviewed By: orionr
Differential Revision: D8702257
Pulled By: pjh5
fbshipit-source-id: eda03ef9732da9fc56b31d909af5c0e39520d689
Summary:
Breaking this out of #8338
This fixed Mac build issues after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9283
Reviewed By: orionr
Differential Revision: D8773459
Pulled By: mingzhe09088
fbshipit-source-id: 71942e8e6891a625e6b1a7dc0160e87444c64209
Summary:
Breaking this out of #8338
When BUILD_CAFFE2 and BUILD_ATEN are removed, we need to install typing on Mac.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9271
Reviewed By: orionr
Differential Revision: D8768701
Pulled By: mingzhe09088
fbshipit-source-id: 052b96e90e64b01e6b5dd48b91c0fb12fb96b54a
Summary:
Breaking out of #8338
This fixes the build issues with pytorch on linux machines after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9273
Reviewed By: orionr
Differential Revision: D8768869
Pulled By: mingzhe09088
fbshipit-source-id: 2730426ed1bed398eb5dc804c7348aeeb27c93d3
Summary:
Breaking this out of #8338
This takes care of failures we saw on Mac CUDA builds when BUILD_CAFFE2 and BUILD_ATEN were removed. Specifically, smessmer fixed `std::hash` being handled in a weird way by nvcc and I fixed an nvcc template issue by moving `SparseNormalizeOp::RunOnDevice` implementation into the cc file.
cc mingzhe09088 smessmer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9269
Reviewed By: mingzhe09088
Differential Revision: D8767984
Pulled By: orionr
fbshipit-source-id: 550686bfcef6d331f16d593859c99169216c5c2e
Summary:
Breaking this out of #8338
This fixed an Android build issue after BUILD_CAFFE2 and BUILD_ATEN are removed.
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9275
Reviewed By: orionr
Differential Revision: D8769913
Pulled By: mingzhe09088
fbshipit-source-id: afce52a12697757a0b2103c7c343e19ab158a9f7
Summary:
Breaking this out of https://github.com/pytorch/pytorch/pull/8338
Use a local version of `np.rot90` with an `axes` argument, since we don't have NumPy 1.12.0 in all of the test environments. Caffe2 conda2-ubuntu16.04, for example, fails. Generally, it seems better to not require a NumPy bump just for this test.
cc mingzhe09088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9267
Reviewed By: mingzhe09088
Differential Revision: D8767819
Pulled By: orionr
fbshipit-source-id: c51a6295d58366eba06e4e55e3f1ffaa8af96975
Summary:
Breaking this out of #8338
More changes required to support USE_CUDNN=OFF. We should be able to land some of our fixes before the big BUILD_CAFFE2 and BUILD_ATEN removal lands.
cc mingzhe09088 Yangqing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9268
Reviewed By: mingzhe09088
Differential Revision: D8767981
Pulled By: orionr
fbshipit-source-id: 0607ca2773253b685209c274a3adf70180d8ce58
Summary:
Commits:
1. In extension doc, get rid of all references of `Variable` s (Closes#6947 )
+ also add minor improvements
+ also added a section with links to cpp extension :) goldsborough
+ removed mentions of `autograd.Function.requires_grad` as it's not used anywhere and hardcoded to `return_Py_True`.
2. Fix several sphinx warnings
3. Change `*` in equations in `module/conv.py` to `\times`
4. Fix docs for `Fold` and `Unfold`.
+ Added better shape check for `Fold` (it previously may give bogus result when there are not enough blocks). Added test for the checks.
5. Fix doc saying `trtrs` not available for CUDA (#9247 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9239
Reviewed By: soumith
Differential Revision: D8762492
Pulled By: SsnL
fbshipit-source-id: 13cd91128981a94493d5efdf250c40465f84346a
Summary:
When we moved the libaten build into libcaffe2, we changed the location where it generated compile_commands.json such that it was no longer being picked up by the build script. This fixes it so it is still found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9227
Reviewed By: goldsborough
Differential Revision: D8757984
Pulled By: zdevito
fbshipit-source-id: 73df26bf08d98f18ac841d6c0db7e332fd328ab6
Summary:
Here's an improved error message. Let me know if this change makes the errors a little clearer.
Closes https://github.com/pytorch/pytorch/pull/9212
Reviewed By: soumith
Differential Revision: D8752896
Pulled By: jramseyer
fbshipit-source-id: d2bd8462c3ddf14acd3de56a4c1aeb75a9bc4067
Summary:
This PR moves the THCStream logic (from both the THCStream and THCState APIs) to ATen. In particular, it:
+ Creates a new (THC free) at::CUDAStream class and API
+ Extends the at::Context API to expose it
+ Stubs the current THCStream and THCState APIs to use it
+ Updates THC to no longer violate stream encapsulation (stream.hpp is dead)
+ Adds an ATen cpp test of the API
+ Bonus: Removes some debug spew in test_nn.py
The new API has several advantages over the old one:
(1) It comes with an easy to use RAII, the CUDAStream. CUDAStreams have the expected copy and move semantics and are implicitly convertible to cudaStream_t.
(2) It does not depend on THCState, THCThreadLocal, or CUDA (thanks to goldsborough for suggesting the dynamic registration technique)
(3) It provides one consistent API/place for all stream operations, instead of having them split between THCStream and THCState
(4) The internals are completely encapsulated, unlike the historic THCStream
(5) It has getAndRetain semantics, which are safer than the historic gets (which allowed a gap between acquisition and retention)
There are a couple things this PR does not do, however, which are left for future work:
- It leaves the c10d:CUDAStream class as a THCStream wrapper (which now really wraps an at::CUDAStream).
- It leaves historic users of THCStream mostly untouched, except where they violated encapsulation (by using stream.hpp). A couple forward declarations were also changed.
I hope this PR allows easy usage of streams from ATen and is a useful pattern for porting more of the THCState API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8997
Differential Revision: D8683375
Pulled By: soumith
fbshipit-source-id: 2e48ad85f1f9c8817684fe63a267938e80eafdcf
Summary:
This looks like a totally cosmetic change, but for some reason it reduces the runtime by ~50% running in a single CPU thread.
```
import os
os.environ['OMP_NUM_THREADS']='1' #Use one CPU thread
import torch, torch.nn as nn, time
def test_net(net,offset):
net.eval()
total=0
with torch.no_grad():
for _ in range(100):
x = torch.randn(100,100,100)+offset
start_time = time.time()
y = net(x)
total+=time.time()-start_time
print(net, total*10, 'ms')
for offset in [-1,0,+1]:
test_net(nn.LeakyReLU(),offset)
test_net(nn.PReLU(),offset)
```
Closes https://github.com/pytorch/pytorch/pull/9206
Reviewed By: yf225
Differential Revision: D8749491
Pulled By: btgraham
fbshipit-source-id: 3db8049dd151c0ba9ae1dd5c05bcc58bcab97e9a
Summary:
This PR addresses #5823.
* fix docstring: upsample doesn't support LongTensor
* Enable float scale up & down sampling for linear/bilinear/trilinear modes. (following SsnL 's commit)
* Enable float scale up & down sampling for nearest mode. Note that our implementation is slightly different from TF that there's actually no "align_corners" concept in this mode.
* Add a new interpolate function API to replace upsample. Add deprecate warning for upsample.
* Add an area mode which is essentially Adaptive_average_pooling into resize_image.
* Add test cases for interpolate in test_nn.py
* Add a few comments to help understand *linear interpolation code.
* There is only "*cubic" mode missing in resize_images API which is pretty useful in practice. And it's labeled as hackamonth here #1552. I discussed with SsnL that we probably want to implement all new ops in ATen instead of THNN/THCUNN. Depending on the priority, I could either put it in my queue or leave it for a HAMer.
* After the change, the files named as *Upsampling*.c works for both up/down sampling. I could rename the files if needed.
Differential Revision: D8729635
Pulled By: ailzhang
fbshipit-source-id: a98dc5e1f587fce17606b5764db695366a6bb56b
Summary:
Closes https://github.com/pytorch/pytorch/pull/9199
The input shapes are not logged correctly in production because `PerfNetObserver::Stop()` only gets called after the inference is done for the net and in the mobile models, it's common practice to reuse the blobs as much as possible to save memory. And the shapes of the blobs keep changing during inference. By the time you you query `InputTensorShapes()` in `PerfNetObserver::Stop()`, you only get the final shape of the blobs.
To fix this bug, I moved the 'InputTensorShapes()' query from `PerfNetObserver::Stop()` to `PerfOperatorObserver::Stop()`. The latter gets called at the end of operator->run() whereas `PerfNetObserver::Stop()` gets called at the end of net->run().
Also remove `PerfOperatorObserver::getAnalyticalCost()` since it's now done on the server side and no longer needed on mobile
Reviewed By: Maratyszcza
Differential Revision: D8743346
fbshipit-source-id: 5d2d0132e3f5e084be7d0173863e695e62a6b4a0
Summary:
Closes https://github.com/pytorch/pytorch/pull/9048
max_length argument helps fix the shape of the output to be N * max_length * D, where N is the batch_size, D is the feature_dim.
Reviewed By: bddppq
Differential Revision: D8702782
fbshipit-source-id: e30555608fee1c4a61cc95922f4a71c7f54903af
Summary:
[x] get registry working
[x] move all current ops to registry
Reviewed By: yinghai
Differential Revision: D8706115
fbshipit-source-id: 8dfce79039b57dea1c15e8e291cdd74f39766ade
Summary:
As I try to replicate DP in C++, I need to move some functions into C++ from Python. This PR ports the scatter and gather primitives from Python in torch/cuda/comm.py to C++ in torch/csrc/cuda/comm.cpp. The basic infrastructure was already there, since apaszke had rewritten broadcast in C++ already.
I'm not very familiar with this code, so let me know if I'm doing something wrong. I largely just literally translated the code.
I don't know how "public" `torch.cuda.comm` is, but I feel like the `destination_index` parameter for `gather` should be changed from -1 indicating CPU to `None` indicating CPU, and `-1` indicating the default CUDA device. That would make the code clearer IMO.
apaszke colesbury teng-li pietern
Closes https://github.com/pytorch/pytorch/pull/9117
Differential Revision: D8721729
Pulled By: goldsborough
fbshipit-source-id: 1844a488079d21fa209b32e2c73e48632cbe9e68
Summary:
Added a way to `dynamic_cast` an `nn::Module` and get a pointer to it. `nn::Module::is<T>` just checked if the return value of the `dynamic_cast` was nullptr, so I got rid of `is<T>` since it's equivalent to `as<T> != nullptr`(or just `as<T>` due to boolean conversion).
We're now at
```
if (auto* conv = module.as<nn::Conv2d>()) {
conv->weight.data().normal_(0.0, 0.02);
} else if (auto* bn = module.as<nn::BatchNorm>()) {
bn->weight.data().normal_(1.0, 0.02);
bn->bias.data().fill_(0);
}
```
ezyang apaszke ebetica
Closes https://github.com/pytorch/pytorch/pull/9149
Differential Revision: D8735954
Pulled By: goldsborough
fbshipit-source-id: e2b8f6f0cea16a621f8bc0807a33cc7651d25154
Summary:
Context: I am updating jit::FunctionSchema to use `Symbol name;` rather than `std::string name`. Sometimes the name refers to a builtin thing like `prim::UnpackTuple`, sometimes to an aten operator like `aten::add`, and sometimes just to a raw string, like `my_method_foo` that really doesn't belong in any namespace and should be printed to the user in that form. For this last case, I want the ability to create a raw Symbol again, like was previously possible, that just represents an interned string. This PR enables that use, keeps the other functionality still possible, and simplifies interned_string's implementation a bit.
This changes how Symbol is implemented. Now the namespace of a symbol
is optional and the namespaces themselves are Symbols.
This allows Symbol to be used with arbitrary namespaces, and allows
you to use Symbol as an simple interned string using via fromQualString
and toQualString without :: in the string. This also simplifies the
implementation. Like with string conversion, builtin primitives go
through a fast path for namespace lookup while registered symbols require
holding a lock and reading an array entry to lookup the namespace.
Note: alexnet expect file update is from a previous commit. It doesn't run in CI because pytorch vision is not installed.
Closes https://github.com/pytorch/pytorch/pull/9018
Reviewed By: SsnL
Differential Revision: D8690449
Pulled By: zdevito
fbshipit-source-id: b65ee57704641d7294fe115c5470cf55d406458f
Summary:
Similar to https://github.com/pytorch/pytorch/pull/9187, This PR makes setting the `PYTORCH_TEST_WITH_ASAN` and `PYTORCH_TEST_WITH_UBSAN` flags easier internally, by allowing the flags to be set to `0`.
Closes https://github.com/pytorch/pytorch/pull/9202
Differential Revision: D8745533
Pulled By: yf225
fbshipit-source-id: 6293f52f2e8b1c3ef150becfdc2dd7ded56d5d80
Summary:
This is necessary for n-dimensional empty tensors, which have special native handling.
Closes https://github.com/pytorch/pytorch/pull/9197
Differential Revision: D8744083
Pulled By: gchanan
fbshipit-source-id: 3cc692a1d62cbeb169681b7c40e3df50e12953b7
Summary:
I've been cleaning up my email notifications, and noticed that this PR used a stack-allocated `random_device`. This is generally a bad idea due to this sentence from the C++ reference (emphasis mine):
> `std::random_device` may be implemented in terms of an implementation-defined pseudo-random number engine if a non-deterministic source (e.g. a hardware device) is not available to the implementation. **In this case each `std::random_device` object may generate the same number sequence.**
If this is how this object is implemented, then this `rd()` call will give the same result at every call.
cc yf225
Closes https://github.com/pytorch/pytorch/pull/9080
Differential Revision: D8748342
Pulled By: soumith
fbshipit-source-id: 22987befee61ff7faacda5ecc10138c2ac5d26ff
Summary:
Previously this used the ``.toliist`` method, which converted the
storage object into a list of Python objects, and then sent those to
pickle. For storage objects of non-trivial size, this was very slow.
Now we reuse the logic of the ``torch.save`` function to efficiently
turn the Storage object into bytes, and send those instead. This
reduces the semantic information (it's harder to interpret the bytes)
but should be orders of magnitude more efficient when serializing data
with the pickle protocol or with copy
For future work it would be nice to develop a mechanism to get a buffer
of bytes out of a Storage object, and use that alongside the current
``from_buffer`` method.
See #9168 for context
Closes https://github.com/pytorch/pytorch/pull/9184
Differential Revision: D8747794
Pulled By: soumith
fbshipit-source-id: ac598e660c043788ed1ffab3d0303812886edf79
Summary:
1. Let `ModuleTest` raise when they fail on non-contiguous inputs. Fix legacy modules.
2. Fix BN (both THNN and cuDNN) not working on non-contiguous inputs.
3. Fix CUDA EmbeddingBag not working on non-contiguous inputs. To prevent calling `.contiguous()` on in both `forward` and `backward`,
a. prefix all current `embedding_bag*` functions with `_`, indicating that they require input to be contiguous (there is a check in each function).
b. create `embedding_bag`, which makes input arguments `.contiguous()`, and calls `_embedding_bag`
3. Make many ATen `embedding*` functions to work on non-contiguous inputs so we don't need to call `input = input.contiguous()` in Python `nn.functional.embedding`.
4. Fix dense-sparse addition when the sparse input is not coalesced and indices or values tensor is not contiguous. This came up in the test cases of Embedding modules with `sparse=True`. Added tests.
5. Update `TensorUtils.cpp` to use `AT_*` macros.
Request:
review from cpuhrsch on the `Embedding*` changes.
review from ezyang on ATen sparse & BN changes.
Closes https://github.com/pytorch/pytorch/pull/9114
Differential Revision: D8717299
Pulled By: SsnL
fbshipit-source-id: 0acc6f1c9522b5b605361e75112c16bbe1e98527
Summary:
cc vishwakftw
Also added a check if none of the input tensors in `gradcheck` have `requires_grad=True`.
Closes https://github.com/pytorch/pytorch/pull/9192
Differential Revision: D8739401
Pulled By: SsnL
fbshipit-source-id: 81bb3aa0b5c04eb209b137a4bd978e040e76cbcd
Summary:
This PR makes setting the `NO_MULTIPROCESSING_SPAWN` easier internally, by allowing the flag to be set to `0`.
Closes https://github.com/pytorch/pytorch/pull/9187
Differential Revision: D8736206
Pulled By: yf225
fbshipit-source-id: b8a34cb9a747b13bc9428777a3ed766ce441cfe1
Summary:
With the Cppzation of a few files in `TH`/`THC`, the CPP extensions got broken whenever the user uses feature from `THC` in their files, when pytorch is installed via `python setup.py install`.
This addresses issues such as
```
/home/me/.conda/envs/pytorch/lib/python3.6/site-packages/torch/lib/include/THC/THCDeviceTensorUtils.cuh:5:25: fatal error: THCTensor.hpp: No such file or directory
```
Closes https://github.com/pytorch/pytorch/pull/9182
Reviewed By: soumith
Differential Revision: D8734581
Pulled By: fmassa
fbshipit-source-id: 2a1138f208592eaccb01fcdb805a6b369d7a497a
Summary:
Closes#9147
Added a test to prevent regression in test_torch
Added entries in docs
cc ezyang weiyangfb
Closes https://github.com/pytorch/pytorch/pull/9156
Differential Revision: D8732095
Pulled By: soumith
fbshipit-source-id: 7a6892853cfc0ccb0142b4fd25015818849adf61
Summary:
This file was added in #9107 but wasn't installed. The libraries in
./torch/lib use the headers from Caffe2/ATen from their temporary
install path at torch/lib/tmp_install, and c10d was not able to find
THC/THCGeneral.hpp before this fix.
Closes https://github.com/pytorch/pytorch/pull/9159
Reviewed By: Yangqing
Differential Revision: D8731107
Pulled By: pietern
fbshipit-source-id: d6009f6f6e8e6e0f37dea24cc4c3570736943ab1
Summary:
This will resolve the code and comments mis-match issue.
Closes https://github.com/pytorch/pytorch/pull/9070
Differential Revision: D8712261
Pulled By: ezyang
fbshipit-source-id: a8a7d8af890a41ec246e11c2a62b0bde297be9c1
Summary:
The loss plugin was using the old-style loss[0] access, which in PyTorch 0.4 and
later is an attempt to index into a scalar, generating a warning.
Replaced that with loss.item().
This fixes
https://github.com/pytorch/pytorch/issues/9142
Closes https://github.com/pytorch/pytorch/pull/9143
Differential Revision: D8726403
Pulled By: ezyang
fbshipit-source-id: 6c496b140a74d22c8423f511db901b18615fd6fa
Summary:
- There were missing error messages for AT_CHECK in SparseTensorImpl::set_indices_and_values
- We have to check that the backends of all our inputs line up,
since native does not do it for us.
- Some math operations were missing shape tests.
Fixes#9110
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Closes https://github.com/pytorch/pytorch/pull/9140
Differential Revision: D8724349
Pulled By: ezyang
fbshipit-source-id: 3c75104187aca97cbe92bb0ec24f6ded07b2c3d6
Summary:
Booleaning indexing was special cased to handle a single boolean value, but didn't generally work given multiple booleans.
This PR unifies the behavior with slicing. Note that only 'True' and torch.tensor(True) behave like NumPy due to the lack of n-dimensional empty tensors.
The corresponding tests for false values have been added, but are guarded behind a flag until we add n-dimensional empty tensors.
Closes https://github.com/pytorch/pytorch/pull/8920
Reviewed By: ezyang
Differential Revision: D8661876
Pulled By: gchanan
fbshipit-source-id: 0dc8a45a303aa41f729d04ab8908cfaf2e3ce3d7
Summary:
Closes https://github.com/pytorch/pytorch/pull/9121
This main function causes 'buck test caffe2_test_cpu' to run 0 tests
Reviewed By: orionr
Differential Revision: D8719343
fbshipit-source-id: dc1cf76b0355637eaae193be2159f5746873b9f9
Summary:
Some functions are exactly implemented in THStorage_; in that case,
we called those functions directly.
Stacked on #9135
Closes https://github.com/pytorch/pytorch/pull/9136
Reviewed By: Yangqing
Differential Revision: D8723998
Pulled By: ezyang
fbshipit-source-id: 653d23a5e1db4b9bdda50641fa97730894cc8ed5
Summary:
There is no way to concatenate two `Sequential`s in Python, but it's also easier to do in an immutable fashion by just writing `Sequential(first.modules() + second.modules())`. Concatenating vectors isn't as easy in C++, so I think it's fair to save users some for loops by giving them `Sequential::extend()`.
apaszke ebetica ezyang
CC jamespinkerton
Closes https://github.com/pytorch/pytorch/pull/9116
Reviewed By: ezyang
Differential Revision: D8719630
Pulled By: goldsborough
fbshipit-source-id: 840d7ac70755350e6202b493c531e30ecbb6546f
Summary:
The tests were using the old args, which caused them to emit a lot of deprecation warnings.
closes#9103.
Reviewed By: ezyang
Differential Revision: D8720581
Pulled By: li-roy
fbshipit-source-id: 3b79527f6fe862fb48b99a6394e8d7b89fc7a8c8
Summary:
Closes https://github.com/pytorch/pytorch/pull/9107
Some details about how this was done:
- For now, the allocators for CPU and CUDA are different (unifying
the allocators is a bigger change to make, I'll contribute this in
a later patch). To smooth this over, the allocator field now
stores a void* instead of THAllocator* or THCDeviceAllocator*; to
make this clear the field is renamed to allocatorVoidPtr.
- Some THStorage functions which were generated per-scalar are now
generalized, and thus moved out of the generic/ library. This way
they can be called directly from a non-code-generated at::Storage
- THCState is moved into a C++ header. This is actually not really
related to this particular diff, but I'll need it soon to replace
THAllocator/THCDeviceAllocator with at::Allocator (C++, so I can't
mention it in a C header file.)
- THPPointer needs to be adjusted, since there is no more type refinement
between THStorage/THCStorage for it to template match over. This
is a little tricky, because I can't refer to THCStorage_free unless
we actually compile with CUDA. So there's two copies of the function
now: one for the CPU build, one for the CUDA build. If we ever split
CUDA/non-CUDA Python builds, you will have to indirect this through some
dynamic dispatch.
I want to soon replace the THCDeviceAllocator pointers in
THCState with at::Allocator, but I can't reference a C++ namespaced type
from C code, so THCState needs to move.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Closes https://github.com/pytorch/pytorch/pull/9087
Reviewed By: orionr
Differential Revision: D8712072
Pulled By: ezyang
fbshipit-source-id: c6e1ea236cd1df017b42a7fffb2dbff20d50a284
Summary:
Having circulated the C++ API a bit, I found that it would make it easier for folks to access module parameters directly than through the `parameters()` map. So here I make all variables/submodules and also the configuration options for every module public.
For RNNs, I also updated the names of parameters to match PyTorch, e.g. `hhw` -> `w_hh`. This should make it easier to transition from Python.
apaszke ebetica
Closes https://github.com/pytorch/pytorch/pull/9111
Differential Revision: D8717112
Pulled By: goldsborough
fbshipit-source-id: 3d36d5e161f7a86f44db7136c9c2fa53067abe1c
Summary:
Closes https://github.com/pytorch/pytorch/pull/9108
OperatorDef ownership was given to the net in the past, we no longer
want to do that
Reviewed By: pjh5
Differential Revision: D8705347
fbshipit-source-id: 34976de202a7a7a71b935dd13c1bc8e9c73552e0
Summary:
As we left weight to be the last calculated weight in eval mode, we need to detach it from the computation in order to facilitate using backward.
The typical use case is in GANs when the discriminator has spectral norm, is in eval mode and we want to backprop through the discriminator to get weight gradients for the generator.
Closes https://github.com/pytorch/pytorch/pull/9020
Reviewed By: ezyang
Differential Revision: D8694054
Pulled By: SsnL
fbshipit-source-id: 09ee5843687cac3ed4c40759ac577a14c5371730
Summary:
Closes https://github.com/pytorch/pytorch/pull/9035
This diff builds on the structure in the stacked diff to add serialization/deserialization. It supports the old format and a new suggested format.
Reviewed By: ilia-cher
Differential Revision: D8415115
fbshipit-source-id: acaacce2b015f4c6ac0ae22625455290a3f30262
Summary:
add two small bindings to recently added attributes.
Also want to leave a reference gist here: https://gist.github.com/soumith/8102ef39530bac09070912b1a5401d0f
It showcases:
- traced a module
- symbolically differentiated the forward graph, to get a forward, backward graph
- executed the subsequent forward + backward graphs correctly
- compared the jit vs non-jit results
Closes https://github.com/pytorch/pytorch/pull/8890
Reviewed By: ezyang
Differential Revision: D8677663
Pulled By: soumith
fbshipit-source-id: a29919c05baad997cd7fb7df718f933a83035118
Summary:
Closes https://github.com/pytorch/pytorch/pull/9072
Use FixedDivisor in Reduce and Broadcast CUDA kernels
Reviewed By: houseroad
Differential Revision: D8710243
fbshipit-source-id: 6f1da12234898594a1be8c979d942aa515832aeb
Summary:
This will resolve some of the timeout issues in CPU and GPU tests internally.
Closes https://github.com/pytorch/pytorch/pull/9061
Reviewed By: ezyang
Differential Revision: D8707471
Pulled By: yf225
fbshipit-source-id: 9dc82a2c9da0c540ae015442f74b9b2b1a67a246
Summary:
Fixes#9049.
When provided with a domain string that lacks proper prefix, i.e. `org.pytorch.`, an exception is thrown.
Closes https://github.com/pytorch/pytorch/pull/9053
Differential Revision: D8708264
Pulled By: ezyang
fbshipit-source-id: e2593d8d36a17d3bb26fc0b239a61b84f1c38ecb
Summary:
Closes https://github.com/pytorch/pytorch/pull/9057
Make the `_C` target depend on the `csrc-no-python` target. Also removes the `csrc` target and the with-python version of autogradpp (which is not used). Let me know if we should pick better names here.
I also ran into a nasty linker issue with only one symbol being undefined. It turns out had been given inline linkage in the `.cpp` file, which I believe is an error.
Reviewed By: orionr
Differential Revision: D8705750
fbshipit-source-id: 8de083e371dbf5e9f12c15572d88e1c595dfa087
Summary:
Closes https://github.com/pytorch/pytorch/pull/8933
spatialBN implementation cannot deal with empty batch, this diff tries to enable zero batch setting:
during training, when batch_size = 0:
in forward, output's saved_mean and saved_var are zeros.
in backward, the gradient for SCALE_GRAD and BIAS_GRAD are zeros.
Reviewed By: pjh5
Differential Revision: D8644699
fbshipit-source-id: 599ea687329d68699c987e05f56f409f4e729d1c
Summary:
1. Instead of using non `_out` variant, we allocate a buffer and use `_out` variant to write the intermediate results into the buffer.
2. Reduce dimensions in order of decreasing sizes.
Benchmark:
Sum a randn tensor of shape `[200, 1, 30, 40, 20, 1, 50]` along dimensions `[4, 6, 3, 0, 2, 5]`. Averaged across 1000 times:
```
before patch:
CPU: 0.0441 s
CUDA: 0.0273 s
after patch:
CPU: 0.0234 s
CUDA: 0.0047 s
```
Closes https://github.com/pytorch/pytorch/pull/8992
Differential Revision: D8681069
Pulled By: SsnL
fbshipit-source-id: 2c5d5af5c5a284f2e945181f2b24ee8c78becd50
Summary:
The goal of this PR was to add support for dropout descriptors in the C++ API's RNN class.
The end result is a 4x-5x speedup for our RNN integration tests since they can now use cuDNN instead of autograd when dropout is set.
To achieve this, I had to move `_cudnn_init_dropout_state` to the `TensorOptions` API.
I also fixed a bug around `RNN::cuda()` not flattening parameters for cuDNN.
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/9012
Reviewed By: pjh5
Differential Revision: D8689786
Pulled By: goldsborough
fbshipit-source-id: 44fb191f5a38e41c4ded5417306b5bbc012cd56c
Summary:
Addresses #7415 . Adding a note first, will do the API change if there's a need in the future.
Closes https://github.com/pytorch/pytorch/pull/9019
Differential Revision: D8694056
Pulled By: ailzhang
fbshipit-source-id: 0b6fa43fa62ac55deff3b3b099d1bc9fee74a5f9
Summary:
Add BatchTensor class
- construct from data, mask, dims or construct from list of tensors
- can return a list of tensors from an BatchTensor class
next step: do IR level transformation and operators
Closes https://github.com/pytorch/pytorch/pull/8922
Differential Revision: D8668986
Pulled By: ChunliF
fbshipit-source-id: 8b24d2a9f46a3b42dbb397e99e9e059dfb2b326e
Summary:
Just tried these and they work now
Closes https://github.com/pytorch/pytorch/pull/9044
Reviewed By: soumith
Differential Revision: D8698819
Pulled By: jamesr66a
fbshipit-source-id: 1d5574de1819aa31fc36ad245186c7aa68587178
Summary:
Closes https://github.com/pytorch/pytorch/pull/9037
Fixes flaky test failures due to port in use.
Reviewed By: soumith
Differential Revision: D8696779
fbshipit-source-id: a05412d1eb1dcb9a4b35023dead371aa33d62c39
Summary:
Tell people to run with num_workers=0 when DataLoader worker failed
Closes https://github.com/pytorch/pytorch/pull/9007
Differential Revision: D8686005
Pulled By: SsnL
fbshipit-source-id: bf872267f609c7b86e943061caab953149507bfe
Summary:
Any flags linking libraries only take effect on inputs preceding them,
so we have to call `$cxx $in $ldflags -o $out` instead of the other way
around.
This was probably not detected so far since the torch libraries are
already loaded when loading JIT-compiled extensions, so this only has an
effect on third-party libraries.
This also matches our behavior on windows.
Closes https://github.com/pytorch/pytorch/pull/9021
Reviewed By: soumith
Differential Revision: D8694049
Pulled By: ezyang
fbshipit-source-id: e35745fc3b89bf39c14f07ce90d6bd18e6a3d7cc
Summary:
This is an initial implementation of Distributed Data Parallel module for c10d GLOO and NCCL backend.
Have done performance testing and made sure that both single GPU / process and multi-GPU / process are able to overlap communication with BW computation
The idea is, DDP will bucket parameters and do all reduce in the reverse order of the bucket. Since all C10D ops are async ops, no more dedicated thread is needed and we simply queue the all-reduce kernels once the bucket is ready following the deterministic reduction order.
Tested with 8 nodes 64 GPUs, ResNet 50, hit the required accuracy within 90 epochs
Closes https://github.com/pytorch/pytorch/pull/8584
Reviewed By: goldsborough
Differential Revision: D8678696
Pulled By: teng-li
fbshipit-source-id: 440341b804befc6762e92acece2759ba47157cea
Summary:
Enable script for the time-sequence prediction, did bunch of hacks to make the script mode work, and couple of issues discovered while enabling the time-sequence prediction, all noted in #8452,
Shall we merge this PR and iteratively fix those issues thereafter?
Closes https://github.com/pytorch/pytorch/pull/8862
Differential Revision: D8677683
Pulled By: wanchaol
fbshipit-source-id: 02319cd56c87de523be898f0e6c541dd15e57cac
Summary:
When initializing weights for my C++ model, I had to write
```cpp
void initialize_weights(nn::Module& module) {
if (module.name().find("Conv2d") != std::string::npos) {
module.parameters()["weight"].data().normal_(0.0, 0.02);
} else if (module.name().find("BatchNorm") != std::string::npos) {
auto parameters = module.parameters();
parameters["weight"].data().normal_(1.0, 0.02);
parameters["bias"].data().fill_(0);
}
}
```
The string-based module determination is not very nice, and not very C++-y. So I created `nn::Module::is<T>` which does a `dynamic_cast` inside. It also handles the `ModuleHolder` vs. `Module` distinction.
It now becomes
```cpp
if (module.is<nn::Conv2d>()) {
module.parameters()["weight"].data().normal_(0.0, 0.02);
} else if (module.is<nn::BatchNorm>()) {
auto parameters = module.parameters();
parameters["weight"].data().normal_(1.0, 0.02);
parameters["bias"].data().fill_(0);
}
```
ebetica ezyang apaszke
Closes https://github.com/pytorch/pytorch/pull/8970
Differential Revision: D8677476
Pulled By: goldsborough
fbshipit-source-id: 053294e19b6a58cce868167596c89639f7de91c2
Summary:
Currently the `test_RNG_after_pickle` in the PR would fail because pickling a tensor changes the RNG state. This PR aims to fix it.
Closes https://github.com/pytorch/pytorch/pull/8971
Reviewed By: ezyang
Differential Revision: D8677474
Pulled By: yf225
fbshipit-source-id: 1713d9611699ad288b66d92dbb29ce9feb34b8cf
Summary:
- fixes log1p at #8853
- added log1p of sparse tensor in ATen
- make log1p of sparse tensor non-differentiable and raise error, because local derivate of log1p for zero element is 1 / (0 + 1) = 1 and make tensor dense
Closes https://github.com/pytorch/pytorch/pull/8969
Reviewed By: ezyang
Differential Revision: D8677491
fbshipit-source-id: 8363a613519de4bc75eda087ccd20a3eb2d18126
Summary:
The problem was a bad regex; the version hash match used to match 6
wildcards. This PR changes it to match \w+, which is sufficient for the
test because the version hash is always followed by either whitespace or
a right-paren.
Fixes#8981
Closes https://github.com/pytorch/pytorch/pull/8983
Differential Revision: D8677771
Pulled By: zou3519
fbshipit-source-id: dfdde98669bcd682335145cba98c82530a815afa
Summary:
Will bump up to opset 8 in another PR to match the current opset version.
Already tested through generating the models in current model zoo.
Closes https://github.com/pytorch/pytorch/pull/8854
Reviewed By: ezyang
Differential Revision: D8666437
Pulled By: houseroad
fbshipit-source-id: feffdf704dd3136aa59c0f1ff1830c14d1bd20aa
Summary:
Operations on `Variable`s (or `torch::Tensor`) usually return `at::Tensor`. This is usually fine, but the `AnyModule` used in the implementation of `torch::Sequential` is very picky about types, and does not understand implicit conversions like this. This means that `sequential.forward(at_tensor_that_is_actually_a_variable)` will fail unless you wrap `at_tensor_that_is_actually_a_variable` with `torch::Tensor`.
This PR adds a special case to `AnyModule` that will convert an `at::Tensor` to `torch::Tensor` when the tensor is really a variable, and else just pass the `at::Tensor`. This is a nice little usability improvement for the often-used `Sequential` class.
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/8968
Reviewed By: ezyang
Differential Revision: D8670407
Pulled By: goldsborough
fbshipit-source-id: 3635ed6ed28238f3900ce4a876d07f1b11713831
Summary:
This PR does 3 things
- Reorder the search order of `intel_lp64` and `gf_lp64` as the first one is more essential and should have high priority.
- Avoid repetitive searching of MKL libraries in `ideep` and `mkldnn` submodule if we already found those in `FindMKL`
- Avoid adding more MKL dependencies to IDEEP if MKL is also found.
TODO: provide an option for user to chose iomp or gomp.
Closes https://github.com/pytorch/pytorch/pull/8955
Reviewed By: bddppq
Differential Revision: D8666960
Pulled By: yinghai
fbshipit-source-id: 669d3142204a8b47c19a900444246fc44a139012
Summary:
disable operator tests for now until we have enough rocm workers in CI
Closes https://github.com/pytorch/pytorch/pull/8720
Reviewed By: ezyang
Differential Revision: D8654871
Pulled By: bddppq
fbshipit-source-id: ff2504d6a7182f85f7cc15618f2df8e512447fa8
Summary:
Closes https://github.com/pytorch/pytorch/pull/8959
MKL-DNN doesn't have support to 0-dim tensor. As a workaround, we produce CPUTensor instead of Ideep tensor in the fallback ops. And for those tensors, we don't need Ideep copy op anymore.
Reviewed By: viswanathgs
Differential Revision: D8665168
fbshipit-source-id: 59678de2c5aed8c691ab5caaadede6d6c000dd7b
Summary:
Sets the random seed at the start of C++ tests so that everything is super deterministic.
I made sure we only generate random values from torch instead of `std::`, so that this seed always applies. I.e. I do:
```
torch::randint(2, {2}, at::kInt64)
```
instead of
```
std::rand() % 2
```
Also got rid of the tests that test the random seeding, since it would interfere here. And the test is not useful since we just use ATen's seeding mechanism, which should work.
Fixes #7288#7286#7289
ebetica ezyang
Closes https://github.com/pytorch/pytorch/pull/8903
Differential Revision: D8667269
Pulled By: goldsborough
fbshipit-source-id: a833e86e156d5e68dae8c53a4b1c433cb0608b6c
Summary:
Closes https://github.com/pytorch/pytorch/pull/8927
Closes https://github.com/pytorch/pytorch/pull/8855
- Add parameter `enable_tracing` to the Arg field of NetDef. `net_async_tracing` will only enable Tracer for Net instances that have this field set (unless the command line argument also include the net name).
- Append a unique id to the json profiling result file because there could be multiple instances of the same net running.
- Dump json profling file regularly instead of just when the Tracer object is destroyed
Reviewed By: ilia-cher
Differential Revision: D8372378
fbshipit-source-id: 8adc9d59f48b67456beed2e3a88235c298fdfd01
Summary:
This PR is the final step to making `torch::` the only namespace users of the C++ API ever see. Basically, I did:
``` cpp
namespace torch {
using namespace at;
}
```
And then changed `torch::` to `at::` almost everywhere. This worked surprisingly well out of the box. So users can now write `torch::relu` and `torch::log_softmax` and `torch::conv2d` instead of having to know when to use `at::` and when `torch::`. This is happy!
Another thing I did was to have `using Dtype = at::ScalarType`, which will be the eventual name anyway.
ebetica ezyang apaszke zdevito
Closes https://github.com/pytorch/pytorch/pull/8911
Reviewed By: ezyang
Differential Revision: D8668230
Pulled By: goldsborough
fbshipit-source-id: a72ccb70fca763c396c4b0997d3c4767c8cf4fd3
Summary:
Closes https://github.com/pytorch/pytorch/pull/8951
Change default value of max decode error rate to 1.0 which means we don't throw such runtime error by default
Reviewed By: avulanov
Differential Revision: D8665640
fbshipit-source-id: 9d373979dd8a97253ad528b167f8d73a28fee82a
Summary:
No longer required now that we've switched over to ShipIt on master.
Closes https://github.com/pytorch/pytorch/pull/8950
Reviewed By: Yangqing
Differential Revision: D8666175
Pulled By: orionr
fbshipit-source-id: 6d8b8b38f6558d87cabd0aa19b72a390057c137b
* add opencl + fpga context
adds an opencl context inside caffe2/fb which can be used for fpga access
* [Caffe2] Force tensor inference checks to be triggered during testing
We've started to rely on TensorInference functions more for different analysis. This diff ensures that the TensorInference function's result matches what is expected from the definition of the operator.
* Enable building //caffe2:torch with @mode/opt
In @mode/opt, python runs out of a PAR, which breaks a lot of
assumptions in the code about where templates/ folders live relative
to __file__. Rather than introduce hacks with parutil, I simply turn
template_path into a parameter for all the relevant functions and
thread it through from the top level.
* [Caffe2] Fix cost models for DotProduct and Div. Update Tensor Inference for dot product
As title. DotProduct states that output is a 1-D tensor (https://caffe2.ai/docs/operators-catalogue.html#dotproduct) though code suggests it is either 0- or 1-D depending on inputs. TensorInference defined to support implementation.
* [SG-MoE] Add an option to make the experts NOT as components
* [nomnigraph] Rename and fixup convertToNeuralNetOperator API
This will make things a bit cleaner
* no longer symlink THNN.h and THCUNN.h
* forced decoder network (onnx export)
Closes https://github.com/pytorch/translate/pull/95
Add networks in ensemble_export.py to create a forced decoding network from PyTorch NMT checkpoints. This network takes an arbitrary numberized (source, target) pair and returns the model score for the translation, including penalties.
Vocabulary reduction networks are also supported, but note that target indices which are not in the possible_translation_tokens generated for the source input will be trea
* Revert schema change to fix production models
Revert schema change to fix production models
* MockLogDeviceReader - rebase on FIX
# Goal
1), Build a make_mock_log_device_reader using make_mock_reader
2), Replace the real log_device_reader here: https://fburl.com/raihwf1p
# Log by D8151734
Real log_device_reader:
```
I0529 20:29:05.373108 954994 tensor.h:839] Tensor print_net/log of type std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >. Dims: (): read_net/ParseOpenTrainingRow:0
I0529 20:29:05.373244 954994 tensor.h:839] Tensor read_net/ParseOpenTrainin
* [C2/D2][1/n]: Nonnegative-Constrained Optimization -- log barrier
implement log barrier as a regularization method
* Add teacher weight screening.
Add teacher weight sceening according to teacher labels. If teacher label is zero, we do not use the distill loss in the objective function.
* Add NormalizerContext
See task for more detail. This implementation is a copy of what exists for RegularizerContext except for how the parameters are defined in the model_definition thrift file.
I'll try an alternative implementation which overrides the default arguments of functions instead like for argscopes in tensorflow.
https://github.com/pytorch/pytorch/compare/master...MaximeBoucher:update-from-facebook-0939578c068c?expand=1
* Adding cosine similarity option in dot processor
Add pairwise cosine similarity option in dot product.
Add an option to concate dot product and cosine similarity.
Add test cases.
* [nomnigraph][redo] Concat elim for sparseNN
Same as D7962948, which was reverted because Operator Schema was not
defined
* [pytorch] Revert pytorch/pytorch#7918 'Release GIL when copying to shared memory', breaks ASAN
Revert this pytorch diff that breaks ASAN when running Filament in dev mode; in opt mode it gives "bad file descriptor" errors. Looks like a race when copying tensors to shared memory in multiple mp.Queue's (which spawn separate threads).
https://github.com/pytorch/pytorch/pull/7918/files
* [nomnigraph][mobile] Enable nomnigraph by default, use -Oz on nomnigraph related code to reduce code size
enables nomnigraph and reduces codesize
* [Warmup] Allow both offline incremental training and online training
Change plan name on saving side and reading side to support both training type
This diff depends on D8128530 and D8168651.
* Revert D7802642: [Warmup] Allow both offline incremental training and online training
This reverts commit afc213cf9b36cecf75333a788391c4d09f4afccc
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Add legacy grad logic to fix div op on old graphs.
Add legacy grad logic to fix div op on old graphs.
* Correctly propagate operator failures
Propagate errors from operators that throw exceptions and return false
* Revert D8374829: [caffe2][nomnigraph][redo] Concat elim for sparseNN
This reverts commit 6dda028c463e54bb5c32188bbbe9202107e188a5
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [Caffe2] Added extra_info to core.DeviceOption(), enforced extra_info to be inherited in scope.DeviceScope
extra_info is a newly defined field in DeviceOption proto. This diff added extra_info to the core.DeviceOption(). And, In scope.DeviceScope(), this diff enforce the new scope to inherit the extra_info from old scope.
* [opt] hgdirsync wasn't enabled, merge diverged code
Here's the damage, P59732616 basically xplat was left behind but had
the change from assert to CAFFE_ENFORCE
* OMP parallelism over RoIs for RoIAlign op
Simpler to parallelize over RoIs. Shouldn't affect other uses as it relies on
the number of OMP threads set during startup.
PR: https://github.com/pytorch/pytorch/pull/8562
* Use int64_t for shape in FillOps
to avoid overflow of int32
* Implement Rotated RoIAlign op
Based on Rotated RPNs as explained in https://arxiv.org/abs/1703.01086.
The idea is simple - orientation/angle is added as an RPN
anchor parameter and then the angle is further regressed similar to bbox
coords. There are some additional changes related to NMS and IoU, but besides
that it's a direct extension to Faster-RCNN. Further details in https://fb.quip.com/sZHlA1iMfWPZ.
RoIs are represented in [center_x, center_y, width, height, angle] format.
`angle` repre
* Rotated RoIAlign op CUDA forward implementation
CUDA forward impl for D8415490
* RoIAlignRotated op CUDA backward pass implementation
TSIA
* All remaining fixes to eliminate process_github.sh
Most of this diff has already been reviewed separately, except for the parts relating to _thnn/utils.py and _utils._internal.py
remove skipIf(True, 'Fbcode') line from process_github.sh
replace sed of cpp file with #ifdef to control cudnnDestroy use
undo sync-time deletion of .gitattributes, remove process_github.sh
switch to using _utils._internal rather than try-import-except
This diff also fixes the open-source bug where rebuilds have
* Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"
Original commit changeset: 7707d2efe60e The original diff is backout becuase the online trainer package is backed out. This code would only work with new online trainer package
* [easy] improve error log in adagrad op
as title
* re-allow use of thnn_h_path
This fixes cffi usage in OSS
* [4/4] [tum] paralyzing layerNorm for GPU full sync
as title
* add compile=False to pytorch tests, remove hack with pyc
* Add shape and type inference for RowWiseArgMax operator
See title
* Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training"
This reverts commit 78167eeef0af16b60f72c82f9dcdda9b41b4dcbd
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [fix-flaky-test] mock_hive_reader_test flaky, because GlobalCounter collects local counts intervally
# Problem
`MockHiveReader` uses `GlobalCounter` to limit `max_examples`.
GlobalCounter on server node collect local counts from worker nodes every 1 sec.
This 1 sec delay makes it impossible to limit exactly to the `max_examples`, it will definitely exceed `max_examples`.
# Plan
Given,
```
Expected num_examples = max_examples + num_examples/sec (Read Speed) x 1 sec (GlobalCounter Sync Int
* [Caffe2] Fix FCGradient cost inference. Prevent overflow in cost inference
FCGradient missed a factor 2 in the `num_outputs == 3` case. Overflow was occurring with flop calculation for FC. Changed types to `uint64_t` to prevent future problems.
* Fix binary ops with empty inputs
Fix binary ops with empty inputs
* Support the filling of input blob with provided data
as title for Biz Integrity case
* Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training""
Original commit changeset: 30c55dd38816 Original diff is reverted due to introducing bad integration test. Fixed the integration test.
* [c2][easy] improve pack ops error loggings
as desc.
* Add ShapeTypeInference for LpNorm operator
As desc
* Shard test_nn to reduce runtime for each test target
Closes https://github.com/pytorch/pytorch/pull/8793
The current test_nn would time out and be disabled in GreenWarden, and we need to have an option to split it up in order to pass the stress test. Right now GreenWarden roughly allows running 100 test cases in test_nn before timing out, and here we have an option to divide test_nn into 30 shards (with ~40 tests in each shard) to allow for some test suite growth in the future.
* Change default caffe2_streams_per_gpu to 1
* Remove IN_SANDCASTLE from common.py and test_nn.py
We prefer to disable the failing tests through Sandcastle UI instead.
* Add a new class for an updated prof_dag.proto
This diff contains:
- An updated prof_dag.proto that contains blob profiles.
- A class to deserialize this information (serialization is in a follow up diff)
- Update to separate profiling information from NeuralNet (and use it as part of the class above).
- Unit tests
* Lambdarank for SparseNN
This diff adds a lambda_rank_layer for SparseNN.
changes include
1) Adds support for multi sessions in c2 op
2) Adds support for two different loss functions in c2 op
3) Unit tests for op
* Revert D8586950: Back out "Revert D8515341: Back out "Revert D7802642: [Warmup] Allow both offline incremental training and online training""
This reverts commit 012220ed63eccc35659a57b31d16a3625da6317b
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [easy] A few fixups to multithread predictor benchmark
(1) support perf on T6 server
(2) remove dead code
* fix a bug about the map size
as title
* Fix reduce sum on in-place case.
Fix reduce sum on in-place case.
* [Warmup] Reland reverted diff Allow both offline incremental training and online training
Closes https://github.com/pytorch/pytorch/pull/8827
fix net transform integration test. Allow offline and online trainer to coexist D7802642.
* Add StoreHandlerNotAvailableException
Add an exception for a store that is not available or has been
deleted.
* Use exception handling for fault tolerance, missing KV store
Remove status blobs to communication ops so that exceptions propagate on
failure.
* [C2/D2][2/n]: Nonnegative-Constrained Optimization -- bounded grad proj
for simple bounded constrained optimization, incl non-negative box constraints.
* [GanH]: Adaptive Weighting with More Estimations
With implemented postivity optimization, we now learn adaptive weights with different
parameterizations.
This improves parameter estimation and training stability.
* Revert some changes for landing
* Remove AutoNoGIL in StorageSharing
* Temporarily disable net_tests
* Revert "[Caffe2] Force tensor inference checks to be triggered during testing"
This reverts commit 67ef05c22b2f71b4a489695384932f968384a2a4.
* Revert "Fix reduce sum on in-place case."
This reverts commit 6cb8a8e1b3db7b6d20941b0053e3f3836068eb64.
* Revert "Revert "Fix reduce sum on in-place case.""
This reverts commit 130a257c0893dc09f4bd6e6a45d112261807fd2c.
* use conda cmake in pytorch-linux-xenial-cuda8-cudnn6-py2 and pytorch-linux-xenial-cuda9-cudnn6-py3
* update test_expect
* add exit 1
* check cmake 3.5
* bump expect driver version
* add back space
* Better forward methods in C++ API
capitalize error message in test_torch.test_flatten
Support for operator()
* Add operator() to Functional
* Get rid of SigmoidLinear
* Add BoundFunction to FunctionalImpl
* Remove macro from conv because it makes errors more nasty
This should be set by the code that instantiates it, be it the Python
bindings or other C++ code. Defaulting to use localhost is not useful
beyond tests. Instead of keeping multiple default paths around we can
punt on it here and require it to be initialized elsewhere.
There is no relevant state in PinnedMemoryAllocator, so we
can have a single allocator with static lifetime.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rework optim folder
* Removed TORCH_OPTIMIZER_CLASS macro
* Got rid of CRTP/Impl
* Removed TORCH_AUTOGRAD_KWARG
* Differentiate between Optimizer and LossClosureOptimizer
* Make Optimizers parameters based instead of model based
* Allow construction of optimizer from arbitrary vector
* Added test for zero grad
* Added test for external parameter vectors
* Now comparing against baseline values
* Documentation
* Post rebase fixes
* Different strategy for creating and accessing buffers in optimizers
* Fix member ordering
* Unify isViewable, handle n-dimensional empty tensors.
1) Unifies the two isViewable functions in ATen and TH.
2) Handle n-dimensional empty tensors in the implementation
3) Clarify some comments.
This requires an extra copy in the TH case, but that will go away.
* Also unify THCTensor version.
* Remove C-linkage from THTensor_compute_stride.
* Update comment.
* Add pos_weight argument to nn.BCEWithLogitsLoss and F.binary_cross_entropy_with_logits (#5660)
- Add an option to control precision/recall in imbalanced datasets
- Add tests (but new_criterion_tests)
* Move pos_weight to the end of args list in the documentation.
`pos_weight` was moved to the end because it is the last argument in both
`nn.BCEWithLogitsLoss` and `binary_cross_entropy_with_logits`
If CUDNN_INCLUDE_DIR, CUDNN_LIB_DIR, and/or CUDNN_ROOT_DIR were set,
but USE_CUDNN was not explicitly set, the code in
cmake/Dependencies.cmake would set USE_CUDNN=OFF even though it could
be found. This caused an issue in ATen, where it includes its CuDNN
bindings if the variable CUDNN_FOUND is set. This was the case,
because the find_package call in cmake/public/cuda.cmake searches for
CuDNN and ends up finding it. The net result is that ATen tried to
compile CuDNN bits, but the caffe2::cudnn target is never defined let
alone added as dependency, and the build fails on not being able to
find the header cudnn.h.
This change does two things:
1) Restore CuDNN autodetection by setting USE_CUDNN=ON if it is found.
2) Remove obsolete FindCuDNN.cmake module. This functionality now
lives in cmake/public/cuda.cmake.
List dependency on gloo_cuda before dependency on gloo such that
unresolved symbols in gloo_cuda are correctly resolved (since the linker
resolves from left to right).
This fixes building c10d C++ tests on GCC 4.8.
currently torch/CMakeLists doesn't know how to find nanopb without
some higher-level script (setup.py or build_all.sh) telling it where
to look, which is an obstacle towards fully CMake-ifying libtorch.so.
This change removes that dependency.
* Bag of fixes
* Rename tensor_range.h to tensor_list_view.h
* Post rebase fixes
* Rename torch::tensor namespace to torch::tensors due to name conflict
* Avoid recursion in Module::to
This commit implements the solution proposed in https://github.com/pytorch/pytorch/issues/8410
to workaround the need to create zero tensors with the same shape as inputs.
It introduces the concept of a LinearBlock which marks places in the code
where we know if all the inputs to the node are zero, then the outputs
to the node are also zero. Autodiff introduces LinearBlocks around
backwards functions, which have this property. specializeUndef then
propagates Undef nodes using this information.
Notes:
* Since we do not always specialize, we have a pass LowerLinearBlocks
that replaces the block with an if statement that dynamically guards
the Undef case.
* We introduce AutogradAdd which is addition that still works when
its inputs might be undefined. In cases where we specialize this will
get removed in favor of a normal add, but there are cases where
gradient graphs do not specialize (e.g. when they are not differentiable,
but a derivative is required) so it is important for this op to be executable.
* make as_strided safer
* patching as_strided; and stop using it in backward
* Test a simple case in as_strided_backward
* a long note
* remove boundary checks of as_strided; implement slow path
* wip
* fix as_strided backward when input is overlapping
check for input overlapping too
[doc] clarify gradcheck behabior when input is overlapping
longer note
* fix a deprecation warning in test_autograd
* nits
* Created DefaultTensorOptions
* Fix TensorOptions() call which was interpreted as function decl
* Fix empty OptionsGuard
* Make options_ and mutex_ in DefaultTensorOptions class static because of dynamic linker issues
* Make DefaultOptions thread local
* Spectral norm improvements
- Don't do iterations on weight in eval mode
To facilitate this, register weight as buffer in order to be able
to use module with spectral norm in eval mode after immediately
after loading state dict (#8208)
- Use weight instead of weight_orig as weight when removing
spectral norm
- Add dim parameter in case the normalization should occur w.r.t.
a dimension other than 0 (#7865)
* add and update spectral norm tests
* More spectral norm tests
Thank you, Simon, for the suggestions.
* Port THCS to ATen.
General structure of the sparse implementation:
- SparseCUDATensor.{cpp, cu} and SparseCUDATensorMath.cu contain
the same functions as their CPU analogues
- SparseCUDAApplyUtils.cuh contains what used to be in
THCSTensor.cu
- SparseCUDABlas.cu contains what used to be THCSparse.cu
Unrelated improvements:
- Forward declared CUDA types in Context.h are now moved
exclusively to CUDAHooks
- New getCurrentCUDASparseHandle in Context
- Support for printing CUSPARSE_STATUS_ZERO_PIVOT error message
directly
Some unusual pieces:
- get_device got the LegacyBridge makeover, as it needs special
logic on sparse tensors (defer to the inner tensors).
- I noticed that I need to turn off device_guard codegen
for many functions in sparse, noticed because get_device
became a native function, and resulted in an infinite recursion. This was
done by adding device_guard: False to the native definitions. An alternative
strategy might be to make the heuristic for deciding when to put in a device
guard more clever.
Scaffolding removal:
- LegacyBridge now special-cases only on sparse versus dense;
no more CUDA test (hooray!)
- Native bindings get CUDA/SparseCUDA dispatch entries.
CPU sparse refactoring:
- New SparseUtils.h header, with all of the utility functions that
used to live in SparseTensor.cpp
- new_with_tensor_sparse now correctly handles both CPU and CUDA
- transpose functions in sparse/ turned out to be dead, so I killed them
Bugs I noticed while working on this:
- I used accessor<...>() on a CUDA tensor, because I thought it does
the CUDA-CPU sync. It does not.
Last mile changes:
- I killed all of the THS/THCS directories, build scripts, bindings everything.
It is now no more!
- A bunch of trampolines in LegacyBridge are no more; anything
that was "sparse only" is now done natively.
- `sparse_coo_tensor` is implemented a little funny, but we think
it's a good idea.
- HIP is handled by explicitly ifdef'ing out all kernels; we'll add support
for this at some later point in time.
- TH_INDEX_BASE is now unconditionally set to 0.
- Some uses of x.type() now replaced with x.options(), the new way of doing it.
- More notes about checked_cast_tensor, and eliminate Storage/Tensor fields in
the code gen env when they are dead.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* cache cufft plans
* use an LRU cache
* suffix CuFFTParams members with _
* import print_function for py2
* lint
* fix potential race; add dummy impl for CPU only builds
* cpp formatting; remove nccl makefile change
* Use CUDA hooks instead
* comments and doc
* update the error message
* move LRU cachae to a separate file and native::detail namespace
* update comment
* specify NOTE location in CuFFTPlanCache.h
* update disabled_features.yaml to make amd ci work
* another fix for AMD CI in disabled_features.yaml
* Wrap cufft_plan_cache_* methods in __HIP_PLATFORM_HCC__
* improve the notes
* lint
* revert onnx change
* put back inlining for CUFFT_CHECK
* Resolve conflicting name, ContextManager
Concept name `Context Manager` is taken by Python. See https://docs.python.org/3.6/reference/datamodel.html#with-statement-context-managers
It says,
A context manager is an object that defines the runtime context to be established when executing a with statement. The context manager handles the entry into, and the exit from, the desired runtime context for the execution of the block of code.
The `ContextManager` here is more like a registry.
And there is a C++ registry in caffe2 codebase `caffe2/caffe2/core/registry.h`.
There is also a Caffe2DBRegistry, declared by calling `CAFFE_DECLARE_REGISTRY(Caffe2DBRegistry, DB, const string&, Mode);` in `caffe2/caffe2/core/db.h`.
I think we can follow the concept name `Registry`, calling it `ContextRegistry`.
* Make Classes and Functions internal to this module start with "_"
Make Classes and Functions internal to this module start with "_"
* Update context.py
* Update context.py
* adds fp16 support to the jit
* improves formatting
* improves formatting
* added an explanatory comment
* fixes Python2 flake8
* updates c code
* all except halfs
The goal is to be able to use at::Half throughout ATen, including in
CUDA kernels and have it operate like built-in types. This avoids the
need for cuda::from_type and cuda::to_type before every
AT_DISPATCH_ALL_TYPES_AND_HALF call.
Addresses #8177
A design doc can be found here: [gist](https://gist.github.com/zou3519/4b7f13f03cc9f3612bd9363e6405fa0a) version or [quip](https://fb.quip.com/azL1AqUckBdo) version
General approach:
- Add NumberType, FloatType, IntType to represent Python numbers, floats and ints.
- Emit these types for python literals
- Change aten_schema such that Scalars are NumberType, int64_t and bool are IntType.
- Emit aten::type_as, prim::NumToTensor, and prim::TensorToNum nodes for tensor-number math. (see examples below)
- Erase NumberType, prim::NumToTensor, and prim::TensorToNum for ONNX export
### Tensor/number math
```
import torch
@torch.jit.script
def fn(x):
return x + 1
```
```
graph(%x : Dynamic) {
%1 : int = prim::Constant[value={1}]()
%2 : Dynamic = prim::NumToTensor(%1)
%3 : Dynamic = aten::type_as(%2, %x)
%4 : Dynamic = aten::add[alpha={1}](%x, %4)
return (%5);
}
```
### Number/Number Math
```
import torch
@torch.jit.script
def fn(zero):
c = 1 + 1
return zero + c
```
```
graph(%zero : Dynamic) {
%1 : int = prim::Constant[value={1}]()
%2 : int = prim::Constant[value={1}]()
%3 : Dynamic = prim::num_to_tensor(%1)
%4 : Dynamic = prim::num_to_tensor(%2)
%5 : Dynamic = aten::add[alpha={1}](%3, %4)
%c : int = prim::TensorToNum(%6) # this is the result of the addition
...
return (%13);
}
```
List of squashed commits:
* Introduce Python Number types
Added: IntType, FloatType, NumberType with
IntType <: NumberType
FloatType <: NumberType
Changed aten_schema so arguments have corresponding types
* Emit a NumberType for python literals.
Also emit a NumberType for Scalar default values.
* Add prim::NumToTensor and prim::TensorToNum
* Add DynamicType -> NumberType implicit cast for bc
* Better ensureTensor error message
* Add ensureTensorOrNumber. Allow passing Number to some functions
Like the range() construct and slices
* Patch IntList to work.
IntList is still a DynamicType in the frontend: a tensor gets built from
a List[int].
Also, IntList[1] is a "union between int and IntList" the way it is
implemented. If the frontend sees an int being passed for an IntList[1]
arg, it converts it to a tensor as well.
* Enforce some order on schemas to avoid overload ambiguity
add(Tensor, Tensor) should appear earlier than add(Tensor, Scalar). This
matches the order in which python_arg_parser parses its arguments.
* Disable std_dim and var_dim tests.
With the new schema information, std(input, keepdim) and std(input, dim)
are ambiguous. This will need to be fixed at a later date.
* Add NumberType erasure pass.
This is used for ONNX export and to ensure that NumberType information
doesn't reach the interpreter
* Add support for mixed tensor/number math ops.
* Tests for new functionality.
Includes:
- Tensor/number math
- number/number math
- EraseNumberTypes pass test
* Patch tests
Update expect tests for:
- decompose_addmm
- loop unrolling tests
Because python numbers are now NumberType, they cannot be returned by
functions anymore. Work around this by using "torch.full", or by adding
a tensor([0]) (taken from FIXME_zerol()). Both approaches are used
because torch.full is more readable, but it is broken in some cases.
* Add erase_number_types to torch/CMakeLists.txt
* Move math back to emitSimpleExpr from emitSugaredExpr
* Remove some dead lines
* Renable some excluded script/trace tests that are fixed.
* Move some tests to expected failure
* Address some comments (more addressing to come)
* Erase relevant aten::type_as nodes in EraseNumberTypes
I also changed it so that EraseNumberTypes is only called for ONNX
export. It is no longer used to prevent
prim::NumToTensor/prim::TensorToNum from reaching shape_analysis or
interpreter.cpp.
shape_analysis infers the type of the output of these nodes to be the
same as their input.
intepreter.cpp treats both of these nodes as no-ops.
* Add reminder to fix std/var
* Call EraseNumberTypes only when exporting a script module
* Update expects after rebase
Buck doesn't support passing arguments to Python unit tests, and we have to use environment variables to pass the sharding options instead. Also, buck test doesn't go through the __name__ == '__main__' code path and we need to move the env var checking logic to top-level.
* Use env var to pass sharing options to test_nn.py
* Move env var checking to top-level
* fix lint
* Support n-dimensional empty tensors in (most of) THNN.
Most of the argument checking in THNN is directly around dimensionality, which doesn't work in general for n-dimensional empty tensors, because
you will end up dividing by 0 or similar. Instead, we change these to check for empty and give error messages for those cases as well.
In some cases, the error messages are improved as well.
* Fix bug.
* enable captured inputs for if Stmt to fix the carried deps bug in nested
blocks
* postpone captured inputs deletion and add new test case
* recursively generate captured values for nested loops
* check asSimple when recursively create captured input
* Some 0-sized dimension support, port catArray away from resizeLegacy.
The goal of this PR is to port catArray away from resizeLegacy (so we can delete the legacy resize calls), but since catArray has some weird behavior because
we don't have arbitrary 0-sized dimension support, I made some effort to fix these both in one pass.
The major changes here are:
1) catArray uses the new resize API, no longer the old resizeLegacy API.
2) As 1) is the last usage of resizeLegacy, it is deleted.
3) If compiled with USE_TH_SIZE_ZERO_DIM, catArray will work and properly check shapes for n-dimensional empty tensors.
4) However, we retain the old behavior of "ignoring" size [0] tensors in catArray. We previously allowed this because we didn't have n-dimensional empty tensors.
5) To get the above to work, we also add support for n-dimensional empty tensors for narrow and slice (ifdef USE_TH_SIZE_ZERO_DIM).
6) We change the stride formula for empty tensors to match NumPy; basically, we never multiply by 0 as the size, always at least 1, so the
strides are monotonically increasing in the empty tensor case.
7) We print the size of empty tensors if size != [0]; this matches NumPy behavior (even in cases where the size could be inferred from the brackets.
8) For test purposes, we add torch._C._use_zero_size_dim() to add tests for the above.
* Fix flake8.
* Address review comments.
* Solves #8659
This PR adds a warning to alert users about the possibility of a failure in the gradcheck
* Fix lint
* Update gradcheck.py
* Update gradcheck.py
* update error message
* Update warning message to be more descriptive
This surfaces the options struct that can be passed to the
ProcessGroupGloo constructor to Python. By default, if no options struct
is passed at construction time, the Python bindings default to using a
struct with a TCP backed Gloo device that uses the machine's hostname to
resolve the IP address to bind to.
Currently, THTensor_(nDimension) goes to _dim(), which makes it difficult to move individual usages over to the new API.
Instead, let's create a THTensor_(_nDimension) going to _dim() and THTensor_(nDimension) going to _dim(). To do this, we will redirect all current
calls and move them over as we did for _dim() and dim().
* Setup wrappers to get vectorized version of mean
* Responding to review 1
* Responding to review 2
* Use variadic AT_CHECK
* Fix AT_CHECKS in ReduceOps
* Fix broadcast copying device[0] tensor when not using NCCL; Avoids potential extra copy in flatten_dense_tensors
* use toType
* revert dense_flat changes
* address comments
* [c10d] NCCL python binding and CI test, with bug fixes
* Addressed comments and further bug fix
* Made NCCL build optional, made C10D libc10d.a only
* Fixed tests so that NCCL pg won't run when not neeeded
* Addressed comments
* Port all indirect calls of resizeNdLegacy to resizeNd.
* Handle 1-d to 1-d resize.
* Maintain behavior of tensor.set_().
* Fix lack of initializer_list in C :).
* Return full dimensionality from newSizeOf.
* Created TORCH_MODULE macro
Rewrote Linear
Rewrote Dropout and added default constructor to TORCH_MODULE macro
Turned TORCH_MODULE contens into a proper base class
Added some documentation
Got rid of the old Dropout module
Got rid of the old Embedding module
Got rid of the old BatchNorm module
Got rid of the old Conv module
Fixing optimizers
Rebase
Removed old RNN modules and the TORCH_ATTR macro
Removed temporary P:: namespace
Added cloning behavior to all modules
Got rid of some get() calls
self review nits
Remove noexcept from ModuleHolder methods that can throw
Remove spaces
Add missing override to reset() methods
Added examples to documentation in pimpl.h
* Post rebase fixes
catArray is more complicated because it requires real 0-size dimension support. The other changes are safe in that the functions are never called (and are now deleted), or
they are used on a result of THTensor_(newSizeOf), which has a valid size.
test_rnn_args_check generates mismatched input_shape and hidden_shape
args. To do this, it changes a dimension of input_shape or hidden_shape
to have an incorrect size.
Before, the test was changing the size of a dimension to -1. However,
this is flawed because an input of size i.e. (6, -1, 2) is wrong.
This PR fixes it so that the test changes sizes of dimensions to
`bad_size = 7`. As long as none of the other sizes (input_size,
hidden_size, num_layers, batch_size) divide this, we don't have to worry
about that dimension being accidentally broadcasted into working.
Unlike resizeLegacy / resizeNdLegacy, these don't call deprecated methods (e.g. _dim) and don't map between logical sizes (i.e. nDimension == 0 -> size [0]).
What you ask for is what you get.
The full 0-sized dimension support is hidden behind an ifdef, because we it's not fully supported yet.
* Created TensorOptions
Storing the type in TensorOptions to solve the Variable problem
Created convenience creation functions for TensorOptions and added tests
Converted zeros to TensorOptions
Converted rand to TensorOptions
Fix codegen for TensorOptions and multiple arguments
Put TensorOptions convenience functions into torch namespace too
All factory functions except *_like support TensorOptions
Integrated with recent JIT changes
Support *_like functions
Fix in place modification
Some cleanups and fixes
Support sparse_coo_tensor
Fix bug in Type.cpp
Fix .empty calls in C++ API
Fix bug in Type.cpp
Trying to fix device placement
Make AutoGPU CPU compatible
Remove some auto_gpu.h uses
Fixing some headers
Fix some remaining CUDA/AutoGPU issues
Fix some AutoGPU uses
Fixes to dispatch_tensor_conversion
Reset version of new variables to zero
Implemented parsing device strings
Random fixes to tests
Self review cleanups
flake8
Undo changes to variable.{h,cpp} because they fail on gcc7.2
Add [cuda] tag to tensor_options_cuda.cpp
Move AutoGPU::set_index_from into .cpp file because Windows is stupid and sucks
Fix linker error in AutoGPU.cpp
Fix bad merge conflict in native_functions.yaml
Fixed caffe2/contrib/aten
Fix new window functions added to TensorFactories.cpp
* Removed torch::TensorOptions
Added code to generate wrapper functions for factory methods
Add implicit constructor from Backend to TensorOptions
Remove Var() from C++ API and use torch:: functions
Use torch:: functions more subtly in C++ API
Make AutoGPU::set_device more exception safe
Check status directly in DynamicCUDAHooksInterface
Rename AutoGPU to DeviceGuard
Removed set_requires_grad from python_variables.h and warn appropriately in Variable::set_requires_grad
remove python_default_init: self.type()
Add back original factory functions, but with deprecation warnings
Disable DeviceGuard for a couple functions in ATen
Remove print statement
Fix DeviceGuard construction from undefined tensor
Fixing CUDA device compiler issues
Moved as many methods as possible into header files
Dont generate python functions for deprecated factories
Remove merge conflict artefact
Fix tensor_options_cuda.cpp
Fix set_requires_grad not being checked
Fix tensor_new.h
TEMPORARILY put some methods in .cpp files to see if it solves issues on windows and mac
Fix bug in DeviceGuard.h
Missing includes
TEMPORARILY moving a few more methods into .cpp to see if it fixes windows
Fixing linker errors
* Fix up SummaryOps to use new factories
Undo device agnostic behavior of DeviceGuard
Use -1 instead of optional for default device index
Also move DeviceGuard methods into header
Fixes around device index after optional -> int32_t switch
Fix use of DeviceGuard in new_with_tensor_copy
Fix tensor_options.cpp
* Fix Type::copy(
* Remove test_non_float_params from ONNX tests
* Set requires_grad=False in ONNX tests that use ints
* Put layout/dtype/device on Tensor
* Post merge fixes
* Change behavior of DeviceGuard to match AutoGPU
* Fix C++ API integration tests
* Fix flip functions
* Spelling fix in MultivariateNormal docstring (#7915)
* [c10d] MPI Process Group Implementation (#7783)
This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR.
* [c10d] MPI Process Group Implementation
ref: https://github.com/pytorch/pytorch/issues/7434
* Better exception, atexit func, and addressed comments
* Clang formatting changes
* Static initialization and addressed comments
* Added constness back
* Test will now launch mpi processes if found
* CMakeList Changed
* Fix Windows doc for import error (#7704)
* Fix Windows doc for import error
* Fix doc again
* Fix wrong format
* Moved condition for dilated grouped convolutions to CUDNN convolution implementation (#7465)
* Updates to caffe2 operator documentation (#7917)
* Significant updates to the operator docs in prep for merge
* [auto] Update onnx to 307995b - Update from upstream (onnx/onnx#1038)
307995b143
* Test if ASAN is actually working as part of ASAN tests. (#6050)
* Test if ASAN is actually working as part of ASAN tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Drop explicit use of libstdc++, we should not care.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Build with DEBUG=1
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Increase main thread stack size when using ASAN.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Split up detail.h (#7836)
* Fix THCUNN SpatialDepthwiseConvolution assuming contiguity (#7952)
* Fix fbcode compatibility (#7939)
* add test for correctness of transpose fusion (#7950)
* [JIT][script] Fix emitted gather and slice for dynamic indices (#7861)
* [JIT][script] Fix emitted gather for dynamic indices
* Also fix slice
* Address comments
* cache and use BLAS_SET_BY_USER so that it doesn't set itself to TRUE when run second time (#7942)
* Add unsafe flag to skip checking in prepare (#7832)
* Add unsafe flag to skip checking in prepare
* pop
* Rename cuda::type to cuda::into_type and provide cuda::from_type. (#7937)
These are used to convert Half -> half and half -> Half respectively.
from_type will be used for runtime type checking in THC.
* Try to fix TORCH_CUDA_ARCH_LIST for PyTorch again (#7936)
* try again
* use DEFINED
* use a loop
* Minor fixes
* remove sort requirement from pad-sequence (#7928)
* pad-sequence no longer requires sorting entries
pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself.
* remove sort requirement from pad-sequence
Picks up from #5974.
Removes the requirement that input sequences to pad_sequence have to be
sorted. Addressed the comments in the PR:
- Updated docstring for pad_sequence
- Remove sort requirement in pad_sequence test
- Test unsorted and sorted sequences in pad_sequence test
* Fix checkBackend error message (#7926)
* Fix checkBackend error message
Fixes#7849
* Switch order of printing args
* Split CI tests in half and run them in parallel (#7867)
* Split and run tests in parallel
* Refactor tests
* Handling of scalars in torch.Size (#5676)
* Handling of scalars in torch.Size
torch.Size() constructor uses python_arg_parser
IntList in python_arg_parser can take iter/range
Have IntList take python iterables and ranges.
Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew
Address comments
Address comments
* Rebased
* Address nit
* [JIT] Fission and fusion passes for addmm (#7938)
* Addmm decomposition pass
* Addmm peephole pass
* Fix handling of output shape in fusion pass
* Add DCE to the peephole passes
* add comments
* maybe bugfix?
* Fix GPU tests
* fix py2/3 test issue
* Set smaller grain size for some cases (#7941)
* Fix returning scalar input in Python autograd function (#7934)
* fix _wrap_outputs not working with scalar inputs
* add a test
* Prevent git autocrlf for bash scripts (#7949)
* Delete unused file (#7919)
* Fix typo in autodiff formula for addmm (#7932)
* 1) use meshgrid for flip() CPU implementation, only need one copy of input tensor; 2) changed kernel of CUDA implementation, no need materialized indices tensor; 3) reusing error checking code
* [caffe2] YellowFin parameter update GPU code fix. (#6993)
* [Caffe2] Keep name of caffe2_pybind11_state and caffe2_pybind11_state_gpu in debug build (#7155)
* Allowing MatMul to create a gradient even with 3 inputs. useful if you are differentiating a graph twice (#6536)
* added const for local variables
* Fix the cpp libtorch CUDA build (#7975)
* Use mingfeima's mkldnn (#7977)
* Fix the import part of the windows doc (#7979)
* Change perf test folder after git checkout (#7980)
* Move the broadcast check in MKL Add/Sum to runtime (#7978)
* Use Glog's implementation of STL logging when possible. (#7206)
Inject custom workaround into namespace std so that it can be found by ADL.
* [Hotfix] Bring back warnings and -Werror to ATen (#7866)
* Bring back warnings and -Werror to ATen
* Unbreak...
* Fix tbb errors
* Enable ONNX backend Mean tests (#7985)
* Add third wayt to determine IS_CONDA (#7971)
* Fix EmbeddingBag max_norm option (#7959)
* fix EmbeddingBag max_norm option
* flake8
* add warning to the embedding bag arg change
* Raise error when torch.load a storage on a non-existing device (#7921)
* Raise error when torch.load a storage on a non-existing device
Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:
```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
223 if self.idx is -1:
224 return
--> 225 self.prev_idx = torch._C._cuda_getDevice()
226 if self.prev_idx != self.idx:
227 torch._C._cuda_setDevice(self.idx)
AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```
This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.
* Address comments
* missing dep
* Make THStorage / THCStorage have void* data ptr. (#7964)
* Make THStorage / THCStorage have void* data ptr.
This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type.
The major changes here are:
1) data has been renamed to data_ptr and made void* in THStorage/THCStorage.
2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage).
3) APIs for Accessing the data as a real*:
a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct).
b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths).
c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>().
* Add include.
* Attempt to fix clang build issues.
* Clarify comment and remove extra character.
* Rename unsafeData -> unsafe_data.
* Remove unnecessary 'to' function to get compile time rather than link time errors.
* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio. (#6834)
* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio.
* Add support of all default cmake build types for release to cuda.
* Remove python bindings for `torch.slice` (#7924)
* skip python bindings for slice
* remove tests
* convert slice test to indexing
* Build ONNX for PyTorch version of libcaffe2 (#7967)
* support loading gzip (#6490)
* support loading gzip
* address comments
* address comments
* fix lint
* fix test for python2
* Add memory leak check in CUDA tests (#7270)
* Add memory leak check in CUDA tests
* Tracking multi-GPU too
* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test
* add a comment
* skip if cuda
* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU
* Fix MaxUnpool3d forward memory leak
* Fix MultiLabelMarginCriterion forward memory leak
* Fix MultiMarginLoss backward memory leak
* default doCUDAMemoryCheck to False
* make the wrapper skip-able
* use TEST_MULTIGPU
* add align_corners=True/False tests for Upsample; fix TEST_CUDNN
* finalize interface
* VolumetricMaxUnpooling_updateOutput
* fix test_nccl
* rename THC caching allocator methods to be clearer
* make the wrapped function a method
* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp
* fix renamed var
* Revert "Set smaller grain size for some cases" (#7988)
* Entry for c10d in CODEOWNERS (#8001)
* Fix a couple of typos (#7998)
* Fix typo
* Fix typo
* Fix typo
* Fix typo
* Add on-stack observer cache for Observable (#7931)
observers_list_ stores all the observers for an observable. The list is allocated on heap, which
can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20%
speed up for start and stop observer calls.
* Reduce grain size for Unary operations (#8003)
* [auto] Update onnx to 8ec0e5f - Add index check for Transpose's type inference function (onnx/onnx#1053)
8ec0e5fe9b
* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace. (#7935)
* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace.
This requires renaming the _cast functions which used the unqualified names.
* Separate onnx mapping of scalar type from cast name.
* Fix flake8.
* Properly cast onnx.
* Remove WITH_ROCM cmake flag/variable (use USE_ROCM solely) (#8013)
* Mention the pytorch-ci-hud on the README. (#8004)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Re-enable build env check (#7969)
* Re-enable build env check
* Fix linux test error
* Try to fix macOS test error
* Update nn.rst (#8029)
* Example for Transformed Distribution (#8011)
* [auto] Update onnx to 33e9cd4 - Remove the usage of default value to fix invalid proto3 files. (onnx/onnx#1052)
33e9cd4182
* [auto] Update onnx to 1504a33 - Convert schema assert for duplicate type names to exception (onnx/onnx#1057)
1504a33abb
* Support CUDA tensors in ProcessGroupGloo (#7694)
This adds an unconditional dependency on CUDA, which is not desirable
for the long term. Ideally we have split like ATen where we have
different artifacts for different backends so you can decide at runtime
what to use.
* [auto] Update onnx to 3fb9656 - Fix for fbcode CI (onnx/onnx#1062)
3fb965666e
* propagate nan in some activations (#8033)
* propagate nan in some activations
* fix py2 not having math.nan
* flake8
* Fix profiler crash when no events register (#8034)
* Fix profiler crash when no events register
When trying to profile, attempting to print the event table throws a vague error because the event list is empty:
....
max_name_length = max(len(evt.key) for evt in events)
ValueError: max() arg is an empty sequence
This change fixes the error by returning an empty string.
* Update profiler.py
* Allow CI testing with different AVX configs (#8020)
* Allow CI testing with different AVX configs
* Unset ATEN_DISABLE_AVX and ATEN_DISABLE_AVX2 in default config
* Support for generating ATen during the fbcode build, rather than committing the generated files (#8002)
Paint the internal bikeshed a slightly different color to appease Buck tooling.
* Factor python dependency out of interpreter (#7970)
* Factor python dependency out of interpreter
* Remove NO_PYTHON for the autograd engine
If there is no python bindings, then a default Engine is constructed
the first time it is requested.
If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.
Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.
* Fixing AlexNet test which is skipped in CI
* [auto] Update onnx to 760c928 - add missing hasNInputShapes check for bidirectionalBroadcastShapeInference (onnx/onnx#1060)
760c9283d0
* Support modules that output scalar in Gather (and data parallel) (#7973)
* Support modules that output scalar in Gather (and data parallel)
* Improve warning msg
* [auto] Update onnx to 9e7855d - Remove PyTorch generated Upsample tests cases (onnx/onnx#1064)
9e7855dcd4
* [script] Add support for torch.zeros, torch.ones, etc. (#7799)
* [script] Add support for torch.zeros, torch.ones, etc.
* modifies gen_jit_dispatch to creating bindings for functions that do
not take tensor arguments, but do have an initial type argument
* adds tensor attributes to these functions for device, layout, and
dtype specification
* extends the list of valid compiler constants to include device, layout,
and dtype.
* allows functions with Generators, but only using the default generator
Known limitations:
* when using `torch.float`, we convert it to a scalar tensor and make
no checks that it is actually used only in a dtype specification.
This is similar to how we handle Python numbers, creating some situations
where the script is more permissive. Fixing this requires much more
significant changes to the IR, so is lower priority for now.
* devices specified using string literals e.g. 'cuda:1' do not work,
since we do not support string literals in general.
* Add profiling annotations to NeuralNet[Operator|Data] (#8005)
* Update from facebook 1ee4edd286a3 (#8040)
* Adding instance weight to batch distill loss
as title
* add bfloat 16-31
added bfloat 16-31 and their respective unit tests
* [CUDA9] Upgrade - fbcode
CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").
This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)
* Share intermediate int32 buffer across Conv ops
Adding a known type
* [C2 fix] infer function for ensure_cpu_output_op
this is adding the missing device funtion for ensure_cpu_output_op
* [int8] Add blob serializer/deserializer for Int8TensorCPU
To export to logfiledb
* [nomnigraph] Add try catch block to optimization passes in predictor
This will catch failures that happen in the optimization pass.
* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE
CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.
Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.
* NUMA support in SparseNN CPU benchmark
Adding support for NUMA in SparseNN CPU benchmark
* [mobile-roofline] Add logging needed for roofline model
This should be all that's needed
* Let the operators using the same input if the operators are not chained
or else, we have to change the input data dims
* fix null-pointer-use UBSAN errors in in reshape_op.h
* revert previous fix on input blob name
as title
* Adding flag to let MineHardNegative automatically extract single value from dict
Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.
* Reverting change that broke internal tests back to OSS compatible state
* Skip CUDA memory leak test on BN tests on windows (#8043)
* workaround for Sequential when one cannot retrieve python source (#8048)
* [auto] Update onnx to 0dbec2a - - Generate protoc type hints on Windows (onnx/onnx#1047)
0dbec2a047
* [auto] Update onnx to 4f8ef17 - Remove erroneous documentation around maps and sequences. (onnx/onnx#1069)
4f8ef17ad3
* [auto] Update onnx to e6a500e - Extract constant to initializer (onnx/onnx#1050)
e6a500e54c
* [auto] Update onnx to 033f956 - make gcc happy (onnx/onnx#1061)
033f956f41
* Remove NO_PYTHON macros from Exceptions.h/cpp (#8007)
Removes cases where NO_PYTHON was unnecessary in Exception.h/cpp
* [ready] Clean up torch.distributions (#8046)
* Have a single THStorage and THCStorage type. (#8030)
No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type
* Reduce usages of TensorUtils<T>::DataType in THC. (#8056)
TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call.
However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work.
Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member),
but the DataType specific functions are more problematic.
So, this PR does two things:
1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical
2) Templatizes the THC_pointwiseApplyX functions to take scalar types. To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of
the corresponding template parameter. We will need to get rid of these static_asserts in the future, but this is useful for now.
* Support to run ONNX Upsample operator (mode=nearest) in Caffe2 (#8037)
* Added support to run ONNX Upsample operator (mode=nearest) in Caffe2
* adding error checks to upsample
* adding error checks to upsample
* adding error checks to upsample
* changing to np.isclose
* Revert onnx submodule update
* still fixing
* [auto] Update onnx to eb12f72 - Add conv transpose test cases (onnx/onnx#886)
eb12f72a86
* [auto] Update onnx to bd98abb - Add a hook for doing post-processing on protobuf generated header files (onnx/onnx#1068)
bd98abbba0
* Skip ConvTraspose ONNX backend tests (#8074)
* Post process onnx proto (#8064)
* Post processing onnx generated protobuf files to hide global symbols
* .
* .
* Add code for TensorBoard visualization of JIT GraphExecutors (#8050)
* [auto] Update onnx to cc26486 - bump version to 7 for prelu. (onnx/onnx#1063)
cc26486541
* [auto] Update onnx to 356208d - add input tensor dimension checks to shape inference (onnx/onnx#1070)
356208d756
* Move backtrace to its own header (#8096)
* Move backtrace to its own header
* Move cxxabi.h into Backtrace.cpp
* Fix and ignore some warnings (#8081)
* Do an additional sanity check that nvcc and CUDA include dir agree. (#8094)
If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may
end up in a situation where the CUDA_VERSION of your includes
mismatches the CUDA version of your nvcc. See #8092 for a concrete
case where this can occur. Explicitly detect this situation and
give a good error message in this case!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* use regex in kwarg parser (#8061)
* Removing remaining NO_PYTHON ifdefs (#8067)
* Remove NO_PYTHON in tracing
* Remove NO_PYTHON in ir.h
* Remove NO_PYTHON in test_jit.cpp
* Replace std::size_t with size_t (#8093)
* Remove out-of-date comment (#8114)
* [Caffe2] Enabling AMD GPU Backend for Caffe2 (#7955)
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Resolve merge conflicts
* .
* Update GetAsyncNetHIPThreadPool
* Enable BUILD_CAFFE2 in pytorch build
* Unifiy USE_HIP and USE_ROCM
* always check USE_ROCM
* .
* remove unrelated change
* move all core hip files to separate subdirectory
* .
* .
* recurse glob core directory
* .
* correct include
* .
* Detect CUDNN related environment variables in cmake (#8082)
* Implement adaptive softmax (#5287)
* Implement adaptive softmax
* fix test for python 2
* add return_logprob flag
* add a test for cross-entropy path
* address review comments
* Fix docs
* pytorch 0.4 fixes
* address review comments
* don't use no_grad when computing log-probs
* add predict method
* add test for predict
* change methods order
* get rid of hardcoded int values
* Add an optional bias term to the head of AdaptiveSoftmax
* Make libshm also test if rt requires pthread. (#8112)
In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23),
-lrt is not sufficient to use shm_open; you also need to declare
a dependency on pthread. This patch adds a surgical extra fix to
detect this situation, in the case that I noticed it failing in the
wild.
Fixes#8110
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [auto] Update onnx to 2d5ce4a - Remove empty model (onnx/onnx#1058)
2d5ce4aeb6
* Add missing pragma once. (#8118)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [auto] Update onnx to 2a87616 - Tests for LRN operator (onnx/onnx#903)
2a876162ac
* Split SparseTensorImpl off from TensorImpl. (#7990)
* Split SparseTensorImpl off from TensorImpl.
At the moment they have the same data layout, but with the upcoming refactor
they will not, and we need a place to put all of the sparse tensor specific
fields.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update SparseTensorImpl.h
* [Caffe2] Support non peer access in muji and fix bug when reduced_affix is empty (#6896)
* [Caffe2] Support non peer access in muji
* [Caffe2] Add test for 4 gpus and 2 groups
* [Caffe2] Add comments
* Fix bug when reduced_affix is empty
* Fix typo and add comments about cpu and amd gpu
* Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)
* Replace most remaining usages of TensorUtils<T>::DataType. (#8124)
As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type.
This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal.
After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first.
* Add utf-8 header to Python file with Unicode. (#8131)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add back lrn test (#8134)
* Revert "Skip OnnxBackendNodeModelTest::test_lrn_default_cuda that causes segfault (#8127)"
This reverts commit 410191c4175eaae141306cdb3c3c1c1e8a495225.
* Fix mismatched default values
* Add non_blocking to Tensor/Module.to (#7312)
* Add non_blocking to Tensor/Module.to
* flake8
* Add argparse tests
* cpp parse
* Use C++ parser
* use a commong parse function with Tensor.to
* fix test_jit
* use THPObjectPtr
* increase refcount for None, True, and False
* address comments
* address comments
* Fix job name checking for AVX tests (#8135)
* Fix a corner case for ReShapeOp (#8142)
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
* cpu/ideep context converter (#8139)
* fix type mismatch while call torch._C._cuda_setDevice (#8065)
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch in scatter
* fix type mismatch in scatter
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* docs: Add warning to torch.repeat() (#8116)
* docs: Add warning to torch.repeat()
closes#7993
* docs: Add links for numpy functions
* docs: Break the too long line
* Accelerate bernoulli number generation on CPU (#7171)
* opt bernoulli rng with vsl and openmp
* detect cpu vendor for bernnoulli
* retrigger test platform
* check the vendor more severely
* use cpuinfo to check vendor
* docs: add canonical_url and fix redirect link (#8155)
* docs: enable redirect link to work for each specific page
* docs: add canonical_url for search engines
closes#7222
* docs: update redirect link to canonical_url
* docstring support for @script and @script_method (#7898)
* docstring support for @script and @script_method
* make it python2 compatible
* improve according to review
* improve build_stmts
* use filter instead of list comprehension
* improve the way wrap is handled for script_method
* stash the original method instead
* allow dynamic attr for ScriptMethod and GraphExecutor
* a bit comment on build_Expr
* remove _build_wrap
* a bit improve on comments
* rename to __original_methods
* should be _original_methods
* [auto] Update onnx to 968d28d - fix Node::isBefore (onnx/onnx#1075)
968d28d901
* remove some unnecessary cudaGetDevices (#8089)
* remove unnecessary cudaGetDevices
* make curDevice argument non-optional, add explicit checks to current_device
* Fix cuda.framework error on OSX. (#8136)
When compiling OSX with CUDA, Caffe2's build system uses
find_package(cuda) to get its grubby hands on the CUDA driver
library (for some strange reason, FindCUDA doesn't save this
information as a variable). Unfortunately, on OSX, sometimes
this picks up the cuda.framework folder, and then our build
system chokes to death because it doesn't try to link against
this as a framework. (Is the folder even a framework? I have
no idea).
This commit attempts to fix this in a two pronged fashion:
1. For some users, reducing the precedence of frameworks
using CMAKE_FIND_FRAMEWORK seems to help. So we set these
variables. However, this fix is not perfect; on my laptop
it doesn't actually solve the problem.
2. PyTorch doesn't actually need the CUDA driver API. So we
only add the dep when building Caffe2.
Fixes#8022
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [C++ API] Improve and use OrderedDict for parameters / modules (#7823)
* Improve OrderedDict for C++ API
* Give OrderedDict a subject and fix review comments
* Fix OrderedDict use in torch/csrc/jit/script/init.cpp
* Fix __rshift__ bug (#8161)
* Fix __rshift__ bug
* Add small tests for __lshift__ and __rshift__ in test_cuda
* Add a more elaborate check for __lshift__ and __rshift__
* refactor the test to address @zou3519 's comments
* Move non-generic Storage code needed by TensorUtils to non-generic C++. (#8164)
For non-generic function call implementations in Storage used by TensorUtils, we do the following:
1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType.
2) Move the implementation from generic/C++ to non-generic/C++.
3) Change the generic implementation to call the non-generic implementation.
This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner).
* Pinning opencv to < 3.4 in conda builds (#7923)
* Pinning opencv to 3.1.0 in conda builds
* Also pinning numpy to 1.11
* Trying only specifying <3.4
* Adding -setup- path, and better code structure (#8122)
* Abstract parallelization to faciliate using threadpools (#8163)
* [Caffe2] Update elementwise ops to support numpy style boradcast (#8070)
* Update elementwise ops to support numpy style boradcast
Update elementwise ops to support numpy style boradcast
* Fix sqrt_op
* Fix compare ops
* Fix gradient test
* Fix optimizer legacy broadcast
* Fix legacy broadcast for elementwise ops
* Skip flaky test
* Fix eigen simple binary op
* Fix attention test
* Fix rnn test
* Fix LSTM test
* Fix tan grad
* Fix schema check
* Export getCudnnHandle (#7726)
* [JIT] Support a single TensorList argument anywhere in the argument list + index_put (#8173)
* [JIT] Support a single TensorList argument anywhere in the argument list
* [JIT] index_put
* use the correct datatype format (#8144)
* Add back onnx console scripts dropped during migration from onnx-caffe2 (#8143)
* Get rid of SOVERSION (again). (#8132)
We don't want SOVERSION because pip will lose the symlink and
double your distribution size, and also because our setup.py
accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib
on OS X. This leads to a very puzzling error where you get
the error "cannot initialize CUDA without ATen_cuda", because
there are actually two copies of your registry in memory (because
there are two copies of the dynamic library). Dropping SOVERSION
makes it impossible to make this mistake.
In principle, if the shared library load is done with DYLD_GLOBAL,
that should also prevent two copies of the registry from popping up.
Worth checking at some later point, if you need to bring back
SOVERSION (because, e.g., pip finally fixed their software.)
Partially fixes#8022.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix a corner case for ReShapeOp (#8178)
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
* Better conv error message basing on weight shape (#8051)
* Add retry logic to sccache download for Windows build (#7697)
* Add retry logic to sccache download for Windows build
* fix script bug
* clean up
* fix caffe2 docker build (#7411)
* [ONNX] Fix type_as symbolic (#8183)
* [ONNX] Nuke type_as symbolic
* make it better
* Fix lookup + test
* Yangqing as an ONNX codeowner (#8185)
* Fix protobuf options (#8184)
* protobuf
* fix protobuf_MSVC_STATIC_RUNTIME
* Add a loop unrolling pass to PyTorch JIT (#7672)
* [auto] Update onnx to 4e65fd8 - fuse consecutive squeezes (onnx/onnx#1078)
4e65fd83ba
* [Caffe2] Merging setup.py with setup_caffe2.py (#8129)
* Mergine setup.pys, torch works, caffe2 works up to other KP
* Fix to super call for python 2
* Works on python2 on mac
* Consolidating Caffe2 flags
* Fix scalar check for sparse tensors. (#8197)
* Fix scalar check for sparse tensors.
As discovered in #8152
If `t` is a scalar sparse tensor, `t._indices` used to return a sparse
empty tensor because the scalar check was incorrect. This PR modifies
the scalar check to return a dense tensor instead of a sparse tensor.
i.e.
```
tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device)
out = tensor._indices() # was a sparse tensor, now is dense.
```
* Fix typos
* fix lint
* Add more annotations for arguments in ATen schema (#8192)
* use THCThrustAllocator in BCECriterion (#8188)
* Allow parallel_apply to take in list[Tensor] (#8047)
* Docs for gradcheck and gradgradcheck; expose gradgradcheck (#8166)
* Docs for gradcheck and gradgradcheck; expose gradgradcheck
* address comments
* Implement randperm for CUDA (#7606)
* Implement randperm for CUDA
* Use Thrust to implement randperm
* clean up
* Fix test
* Offload small input scenario to CPU
* Fixed test
* Try to fix Windows error
* Fix Windows error and clean up
* Use fork_rng context manager
* Move test_randperm_cuda to test_cuda
* Add half tensor support
* Fix cuda::type error
* Fix CPU offloading
* Fix issues
* No need to check range for n == 0 case
* Update c10d build to link against Caffe2 (#8201)
This follows #7399.
* add wipe_cache option (#8204)
as title
* Replace (non-data) TensorUtils calls with non-generic THCTensor calls. (#8176)
* Replace (non-data) TensorUtils calls with non-generic THCTensor calls.
TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these.
This PR does the following:
1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp.
This involves moving includes of it outside of generic/, so we can use the new implementations.
2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor.
This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions.
3) For functions inside of TensorUtils that don't use data pointers:
a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp.
b) Have the generic versions call the non-generic versions.
c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn.
* Add comment about THCTensor struct.
* Error if storage is null in setStorageNd or resizeNd.
* Fix c10d compiler warnings (#8206)
Copy compiler flags from the ones used in setup.py and fix warnings.
This makes the root build that includes c10d headers warning free.
* Bump gloo submodule (#8202)
This includes facebookincubator/gloo#125.
* rm -rf aten/contrib (#8165)
* Remove aten/contrib
* Remove from CMake
* Fix tanh_op on ios build (#8207)
* Fix tanh_op on ios build
* Fix tanh
* [auto] Update onnx to f28e2f1 - fix lrn spec (onnx/onnx#1090)
f28e2f1a60
* [cmake] deprecate caffe2_* specific cuda function in cmake. (#8200)
* deprecate caffe2_* specific cuda function in cmake.
* ENV{} -> $ENV{}
* CUDA_ARCH_NAME -> TORCH_CUDA_ARCH_LIST
* .
* .
* .
* skip CUDA memory leak check on Windows altogether (#8213)
* Record shape and type in autograd to validate gradients (#8168)
The check that the gradient is defined is currently disabled because
TestJit.test_ge_optimized will trigger the error.
* [auto] Update onnx to 18d70ff - Graph should only have one (input) kParam node (onnx/onnx#1088)
18d70ff529
* Set up a c10 source folder (#7822)
* Set up a c10 source folder
* Change the benchmark log format and also log flops (#8215)
as title
* Move helper functions to unnamed namespace. (#8224)
Currently, the helper functions in this file are in global
namespace. I am guessing the purpose of excluding them from was to
keep them local.
* [auto] Update onnx to e96d823 - Update Google benchmark to 1.4.1 (onnx/onnx#1083)
e96d823e5c
* Change new bernoulli implementation to be fully generic. (#8218)
The current implementation depends on THTensor types being unique, which is not guaranteed going forward.
* Structure THTensor like THCTensor is structured. (#8217)
In particular, define a base type, _THTensor, that can be used for all THRealTensor structs.
This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates).
* move THCP-related utils to cuda/utils.cpp. (#8221)
These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X
both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations.
(Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X).
utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code. This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>).
* [READY TO MERGE] Use ccache in macOS build (#8009)
* Use ccache in macOS build
* Moving to sccache
* Don't use sccache in test job
* [NEEDS REVIEW] Add nan and inf probability check to multinomial (#7647)
* Add nan and inf probs check to multinomial
* fix bug
* Spawn CUDA test in subprocess
* Make sure invalid input won't pass the test case
* Try to fix error
* Test failure cases in Python 3 only
* Try to fix Windows error
* Move CUDA test to test_cuda.py
* fix issues
* fix module name error
* no need to check for CUDA existence in test_cuda
* Use PY3
* [READY TO MERGE] Enable tests that use DataLoader with multiple workers on Windows (#6745)
* Don't import TEST_CUDA for test_dataloader on Windows
* test_partial_workers is stuck on Windows
* Don't copy unneeded grads when using a function for several derivatives (Fixes#7722) (#7759)
Trying to copy all results fails when one of them is a tensor list which
has not been populated. This blew up for CuDNN RNNs when the weights
did not require grad.
Thanks to Sylvain Gugger for reporting!
* Fix win mkldnn (#7718)
* Sync build_pytorch_libs.bat with build_pytorch_libs.sh
* fix quoting
* add warnings
* fix warnings
* Add /EHa
* [Caffe2] Add ADD operator for IDEEP (#8220)
* Add ADD operator for IDEEP
* Add boradcast check
* Comments
* Allow optional build and installation of native test binaries (#8225)
* test finetuning
* install off by default
* Turn BUILD_TEST=ON for jenkins.
* Turn on install_test in jenkins as well
* Update MKL exporter to IDEEP ops (#8228)
IDEEP exporter support
* [ideep] Add IDEEP Squeeze op (#8227)
Similar to MKLSqueezeOp at caffe2/mkl/operators/squeeze_op.cc
* [auto] Update onnx to 62e63e9 - Fix build errors inside protobuf-bench (onnx/onnx#1084)
62e63e9de8
* Use .cc since some downstream libraries are configured for C++ only. (#8234)
* Rename SparseTensor to SparseTensorRef. (#8237)
I want to introduce using SparseTensor = Tensor (as a documentary
type alias for Tensor), but the name is already taken.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [caffe2] Build Android tests and binaries in CI (#7593)
Update benchmark submodule to version with fixed Android/GNUSTL build
* Remove core and util warnings (#8239)
* Fix some signed/unsigned mismatches
* Skip unused result warning
* Explict fallthrough for murmur hash
* Enable aligned new support to eliminate warning
* Switch to int instead of unsigned in some cases
* Remove .gitmodules.aten since it is in .gitmodules now (#8232)
* Fix: gradcheck forced float32 (#8230)
* Print requires_grad and grad_fn in string repr of tensor (#8211)
For example:
>>> torch.ones(3).requires_grad_()
tensor([ 1., 1., 1.], requires_grad=True)
>>> torch.ones(3).requires_grad_() * 5
tensor([ 5., 5., 5.], grad_fn=<MulBackward0>)
The suffix (dtype, requires_grad, grad_fn) wraps to a new line if
it would cause the the line to exceed the linewidth.
>>> torch.ones(10).double().requires_grad_()
tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
dtype=torch.float64, requires_grad=True)
* Fix TEST_CUDA import in test_cuda (#8246)
* Fix lifting cat into its constant version (#8174)
This fixes a bug where schema including varargs lists did not lift
properly blocking correct ONNX export.
* Don't override Tensor, Storage macros defined outside torch/csrc in t… (#8243)
* Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc.
This PR does the following:
1) Removes THSTensor macros in torch/csrc, which aren't used.
2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_):
a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h.
b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character.
In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example:
currently in the THC random code, the state is casted to THByteTensor*; this happens to work because the macros don't happen to override THByteTensor.
But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works.
The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't.
We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now.
* Don't change the plugin.
* [auto] Update onnx to 3a035f4 - Add retry logic to model downloading (onnx/onnx#1077)
3a035f4397
* Fully genericize THC/THCUNN (except for TensorUtils and DeviceTensorUtils). (#8251)
* [cmake] Use CAFFE2_USE_* for public/cuda.cmake (#8248)
* Fix app size check (#8256)
Fix app size check
* wip on CPU impl
* Stop BCELoss from returning negative results (#8147)
* Stop BCELoss from returning negative results
* check explicitly for 0 before taking log
* add tests
* fix lint
* address comments
* Relax CUDA_HOME detection logic, to build when libraries are found. (#8244)
Log when no cuda runtime is found, but CUDA is found
* Added backward function for kl_div target (#7839)
* added backward fn for target
* added module test for kl_div target, and assuming targets are probabilities
* Change the output format of caffe2 observers (#8261)
as title
* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor. (#8247)
* Remove TensorUtils<T>::getData, provide data<T>() in TH(C)Tensor.
* Fix template parameter.
* [caffe2] Move submodule onnx-tensorrt forward (#7659)
Commit 82106f833dcb0070446a150e658e60ca9428f89b is essential.
* [ideep] Add IDEEP fallbacks for Faster-RCNN ops (#8260)
TSIA
* un-genericize THCDeviceTensorUtils. (#8258)
* provide data<T>() in TH(C)Tensor.
* un-genericize THCDeviceTensorUtils.
This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type.
* [caffe2] Fix ATen dispatch for ops with TensorList arg (#8226)
* [cmake] Add and export Modules_CUDA_fix (#8271)
* Add and export Modules_CUDA_fix
* actually, need to include before finding cuda
* [auto] Update onnx to 2508156 - Make error message more verbose (onnx/onnx#1097)
2508156135
* [auto] Update onnx to 39e4668 - fix optimizer does not set ir_version bug (onnx/onnx#1098)
39e46687ea
* [cmake] Make cudnn optional (#8265)
* Make cudnn optional
* Remove cudnn file from cpu file
* Move signal window functions to ATen; add Blackman window (#8130)
* Move signal window functions to ATen; add Blackman window
* fix cuda test not checking scipy
* [ideep] Fuse Conv-Relu after IDEEP graph rewrite, skip group conv (#8233)
IDEEP supports fusion for non-group conv
* [c10d] NCCL Process Group implementation (#8182)
* [c10d] Process Group NCCL implementation
* Addressed comments
* Added one missing return and clang format again
* Use cmake/Modules for everything and fix gloo build
* Fixed compiler warnings
* Deleted duplicated FindNCCL
* Set up CI build for CUDA 9.2 + macOS (#8274)
* Add macOS CUDA build to CI
* Fix undefined symbols issue
* Use sccache for CUDA build
* Fix sccache issues
* clean up
* c10 build setup (#8264)
* Move c10/ to caffe2/dispatch/
* Set up caffe2/utils directory
* Remove remaining TensorTypeUtils functions. (#8286)
Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.
* Create initial Python bindings for c10d (#8119)
* Build and install c10d from tools/build_pytorch_libs.sh
* Create initial Python bindings for c10d
* clang-format
* Switch link order to include more symbols
* Add bindings and tests for ProcessGroupGloo
* Add broadcast test
* Separate build flag for c10d
* Explicit PIC property
* Skip c10d tests if not available
* Remove c10d from Windows blacklist
Let it skip by itself because it won't be available anyway.
* Make lint happy
* Comments
* Move c10d module into torch.distributed
* Close tempfile such that it is deleted
* Add option USE_NVRTC which defaults to off (#8289)
* [build] Remove /torch/lib/THD/cmake in favor of /cmake (#7159)
* Remove /torch/lib/THD/cmake in favor of /cmake
* path fix
* Explicitly marking gloo to use cuda
* Fix gloo path in THD
* Have a single THTensor / THCTensor type. (#8288)
* Remove remaining TensorTypeUtils functions.
Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.
* Have a single THTensor / THCTensor type.
As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor.
For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type.
* undef GENERATE_SPARSE.
* [auto] Update onnx to 58efe0a - add float16 support back for math and reduction ops (onnx/onnx#1102)
58efe0a9ca
* Some utils for compile-time programming (#7778)
* Add some C++17 features, implemented with C++14
* Add some type traits
* Compile-time type list abstraction
* Some utils for compile-time programming
* Fix compatibility with a larger range of compilers
* Use guts::array instead of std::array because of std::array shortcomings
* code review comments
* Use quotes for includes
* Remove THC's FindMAGMA (#8299)
* Entries for torch.distributed in CODEOWNERS (#8293)
* Add depthwise convolution test for IDEEP (#8301)
* Fix dividing by zero segfault in Reshape (#8302)
when infer a dimension of zero size new shape
* Removes unused THCTensorConv (#8229)
* Replace Variables to Tensors (#8309)
* Clean up old sccache log before build (#8305)
* Remove unused grad ops on mobile to reduce app size (#8297)
Remove unused grad ops on mobile to reduce app size
* Small fixes (#8296)
* [auto] Update onnx to 5ed684e - Remove/replace /MX with /WX for MSVC build. Was typo in a previous ch… (onnx/onnx#1104)
5ed684ebe5
* Fix sample code for cuda stream (#8319)
* [auto] Update onnx to 4b4085c - Add missing warning ignoring flags to onnx_proto CMake target (onnx/onnx#1105)
4b4085c2e9
* [THD] fix broken THD build with NCCL (#8323)
* Add docstring for `torch.sparse_coo_tensor` (#8152)
* add sparse_coo_tensor docstring
* update empty tensor example
* whitespace
* whitespace again
* add error when backend is not supported by DDP (#8325)
* Fix collect_env.py for Windows (#8326)
* Fix collect_env.py for Windows
* Fix expect file for Win machine
* Fix the script doesn't stop eariler on error for MSVC and Ninja (#8277)
* Simplify the solution
* Remove the usage of set errorlevel
* Skip test_multinomial_invalid_probs_cuda on Windows (#8324)
* Support printing sparse tensors in ATen, fixes#8333. (#8334)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* [C++ API] Cursors (#8190)
* Add cursors to C++ API
* Small self nits
* s/struct/class
* Use more STL like names for cursors
* Implement dim_arange operator (#8266)
* Implement arange_like operator
* add ONNX symbolic
* lint
* change name
* Comment the hack
* 1. fixed flip CPU impl for non-continuous flip dims; 2. added more tests; 3. using TensorInfo and collapseDims to speed up CUDA impl for cases where flip dim is the 1st or last dim
* nits
* 1. removed for loop in pointwise CUDA kernel; 2. using templated (int64_t) IndexType for indices in pointwise CUDA kernel
* added torch.flip.__doc__
* nits
* Update operator documentation with markdown descriptions and interfaces
* Added rest of updated operator documentation to source files
* Commiting local changes for rebase
* fixed bracket typo in sqrt_op.cc file
* Added updated markdown documentation to remaining completed ops
We have 2 use cases where we want to experiment with new base ATen
tensor types:
* BatchTensor for matchbox
* Tensors that live on accelerators
It is possible to subclass TensorImpl to implement these but VariableType
does not work with them because it cannot find the equivalent variable type
in the registry.
This commit changes the way we implement type -> variable(type) lookup so that
torch::register_variable_type_for can be called on any at::Type.
Lookups are still done using arrays so there should be no perf impact from the change.
* Port THS to ATen.
The basic structure of the patch:
- All kernels in aten/src/THS got rewritten as native
functions in aten/src/ATen/native/sparse
I took the liberty to rename some of the kernels,
opting for a longer, more transparent names than
things like 'spaddcmul'.
- Instead of holding fields for sparse tensor in the TH
C struct THSTensor, they are now held in a C++ class
SparseTensorImpl (this explains why I had to do this
all in one go; I can't have *two* reps for sparse
tensors!)
Along the way, we change a key internal representation
invariant: an "empty" sparse tensor has dimI == 1 and
dimV == 0 (this is different from dimI == 0 and dimV == 0
we had before); this ensures that we maintain the invariant
that dim == dimI + dimV. "Scalar" sparse tensors are
made illegal, because there really is no way to properly
express them in COO format.
- Because we haven't ported THCS or any of the traditional
dense TH implementations, there is a new set of adapter
functions in native/LegacyBridge.cpp exclusively devoted
to deciding whether or not to go to the new native implementation
or back to the legacy TH binding (prefixed with th_).
The intent is that when everything gets ported, we can
delete this file.
- I've kept the stubs for all the THS functions, but they now all
error if you try to actually call them. Eventually, we should
replace these with calls to ATen so that everything keeps
working.
- I gobbled up SparseMM (SparseMM.cpp is no more). It was tasty.
There are some miscellaneous improvements which were needed for other
changes in this patch:
- There is now AT_FORALL_SCALAR_TYPES_EXCEPT_HALF, which does what
it says on the tin.
- axpy templated function moved to TH/BlasUtils.h, there's a new macro
which lets you easily forward to all of the TH functions. We also expose
THBlas_copy. I'm not terribly pleased with these functions but
they seem to serve a purpose they need.
- New method on Tensor to get TensorImpl*, unsafeGetTensorImpl
- accessor() is now this-const, since const-correctness on Tensor is a lie
- New toSparse()/toDense() methods on Type; now you can call these
directly without having to manually apply at::toSparse/toDense
on the Backend and then running toBackend yourself.
Changes to the kernels:
- Previously, the whole body of all kernels was compiled for
every supported scalar type. In our new implementation,
the scalar dispatch has been pushed into the smallest extent
which (1) is not in a type loop and (2) requires statically
knowing the scalar type. These sites all use
AT_DISPATCH_ALL_TYPES. I tried to use lambdas as much as
possible, but sometimes it was not possible when a OpenMP
pragma was used.
- Anywhere we tested if the nDimension of a tensor was zero,
we replaced with a test that numel is zero. Because, as we
known, nDimension of zero-size tensors in TH is zero, and
that's wrong wrong wrong (and not done this way in ATen).
Some subtleties:
- Places where previously fastget1d was used, I now use a
TensorAccessor. However, you have to be careful about grabbing
the accessor, because sometimes you will be accessor'ing
indices/values and they are empty, which means they will
be *1D* ("oh, aren't indices always 2D?" Nope. Nyet.)
So, essentially, it is only safe to grab an accessor *after*
you have checked that nnz != 0. All of these shenanigans
will go away when we properly support zero-size dimensions.
A few places, we test for this case just by wrapping the loop
in a conditional on nnz. Some other places this is not so easy,
so we instead short-circuit the function with a special case for
when nnz == 0 (usually, these implementations are degenerate).
- There is a very subtle but important difference between
_sparse_get_impl(self)->indices() and self._indices();
the latter may return a view! This is because nnz is
not guaranteed to match the dimensions of indices/values;
you can "truncate" a sparse tensor by setting the nnz.
Actually, I think this is not a good idea and we should
enforce a stronger invariant, but for this patch I slavishly
adhere to the old ways, and as such I have to be very
careful if I want to resize something, I had better use
the former and not the latter.
- I had to reimplement broadcasting by hand (thus the s_
and non-s_ functions in the sparse native files). There
is a very important distinction between foo_out and foo_,
so it is important that the LegacyBridge function always
call to the lower layer, and not try to avoid boilerplate
by calling to another LegacyBridge function first.
I did NOT put broadcasting in LegacyBridge (even though,
ultimately, that's where it must live), because the th_
functions which are invoked from LegacyBridge handle
broadcasting themselves, and I don't want to broadcast
twice.
- Sparse function MUST explicitly specify the Type they
dispatch from, otherwise Variable wrapping/unwrapping will
not work correctly. If you use _get_sparse_impl, that is
sufficient to levy this requirement.
- The "has native" tests in LegacyBridge.cpp are not 100%,
because some of the functions are mixed dense-sparse functions,
and so you can't just say, "Oh, if it's sparse and CPU, call
the native sparse implementation." This is handled on a
case by case basis. There is some especially complex
logic for add(), which has dense-dense, sparse-sparse
and dense-sparse implementations.
- I added some uses of SparseTensorRef in native_functions.yaml,
but you will notice that these are all on native_* functions,
and not the actual, top-level functions. So the SparseTensorRef
is purely documentary (helping you not call the wrong overload)
but there is no magic; we do the wrapping ourselves the hard
way. (This is in constrast to the TH binding code which is magical.)
Except for _sparse_mask; _sparse_mask is magical.
- There is a raw_copy_sparse_ method, which is really my way of
getting around the fact that copy_ has never been implemented
for sparse tensors (even before this patch), but there IS a
super secret, internal way of doing these copies that the THS
code used, and which I needed to get my hands on when I did this
port. We should refactor so that either (a) copy_ does support
sparse-sparse copy natively, or (b) we do this other ways.
- Irritatingly, I must explicitly resize_as_ before copy_ into
a tensor. This was not the case with THTensor_(copy) but I don't
have any direct binding that doesn't have this requirement.
- For some reason, the sparse tensor constructor accepts a scalar
tensor for the values tensor. This is kind of weird because
you always need an nnz-dimension. However, the old code supported
this and just expanded it into a 1D size 0 tensor; so we need some
explicit code to do this.
There are maybe a bit more AT_ASSERTs in some of the kernels
than is wise. I added them all when I was debugging and was
loathe to remove them.
Some last mile fixes after this commit went into PR
- Move expand outside of dispatch so autograd works (it used to be inside and then we lost all of the recorded broadcasts).
- Hack to duplicate the derivatives for our now two definitions TH and native. Mercifully the derivatives are short.
- Apparently, TH has a special case to make foo_ functions method only, and if you don't do this the Python arg parsing is wrong. We carefully work around this in the native bindings
- Apply DCE to a test_jit case, fixes wobbling due to DCE trick in tracing
- Update test_function's output
- Some last mile fixes for dispatch confusion in sparse_coo_tensor functions.
- New simplified regression test based on failures I saw in ONNX
- Increase tolerance on super resolution test
- More robust dynamic_type normalization, fixes ONNX bug.
The dynamic_type situation is very delicate; probably need
to stop having both Scalar and real.
- Make new_with_tensor_sparse more CUDA safe
- Note about CUDA-safety in SparseTensorImpl
- Rename dimI/dimV to sparseDims/denseDims.
- Make localScalar on SparseTensorImpl work.
- Make numel uniformly supported on all types, not just dense
types
- Add tests for is_nonzero() method (which exercises localScalar)
- Disable constant JIT autogenerated tests, which are fragile and broken
by this change, but being fixed in a parallel track.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* throw error on 0-length tensor slicing
* return empty tensor instead of throwing error
* make 0 slice work for tuples also
* add tests
* move check to aten
* Address comments
* Move empty size logic from ATen into TH/THC.
The goal here is to unify the tensor representations; since the "majority" of the representation is in TH, we push the empty size ({0}) and empty stride ({1}) logic into TH.
This PR does the following:
1) Previously THTensor/THCTensor with dim_ == 0, size == nullptr, stride == nullptr are now dim_ == 1, size == {0}, stride == {1}.
2) The logic that previously implemented this at the ATen level (e.g. THLongStorageView STRIDE_EMPTY_TENSOR) is removed.
3) The above is pretty clean except for resize/resizeNd logic -- that is still called with nDimension == 0. So, we rename these to resizeLegacy, resizeNdLegacy, map nDimension == 1
into the new regime, and will later write a empty-aware resize/resizeNd and move over the calls to resizeLegacy, resizeNdLegacy.
4) Also introduces some ifdefs that are just used for testing:
a) USE_TH_SCALAR: move scalar logic in TH
b) USE_TH_ZERO_SIZE_DIM: support arbitrary 0-sized dimensions, i.e {...,0,...}.
These are just used to write forward-looking correct code while call sites to _dim() (old TH nDimension) and resizeLegacy are updated.
* Get rid of noelem_to_empty.
* Use static_cast rather than C-style cast.
* Allocator size for empty tensors in THS/THCS.
* Add back THLongStorageView type Stride (TH and arg parsing has some magic that needs these to be nullptrs).
* 1. added hardshrink() to ATen (CPU + GPU); 2. removed nn.Hardshrink(); 3. reusing previous tests for nn.Hardshrink() and included CUDA tests at test_nn; 4. default parameter lambda=0.5 is not working yet
* optimized memory read/write
* 1. pass in lambd as scalar for CPU/CUDA_apply*; 2. removed tests for hardshrink at test_legacy_nn
* fixes test_utils
* 1. replace zeros_like with empty_like; 2. use scalar_cast in cuda
* 1. printing lambd value; 2. default lambd=0.5 is still failing
* getting around Scalar bug buy removing default value of lambd from native_functions.yaml, and declare it at nn/functional.py
* cleaned up debug printf
* Temporary solution for having access to the root path for python installations until Caffe2/PyTorch figure out the best way to build.
* Update build.sh
Increasing the verbosity of HIP errors.
This commit turns autograd function/method tests into tests run inside of a trace, or directly written using
script. These tests have uncovered many bugs and limited functionality
in the trace/script pathway, and these failing parts of the tests
are disabled using new exclusion sets. The size of these sets will shrink
as the bugs are fixed.
* fix a bug for SkipIndices
* IDEEP bug, revise the output to CPUTensor in SkipOutputCopy strategy
* [IDEEP] Add IDEEP fallbacks for Style-Transfer ops
* Improve TypeId:
- move it to c10 namespace to allow for easy extraction from caffe2 into c10 (i.e. reuseability from aten)
- Use unordered_map/unordered_set instead of map/set for performance
- Make TypeId a type safe class (i.e. no implicit casts from/to int)
- Make TypeId constexpr
- Some readability improvements (e.g. using instead of typedef)
- Don't explicitly implement TypeMeta copy assignment and construction - let the compiler do that for us.
- Add TypeMeta move constructor
- Make TypeMeta members noexcept
- Implement TypeMeta::operator== and operator!= as free functions instead of in-class
* CR comments
* fix
* fix windows
* Rename back to CaffeTypeId
* Remove c10::TypeId/TypeMeta
* remove C10_KNOWN_TYPE
* code review
* Implement CPU bincount feature support
* Incorporate feedback on renaming to SummaryOps file and other nits
* bincount gpu implementation
* refactor cuda code and incorporate nits
* doc fix
* cuda bincount - cast weights to double if integral type
* fix: signed unsigned comparison error
* fix: ssize_t error
* refactor
* make template typenames readable and other nist
* make compatible with v0.5
* incorporate comments
* update test cases to ensure CUDA code coverage
* add comparison operators to jit
* try to fix CI
* address review comments
* fix type of comparison ops result
* address review comments
* fix indentation
* add comments
* require type_as to have non-dynamic tensor arg
* Typo (should check if template argument of type_as, inputs()[1], is tensor)
* Use .at() instead of []
* Use .at() again
* Improve number formatting in tensor print
* fix bad rebase
* address comments
* fix test
* fix test
* use assertExpected for tests
* address comments
* address comments
* More efficient kernel that avoids deprecated shuffles in Embedding.cu and THCUNN/LookupTable.cu
* Using WARP_BALLOT from THCDeviceUtils.cuh, also changing WARP_BALLOT to return unsigned
* [c10d] Rendezvous skeleton
The rendezvous function takes an URL and produces a triplet of a store,
a process rank, and the process group size.
For the file and TCP handlers, the rank and size must be specified, but
other handlers may discover these parameters dynamically.
It returns a generator function, such that if a rendezvous handler
supports rerendezvous, you can write:
for store, rank, size in c10d.rendezvous(...):
pg = c10d.ProcessGroup(store, rank, size)
while the process group is valid:
# Do stuff with process group
* Add Python 2 fallback for urlparse library
* Import X as Y
* Relative import seems to fix it
* Spelling
* Gate import on c10d availability
* Modifying the build path to handle Caffe2's merge
* Update LoadHIP.cmake
Fixing typo.
* Update Dependencies.cmake
Keeping hip_include_directories since other Caffe2 libs depend on it.
* Update CMakeLists.txt
Only including for the second time if we're building with ATen.
* Update CMakeLists.txt
Adding comments to make sure future users understand why necessary commands have been added.
* [fix] fixup the bias multiplier data access issue
Hotfix for failues in conv_transpose
* [D2][Easy]: lint regularizer
lint with black
* [GanH]: Split mu in adaptive weight for diagnose
* [Dper] Add the ability to split FC weights into multiple smaller ones
* fix SumReduceLikeOp for empty blob
as desc.
* add ctc_greedy_decoder for caffe2
ctc_greedy_decoder same as tf's
* Update event callback handling
Allow multiple callbacks per event
* Add WeightedSum layer
The motivation is to do weighted sum in HoNet/crossnet, in the next diff, I'll replace model.Add with model.WeightedSum in
honet: https://fburl.com/f4rmolg2
crossnet: https://fburl.com/v7awn8se, https://fburl.com/63filbnm
* Replicate DAG's behavior
Some callers expect RunAsync to block, replicate that behavior in case of
explicit 'dag' net type
* [dper] layernorm layer
as title
* Override dag, async_dag, async_polling
Overriding dag, async_dag and async_polling with async_scheduling
* Name the thread pools
Caffe thread pools currently inherit the thread names from the thread that starts them, which can be misleading. Give them an explicit name instead.
* [Caffe2] FilleOp should support int64_t dimensions
Change argument type to int64_t for shape argument of FillerOp (used in ConstantFill, XavierFill, etc)
* Remove caffe2/caffe2/contrib/torch/
It's not used anywhere and depends on old lua torch that conflicts with Aten. Given PT1 it's not relevant any more (though it was nice and clever code!)
#accept2ship
* Fix linearWarmup multiplier check
The multiplier needs to be non-negative, not strictly positive.
* Revert D3314316
This is after 2 years and we do not seem to have a use case for this one, so
for the sake of clean API design we should potentially remove this. This would
allow us to potentially pass in arguments to optionally construct an object,
although it is indeed a little bit unclear how we can reuse existing objects if
constructor arguments are passed in. In any case, we may want to remove this
dangling feature.
* Speedup generate proposals by partial_sort.
Speedup generate proposals by partial_sort.
FACEBOOK:
- Saw speed improvement for training with this op.
- Yanghan benchmarked the op on a small dataset and see consistent 100% improvement on speed (6ms -> 3ms) on 420 input resolution. See next diff for details.
* More parallel processing friendly for CPP version of GenerateProposals.
More parallel processing friendly for CPP version of GenerateProposals.
* [DT] [43/n] Lift stop conditions inside reader code back to flow control
1. Split multi_reader function into local_reader and remote_reader
2. Lifted stop conditions inside Limiter back to flow control
3. Split epoch flow building logic into 3 cases:
- single machine (1 reader, 1 trainer on trainer0 node, no PS)
- (1 reader + 1 trainer) on trainer0 node, has PS
- multiple readers, readers do not share nodes with trainers, might have PS or not
* Resolve conflicts for torch/_thnn/utils.py
* [Caffe2] Handle image decoding errors
Image decoding errors can make the whole training fail. This diff is to handle them
1.Catch imdecode exceptions and check if decoded image has zero columns or rows. This is counted as decoding errors.
2.Replace the image with empty in case of error
3.Count the number of errors and throw runtime exception if the rate reaches given number
The empty image data is kept. It might introduce noise in the training data.
* Update MKL exporter to IDEEP ops
TSIA
* [Caffe2] GlobalInit is thread safe, fixing the comment
With the mutex and lock, GlobalInit is thread safe.
Update the comments.
* Back out "Add support for generating ATen files during fbcode build"
Original commit changeset: 28970ddba353
@override-unit-failures
(Note: this ignores all push blocking failures!)
* [DT]: fix predictor save
similar to D6610058, here we add the fix for distributed online training
* Remove net_singlethread_async_gpu.cc
Closes https://github.com/caffe2/caffe2/pull/2528
This removes net_singlethread_async_gpu.cc as part of our effort to clean
CUDAContext and the net executors.
* Inline DFS task execution
Add a DFS inline task execution mode in executor
* Add c10 folder to fbcode
This adds the c10 folder and its test cases to fbcode. Build flags are mostly taken from aten.
* add dependencies for online trainer
Add some dependencies so that the online model can use DataPipeline and PredictionTransform operators
Relevent post: https://fb.intern.facebook.com/groups/1324375037655677/permalink/1740993462660497/
* Resolve conflicts for tools/jit/gen_jit_dispatch.py
* [Fix] sparse regularization in distributed training
* Support advanced pooling options in sum processor
* support advanced pooling options in sum processor
* remove redundant code
* support attention in sum processor
* Improve shard logging in net tracing code
Make it handle arbitrary shard ids instead of just one digit ids.
* [Caffe2] Call GlobalInit in predictor only in mobile
FACEBOOK:
Calling GlobalInit long after the program starts may not be safe. There are issues if the following happens:
User does not call GlobalInit and initFacebook after program starts
User sets a flag manually: https://fburl.com/mcsumw7d
User calls OSS predictor.
OSS predictor calls GlobalInit
GlobalInit calls initFacebook
initFacebook resets all flags: https://fburl.com/tolszha1
Thus, the user manually set flags are overwritten
This would happen anytime GlobalInit is called long after the program starts.
I suppose the intention of the user in this case is not to call GlobalInit throughout the program,
but use Caffe2 regardless (is that desired?)
But adding GlobalInit in the OSS predictor would automatically call GlobalInit when using Caffe2.
This issue doesn't exist in mobile, since initFacebook is not called on mobile.
For now, guard the GlobalInit in predictor for mobile only.
May want to ensure the GlobalInit is always called at the start of the program. @[3501714:kutta] has seen weird issues when not calling GlobalInit at the start of the program on server side. He has made some progress on this.
* resolve conflicts for caffe2/core/logging_is_google_glog.h and test/test_torch.py
* Add empty fix for SumLikeReduceOp
Add empty fix for SumLikeReduceOp
* Revert D7962948: [caffe2][nomnigraph] Concat elim for sparseNN
This reverts commit f7f434dc5c34ca6058b9765d2ef615453d2276a9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Remove Declarations.yaml
* Include common.h
* Change std::stoi to caffe2::stoi
* Add thread_name.cc to the CMake file
* No need to subtract 1. Fix test segfaults
* Fix NetTest, ObserverTest
Fix tests
(cherry picked from commit 3767e66c3f365596cba3d46d3e7322c933a0ab41)
* CTCGreedyDecoderOp only has CPU implementation, test should only run on CPU
* Add a variable to avoid conversion resizing issue
* [fix] fixup the bias multiplier data access issue
Hotfix for failues in conv_transpose
* [D2][Easy]: lint regularizer
lint with black
* [GanH]: Split mu in adaptive weight for diagnose
* [Dper] Add the ability to split FC weights into multiple smaller ones
* fix SumReduceLikeOp for empty blob
as desc.
* add ctc_greedy_decoder for caffe2
ctc_greedy_decoder same as tf's
* Update event callback handling
Allow multiple callbacks per event
* Add WeightedSum layer
The motivation is to do weighted sum in HoNet/crossnet, in the next diff, I'll replace model.Add with model.WeightedSum in
honet: https://fburl.com/f4rmolg2
crossnet: https://fburl.com/v7awn8se, https://fburl.com/63filbnm
* Replicate DAG's behavior
Some callers expect RunAsync to block, replicate that behavior in case of
explicit 'dag' net type
* [dper] layernorm layer
as title
* Override dag, async_dag, async_polling
Overriding dag, async_dag and async_polling with async_scheduling
* Name the thread pools
Caffe thread pools currently inherit the thread names from the thread that starts them, which can be misleading. Give them an explicit name instead.
* [Caffe2] FilleOp should support int64_t dimensions
Change argument type to int64_t for shape argument of FillerOp (used in ConstantFill, XavierFill, etc)
* Remove caffe2/caffe2/contrib/torch/
It's not used anywhere and depends on old lua torch that conflicts with Aten. Given PT1 it's not relevant any more (though it was nice and clever code!)
#accept2ship
* Fix linearWarmup multiplier check
The multiplier needs to be non-negative, not strictly positive.
* Revert D3314316
This is after 2 years and we do not seem to have a use case for this one, so
for the sake of clean API design we should potentially remove this. This would
allow us to potentially pass in arguments to optionally construct an object,
although it is indeed a little bit unclear how we can reuse existing objects if
constructor arguments are passed in. In any case, we may want to remove this
dangling feature.
* Speedup generate proposals by partial_sort.
Speedup generate proposals by partial_sort.
FACEBOOK:
- Saw speed improvement for training with this op.
- Yanghan benchmarked the op on a small dataset and see consistent 100% improvement on speed (6ms -> 3ms) on 420 input resolution. See next diff for details.
* More parallel processing friendly for CPP version of GenerateProposals.
More parallel processing friendly for CPP version of GenerateProposals.
* [DT] [43/n] Lift stop conditions inside reader code back to flow control
1. Split multi_reader function into local_reader and remote_reader
2. Lifted stop conditions inside Limiter back to flow control
3. Split epoch flow building logic into 3 cases:
- single machine (1 reader, 1 trainer on trainer0 node, no PS)
- (1 reader + 1 trainer) on trainer0 node, has PS
- multiple readers, readers do not share nodes with trainers, might have PS or not
* Resolve conflicts for torch/_thnn/utils.py
* [Caffe2] Handle image decoding errors
Image decoding errors can make the whole training fail. This diff is to handle them
1.Catch imdecode exceptions and check if decoded image has zero columns or rows. This is counted as decoding errors.
2.Replace the image with empty in case of error
3.Count the number of errors and throw runtime exception if the rate reaches given number
The empty image data is kept. It might introduce noise in the training data.
* Update MKL exporter to IDEEP ops
TSIA
* [Caffe2] GlobalInit is thread safe, fixing the comment
With the mutex and lock, GlobalInit is thread safe.
Update the comments.
* Back out "Add support for generating ATen files during fbcode build"
Original commit changeset: 28970ddba353
@override-unit-failures
(Note: this ignores all push blocking failures!)
* [DT]: fix predictor save
similar to D6610058, here we add the fix for distributed online training
* Remove net_singlethread_async_gpu.cc
Closes https://github.com/caffe2/caffe2/pull/2528
This removes net_singlethread_async_gpu.cc as part of our effort to clean
CUDAContext and the net executors.
* Inline DFS task execution
Add a DFS inline task execution mode in executor
* Add c10 folder to fbcode
This adds the c10 folder and its test cases to fbcode. Build flags are mostly taken from aten.
* add dependencies for online trainer
Add some dependencies so that the online model can use DataPipeline and PredictionTransform operators
Relevent post: https://fb.intern.facebook.com/groups/1324375037655677/permalink/1740993462660497/
* Resolve conflicts for tools/jit/gen_jit_dispatch.py
* [Fix] sparse regularization in distributed training
* Support advanced pooling options in sum processor
* support advanced pooling options in sum processor
* remove redundant code
* support attention in sum processor
* Improve shard logging in net tracing code
Make it handle arbitrary shard ids instead of just one digit ids.
* [Caffe2] Call GlobalInit in predictor only in mobile
FACEBOOK:
Calling GlobalInit long after the program starts may not be safe. There are issues if the following happens:
User does not call GlobalInit and initFacebook after program starts
User sets a flag manually: https://fburl.com/mcsumw7d
User calls OSS predictor.
OSS predictor calls GlobalInit
GlobalInit calls initFacebook
initFacebook resets all flags: https://fburl.com/tolszha1
Thus, the user manually set flags are overwritten
This would happen anytime GlobalInit is called long after the program starts.
I suppose the intention of the user in this case is not to call GlobalInit throughout the program,
but use Caffe2 regardless (is that desired?)
But adding GlobalInit in the OSS predictor would automatically call GlobalInit when using Caffe2.
This issue doesn't exist in mobile, since initFacebook is not called on mobile.
For now, guard the GlobalInit in predictor for mobile only.
May want to ensure the GlobalInit is always called at the start of the program. @[3501714:kutta] has seen weird issues when not calling GlobalInit at the start of the program on server side. He has made some progress on this.
* resolve conflicts for caffe2/core/logging_is_google_glog.h and test/test_torch.py
* Add empty fix for SumLikeReduceOp
Add empty fix for SumLikeReduceOp
* Revert D7962948: [caffe2][nomnigraph] Concat elim for sparseNN
This reverts commit f7f434dc5c34ca6058b9765d2ef615453d2276a9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Remove Declarations.yaml
* Include common.h
* Change std::stoi to caffe2::stoi
* Add thread_name.cc to the CMake file
* No need to subtract 1. Fix test segfaults
* Fix NetTest, ObserverTest
Fix tests
(cherry picked from commit 3767e66c3f365596cba3d46d3e7322c933a0ab41)
* CTCGreedyDecoderOp only has CPU implementation, test should only run on CPU
* Add a variable to avoid conversion resizing issue
* Remove the code per soumith's comments
* Remove the code per soumith's comments
* Remove blank lines in the end of file
* Resolve conflicts for torch/_thnn/utils.py
* Update MKL exporter to IDEEP ops
TSIA
* Back out "Add support for generating ATen files during fbcode build"
Original commit changeset: 28970ddba353
@override-unit-failures
(Note: this ignores all push blocking failures!)
* add dependencies for online trainer
Add some dependencies so that the online model can use DataPipeline and PredictionTransform operators
Relevent post: https://fb.intern.facebook.com/groups/1324375037655677/permalink/1740993462660497/
* Resolve conflicts for tools/jit/gen_jit_dispatch.py
* Support advanced pooling options in sum processor
* support advanced pooling options in sum processor
* remove redundant code
* support attention in sum processor
* resolve conflicts for caffe2/core/logging_is_google_glog.h and test/test_torch.py
* Revert D7962948: [caffe2][nomnigraph] Concat elim for sparseNN
This reverts commit f7f434dc5c34ca6058b9765d2ef615453d2276a9
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* Remove Declarations.yaml
* Include common.h
* Change std::stoi to caffe2::stoi
* [caffe2] uprade IDEEP and hotfix for conv op accuracy issue (#8364)
* [IDEEP] Upgrade IDEEP version
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
* [IDEEP] Fix accuracy issue in conv op
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
* Fix build error due to lack of src in CMakeLists
Signed-off-by: Gu, Jinghui <jinghui.gu@intel.com>
* Remove the code per soumith's comments
* [ONNX] Add an ATen fallback pathway for ONNX export (#8273)
* ATen fallback for ONNX export
* Move to enum
* Fix model test
* Add comment
* Address comments
BC interface
* Remove imaginary file (#8415)
* [Caffe2] Enable AMD/MIOPEN ops for Caffe2 (#8306)
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Add MIOPEN pooling operator
* Add MIOPEN activation operator
* Add MIOPEN softmax operator
* Add MIOPEN spatial batch norm operator
* Add MIOPEN loacl response normalization operator
* Add MIOPEN conv operator
* Clean-up LRN ops
* enable fp16 in MIOPEN pool ops
* Enable fp16 for MIOPEN relu op
* Enable fp16 for MIOPEN spatial batch norm op
* code clean-up
* revert float16 support
* Create Caffe2 python binding for AMD/ROCM/HIP
* Add op fallback for HIP operator
* add hip src/test files in cmake
* exclude hip src/test files
* fix python binding for hip backend
* fix MIOPEN pooling op workspace
* hack to compile miopen operators
* fix include path for MIOPEN ops
* Fix include path
* Add HIP math utilities
* Fix path for HIP math utils
* cmake fix
* Cmake fix / hipcc for hip files
* suppress hipcc warning
* cmake fix /replcae USE_HIP with USE_ROCM
* revert LoadHIP.cmake change
* fix include for thrust/cub-hip
* include path fix for conversion.h
* Updated with latest upstream changes
* clang format fixes
* Context_hip updates
* Fixed typo in rocblas handle get function
* Updated hipified math utils
* Updated math hip test util
* Updated context hip test
* Updated common_hip
* Updated net async dag for HIP
* Added MIOPEN in operator hip test
* fix
* C2 dependencies clean-up
* fix include path for building custom protobuf
* Decouple miopen pool op and conv_pool_op base
* cmake refactor
* fix operator_hip_test
* move all hip/miopen ops files into caffe2/operators/hip
* sanitize cmake
* permission issue
* remove extra parenthesis
* remove artifact from resolving merge conflict
* cont. sanitize cmake files
* fix syntax error
* sanitize conversion.h
* .
* Revert "."
This reverts commit 56020cb0e996a31ae27bf1f8f491955ed0b121b9.
* clang-format
* Enable some reduce operators' ONNX backend tests (#8418)
* fix old comment to point to the right file (#8416)
* Stop pinning nccl version. (#8421)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Expose logsumexp docs and mark log_sum_exp in distributions for internal use (#8428)
* Enable some of the ONNX backend test on broadcasting (#8423)
* Enable some of the ONNX backend test on broadcasting
* enable gemm broadcast
* Expose proto utils and ONNX (#8073)
* Expose proto utils and ONNX from PyTorch libcaffe2.so
* Try to use protobuf from _C.so
* Fix ONNX proto header include
* Adjust order of imports for ONNX until nanopb goes away
* Set and use ONNX_NAMESPACE for PyTorch builds
* Show protobuf summary for all builds
* Add ONNX_NAMESPACE for cpp_build
* Statically link libprotobuf.a into libtorch.so
* Set ONNX_NAMESPACE on Windows build
* Move core/dispatch up as well
* Add /MD flag for Windows build of _C
* Potential Windows fix for ONNX and protobuf
* Add direct linkage from _C to ONNX on Windows
* Only include protobuf wrapper for PyTorch
* Pass extra_compile_args to _nvrtc ext build
* Remove installation of .a files
* Rebase creates some weird situations, revert them manually
* Remove more weird changes due to rebase
* Need to add thread_name.cc after merge
* Revert "Stop pinning nccl version. (#8421)"
This reverts commit 3cb45bafc8b9b023049e5f979a2bcb75e3f7009d.
* Allow downgrades from libnccl2 install.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Expose proto utils and ONNX from PyTorch libcaffe2.so
* Try to use protobuf from _C.so
* Fix ONNX proto header include
* Adjust order of imports for ONNX until nanopb goes away
* Set and use ONNX_NAMESPACE for PyTorch builds
* Show protobuf summary for all builds
* Add ONNX_NAMESPACE for cpp_build
* Statically link libprotobuf.a into libtorch.so
* Set ONNX_NAMESPACE on Windows build
* Move core/dispatch up as well
* Add /MD flag for Windows build of _C
* Potential Windows fix for ONNX and protobuf
* Add direct linkage from _C to ONNX on Windows
* Only include protobuf wrapper for PyTorch
* Pass extra_compile_args to _nvrtc ext build
* Remove installation of .a files
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Add MIOPEN pooling operator
* Add MIOPEN activation operator
* Add MIOPEN softmax operator
* Add MIOPEN spatial batch norm operator
* Add MIOPEN loacl response normalization operator
* Add MIOPEN conv operator
* Clean-up LRN ops
* enable fp16 in MIOPEN pool ops
* Enable fp16 for MIOPEN relu op
* Enable fp16 for MIOPEN spatial batch norm op
* code clean-up
* revert float16 support
* Create Caffe2 python binding for AMD/ROCM/HIP
* Add op fallback for HIP operator
* add hip src/test files in cmake
* exclude hip src/test files
* fix python binding for hip backend
* fix MIOPEN pooling op workspace
* hack to compile miopen operators
* fix include path for MIOPEN ops
* Fix include path
* Add HIP math utilities
* Fix path for HIP math utils
* cmake fix
* Cmake fix / hipcc for hip files
* suppress hipcc warning
* cmake fix /replcae USE_HIP with USE_ROCM
* revert LoadHIP.cmake change
* fix include for thrust/cub-hip
* include path fix for conversion.h
* Updated with latest upstream changes
* clang format fixes
* Context_hip updates
* Fixed typo in rocblas handle get function
* Updated hipified math utils
* Updated math hip test util
* Updated context hip test
* Updated common_hip
* Updated net async dag for HIP
* Added MIOPEN in operator hip test
* fix
* C2 dependencies clean-up
* fix include path for building custom protobuf
* Decouple miopen pool op and conv_pool_op base
* cmake refactor
* fix operator_hip_test
* move all hip/miopen ops files into caffe2/operators/hip
* sanitize cmake
* permission issue
* remove extra parenthesis
* remove artifact from resolving merge conflict
* cont. sanitize cmake files
* fix syntax error
* sanitize conversion.h
* .
* Revert "."
This reverts commit 56020cb0e996a31ae27bf1f8f491955ed0b121b9.
* clang-format
Billing of changes:
- New Jenkins script for building on rocm. For now it is a bit hacked together, but we can improve it once CI is running
- New ROCM docker image for nightly HIP, and also some legacy packages that we need temporarily
- New enabled config py2-clang3.8-rocmnightly-ubuntu16.04-build based off of the existing Caffe2 image (not built yet)
- A big pile of cmake fixes, mostly to turn bits on/off when ROCM build is involved
- Switch from hiprng to hcrng
- Apply some patches directly in code, eliminating the patches
- Use __hdiv instead of hdiv, it's more portable
- THCNumerics<T>::gt doesn't work in HIP, so simulate it with sub
- Add a few more overloads HIP needs
- Turn off use of hcc to link (we plan to turn this back on to get tests running)
- Search for hiprand, hiprng, hipblas, hipsparse
- Better Python 2 portability
The Python binding generation code doesn't understand
method '_out' bindings correctly, and will compute the
indices wrong if you have an '_out' function that's also
method. This is a quick check to prevent you from making
this mistake.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Back out "Back out "Add support for generating ATen files during fbcode build""
Original commit changeset: 7b8de22d1613
I'm re-sending this diff exactly as it was approved and
committed. Fixes to support @mode/opt will be sent separately for ease
of review.
* Enable building //caffe2:torch with @mode/opt
In @mode/opt, python runs out of a PAR, which breaks a lot of
assumptions in the code about where templates/ folders live relative
to __file__. Rather than introduce hacks with parutil, I simply turn
template_path into a parameter for all the relevant functions and
thread it through from the top level.
There is a bug in NCCL that causing seg faults while calling ncclCommDestroy() in the destructor during program exit. According to Nvidia, "Whether the NCCL destructor will be called before or after the CUDA runtime destructor is undefined, which can lead to crashes."
For the immediate workaround, skip calling ncclCommDestroy ihe NCCL destructor. This is UGLY and we'll follow up with Nvidia to solve this ASAP.
This does the following:
1) makes nDimension an int64_t (to match ATen)
2) changes the dimension value to dim_ (so we catch direct usages)
3) provide an _dim() that provides access to the "old" view (so we can migrate functions one at a time)
4) have code call ->-_dim() instead of ->nDimension.
Necessary for Tensor detemplatization (D8121878) - now tensor won't have default constructor (as we don't know the device).
Thus this diff makes TypeMeta be constructible with non-default-constructible types in which case ctor() is non-null but always throws.
It's dangerous however as we won't catch potential type errors at compile time. Luckily - the only place where ctor() is used is in Blob and Tensor which have templated wrappers there (GetMutable and mutable_data respectively). We can just enforce the necessary type requirements there explicitly as a static_assert.
It also changes the failure behavior to be throw() instead of abort(). Aborting the process is not cool for the library :)
This is after 2 years and we do not seem to have a use case for this one, so
for the sake of clean API design we should potentially remove this. This would
allow us to potentially pass in arguments to optionally construct an object,
although it is indeed a little bit unclear how we can reuse existing objects if
constructor arguments are passed in. In any case, we may want to remove this
dangling feature.
* Add some C++17 features, implemented with C++14
* Add some type traits
* Compile-time type list abstraction
* Some utils for compile-time programming
* Fix compatibility with a larger range of compilers
* Use guts::array instead of std::array because of std::array shortcomings
* code review comments
* Use quotes for includes
* Remove remaining TensorTypeUtils functions.
Mostly what's remaining is copy utilities -- these are now provided in THCTensorCopy.hpp and templatized on the ScalarType rather than the TensorType.
* Have a single THTensor / THCTensor type.
As was previously done with Storages, have only a single (dtype-independent) THTensor / THCTensor.
For documentation and backwards compatibility purposes, the old names, e.g. TH(Cuda)LongTensor alias the new TH(C)Tensor type.
* undef GENERATE_SPARSE.
* Build and install c10d from tools/build_pytorch_libs.sh
* Create initial Python bindings for c10d
* clang-format
* Switch link order to include more symbols
* Add bindings and tests for ProcessGroupGloo
* Add broadcast test
* Separate build flag for c10d
* Explicit PIC property
* Skip c10d tests if not available
* Remove c10d from Windows blacklist
Let it skip by itself because it won't be available anyway.
* Make lint happy
* Comments
* Move c10d module into torch.distributed
* Close tempfile such that it is deleted
* [c10d] Process Group NCCL implementation
* Addressed comments
* Added one missing return and clang format again
* Use cmake/Modules for everything and fix gloo build
* Fixed compiler warnings
* Deleted duplicated FindNCCL
* provide data<T>() in TH(C)Tensor.
* un-genericize THCDeviceTensorUtils.
This is used outside of generic context, so we need to un-genericize it to have a single THCTensor type.
* Don't override Tensor, Storage macros defined outside torch/csrc in torch/csrc.
This PR does the following:
1) Removes THSTensor macros in torch/csrc, which aren't used.
2) For macros defined outside of torch/csrc (THTensor, THTensor_, THStorage, THStorage_):
a) No longer override them, i.e. previously THTensor could actually be THCTensor if a generic file was included from a file including THCP.h.
b) Instead, introduce new macros THW* (e.g. THWTensor) to represent a (potentially empty) wildcard character.
In addition to making this code easier to read and codemod, this allows us to more freely change TH/THC; for example:
currently in the THC random code, the state is casted to THByteTensor*; this happens to work because the macros don't happen to override THByteTensor.
But if THByteTensor just becomes an alias of THTensor (which is the plan for a single tensor type), then this no longer works.
The whole thing is a bit of a mess previously because you really have to understand which macros and redefined and which aren't.
We could also rename the macros that live in torch/csrc (e.g. the THPTensor macros), but since that is more self contained, I punted for now.
* Don't change the plugin.
For example:
>>> torch.ones(3).requires_grad_()
tensor([ 1., 1., 1.], requires_grad=True)
>>> torch.ones(3).requires_grad_() * 5
tensor([ 5., 5., 5.], grad_fn=<MulBackward0>)
The suffix (dtype, requires_grad, grad_fn) wraps to a new line if
it would cause the the line to exceed the linewidth.
>>> torch.ones(10).double().requires_grad_()
tensor([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
dtype=torch.float64, requires_grad=True)
* Fix some signed/unsigned mismatches
* Skip unused result warning
* Explict fallthrough for murmur hash
* Enable aligned new support to eliminate warning
* Switch to int instead of unsigned in some cases
I want to introduce using SparseTensor = Tensor (as a documentary
type alias for Tensor), but the name is already taken.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Trying to copy all results fails when one of them is a tensor list which
has not been populated. This blew up for CuDNN RNNs when the weights
did not require grad.
Thanks to Sylvain Gugger for reporting!
* Add nan and inf probs check to multinomial
* fix bug
* Spawn CUDA test in subprocess
* Make sure invalid input won't pass the test case
* Try to fix error
* Test failure cases in Python 3 only
* Try to fix Windows error
* Move CUDA test to test_cuda.py
* fix issues
* fix module name error
* no need to check for CUDA existence in test_cuda
* Use PY3
These files don't follow the usual pattern: In general the files torch/csrc/X torch/csrc/cuda/X
both include the generic file torch/csrc/generic/X, where torch/csrc/X includes the cpu implementations and torch/csrc/cuda/X includes the cuda implementations.
(Aside: this is probably not the best structure, the torch/csrc/X fiels should probably be moved to torch/csrc/cpu/X).
utils.cpp combines these so that torch/csrc/utils.cpp has cuda specific code. This makes it impossible to declare a single THTensor and THCTensor template type (i.e. THPPointer<_THTensor>, THPointer<_THCTensor>).
In particular, define a base type, _THTensor, that can be used for all THRealTensor structs.
This is just to have less cognitive load when dealing with generic THTensor/THCTensor types (as in templates).
* Replace (non-data) TensorUtils calls with non-generic THCTensor calls.
TensorUtils is templatized on the THTensor type, so to support a single tensor type (like ATen), we need to remove these.
This PR does the following:
1) Allows THCTensorTypeUtils.cuh to include THCTensor.hpp.
This involves moving includes of it outside of generic/, so we can use the new implementations.
2) Defines a single _THTensor struct and changes THCRealTensor to be a derived type of _THCTensor.
This allows us to implement a single non-generic function and avoid static_cast or void * tricks to call it from the generic functions.
3) For functions inside of TensorUtils that don't use data pointers:
a) Implement the functions in (non-generic) THTensor.cpp and declare them in (non-generic) THTensor.hpp.
b) Have the generic versions call the non-generic versions.
c) Replace the corresponding TensorUtils<THCTensor>::fn call with (non-generic) THTensor_fn.
* Add comment about THCTensor struct.
* Error if storage is null in setStorageNd or resizeNd.
* Implement randperm for CUDA
* Use Thrust to implement randperm
* clean up
* Fix test
* Offload small input scenario to CPU
* Fixed test
* Try to fix Windows error
* Fix Windows error and clean up
* Use fork_rng context manager
* Move test_randperm_cuda to test_cuda
* Add half tensor support
* Fix cuda::type error
* Fix CPU offloading
* Fix issues
* No need to check range for n == 0 case
* Fix scalar check for sparse tensors.
As discovered in #8152
If `t` is a scalar sparse tensor, `t._indices` used to return a sparse
empty tensor because the scalar check was incorrect. This PR modifies
the scalar check to return a dense tensor instead of a sparse tensor.
i.e.
```
tensor = torch.sparse_coo_tensor([], [], torch.Size([]), device=device)
out = tensor._indices() # was a sparse tensor, now is dense.
```
* Fix typos
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
We don't want SOVERSION because pip will lose the symlink and
double your distribution size, and also because our setup.py
accidentally links against both libcaffe2.dylib and libcaffe2.1.dylib
on OS X. This leads to a very puzzling error where you get
the error "cannot initialize CUDA without ATen_cuda", because
there are actually two copies of your registry in memory (because
there are two copies of the dynamic library). Dropping SOVERSION
makes it impossible to make this mistake.
In principle, if the shared library load is done with DYLD_GLOBAL,
that should also prevent two copies of the registry from popping up.
Worth checking at some later point, if you need to bring back
SOVERSION (because, e.g., pip finally fixed their software.)
Partially fixes#8022.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update elementwise ops to support numpy style boradcast
Update elementwise ops to support numpy style boradcast
* Fix sqrt_op
* Fix compare ops
* Fix gradient test
* Fix optimizer legacy broadcast
* Fix legacy broadcast for elementwise ops
* Skip flaky test
* Fix eigen simple binary op
* Fix attention test
* Fix rnn test
* Fix LSTM test
* Fix tan grad
* Fix schema check
For non-generic function call implementations in Storage used by TensorUtils, we do the following:
1) Move the declaration from generic/C to non-generic/C++; we don't need backwards compatibility on these functions and want to use e.g. at::ScalarType.
2) Move the implementation from generic/C++ to non-generic/C++.
3) Change the generic implementation to call the non-generic implementation.
This will allow us to get rid of the corresponding TensorUtils calls (once we move over the Tensor functions in the same manner).
* Fix __rshift__ bug
* Add small tests for __lshift__ and __rshift__ in test_cuda
* Add a more elaborate check for __lshift__ and __rshift__
* refactor the test to address @zou3519 's comments
When compiling OSX with CUDA, Caffe2's build system uses
find_package(cuda) to get its grubby hands on the CUDA driver
library (for some strange reason, FindCUDA doesn't save this
information as a variable). Unfortunately, on OSX, sometimes
this picks up the cuda.framework folder, and then our build
system chokes to death because it doesn't try to link against
this as a framework. (Is the folder even a framework? I have
no idea).
This commit attempts to fix this in a two pronged fashion:
1. For some users, reducing the precedence of frameworks
using CMAKE_FIND_FRAMEWORK seems to help. So we set these
variables. However, this fix is not perfect; on my laptop
it doesn't actually solve the problem.
2. PyTorch doesn't actually need the CUDA driver API. So we
only add the dep when building Caffe2.
Fixes#8022
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* docstring support for @script and @script_method
* make it python2 compatible
* improve according to review
* improve build_stmts
* use filter instead of list comprehension
* improve the way wrap is handled for script_method
* stash the original method instead
* allow dynamic attr for ScriptMethod and GraphExecutor
* a bit comment on build_Expr
* remove _build_wrap
* a bit improve on comments
* rename to __original_methods
* should be _original_methods
* docs: enable redirect link to work for each specific page
* docs: add canonical_url for search engines
closes#7222
* docs: update redirect link to canonical_url
* opt bernoulli rng with vsl and openmp
* detect cpu vendor for bernnoulli
* retrigger test platform
* check the vendor more severely
* use cpuinfo to check vendor
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch in scatter
* fix type mismatch in scatter
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
* fix type mismatch while call torch._C._cuda_setDevice
In my use case, in the backward propogate pass, the reshape need to
change a [0] tensor into [0,0] shaped tensor. The original implementation would
cause out of index issue. This diff fix this problem.
* Add non_blocking to Tensor/Module.to
* flake8
* Add argparse tests
* cpp parse
* Use C++ parser
* use a commong parse function with Tensor.to
* fix test_jit
* use THPObjectPtr
* increase refcount for None, True, and False
* address comments
* address comments
As in https://github.com/pytorch/pytorch/pull/8056, this doesn't work with a single TensorImpl type.
This replaces the usages of with a templatized parameter and static_asserts that the new and old are equal.
After this we can get rid of the old template parameter, but I want to ensure they are equivalent across all builds first.
* [Caffe2] Support non peer access in muji
* [Caffe2] Add test for 4 gpus and 2 groups
* [Caffe2] Add comments
* Fix bug when reduced_affix is empty
* Fix typo and add comments about cpu and amd gpu
* Split SparseTensorImpl off from TensorImpl.
At the moment they have the same data layout, but with the upcoming refactor
they will not, and we need a place to put all of the sparse tensor specific
fields.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Update SparseTensorImpl.h
In some configurations (e.g., our internal build of GCC 5 + GLIBC 2.23),
-lrt is not sufficient to use shm_open; you also need to declare
a dependency on pthread. This patch adds a surgical extra fix to
detect this situation, in the case that I noticed it failing in the
wild.
Fixes#8110
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Implement adaptive softmax
* fix test for python 2
* add return_logprob flag
* add a test for cross-entropy path
* address review comments
* Fix docs
* pytorch 0.4 fixes
* address review comments
* don't use no_grad when computing log-probs
* add predict method
* add test for predict
* change methods order
* get rid of hardcoded int values
* Add an optional bias term to the head of AdaptiveSoftmax
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
* Resolve merge conflicts
* .
* Update GetAsyncNetHIPThreadPool
* Enable BUILD_CAFFE2 in pytorch build
* Unifiy USE_HIP and USE_ROCM
* always check USE_ROCM
* .
* remove unrelated change
* move all core hip files to separate subdirectory
* .
* .
* recurse glob core directory
* .
* correct include
* .
If you set CUDA_HOME and CUDA_NVCC_EXECUTABLE together, you may
end up in a situation where the CUDA_VERSION of your includes
mismatches the CUDA version of your nvcc. See #8092 for a concrete
case where this can occur. Explicitly detect this situation and
give a good error message in this case!
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Added support to run ONNX Upsample operator (mode=nearest) in Caffe2
* adding error checks to upsample
* adding error checks to upsample
* adding error checks to upsample
* changing to np.isclose
* Revert onnx submodule update
* still fixing
TensorUtils<T> is basically ATen-dispatch-lite in that it allows one to do multi-type THC function dispatch with a single call.
However, it is templatized on the Tensor type, and since we are moving to a single Tensor type, this doesn't work.
Most of the functions in TensorUtils (e.g. getDims) can be pulled up a level, to just call THCTensor_nDimension (or directly accessing the member),
but the DataType specific functions are more problematic.
So, this PR does two things:
1) Replaces calls of 'TensorUtils<THCTensor>::DataType' with 'real' since these are identical
2) Templatizes the THC_pointwiseApplyX functions to take scalar types. To ensure this is done correctly, we static_assert that the scalar type template parameter matches the scalar type of
the corresponding template parameter. We will need to get rid of these static_asserts in the future, but this is useful for now.
No longer generate data-type specific Storage types, since all Storage types are now identical anyway.
For (some) backwards compatibility and documentation purposes, the Real names, e.g. THLongStorage are now #defined as aliases to the single THStorage type
* Adding instance weight to batch distill loss
as title
* add bfloat 16-31
added bfloat 16-31 and their respective unit tests
* [CUDA9] Upgrade - fbcode
CUDA9 upgrade diff D5654023 has been out for a while thanks to Pieter. But with time growing it's becoming quite hard to rebase, because of the symlinks and auto-generated build/config files in tp2. Break D5654023 into two diffs, one touching tp2 config files, and another one touching fbcode TARGETS file (adding nvcc flag). These two should be a bit easier to rebase (for detailed procedure see "Test Plan").
This diff can only be committed if:
1. CUDA 9 rpm is rolled out fleet-wide (TBD)
2. NVidia driver 390.40 is rolled out fleet-wide (done)
3. Upgrade CUDA 9.1, cudnn 7.1, nccl 2.1 (done)
4. Make sure all dependents are built (done)
5. Test all C2 operators, PyTorch (see test plan)
* Share intermediate int32 buffer across Conv ops
Adding a known type
* [C2 fix] infer function for ensure_cpu_output_op
this is adding the missing device funtion for ensure_cpu_output_op
* [int8] Add blob serializer/deserializer for Int8TensorCPU
To export to logfiledb
* [nomnigraph] Add try catch block to optimization passes in predictor
This will catch failures that happen in the optimization pass.
* Caffe2: avoid static initialization order fiasco for CAFFE_ENFORCE
CAFFE_ENFORCE uses strack trace fetcher. Which is currently a
global static variable. If at static initialization time CAFFE_ENFORCE
is used, this is a SIOF. Recently CAFFE_ENFORCE was added into init
functions registration, so we started to see this.
Meyers singleton is going to provide safety here. If stacktrace
fetcher was not registered yet, it will just use a dummy one.
* NUMA support in SparseNN CPU benchmark
Adding support for NUMA in SparseNN CPU benchmark
* [mobile-roofline] Add logging needed for roofline model
This should be all that's needed
* Let the operators using the same input if the operators are not chained
or else, we have to change the input data dims
* fix null-pointer-use UBSAN errors in in reshape_op.h
* revert previous fix on input blob name
as title
* Adding flag to let MineHardNegative automatically extract single value from dict
Model exporter requires the output of the model to be a struct. This makes it convenient to use those models directly in MineHardNegative by allow automatic extraction of the single element of dict, which is a common use case.
* Reverting change that broke internal tests back to OSS compatible state
* [script] Add support for torch.zeros, torch.ones, etc.
* modifies gen_jit_dispatch to creating bindings for functions that do
not take tensor arguments, but do have an initial type argument
* adds tensor attributes to these functions for device, layout, and
dtype specification
* extends the list of valid compiler constants to include device, layout,
and dtype.
* allows functions with Generators, but only using the default generator
Known limitations:
* when using `torch.float`, we convert it to a scalar tensor and make
no checks that it is actually used only in a dtype specification.
This is similar to how we handle Python numbers, creating some situations
where the script is more permissive. Fixing this requires much more
significant changes to the IR, so is lower priority for now.
* devices specified using string literals e.g. 'cuda:1' do not work,
since we do not support string literals in general.
* Factor python dependency out of interpreter
* Remove NO_PYTHON for the autograd engine
If there is no python bindings, then a default Engine is constructed
the first time it is requested.
If the python libraries are loaded, then they override the default
accessor and the default engine becomes a python Engine.
Note: it is possible for two engines to be generated if a non-python
one gets created before the python bindings are loaded. This case
is rare, and just results in additional threads being spawned.
* Fixing AlexNet test which is skipped in CI
* Fix profiler crash when no events register
When trying to profile, attempting to print the event table throws a vague error because the event list is empty:
....
max_name_length = max(len(evt.key) for evt in events)
ValueError: max() arg is an empty sequence
This change fixes the error by returning an empty string.
* Update profiler.py
This adds an unconditional dependency on CUDA, which is not desirable
for the long term. Ideally we have split like ATen where we have
different artifacts for different backends so you can decide at runtime
what to use.
* Make AT_FORALL_SCALAR_TYPES usable outside of at::namespace.
This requires renaming the _cast functions which used the unqualified names.
* Separate onnx mapping of scalar type from cast name.
* Fix flake8.
* Properly cast onnx.
observers_list_ stores all the observers for an observable. The list is allocated on heap, which
can cause LLC miss. Add an on-stack observer cache for fast access. In production, we have seen 20%
speed up for start and stop observer calls.
* Add memory leak check in CUDA tests
* Tracking multi-GPU too
* fix run_test.py not running __name__ == '__main__' content; add test for make_cuda_memory_checked_test
* add a comment
* skip if cuda
* 1. Change the wrapper to a method in common.py:TestCase
2. Refactor common constants/method that initialize CUDA context into common_cuda.py
3. Update some test files to use TEST_CUDA and TEST_MULTIGPU
* Fix MaxUnpool3d forward memory leak
* Fix MultiLabelMarginCriterion forward memory leak
* Fix MultiMarginLoss backward memory leak
* default doCUDAMemoryCheck to False
* make the wrapper skip-able
* use TEST_MULTIGPU
* add align_corners=True/False tests for Upsample; fix TEST_CUDNN
* finalize interface
* VolumetricMaxUnpooling_updateOutput
* fix test_nccl
* rename THC caching allocator methods to be clearer
* make the wrapped function a method
* address comments; revert changes to aten/src/THC/THCCachingAllocator.cpp
* fix renamed var
* Import/export observer symbols for DLL, which fixes the linking error in Visual Studio.
* Add support of all default cmake build types for release to cuda.
* Make THStorage / THCStorage have void* data ptr.
This is the initial step in unifying the ATen and TH tensor representations, next is to only generate a single THStorage / THCStorage type.
The major changes here are:
1) data has been renamed to data_ptr and made void* in THStorage/THCStorage.
2) THStorage / THCStorage stores a at::ScalarType representing its data type (This will be useful when we generate a single THStorage/THCStorage).
3) APIs for Accessing the data as a real*:
a) storage->data<real>() -- this does runtime-type checking (checks that the at::ScalarType is correct).
b) storage->unsafeData<real>() -- as above, but no runtime-type checking (used in inner loops / fast code paths).
c) THStorage_(data)(storage) -- this already existed, just calls storage->data<real>().
* Add include.
* Attempt to fix clang build issues.
* Clarify comment and remove extra character.
* Rename unsafeData -> unsafe_data.
* Remove unnecessary 'to' function to get compile time rather than link time errors.
* Raise error when torch.load a storage on a non-existing device
Before, doing torch.load(...) on a CUDA tensor on a CPU-only machine
would raise an unreadable error:
```
~/pytorch/pytorch/torch/cuda/__init__.py in __enter__(self)
223 if self.idx is -1:
224 return
--> 225 self.prev_idx = torch._C._cuda_getDevice()
226 if self.prev_idx != self.idx:
227 torch._C._cuda_setDevice(self.idx)
AttributeError: module 'torch._C' has no attribute '_cuda_getDevice'
```
This PR makes it so that torch.load raises a hard error if one tries to
load a storage onto a non-existing device and suggests the user to use
torch.load's map_location feature.
* Address comments
* missing dep
* Handling of scalars in torch.Size
torch.Size() constructor uses python_arg_parser
IntList in python_arg_parser can take iter/range
Have IntList take python iterables and ranges.
Address comments: don't use python_arg_parser and instead call __index__ in THPSize_pynew
Address comments
Address comments
* Rebased
* Address nit
* pad-sequence no longer requires sorting entries
pad-sequence can get the max_len from the list of sequences. entries only need to be sorted if output will be used for pack_padded_sequence, which can throw the error itself.
* remove sort requirement from pad-sequence
Picks up from #5974.
Removes the requirement that input sequences to pad_sequence have to be
sorted. Addressed the comments in the PR:
- Updated docstring for pad_sequence
- Remove sort requirement in pad_sequence test
- Test unsorted and sorted sequences in pad_sequence test
* Test if ASAN is actually working as part of ASAN tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Drop explicit use of libstdc++, we should not care.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Build with DEBUG=1
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Increase main thread stack size when using ASAN.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
This provides a bare-minimum MPI Process Group implementation, the commit is on top of @pietern's Gloo Process Group PR.
* [c10d] MPI Process Group Implementation
ref: https://github.com/pytorch/pytorch/issues/7434
* Better exception, atexit func, and addressed comments
* Clang formatting changes
* Static initialization and addressed comments
* Added constness back
* Test will now launch mpi processes if found
* CMakeList Changed
* [mpscnn] MPSCNNChannelShuffle
att
* [Easy] Adding tags as an argument to the functional layer
Without it "tags" would be added as an argument to the operator.
The change here is based on the assumption that there is no operator that takes "tags" as an argument.
* Fix locally_connected_op schema check.
Fix locally_connected_op schema check.
* [C2] Add TypeAndShape inference for few more operators
As desc
* [c2] Shape inference should support 0 as dimension
Tensors can have 0 in their dimension.
* Make MockHiveReader loop over and support max_examples
Replace DatasetReader with RandomDatasetReader.
So that Mock Hive Reader can simulate a large data input using a small sample file as source.
* Utility function to wipe cache between benchmark runs
Caffe2 benchmark does not wipe out cache between runs, and this potentially creates an unrealistically optimistic picture of performance. This diff adds utility function to wipe out the cache.
* Allow caffe2 GlobalInit to be invoked multiple times
Allow caffe2 GlobalInit to be invoked multiple times. Will re-parse gflags and update logging levels on successive invocations, but will not re-run init functions or perform other one-time initialization.
* Add Caffe2 GlobalInitIsCalledGuard to base net and operator classes
Warn if caffe2's GlobalInit function has not been invoked before creating an operator or net object. This is based on discussion here: https://fb.quip.com/kqGIAbmK7vNG
* Rethrow current exception on failure
Rethrow current exception instead of copy constructing a new one on op failure.
* Make `clone()` return subclass of List/Struct
`clone()` is not working correctly when we subclass those classes
* Wipe the cache before the net run
the util function is copied from D7409424
will rebase once D7409424 is landed.
* [Caffe2] [Mobile] Support utils/cast.h::GetCastDataType with LITE_PROTO builds
* Correct includes
async_polling include -> async_base include
* Prepare execution flags for executor migration
Making async_scheduling aware of underlying net type to prepare for executor
migration
* Add operator level observers into async executor
Adding operator level observers into RunAsync operators' calls
* Cleanup TEST_Benchmark
Remove duplicate code and provide default implementation in NetBase
* [C2] Fix type and shape inference for binary comparison ops
As desc.
* Add GlobalInit to predictor to ensure initialization is always done before prediction
FACEBOOK:
Redo D7651453 the correct way.
Now use a static variable for the arguments passed to GLog
* Remove spammy log message
This method is currently used in various places inside Caffe itself.
* Disable events for operators inside a chain
We don't need to use events in operators within a chain because the chain is
always scheduled on a single stream, keeping only first and last event for
scheduling purposes
* Ensure correct finish run order
In rare cases we might call finishRun and trigger net's destruction while
another worker is still holding shared_ptr to a thread pool, that can cause
thread pool destruction from within a worker thread in case no other nets are
using the pool. This diff fixes the order of calling finishRun and also changes
pool() to return raw pointer to keep pool's ownership within the net
* Reduce unnecessary polling
Make sure we don't waste CPU by polling operators that we can set an efficient
callbacks on
* Squash commit of syncing 9506eeb from github to fbcode
Patch xplat buck fix
add virtual destructor to OptimizationPass
add virtual destructor to OptimizationPass
build fixes for sync
build fixes for sync
* Fix net tracing
Fix net tracing from async_scheduling
* Fix logging
It's going to define a static variable, and this was a loaded
footgun if another C++ file directly included this header.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Revert "Fix error when setting multiple arch in TORCH_CUDA_ARCH_LIST (#7879)"
This reverts commit 45cdb63d8b8022ab26f073d3bed718e75d2aedaf.
* Disable dirty test; always run all CI runs.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Remove templatization of PyTypeObject in THP copy storage methods.
An in-progress refactoring of THStorage is collapsing the types of THStorages to not be ScalarType-specific.
The revelant PyTypeObject to use for the THPStorageType is currently templatized based on the current THStorage;
this doesn't work if the ScalarType is collapsed. Instead, just pass it explicitly.
* Pass src type instead of dst type.
* Line up columns.
* Avoid @generated in templates.
We want @generated only in the build products. Otherwise, templates are
locked and changes to the templates are excluded from phabricator.
Also adds @generated to autograd generated files (e.g.
VariableType.cpp).
See #7780
* Don't try to specify the template filename in generated comment
The template filename is not always the same as the generated filename.
* Make TensorMethods (fastGetSet) not depend on data type of Storage.
Currently, fastGetSet is implemented as macros that depend on the data type of Storage (i.e. that storage->data is real*).
Since we are moving to having 'void*' data this won't work in the future.
Also, due to the recentl C/C++ split, these are actually C++ implementations (because they require the struct definition which is C++),
so we move them to a generic .hpp file and implement them as static inline functions.
* Fix set functions.
* Add generic to CMakeLists.
* Not running ATEN tests on Caffe2 builds
* Keeping test directory when only aten is built
* Changing to run all aten tests too
* Skipping directories again
* .
* .
* skip aten/integer_divider_test (it hangs for unknown reason)
* Implement nn.Sequential that can be inlined into script modules
* fix bugs
* add comment
* add _ConstSequential class
* add script_method for forward in ConstSequential
* fix build bug
* refactor
* Add backward() to Tensor and Variable
* Add at:: in front of Tensor
* Trying to not move optional to appease windows?
* Move implementation into cpp file
* Undo some formatting changes
* Have PyTorch depend on minimal libcaffe2.so instead of libATen.so
* Build ATen tests as a part of Caffe2 build
* Hopefully cufft and nvcc fPIC fixes
* Make ATen install components optional
* Add tests back for ATen and fix TH build
* Fixes for test_install.sh script
* Fixes for cpp_build/build_all.sh
* Fixes for aten/tools/run_tests.sh
* Switch ATen cmake calls to USE_CUDA instead of NO_CUDA
* Attempt at fix for aten/tools/run_tests.sh
* Fix typo in last commit
* Fix valgrind call after pushd
* Be forgiving about USE_CUDA disable like PyTorch
* More fixes on the install side
* Link all libcaffe2 during test run
* Make cuDNN optional for ATen right now
* Potential fix for non-CUDA builds
* Use NCCL_ROOT_DIR environment variable
* Pass -fPIC through nvcc to base compiler/linker
* Remove THCUNN.h requirement for libtorch gen
* Add Mac test for -Wmaybe-uninitialized
* Potential Windows and Mac fixes
* Move MSVC target props to shared function
* Disable cpp_build/libtorch tests on Mac
* Disable sleef for Windows builds
* Move protos under BUILD_CAFFE2
* Remove space from linker flags passed with -Wl
* Remove ATen from Caffe2 dep libs since directly included
* Potential Windows fixes
* Preserve options while sleef builds
* Force BUILD_SHARED_LIBS flag for Caffe2 builds
* Set DYLD_LIBRARY_PATH and LD_LIBRARY_PATH for Mac testing
* Pass TORCH_CUDA_ARCH_LIST directly in cuda.cmake
* Fixes for the last two changes
* Potential fix for Mac build failure
* Switch Caffe2 to build_caffe2 dir to not conflict
* Cleanup FindMKL.cmake
* Another attempt at Mac cpp_build fix
* Clear cpp-build directory for Mac builds
* Disable test in Mac build/test to match cmake
* Skip some tests to unbreak CI
* Pass the opset_version to run_node
* Remove the stale check_graph call, caffe2_net_to_onnx_model will invoke check_model
* Add hip support for caffe2 core
* Add MIOPEN header/wrapper to caffe2 core
* Add HIP device into caffe2 PB
* top level makefile change for rocm/hip
* makefile scaffolding for AMD/RocM/HIP
* Makefile scafodding for AMD/RocM/HIP; add makefile/utility for HIP files
* caffe2 PB update for AMD/ROCM HIP device
* Add AMD/RocM/Thrust dependency
* HIP threadpool update
* Fix makefile macro
* makefile fix: duplicate test/binary name
* makefile clean-up
* makefile clean-up
* add HIP operator registry
* add utilities for hip device
* Add USE_HIP to config summary
* makefile fix for BUILD_TEST
* merge latest
* Fix indentation
* code clean-up
* Guard builds without HIP and use the same cmake script as PyTorch to find HIP
* Setup rocm environment variables in build.sh (ideally should be done in the docker images)
* setup locale
* set HIP_PLATFORM
* Revert "set HIP_PLATFORM"
This reverts commit 8ec58db2b390c9259220c49fa34cd403568300ad.
* continue the build script environment variables mess
* HCC_AMDGPU_TARGET
* Cleanup the mess, has been fixed in the lastest docker images
* Assign protobuf field hip_gpu_id a new field number for backward compatibility
* change name to avoid conflict
* Fix duplicated thread pool flag
* Refactor cmake files to not add hip includes and libs globally
* Fix the wrong usage of environment variables detection in cmake
* Add MIOPEN CNN operators
* Revert "Add MIOPEN CNN operators"
This reverts commit 6e89ad4385b5b8967a7854c4adda52c012cee42a.
Pull new revision of NNPACK which specifies non-executable stack in assembly files. Previous revision didn't do that, and depending on toolchain could cause linker to mark stack as executable for the linked binaries.
This is a starting point and only implements allreduce for CPU tensors. It includes most base functionality like algorithm caching (similar approach as taken in the THD GlooCache) and multi-threaded execution (new).
The expectation is that function calls on the process group class are globally serialized. They execute collective functions, so members of the collective must call the same functions in the same order, or a deadlock may happen.
The algorithm cache works as follows: the ProcessGroupGloo class has a cache map from algorithm keys to algorithm entries. The algorithm key is a struct with fields that make up the signature of a collective function. It includes the dimensionality of the input/output tensors, tensor device assignment, source/destination rank, etc. For collective calls with the same key, the process group will lazily initialize and then cache a Gloo algorithm instance. For now we only keep a single algorithm instance per key, but this may be revisited in the future, if we observe contention on a single key and can exploit additional parallelism.
* Change backward calls to grad to avoid memory leak from #7343; Replace unnecesary create_graph=True with retain_graph=True
* fix gradgradcheck use of make_non_contiguous
* allow non-contguous target
* remove unnecessray .grad.zero_()
* remove contiguous_detach
* fix PReLU double backward always returning ggW as a scalar
* let noncontig gO require grad
* move requires_grad to return
* Fix handling of empty batches in SumReduceDimsOp
As titled
* Deferrable async_scheduling finishRun fix
Proper order of finishing run operations in deferrable_async_scheduling net
* Simplify exception handling in async_scheduling
Simplify exception handling, no need to busy wait, thread that processes the
last task can finish the run
* [C2]worker_coordinator_memorize_worker_ids
As titled. This is related to T28689868, where the number of blobs we want to create is equal to the number of worker ids
* Add unit test for nets with no type set
* Ignore total length argument in sympolic_pad_packed_sequence
1- There was a mistake in the code that total_length was added to the wrong symbolic function (pack_padded_sequence) instead of (pad_packed_sequence)
2- No need to throw an exception if total_length is given since it is only used to enable data_parallel training on multi-gpus and doesn't have anything to do with onnx export, so just ignore it. https://fburl.com/tk4gciqp
* Add support for MKLDNN to async_scheduling
Just add MKLDNN as a possible CPU option to async_scheduling's pool function
* [AuFL][ensemble] support branch output for prediction
This diff supports using predictions from different branches and thus enables model ensembling (not fully independent).
* Fix a bug in add_loss in layer_model_helper
As titled.
* Support lradaption for adam
1.lr adaption operator
2.apply to dense adam
* Perf tweaks for async_scheduling
Restore single pool option + remove unnecessary (no-ops) calls
* add quantization to SparseSimdAdagradOp
add a bunch of quantization signatures to SparseSimdAdagradOp, implementations to come next
* [sr] [codemod] Change all SR callsites to use new API
@allow-large-files
This diff refactors all callsites of SR to use the slightly changed API introduced in the diff below. Really what this means is that you need to include the correct header. Also if you were using `ClientFactory::newFactory` you need to not prefix it with `ClientFactory::`.
```
cd ~/fbsource/fbcode
find ./ -type f -exec sed -i -e 's:#include "servicerouter/client/cpp2/ClientFactory.h":#include "servicerouter/client/cpp2/ServiceRouter.h":' -e 's:#include <servicerouter/client/cpp2/ClientFactory.h>:#include <servicerouter/client/cpp2/ServiceRouter.h>:' -e 's/ClientFactory::newFactory(/newFactory(/g' {} \;
```
Also manually fixed spots that couldn't be done automatically (or broke because they depended on transitive includes).
* Back out "Fix handling of empty batches in SumReduceDimsOp"
Original commit changeset: 282da1730cc2 This commit is blocking the
Github->fbcode sync, which really needs to get merged ASAP. D7881937 which this
diff depends on will be reverted in the sync D7990948 which causes this to
break. The sync diff cannot be patched with this reversion because it must be
landed against base revision 5c8c099 , and D7881937 must not be included in the
sync diff because it is breaking GPU tests that are not available in sandcastle
: https://ci.pytorch.org/jenkins/job/caffe2-builds/job/py2-cuda8.0-cudnn6-ubuntu16.04-test/3638/console
for one example.
* Add the flow to support operator benchmark
1) generate model with the operator 2) upload to everstore 3) generate model spec into json file 4) start running the benchmark
* [tum][gpu] Connect DPM trainer with flow and unit tests
This diff:
- Fix some small bugs for Yiming's recent changes to parallelizer, so it suits real use cases.
- Add correct tags to the TUM code, so we can do data parallel transform
- pass extra info when instantiation.
- add unit test for using DPM in TUM model
After this diff, we can do simple box, multi-gpu fully-sync trainer for TUM in Fblearner workflow, but may still need to do speed benchmarking.
* w/o normalized lradaption for adam dense only
The previous lr adaption includes a normalization step when performing the dot product operation. This is not exactly same as what is proposed in the paper. I add normalization as an option. Without it, the operator performs exactly what the paper proposed. With the option, we add the normalization step
* [fb] Use SharedPromise in DeferrableAsyncSchedulingNet
This code is to simplify DeferrableAsyncSchedulingNet by removing condition
variable + small fixes
* [tum] implement cuda sparseLengthsMean and LengthsMean
as title
* Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.
Adding an optional parameter to allow use of protobufs in InferShapesAndTypes function.
* Move feature_to_index to FeatureSpec.feature_to_index
move feature_to_index to FeatureSpec.feature_to_index to avoid override other fields
* [Caffe2] Rename bytes_moved to bytes_written
Just a rename in preparation for supporting bytes_read.
* [c2] fix ReduceFrontSumOp for empty case by setting 0
otherwise, it may use the results from last iteration when it's empty batch.
* [Caffe2] [Int8] Improve Intel CPU performance
* [Easy] Improve PrependDim op logging
as titled
* DBFileReader expand db_path using os.path.expanduser(..)
Since there are a lot of possible use cases of `DBFileReader` to read from user home path, like `~/local/sample.db`, I want to save people's trouble of calling `os.path.expanduser(db_path)` themselves.
* [Caffe2] Add bytes_read to cost structure
We're adding analytical read bytes to cost functions. This extends the structure accordingly for all CostInference defined operators.
Additionally, some small bug fixes were performed:
1) Cost functions now extract type information of operands instead of assuming float
* Fix sleef on aarch64 for hhvm
@bypass-lint
Rename flag
* Remove duplicated part in caffe2/ideep/operators/conv_op.cc
should be sync error
* Rename test helper function test_adagrad_sparse_helper to adagrad_sparse_test_helper to avoid confusing pytest
* Fix various sparse transpose issues; remove dead code from Declarations.yaml.
1) Fixes some checks in t_, transpose_ that don't allow transposing empty sparse tensors.
2) Remove out= variants from docs since they don't exist (and haven't since at least v0.3.1).
3) Unify implementations of t_, transpose_, t, transpose.
4) Move dead checking code from Declarations.cwrap to actual implementations.
5) Fix test which never tested transpose_.
* Add test for error with t, t_.
* Address review comments.
* Fix jit tests.
* Fix test_jit.
* Don't allow requires_grad to be set on integer Tensor constructors in tensor_new.
* Fix autograd test.
* Fix test_distributions.
* Fix test_jit.
* Fix NN tests.
Right now, if we add a zero-filled sparse tensor with another sparse
tensor, both tensors must have the same "density" (dimI, dimV) and size
(tensor.size()) for them to be added successfully. This relaxes that
constraint so that if both tensors have the same tensor.size() and at
least one is zero-filled, they can be added successfully.
Before:
```
i = torch.LongTensor([[0, 1, 1], [2, 0, 2]])
v = torch.FloatTensor([3, 4, 5]).unsqueeze(1)
sparse_mat = torch.sparse.FloatTensor(i, v, torch.Size([2,3,1]))
zeros = torch.zeros(sparse_mat.size(), layout=torch.sparse_coo)
sparse_mat + zeros
RuntimeError: cadd operands have
incompatible sizes or dimension types
at
../src/THS/generic/THSTensorMath.c:126
```
After: no error.
Compilers used to report a warning:
caffe2/core/net_async_tracing.cc: In member function 'void caffe2::tracing::Tracer::renameThreads()':
caffe2/core/net_async_tracing.cc:210:32: warning: overflow in implicit constant conversion [-Woverflow]
const long numa_multiplier = 10e9;
This patch fixes it.
* Makes accumulate_grad functions high priority in backwards passes
* Delegating constructor and comments
* Sequence_nr ain't pretty no more
* Sequence_nr ain't pretty no more
* Implemented fused builder based construction mechanism
* "weights" -> "weight"
* Use int64_t instead of size_t everywhere in RNN
* Extracted Conv::ExpandingSize into its own thing
* Rename TORCH_PARAMETER to TORCH_ATTR
* Added documentation
* Fix weight names in batchnorm module
Reference: https://github.com/pytorch/pytorch/issues/7434
* C10D: Added TCPStore to support C10D store interface
* Used pipe to terminate the store daemon and addressed all comments
* Used notify/wake for wait and addressed all comments
* Clean up nits
* Clean up all socket states when the socket is closed
* Adding LBFGS to cpp API
* Adding stop conditions
* Test cases now passing and adding closure to all algs
* Addressing code review
* Set seeds to make optim tests more deterministic
* Reduce gen_jit_dispatch options
This removes the power set of options generated for IntList[k] arguments
in aten_dispatch. Instead, the compiler now performs the broadcast using
schema information. This substantially cuts the compile time for aten_dispatch.cpp
* Make return uniform in lbfgs step
This ensures that we are returning results of the same type
in LBFGS step.
* Adding test case to exercise different exit points
Sets the tolerance_grad to negative infinity and positive
infinity to deterministically excercise the early exit branch
* Fixing lint error
* Fix python3.6 build in caffe2 CI
* Turn off onnx protobuf type stubs generation
* Revert "Turn off onnx protobuf type stubs generation"
This reverts commit 618b80911a316caa69f2d774fb12ae6b24b2a6d6.
Android unit tests failed to link due because libnnpack and libcpuinfo appeared in the linker command line before libcaffe2. This patch somehow fixes it.
Fixes#7502.
Test Plan: build and test
Build output has this:
```
-- Checking prototype magma_get_sgeqrf_nb for MAGMA_V2 - True
-- Compiling with MAGMA V2 support
-- MAGMA INCLUDE DIRECTORIES: /data/users/rzou/miniconda3/include
-- MAGMA LIBRARIES: /data/users/rzou/miniconda3/lib/libmagma.a
```
* PyTorch AMD Build Script.
* Python invocation for hipify
* Adding individual hip fles.
* Updating CWD
Use the actual path for the file instead of the current working directory, which depends on where the script is invoked.
* Updating folder path for amd_build
* Removing previous amd_build directory
* Updated setup.py to support WITH_ROCM
* Renaming the files for CuDNN BatchNorm & Conv since having two .cpp files with the same name results in a linking error in the HCC compiler used for ROCm/AMD.
* Removing old BatchNorm & Conv files since they've been renamed.
* Updating build path to handle ROCM
* Cleaned up the build path and created a FindHIP cmake file for setting up relevant hip paths.
* Seperated the individual patch files to make it easier to detect issues while building.
* Removed CMakeLists hip files and fixed directory structure
* Adding build pytorch amd script
* Merged setup patch into PyTorch setup.py & cleaned a few issues
* Added information on where to download the hipify-python script.
* Resolved linting issues inside of build_pytorch_amd.py
* Removing many unnecessary patch files. Removing unnecessary .hip files. Fixing up the build process.
* Refactored the PR for supporting HIP
* Minimizing the number of changes inside individual patches.
* Cleaned up patch files.
* Removed patch files.
* Updating patches
* Removing HIP change from file.
* Cleaned up patches
* Added AVX/SSE avoidance due to bug with ROCms stack. Just temporary for now.
* Removing the other HIP file
* Removed patch file + merged ROCm into Aten/test
* Removed ATen tests patch file and updated disbale_features yaml to remove headers that don't exist on the HIP stack.
* Reduced the number of patches down to 14 after Edward's suggestions.
* Transferred deletion of certain functions from patch to yaml file.
* Set default Thrust path
* Fixed aten files so we now use the templated pow/abs instead of std:: directly.
* Removed error from aten/src/THCUNN/Abs.cu
* Updated the locations of the cmake build files. Moved THCTensorRandom from a hip to a patch file. Added executable/library commands that can successfully handle either CUDA or HIP.
* Removed hip extraction from the build script and removed the old hip file.
* Replaced MACRO with function in upper level cmake.
* Added empty ELSE() block to prevent the loading of a command without CUDA or HIP. Also added IF guards around torch_cuda_based_add_executable in Aten tests.
* Updated aten tests.
* Removed the hip include from the ATen header.
* Can't throw exceptions on C++ AMP, using abort
* Missing IF guards for cuda/hip executables in aten tests.
* Removed a series of patch files.
* Added template keyword to help out the HCC compiler.
* Rebased the specific files displayed in the PR
* Fixing typo.
* Change flag from "WITH_CUDA" to "NOT NO_CUDA"
Replacing "WITH_CUDA" with "NOT NO_CUDA" after the rebase.
* Fix LoadHIP path
* Updating build files after rebasing.
* Reorganization after cpu/gpu separation.
* Removed HIPCC from setup.py & removed -shared extra linking args.
* Updated CMake / Setup build to correctly link when under ROCm stack.
* Removed the unnecessary argument from Extension constructor.
* Adding another test to be included with ROCm building.
* Updated the setup_helpers scripts in order to get around linter error
* Fix syntax issue
* Solving lint issue: line too long
Running sccache in foreground mode seems to uniformly slow down the builds and causes virtual memory exhausted errors for gcc7.2 builds. This PR moves sccache to background mode instead and print the compilation log at the end of the build.
* Run onnx integration tests in caffe2 CI
* verbose log
* turn off onnx verbose installation log
* can not install ninja
* Do not use all cores to build pytorch
* install tests require
* pip install to user dir
* use determined path to improve (s)ccache hit
* Do not change path in test.sh
* Add the compile cache hit trick to conda install as well
* cover jenkins in CI environment detection
This PR uses Vec256 to vectorize the softmax and logsoftmax Layers.
This comes in 4 steps:
log_softmax
softmax
log_softmax_backward
softmax_backward
* Vectorized Softmax and LogSoftmax
* Abstractions
* Style
* Remove <limits> for Kernel
* Perf investigations
* Last cleanups
Improve script builtin checking using schema
* This add aten_schema.h which provides a barebones amount of type and
argument information about each builtin operator
* emitBuiltinCall is updated to use this information rather than
aten_dispatch to ensure the operator is correct.
* handling of keyword and position arguments now matches python behavior
* There is no longer a requirement that kwargs be constant or that the
attributes of an op must be entirely constant or non-constant
* compiler now constructs a non-attributed version of the op first and
then turns it into the constant-attribute version if all attributes
are constants.
* default arguments for builtins now work
* SugaredValue::call and similar functions now have SourceRange information
for their arguments so that error reporting is more accurate
Notes:
* This does not try to merge the builtin checking with python arg parser.
Given that we will eventually have C10 schema which will replace aten_schema,
we will eventually have a C++ description of the schema and working of that
description directly will be the easiest form to understand.
* python function calls and script method calls do not support keyword arguments yet.
When we add this support we should refactor the handling in tryEmitSchema
that resolves keywords into a common function.
* default arguments work
* keyword arguments to builtins work (still need to extend to calling python and other script methods)
* much better error reporting for incorrect builtins
Lift any constants to attributes on nodes when possible
* Schema is usable internally in the compiler as
the function signatures of script functions as well as for builtin
operators.
* Adds a List[T] class to better represent the arguments to cat/stack
as a type rather than with custom checking.
* Support kwargs for calls of script methods
A future commit will be needed to add support for:
* calls to script _functions_ which are currently are GraphExecutors without schema info.
* kwargs to python functions, which will require refactoring python op
* fix for #7532: clamping the return value of uniform.cdf() to the range [0,1]
* removed whitespace around equals to pass flake8 tests
* added a test for uniform.cdf() with arguments outside support
This PR makes two improvements:
It fixes reduce kernels where accum type != type. Currently, for example, half tensors with small values may have norms that are (approximately) representable in fp16, but calling .norm() on them will result in underflow and a reported norm of zero. This PR fixes that behavior and adds a test in test_cuda.py to ensure underflow does not occur (test_tiny_half_norm).
It simplifies all reductions by removing excessive templating and the -2 contiguous special case from THC_reduceDim and THC_reduceAll. The latter was previously removed from pointwise apply. This has no performance impact as the -2 special case was already mapping to the 1D code path.
PyTorch currently attempts to handle accum type != type by either (1) writing kernels that immediately convert values to accum type after reading or (2) writing operations that take in type values and accumulate to the accum type. The latter path was not working properly (hence the current excessive half tensor underflow) and resulted in a lot of redundant code, with two reduce ops being passed to a kernel instead of one, and reduce ops frequently receiving the same template argument twice.
This PR makes the former approach THE approach. Kernels that accumulate to (potentially) different types should follow the pattern of converting their input to the accum type, performing all operations on that type, and then converting back to the appropriate type if writing their value back to the tensor. This pattern makes the second reduce op redundant and allows for simpler templating, which should improve readability, reduce build time, and reduce binary size. Also, this prevents ops from having to perform their own conversions, which could result in poor performance if the same value was operated on multiple times.
One exception to this simplification was that a new ThrustTensorDistOp was created to handle a call to thrust::inner_product(). This Op fuses the conversion and the TensorDistOp.
In addition to the expected simplification, there is also some cleanup of excessive template parameters. For example, kernelReduceAllPass2() had three template parameters: T, IndexType, and ReduceOp, but IndexType was never used.
* wip
* Adds tests
* Fixes Python linting
* mean and norm fusions, code cleanup
* fixes file permissions
* Built-in support for rebuilding in win-build.sh
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* fixups
Signed-off-by: Jenkins <jenkins@ci.pytorch.org>
* CR comments
* CR comments
* more delayed expansion fixes
* Updates collapseDims() function and documentation
* Adds C++ tests, validates input, updates names for readability
* Removes invalid test
* stashing to merge AT_CHECK macro
* Updates asserts, removes tests on Windows
* Fix advanced indexing with negative indices
Fixes#7156
Here is some behavior before this PR:
```
In[1]:
x = torch.arange(9).view(3, 3).contiguous()
x[[0], [-1]] # Should be equivalent to x[0, -1]
Out[1]:
tensor([ 8])
```
The bug is that negative indices are added to the computed linear index
directly. In the above example, the linear index computed is "-1", which
wraps around to "8", giving the last element of a flattened view of `x`.
Instead, we should wrap negative indices around before adding them to
the linear index.
* Use toCLong()
Previously, CUDAGenerator::CUDAGenerator would initialize the random
number generator on the current device. This would usually be device 0.
This is undesirable because initialize the CUDA context allocates a few
100 MBs due to all the kernels in libTHC.so.
This avoids the unecessary call to THCRandom_getGenerator() in the
CUDAGenerator constructor.
Fixes#7320
Previously, CUDAGenerator::CUDAGenerator would initialize the random
number generator on the current device. This would usually be device 0.
This is undesirable because initialize the CUDA context allocates a few
100 MBs due to all the kernels in libTHC.so.
This avoids the unecessary call to THCRandom_getGenerator() in the
CUDAGenerator constructor.
Fixes#7320
* Fix call to get THCState
* Move ONNX integration tests from onnx-fb-universe to PyTorch repo
* Switch to use torchvision
* Delete single rnn operator tests, they have been covered in e2e tests in test_caffe2.py
* Mirror the fix in onnx-fb-universe to bypass cuda check
667326d84b
* this removes the flag controlling whether the interpreter works on variables.
* now the interpreter _always_ works on variables
* constants in the IR are still _always_ non-variables, and an assert was added to ensure this.
* as_tensor was split into as_variable and as_tensor since it is sometimes used
to construct constants in the IR
* I tried changing the IR to also always use variables but that change was much more
cross cutting and fragile and I never got it working
* [bootcamp] Improve "Shape" operator to support axes specification
To improve .shape operator of Caffe2 to support x.shape(tensor, axes), which takes an optional int array "axes" as input. For example, x.shape(tensor, [1, 0]) will return the dimension for axis 1 and 0 following the specified order. For current version, "axes" input allows duplications and can have arbitrary length.
* Back out "Add barrier net that runs before training nets"
Original commit changeset: b373fdc9c30f. Need additional changes to some callers to support barrier failures.
* Change warning to verbose log to reduce log spam
The `LOG(WARNING)` was a bit spammy for regular use so lets just make it a `VLOG`.
* Extract the shared code from different caffe2_benchmark binaries
The OSS benchmark and Internal benchmark will share most functions in the benchmark.
* Support MFR in sequence training
As titled.
* Make knowledge distillation work with using logged prediction feature as teacher label.
1) Add loading raw dense feature as teacher label.
2) Optional calibration function for teacher label
3) Add teacher label into generic unit test
4) Deprecated TTSN workflow version using feature_options to config teacher label
* [C2/CUDA]: unjoined cross entropy sigmoid
as desc
* Add async_scheduling executor into deferrable_net_exec_test
Add async_scheduling into tests and fix some exception cases
* Fix Event disabled error
When disabling event in RNN ops make sure we don't call Finish on disabled
event from op's RunAsync
* cuda ensure cpu output op can handle both TensorCPU and TensorCUDA
as desc.
* [C2 Core] Infer input device option in C2 hypothesis_test checkers
Improve how we default input blob device options.
Previously it defaults as where op lives but it is not necessarily the case.
For example:
CopyCPUToGPU
* [C2 Op]SplitByLengthsOp CPU/GPU implementation
[C2 Op]SplitByLengthsOp CPU/GPU implementation
* fix undefined symbol error
not sure why we're getting undefined symbol even with link_whole = True
Need to figure out why but need this workaround for now
* Add tools in DAIPlayground platform to help debugging models
Add additional tools to allow Plauground override individual method defined in AnyExp. This will allow user to create module that specificly change certain default method behavior. An example included in this diff is deactivating test model and checkpointing. When debugging any model problems, switching off components helps me quickly narrow down the location of the bug. The technique is extensively used in task T27038712 (Steady memory increase in EDPM, eventually resulting in gloo/cuda.cu:34: out of memory)
* add shape and type inference for int8 conversion operator
* Fix flaky test for group_norm
Fix flaky test for group_norm
* Fix group_norm_op_test flaky
Fix group_norm_op_test flaky
* Implementation of composite learning rate policy
In many state-of-the-arts deep learning works, people use a simple trick to
schedule the learning rate: use a fixed learning rate until error plateaus
and then switch to a different fixed learning rate, and so on. In this diff,
we implemented a simple version of the composite learning rate. The user gives
a set of learning rates policies and corresponding iteration nums, and the
optimizer will change the learning rate policy based on the number of iterations so far.
For example, the user give two learning rate policies, one is FixedLearningRate
and PolyLearningRate, with an iteration number of 1k. Then the first 1k iteration,
we use FixedLearningRate. For the following iterations, we use PolyLearningRate.
* Split two use cases of CachedReader into two classes, DBFileReader and CachedReader
# Use Cases:
1). input: DB file -> output: DatasetReader.
Use DBFileReader.
2). input: Reader -> build cache DB file -> output: DatasetReader.
Use CachedReader.
# Changes to CachedReader:
1). Move db_path to the constructor.
Because in mock reader. cache will always be built ahead.
# Changes to tests:
1). Make a separate TestCase class for CachedReader and DBFileReader.
2). Make it possible to add more test functions by adding setUp, tearDown and _make_temp_path.
3). Make delete db_path more general. `db_path` could be a file for `log_file_db`, but could also be a directory for `leveldb`.
* Back out "On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization"
Original commit changeset: 4489c6133f11
* Fix LARS bug
Fixed a bug in the LARS implementation which caused all subsequent blobs not using LARS to have the LARS learning rate multiplier applied to them.
* [tum] support sparse init & add uniformFill option
as title
* Propagate exception for async nets
Capture the exception when an exception is thrown in async nets and re-throw it after wait(). This allows exceptions to be propagated up to the caller.
This diff was a part of D7752068. We split the diff so that C2 core files changes are in a separate diff.
* Automatic update of fbcode/onnx to 69894f207dfcd72d1e70497d387201cec327efbc
Previous import was 403ccfbd0161c38f0834413d790bad0874afbf9a
Included changes:
- **[69894f2](https://github.com/onnx/onnx/commit/69894f2)**: Use op schema.all tensor types in random like definitions (#865) <Scott McKay>
- **[b9d6b90](https://github.com/onnx/onnx/commit/b9d6b90)**: Clarify random like operators (#846) <Scott McKay>
- **[fc6b5fb](https://github.com/onnx/onnx/commit/fc6b5fb)**: Refactor shape inference implementation (#855) <anderspapitto>
- **[b7d8dc8](https://github.com/onnx/onnx/commit/b7d8dc8)**: fix cmake warning message (#863) <Eric S. Yu>
- **[f585c5d](https://github.com/onnx/onnx/commit/f585c5d)**: add pytorch-operator test for tile (#831) <Wenhao Hu>
- **[993fe70](https://github.com/onnx/onnx/commit/993fe70)**: add install step (#832) <Eric S. Yu>
- **[68bc26c](https://github.com/onnx/onnx/commit/68bc26c)**: add type inference for traditional ml ops except classifier ops. (#857) <Ke Zhang>
- **[9cc0cda](https://github.com/onnx/onnx/commit/9cc0cda)**: fix string representation of scalar types (#858) <G. Ramalingam>
- **[1078925](https://github.com/onnx/onnx/commit/1078925)**: fix y in pow test case to scalar (#852) <Wenhao Hu>
- **[c66fb6f](https://github.com/onnx/onnx/commit/c66fb6f)**: Add some math function shape inference (#845) <anderspapitto>
- **[ff667d1](https://github.com/onnx/onnx/commit/ff667d1)**: Refactor return type and docs for ONNXIFI_BACKEND_DIRECTX_ID (#853) <Marat Dukhan>
- **[11c6876](https://github.com/onnx/onnx/commit/11c6876)**: clear initializer names when clear initializer (#849) <Wenhao Hu>
- **[73c34ae](https://github.com/onnx/onnx/commit/73c34ae)**: Clarify FeatureVectorizer description. (#843) <Scott McKay>
- **[1befb9b](https://github.com/onnx/onnx/commit/1befb9b)**: Remove useless text in docs (#850) <Lu Fang>
- **[e84788f](https://github.com/onnx/onnx/commit/e84788f)**: Fix SELU attributes' default values (#839) <Lu Fang>
- **[ebac046](https://github.com/onnx/onnx/commit/ebac046)**: Add tile test case (#823) <Wenhao Hu>
- **[8b7a925](https://github.com/onnx/onnx/commit/8b7a925)**: a few more shape inference functions (#772) <anderspapitto>
- **[9718f42](https://github.com/onnx/onnx/commit/9718f42)**: Make the coefficient non optional for LinearClassifier (#836) <Jaliya Ekanayake>
- **[ef083d0](https://github.com/onnx/onnx/commit/ef083d0)**: Add save_tensor and load_tensor functions for Protos (#770) <Lu Fang>
- **[45ceb55](https://github.com/onnx/onnx/commit/45ceb55)**: Check if CMAKE_BUILD_TYPE set before project(). (#812) <Sergii Dymchenko>
- **[4b3d2b0](https://github.com/onnx/onnx/commit/4b3d2b0)**: [WIP] reenable shape inference tests (#834) <anderspapitto>
- **[22d17ee](https://github.com/onnx/onnx/commit/22d17ee)**: RNN tests: LSTM, GRU, SimpleRNN (#739) <Peyman Manikashani>
- **[de65b95](https://github.com/onnx/onnx/commit/de65b95)**: dimension denotation (#443) <Tian Jin>
- **[eccc76e](https://github.com/onnx/onnx/commit/eccc76e)**: fix field number issue in onnx operator proto and enable its build (#829) <Ke Zhang>
- **[d582beb](https://github.com/onnx/onnx/commit/d582beb)**: disable shape inference test to unbreak ci (#830) <Lu Fang>
- **[485b787](https://github.com/onnx/onnx/commit/485b787)**: function proto for composite op. (#802) <Ke Zhang>
- **[cd58928](https://github.com/onnx/onnx/commit/cd58928)**: specify defaults for attributes of Affine op (#820) <G. Ramalingam>
- **[7ee2cf9](https://github.com/onnx/onnx/commit/7ee2cf9)**: merge the dummy backend back into the main one (#743) <anderspapitto>
- **[1c03a5a](https://github.com/onnx/onnx/commit/1c03a5a)**: [Proposal] ONNX Interface for Framework Integration (previously ONNX Backend API) header and docs (#551) <Marat Dukhan>
- **[3769a98](https://github.com/onnx/onnx/commit/3769a98)**: Rename real model test case from VGG-16 to ZFNet (#821) <Lu Fang>
* [C2]ReluN Op
relu n op.
tf reference: https://www.tensorflow.org/api_docs/python/tf/nn/relu6
* Call destructor when assigning a blob value
* Add executor overrides
Add executor overrides flag to enable migration to async_scheduling executor
* Add barrier net that runs before training nets - attempt #2
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow.
This change was landed previously but caused errors for some EDPM workflows - See https://fb.facebook.com/groups/1426530000692545/permalink/1906766366002237/ - because EDPM assumes any call to CreateOrCloneCommonWorld and Gloo ops are wrapped in exception handlers but in this case exception thrown in the barrier init net is not handled.
To address this issue, we add _CreateOrCloneCommonWorld to the param_init_net instead of a new barrier init net. Since errors for param_init_net run is handled gracefully and re-rendezvous, it should fixes the problem.
* Handle empty nets in async_scheduling
Make sure we don't get stuck on empty nets
* use CUDA_ARCH for conditional compile
* [C2 fix] infer function for ensure_cpu_output_op
* Update group_norm test to reduce flaky test
* Fix lr_multiplier for GPU
The file store implementation is new and based on the file
initialization method (which uses a single file and file locking) and
the interface of the Caffe2 store handler.
See #7434.
When tracing we record expand nodes. This is useful in some cases because
it makes it clear a broadcast happened. However, in future runs
the broadcast may be different or not needed. This change adds an
attribute to expand to track if it was implicitly added. This
takes the form of an unused input to expand with a default value.
The execution engine then removes implicit expands before execution.
Note that shape_analysis will re-add expands when it can prove by
shape analysis that they will exist and this is useful for the fuser,
so this change should not affect fusion passes.
* Split libATen.so into libATen_cpu.so and libATen_cuda.so
Previously, ATen could be built with either CPU-only support, or
CPU/CUDA support, but only via a compile-time flag, requiring
two separate builds. This means that if you have a program which
indirectly uses a CPU-only build of ATen, and a CPU/CUDA-build of
ATen, you're gonna have a bad time. And you might want a CPU-only
build of ATen, because it is 15M (versus the 300M of a CUDA build).
This commit splits libATen.so into two libraries, CPU/CUDA, so
that it's not necessary to do a full rebuild to get CPU-only
support; instead, if you link against libATen_cpu.so only, you
are CPU-only; if you additionally link/dlopen libATen_cuda.so,
this enables CUDA support. This brings ATen's dynamic library
structure more similar to Caffe2's. libATen.so is no more
(this is BC BREAKING)
The general principle for how this works is that we introduce
a *hooks* interface, which introduces a dynamic dispatch indirection
between a call site and implementation site of CUDA functionality,
mediated by a static initialization registry. This means that we can continue
to, for example, lazily initialize CUDA from Context (a core, CPU class) without
having a direct dependency on the CUDA bits. Instead, we look up
in the registry if, e.g., CUDA hooks have been loaded (this loading
process happens at static initialization time), and if they
have been we dynamic dispatch to this class. We similarly use
the hooks interface to handle Variable registration.
We introduce a new invariant: if the backend of a type has not
been initialized (e.g., it's library has not been dlopened; for
CUDA, this also includes CUDA initialization), then the Type
pointers in the context registry are NULL. If you access the
registry directly you must maintain this invariant.
There are a few potholes along the way. I document them here:
- Previously, PyTorch maintained a separate registry for variable
types, because no provision for them was made in the Context's
type_registry. Now that we have the hooks mechanism, we can easily
have PyTorch register variables in the main registry. The code
has been refactored accordingly.
- There is a subtle ordering issue between Variable and CUDA.
We permit libATen_cuda.so and PyTorch to be loaded in either
order (in practice, CUDA is always loaded "after" PyTorch, because
it is lazily initialized.) This means that, when CUDA types are
loaded, we must subsequently also initialize their Variable equivalents.
Appropriate hooks were added to VariableHooks to make this possible;
similarly, getVariableHooks() is not referentially transparent, and
will change behavior after Variables are loaded. (This is different
to CUDAHooks, which is "burned in" after you try to initialize CUDA.)
- The cmake is adjusted to separate dependencies into either CPU
or CUDA dependencies. The generator scripts are adjusted to either
generate a file as a CUDA (cuda_file_manager) or CPU file (file_manager).
- I changed all native functions which were CUDA-only (the cudnn functions)
to have dispatches for CUDA only (making it permissible to not specify
all dispatch options.) This uncovered a bug in how we were handling
native functions which dispatch on a Type argument; I introduced a new
self_ty keyword to handle this case. I'm not 100% happy about it
but it fixed my problem.
This also exposed the fact that set_history incompletely handles
heterogenous return tuples combining Tensor and TensorList. I
swapped this codegen to use flatten() (at the possible cost of
a slight perf regression, since we're allocating another vector now
in this code path).
- thc_state is no longer a public member of Context; use getTHCState() instead
- This PR comes with Registry from Caffe2, for handling static initialization.
I needed to make a bunch of fixes to Registry to make it more portable
- No more ##__VA_ARGS__ token pasting; instead, it is mandatory to pass at
least one argument to the var-args. CUDAHooks and VariableHooks pass a nullary
struct CUDAHooksArgs/VariableHooksArgs to solve the problem. We must get rid of
token pasting because it does not work with MSVC.
- It seems MSVC is not willing to generate code for constructors of template
classes at use sites which cross DLL boundaries. So we explicitly instantiate
the class to get around the problem. This involved tweaks to the boilerplate
generating macros, and also required us to shuffle around namespaces a bit,
because you can't specialize a template unless you are in the same namespace as
the template.
- Insertion of AT_API to appropriate places where the registry must be exported
- We have a general problem which is that on recent Ubuntu distributions,
--as-needed is enabled for shared libraries, which is (cc @apaszke who was
worrying about this in #7160 see also #7160 (comment)). For now, I've hacked
this up in the PR to pass -Wl,--no-as-needed to all of the spots necessary to
make CI work, but a more sustainable solution is to attempt to dlopen
libATen_cuda.so when CUDA functionality is requested.
- The JIT tests somehow manage to try to touch CUDA without loading libATen_cuda.so. So
we pass -Wl,--no-as-needed when linking libATen_cuda.so to _C.so
- There is a very subtle linking issue with lapack, which is solved by making sure libATen_cuda.so links against LAPACK. There's a comment in aten/src/ATen/CMakeLists.txt about htis as well as a follow up bug at #7353
- autogradpp used AT_CUDA_ENABLED directly. We've expunged these uses and added
a few more things to CUDAHooks (getNumGPUs)
- Added manualSeedAll to Generator so that we can invoke it polymorphically (it
only does something different for CUDAGenerator)
- There's a new cuda/CUDAConfig.h header for CUDA-only ifdef macros (AT_CUDNN_ENABLED, most prominently)
- CUDAHooks/VariableHooks structs live in at namespace because Registry's
namespace support is not good enough to handle it otherwise (see Registry
changes above)
- There's some modest moving around of native functions in ReduceOps and
UnaryOps to get the CUDA-only function implementations into separate files, so
they are only compiled into libATen_cuda.so. sspaddmm needed a separate CUDA
function due to object linkage boundaries.
- Some direct uses of native functions in CUDA code has to go away, since these
functions are not exported, so you have to go through the dispatcher
(at::native::empty_like to at::empty_like)
- Code in THC/THCS/THCUNN now properly use THC_API macro instead of TH_API
(which matters now that TH and THC are not in the same library)
- Added code debt in torch/_thnn/utils.py and other THNN parsing code to handle
both TH_API and THC_API
- TensorUtils.h is now properly exported with AT_API
- Dead uses of TH_EXPORTS and co expunged; we now use ATen_cpu_exports and
ATen_cuda_exports (new, in ATenCUDAGeneral.h) consistently
- Fix some incorrect type annotations on _cudnn_rnn_backward, where we didn't
declare a type as possibly undefined when we should have. We didn't catch this
previously because optional annotations are not tested on "pass-through" native
ATen ops (which don't have dispatch). Upstream issue at #7316
- There's a new cmake macro aten_compile_options for applying all of our
per-target compile time options. We use this on the cpu and cuda libraries.
- test/test_cpp_extensions.py can be run directly by invoking in Python,
assuming you've setup your PYTHONPATH setup correctly
- type_from_string does some new funny business to only query for all valid CUDA
types (which causes CUDA initialization) when we see "torch.cuda." in the
requested string
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Last mile libtorch fixes
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* pedantic fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Add name() to C++ modules
* Use RTTI to get module name by default
* Add functional.cpp to CMakeLists.txt
* Call typeid() inside name() instead of constructor
* Add tests and use default constructor
In Maratyszcza/NNPACK#140 @daquexian reported an error on Faster-RCNN model with MobileNet V2, when running with NNPACK engine. The error disappears when using the latest NNPACK and cpuinfo. Updating submodules upstream to ensure others don't hit this issue.
* [ONNX] Allow specifying only a subset of input/output names
Then we can only specify the "real" names while ignoring the names for all the parameters
* fix
* Update utils.py
* Replace incorrect usages of "NotImplemented"
Fixes#7266. Replaces "NotImplemented" (which is supposed to be used for
binary ops) with the correct "NotImplementedError".
* Address comments
* Add batched linear solver to torch.gesv()
Fixes#3164
Picks up from #4502
I moved `gesv` to ATen.
Adds bindings for MAGMA's `gesv_batched` function for CUDA.
For CPU, runs `THLapack(gesv)` in a for loop.
The new function supports arbitrary batch dimensions (and broadcasting
of those dimensions). For example, the 4-d tensor `A x B x M x M` should
be treated as having batch-size `(A x B)`.
The overhead of creating the magma_queue_t is: ~350000 microseconds
the first time it's called and ~6 microseconds every time after that.
* Tests and docs
* Address comments
* Address comments
* Rebase
* Address comments
* Fix rebase
* Addressed comments
* Address comments
* Address comments
* Addressed comments
This lets aten::expand be differentiable in torchscript. It was probably
omitted from the list by accident in the past b/c gradientForNode does
already support aten::expand.
Also adds a test to check expand and its gradient in a torchscript fn.
* Make ATen buildable without all Caffe2 by root cmake
* Fix typo in aten cmake
* Set BUILD_ATEN from USE_ATEN as compat
* Only set BUILD_ATEN from USE_ATEN when on
* Have USE_GLOO only set when BUILD_CAFFE2
* Generic fuse conv relu pass for nomnigraph
* Use it in NNPACK conversion
* Comments
* Change the postprocess interface to take node instead of conv op
* Pinning conda-numpy to 1.14 to avoid SVD issue
* Adding another leveldb test to conda's ignored tests, removing a mkl-test from this
* Removing commented out section
The schema.Scalar class makes pretty strict assumptions (via its docstring)
on the spec of the shape of its underlying object. Because of idiosyncracies
of numpy indexing and the use of np.dtype, those assumptions are broken on an
edge case (dtype = (scalar_type, 1)). This corrects the behavior of this
edge case to conform to the spec.
* Rename autograd namespace to torch and change torch.h into python.h
* Pave the way for torch::nn::Module
* Reorganize module code structure
* Undo ONNX update
* Remove sleef submodule
* ENH: add to method for PackedSequence
* ENH: return self if possible
* TST: remove extra data
* DOC: add more explanation
* TST: remove extra data
* DOC: minor fix
Apparently get() is a function of requests, not a module (not sure if in
the past get() used to be a module). Therefore, the syntax in #3280 will
alway fail with ImportError, and requests lib will never be used (kind
of defeat the purpose of that pull request).
Also, if requests lib is used, should add stream=True parameter,
otherwise requests.get() will load the whole response into memory.
* Clarify patience in ReduceLROnPlateau docs
It's unclear which definition of patience we have. The two ways to
interpret it are:
- How many bad epochs can you see before you start considering changing the learning rate.
- How many bad epochs can you see before you change the learning rate.
This PR clarifies the docs with an example. If `patience = 2`, then
after 2 bad epochs, we begin considering changing the learning rate.
After seeing one more epoch (the 3rd epoch), if that epoch is also bad,
then we change the learning rate after it.
* address comments
* move softmax/logsoftmax to ATen
* specify cpu and gpu accum types
* use accreal for CPU
* expose softmax backward to python, fix legacy interface
* fix Distributions.cu to use common AccumulateType
* fix cuda 8 build
* delete commented out lines
* rebase on master, fix breakages
* Double-dispatch copy.
In order to split ATen's CPU/CUDA code into two separate libraries
which don't require a build flag (AT_CUDA_ENABLED) to separate them,
we need to be able to split source files based on whether or not they
handle CPU functionality only, or also touch CUDA. Copy poses a unique
challenge here, because the naive implementation involves writing
a matrix for all combinations of CPU/GPU in a single file.
This PR splits up Copy.cpp into CPUCopy.cpp and CUDACopy.cpp, respecting
the following matrix:
to\from CPU CUDA
+---------------------------
CPU | CPUCopy.cpp CUDACopy.cpp
CUDA | CUDACopy.cpp CUDACopy.cpp
When you run x.copy_(y) where x is CPU and y is CUDA, we do a second
virtual dispatch to copy_from(y, x) on y's type, so that we can get
from CPUCopy.cpp to CUDACopy.cpp
The new autogenerated code for CPU looks like this:
Tensor & CPUByteType::s_copy_(Tensor & dst, const Tensor & src, bool non_blocking) const {
// code generated by copy_wrapper
checked_cast_tensor<CPUByteTensor>(dst.pImpl, "dst", 0, false);
switch (src.type().ID()) {
case TypeID::CPUByte:
THByteTensor_copyByte(static_cast<CPUByteTensor*>(dst.pImpl)->tensor, static_cast<CPUByteTensor*>(src.pImpl)->tensor);
break;
case TypeID::CPUChar:
THByteTensor_copyChar(static_cast<CPUByteTensor*>(dst.pImpl)->tensor, static_cast<CPUCharTensor*>(src.pImpl)->tensor);
break;
...
default:
return src.type().s_copy_from(src, dst, non_blocking);
Notice that the fall through goes to s_copy_from. s_copy_from is like s_copy
but the arguments are reversed.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Lintfix and no-CUDA fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix compilation erorr.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Rename autograd namespace to torch and change torch.h into python.h
* Include torch.h instead of python.h in test/cpp/api
* Change some mentions of torch.h to python.h in C++ extensions
* Set paths directly, without find_path
This makes the JIT tracer much more robust, by allowing it to record
dependencies on tensor sizes. For example, if you were to trace this
function
def fn(x):
return x.view(x.size(1), -1)
before this patch, then it would embed the actual value of x.size(1)
in the trace as a constant, making it very hard to have e.g. batch size
independent traces. Now, this will correctly record the dependency, and
will retrieve the size of x at every run.
* Refactor reduce ops to take flexible input types
* Add DISPATCH_FUNCTION macros in common_gpu.h
* Use macros to reduce switch case in dispatching cuda functions
AT_ASSERT is an internal, PyTorch specific error, so we should
give a little more debug information (than with the ordinary
errors.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
```
JIT_ASSERT(v->setUnique(x)->uniqueName() == x);
```
This works by changing any other value in the graph with name x to a
different name. This mirrors llvm behavior and is useful when you
want to ensure some names have particular values.
* Remove stale THD README
* Move common THD dependency into THD/base
The master_worker directory now no longer contains files that are
needed for building other parts of THD.
* [fix] Re-enable events in RNN ops
We have earlier added event disabling in RNN ops as back then we didn't use
events, with current use cases this is no longer true
(https://fburl.com/8vd0lp8y)
* use ops with cude impl
* Revert D7729695: [caffe2][fix] Re-enable events in RNN ops
This reverts commit 4b215c7496fb724656ff4c776933a15bdbbcde5e
@bypass-lint
An infra SEV is better than not reverting this diff.
If you copy this password, see you in SEV Review!
@cause_a_sev_many_files
* [observer] Clean up observer_config.h
#accept2ship
* [1/n] Refactor dataio_test.py
Replace code duplication with a common function
* Add barrier net that runs before training nets
Add a synchonize barrier net that is run before training nets. With this net, shards that are faster will wait for other shards before start training. This reduce chances of the faster shards timing out during GLOO AllReduce.
Removed explicit data_parallel_model.py.synchronize call in holmes workflow. Similar change in speech/asr_training workflow will come in another diff.
* Support the dnnlowp backend in caffe2_benchmark
This is for SHARE operator latency evaluation
* Migrate integral_image_op to main caffe2
migrate integral_image_op(GPU version) given by https://fburl.com/yvqezigi
to caffe2/caffe2/operators and implement its CPU version. Write up a test
using the hypothesis_test mechanism
* [pos_disc, fbcode] Implement unjoined lr loss
As explained in https://our.intern.facebook.com/intern/wiki/Model_Based_Calibration/, when the dataset is an joined data set, where labels might change later, we need to use unjoined logloss.
The implementation is almost the same as in Sigrid (https://fburl.com/1trngsls), where
loss = y (log(p) - log(1-p)) + (1-y)(log(1-p)) = xy - (1-y)x - (1-y)log(1+exp(-x))
For x < 0, to ensure stability and avoid overflow, we reformulate the above exp as
loss = xy - (1-y)x - (1-y)x + (1-y)log(1+exp(x)) = xy + (1-y)log(1+exp(x))
Then the final expression becomes
loss = xy + (y - 1) x (x >= 0) - (1 - y) log(1 + exp(x - 2 x (x >= 0)))
where y is the true label, x is the dot product and p = logistic(x).
This kind of implementation is align with the current implementation of the original cross entropy in
https://phabricator.intern.facebook.com/diffusion/FBS/browse/master/fbcode/caffe2/caffe2/operators/cross_entropy_op.cc;0bae3b5d0f825897c5e0dd0ff10f489d7271bf25$7-13
* Keep the array to fix the conflict
* [C2] Compute Adagrad effective LR
The AdagradWithLR op outputs an extra blob which is contains the average effective learning rate across all weights in this blob.
* Open-source extractMetaNetDef & runGlobalInitialization, add new Predictor constructor from db file, and add run_map_outputs
1. Open-source extractMetaNetDef and runGlobalInitialization, for use in
2. new Predictor constructor from db file.
3. Add new run function that returns outputs as TensorMap
* Disable eigen cpu
Disable eigen cpu in transpose and reduce
* Introduce request_only/object_only property of ModelLayer
by default this is False
* A simple TC Caffe2 benchmark
We can run tunner, get MappingOptions and then use them to
compare against cuBLAS
currently broken due to LLVM issues. How to run:
hg checkout eec1ab31b59c03b8deded1c755a9abaf8c45be01
add D7401202
add D7434625
add D7506031
add D7540728
buck run @mode/dev-nosan tc/tc/benchmarks_python:caffe2_benchmark
* Move Caffe2 feature_maps_ops to open source
Need feature maps operators in open source project facebookresearch/BlueWhale
* Manually fix the conflicts in channel shuffle op
* Fix the inconsistency between different gh and fbcode
* Skip Adagrad GPU Test (Because some gpu implementation is missing)
* Fix another test to make sure it won't run on gpu when implementation is not available yet
These changes are already handled, either in native functions or via resize specifications in Declarations.cwrap.
The resize_ one is technically not handled, although in TH it is checked if the storage is actually reallocated; this is less strict, but seems okay.
* Add moments op in caffe2
* Use rsqrtf in float for group_norm
* Add docs for default behavior when axes is not provided.
* Update group_norm_op by using Eigen::sqrt on CPU
* Generate code without setup.py for C++ build
* Move code generation to CMake
* Set DEPENDS files correctly
* Fix some errors in codegen
* Fix blank line lint
* Implement torch.as_tensor, similar to numpy.asarray.
torch.as_tensor behaves like torch.tensor except it avoids copies if possible; so also somewhat like tensor.new but without the size overloads.
I didn't add a requires_grad field, because we haven't decided on the semantics such as as_param.
* Remove requires_grad for doc.
Enables more warnings in the C++ API build.
Fixed a bunch of things in torch/csrc/.
Mostly taken from c10
* Enable -pedantic for C++ build
* Enable more warnings
* Include CUDA and library headers with -isystem
* Fix sign-promo warning
* Make AT_ASSERT/AT_ERROR non-printf based, other tweaks
- AT_ASSERT/AT_ERROR don't take printf strings anymore; instead,
they take a comma-separated list of things you wanted to print
(bringing it inline with Caffe2's conventions).
Instead of AT_ASSERT(x == 0, "%d is not zero", x)
you write AT_ASSERT(x == 0, x, " is not zero")
This is done by way of a new variadic template at::str(), which
takes a list of arguments and cats their string reps (as per
operator<<) together.
- A bunch of the demangling logic that was in Error.h is now
moved to Error.cpp (better header hygiene.) Also, demangle
has been moved out to its own helper function, and also
a new helper demangle_type (from Caffe2) added.
- A bunch of AT_ASSERT converted into AT_CHECK, to more properly
convey which checks can be caused by user error, and which are
due to logic error in ATen.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* CR
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Fix test failure.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* buildfix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* More fixes.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* One more fix
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Try harder
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* initial commit for spectral norm
* fix comment
* edit rst
* fix doc
* remove redundant empty line
* fix nit mistakes in doc
* replace l2normalize with F.normalize
* fix chained `by`
* fix docs
fix typos
add comments related to power iteration and epsilon
update link to the paper
make some comments specific
* fix typo
Right now, the bottleneck test_utils.py tests assume that a user's
python executable is 'python'. This may not be the case especially if
the user has multiple versions of python installed. This PR changes it
so that test_utils.py uses `sys.executable` as the python executable.
* Refactor extractMetaNetDef and runGlobalInitialization into open...
* Fix test by making get output blobs optional
* Update test instead of making output blobs optional
* Dump autogradpp into PyTorch
* Fixed up CMake for autogradpp/C++ API
* Made cereal a submodule
* Change search location of autogradpps mnist directory
* Add test_api to CI
* Download MNIST from the internet instead of storing in repo
* Fix warnings
Adds ability to JIT compile C++ extensions from strings
>>> from torch.utils.cpp_extension import load_inline
>>> source = '''
at::Tensor sin_add(at::Tensor x, at::Tensor y) {
return x.sin() + y.sin();
}
'''
>>> module = load_inline(name='inline_extension', cpp_sources=source, functions='sin_add')
Fixes#7012
* Inline JIT C++ Extensions
* jit_compile_sources -> jit_compile
* Split up test into CUDA and non-CUDA parts
* Documentation fixes
* Implement prologue and epilogue generation
* Remove extra newline
* Only create the CUDA source file when cuda_sources is passed
* Add max mode support to EmbeddingBag
* Lint fix
* Fix compilation issue on other platforms
* Rebase + don't waste memory when not in max mode
* Oops, missed a spot
* Fix whitespace from merge
* less precision
* Lower precision to avoid spurious failures
* Minor typo
* Switch to size()
* Add full impl of GroupNorm
* Fix comments in math.h
* Remove unsed buffers
* Add #include <array> in gpu version
* Remove unused moments_buffer_
* Make inverse std to be a template.
* Add detailed comments
* Add support for dotted names in CPP Extensions
* Modify tests for cpp extensions
Test that dotted names work
* Py2 fixes
* Make run_test cpp_extensions Win-compatible
Changelist:
- Move *.c to *.cpp
- Change includes of ".c" to ".cpp"
- A bunch of cmake configuration modifying CMAKE_C_FLAGS changed
to CMAKE_CXX_FLAGS or add_compile_options, because if you do CMAKE_C_FLAGS it only applies when you compile C code
- Explicitly cast void* to T* in a number of places
- Delete extern "C" { ... } blocks; instead, properly apply TH_API to everything that should have it (TH_API handles extern "C")
- Stop using stdatomic.h, instead, use <atomic>. This resulted in a bunch of placement-new/delete to be "totally properly correct"
- Refactor of THLongStorageView to not have static constructor methods (since it no longer has a copy/move constructor)
- Documentation about how the TH C interface (and extern C business) works
- Note that THD master_worker mode is dead
- C++ headers in TH libraries are given .hpp suffix, to make it less likely that you'll confuse them with the C-compatible headers (now suffixed .h)
- New function THCStream_stream and THCStream_device to project out fields of THCStream instead of accessing fields directly
- New function THStorage_(retainIfLive), which is equivalent to a retain but only if the refcount is greater than zero.
- In general, I tried to avoid using hpp headers outside of ATen/TH. However, there were a few places where I gave up and depended on the headers for my own sanity. See Note [TH abstraction violation] for all the sites where this occurred. All other sites were refactored to use functions
- Some extra Werror fixes (char* versus const char*)
* Add missing header "caffe2/core/common.h" before "caffe/proto/caffe.pb.h" to provide CAFFE2_API macro.
This only affects the Windows build since CAFFE2_API is only defined for DLL.
* Fix ".pb.h" dependency issue about DLL build.
CAFFE2_API defined in "caffe2/core/common.h" is required by ".pb.h" generated on Windows for DLL build.
We always need to have "#include <caffe2/core/common.h>" before using any proto header.
In this case "caffe2.pb.h" is already included by "context_gpu.h" -> "common_cudnn.h" in the correct order, hence we simply remove a line.
* Enable WERROR in tests
* Also set WERROR=1 for cpp_build in CI
* Enable Werror after the compiler checks
* Remove -DWERROR because its picked up from the env var
* Had to fix some errors in aten/contrib/data
* Allow an uninitialized variable in ReduceOpsKernel.cpp
* Use CUDNN_DATA_UINT8 in cuDNN type string conversion
* Fixes and use target_compile_options
* Fix uninitialized variables in THNN
* Include Python.h earlier in tensor_types.cpp
* Use CUDNN_VERSION 7100 instead of 7000?
* More Python.h includes
* Make switch case in common_subexpression_elimination.cpp exhaustive
* Build with WERROR=0 just to see all the warnings
* Remove some Python includes
* Enable WERROR=1 again
* Bring back switch case default
* Allow `__constant__` values in a ScriptModule to be used as attributes for builtin functions
* Fix bugs in @script loops
1. while loops run shape propagation multiple times until the shapes have converged.
There were two bugs here. (a) First the 'changed' condition was not checking if it actually
changed the output, and instead would mark changed = true if the two inputs were different.
This incorrect because the output of the block and the input of the block may always have different shapes.
Now it actually checks if it is about to change the output entry that it is writing to.
(b) expand nodes were being inserted into the graph even inside the while loop body. However, if
we iteratively discover that the input shape to one of these expands is actual dynamic, then
it was incorrect to insert the expand in the first place. This changes it so that we only insert expands
after we have converged on the shapes.
2. the way deleteExtraInputs removed loop-carried dependencies was unsafe because it would lookup
Value* elements in the loop body's environment that were previously invalidated when deleteExtraInputs
remove another input to the loop. This changes the way deleteExtraInputs works so that it never has to
read a value out of the loop body's environment to avoid using the invalidated pointers.
* Fix torch.tensor(...) device-type calculation when used with numpy and type inference.
* Fix tensor device type inference as well.
* Better variable type inference: infer cuda-ness only if device is not specified.
Switches the step/direction variable names (steps and directions are flipped
in the current implementation of the two loop-recursion). This change does
not change the numerical output of the program, but should make it easier
to follow.
* Prevent stack overflow on deletion of deep graph
Fixes#5534.
Sometimes one can end up with a very big computation graph of Functions
and Edges. Each std::shared_ptr<Function> contains a list of Edge, and
each Edge contains a std::shared_ptr<Function>. Deleting a
std::shared_ptr<Function> can trigger the recursive deletion of other
std::shared_ptr<Function>'s: this can stack overflow if the graph
is deep enough. Here is an example of such a graph:
shared_ptr<Function> -> Edge -> shared_ptr<Function> -> Edge -> ... -> shared_ptr<Function>
The solution here is to use a custom deleter with each
std::shared_ptr<Function>. The custom deleter keeps track of how many
nested deleters it is in. When this number exceeds the maximum allowed
depth, the Function* to be deleted are accumulated in a per-thread
delete queue and handled by one of the deleters.
Example code that could trigger the overflow (set ``depth`` to something >
100000) is below. I also benchmarked the below code before/after the
changes to see if there are any significant performance differences.
```
import torch
def scope():
depth = 80000
x = torch.randn(9, requires_grad=True)
y = x.clone()
# build deeply nested computation graph
for i in range(depth):
y = y + y * 0.000001
%timeit -n 100 scope()
376 ms ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Without changes:
352 ms ± 6.58 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
With the change, the above code is 6.8% slower.
UPDATE: I did some more benchmarking. It looks like it takes 25% more time to free the computation graph in the case of the straight chain graph: https://gist.github.com/zou3519/93cf84d96ae431356ae7f7c1923ef51a
* WIP
* Add custom deleter to PyFunctions created by THPFunction
* Address some comments; pick new value
* Address some more comments
* Add more complicated test; special case the windows depth constant
* Add big warning about averagin to KLDivLoss documentation #6622
Also: An (independent) change in diagonal docstring tensor
formatting.
* Improve note with example
Thank you Richard Zou!
* use log_softmax
* Use Index rather than Long for IntList, so floating-point types convertible to ints fail the parsing.
Basically, our unpackLong code works with floating-point types that are convertible to ints, but this isn't often what you want (because of truncation).
What you actually want is to convert to an index, which will usually find such issues.
I made this the minimal change I could because:
1) I didn't want to change unpackLong because the existing code call checkLong before unpackLong, so this should be a non-issue most of the time. And fixing this properly requires calling checkLong again, which will slow everything down.
2) An exception above is with IntList, which only checks that 1) it is a tuple or 2) it is a varargs tuple (i.e. torch.ones(1, 2, 3)).
* Fix bug.
* Don't conflict tensor and IntList bindings.
* Change function to be consistent between python 2 and 3.
* Check Index.
* Move IntList overloads in legacy new functions to below Tensor overloads.
* Implement matmul_out and dot_out.
* Fix autograd by only calling _out variants if we have an out ourselves.
* Disallow mismatched types in dot_out.
* Make sure out variant doesn't have a method.
* Do proper type conversion.
* Enhance diagonal
This patch
- adds Tensor.diagonal to complement torch.diagonal
- implements diagonal natively in ATen
- makes diagonal a view
- implements taking arbitrary diagonals
- implements diagonal backward instead of referring
to the (more limited) diag
* add tests, copy diagonal code to backward for double differentiability
* improve tests and doc comment. Thank you, Adam!
* Mark diagonal as view function in gen_autograd.py, use simple backward.
* Workaround in onnx to get transposes into init_nets
This adds a pass to ONNX so that it can speculate Transpose
operators so that ONNX's split pass can put them into an init_net
Also fixes a potential bug in onnx peephole where an optimization
across blocks might move a Value and violate scoping.
* Perform shape propagation when embedding a program into a trace.
This ensures the trace still has type information specific to that trace, which will help onnx export succeed in more cases.
* onnx export aten::repeat to Tile
* move repeats to input
* turn repeats to a long tensor constant
* deal with case that len of repeats bigger than number of dims in input
DEPTHWISE_3x3 engine provides an optimized implementation of depthwise 3x3 convolution, e.g. for ShuffleNet, MobileNets
Implementations exist for CPU (generic), ARM CPU, and CUDA GPU.
Originally developed by @ajtulloch
* Refactor standard_gamma and implement CUDA gamma sampling
* Attempt fixes for AT_CUDA_ENABLED changes
* Gamma cuda and cpu forward as ATen native
* implement standard_gamma_grad_cuda
* update native_test.cpp, try to fix windows and various cuda version compiles
* searching a windows fix via CI... use std:: for math
* casting some constants in the calculation, compute at float for half precision
* whitespace fixes
* add acctype to do half->float computation, include HALF in generation, cast locally rather than tensors
* fix cuda8 half compilation
* always use scalar_cast with CUDACC, lock CPU generator, CPU acctype = double\nThank you for your review comments!
* Added ReLU unit to LP pooling, so the gradient does not become NAN if all inputs are zero.
* Added workaround for odd p. Added a bit of doc.
* Make the linter happy.
* Changes incorrect "overlappingIndices" call to correct "maybeOverlappingIndices"
THE PROBLEM
The current overlappingIndices() is meant to detect if a tensor defines multiple valid indices for the same data element. There are two significant issues with this function:
(1) The algorithm it attempts to implement cannot do this.
(2) That algorithm is not implemented correctly.
This call is used by pointwiseApply() and scatter(). If a tensor is readable/writable and detected as overlapped these algorithms will create a non-overlapped copy of it to work on. When tensors are improperly identified as overlapped this causese extra work. If tensors are improperly identified as non-overlapped then this would cause the operations to exhibit unexpected behavior.
For example,
ref = torch.arange(0, 32 * 5).view(4, 8, 5).cuda().double()
p = ref[:,:,::2]
p += 1
Results in a call to pointwiseApply1, which detects p as an overlapped tensor (it is not), causing a call to pointwiseApply2 that copies it into a non-overlapped temporary, and then another call to pointwiseApply2 later that copies it back to the original tensor. If, however, the original tensor is given dimensions of (4, 8, 4), instead, it is correctly detected as non-overlapped and only a single pointwiseApply1 call is made.
DISCUSSION + FIX
The algorithm that overlappingIndices() attempts to implement tests for a sufficient but not necessary condition of a tensor to be non-overlapping. That is, if its algorithm were implemented properly then it would be a conservative check that would ensure all overlapped tensors were copied (as desired), but also that some non-overlapped tensors were copied too.
The algorithm can be thought of as trying to test whether the dimensions can be ordered like "nesting dolls," with each dimension fitting within the next one larger than it. If this is true then the tensor is non-overlapping, but if it's false the tensor may or may not be overlapped. For example, a tensor with dims (2, 3) and strides (4, 3) cannot be "nested," but is non-overlapping. (The tensor looks like [[0, 3, 6], [4, 7, 10]].)
The algorithm is currently implemented improperly, as can be seen in the example above. The tensor p has dimensions [4, 8, 3] and strides [40, 5, 2]. This confuses the current implementation, which thinks the innermost dimension needs a stride of 6, which is incorrect. The first row is [0, 2, 4] and the next row begins with 5. The current implementation also improperly implemented its sorting behavior. (qsort comparators require -1, 0, and 1, not true/false return values.)
Fixing the existing algorithm is straightforward (and what this PR does, see below), but it is important to note that the algorithm never performed as intended, so its name and the documentation around it has been updated, too. A natural question is if it's possible to write an efficient overlappingIndices(), and I believe the answer is "no." Disambiguating overlapping from non-overlapping tensors is equivalent to finding a nonzero solution to a linear diophantine equation with restricted coefficients, that is, an equation of the form x_0s_0 + x_1s_1 ... = 0 where s_X is the stride in dimension X and x_X is an integer from [-size_X + 1, size_X - 1].
Another note is that the CPU does not perform this check. For example, if we run:
a = torch.FloatTensor([[0,1], [10, 11]])
b = torch.FloatTensor([[0,0],[0,0]])
b = b.set_(a.storage(), storage_offset=0, size=a.size(), stride=(1,1))
b += 1
Then b is [[1, 3], [3, 11]] because the operation is applied twice to the second element of the original tensor. This causes no warning.
Since the CPU does not perform a similar check, another question is whether the GPU code should remove its check. While it may seem that writing to overlapping tensors is an error state, running test_cuda.py reveals 171 instances of possibly overlapped tensors being copied by pointwiseApply(). (The prior incorrect version has 176 copies.) Allowing writing to overlapped tensors on the GPU may violate assumptions about memory accesses, too. In fairness, these assumptions may be violated on the CPU already.
Leaving the CPU vs GPU behavior question for the future, this fix corrects the current intended GPU behavior. This means that there will be fewer unnecessary copies and no chance of an overlapped tensor sneaking through on the GPU. The CPU behavior remains unchanged. The fix also adds a test to test_cuda.py to ensure that overlapped tensors on the GPU are written to as expected.
* cleanup
* Fixes Python formatting
* Make cuda 9 behave as cuda 8 wrt half conversions
Cuda 9 is too smart about implicit half conversions, this would disable them so that cuda 8 and cuda 9 behave in the same way wrt half.
* try fixing windows build
* one more broken conversion
* Statically linking CUDA for Anaconda builds
* typo
* Adding a summary line
* Comments
* Typo fix
* Fix faulty parameter passing
* Removing problem CUDA modules for now
* Fixing unused debugging function
* Turning off static cuda linking until script changes are in
* Disabling mkl
THC had a concept of per-device per-stream scratch space that was
persistent in THCState. This was useful before the caching allocator
because it avoided synchronizations in kernels that needed temporary
scratch space. However, it's not thread-safe since multiple threads can
operate on the same stream: In a two-pass reduction the scratch space
may get clobbered in between the two kernels.
This removes the scratch space and just uses THCudaMalloc and THCudaFree
within the reductions.
I've kept THCState_getCurrentDeviceScratchSpaceSize for now since it's
useful to have the temporary buffer be sized based on the number of SMs.
ATen can be configured to compile without CUDA support by passing
-DNO_CUDA=0 to cmake. However, cmake will look for CuDNN independently
of that flag and may eventually find it. In cases were compilation
without CUDA support was requested on system with CUDA installed, this
will result in linking errors while building some tests that rely only
on CuDNN being found.
Do not look for CuDNN if -DNO_CUDA=1 was provided in the cmake call
since it does not make sense to compile with CuDNN if CUDA support was
disabled.
* add threshold for ops using omp macro
* modify interface for ops using omp macro
* modify some thresholds
* implement C macros with optional parameters to avoid duplicating definitions for all pointwise operations
* add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing
* modify the comment
* Revert "add a parameter of LAB_IMPLEMENT_BASIC_FUNCTION for vectorizing"
Modify macro LAB_IMPLEMENT_VECTORIZED_FUNCTION to enable optional parameters
This reverts commit 8ef783a0cc67b653c435e64a3beb6866a6b4216d.
Conflicts:
aten/src/TH/generic/THTensorMath.c
* fix build error on windows
* retrigger the test
The long-term fix is to remove the handling-creating pathways and
remove all the modes from PythonOp making it into an op that simply
calls a PyObject. Right now ONNX expects PythonOp to hold a
nn.Function, not a generic callable, so completely removing the legacy
pathway will also require changes to how ONNX symbolics are found.
* [jit][script] Fix a bug combining sizes/unsized tensors
This add an isSubtypeOf method to reflect that sized tensors are a subtype
of Dynamic[Tensors]. It updates the typechecking code to reflect this
relationship.
* Add index_select to shape prop
* Speed up printing of large tensors.
Instead of deciding on the format based on all of the elements of the tensor, decide based on the elements that will actually be printed.
* Fix flake8.
* Add else case.
Sebastian Messmer noticed that these iterators were writeable by
default, which seemed dangerous. Replaced with const iterators.
This doesn't seem to affect any ATen code; seems reasonable enough.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
- ATen repo now has a new top-level, so Travis script has
to be adjusted to (1) be moved to the top-level and (2)
cd into the aten directory before doing anything.
- Unfortunately, this makes the import script even slower,
because I'm banging on the entire index every commit. If
anyone has better suggestions for how to twiddle the index.
One possibility is to fold the ATen build into the base\
.travis.yml but only activate it when a file is missing
(and then filter out that file.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
* Track checkpoint performance in scuba
As title.
* [C2/CUDA]: fix cross entropy sigmoid with logits
when adding log_d_trick, I forgot to add it to the cuda impl; this diff fixes
it.
* Back out "[caffe2] Unregister MKL fallbacks for NCHW conversions"
Original commit changeset: 8918dd40205a
Will land after @jongsoo's diff https://phabricator.intern.facebook.com/D7596315 lands
* [Easy][C2] Don't add blob to external outputs from output_record if it's already external output
As desc.
* On Mobile phones, call GlobalInit with no arguments in predictor in case we need to perform initialization
FACEBOOK:
The QPL logger needs the initialization code. In the past, the initialization code is put in the pipeline calling Caffe2. However, those places become obsolete quickly, as the product teams change places to call Caffe2 from time to time. We also need to track which teams use Caffe2 so that we can put the initialization code there.
With this diff, the initialization code is put in the predictor constructor, only enabled for mobile phones. This way, we can always enable QPL logging.
Once we do this, we can check how many times Caffe2 inference is called in production, and which models are more popular in production. This way, we can prioritize our effort supporting those models.
Will clean up the old code calling the init in the product in a separate diff.
* add padding op for sparse length tensor
to pad length-based sparse tensor with padding_value
* Add conv_op with cudaconvnet engine
Add conv_op with cudaconvnet engine
* [numa] Fix simple NUMA copy benchmark
Move XavierFill into init_net and also compute BW
* call roundf (device function) instead of round (host function)
* [caffe2_benchmark][observer] Make caffe2_benchmark use its own observer
1. Add ClearGlobalNetObservers()
2. Make caffe2_benchmark use its own observer and observer_reporter
* [detectron] Use roundf instead of round in the detectron module ops
* allow K larger than number of elements in top k op
one use case is to use this op together with PackSegments for sparse tensors, where the number of elements in each slice is not statistically defined.
* add ChannelShuffle DNNLOWP op
* fixup math_cpu.cc break
* Support list and tuple literals: Adds support for [a, b], (a, b) and "a, "
* Allow non-tensors to reach emitBuiltinCall, each SugaredValue::call
is now responsible for checking the types of its inputs.
Add support for calling cat with a tuple to emitBuiltinOp
This PR makes it so that the collect_env.py tests ignore the most minor
number of most version strings. It also bumps the version up to 0.5.0a
to fix the CI.
Reopening #6606 with fix for TEST_CUDA import issue on Windows and improvement to how we wait for manager exit in test_manager_unclean_exit. Loop tested on the Windows CI multiple times to make sure this actually fixes the CUDA OOM issue.
* Terminate dataloader workers properly when parent process is SIGKILL'ed
* Wait for worker processes to finish before shutting down manager process
* Add test for checking proper worker exit
* cosmetic change
* Test only if CUDA exists
* Don't call multiprocessing.set_start_method() in Python 2
* import TEST_CUDA only when we are in __main__
* Tune JOIN_TIMEOUT
* handle os.getppid() == 0 case
* Reset to original JOIN_TIMEOUT
* Use WaitForSingleObject() to check parent process status on Windows
* Fix TEST_CUDA import
* clean up
* Check main process only when index_queue.get() times out
* Change index_queues to multiprocessing.Queue
* Move manager checking logic to watchdog class
* Fix bugs in dataloader
* Fix TEST_CUDA import issue
* Don't import TEST_CUDA from common_nn
* Use event to signal manager exit in test
* fix lint
* Add comments
* Add environment collection script
Fixes#6111. This should make it easier for users to report bugs by giving
them a script to collect system environment information.
Changes include:
- Refactor out the environment collecting code from utils.bottleneck
- Add script (collect_env.py)
- Cleaned up the issues template so that it suggests using the script
and is more readable.
Testing: added expect tests to go with 4 CI configurations. Whenever one
of these configurations gets updated, the test will fail until the test
also gets updated.
* Expect tests
* Update issue template
* Fix random space
* Minor improvement to issue template; fix expect test
* Skip expect test if BUILD_ENVIRONMENT not found; test fix; split off smoke/expect test
Previously we would see errors like:
variable 'states' previously has type (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) but is now being assigned to a value of type (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor):
since the default case in the diagnostic printout was "Tensor". This adds a virtual member function to each Type class that returns a human-readable string for better error reporting
* Improve error reporting for tuple type mismatch
* Add better Tensor printout
We like to limit our issues to bug reports and feature requests. If you have a question or would like help and support, please visit our forums: https://discuss.pytorch.org/
If you have a question or would like help and support, please ask at our
[forums](https://discuss.pytorch.org/).
If you are submitting a feature request, please preface the title with [feature request].
If you are submitting a bug report, please fill in the following details.
## Issue description
Provide a short description.
## Code example
Please try to provide a minimal example to repro the bug.
about: Report an issue related to https://pytorch.org/docs
---
## 📚 Documentation
<!-- A clear and concise description of what content in https://pytorch.org/docs is an issue. If this has to do with the general https://pytorch.org website, please file an issue at https://github.com/pytorch/pytorch.github.io/issues/new/choose instead. If this has to do with https://pytorch.org/tutorials, please file an issue at https://github.com/pytorch/tutorials/issues/new -->
about: Submit a proposal/request for a new PyTorch feature
---
## 🚀 Feature
<!-- A clear and concise description of the feature proposal -->
## Motivation
<!-- Please outline the motivation for the proposal. Is your feature request related to a problem? e.g., I'm always frustrated when [...]. If this is related to another GitHub issue, please link here too -->
## Pitch
<!-- A clear and concise description of what you want to happen. -->
## Alternatives
<!-- A clear and concise description of any alternative solutions or features you've considered, if any. -->
## Additional context
<!-- Add any other context or screenshots about the feature request here. -->
echo "NOTE: To run \`import torch\`, please make sure to activate the conda environment by running \`call C:\\Jenkins\\Miniconda3\\Scripts\\activate.bat C:\\Jenkins\\Miniconda3\` in Command Prompt before running Git Bash."
) else (
7z a %IMAGE_COMMIT_TAG%.7z C:\\Jenkins\\Miniconda3\\Lib\\site-packages\\torch && python ci_scripts\\upload_image.py %IMAGE_COMMIT_TAG%.7z
@ -15,6 +15,7 @@ We are in an early-release beta. Expect some adventures and rough edges.
- [Binaries](#binaries)
- [From Source](#from-source)
- [Docker Image](#docker-image)
- [Building the Documentation](#building-the-documentation)
- [Previous Versions](#previous-versions)
- [Getting Started](#getting-started)
- [Communication](#communication)
@ -27,37 +28,21 @@ We are in an early-release beta. Expect some adventures and rough edges.
| Linux GPU | [](https://ci.pytorch.org/jenkins/job/pytorch-master/) | [](https://ci.pytorch.org/jenkins/job/pytorch-master/) |
| Windows GPU | <center>—</center> | [](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-win-ws2016-cuda9-cudnn7-py3-trigger/)
See also the [ci.pytorch.org HUD](https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master).
## More about PyTorch
At a granular level, PyTorch is a library that consists of the following components:
<table>
<tr>
<td><b> torch </b></td>
<td> a Tensor library like NumPy, with strong GPU support </td>
</tr>
<tr>
<td><b> torch.autograd </b></td>
<td> a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch </td>
</tr>
<tr>
<td><b> torch.nn </b></td>
<td> a neural networks library deeply integrated with autograd designed for maximum flexibility </td>
</tr>
<tr>
<td><b> torch.multiprocessing </b></td>
<td> Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training. </td>
</tr>
<tr>
<td><b> torch.utils </b></td>
<td> DataLoader, Trainer and other utility functions for convenience </td>
</tr>
<tr>
<td><b> torch.legacy(.nn/.optim) </b></td>
<td> legacy code that has been ported over from torch for backward compatibility reasons </td>
</tr>
</table>
| Component | Description |
| ---- | --- |
| **torch** | a Tensor library like NumPy, with strong GPU support |
| **torch.autograd** | a tape-based automatic differentiation library that supports all differentiable Tensor operations in torch |
| **torch.nn** | a neural networks library deeply integrated with autograd designed for maximum flexibility |
| **torch.multiprocessing** | Python multiprocessing, but with magical memory sharing of torch Tensors across processes. Useful for data loading and Hogwild training |
| **torch.utils** | DataLoader, Trainer and other utility functions for convenience |
| **torch.legacy(.nn/.optim)** | legacy code that has been ported over from torch for backward compatibility reasons |
Usually one uses PyTorch either as:
@ -70,10 +55,10 @@ Elaborating further:
If you use NumPy, then you have used Tensors (a.k.a ndarray).
@ -121,8 +106,7 @@ We hope you never spend hours debugging your code because of bad stack traces or
PyTorch has minimal framework overhead. We integrate acceleration libraries
such as Intel MKL and NVIDIA (cuDNN, NCCL) to maximize speed.
At the core, its CPU and GPU Tensor and neural network backends
(TH, THC, THNN, THCUNN) are written as independent libraries with a C99 API.
They are mature and have been tested for years.
(TH, THC, THNN, THCUNN) are mature and have been tested for years.
Hence, PyTorch is quite fast – whether you run small or large neural networks.
@ -139,9 +123,8 @@ and with minimal abstractions.
You can write new neural network layers in Python using the torch API
[or your favorite NumPy-based libraries such as SciPy](http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html).
If you want to write your layers in C/C++, we provide an extension API based on
[cffi](http://cffi.readthedocs.io/en/latest/) that is efficient and with minimal boilerplate.
There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/c_extension.html) and [an example here](https://github.com/pytorch/extension-ffi).
If you want to write your layers in C/C++, we provide a convenient extension API that is efficient and with minimal boilerplate.
There is no wrapper code that needs to be written. You can see [a tutorial here](http://pytorch.org/tutorials/advanced/cpp_extension.html) and [an example here](https://github.com/pytorch/extension-cpp).
## Installation
@ -165,7 +148,9 @@ If you want to compile with CUDA support, install
If you want to disable CUDA support, export environment variable `NO_CUDA=1`.
Other potentially useful environment variables may be found in `setup.py`.
If you want to build on Windows, Visual Studio 2017 and NVTX are also needed.
If you want to build on Windows, Visual Studio 2017 14.11 toolset and NVTX are also needed.
Especially, for CUDA 8 build on Windows, there will be an additional requirement for VS 2015 Update 3 and a patch for it.
The details of the patch can be found out [here](https://support.microsoft.com/en-gb/help/4020481/fix-link-exe-crashes-with-a-fatal-lnk1000-error-when-you-use-wholearch).
Dockerfile is supplied to build images with cuda support and cudnn v7. Build as usual
Dockerfile is supplied to build images with cuda support and cudnn v7. You can pass -e PYTHON_VERSION=x.y flag to specificy which python to be used by Miniconda, or leave it unset to use the default. Build as usual
* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . Our slack channel is invite-only to promote a healthy balance between power-users and beginners. If you need a slack invite, ping us at soumith@pytorch.org
* Slack: general chat, online discussions, collaboration etc. https://pytorch.slack.com/ . Our slack channel is invite-only to promote a healthy balance between power-users and beginners. If you need a slack invite, ping us at slack@pytorch.org
* newsletter: no-noise, one-way email newsletter with important announcements about pytorch. You can sign-up here: http://eepurl.com/cbG0rv
ATen is a simple tensor library thats exposes the Tensor operations in Torch
and PyTorch directly in C++11. The wrapper respects the semantics of operators
in PyTorch, except minor details due to differences between C++ in Python in
in PyTorch, except minor details due to differences between C++ and Python in
the way default arguments are handled. See the [documentation for tensors](http://pytorch.org/docs/tensors.html) in PyTorch for what these operations do.
ATen's API is auto-generated from the same declarations PyTorch uses so the
two APIs will track each other over time.
@ -12,7 +12,7 @@ does not include templates. That is, there is one `Tensor` type. It can hold a
CPU or CUDA Tensor, and the tensor may have Doubles, Float, Ints, etc. This design
makes it easy to write generic code without templating everything.
See the _generated_ [`Tensor.h` file](doc/Tensor.h) and [`Functions.h` file](doc/Functions.h) for the provided API. Excerpt:
See https://pytorch.org/cppdocs for the provided API. Excerpt:
```c++
Tensor atan2(const Tensor & other) const;
Tensor & atan2_(const Tensor & other);
@ -48,7 +48,8 @@ sudo pip install pyyaml
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/where/you/want # specify your dest directory
# cmake .. -DNO_CUDA=true # for CPU only machines
# cmake .. -DUSE_NVRTC=ON -DUSE_TENSORRT=OFF -DCMAKE_INSTALL_PREFIX=../install -DCAFFE2_CMAKE_BUILDING_WITH_MAIN_REPO=OFF -DUSE_CUDA=ON # for CUDA
# cmake .. -DUSE_CUDA=OFF # for CPU only machines
make install
```
@ -87,7 +88,7 @@ for(auto i = 0; i < 100000; i++) {
Expressions like `CUDA(kFloat)` are first-class `at::Type` objects that represent
the type of a Tensor and are used to create Tensors when their type cannot be
inferred. See the _generated_ [Type header](doc/Type.h) for its API.
inferred.
See more in [sample files](src/ATen/test).
@ -164,7 +165,7 @@ behave as normal tensors.
### Scalars and zero-dimensional tensors
In addition to the `Tensor` objects, ATen also includes `Scalar`s that represent a single number.
Like a Tensor, Scalars are dynamically typed and can hold any one of ATen's [number types](doc/Type.h).
Like a Tensor, Scalars are dynamically typed and can hold any one of ATen's number types.
Scalars can be implicitly constructed from C++ number types. Scalars are needed because some functions like `addmm` take numbers along with Tensors and expect these
numbers to be the same dynamic type as the tensor. They are also used in the API to indicate places where
a function will _always_ return a Scalar value, like `sum`.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.