1. Update "Package Reference" to "Python API"
2. Add in torchaudio and torchtext reference links so they show up across all docs not just the main page
3. Add "Other Languages" section and add in C++ docs
Changelog:
- When number of batches = 1, dispatch to trsm instead of trsm_batched in MAGMA
Test Plan:
- All triangular_solve tests should pass to ensure that the change is valid
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23376
This uses master version of sphinxcontrib-katex as it only
recently got prerender support.
Fixes#20984
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D16582064
Pulled By: ezyang
fbshipit-source-id: 9ef24c5788c19572515ded2db2e8ebfb7a5ed44d
This is temporary, won't be needed with the new serialization format.
But for now, since the main module gets its name from the archive name,
we need this for safety, other wise something like
`torch.jit.save("torch.pt") will break things.
ghstack-source-id: f36febe1025ff04e7f79617e548819d4876dc7fa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23630
Now when initializing a ScriptModule during the torch.jit.load()
process, there is already a cpp module backing the thing. That means
that setting training will overwrite whatever the initialized
ScriptModule had.
This PR splits apart the common "set up internal state" part of the
Module __init__ and calls that from ScriptModule.__init__ and
Module.__init__, leaving the "nn.Module-specific" part (setting
`self.training`) for the nn.Module __init__
ghstack-source-id: 9b2ba8a15c43cf230363e4cd10ba4ad3ac4931f7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23680
* [jit] Support nn.GRU and Make nn.LSTM accept PackedPaddedSequence
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
* fix
* add link to comments
In max_pool2d, max_pool3d, avg_pool2d, avg_pool3d.
There is only one substantive change: when stride.size() == 1,
we expand it to size 2. However, I also took the opportunity
to give a better error message.
Testing here is bare minimum, because I'm in a hurry. Just make
sure C++ API with all size 1 inputs works.
This is a squash of four commits.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Summary:
Simplifying https://github.com/pytorch/pytorch/issues/23793: The dependency relationship between
{INSTALL,BUILD}_TEST is already properly handled in CMakeLists.txt. All
we need to do is to pass down INSTALL_TEST.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23806
Differential Revision: D16691833
Pulled By: soumith
fbshipit-source-id: 7607492b2d82db3f79b174373a92e2810a854a61
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/23480.
I only verified that the schedule reaches the restart at the expected step as specified in the issue, it would be good to have someone else verify correctness here.
Script:
```
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(torch.optim.SGD([torch.randn(1, requires_grad=True)], lr=0.5), T_0=1, T_mult=2)
for i in range(9):
print(i)
print(scheduler.get_lr())
scheduler.step()
```
Output:
```
0
[0.5]
1
[0.5]
2
[0.25]
3
[0.5]
4
[0.42677669529663687]
5
[0.25]
6
[0.07322330470336313]
7
[0.5]
8
[0.4809698831278217]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23833
Differential Revision: D16657251
Pulled By: gchanan
fbshipit-source-id: 713973cb7cbfc85dc333641cbe9feaf917718eb9
Summary:
This allows `INSTALL_*` to pass through to cmake.
Additional fix is that if `INSTALL_TEST` is specified, it wont use `BUILD_TEST` as the default value for `INSTALL_TEST`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23793
Differential Revision: D16648668
Pulled By: soumith
fbshipit-source-id: 52c2a0d8033bc556355b87a6731a577940de9859
Summary:
Changelog:
- Add batching for det / logdet / slogdet operations
- Update derivative computation to support batched inputs (and consequently batched outputs)
- Update docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22909
Test Plan:
- Add a `test_det_logdet_slogdet_batched` method in `test_torch.py` to test `torch.det`, `torch.logdet` and `torch.slogdet` on batched inputs. This relies on the correctness of `torch.det` on single matrices (tested by `test_det_logdet_slogdet`). A port of this test is added to `test_cuda.py`
- Add autograd tests for batched inputs
Differential Revision: D16580988
Pulled By: ezyang
fbshipit-source-id: b76c87212fbe621f42a847e3b809b5e60cfcdb7a
* [jit] Recursive compilation error hot fixes
This is a combination of #23454 and #23682 which are needed for the
error reporting on recrusively compiled code
* #23682
Summary:
Changelog:
- Use narrow instead of narrow_copy while returning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23591
Test Plan:
- All tests should pass to ensure that the change is correct
Fixes https://github.com/pytorch/pytorch/issues/23580
Differential Revision: D16581174
Pulled By: ezyang
fbshipit-source-id: 1b6bf7d338ddd138ea4c6aa6901834dd202ec79c
Summary:
accidently calls clone, but what we want is creating an empty tensor and set storage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23452
ghstack-source-id: 87438096
Differential Revision: D16442756
fbshipit-source-id: 6d5663f82c9bd4e9de8fc846c52992477843af6a
Summary:
Changelog:
- Rename `gels` to `lstsq`
- Fix all callsites
- Rename all tests
- Create a tentative alias for `lstsq` under the name `gels` and add a deprecation warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23460
Test Plan: - All tests should pass to confirm that the patch is correct
Differential Revision: D16547834
Pulled By: colesbury
fbshipit-source-id: b3bdb8f4c5d14c7716c3d9528e40324cc544e496
Summary:
Only check for cmake dependencies we directly depend on (e.g., hipsparse but not rocsparse)
Use cmake targets for ROCm where possible.
While there, update the docker CI build infrastructure to only pull in packages by name we directly depend on (anticipating the demise of, e.g., miopengemm). I do not anticipate a docker rebuild to be necessary at this stage as the changes are somewhat cosmetic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23527
Differential Revision: D16561010
Pulled By: ezyang
fbshipit-source-id: 87cd9d8a15a74caf9baca85a3e840e9d19ad5d9f
Summary:
Syncing worker requirement mismatches to improve remote build time.
Created actions:
LARGE: 66
MEDIUM: 649
XLARGE: 1
Updated actions:
From LARGE to MEDIUM: 18
From LARGE to XLARGE: 2
From MEDIUM to LARGE: 20
From XLARGE to LARGE: 1
Differential Revision: D16559356
fbshipit-source-id: a51ef034265649314661ab0e283089a069a20437
Summary:
When a user tries to change metadata of a tensor created from `.data` or `.detach()`, we currently shows an error message "<function_name> is not allowed on Tensor created from .data or .detach()". However, this error message doesn't suggest what the right fix should look like. This PR improves the error message.
Closes https://github.com/pytorch/pytorch/issues/23393.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23504
Differential Revision: D16547415
Pulled By: yf225
fbshipit-source-id: 37f4a0385442e2b0966386fb14d3d938ecf4230c
Summary:
Previously these were left out which would lead to confusing messages,
now it looks something like:
```
torch.jit.frontend.UnsupportedNodeError: import statements aren't
supported
:
at ../test.py:13:9
def bad_fn(self):
import pdb
~~~~~~ <--- HERE
'__torch__.X' is being compiled since it was called from 'fn'
at ../test.py:16:12
def fn(x):
return X(10)
~~~~ <--- HERE
```
Fixes#23453
](https://our.intern.facebook.com/intern/diff/16526027/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23454
Pulled By: driazati
Differential Revision: D16526027
fbshipit-source-id: 109f2968430dbf51ee91b1b3409badfd557d19a4
Summary:
Use the recursive script API in the existing docs
TODO:
* Migration guide for 1.1 -> 1.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21612
Pulled By: driazati
Differential Revision: D16553734
fbshipit-source-id: fb6be81a950224390bd5d19b9b3de2d97b3dc515
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23521
non-fbgemm path should have the same arguments as fbgemm path.
Reviewed By: jianyuh
Differential Revision: D16547637
fbshipit-source-id: bb00d725fb968cbee32defb8facd2799a7e79bb4
Summary:
This resolves two issues in one shot:
- sub shouldn't be available for bool type.
- When sub is applied to an unsupported type, the current error messages
shows "add_cpu/add_cuda is not implemented for [type]". They should be
"sub_cpu/sub_cuda" instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23519
Differential Revision: D16548770
Pulled By: izdeby
fbshipit-source-id: fe404a2a97b8d11bd180ec41364bf8e68414fb15
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/23417
Test Plan:
cd docs; make html
Imported from OSS
Differential Revision: D16523781
Pulled By: ilia-cher
fbshipit-source-id: d6c09e8a85d39e6185bbdc4b312fea44fcdfff06
Summary:
No real change on the CI since currently the default latest is 0.4.0. houseroad bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23517
Differential Revision: D16550375
Pulled By: bddppq
fbshipit-source-id: a669b8af678c79c4d6909300b28458fe6b7cd30c
Summary:
There is an internal fbcode assert that fails if i do not add these checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23511
Differential Revision: D16545606
Pulled By: eellison
fbshipit-source-id: cd3a799850bae8f052f9d81c1e4a2678fda19317
Summary:
PyTorch test sets a policy() method to assertLeaksNoCudaTensors.
Whenever a test is run, assertLeaksNoCudaTensors is called,
which in turn calls CudaMemoryLeakCheck, which in turn calls
initialize_cuda_context_rng, where it executes torch.randn
on each device, where a kernel is launched on each device.
Since the kernel may not finish on device 1, the assertion
self.assertTrue(s1.query()) fails.
The fix is to insert
torch.cuda.synchronize(d0)
torch.cuda.synchronize(d1)
at the beginning of the test so that previously launched kernels finish before the real
test begins.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23520
Differential Revision: D16547701
Pulled By: soumith
fbshipit-source-id: 42ad369f909d534e15555493d08e9bb99dd64b6a
Summary:
Add a sorting policy to ChunkDataset.
This is considered an advanced parameter for developers who want to apply a 'sorting policy' to the chunk data before sampling into minibatch.
Different than the collate method, this policy is applied on the chunk level instead of minibatch level. When a chunk of data is loaded (multiple chunks if cross_chunk_shuffle_count_ is greater than 1), this policy is targeting to the full loaded data. It will be useful if developers want to perform some pre-processing (like bucketing) to the chunk data before example sampler samples the data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23053
Differential Revision: D16537692
Pulled By: colesbury
fbshipit-source-id: cd21ed40ab787a18b8c6dd304e5b806a7a45e6ba
Summary:
Thanks adefazio for the feedback, adding a note to the Contribution guide so that folks don't start working on code without checking with the maintainers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23513
Differential Revision: D16546685
Pulled By: soumith
fbshipit-source-id: 1ee8ade963703c88374aedecb8c9e5ed39d7722d
Summary:
This modernizes distributions code by replacing a few uses of `.contiguous().view()` with `.reshape()`, fixing a sample bug in the `Categorical` distribution.
The bug is exercised by the following test:
```py
batch_shape = (1, 2, 1, 3, 1)
sample_shape = (4,)
cardinality = 2
logits = torch.randn(batch_shape + (cardinality,))
dist.Categorical(logits=logits).sample(sample_shape)
# RuntimeError: invalid argument 2: view size is not compatible with
# input tensor's size and stride (at least one dimension spans across
# two contiguous subspaces). Call .contiguous() before .view().
# at ../aten/src/TH/generic/THTensor.cpp:203
```
I have verified this works locally, but I have not added this as a regression test because it is unlikely to regress (the code is now simpler).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23328
Differential Revision: D16510678
Pulled By: colesbury
fbshipit-source-id: c125c1a37d21d185132e8e8b65241c86ad8ad04b
Summary:
Currently there is no way to build MKLDNN more optimized than sse4. This commit let MKLDNN build respect USE_NATIVE_ARCH.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23445
Differential Revision: D16542275
Pulled By: ezyang
fbshipit-source-id: 550976531d6a52db9128c0e3d4589a33715feee2
Summary:
- MSVC_Z7_OVERRIDE has already handled in CMakeLists.txt. No need to process it for once more in the Python scripts.
- Option MSVC_Z7_OVERRIDE should be visible to the user only if MSVC is used.
- Move the setting of "/EHa" flag to CMakeLists.txt, where other MSVC-specific flags are processed. This also further prepares the removal of redundant cflags setup in Python build scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23455
Differential Revision: D16542274
Pulled By: ezyang
fbshipit-source-id: 4d3b8b07161478bbba8a21feb6ea24c9024e21ac
Summary:
Closes gh-16955.
Closes https://github.com/pytorch/vision/issues/977
On Linux both `lib64` and `lib` may be present (symlinked). The reports
seem to all be about macOS, but it seems like this is also possibly more
robust on Linux and can't hurt. So not treating platforms differently.
Note that Eigen has a similar check in its CMake:
```
if(CUDA_64_BIT_DEVICE_CODE AND (EXISTS "${CUDA_TOOLKIT_ROOT_DIR}/lib64"))
link_directories("${CUDA_TOOLKIT_ROOT_DIR}/lib64")
else()
link_directories("${CUDA_TOOLKIT_ROOT_DIR}/lib")
endif()
```
There may be other issues for building from source on macOS, can't test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23491
Differential Revision: D16538973
Pulled By: soumith
fbshipit-source-id: cc309347b7d16e718e06878d3824d0a6e40b1019
Summary:
Currently set_rng_state and get_rng_state do not accept string as their parameters. This commit let them accept strings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23448
Differential Revision: D16527172
Pulled By: soumith
fbshipit-source-id: 8f9a2129979706e16877cc110f104770fbbe952c
Summary:
Syncing worker requirement mismatches to improve remote build time.
Created actions:
MEDIUM: 981
LARGE: 56
Updated actions:
From MEDIUM to LARGE: 10
From LARGE to MEDIUM: 3
From LARGE to XLARGE: 1
Differential Revision: D16532427
fbshipit-source-id: c58bf59e6c571627b3994f8cdfa79758fb85892b
Summary:
(1) Add `COMMON_MSVC_FLAGS` to the flags in the ninja codepath
(2) Add `/EHsc` to `COMMON_MSVC_FLAG`
(3) Remove `-fPIC` and `-std=c++11` from the flags in the windows codepath
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23472
Differential Revision: D16532993
Pulled By: soumith
fbshipit-source-id: bc2d983f5f8b4eae9c7385bf170f155679e92e87
Summary:
Add `sorted` keyword to JIT for lists and dicts. This desugars to a list copy and a call to `list.sort()`. Since we don't have interfaces yet I implement it in terms of `list.sort()`. When we do we can re-visit implementing this op in a different manner.
The test fails bc of a fix to specialized lists which is landing here: https://github.com/pytorch/pytorch/pull/23267
Ignore the first commit because it is formatting, plz use clang_format ppl :'(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23274
Differential Revision: D16527323
Pulled By: eellison
fbshipit-source-id: aed8faef23cb790b9af036cd6c1b9b1d7066345d
Summary:
Scatter is unnecessary if only using one device, and it breaks on some custom data structures like namedtuple, so would like to avoid :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22384
Differential Revision: D16428208
Pulled By: soumith
fbshipit-source-id: eaa3876b2b95c1006ccaaacdb62f54c5280e730c
Summary:
This is part of the effort to shrink OSS libtorch mobile build size.
We shouldn't need Module::save function on mobile - it depends on
csrc/jit/export.cpp which then depends on ONNX. By gating these two
methods we can avoid these dependencies for libtorch mobile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23415
ghstack-source-id: 87288228
Reviewed By: dreiss
Differential Revision: D16511143
fbshipit-source-id: fd031f91fcf9b7be54cbe1436506965af94ab537
Summary:
Add early returns to JIT with minimal changes to compiler.cpp and an IR->IR pass that will transform the graph so that there is only one return value.
In compiler.cpp, record when a block will exit so that in the following example will work:
```
if cond:
a = torch.zeros([2])
else:
return 2
a += 2
...
```
To match block outputs with values that will not be used, like in the above example with `a`, I add a Bottom Type that subtypes everything else. This allows shape propagation to continue to work, and makes it so that we don't need many extra nodes filling up the graph.
The IR transform currently doesn't work on Loops, I didn't add that to this PR to avoid too much complexity, but will add it as a stack (and it should be very little extra code). the IR transform is commented at the top of the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19179
Differential Revision: D16519819
Pulled By: eellison
fbshipit-source-id: 322a27f69966d1fd074ebe723c3e948b458b0e68
Summary:
Adds qtensor specific fields to the proto file so that they get serialized into the model.json
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23356
ghstack-source-id: 87263428
Differential Revision: D16473237
fbshipit-source-id: bf5b51d0863d036d30a1644a3c3b74516468224b
Summary:
As pointed out by SsnL in https://github.com/pytorch/pytorch/issues/20910, when clone destination is different from the module's device,
`Cloneable` currently calls `clone()` and then `to()` on every parameter and buffer, where the first clone is unnecessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20995
Differential Revision: D15517353
Pulled By: mrshenli
fbshipit-source-id: 6b6dc01560540a63845663f863dea0a948021fa5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23442
Replace the argument name from `operator` to `operators` which can take a list of operators to test.
Reviewed By: hl475
Differential Revision: D16520779
fbshipit-source-id: 94284a87c64471793e319f5bd3143f89b9a192bb
Summary:
When an exception occurs in one of the modules passed to `parallel_apply()`, it is caught and re-raised in the main thread. This preserves the original exception type and message, but has the traceback point at the position where it's re-raised, rather than the original point of failure.
This PR saves the exception information required to generate the traceback, and includes the original traceback in the message of the exception raised in the main thread.
Before:
```
...
File ".../torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File ".../torch/nn/parallel/parallel_apply.py", line 84, in parallel_apply
raise output
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
```
After:
```
...
File ".../torch/nn/parallel/data_parallel.py", line 153, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File ".../torch/nn/parallel/parallel_apply.py", line 88, in parallel_apply
''.join(traceback.format_exception(*exc_info)))
RuntimeError: Caught exception in replica 0. Original traceback and message:
Traceback (most recent call last):
...
File "../models/foo.py", line 319, in bar
baz = asdf / ghij[:, np.newaxis]
RuntimeError: expected type torch.FloatTensor but got torch.cuda.FloatTensor
```
I took care to raise an exception of the original type (in case the main code checks for that), but replaced the message. It helped me find a bug that did not occur outside `data_parallel()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18055
Differential Revision: D16444972
Pulled By: zhangguanheng66
fbshipit-source-id: ec436c9d4677fad18106a8046cfa835a20a101ce
Summary:
Don't automatically unwrap top layer DataParalllel for users. Instead, we provide useful error information and tell users what action to take.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23365
Reviewed By: zrphercule
Differential Revision: D16514273
Pulled By: houseroad
fbshipit-source-id: f552de5c53fb44807e9d9ad62126c98873ed106e
Summary:
The conda compiler are gcc/c++ 7.3.0, but have custom version strings
for clarity:
x86_64-conda_cos6-linux-gnu-cc
x86_64-conda_cos6-linux-gnu-c++
Using these compilers to build a C++ or CUDA extension now gives this warning (unnecessarily):
```
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (/home/rgommers/anaconda3/envs/pytorch-nightly/bin/x86_64-conda_cos6-linux-gnu-c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux.
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23396
Differential Revision: D16500637
Pulled By: soumith
fbshipit-source-id: 5b2fc3593e22e9a7d07dc2c0456dbb4934ffddb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23104
ghstack-source-id: 87247148
As suggested in https://github.com/pytorch/pytorch/pull/22891, we will add an overload for ```torch.fbgemm_linear_int8_weight``` (dynamic quantized version of linear function) that takes PackedLinearWeight as input and is pretty much the same in signature as regular aten::linear.
Differential Revision: D16381552
fbshipit-source-id: 1ccc4174fd02c546eee328940ac4b0da48fc85e8
Summary:
adding qconv+relu and qlinear+relu modules in nn/_intrinsic/quantized
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23410
Test Plan:
Extended tests to test these new modules as well
buck test mode/dev caffe2/test:quantized -- 'test_linear_api' --print-passing-details
```
Running 1 tests
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/2251799820197379
✓ caffe2/test:quantized - test_linear_api (test_nn_quantized.ModuleAPITest) 4.055 1/1 (passed)
Test output:
> test_linear_api (test_nn_quantized.ModuleAPITest)
> test API functionality for nn.quantized.linear and nn._intrinsic.quantized.linear_relu ... ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 4.056s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/2251799820197379
Summary (total time 10.66s):
PASS: 1
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
buck test mode/dev caffe2/test:quantized -- 'test_conv_api' --print-passing-details
```
Running 2 tests
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/4785074607089664
✓ caffe2/test:quantized - test_conv_api (test_quantized_conv.QuantizedConvTest) 5.195 1/2 (passed)
Test output:
> test_conv_api (test_quantized_conv.QuantizedConvTest)
> Tests the correctness of the conv functional. ... ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 5.195s
>
> OK
✓ caffe2/test:quantized - test_conv_api (test_nn_quantized.ModuleAPITest) 10.616 2/2 (passed)
Test output:
> test_conv_api (test_nn_quantized.ModuleAPITest)
> Tests the correctness of the conv module. ... ok
>
> ----------------------------------------------------------------------
> Ran 1 test in 10.616s
>
> OK
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/4785074607089664
Summary (total time 17.31s):
PASS: 2
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
``
Differential Revision: D16505333
Pulled By: dskhudia
fbshipit-source-id: 04f45cd0e76dc55f4694d558b913ab2958b7d727
Summary:
This is still work in progress.
There are several more items to add to complete this doc, including
- [x] LHS indexing, index assignments.
- [x] Tensor List.
- [x] ~Shape/Type propagation.~
- [x] FAQs
Please review and share your thoughts, feel free to add anything that you think should be included as well. houseroad spandantiwari lara-hdr neginraoof
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23185
Differential Revision: D16459647
Pulled By: houseroad
fbshipit-source-id: b401c005f848d957541ba3b00e00c93ac2f4609b
Summary:
They should be forwarded by their actual type, not their rvalue reference.
This looked like perfect forwarding but actually wasn't.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23412
ghstack-source-id: 87214575
Reviewed By: dzhulgakov
Differential Revision: D16507872
fbshipit-source-id: 2b20a37df83067dd53e917fe87407ad687bb147c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21211
There are cases where the `init` method used to create inputs can exit with error. When this happens, that specific input should be skipped.
Reviewed By: zheng-xq
Differential Revision: D15466410
fbshipit-source-id: 55e86764b2ec56f7730349ff1df6e50efc0239d7
Summary:
Align the Argument's operator<< with parser,
additional support:
1) List size
2) real default value
3) Alias information
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23203
ghstack-source-id: 87118985
Reviewed By: zrphercule
Differential Revision: D16433188
fbshipit-source-id: aea5711f93feacd94d1732e2f0d61218a31a0c5c
Summary:
The builder pattern doesn't seem to work well with return-value-optimization.
This saves ~100 ns in the construction of TensorIterator::binary_op.
```
import torch
x = torch.rand(1)
y = torch.rand(1)
z = torch.rand(1)
%timeit torch.add(x, y, out=z) # ~1.76 us vs ~1.88 us on my machine
```
cc resistor zheng-xq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23329
Differential Revision: D16495070
Pulled By: VitalyFedyunin
fbshipit-source-id: 8ce116075fa4c7149dabfcdfa25885c1187c8e2f
Summary:
The legacy iOS build script (`build_ios.sh`) is still working, but the output is in caffe2, not Pytorch. To enable the Pytorch iOS build, we can set the value of `BUILD_CAFFE2_MOBILE` to `NO`, and turn on another cmake arg - `INTERN_BUILD_MOBILE` ljk53 has created for Android.
There is a trivial issue in `used_kernel.cpp` that will cause the compiling error when running `build_ios.sh`, as it uses a `system`API that has been deprecated since iOS 11. The fix below is to bypass this file since it's not needed by mobile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23293
Test Plan:
The `build_ios.sh` completed successfully, and all the generated static libraries can be compiled and linked successfully on iOS devices.
### Build script
```shell
./scripts/build_ios.sh \
-DBUILD_CAFFE2_MOBILE=OFF \
-DCMAKE_PREFIX_PATH=$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())') \
-DPYTHON_EXECUTABLE=$(python -c 'import sys; print(sys.executable)')
```
Differential Revision: D16456100
Pulled By: xta0
fbshipit-source-id: 38c73e1e3a0c219a38ddc28b31acc181690f34e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22175
- Rename AliasAnalysisKind::DEFAULT to AliasAnalysisKind::CONSERVATIVE
- Introduce AliasAnalysisKind::FROM_SCHEMA that means the alias annotations of the schema should be honored
- Introduce AliasAnalysisKind::INTERNAL_SPECIAL_CASE to be able to run assertions that internal special cased ops are treated correctly
- aten:: and prim:: ops are not treated as special cases anymore, but just use AliasAnalysisKind::FROM_SCHEMA
- There's a set of assertions to ensure that aten:: and prim:: ops are all correctly set up to use AliasAnalysisKind::FROM_SCHEMA. Once this PR lands and passes all tests, we will remove those assertions and open up for the possibility of different AliasAnalysisKind settings for aten:: and prim:: ops
Differential Revision: D15929595
fbshipit-source-id: 7c6a9d4d29e13b8c9a856062cd6fb3f8a46a2e0d
Summary:
torch::List recently received some polishing that now also is done for Dict. This should be done before the PyTorch 1.2 release because of backwards compatibility.
- Dict is just a reference type, so "const Dict" should have the same capabilities as "Dict", constness is not guaranteed in any way.
- DictIterator gets comparison operators <, <=, >, >=
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23344
ghstack-source-id: 87170304
Differential Revision: D16468800
fbshipit-source-id: 2978c3b9cdcfb2cfb3f26516b15bd455d9a48ba9
Summary:
This check is not needed. Even if it were, the assignment is clobbered anyway.
Closes#23300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23370
ghstack-source-id: 87157671
Differential Revision: D16485329
fbshipit-source-id: 8ccac79e81f5e0d0d20099d550411c161f58c233
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22808
- Use ```size_to_dim_```.
- ```mod``` is not in the scope. Should be ```module```.
Reviewed By: mingzhe09088
Differential Revision: D16225799
fbshipit-source-id: 9a263227d2d508eefdfddfee15fd0822819de946
Summary:
all cases should be prim ops, but let's support it. it will expect variadic return schema to be prim::PythonOp(...) -> ...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23199
ghstack-source-id: 87113845
Differential Revision: D16431635
fbshipit-source-id: 798b6957ce5d800f7fcf981c86fdcb009cd77a78
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/22833
grad_sum_to_size does not commute with AutogradAdd after all because it turns the broadcasting AutogradAdd into a broadcasting add.
Chillee did actually do most of the tracking down to the fusion of grad_sum_to_size and pinging me when he had found the cause. Thank you!
About the choice of removing the fusion completely instead of being more precise:
- We do have grad_sum_to_size elimination which works for cases where broadcasting does not actually happen in the forward, so the cases where the fusing of grad_sum_to_size is actually beneficial is much smaller than when initially proposed.
- There will be less fusion, in terms of the tests, IOU stops being fully fused. I vaguely think that it is a case we could handle with refined logic.
- Keeping it would add complexity in checking when to merge fusion groups to the complexities that this PR removes.
- The future of fusion probably lies more in more complete solutions including reductions (TVM or KeOps or our own or ...).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23372
Differential Revision: D16489930
Pulled By: soumith
fbshipit-source-id: bc0431b0d3eda264c401b634675872c4ce46f0f4
Summary:
Instead, defer its default value to CMakeLists.txt
NO_FBGEMM has already been handled in tools/setup_helpers/env.py
(although deprecated)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23314
Differential Revision: D16493580
Pulled By: ezyang
fbshipit-source-id: 7255eb1df5e8a6dd0362507d68da0986a9ed46e2
Summary:
This is a small fix on top of gh-23348, which fixed the libtorch
nightly build timeouts.
For the latest nighly build (25 July), see
https://circleci.com/workflow-run/33d0a24a-b77c-4a8f-9ecd-5646146ce684
The only failures are these uploads, which is because `aws s3 cp`
can only deal with one file at a time. The only way to make it do
multiple files at once is:
```
aws s3 cp . "$s3_dir" --exclude "*" --include "libtorch-*.zip" --recursive --acl public-read
```
which is much more verbose. executing one `cp` per file should be fine,
and this is also what's done in `binary_macos_upload.sh`
Closes gh-23039
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23368
Differential Revision: D16488853
Pulled By: soumith
fbshipit-source-id: 6dc04b4de2f6cd2de5ae9ad57a6e980f56896498
Summary:
With this change you can now list multiple interfaces separated by
comma. ProcessGroupGloo creates a single Gloo context for every device
in the list (a context represents a connection to every other
rank). For every collective that is called, it will select the context
in a round robin fashion. The number of worker threads responsible for
executing the collectives is set to be twice the number of devices.
If you have a single physical interface, and wish to employ increased
parallelism, you can also specify
`GLOO_SOCKET_IFNAME=eth0,eth0,eth0,eth0`. This makes ProcessGroupGloo
use 4 connections per rank, 4 I/O threads, and 8 worker threads
responsible for executing the collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22978
ghstack-source-id: 87006270
Differential Revision: D16339962
fbshipit-source-id: 9aa1dc93d8e131c1714db349b0cbe57e9e7266f1
Summary:
Illegal instruction is encountered in pre-built package in MKL-DNN. https://github.com/pytorch/pytorch/issues/23231
To avoid such binary compatibility issue, the HostOpts option in MKL-DNN is disabled in order to build MKL-DNN for generic arch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23292
Differential Revision: D16488773
Pulled By: soumith
fbshipit-source-id: 9e13c76fb9cb9338103cb767d7463c10891d294a
Summary:
This is step 1 in trying to get rid of constants that are set prior to
executing the test runner. All setup logic should be concentrated in
the setupClass() function of the TestCase subclass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23223
ghstack-source-id: 87005260
Reviewed By: zhaojuanmao
Differential Revision: D16439147
fbshipit-source-id: 7a929ad4b1c8e368e33d1165becbd4d91220882c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23347
This diff replaces uint8 with int8 to match with the underlying kernel implementation. When we do int8 quantization, we are computing with uint8 (input activation) * int8 (weight) -> uint8 (output activation). The weight is quantized into int8.
Reviewed By: jianyuh
Differential Revision: D16469435
fbshipit-source-id: a697655b0e97833fc601e5980970aec4dba53c39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23354
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D16474254
Pulled By: ezyang
fbshipit-source-id: 0dd7ce02e1aa1a42a24d2af066ebd0ac5206c9a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23325Fixes#19990
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Differential Revision: D16473826
Pulled By: ezyang
fbshipit-source-id: 466db2c22fabd7b574f0a08aec67a18318ddb431
Summary:
Proposed PR for
https://github.com/pytorch/pytorch/issues/23342
Disables execution of QNNpack tests if IS_PPC.
Basically this parallels the same skipping of tests for IS_WINDOWS as well, which is already present.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23343
Differential Revision: D16469218
Pulled By: soumith
fbshipit-source-id: 80b651d00e5d413e359cf418f79e20d74cd9c8e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23317
Print out the kind type when fail to export
Reviewed By: zrphercule
Differential Revision: D16462641
fbshipit-source-id: 27157c0bd597362f90ac8cfb33e1808bac0ec48b
Summary:
fix https://github.com/pytorch/pytorch/issues/21044
Bicubic interpolation can cause overshoot.
Opencv keeps results dtype aligned with input dtype:
- If input is uint8, the result is clamped [0, 255]
- If input is float, the result is unclamped.
In Pytorch case, we only accept float input, so we'll keep the result unclamped, and add some notes so that users can explicitly call `torch.clamp()` when necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23321
Differential Revision: D16464796
Pulled By: ailzhang
fbshipit-source-id: 177915e525d1f54c2209e277cf73e40699ed1acd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23257
Overal context: open-source BlackBoxPredictor as the entry
point for inference in Caffe2 (thread safe abstraction for Caffe2
inference). This should be used in ThroughputBenchmark for the purpose
of framework comparison
This specific diff:
There should be no harm in moving transformation code to
OSS. On the advantages side we will be able to compare production
Caffe2 setup with PyTorch in the most fair way via
ThroughputBenchmark. This approach avoid any complicated
transformation regirstries. Building those proper would be significant
engineering effort as well as production risk. In the past we had SEVs
related to transforms being turned off due to various refactors. Given
that we don't plan to build any other significant investments into
transformation logic except existing ones (like TVM and Glow), and
those also relate to open-source technologies, I came up to the
conclusion of moving to OSS the whole thing.
Reviewed By: zrphercule
Differential Revision: D16428124
fbshipit-source-id: b35deada5c015cd97b91ae12a7ea4aac53bd14b8
Summary:
Covering fleet-wide profiling, api logging, etc.
It's my first time writing rst, so suggestions are definitely welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23010
Differential Revision: D16456721
Pulled By: dzhulgakov
fbshipit-source-id: 3d3018f41499d04db0dca865bb3a9652d8cdf90a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23291
This diff implements LSTM with FP16 weights based on FBGEMM.
At a high level, here are the steps:
1. Quantize and pack weight in every layer of LSTM
2. Pass weights from step 1 to the ATen `quantized_lstm` function which does matrix multiplication with FP16 weight. The following code shows the dtype of each variable used in MM:
Y = X * W + B
(fp32, fp32, fp16, fp32)
Reviewed By: jianyuh
Differential Revision: D16389595
fbshipit-source-id: c26ae4e153c667a941f4af64e9d07fc251403cee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22733
This refactor changes the conv module to avoid the usage of the functional ops.
Reviewed By: jerryzh168
Differential Revision: D15835572
fbshipit-source-id: f2294cd708fbe8372eb3a15cc60d83777d4f7029
Summary:
It used to be run with comm_size=8, which causes flaky results in a
stress run. The flakiness was caused by too many listening sockets
being created by Gloo context initialization (8 processes times 7
sockets times 20-way concurrency, plus TIME_WAIT).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23221
ghstack-source-id: 86995596
Reviewed By: d4l3k
Differential Revision: D16437834
fbshipit-source-id: 998d0e2b087c0ab15eca64e308059c35e1b51e7b
Summary:
I manually went through all functions in `torch.*` and corrected any mismatch between the arguments mentioned in doc and the ones actually taken by the function. This fixes https://github.com/pytorch/pytorch/issues/8698.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22973
Differential Revision: D16419602
Pulled By: yf225
fbshipit-source-id: 5562c9b0b95a0759abee41f967c45efacf2267c2
Summary:
Currently the build type is decided by the environment variable DEBUG
and REL_WITH_DEB_INFO. This commit also lets CMAKE_BUILD_TYPE be
effective. This makes the interface more consistent with CMake. This
also prepares https://github.com/pytorch/pytorch/issues/22776.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22875
Differential Revision: D16281663
Pulled By: ezyang
fbshipit-source-id: 952f92aad85ff59f1c7abe8256eca8a4a0936026
Summary:
Rehash of https://github.com/pytorch/pytorch/issues/22322 .
Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6.
This PR adds the skip tests and some semantic changes for PyTorch.
Added pattern match skip for anything but the ROCm CI compared to #223222 for the python find step in the PyTorch build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23088
Differential Revision: D16448261
Pulled By: bddppq
fbshipit-source-id: 69ece1a213418d9abf1444c496dce1c190ee07c8
Summary:
there are a lot of formatting changes which makes other diffs to these PRs noisy & hard to read.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23283
Differential Revision: D16453590
Pulled By: eellison
fbshipit-source-id: 97b4bf1dbbbfb09c44c57402f61ea27287060044
Summary:
In Python, `register_module` / `register_parameter` / `register_buffer` method in `nn.Module` is public. This PR makes those APIs public for C++ `nn::Module` as well. Closes https://github.com/pytorch/pytorch/issues/23140.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23196
Differential Revision: D16440239
Pulled By: yf225
fbshipit-source-id: e0eff6e1db592961fba891ec417dc74fa765e968
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23272
We see significant performance improvements by limiting concurrency
at caffe2 level on mobile. This diff enables setting the number of caffe2
workspaces used during rnn inference.
Reviewed By: akyrola
Differential Revision: D16448611
fbshipit-source-id: 28abaddb4ea60bacb084ceb28cb7a4d1e67ccc17
Summary:
Support exporting
* Standard tensor indexing like
```
x = torch.ones(4, 5)
ind = torch.tensor([0, 1])
return x[ind]
```
* [Advanced indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing) like
```
x = torch.ones(4,5,6,7,8)
ind1 = torch.tensor([0, 1])
ind2 = torch.tensor([[3], [2]])
ind3 = torch.tensor([[2, 2], [4, 5]])
return x[2:4, ind1, None, ind2, ind3, :]
```
It would be ideal if ONNX can natively support indexing in future opsets, but for opset <= 10 it will always need this kind of workarounds.
There are still various limitations, such as not supporting advanced indexing with negative indices, not supporting mask indices of rank > 1, etc. My feeling is that these are less common cases that requires great effort to support using current opset, and it's better to not make the index export more cumbersome than it already is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21716
Reviewed By: zrphercule
Differential Revision: D15902199
Pulled By: houseroad
fbshipit-source-id: 5f1cc687fc9f97da18732f6a2c9dfe8f6fdb34a6
Summary:
Previously we weren't specializing the list returned from `dict.keys()`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23267
Differential Revision: D16448512
Pulled By: eellison
fbshipit-source-id: fcd2a37ac680bdf90219b099a94aa36a80f4067c
Summary:
Overal context: open-source BlackBoxPredictor as the entry
point for inference in Caffe2 (thread safe abstraction for Caffe2
inference). This should be used in ThroughputBenchmark for the purpose
of framework comparison
This specific diff:
There should be no harm in moving transformation code to
OSS. On the advantages side we will be able to compare production
Caffe2 setup with PyTorch in the most fair way via
ThroughputBenchmark. This approach avoid any complicated
transformation regirstries. Building those proper would be significant
engineering effort as well as production risk. In the past we had SEVs
related to transforms being turned off due to various refactors. Given
that we don't plan to build any other significant investments into
transformation logic except existing ones (like TVM and Glow), and
those also relate to open-source technologies, I came up to the
conclusion of moving to OSS the whole thing.
Reviewed By: bertmaher
Differential Revision: D16367134
fbshipit-source-id: fc6bacc1be3ff6336beb57cdad58168d3a2b8c28
Summary:
per https://github.com/pytorch/pytorch/issues/22260, default number of open mp threads are spawned to be the same of number of cores available, for multi processing data parallel cases, too many threads may be spawned and could overload the CPU, resulting in performance regression.
so set OMP_NUM_THREADS = number of CPU processors/number of processes in default to neither overload or waste CPU threads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22501
Test Plan:
1. without and with this change, example codes result in same result
python ~/local/fbsource-fbcode/fbcode/caffe2/torch/distributed/launch.py --nproc_per_node=2 pytorch/examples/yanlizhao/distributed_launch_example.py
Setting OMP_NUM_THREADS environment variable for each process to be: 24, which
is max(1, num_cpus / num_processes), you can further tune the variable for optimal performance in your application if needed.
final loss = tensor(0.5211, device='cuda:0', grad_fn=<MseLossBackward>)
Differential Revision: D16092225
Pulled By: zhaojuanmao
fbshipit-source-id: b792a4c27a7ffae40e4a59e96669209c6a85e27f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23003
torch.quantization.fuse_module and torch.nn._intrinsic convRelu and LinearRelu
Fusion function to combine specific modules: (conv,bn) and (conv,bn,relu).
In all cases, replace modules in place. The first module is replaced with the _intrinsic fused module and the remaining modules are replaced by nn.Identity.
Support both training and eval. For training, the modules are "fused" with a sequential container. This is to allow for further module swaps for quantization aware training.
Also add: torch.nn._intrinsic for convRelu and LinearRelu.
TODO: Add tests for _intrinsic modules.
Conv BN fusion code is based on DsKhudia's implementation
Differential Revision: D16199720
fbshipit-source-id: 95fb9ffe72b361d280313b2ec57de2acd4f9dda2
Summary:
This adds a replace_module method to the C++ api. This is needed to be able to replace modules.
The primary use case I am aware of is to enable finetuning of models.
Given that finetuning is fairly popular these days, I think it would be good to facilitate this in the C++ api as well.
This has been reported by Jean-Christophe Lombardo on the [forums](https://discuss.pytorch.org/t/finetuning-a-model-on-multiple-gpu-in-c/49195).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22546
Differential Revision: D16440289
Pulled By: yf225
fbshipit-source-id: c136f914b8fc5c0f1975d877ea817fda5c851cda
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23022
will be tested in later diffs.
Added LinearReLU module for qat, allows conversion from torch.nn._intrisic.LinearReLU to torch.nn._intrinsic.qat.LinearReLU
Reviewed By: zafartahirov
Differential Revision: D16286800
fbshipit-source-id: 84cce3551d46e649781b9b6107d4076e10e51018
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23181
We can't run dead code elimination after erasing number types because dce relies on graph invariants that erase_number_types breaks.
Reviewed By: houseroad
Differential Revision: D16427819
fbshipit-source-id: d1b98a74d2558b14d4be692219691149689a93d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23180
This pass needs to be run later because it breaks jit graph invariants and the lower_all_tuples pass still needs a valid jit graph.
Reviewed By: houseroad
Differential Revision: D16427680
fbshipit-source-id: 427c7e74c59a3d7d62f2855ed626cf6258107509
Summary:
Creating an untyped generic list is deprecated, we always want type information to be present.
This fixes test cases and removes one that used lists with ambigious types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23192
ghstack-source-id: 86972891
Differential Revision: D16431482
fbshipit-source-id: 4ca5cd142118a3f0a4dcb8cd77383127c54abb29
Summary:
---
How does the current code subsume all detections in the deleted `nccl.py`?
- The dependency of `USE_NCCL` on the OS and `USE_CUDA` is handled as dependency options in `CMakeLists.txt`.
- The main NCCL detection happens in [FindNCCL.cmake](8377d4b32c/cmake/Modules/FindNCCL.cmake), which is called by [nccl.cmake](8377d4b32c/cmake/External/nccl.cmake). When `USE_SYSTEM_NCCL` is false, the previous Python code defer the detection to `find_package(NCCL)`. The change in `nccl.cmake` retains this.
- `USE_STATIC_NCCL` in the previous Python code simply changes the name of the detected library. This is done in `IF (USE_STATIC_NCCL)`.
- Now we only need to look at how the lines below line 20 in `nccl.cmake` are subsumed. These lines list paths to header and library directories that NCCL headers and libraries may reside in and try to search these directories for the key header and library files in turn. These are done by `find_path` for headers and `find_library` for the library files in `FindNCCL.cmake`.
* The call of [find_path](https://cmake.org/cmake/help/v3.8/command/find_path.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for headers in `<prefix>/include` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. Like the Python code, this commit sets `CMAKE_PREFIX_PATH` to search for `<prefix>` in `NCCL_ROOT_DIR` and home to CUDA. `CMAKE_SYSTEM_PREFIX_PATH` includes the standard directories such as `/usr/local` and `/usr`. `NCCL_INCLUDE_DIR` is also specifically handled.
* Similarly, the call of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) (Search for `NO_DEFAULT_PATH` in the link) by default searches for libraries in directories including `<prefix>/lib` for each `<prefix>` in `CMAKE_PREFIX_PATH` and `CMAKE_SYSTEM_PREFIX_PATH`. But it also handles the edge cases intended to be solved in the Python code more properly:
- It only searches for `<prefix>/lib64` (and `<prefix>/lib32`) if it is appropriate on the system.
- It only searches for `<prefix>/lib/<arch>` for the right `<arch>`, unlike the Python code searches for `lib/<arch>` in a generic way (e.g., the Python code searches for `/usr/lib/x86_64-linux-gnu` but in reality systems have `/usr/lib/x86_64-some-customized-name-linux-gnu`, see https://unix.stackexchange.com/a/226180/38242 ).
---
Regarding for relevant issues:
- https://github.com/pytorch/pytorch/issues/12063 and https://github.com/pytorch/pytorch/issues/2877: These are properly handled, as explained in the updated comment.
- https://github.com/pytorch/pytorch/issues/2941 does not changes NCCL detection specifically for Windows (it changed CUDA detection).
- b7e258f81ef61d19b884194cdbcd6c7089636d46 A versioned library detection is added, but the order is reversed: The unversioned library becomes preferred. This is because normally unversioned libraries are linked to versioned libraries and preferred by users, and local installation by users are often unversioned. Like the document of [find_library](https://cmake.org/cmake/help/v3.8/command/find_library.html) suggests:
> When using this to specify names with and without a version suffix, we recommend specifying the unversioned name first so that locally-built packages can be found before those provided by distributions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22930
Differential Revision: D16440275
Pulled By: ezyang
fbshipit-source-id: 11fe80743d4fe89b1ed6f96d5d996496e8ec01aa
Summary:
Some overlap with https://github.com/pytorch/pytorch/pull/21716 regarding caffe2 nonzero. Will rebase the other one accordingly whichever gets merged first.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22601
Reviewed By: zrphercule
Differential Revision: D16224660
Pulled By: houseroad
fbshipit-source-id: dbfd1b8776cb626601e0bf83b3fcca291806e653
Summary:
https://github.com/pytorch/pytorch/issues/20153
I believe you need 2 passes for this. Take this example
```python
torch.jit.script
def f():
x = torch.ones(10, 9, 8, 7, 6)
return x[..., None, None].shape
```
which results in `[10, 9, 8, 7, 6, 1, 1]`
vs
```
torch.jit.script
def f():
x = torch.ones(10, 9, 8, 7, 6)
return x[..., None, None, :].shape
```
which results in `[10, 9, 8, 7, 1, 1, 6]`
After only processing `x[..., None, None` we don't know whether we should be creating a new dimension at the end of the dimension list or somewhere in the middle. What we do depends on the elements to the right of it.
Thus, I do 2 passes - one to collect all the dimensions that the index operations operate on, and another that executes the index operations.
This still doesn't work for an ellipse index followed by a tensor index, but it wasn't working previously either.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22905
Differential Revision: D16433558
Pulled By: Chillee
fbshipit-source-id: c1b303cb97b1af8b6e405bad33495ef3b4c27c4a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23182
This fixes the issue seen in D16390551
Changing the load op to take in shapes vector needs changes in lots of places (almost all usages of load op).
Instead this is a small and safe change where the behavior is unchanged if we are loading multiple blobs and when loading a single blob without shape information.
If you are loading just one blob and the shape information is provided, then this returns the right shape info back.
For all other cases, behavior is unchanged as before we introduced the issue.
This fixes the issue reported by Andrey in D16229465
Reviewed By: boryiingsu
Differential Revision: D16428140
fbshipit-source-id: 8ef6705ab2efb346819489e1f166e23269f7ef8a
Summary:
fbgemm requires a AVX512 which requires a more recent compiler, so this also switches all the nightlies from devtoolset3 to devtoolset7. Since CUDA 9.0 doesn't support devtoolset7, we also switch from CUDA 9.0 to CUDA 9.2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22784
Differential Revision: D16428165
Pulled By: pjh5
fbshipit-source-id: c1af3729d8edce88a96fa9069d4c5a1808c25f99
Summary:
We need a way to figure get a complete list fo features that are used in training a model. One way to do this is to make it possible to get the list of features used in each Model Layer. Then once the model is complete we can go through the layers and aggregate the features.
I've introduced a function to expose that information here, get_accessed_features, and implemented it in the FeatureSparseToDense layer to start with.
I've tried to include the minimum amount of information to make this useful, while making it easy to integrate into the variety of model layers. This is, for example, why AccessedFeatures does not contain feature_names which is not always present in a model layer. I debated whether or not to include feature_type, but I think that's useful enough, and easy enough to figure out in a model layer, that it's worth including.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23036
Test Plan:
Added a unit test to verify the behavior of get_accessed_features in FeatureSparseToDense.
aml_dper2-fblearner-flow-integration-tests failed due to a known issue D16355865
aml_dper3-fblearner-flow-integration-tests failed due to a known issue T47197113
I verified no tests in the integration tests failed to issues other than those known ones.
DPER2 canaries: https://fburl.com/fblearner/1217voga
Reviewed By: volkhin
Differential Revision: D16365380
Pulled By: kevinwilfong
fbshipit-source-id: 2dbb4d832628180336533f29f7d917cbad171950
Summary:
I ran into the following error when trying to pass a Python int as an arg to `torch::jit::createStackForSchema`, and I think it is due to the missing support for `NumberType` in [toIValue](https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/pybind_utils.h#L448).
> RuntimeError: Missing cases in toIValue for type: Scalar! File a bug report. (toIValue at ../torch/csrc/jit/pybind_utils.h:449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22817
Differential Revision: D16276006
Pulled By: mrshenli
fbshipit-source-id: 7f63519bb37219445e836ec1f51ca4f98bf52c44
Summary:
Bumping up the producer_version in exported ONNX models in view of the next release. Updating tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23120
Reviewed By: zrphercule
Differential Revision: D16420917
Pulled By: houseroad
fbshipit-source-id: 6686b10523c102e924ecaf96fd3231240b4219a9
Summary:
`pickle` supports this and a lot of the quantized use cases for get/set
state follow this pattern
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23119
Pulled By: driazati
Differential Revision: D16391234
fbshipit-source-id: 9f63e0a1679daa61b17aa64b5995e2be23b07b50
Summary:
Previously we looked at the stack frame of the function that called
`script` to resolve variables. This doesn't work if someone calls script
with a function defined somewhere else that references captured
variables. We already have a mechanism to look at the closed over
variables for a function, so this changes the `rcb` to use that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22270
Pulled By: driazati
Differential Revision: D16391346
fbshipit-source-id: ad9b314ae86c249251b106079e76a5d7cf6c04c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23166
Changing the load op to take in shapes vector needs changes in lots of places (almost all usages of load op).
Instead this is a small and safe change where the behavior is unchanged if we are loading multiple blobs and when loading a single blob without shape information.
If you are loading just one blob and the shape information is provided, then this returns the right shape info back.
For all other cases, behavior is unchanged as before we introduced the issue.
This fixes the issue reported by Andrey in D16229465
Reviewed By: boryiingsu
Differential Revision: D16390551
fbshipit-source-id: 1055b481a7a9e83021209e59f38a7cc0b49003cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23077
Although the difference between running from python and this is not much if we
have forward method's loop long enough (like 1000 in this case).
Reviewed By: mingzhe09088
Differential Revision: D16122343
fbshipit-source-id: 5c1d1b98ae82c996baf9d42bcd04995e2ba60c78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23076
Tracing based and non tracing based added
Reviewed By: mingzhe09088
Differential Revision: D16097280
fbshipit-source-id: 3a137092f7ccc3dd2d29d95e10178ec89d3ce892
Summary:
Update ScatterWeightedSum op when there exists only one weighted X to update slice of Y which is usually the case when the op is used for gradient update. The changes remove the copy overhead and seeing significant operator performance improvement
- 25 - 50% improvment on CUDA based on input configuration
- ~50% improvement on ROCm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23087
Differential Revision: D16385194
Pulled By: bddppq
fbshipit-source-id: 3189e892940fb9c26305269eb0d47479b9b71af0
Summary:
This is a small patch to not overwrite unchanged files to help a bit with building.
It is not as incremental as one might like, given that one has to pass `--out-of-place-only` to not run into the patching and things.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23112
Differential Revision: D16402623
Pulled By: bddppq
fbshipit-source-id: 531ce0078bc716ae31bd92c5248080ef02a065b9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22765
the pooling signature is the same as the non-quantized one. Adding it to the native_functions.yaml
Reviewed By: jerryzh168
Differential Revision: D16102608
fbshipit-source-id: 7627ad8f02a231f488b74d1a245b853f89d9c419
Summary:
USE_{C11,MSC,GCC}_ATOMICS are not used in PyTorch or submodules. Now we remove their underlying detection code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23089
Differential Revision: D16402750
Pulled By: ezyang
fbshipit-source-id: fde84b958eb0b5b4d3f0406acefa92ab30ea43be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21749
This is the first version without "requantization"
Reviewed By: jerryzh168
Differential Revision: D15807940
fbshipit-source-id: 19bb0482abed8ed9d1521a3fa1f15bda8e6a6a7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23096
nets can have states that depends on the rest of the state in the Workspace. Hence, they should be destructed first.
Reviewed By: ajyu
Differential Revision: D16382987
fbshipit-source-id: 3fd030ba206e2d0e897abb9e31c95bdaeb9482b7
Summary:
Add support for quantization aware training in eager mode
Modifications to Post training flow:
## Prepare
* Fusion: e.g. (Conv, Bn) → ConvBn (float)
* Swapping: To insert fake_quant to weight, we need to swap the float modules that has weight with different qat modules, e.g. Conv → torch.nn.qat.Conv , ConvBn → torch.nn._intrinsic.qat.ConvBn
```
* previously we were thinking about modify the weight in forward_pre hook and change it back in forward_hook:
* def forward_pre_hook(self, input):
self.float_weight = self.weight
self.weight = self.fake_quantize(self.float_weight)
def forward_hook(self, input):
self.weight = self.float_weight
```
* Assignments to self.weight are needed because we can’t change forward function and in forward function they are using self.weight.
* But we will need to keep two copies of weight in this case, so it’s probably better to just swap the module
* So we want to just swap Conv to torch.nn.qat.Conv and Linear to torch.nn.qat.Linear
* qat modules will have fake_quant for output and weights inserted in forward function
## Convert
* flow should be identical to ptq, but the swapping dictionary is slightly different since modules are changed in prepare step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23082
ghstack-source-id: 86824650
Differential Revision: D16379374
fbshipit-source-id: 7d16d1acd87025065a24942ff92abf18e9fc8070
Summary:
Overal context: open-source BlackBoxPredictor as the entry
point for inference in Caffe2 (thread safe abstraction for Caffe2
inference). This should be used in ThroughputBenchmark for the purpose
of framework comparison
This specific diff:
There should be no harm in moving transformation code to
OSS. On the advantages side we will be able to compare production
Caffe2 setup with PyTorch in the most fair way via
ThroughputBenchmark. This approach avoid any complicated
transformation regirstries. Building those proper would be significant
engineering effort as well as production risk. In the past we had SEVs
related to transforms being turned off due to various refactors. Given
that we don't plan to build any other significant investments into
transformation logic except existing ones (like TVM and Glow), and
those also relate to open-source technologies, I came up to the
conclusion of moving to OSS the whole thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22877
Test Plan:
did a bunch of unit tests locally and now
waitforsandcaslte
AdFinder canary:
https://our.intern.facebook.com/intern/ads/canary/419623727275650390
adindexer:
https://our.intern.facebook.com/intern/ads/canary/419623750891549182
prospector:
https://our.intern.facebook.com/intern/ads/canary/419644899887610977https://our.intern.facebook.com/intern/ads/canary/419645123742738405
Differential Revision: D16267765
Pulled By: salexspb
fbshipit-source-id: 776a1cd5415e0695eae28254b3f155e7a9bd8c2b
Summary:
1. Fix out of range memory access for reduction on all dimensions for non-packed
tensor.
2. Enabling launch config that maps block width to reduction on fastest striding
dimension. This mapping was previously only active when reducing on fastest
striding dimension of packed tensor, which is not necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22827
Differential Revision: D16271897
Pulled By: zdevito
fbshipit-source-id: 20763f6cf9a58e44ffc0e7ec27724dfec8fe2c5d
Summary:
Fixes https://github.com/pytorch/pytorch/issues/22389
In most cases we only import `PIL` methods when we need them, but we missed a spot.
cc lanpa natalialunova sanekmelnikov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23023
Reviewed By: sanekmelnikov
Differential Revision: D16373492
Pulled By: orionr
fbshipit-source-id: b08bf8a9b5a861390eadf62eda21ac055777180f
Summary:
This PR fixes the invalid None return when calling get_all_math_dtype(device='cuda').
Issue came from the __append__ method which doesn't have any return value used in `return dtypes.append(...)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/23028
Differential Revision: D16362732
Pulled By: colesbury
fbshipit-source-id: 0bbc30a0c663749d768159f1bc37b99f7263297b
Summary:
This PR aims at improving BERT performance on CPU by using `mkldnn` inner product for `nn.Linear()`.
The current logic is to use `mkldnn` only when `input` tensor is of mkldnn layout. This PR loosens this condition, `mkldnn` will be used for `nn.Linear()` when `input` tensor is of dense layout. The aten tensor is viewed inplace in `mkldnn` without additional memory copy.
1. when `input.dim() >= 3` , it is viewed as 2d tensor. e.g. `[T, N, C]` is treated as `[TN, C]`;
2. when `input` is not contiguous, it is copied so as to be contiguous. `mkldnn` inner product can't handle non-contiguous memory.
With this PR, BERT on `glue/MRPC` inference (batch size = 1) on Xeon 6148 single socket (20 cores@2.5GHz) improves by `44%`:
1. before (unit: iterations/sec):
```bash
408/408 [00:24<00:00, 16.69it/s]
```
2. after (unit: iterations/sec):
```bash
408/408 [00:16<00:00, 24.06it/s]
```
The latency reduces from `59.92 ms` to `41.56ms` correspondingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21851
Differential Revision: D16056334
Pulled By: dzhulgakov
fbshipit-source-id: 9b70ed58323b5e2f3f4e3ebacc766a74a8b68a8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22732
Add support for quantization aware training in eager mode
Modifications to Post training flow:
## Prepare
* Fusion: e.g. (Conv, Bn) → ConvBn (float)
* Swapping: To insert fake_quant to weight, we need to swap the float modules that has weight with different qat modules, e.g. Conv → torch.nn.qat.Conv , ConvBn → torch.nn._intrinsic.qat.ConvBn
```
* previously we were thinking about modify the weight in forward_pre hook and change it back in forward_hook:
* def forward_pre_hook(self, input):
self.float_weight = self.weight
self.weight = self.fake_quantize(self.float_weight)
def forward_hook(self, input):
self.weight = self.float_weight
```
* Assignments to self.weight are needed because we can’t change forward function and in forward function they are using self.weight.
* But we will need to keep two copies of weight in this case, so it’s probably better to just swap the module
* So we want to just swap Conv to torch.nn.qat.Conv and Linear to torch.nn.qat.Linear
* qat modules will have fake_quant for output and weights inserted in forward function
## Convert
* flow should be identical to ptq, but the swapping dictionary is slightly different since modules are changed in prepare step.
Reviewed By: zafartahirov
Differential Revision: D16199356
fbshipit-source-id: 62aeaf47c12c62a87d9cac208f25f7592e245d6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22714
We need this module for add fake_quant for weight
Reviewed By: zafartahirov
Differential Revision: D16193585
fbshipit-source-id: ed6c04ecf574ca1fe1dcded22c225da05976f7a3
Summary:
When working on https://github.com/pytorch/pytorch/pull/22762, we discovered that we haven't actually deprecated legacy autograd function. This PR puts up the deprecation warning for 1.2, with the goal to remove legacy function support completely in the near future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22922
Differential Revision: D16363916
Pulled By: yf225
fbshipit-source-id: 4b554010a3d1f87a3fa45cc1aa29d019c8f1033c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22950
Print quantized tensor by first dequantizing it and then printing. Also print the scale, zero_point. size and type of tensor.
Reviewed By: jerryzh168
Differential Revision: D16286397
fbshipit-source-id: 2d6fb1796e5b329a77c022b18af0a39f6edde0d7
Summary:
We are planning to put up a deprecation warning for legacy autograd function in 1.2: https://github.com/pytorch/pytorch/pull/22922. This PR removes all usage of legacy function in PyTorch core and test suite, to prepare for the eventual removal of legacy function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22925
Differential Revision: D16344834
Pulled By: yf225
fbshipit-source-id: 8bf4cca740398835a08b7a290f3058c3e46781ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22316
Adding the quantized ReLU to the native_functions.yamp, as it has the same signature as non-quantized relu
Reviewed By: jerryzh168
Differential Revision: D16038441
fbshipit-source-id: 1cfbb594eb9bca1b7ec49ca486defcf1908b0d26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22966
We want to implement "trimmed lasso" for feature selection with learnable and regularizable weights. Trimmed lasso is a simple yet powerful improved version from traditional lasso. More reference can be found at https://arxiv.org/abs/1708.04527 and http://proceedings.mlr.press/v97/yun19a.html. For quick and necessary intro, please refer to P1-3 of the paper at https://arxiv.org/abs/1708.04527.
Given n weights, traditional lasso sums up all weights' l1 norms. The trimmed lasso takes an input integer k (how many weights you want to select from n) and only sums over the smallest n - k weights. Given lambda as the regularization constant, the penalty term is only on the smallest n - k weights, but not other larger weights. If lambda becomes larger than certain threshold, the smallest n - k weights are shrunk to zero. That means we have those weights "dropped". With this property, the number k is the number of weights left after lasso, which we can easily control.
Meanwhile, we further support all available regularization in a single interface. Current supported regularizers on weights include no reg, l1, l2, elastic, trimmed l1, elastic with trimmed l1, group l1, and logbarrier.
Differential Revision: D16326492
fbshipit-source-id: 6e1fd75606005d9bc09d6650435c96a7984ba69c
Summary:
Given that python 2.7 will be EOL'd on Jan 1, 2020 and we have models depending on python3.5+, we'd like to update the ROCm CI across the board to python3.6.
This PR adds the skip tests and some semantic changes for PyTorch.
Open tasks/questions:
* RoiAlignTest.CheckCPUGPUEqual fails in the Caffe2 unit tests. Is this something expects / can be skipped?
* for testing, I've used update-alternatives on CentOS/Ubuntu to select python == python 3.6. Is this the preferred way?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22322
Differential Revision: D16199862
Pulled By: ezyang
fbshipit-source-id: 46ca6029a232f7d23f3fdb5efc33ae39a379fca8
Summary:
Fixes https://github.com/pytorch/pytorch/issues/21935 by using the integer floor division that was introduced for convolution shapes in https://github.com/pytorch/pytorch/issues/9640. Without this fix, the pooling operators can produce a 1-element output in cases they shouldn't.
Disclaimer: I couldn't properly test it locally (it's not picking up the modified version for some reason). I'm marking this WIP until I checked what the CI tools say...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22304
Differential Revision: D16181955
Pulled By: ezyang
fbshipit-source-id: a2405372753572548b40616d1206848b527c8121
Summary:
This cleans up the `torch.utils.tensorboard` API to remove all kwargs usage (which isn't clear to the user) and removes the "experimental" warning in prep for our 1.2 release.
We also don't need the additional PyTorch version checks now that we are in the codebase itself.
cc ezyang lanpa natalialunova
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21786
Reviewed By: natalialunova
Differential Revision: D15854892
Pulled By: orionr
fbshipit-source-id: 06b8498826946e578824d4b15c910edb3c2c20c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22958
When we use `extension_loader.DlopenGuard()` to dyndep or import modules, it sets a `RTLD_GLOBAL` flag, and restores the original flags after the `yield`. However, if the modules is not there, yield will fail, and the flags won't be restored, creating all kinds of symbol conflict problems.
Reviewed By: bddppq
Differential Revision: D16311949
fbshipit-source-id: 7b9ec6d60423ec5e78cae694b66c2f17493840b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22830
Separating the tensor generation and the generation of the quantization parameters
- Introducing hypothesis filter `assume_not_overflowing`, which makes sure that the generated tensor and qparams play well with each other. **Note: This is an expensive filter!**
- `qtensor` -> Renameed to `tensor`
- `qtensor_conv` -> Renamed to `tensor_conv2d`
- The tensors don't return the quantization parameters anymore, use `qparams` for it
- The `dtypes` argument is just a quantized dtype now.
- The enforcement for zero_point is predefined as before. As before, if set to `None` the zero_point will be sampled. However, if `None`, you can override sampling with `zero_point_min` and `zero_point_max`
- Scale sampling can also be overriden using `scale_min` and `scale_max`
Reviewed By: jerryzh168
Differential Revision: D16234314
fbshipit-source-id: 5b538a5aa9772b7add4f2ce5eff6fd0decd48f8e
Summary:
ONNX uses virtualenv, and PyTorch doesn't. So --user flag is causing problems in ONNX ci...
Fixing it by moving it to pytorch only scripts. And will install ninja in onnx ci separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22946
Reviewed By: bddppq
Differential Revision: D16297781
Pulled By: houseroad
fbshipit-source-id: 52991abac61beaf3cfbcc99af5bb1cd27b790485
Summary:
…te argument in macro
Changelog:
- Update note about tensors on CPU for the following MAGMA functions
- magma_(d/s)getrf_gpu and magma_getrf_nopiv_gpu require tensors on CPU for pivots
- magma_(d/s)geqrf2_gpu requires tensors on CPU for elementary reflectors
- magma_(d/s)syevd_gpu requires tensors on CPU for eigenvalues
- Remove dummy tensor in ALLOCATE_ARRAY MACRO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22618
Test Plan:
- All existing tests should pass to verify that the patch is correct
This PR has been proposed to eliminate confusion due to the previous comments, as indicated in https://github.com/pytorch/pytorch/issues/22573
Differential Revision: D16286198
Pulled By: zou3519
fbshipit-source-id: a5a6ec829084bdb752ca6006b8795227cbaf63b1
Summary:
This fixes up the test suite (mostly just adding `ignore` decorations
to tests that need to call Python function) so that it all passes with
recursive script enabled.
The main user-facing result of this change is that Python functions are
compiled without any decorators, so non-TorchScriptable code must be
decorated with `torch.jit.ignore` (or
`torch.jit.ignore(drop_on_export=True` to maintain the functionality of
the current `ignore`)
Details can be found in #20939
](https://our.intern.facebook.com/intern/diff/16277608/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22887
Pulled By: driazati
Differential Revision: D16277608
fbshipit-source-id: 0abd0dc4291cf40651a1719bff813abb2b559640
Summary:
Motivation:
The forward method of MultiheadAttention has a kwarg a key_padding_mask. This mask is of shape (N,S) where N is batch and S is sequence length. This mask is applied prior to attention softmax where True values in the mask are set to float('-inf'). This allows you to mask position j from attention for all position i in input sequence. It's typically used to mask padded inputs. So for a sample in a batch we will be able to make sure no encoder outputs depend on padding inputs. Currently the Transformer, TransformerEncoder, and TransformerEncoderLayer do not have this kwarg, and only have options for a (S,S), (T,T), and (S,T) masks which are applied equally across the batch for source input, target output, and target-source memory respectively. These masks can't be used for padding and are instead used for things like subsequent masking in language modeling, by masking the attention of position i to position j.
This diff exposes the key_padding_mask to Transformer, TransformerEncoder, and TransformerEncoderLayer forward methods which is ultimately passed to MultiheadAttention forward.
Open question: should we also allow a key_padding_mask for the decoder layer? As padding is usually at the end of each sentence in a batch and sentences are usually decoding from left to right, usually people deal with padding on decoded outputs by just masking those outputs at the loss layer. There might be some scenarios where it's needed though I don't think it would be common. People can also still just subclass and override the layers. We could also pass the input key_padding_mask to the memory <> decoder attention layer. Not sure if that's necessary though because the output of position i from each attention encoder layer won't depend on any masked positions in the input (even if position i is a masked position itself) so there's not really any point in masking position i again.
Adds the key_padding_mask kwarg to Transformer, TransformerEncoder, and TransformerEncoderLayer forward methods.
The standard TransformerEncoderLayer uses a MultiheadAttention layer as self_attn. MultiheadAttention forward method has a key_padding_mask kwarg that allows for masking of values such as padding per sequence in a batch, in contrast to the attn_mask kwarg which is usually of shape (S,S) and applied equally across the batch.
MultiheadAttention calls functional.multi_head_attention_forward, which has the same key_padding_mask kwarg of shape (N,S). Masked (True) values are set to float('-inf').
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22588
Test Plan:
buck test mode/dev caffe2/test:nn -- 'test_transformerencoderlayer \(test_nn\.TestNN\)'
buck test mode/dev caffe2/test:nn -- 'test_Transformer_cell \(test_nn\.TestNN\)'
buck test mode/dev caffe2/test:nn -- 'test_transformer_args_check \(test_nn\.TestNN\)'
Differential Revision: D16112263
Pulled By: lucasgadams
fbshipit-source-id: dc4147dd1f89b55a4c94e8c701f16f0ffdc1d1a2
Summary:
Asterisks start emphases in rst. We should either escape them or put them as interpreted text.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22896
Differential Revision: D16282869
Pulled By: zou3519
fbshipit-source-id: 15ec4286434db55fb8357b1a12e6f70ef54f8c66
Summary:
The sccache wrapping strategy causes problems for at-runtime kernel
compilation of MIOpen kernels. We therefore - after the builds of
caffe2/pytorch are complete - unwrap sccache again by moving the clang-9
actual binary back into its original place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22743
Differential Revision: D16283329
Pulled By: bddppq
fbshipit-source-id: 4fcdc92be295d5ea9aba75c30e39af1a18a80c13
Summary:
This is achieved by using `cuDevicePrimaryCtxGetState` as a way to check whether a primary context exists on a device. It is not too slow, from this benchmark of a single call to it on CUDA 10.1, Titan Xp, driver 415.27:
```
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
BM_cuDevicePrimaryCtxGetState 301 ns 301 ns 2319746
```
Commits:
1. Add `CUDAHooks::getDeviceWithPrimaryContext` which returns a device index with primary context (if exists).
Link `c10/cuda` against `libcuda` for device API calls.
2. Use `getDeviceWithPrimaryContext` to check primary context in `pin_memory`.
Fix `OptionalDeviceGuard` doc.
3. Refactor `test_cuda_primary_ctx.py` to support multiple tests.
Add test for this in that file.
Fixes https://github.com/pytorch/pytorch/issues/21081.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22229
Differential Revision: D16170194
Pulled By: zou3519
fbshipit-source-id: 485a45f211b7844c9e69c63f3b3b75194a796c5d
Summary:
…te argument in macro
Changelog:
- Update note about tensors on CPU for the following MAGMA functions
- magma_(d/s)getrf_gpu and magma_getrf_nopiv_gpu require tensors on CPU for pivots
- magma_(d/s)geqrf2_gpu requires tensors on CPU for elementary reflectors
- magma_(d/s)syevd_gpu requires tensors on CPU for eigenvalues
- Remove dummy tensor in ALLOCATE_ARRAY MACRO
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22618
Test Plan:
- All existing tests should pass to verify that the patch is correct
This PR has been proposed to eliminate confusion due to the previous comments, as indicated in https://github.com/pytorch/pytorch/issues/22573
Differential Revision: D16227440
Pulled By: zou3519
fbshipit-source-id: 97d5537c5da98c0ed3edc4668a09294794fc426b
Summary:
…rides
Changelog:
- Fix behavior of `torch.triu` / `torch.tril` on certain unsqueezed tensors that lead to uninitialized values on CPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22730
Test Plan:
- Add tests for these cases in test_triu_tril in test_torch
Fixes https://github.com/pytorch/pytorch/issues/22581
Differential Revision: D16222897
Pulled By: zou3519
fbshipit-source-id: b86b060187797e5cd2a7731421dff1ba2b5c9596
Summary:
Align the behavior of `torch.utils.cpp_extension.CUDA_HOME` with that of `tools.setup_helpers.cuda.CUDA_HOME`.
Typically, I swapped the position of guess 2 and guess 3 in `torch.utils.cpp_extension.CUDA_HOME` .
Fixing issue https://github.com/pytorch/pytorch/issues/22844
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22845
Differential Revision: D16276241
Pulled By: zou3519
fbshipit-source-id: 3b62b439b2f794a6f3637a5fee58991f430985fe
Summary:
We introduced RTTI in recent change: https://github.com/pytorch/pytorch/pull/21613
For internal mobile build we don't enable '-frtti' yet. This diff is trying to replace
RTTI with alternative approach.
According to dzhulgakov we could compare two tensors' type_id directly in most cases -
which is more strict than comparing TensorImpl subclass type as TensorImpl -> type_id
mapping is 1-to-n but it's more proper for this use case.
The only two cases where we can relax direct type comparison (for legacy reason) are:
1. CPUTensor <-> CUDATensor;
2. SparseCPUTensor <-> SparseCUDATensor;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22773
Differential Revision: D16277696
Pulled By: ljk53
fbshipit-source-id: 043e264fbacc37b7a11af2046983c70ddb62a599
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22892
Think of num_runs as manually run the binary <num_runs> times. Each run runs the operator for many iterations.
Reviewed By: hl475
Differential Revision: D16271597
fbshipit-source-id: b6f509ee0332c70f85bec0d447b84940c5c0cecd
Summary:
Since recursive script creates a ScriptModule from an `nn.Module`,
there's no ties to the original module to pull a type name from, so we
have to explicitly pass it in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22873
Pulled By: driazati
Differential Revision: D16268547
fbshipit-source-id: 902a30e6e36427c6ba7033ded027a29d9dcbc1ee
Summary:
Changelog:
- Port SVD TH implementation to ATen/native/BatchLinearAlgebra.cpp
- Port SVD THC implementation to ATen/native/cuda/BatchLinearAlgebra.cu
- Allow batches of matrices as arguments to `torch.svd`
- Remove existing implementations in TH and THC
- Update doc string
- Update derivatives to support batching
- Modify nuclear norm implementation to use at::svd instead of _batch_svd
- Remove _batch_svd as it is redundant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21588
Test Plan:
- Add new test suite for SVD in test_torch.py with port to test_cuda.py
- Add tests in common_methods_invocations.py for derivative testing
Differential Revision: D16266115
Pulled By: nairbv
fbshipit-source-id: e89bb0dbd8f2d58bd758b7830d2389c477aa61fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22517
Force anybody creating an untyped Dict to call c10::impl::deprecatedUntypedDict().
This should hopefully make it clear that this is not public API and prevent people from using it.
Reviewed By: dzhulgakov
Differential Revision: D16115214
fbshipit-source-id: 2c8d0e4e375339c699d583995f79c05c59693c3e
Summary:
Introduce Azure Pipelines for the linting checks. This is meant to be equivalent to the existing Travis linting phase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22839
Differential Revision: D16260376
Pulled By: ezyang
fbshipit-source-id: 1e535c3096358be67a0dad4cd920a92082b2d18e
Summary:
As part of the Variable/Tensor merge, `variable.tensor_data()` should be removed in favor of `variable.detach()`. This PR removes `tensor_data()` call sites in Python `Variable()` and `nn.Parameter()` constructor paths.
Note that this PR is BC-breaking in the following way:
- For Python `Variable()` constructor:
Previously, in-place updating a tensor after it's been used to create a Variable does not bump the Variable's version counter, which causes the following problem:
```python
t = torch.ones(2, 3)
v = torch.autograd.Variable(t).requires_grad_()
y = v * v
t.add_(1) # This bumps version counter of `t`
y.sum().backward() # This computes `v`'s gradient incorrectly before this patch, and throws error after this patch
```
After this patch, in-place updating a tensor after it's been used to create a Variable will also bump the Variable's version counter, thus preserving the correctness of the Variable's version counter.
- For Python `nn.Parameter()` constructor:
Previously, in-place updating a tensor after it's been used to create an nn.Parameter does not bump the nn.Parameter's version counter, which causes the following problem:
```python
t = torch.ones(2, 3)
v = torch.nn.Parameter(t)
y = v * v
t.add_(1) # This bumps version counter of `t`
y.sum().backward() # This computes `v`'s gradient incorrectly before this patch, and throws error after this patch
```
After this patch, in-place updating a tensor after it's been used to create an nn.Parameter will also bump the nn.Parameter's version counter, thus preserving the correctness of the nn.Parameter's version counter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22821
Differential Revision: D16258030
Pulled By: yf225
fbshipit-source-id: 9a6d68cea1864893193dbefbb6ef0c1d5ca12d78
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22829
Sending out caffe2 load op changes separately since we want pick it to open source.
This change is needed because the shape information of the blobs is determined from the load operator and that shape information is needed in our download_group.
Reviewed By: boryiingsu
Differential Revision: D16229465
fbshipit-source-id: f78b2df9a7f26968d70eca68dde75cd11ab6f7a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22323
This diff adds an interface to use quantized Linear op in JIT.
Reviewed By: jamesr66a
Differential Revision: D16040724
fbshipit-source-id: 90e90aff9973c96ea076ed6a21ae02c349ee2bcf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22023
This diff implements Linear operation with fp16 weights based on FBGEMM. At a hight level, we want to perform the following operation:
Y = X * W + B with dtypes:
(fp32, fp32, fp16, fp32)
To do that, three steps are needed:
1. Quantize weights from fp32 to fp16, this is done using `PackedGemmMatrixFP16` in the `fbgemm_pack_gemm_matrix_fp16`
2. Conduct matrix multiplication with quantized weights using `cblas_gemm_compute` in `fbgemm_linear_fp16_weight`
3. Add bias to the result from step2 and return the final Y
Reviewed By: jianyuh
Differential Revision: D15921768
fbshipit-source-id: dc4e5b366f846ce9d58975876940a9b3372b8b8d
Summary:
Add support for breaks and continues in the jit. We do with a Graph transform pre-SSA.
A graph of the form
```
def test():
while i < 5:
if i == 3:
break
i += 1
print(i)
```
has the body of the loop transformed to
```
if i == 3:
did_break = True
else:
did_break = False
if did_break:
loop_exit = True
else:
i += 1
print(i)
loop_exit = i < 5
```
I am going to add more tests but I think it is ready for review now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21692
Differential Revision: D16215807
Pulled By: eellison
fbshipit-source-id: 365102f42de4861d9323caaeb39a96de7619a667
Summary:
This is an extension to the original PR https://github.com/pytorch/pytorch/pull/21765
1. Increase the coverage of different opsets support, comments, and blacklisting.
2. Adding backend tests for both caffe2 and onnxruntime on opset 7 and opset 8.
3. Reusing onnx model tests in caffe2 for onnxruntime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22421
Reviewed By: zrphercule
Differential Revision: D16225518
Pulled By: houseroad
fbshipit-source-id: 01ae3eed85111a83a0124e9e95512b80109d6aee
Summary:
Using PMCTest (https://www.agner.org/optimize/) to measure
TensorIterator construction, this results in ~600 fewer instructions
retired (~300 fewer cycles) for constructing TensorIterator on a 1D
tensor. (Should be roughly ~100 ns, but it's hard to measure that
precisely end-to-end).
```
Before:
Clock Core cyc Instruct Uops L1D Miss
5082 2768 5690 7644 3
After:
Clock Core cyc Instruct Uops L1D Miss
4518 2437 5109 6992 0
```
Note that Instruct is reliable, Core cyc is a little noisy, and Clock
is a little more noisy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22756
Differential Revision: D16207777
Pulled By: VitalyFedyunin
fbshipit-source-id: bcc453a90472d9951a1c123bcb1b7a243fde70ac
Summary:
Speeds up the common case where Tensor is a torch.Tensor (not a
subclass). This reduces the number of executed instructions for a
torch.add(tensor1, tensor2) by ~328 (should be ~65 ns faster).
Note that most of the PythonArgs accessors are too large to be inlined.
We should move most of them to the cpp file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22782
Differential Revision: D16223592
Pulled By: colesbury
fbshipit-source-id: cc20f8989944389d5a5e3fab033cdd70d581ffb1
Summary:
This PR aims at improving `topk()` performance on CPU. This is useful when computing **beam search** during `Transformer` and `BERT`.
Given a tensor x of size `[N, C]`, and we want to apply `x.topk(K)`, the current logic is **sequentially** loop on the dimension of `N` and do **quick select** on the dimension of `C` so as to find out top K elements.
Performance can be further improved from:
- On the dimension of `N`, it can be paralleled
- Maybe a faster sorting algorithm for `topk`. (After a bunch of experimenting, `std::partial_sort` seems to be the most promising)
So i compared 3 versions:
1. vanilla: sequential + quick select
2. reference PR https://github.com/pytorch/pytorch/issues/19737: parallel + quick select
3. this PR: parallel + partial sort
with the following benchmark, on `Xeon 8180, 2*28 cores@2.5 GHz`:
```python
import torch
from time import time
num_iters = 1000
def bench_topk(N=8, C=168560, k=10):
a = torch.randn(N, C)
# warm up
for i in range(100):
torch.topk(a, k)
t = 0
for i in range(num_iters):
a = torch.randn(N, C)
start = time()
value, indice = torch.topk(a, k)
t += time() - start
print("#[%d, %d] times: %f ms" % (N, C, t / num_iters * 1000))
Ns = [10, 20, 30]
Cs = [10000, 20000, 40000, 80000, 160000, 320000]
for n in Ns:
for c in Cs:
bench_topk(N=n, C=c)
```
### vanilla: sequential + quick select
```
#[10, 10000] times: 0.746740 ms
#[10, 20000] times: 1.437399 ms
#[10, 40000] times: 2.832455 ms
#[10, 80000] times: 5.649426 ms
#[10, 160000] times: 11.309466 ms
#[10, 320000] times: 22.798765 ms
#[20, 10000] times: 1.511303 ms
#[20, 20000] times: 2.822024 ms
#[20, 40000] times: 5.564770 ms
#[20, 80000] times: 11.443044 ms
#[20, 160000] times: 22.747731 ms
#[20, 320000] times: 46.234449 ms
#[30, 10000] times: 2.214045 ms
#[30, 20000] times: 4.236179 ms
#[30, 40000] times: 8.418577 ms
#[30, 80000] times: 17.067578 ms
#[30, 160000] times: 33.826214 ms
#[30, 320000] times: 68.109420 ms
```
### reference PR: parallel + quick select
```
#[10, 10000] times: 0.271649 ms
#[10, 20000] times: 0.593016 ms
#[10, 40000] times: 1.133518 ms
#[10, 80000] times: 2.082355 ms
#[10, 160000] times: 4.049928 ms
#[10, 320000] times: 7.321285 ms
#[20, 10000] times: 0.315255 ms
#[20, 20000] times: 0.539054 ms
#[20, 40000] times: 1.000675 ms
#[20, 80000] times: 1.914586 ms
#[20, 160000] times: 4.437122 ms
#[20, 320000] times: 8.822445 ms
#[30, 10000] times: 0.347209 ms
#[30, 20000] times: 0.589947 ms
#[30, 40000] times: 1.102814 ms
#[30, 80000] times: 2.112201 ms
#[30, 160000] times: 5.186837 ms
#[30, 320000] times: 10.523023 ms
```
### this PR: parallel + partial sort
```
#[10, 10000] times: 0.150284 ms
#[10, 20000] times: 0.220089 ms
#[10, 40000] times: 0.521875 ms
#[10, 80000] times: 0.965593 ms
#[10, 160000] times: 2.312356 ms
#[10, 320000] times: 4.759422 ms
#[20, 10000] times: 0.167630 ms
#[20, 20000] times: 0.265607 ms
#[20, 40000] times: 0.471477 ms
#[20, 80000] times: 0.974572 ms
#[20, 160000] times: 3.269645 ms
#[20, 320000] times: 6.538608 ms
#[30, 10000] times: 0.204976 ms
#[30, 20000] times: 0.342833 ms
#[30, 40000] times: 0.589381 ms
#[30, 80000] times: 1.398579 ms
#[30, 160000] times: 3.904077 ms
#[30, 320000] times: 9.681224 ms
```
In summary, `2` is **5x** faster than `vanilla` on average and `3` is **8.6x** faster than `vanilla`.
On `Fairseq Transformer`, the default parameter on dataset `wmt14` would have a `topk` size of `[8, 168560]`, and this operator gets `3x` faster with this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19736
Differential Revision: D16204820
Pulled By: VitalyFedyunin
fbshipit-source-id: ea70562c9149a0d832cf5872a891042ebd74fc63
Summary:
For three 1-D operands, compute_strides now takes 298 instructions instead
of 480. (Saves ~36 ns). We'll want to make Tensor::sizes(), strides(), and
element_size() trivially inlinable to speed this up more.
(Using PMCTest from https://www.agner.org/optimize/ to measure instructions retired)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22779
Differential Revision: D16223595
Pulled By: colesbury
fbshipit-source-id: e4730755f29a0aea9cbc82c2d376a8e6a0c7bce8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22781
The custom op is required to make the op benchmark work with JIT. Running this command `python setup.py install` in the pt_extension directory to install it. It is required.
Reviewed By: hl475
Differential Revision: D16214430
fbshipit-source-id: c9221c532011f9cf0d5453ac8535a6cde65e8376
Summary:
Currently ONNX constant folding (`do_constant_folding=True` arg in `torch.onnx.export` API) supports only opset 9 of ONNX. For opset 10, it is a no-op. This change enables ONNX constant folding for opset 10. Specifically there are three main changes:
1) Turn on constant folding ONNX pass for opset 10.
2) Update support for opset 10 version of `onnx::Slice` op for backend computation during constant folding.
3) Enable constant folding tests in `test/onnx/test_utility_funs.py` for multiple opsets (9 and 10).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22515
Reviewed By: zrphercule
Differential Revision: D16189336
Pulled By: houseroad
fbshipit-source-id: 3e2e748a06e4228b69a18c5458ca71491bd13875
Summary:
1. update on restricting block.z <= 64, compliant to CUDA maximum z-dimension of
a block;
2. clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22602
Differential Revision: D16203857
Pulled By: ezyang
fbshipit-source-id: 567719ae175681a48eb0f818ca0aba409dca2550
Summary:
Some other environment variables can be added to speed things up for development.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22736
Differential Revision: D16200904
Pulled By: soumith
fbshipit-source-id: 797ef91a863a244a6c96e0adf64d9f9b4c9a9582
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22706
Moved the models used for quantization test from the test_quantization.py file to common_quantization.py
Reviewed By: jerryzh168
Differential Revision: D16189865
fbshipit-source-id: 409b43454b6b3fe278ac16b1affb9085d6ed6835
Summary:
Previously in tracing when we called a script function we would inline the graph and set the graph inputs equal to the types the graph was invoked with.
This breaks for optional arguments invoked with None since we rely on None being set to Optional[T] in schema matching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22686
Differential Revision: D16186372
Pulled By: eellison
fbshipit-source-id: e25c807c63527bf442eb8b31122d50689c7822f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22694
Move quantization and quantized utility functions for testing to common_quantized.py and common_quantization.py. Addditionally, add a quantized test case base class which contains common methods for checking the results of quantization on modules. As a consequence of the move, fixed the import at the top of test_quantized.py, and test_quantization to use the new utility
Reviewed By: jerryzh168
Differential Revision: D16172012
fbshipit-source-id: 329166af5555fc829f26bf1383d682c25c01a7d9
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22631
Test Plan:
test suite
Imported from OSS
Differential Revision: D16185040
fbshipit-source-id: 9b83749f6c9cd05d13f54a3bb4801e263293252b
Summary:
After converting BN layers to SyncBN layers, the function will set all `requires_grad = True` regardless of the original requires_grad states. I think it is a bug and have fixed it in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22569
Differential Revision: D16151647
Pulled By: zou3519
fbshipit-source-id: e2ad1886c94d8882485e7fb8be51ad76469ecc67
Summary:
Addressing potential dependency issue by adding forward declaration for OutputArchive/InputArchive.
This change follows the same pattern in base.h in 'torch/csrc/api/include/torch/data/samplers/base.h'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22562
Differential Revision: D16161524
Pulled By: soumith
fbshipit-source-id: d03f8a2ece5629762f9fa8a27b15b0d037e8f07b
Summary:
Also revert the change of cmake.py in
c97829d7011bd59d662f6af9c3a0ec302e7e75fc . The comments are added to
prevent future similar incidents in the future (which has occurred a couple of times in the past).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22641
Differential Revision: D16171763
Pulled By: ezyang
fbshipit-source-id: 5a65f9fbb3c1c798ebd25521932bfde0ad3d16fc
Summary:
No need to `clone` if the expanded size matches original size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22634
Differential Revision: D16171091
Pulled By: ezyang
fbshipit-source-id: 3d8f116398f02952488e321c0ee0ff2868768a0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21209
This diff introduces a new interface to add a list of operators. Here are the steps to add ops using this interface:
- create op_list:
```unary_ops_list = op_bench.op_list(
attr_names=["op_name", "op_function"],
attrs=[
["abs", torch.abs],
["abs_", torch.abs_],
],
)
```
- create a bench class:
```
class UnaryOpBenchmark(op_bench.TorchBenchmarkBase):
def init(self, M, N, op_function):
self.input_one = torch.rand(M, N)
self.op_func = op_function
def forward(self):
return self.op_func(self.input_one)
```
- 3. register those ops
``` op_bench.generate_pt_tests_from_list(unary_ops_list, unary_ops_configs, UnaryOpBenchmark)
```
Reviewed By: zheng-xq
Differential Revision: D15514188
fbshipit-source-id: f09b359cab8175eeb8d51b3ad7bbbcfbc9f6430f
Summary:
The error for `test_error_stack_module`:
```
Traceback (most recent call last):
File "../test.py", line 35, in <module>
scripted = torch.jit.script(M())
File "/home/davidriazati/other/pytorch/torch/jit/__init__.py", line 1119, in script
return _convert_to_script_module(obj)
File "/home/davidriazati/other/pytorch/torch/jit/__init__.py", line 1825, in _convert_to_script_module
raise e
RuntimeError:
d(int x) -> int:
Expected a value of type 'int' for argument 'x' but instead found type 'str'.
:
at ../test.py:11:12
def c(x):
return d("hello") + d(x)
~ <--- HERE
'c' is being compiled since it was called from 'b'
at ../test.py:14:12
def b(x):
return c(x)
~~~ <--- HERE
'b' is being compiled since it was called from 'forward'
at ../test.py:22:16
def forward(self, x):
return b(x)
~~~ <--- HERE
'forward' is being compiled since it was called from 'forward'
at ../test.py:31:20
def forward(self, x):
return x + self.submodule(x)
~~~~~~~~~~~~~~~~ <--- HERE
```
This also unifies our error reporting in the front end with `ErrorReport`
TODO
* Include module names in message, #22207 should make this easy
](https://our.intern.facebook.com/intern/diff/16060781/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22280
Pulled By: driazati
Differential Revision: D16060781
fbshipit-source-id: c42968b53aaddb774ac69d5abbf7e60c23df8eed
Summary:
Some of my qpth users have told me that updating to the latest version of PyTorch and replacing the btrifact/btrisolve calls with the LU ones wasn't working and I didn't believe them until I tried it myself :)
These updates have broken unpivoted LU factorizations/solves on CUDA. The LU factorization code used to return the identity permutation when pivoting wasn't used but now returns all zeros as the pivots. This PR reverts it back to return the identity permutation. I've not yet tested this code as I'm having some trouble compiling PyTorch with this and am hitting https://github.com/pytorch/pytorch/issues/21700 and am not sure how to disable that option.
Here's a MWE to reproduce the broken behavior, and my fix.
```python
torch.manual_seed(0)
n = 4
L = torch.randn(n,n)
A = L.mm(L.t()).unsqueeze(0)
b = torch.randn(1, n)
A_lu_cpu = torch.lu(A)
A_lu_cuda_nopivot = torch.lu(A.cuda(), pivot=False)
A_lu_cuda_pivot = torch.lu(A.cuda(), pivot=True)
print('A_lu_cuda_nopivot\n', A_lu_cuda_nopivot)
print('-----\nA_lu_cuda_pivot\n', A_lu_cuda_nopivot)
x_cpu = b.lu_solve(*A_lu_cpu)
x_cuda_nopivot = b.cuda().lu_solve(*A_lu_cuda_nopivot)
x_cuda_nopivot_fixed = b.cuda().lu_solve(
A_lu_cuda_nopivot[0], torch.arange(1, n+1, device='cuda:0').int())
x_cuda_pivot = b.cuda().lu_solve(*A_lu_cuda_pivot)
print(x_cpu, x_cuda_nopivot, x_cuda_nopivot_fixed, x_cuda_pivot)
```
Output:
```
A_lu_cuda_nopivot
(tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337],
[-0.2656, 5.5724, -1.1316, 0.6678],
[ 0.3062, -0.2031, 1.4206, -0.5438],
[-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32))
-----
A_lu_cuda_pivot
(tensor([[[ 2.8465, -0.7560, 0.8716, -1.7337],
[-0.2656, 5.5724, -1.1316, 0.6678],
[ 0.3062, -0.2031, 1.4206, -0.5438],
[-0.6091, 0.1198, -0.3828, 1.5103]]], device='cuda:0'), tensor([[0, 0, 0, 0]], device='cuda:0', dtype=torch.int32))
(tensor([[-0.3121, -0.1673, -0.4450, -0.2483]]),
tensor([[-0.1661, -0.1875, -0.5694, -0.4772]], device='cuda:0'),
tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0'),
tensor([[-0.3121, -0.1673, -0.4450, -0.2483]], device='cuda:0'))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22242
Differential Revision: D16049334
Pulled By: ezyang
fbshipit-source-id: 7eacae810d87ffbdf8e07159bbbc03866dd9979d
Summary:
This PR activates faster depthwise convolution kernels for Volta and Turing GPUs using cudnn >= 7600.
The script to benchmark the current PyTorch master branch and this PR branch can be found [here](https://gist.github.com/ptrblck/4590cf20721d8f43296c9903abd4a774).
(50 warmup iterations, 1000 iterations for timing)
I've used https://github.com/pytorch/pytorch/issues/3265 to create a similar benchmark and added a few additional setups.
Since the results are quite long, I've uploaded them in a spreadsheet [here](https://docs.google.com/spreadsheets/d/13ByXcqg7LQUr3DVG3XpLwnJ-CXg3GUZJ3puyTMw9n2I/edit?usp=sharing).
Times are given in ms per iteration.
We've benchmarked this PR on a DGX1 using V100 GPUs.
The current workload check in `check_cudnn_depthwise_workload` is quite long and can be moved to another file, if wanted.
CC ngimel (Thanks for the support while benchmarking it ;) )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22302
Differential Revision: D16115057
Pulled By: ezyang
fbshipit-source-id: bad184658518e73b4d6b849d77e408f5a7a757de
Summary:
Having the NVRTC stub in ATen is necessary to call driver APIs in ATen. This is currently blocking https://github.com/pytorch/pytorch/pull/22229.
`DynamicLibrary` is also moved as it is used in the stub code, and seems general enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22362
Differential Revision: D16131787
Pulled By: ezyang
fbshipit-source-id: add2ee8a8865229578aa00001a00d5a6671e0e73
Summary:
Syncing worker requirement mismatches to improve remote build time.
Created actions:
MEDIUM: 488
LARGE: 29
XXLARGE: 2
Updated actions:
From MEDIUM to LARGE: 227
From XLARGE to MEDIUM: 1
From XLARGE to LARGE: 1
From XLARGE to XXLARGE: 1
From LARGE to MEDIUM: 2
From LARGE to XLARGE: 2
Differential Revision: D16161669
fbshipit-source-id: 67a4e0d883ca3f1ca3185a8285903c0961537757
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22143
Like Conv DNNLOWP operator, allow FC to run the slow path to debug numerical issues caused by Intel's int8 instruction that does horizontal addition of 2 int8 multiplication results in 16 bit
Reviewed By: hx89
Differential Revision: D15966885
fbshipit-source-id: c6726376a3e39d341fd8aeb0e54e0450d2af8920
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22174
This is a preliminary change outlining the approach we plan to follow to integrate QNNPACK operators into the pytorch backend. The operators will not be made visible to the user in the python world, so ultimately we will have a function that calls qnnpack backend based on the environment being run on.
The goal of the project is to integrate QNNPACK library with PyTorch to achieve good performance for quantized mobile models.
Reviewed By: ljk53
Differential Revision: D15806325
fbshipit-source-id: c14e1d864ac94570333a7b14031ea231d095c2ae
Summary:
Some duplicated code is removed. It also becomes clear that there is only one special case `div_kernel_cuda` is handling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22555
Differential Revision: D16152091
Pulled By: zou3519
fbshipit-source-id: bb875370077c1f84efe4b766b3e1acc461e73e6c
Summary:
Fix a grammatical error of the comment in line 233.
change from " Returns an `OrderedDict` of he submodules of this `Module`"
to " Returns an `OrderedDict` of the submodules of this `Module`"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22548
Differential Revision: D16134534
Pulled By: zou3519
fbshipit-source-id: 33b1dd0fbc3a24bef99b6e0192566e2839292842
Summary:
As part of the Variable/Tensor merge, we want to be able to pass Variables into Caffe2 without doing extra shallow copy, to improve performance and also allow for in-place mutations in Caffe2 ops. There are a few approaches outlined in https://github.com/pytorch/pytorch/pull/22418, and this PR is the chosen approach.
Specifically, we can have the assumption that we won't be connecting autograd to C2 gradients at any point (as it's too tricky and not that useful). Therefore, we can pass Variable into Caffe2 ops by requiring that all Variables in Caffe2 don't require grad. For code paths in Caffe2 that might potentially track gradients (e.g. `ScriptModuleOp` and `call_caffe2_op_from_c10`), we use the `torch::NoGradGuard` to make sure gradients are not tracked.
This supersedes https://github.com/pytorch/pytorch/pull/22418.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22473
Differential Revision: D16099042
Pulled By: yf225
fbshipit-source-id: 57efc3c7cfb3048d9abe90e63759acc14ebd2972
Summary:
Forgot to mirror the `nn/ __init__.py` semantics in the new `nn` type stub.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22411
Differential Revision: D16149798
Pulled By: ezyang
fbshipit-source-id: 0ffa256fbdc5e5383a7b9c9c3ae61acd11de1dba
Summary:
`addcmul_out` overwrote the samples, which led to constant values being output by `torch.normal`.
Changelog:
- Replace the `addcmul_out` calls with combo of inplace `mul` and `add` and justification for this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22533
Test Plan:
- Enable tests for test_normal on all devices
Fixes https://github.com/pytorch/pytorch/issues/22529
Differential Revision: D16141337
Pulled By: ezyang
fbshipit-source-id: 567a399042e0adcd154582f362318ce95a244c62
Summary:
Currently specifying different build options in respect to the "USE_"
series is in quite a disarray. There are a lot of build options that
accept three variants: USE_OPTION, WITH_OPTION, and NO_OPTION. Some
build options only accept USE_ and NO_ variant. Some accept only USE_.
This inconsistency is quite confusing and hard to maintain.
To resolve this inconsistency, we can either let all these build options
support all three variants, or we only support the USE_ variant.
This commit makes a step to the latter choice, i.e., deprecates and sets
a date for removing the NO_ and WITH_ variants and keeps only the
USE_ variant. This is likely better than the former solution because:
- NO_ and WITH_ variants are not documented.
- CMakeLists.txt only has the USE_ variants for relevant build options
defined. It would be a surprise that when user pass these variables to
CMake during rebuild and find them ineffective.
- Multiple variants are difficult to maintain.
- The behavior is confusing if more than one variant is passed. For
example, what to be expected if one sets "NO_CUDA=1 USE_CUDA=1"?
The downside is that this will break backward compatibility for existing
build scripts in the future (if they used the undocumented build
options).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22474
Differential Revision: D16149396
Pulled By: ezyang
fbshipit-source-id: 7145b88ad195db2051772b9665dd708dfcf50b7d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22477
There is actually no use of uninitialized variable but some compilers are not smart enough to reason about two if branches are already taken together.
Reviewed By: hx89
Differential Revision: D16100211
fbshipit-source-id: 25f01d668063603d7aaa776451afe8a10415d2ea
Summary:
After the Variable/Tensor merge, code paths in ATen need to be able to check whether a tensor requires gradient, and throw errors in places where a `requires_grad=true` tensor is not allowed (such as https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/Utils.h#L76-L78 and https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/SparseTensorImpl.cpp#L86). Since the `GradMode` thread-local variable controls whether a tensor should accumulate gradients, we need to be able to check this variable from ATen when we determine whether a tensor requires gradient, hence the PR to move `GradMode` / `AutoGradMode` / `NoGradGuard` to ATen.
Note that we intentionally don't merge `at::GradMode` and `at::NonVariableTypeMode`, with the following reasoning:
Semantically, `at::GradMode` and `at::NonVariableTypeMode` actually mean different things: `at::GradMode` controls whether a tensor should accumulate gradients, and `at::NonVariableTypeMode` controls whether a Variable should be treated as a non-Variable tensor in type dispatches. There are places whether we *don't* want the tensor to accumulate gradients, but *still* want the Variable to be treated as a Variable. Here is one example:
```python
# torch/tensor.py
with torch.no_grad():
...
new_tensor = self.new() # `at::GradMode` is false at this point
...
```
```cpp
// tools/autograd/templates/python_variable_methods.cpp
static PyObject * THPVariable_new(PyObject* self, PyObject* args, PyObject* kwargs)
{
...
// if we merge `at::GradMode` and `at::NonVariableTypeMode`, since `at::GradMode` is false and `self_.type()` checks `at::GradMode` to decide whether to return non-Variable type, it will return a non-Variable type here, which is not what we want (and throws a "Tensor that was converted to Variable was not actually a Variable" error)
return THPVariable_Wrap(torch::utils::legacy_tensor_new(self_.type(), args, kwargs));
...
}
```
For the above reason, we cannot merge `at::GradMode` and `at::NonVariableTypeMode`, as they have different purposes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18573
Differential Revision: D16134413
Pulled By: yf225
fbshipit-source-id: 6140347e78bc54206506499c264818eb693cdb8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22479
In some cases, for example, when we training on CTR data, we would like to start training from old samples and finish on new recent samples.
This diff add the option to disable the shuffling in DistributedSampler to accommodate this use case.
Reviewed By: soumith
Differential Revision: D16100388
fbshipit-source-id: 35566581f5250040b2db5ec408a63037b47a9f5d
Summary:
Replaces https://github.com/pytorch/pytorch/pull/21501 because ghimport had errors when i tried to import the stack that i couldn't figure out :'(
has the two commits that were previously accepted and the merge commit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22561
Differential Revision: D16135743
Pulled By: eellison
fbshipit-source-id: f0a98842ccb334c7ceab04d1437e09dc76be0eb1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22516
Force anybody creating an untyped Dict to call c10::impl::deprecatedUntypedDict().
This should hopefully make it clear that this is not public API and prevent people from using it.
Differential Revision: D16115215
fbshipit-source-id: 2ef4cb443da1cdf4ebf5b99851f69de0be730b97
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22005
When a Dict or List is created with type information, it will remember that.
If at any point later, this list is instantiated to a List<T> with a concrete type, it will assert that T is the correct type.
Differential Revision: D15914462
fbshipit-source-id: a8c3d91cb6d28d0c1ac0b57a4c4c6ac137153ff7
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22551
Test Plan:
ran test locally
Imported from OSS
Differential Revision: D16132182
fbshipit-source-id: 5b9efbf883efa66c4d8b7c400bdb804ac668a631
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22510
Added a new function to implement clone operation on quantized tensors. Also added a test case which can be tested as shown in test plan.
This change is required to be able to call torch.jit.trace on quantized models.
Clone implementation calls copy_ on QTensor internally.
Differential Revision: D16059576
fbshipit-source-id: 226918cd475521b664ed72ee336a3da8212ddcdc
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22397
Test Plan:
Added test for reentrant backwards with checkpoint and a test for a recursive backwards function (which should fail if we run all the reentrant tasks recursively in the same thread) and for testing priority of reentrant tasks.
~~Will add a test for priority of reentrant tasks in future pr.~~
Imported from OSS
Differential Revision: D16131955
fbshipit-source-id: 18301d45c1ec9fbeb566b1016dbaf7a84a09c7ac
Summary:
Currently, the **stream** parameter is not set when launching these two kernels: softmax_warp_forward() and softmax_warp_backward(), i.e. the kernels are always put on the default stream, which may fail to respect the stream that was set previously. Add **at::cuda::getCurrentCUDAStream()** as a launch argument to fix this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22470
Differential Revision: D16115051
Pulled By: izdeby
fbshipit-source-id: 38b27e768bb5fcecc1a06143ab5d63b0e68a279e
Summary:
re-apply changes reverted in:
https://github.com/pytorch/pytorch/pull/22412
Also change log_softmax to take positional arguments. Long-term we do want the kwarg-only interface, but seems to currently be incompatible with jit serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22456
Differential Revision: D16097159
Pulled By: nairbv
fbshipit-source-id: 8cb73e9ca18fc66b35b873cf4a574b167a578b3d
Summary:
* Deletes all weak script decorators / associated data structures / methods
* In order to keep supporting the standard library in script, this enables recursive script on any function defined in `torch.nn`
* Most changes in `torch/nn` are the result of `ag -Q "weak" torch/nn/ -l | xargs sed -i '/weak/d'`, only `rnn.py` needed manual editing to use the `ignore` and `export` to continue supporting the overloaded `forward` methods
* `Sequential`/`ModuleList` no longer need to be added to constants since they are compiled on demand
This should also fix https://github.com/pytorch/pytorch/issues/22212
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22212
Differential Revision: D15988346
Pulled By: driazati
fbshipit-source-id: af223e3ad0580be895377312949997a70e988e4f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22309
This diff enables PT operators to run with JIT mode. Users can control eager and JIT mode using the `use_jit` flag.
In this diff, we are putting operators in a loop and passed it to JIT. One extra step which wraps the operator with the `_consume` op is introduced to avoid dead code elimination optimization in JIT. With that, the reported time includes the real operator execution time plus the `_consume` (directly return input, nothing else if happening inside) op.
Reviewed By: zheng-xq
Differential Revision: D16033082
fbshipit-source-id: e03be89fd5a505e44e81015dfc63db9cd76fb8a1
Summary:
- Fix typo in ```torch/onnx/utils.py``` when looking up registered custom ops.
- Add a simple test case
1. Register custom op with ```TorchScript``` using ```cpp_extension.load_inline```.
2. Register custom op with ```torch.onnx.symbolic``` using ```register_custom_op_symbolic```.
3. Export model with custom op, and verify with Caffe2 backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21321
Differential Revision: D16101097
Pulled By: houseroad
fbshipit-source-id: 084f8b55e230e1cb6e9bd7bd52d7946cefda8e33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21432
This diff introduce a new interface to generate tests based on the metadata of operators.
Reviewed By: ajauhri
Differential Revision: D15675542
fbshipit-source-id: ba60e803ea553d8b9eb6cb2bcdc6a0368ef62b1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22499
Another place where onnx export is running dead code elimination after making the jit graph invalid. Fixing it.
Reviewed By: houseroad
Differential Revision: D16111969
fbshipit-source-id: 5ba80340c06d091988858077f142ea4e3da0638c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22348
This is the last step of LRU hash eviction weight re-init. This diff checks if there's evicted values in sparse_lookup, if so call op created in D15709866 to re-init the values for indicies in evicted_values. Also created gradient op for the operator. The gradient op just passes the output gradient as input gradient.
Reviewed By: itomatik
Differential Revision: D16044736
fbshipit-source-id: 9afb85209b0de1038c5153bcb7dfc5f52e0b2abb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22476
Dead code elimination assumes a valid jit graph because it checks if operators have side effects.
The onnx export path destroys the jit graph right before calling dead code elimination, but it actually doesn't care about side effects.
We can just call dead code elimination and disable side effect lookup and things should work.
Reviewed By: houseroad
Differential Revision: D16100172
fbshipit-source-id: 8c790055e0d76c4227394cafa93b07d1310f2cea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22441
This include doesn't seem to be needed. Remove it to simplify mobile build dependency.
Reviewed By: dreiss
Differential Revision: D16088224
fbshipit-source-id: f6aec21655e259726412e26a006d785912436c2a
Summary:
This has been requested in https://github.com/pytorch/pytorch/issues/20323
(It is still not exactly the same as NumPy, which allows you to pass tensors at mean/std and broadcast them with size, but the present PR is extremely simple and does the main thing people are asking for)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20545
Differential Revision: D15358736
Pulled By: zhangguanheng66
fbshipit-source-id: 762ea5eab5b8667afbac2df0137df017ba6e413c
Summary:
The changes include:
1. Allow key/value to have different number of features with query. It supports the case when key and value have different feature dimensions.
2. Support three separate proj_weight, in addition to a single in_proj_weight. The proj_weight of key and value may have different dimension with that of query so three separate proj_weights are necessary. In case that key and value have same dimension as query, it is preferred to use a single large proj_weight for performance reason. However, it should be noted that using a single large weight or three separate weights is a size-dependent decision.
3. Give an option to use static k and v in the multihead_attn operator (see saved_k and saved_v). Those static key/value tensors can now be re-used when training the model.
4. Add more test cases to cover the arguments.
Note: current users should not be affected by the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21288
Differential Revision: D15738808
Pulled By: zhangguanheng66
fbshipit-source-id: 288b995787ad55fba374184b3d15b5c6fe9abb5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21927
Add `OUTPUT_PROB` output to CTCBeamSearchDecoderOp to return a probability for each sequence.
Add argument to output top-k instead of top-1 decoded sequences.
Reviewed By: SuperIRabbit
Differential Revision: D15797371
fbshipit-source-id: 737ca5cc4f90a0bcc3660ac9f58519a175977b69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22461
We shouldn't call dead code elimination after EraseNumberTypes because dead code elimination assumes a valid jit graph which EraseNumberTypes just broke.
Let's have it clean up itself isntead.
Reviewed By: houseroad
Differential Revision: D16094656
fbshipit-source-id: f2752277d764e78ab276c57d56b2724b872b136f
Summary:
It's always set to equal USE_NCCL, we made Gloo depending on Caffe2 NCCL
build. See 30da84fbe1614138d6d9968c1475cb7dc459cd4b
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22467
Differential Revision: D16098581
Pulled By: ezyang
fbshipit-source-id: f706ec7cebc2e6315bafca013b669f5a72e04815
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22279
This new operator is used for embedding table weight re-init. After we get the evicted indices, they will be the rows need reseting in embedding table. Then we can create a 1d tensor with default values, and apply this operator to copy the tensor to all evicted rows in embedding table
Will add gradient op in next diff
Reviewed By: itomatik
Differential Revision: D15709866
fbshipit-source-id: 2297b70a7326591524d0be09c73a588da245cc08
Summary:
The sgemm in cuBLAS 9.0 has some issues with sizes above 2M on Maxwell and Pascal architectures. Warn in this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22034
Differential Revision: D15949930
Pulled By: zhangguanheng66
fbshipit-source-id: 0af977ec7900c76328d23898071de9c23778ff8b
Summary:
ROCm is already detected in cmake/public/LoadHIP.cmake. No need to
detect twice. Plus, the Python script read environment variable
ROCM_HOME, but what is really used in CMake scripts is ROCM_PATH -- A
user must specify both environment variables right. Since ROCM_HOME is
undocumented, this commit completely eradicates it.
---
ezyang A remake of https://github.com/pytorch/pytorch/issues/22228 because its dependency has been dismissed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22464
Differential Revision: D16096833
Pulled By: bddppq
fbshipit-source-id: fea461e80ee61ec77fa3a7b476f7aec4fc453d5d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22425
Currently, in bound_shape_inference.cc: InferBoundShapeAndType, we firstly infer ops in the order and then infer inputs of concat in reverse order. In ctr_instagram_model tiny version, concat is right before FC, so we can infer the inputs for concat. But in production version, we found there are some ops between concat and FC(or other ops we know the shape), so the shape of these ops cannot be inferred.
This diff is a tmp solution for this problem: infer shape in order and in reverse order repeatly until no more change.
Reviewed By: yinghai, ipiszy
Differential Revision: D16082521
fbshipit-source-id: d5066509368029c6736dce156030adf5c38653d7
Summary:
MKL-DNN is the main library for computation when we use ideep device. It can use kernels implemented by different algorithms (including JIT, CBLAS, etc.) for computation. We add the "USE_MKLDNN_CBLAS" (default OFF) build option so that users can decide whether to use CBLAS computation methods or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19014
Differential Revision: D16094090
Pulled By: ezyang
fbshipit-source-id: 3f0b1d1a59a327ea0d1456e2752f2edd78d96ccc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22004
In future, we want all dicts/lists to store information about the types they contain.
This is only possible if the creation API doesn't allow creating lists/dicts without type information.
This diff removes some call sites that don't specify type information and have it specify type information.
Reviewed By: dzhulgakov
Differential Revision: D15906387
fbshipit-source-id: 64766a2534b52c221e8a5501a85eaad13812e7bd
Summary:
Currently the build system accepts USE_NAMEDTENSOR from the environment
variable and turns it into NAMEDTENSOR_ENABLED when passing to CMake.
This discrepancy does not seem necessary and complicates the build
system. The naming of this build option is also semantically incorrect
("BUILD_" vis-a-vis "USE_"). This commit eradicate this issue before it
is made into a stable release.
The support of NO_NAMEDTENSOR is also removed, since PyTorch has been
quite inconsistent about "NO_*" build options.
---
Note: All environment variables with their names starting with `BUILD_` are currently automatically passed to CMake with no need of an additional wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22360
Differential Revision: D16074509
Pulled By: zou3519
fbshipit-source-id: dc316287e26192118f3c99b945454bc50535b2ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21389
As titled. To do weight re-init on evicted rows in embedding table, we need to pass the info of the evicted hashed values to SparseLookup, which is the layer model responsible for constructing the embedding table and do pooling.
To pass evicted values, we need to adjust the output record of lru_sparse_hash to include the evicted values, and add optional input to all processors that needs to take in sparse segment. For SparseLookup to get the evicted values, its input record needs to be adjusted. Now the input record can have type IdList/IdScoreList/or a struct of feature + evicted values
Reviewed By: itomatik
Differential Revision: D15590307
fbshipit-source-id: e493881909830d5ca5806a743a2a713198c100c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22241
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20387
glibc has a non-standard function, feenableexcept, that triggers floating-point exception handler . Compared to feclearexcept + fetestexcept , this approach allows us to see precisely where the exception is raised from the stack trace.
Reviewed By: jspark1105
Differential Revision: D15301095
fbshipit-source-id: 94f6e72456b2280f78d7d01c2ee069ae46d609bb
Summary:
empty_like uses the tensor options of `self`, rather than the passed in tensor options. This means it messes up variable/tensor types, and ignores specifications like different dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21978
Differential Revision: D15903948
Pulled By: gchanan
fbshipit-source-id: f29946be01c543f888daef2e99fe928e7b7d9d74
Summary:
# What is this?
This is an implementation of the AdamW optimizer as implemented in [the fastai library](803894051b/fastai/callback.py) and as initially introduced in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101). It decouples the weight decay regularization step from the optimization step during training.
There have already been several abortive attempts to push this into pytorch in some form or fashion: https://github.com/pytorch/pytorch/pull/17468, https://github.com/pytorch/pytorch/pull/10866, https://github.com/pytorch/pytorch/pull/3740, https://github.com/pytorch/pytorch/pull/4429. Hopefully this one goes through.
# Why is this important?
Via a simple reparameterization, it can be shown that L2 regularization has a weight decay effect in the case of SGD optimization. Because of this, L2 regularization became synonymous with the concept of weight decay. However, it can be shown that the equivalence of L2 regularization and weight decay breaks down for more complex adaptive optimization schemes. It was shown in the paper [Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101) that this is the reason why models trained with SGD achieve better generalization than those trained with Adam. Weight decay is a very effective regularizer. L2 regularization, in and of itself, is much less effective. By explicitly decaying the weights, we can achieve state-of-the-art results while also taking advantage of the quick convergence properties that adaptive optimization schemes have.
# How was this tested?
There were test cases added to `test_optim.py` and I also ran a [little experiment](https://gist.github.com/mjacar/0c9809b96513daff84fe3d9938f08638) to validate that this implementation is equivalent to the fastai implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21250
Differential Revision: D16060339
Pulled By: vincentqb
fbshipit-source-id: ded7cc9cfd3fde81f655b9ffb3e3d6b3543a4709
Summary:
Address the issue raised in https://github.com/pytorch/pytorch/issues/22377.
The PR https://github.com/pytorch/pytorch/issues/22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`.
In this PR, it now use a `float32` temporary tensor when the input is `float16`.
ngimel, can I get you to review? I think I have fixed the issues you have pointed out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22401
Differential Revision: D16077319
Pulled By: mrshenli
fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Summary:
The CMake modifications include removal of some unnecessary paths
(e.g. find_package(CUDA) and friends) that are no longer used since
c10d is always part of the larger torch build. The macro
`C10D_USE_...` was ambiguous and is now removed in favor of only
having top level `USE_...`. The c10d test suite is changed to include
skip annotations for the tests that depend on Gloo as well.
Now, if you compile with `USE_DISTRIBUTED=1` and `USE_GLOO=0` you get
a functioning build for which the tests actually pass.
Closes https://github.com/pytorch/pytorch/issues/18851.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22257
Differential Revision: D16087993
Pulled By: pietern
fbshipit-source-id: 0cea66bd5cbd9736b06fa1d45ee13a18cab88adb
Summary:
The `assert False` lint error has been causing CI to fail:
./torch/utils/throughput_benchmark.py:14:13: B011 Do not call assert False since python -O removes these calls. Instead callers should raise AssertionError().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22424
Differential Revision: D16083464
Pulled By: bddppq
fbshipit-source-id: 6d96e36c8fcbb391d071b75fe79c22d526c1ba3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22429
Android NDK r20 removes the guard `(__ANDROID_API__ <= __ANDROID_API_O_MR1__)`, so we do it here also. There is insufficient reason to keep these decls undefined for earlier API levels. NDK r15 and earlier don't even define `__ANDROID_API_O_MR1__`, so the preprocessor defaults it to 0 and the guard evaluates as TRUE.
Reviewed By: smeenai, hlu1
Differential Revision: D16084105
fbshipit-source-id: f0857b3eb0573fe219f0d6c5e6583f89e2b5518f
Summary:
This change adds one advanced support for cross-chunk shuffling.
For training with static dataset, the default configuration is at user's disposal. However, in some user cases, over each epoch, new data is added to the current dataset, thus the dataset's size is dynamically changing/increasing. In order to mix the new data and the old data for better random sampling, one approach is to shuffle examples from more than 1 chunks. This feature is supported with this change. By specifying the `cross_chunk_shuffle_count_` on construction, advanced user can specify how many chunks to shuffle example from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22347
Differential Revision: D16081378
Pulled By: zhangguanheng66
fbshipit-source-id: fd001dfb9e66947839adecfb9893156fbbce80d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22413
_jit_pass_erase_number_types invalidates the jit graph but parts of _jit_pass_onnx rely on having a valid jit graph.
This splits _jit_pass_onnx into _jit_pass_onnx_remove_print and _jit_pass_onnx_preprocess_caffe2 (which rely on the valid jit graph), runs these before _jit_pass_erase_number_types,
and then runs the rest of _jit_pass_onnx after _jit_pass_erase_number_types
Reviewed By: houseroad
Differential Revision: D16079890
fbshipit-source-id: ae68b87dced077f76cbf1335ef3bf89984413224
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22334
Improve the function signatures of save_to_db and load_from_db in predictor_exporter.
Reviewed By: akyrola
Differential Revision: D16047208
fbshipit-source-id: a4e947f86e00ef3b3dd32c57efe58f76a38fcec7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22293
Just wrapping C class with nicer python interface which now
ust print dirrectly to get all the data. Later we can add various
visualizations there
Differential Revision: D16023999
fbshipit-source-id: 8436e37e36965821a690035617784dcdc352dcd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22292
as we do atomic fetch_add to validate if a thread should
finish, we should not take the last iteration into account. As a
result total number of iterations should be exactly the same as user
sets via config.num_iters
Now when running a unit test I see exact number of iterations reported
Differential Revision: D16023963
fbshipit-source-id: 3b12ee17276628ecd7b0979f28cd6deb777a1543
Summary:
As part of the Variable/Tensor merge, one invariant for tensor libraries such as ATen / Caffe2 / XLA is that they should only deal with Tensors, not Variables. However, currently in `variable_factories.h` we are potentially passing Variables into those tensor libraries without the `at::AutoNonVariableTypeMode` guard, which will cause those libraries to treat those Variables as Variables (i.e. their `is_variable()` is true), not Tensors.
Consider the following example for `full_like`:
```cpp
inline at::Tensor full_like(const at::Tensor & self, at::Scalar fill_value) {
...
// Both ATen and XLA rely on `at::full_like` to dispatch to library specific implementations.
//
// When `self` is a Variable, since we are not using `at::AutoNonVariableTypeMode`,
// `at::full_like` will also use `self` as a Variable (and it will see that `self.is_variable()` is true),
// which breaks the invariant that ATen / XLA should never deal with Variables.
at::Tensor tensor = at::full_like(self, fill_value, self.options().is_variable(false));
at::Tensor result =
autograd::make_variable_consuming(std::move(tensor), /*requires_grad=*/false);
...
return result;
}
```
Instead, the invariant-preserving implementation would be:
```cpp
inline at::Tensor full_like(const at::Tensor & self, at::Scalar fill_value) {
...
at::Tensor tensor = ([&]() {
at::AutoNonVariableTypeMode non_var_type_mode(true);
// Both ATen and XLA rely on `at::full_like` to dispatch to library specific implementations.
//
// When `self` is a Variable, since we have `at::AutoNonVariableTypeMode` in the scope,
// `at::full_like` will use `self` as a Tensor (and it will see that `self.is_variable()` is false),
// which preserves the invariant that ATen / XLA should only deal with Tensors.
return at::full_like(self, fill_value, self.options().is_variable(false));
})();
at::Tensor result =
autograd::make_variable_consuming(std::move(tensor), /*requires_grad=*/false);
...
return result;
}
```
This PR makes the suggested change for all variable factory functions.
cc. ailzhang This should allow us to remove all `tensor_data()` calls in the XLA codebase.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22364
Differential Revision: D16074862
Pulled By: yf225
fbshipit-source-id: 3deba94b90bec92a757041ec05d604401a30c353
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22285
Previously forward hooks are expected to return None, this PR adds the support to overwrite input and output in `forward_pre_hook` and `forward_hook`, this is used to implement inserting quant/dequant function calls around forward functions.
Differential Revision: D16022491
fbshipit-source-id: 02340080745f22c8ea8a2f80c2c08e3a88e37253
Summary:
As per attached tasks, these are noops and are being deprecated/removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22113
Reviewed By: philipjameson
Differential Revision: D15901131
fbshipit-source-id: 3acf12208f692548afe4844be13717a49d74af32
Summary:
Saying `I` in an err msg is too subjective to be used in a framework.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22369
Differential Revision: D16067712
Pulled By: soumith
fbshipit-source-id: 2a390646bd5b15674c99f65e3c460a7272f508b6
Summary:
`setup.py` recommends setting `USE_QNNPACK=0` and `USE_NNPACK=0` to disable building QNNPACK and NNPACK respectively. However this wasn't reflected correctly because we were looking for `NO_QNNPACK` and `NO_NNPACK`. This PR fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22367
Differential Revision: D16067393
Pulled By: soumith
fbshipit-source-id: 6491865ade9a6d41b7a79d68fd586a7854051f28
Summary:
Say the user inputs reduction=False. Of course, we can't add a bool and a string, so the ValueError itself will error -which is more confusing to the user. Instead, we should use string formatting. I would use `f"{reduction} is not..."` but unsure whether we are ok with using f"" strings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22160
Differential Revision: D15981826
Pulled By: soumith
fbshipit-source-id: 279f34bb64a72578c36bdbabe2da83d2fa4b93d8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22319
The onnx pass replacing ints with Tensors produces an invalid JIT graph. It should only be called right before the onnx pass.
Also, it should only be called if we actually export to onnx.
Reviewed By: houseroad
Differential Revision: D16040374
fbshipit-source-id: e78849ee07850acd897fd9eba60b6401fdc4965b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22317
About to add an observer that is also statically initialized in a different
file, so we need to enforce initialization order.
Reviewed By: ilia-cher
Differential Revision: D16012275
fbshipit-source-id: f26e57149a5e326fd34cb51bde93ee99e65403c4
Summary:
Syncing worker requirement mismatches to improve remote build time.
Created actions:
MEDIUM: 445
LARGE: 354
Updated actions:
From MEDIUM to LARGE: 21
From LARGE to XLARGE: 34
From LARGE to MEDIUM: 9
From XLARGE to MEDIUM: 1
Differential Revision: D16047893
fbshipit-source-id: 7afab2ef879277f114d67fd1da9f5102ec04ed7f
Summary:
This does not occur in CUDA code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22271
Differential Revision: D16024605
Pulled By: bddppq
fbshipit-source-id: bb4f16bacbdc040faa59751fba97958f4c2d33cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22307
MSVC-specific pragma doesn't silence the warning about throwing constructor and therefore `clang-cl` fails to compile this file. This diff fixes the problem by adding additional check for `clang` compiler.
Reviewed By: smessmer
Differential Revision: D16032324
fbshipit-source-id: 6dbce0ebf0a533d3e42b476294720590b43a8448
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21921
Call FBGEMM kernels to implement quantized linear operator. This operator is used only for inference.
Differential Revision: D15375695
fbshipit-source-id: b9ca6c156fd60481fea83e55603b2897f7bfc3eb
Summary:
Reduction of gradients for unused parameters should happen as soon as
possible, because they potentially block reduction of gradients for
used parameters. This used to happen instantly when
`prepare_for_backward` was called and it found parameters that didn't
contribute. This meant that if you have a model with unused
parameters, and you want to discard the model output (i.e. not call
backward on some loss), reduction of the gradients of those unused
parameters would have been kicked off, and you'd see an error the next
time you called `forward`.
In this commit, this original approach is slightly changed to delay
reduction of the gradients of those unused parameters until the first
autograd hook is called. This means that you can now discard the model
output regardless of the model having unused parameters or not.
This is a prerequisite for making the `find_unused_parameters`
argument to DDP default to `True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22219
Differential Revision: D16028698
Pulled By: pietern
fbshipit-source-id: c6aec2cd39c4a77746495d9cb1c9fb9c5ac61983
Summary:
This adds the rest of the `dict.???` methods that were missing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21979
Pulled By: driazati
Differential Revision: D16023573
fbshipit-source-id: 3ea9bd905090e2a176af654a8ca98c7d965ea679
Summary:
In talks with smessmer, we decided that it'd be better to put the logic in `list`, as optimal behavior requires knowing `.capacity()`
Results on my cpu (for the benchmark here: https://twitter.com/VahidK/status/1138674536679821312) now look like this:
```
Pytorch batch_gather took 0.018311 seconds.
Pytorch batch_gather jit took 0.013921 seconds.
Pytorch vectorized batch_gather took 0.001384 seconds.
```
Previously, `batch_gather jit` took 3x as long as `batch_gather`.
Some logic taken from https://github.com/pytorch/pytorch/pull/21690. Note that these two PR's are somewhat orthogonal. That PR handles this benchmark by looking at the alias analysis, while this PR specializes for `+=`.
Note that we can't jit the vectorized version as we think `torch.arange` returns a float tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21896
Differential Revision: D15998628
Pulled By: Chillee
fbshipit-source-id: b0085960da4613578b94deb98ac62c0a4532a8c3
Summary:
This is yet another step to disentangle Python build scripts and CMake
and improve their integration (Let CMake handle more build environment
detections, and less by our handcrafted Python scripts).
The processor detection logic also changed a bit: Instead of detecting
whether the system processor is PPC or ARM, this PR changes to detect
Intel CPUs, because this is more precise as MKL only supports Intel
CPUs. The build option `USE_MKLDNN` will also not be presented to
users on non-Intel processors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22215
Differential Revision: D16005953
Pulled By: ezyang
fbshipit-source-id: bf3f74d53609b3f835e280f63a872ff3c9352763
Summary:
When dealing with large scale dataset, it is handy if we can save the dataset status and resume later. Especially in cases where some unexpected crash happens, user don't need to start over the whole dataset from begining. Instead, they can reload it from the last checkpoint.
This change adds support for checkpoint save/load logic in ChunkDataset.
On ChunkDataset construction, user can specify a file name from which to load the checkpoint. If it is empty, default to start from fresh; otherwise the ChunkDataset will 'fast forward' the chunk sampler to the corresponding checkpoint.
The user can also call ChunkDataset::save() to serialize current status to a file, which can be used later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21889
Differential Revision: D16024582
Pulled By: ailzhang
fbshipit-source-id: 1862ab5116f94c9d29da174ce04a91041d06cad5
Summary:
`cmake/public/LoadHIP.cmake` calls `find_package(miopen)`, which uses the CMake module in MIOpen installation (It includes the line `set(miopen_DIR ${MIOPEN_PATH}/lib/cmake/miopen)`). `cmake/Modules/FindMIOpen.cmake` is not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22244
Differential Revision: D16000771
Pulled By: bddppq
fbshipit-source-id: 07bb40fdf033521e8427fc351715d47e6e30ed34
Summary:
The original name `copy_tensor_data` could be confusing because users are not sure whether it deep-copies data in the tensor's storage or just copies the tensor's metadata. The renaming makes it more clear.
cc. ailzhang This might break XLA build, but I think the renaming makes it more clear why we use `copy_tensor_data` in XLATensorImpl's shallow-copy functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22266
Differential Revision: D16014724
Pulled By: yf225
fbshipit-source-id: f6ee966927d4d65d828b68264b3253b2f8fd768d
Summary:
This adds the rest of the `dict.???` methods that were missing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21979
Pulled By: driazati
Differential Revision: D15999938
fbshipit-source-id: 7bc2a55e3f791015a0ff2e3731703075cf0770ee
Summary:
I learned from https://github.com/pytorch/pytorch/pull/22058 that `worker_kill` is just flaky, regardless of `hold_iter_reference`. So let's disable it altogether for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22208
Differential Revision: D15990307
Pulled By: soumith
fbshipit-source-id: d7d3f4fe7eaac4987f240cb8fd032c73a84157d7
Summary:
As part of the Variable/Tensor merge, we want to gradually remove call sites of `tensor_data()` and the API itself, and instead uses `variable_data()`. This PR removes the `tensor_data()` call in the tensor_to_numpy conversion path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22214
Differential Revision: D15997397
Pulled By: yf225
fbshipit-source-id: 6fcab7b14e138824fc2adb5434512bcf868ca375
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22077
ghimport-source-id: 39cf0a2e66e7fa2b6866af72782a22a4bd025e4c
Test Plan:
- Compared the build/aten/src folder before and after this change
locally and verified they are identical (`diff -r`).
- Wait for CI + Also, [namedtensor ci]
Imported from OSS
Differential Revision: D15941967
Pulled By: zou3519
fbshipit-source-id: d8607df78f48325fba37e0d00fce0ecfbb78cb36
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20729
Currently there is no way to specify what scalar types each nn function will support.
This change will allow to specify supported scalar types for each function/backward function and device. By default each function will support Float, Double, Half.
If you want to scpecify any extra supported scalar types, other then default, you will need to change nn.yalm:
- name: _some_func(Tensor self)
cname: SomeFunction
CPU:
forward_scalar_types: ['Float', 'Double', 'Long']
backward_scalar_types: ['Float', 'Double']
Differential Revision: D15423752
fbshipit-source-id: b3c157316d6e629bc39c1b377a3b23c71b1656cf
Summary:
In `torch/csrc/autograd/function.h` we define `torch::autograd::Function`, a (the?) central autograd record-holding class. `Function` is declared public API (`TORCH_API`).
We also define a custom deleter `deleteFunction` which we use throughout PyTorch's own use of `Function`. This trivial PR declares the deleter public API as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22236
Differential Revision: D16001335
Pulled By: yf225
fbshipit-source-id: 6ef0a3630e8f82f277a0e6e26cc64455ef7ee43e
Summary:
we used to not print device when it's on xla. It's sometimes confusing as it looks the same as cpu tensor...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22094
Differential Revision: D15975405
Pulled By: ailzhang
fbshipit-source-id: f19ceb9e26f5f2f6e7d659de12716f0dfe065f42
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22084
For DictPtr/ListPtr, default construction was disallowed because it was ambigious if it's supposed to create an empty list or a nullptr.
But since we renamed them to Dict/List, we can now allow default construction without ambiguity.
Differential Revision: D15948098
fbshipit-source-id: 942a9235b51608d1870ee4a2f2f0a5d0d45ec6e6
Summary:
This cleans up the `checkScript` API and some old tests that were hardcoding outputs. It also now runs the Python function when a string is passed in to verify the outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22002
Differential Revision: D15924485
Pulled By: driazati
fbshipit-source-id: ee870c942d804596913601cb411adc31bd988558
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22157
This header uses `std::swap_ranges` function which is defined in `<algorithm>` header (https://en.cppreference.com/w/cpp/algorithm/swap_ranges). Therefore this file isn't guaranteed to compile on all platforms.
This diff fixes the problem by adding the missing header.
Reviewed By: smessmer
Differential Revision: D15971425
fbshipit-source-id: e3edcec131f72d729161f5644ee152f66489201a
Summary:
Changelog:
- Port `symeig` from TH/THC to ATen
- Enable batching of matrix inputs for `symeig`
- Modify derivative computation based on batching
- Update docs to reflect the change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21858
Test Plan: - Added additional tests in `test_torch.py` (with a port to `test_cuda.py`) and `common_methods_invocations.py` to test if both the port and batching work.
Differential Revision: D15981789
Pulled By: soumith
fbshipit-source-id: ab9af8361f8608db42318aabc8421bd99a1ca7ae
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21885
If a kernel is defined as a stateful lambda
static auto registry = torch::RegisterOperators().op("my::op", [some_closure] (Tensor a) {...});
this can have very unexpected behavior when kernels are instantiated. There is no guarantee that the state is kept.
In the options based API, state is already disallowed:
// this is a compiler error
static auto registry = torch::RegisterOperators().op("my::op", torch::RegisterOperators::options().kernel([some_closure] (Tensor a) {...}));
but we can't disallow it in the non-options-based API for backwards compatibility reasons.
We can, however, show a deprecation warning. This is what this diff introduces.
Differential Revision: D15867089
fbshipit-source-id: 300fa4772fad8e7d177eb7cb910063d360537a4a
Summary:
Re-implementation of the `embedding_dense_backward_cuda()` and the `embedding_bag_backward_cuda_sum_avg()` functions.
#### Performance
Running a [Mortgage Workflow](https://github.com/EvenOldridge/MortgageWorkflowA) with a block size of 100K on a DXG-2 (single GPU), we see a 270% speedup:
```
Original version: 370,168 example/s
Optimized version: 1,034,228 example/s
```
The original version is bounded by the `EmbeddingBag_accGradParametersKernel_sum_avg`, which takes 70% of the CUDA execution time. In the optimized version, the optimized kernel now takes only 17% of the time.
#### Greater Numerical Stability
An added benefit is greater numerical stability. Instead of doing a flat sum where a single variable are used to accumulate the weights, this code uses two-steps where each GPU-thread computes a sub-result defined by `NROWS_PER_THREAD` before the final result are accumulated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22016
Differential Revision: D15944339
Pulled By: mrshenli
fbshipit-source-id: 398d5f48826a017fc4b31c24c3f8b56d01830bf0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22130
Optimize InstanceNormOp forward
For InstanceNormOp on CPU with order = NHWC, N = 128, C = 256, H = 56, W = 56: 183ms -> 115ms.
For InstanceNormOp on GPU with N = 256, C = 256, H = 112, W = 112:
NCHW: 1475ms -> 45ms
NHWC: 1597ms -> 79ms
Reviewed By: houseroad
Differential Revision: D15963711
fbshipit-source-id: 3fa03109326456b9f301514fecbefa7809438d3e
Summary:
In order to select more important features in dot product among a list of candidate sparse features, we can assign one learnable weight on each feature, reweight each feature by multiplying the weight onto its embedding before dot product. We finally select features based on the weight magnitude after training.
We can perform L1 and/or L2 regularization on the weights. To summarize, the weights tend to shrink their values (avoiding overfitting) due to L2 regularization, and some weights will vanish to zero as L1. To avoid sparse feature embedding being ignored due to early collapse of weights, a piece lr warm up policy is used in optimizing regularization term, such that regularization is weak at first stage and gets stronger afterwards (a small lr constant in iters less than threshold 1, a medium lr constant in stage 2, and a final reasonable large lr constant in all iters after threshold 2). The features with nonzero and relatively large weights (in absolute value) will be selected for the module.
We can also apply softmax on the original weights to make it sum to 1. We can even boosting the softmaxed weights by multiply the number of softmax components, which essentially make them sum to the number of softmax components and avergae to 1. In this idea, all the weights are positive and sum to a constant. Regularization is not a must since we can count on the competition between softmax weights themselves to achieve reasonable re-weighting. We expect those weights be more dense, comparing with sparse ones from L1 regularization and we can select features based on top K weights.
Overall, we aim to demonstrate the selected feature set outperform current v0 feature set in experiments. Special acknowledgement goes to Shouyuan Chen, who initiated the work of regularizable weighting.
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22176
The diff will export updates to Github repository, as stated below.
{F162787228}
Basically, the updates on the files are summarized as below:
- adding logger messages
`caffe2/python/layer_model_helper.py`
- add ElasticNet regularizer, which combines both L1 and L2 regularization
`caffe2/python/regularizer.py`
- implement piecewarmup, specifically warm up with three constant pieces
`caffe2/sgd/learning_rate_functors.h, caffe2/sgd/learning_rate_op.cc, caffe2/sgd/learning_rate_op.h`
Differential Revision: D15923430
fbshipit-source-id: ee18902cb88c23b1b7b367cc727d690a21e4cda9
Summary:
- PyCQA/flake8-bugbear#53 has been fixed (but not yet closed on their side) and a new version of flake8-bugbear has been released on Mar 28, 2019. Switch CI to use the latest stable version.
- Fix the new B011 errors that flake8-bugbear catches in the current codebase.
---
B011: Do not call assert False since python -O removes these calls. Instead callers should raise AssertionError().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21944
Differential Revision: D15974842
Pulled By: soumith
fbshipit-source-id: de5c2c07015f7f1c50cb3904c651914b8c83bf5c
Summary:
Returning the result of an inplace `squeeze_` in `einsum` (which itself is traced) interacts badly with `autograd.Function`.
I must admit that I'm not 100% certain whether it should be necessary to change this, but I consider this a good change overall.
Fixes: https://github.com/pytorch/pytorch/issues/22072
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22111
Differential Revision: D15974990
Pulled By: soumith
fbshipit-source-id: 477e7f23833f02999085f665c175d062e7d32acd
Summary:
The current error message displays as:
`RuntimeError: index koccurs twice in output`
A whitespace is missing between the index and 'occurs'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21904
Differential Revision: D15878941
Pulled By: colesbury
fbshipit-source-id: 163dda1829bf4956978cd01fd0e751673580722d
Summary:
The bug is that when target_length == 0, there is no preceding BLANK state and the original implementation will lead to out of bound pointer access.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21910
Differential Revision: D15960239
Pulled By: ezyang
fbshipit-source-id: 7bbbecb7bf91842735c14265612c7e5049c4d9b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22088
This diff is similar to D14163001. We need to handle the edge case when add_axis=1.
Reviewed By: jspark1105
Differential Revision: D15949003
fbshipit-source-id: 328d1e07b78b69bde81eee78c9ff5a8fb81f629b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22037
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Reviewed By: mrshenli
Differential Revision: D15926383
fbshipit-source-id: 39c0d5dbd95bf0534314fdf4d44b2385d5321aaf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22036
Implemented only on ProcessGroupGloo, as an allgather of metadata
(sparse_dim, dense_dim, and nnz), followed by an allgather of indices,
followed by an allgather of values. Once these operations have
finished, all ranks locally compute a reduction over these sparse
tensors. Works for both CPU and CUDA tensors.
This surfaced a problem with the existing assumption of only modifying
tensors that are passed at the call site, because for sparse tensors
we don't know the dimensions of the output tensors before we run the
collective. To deal with this unknown, this commit adds a `result`
function to the `c10d::ProcessGroup::Work` class that returns a vector
of tensors.
It's a bit odd to have to retrieve the result through this function
only for operations on sparse tensors. To make this work irrespective
of tensor layout, we can create a follow-up commit to make all in
place operations make their results accessible through this function
as well. This doesn't break any existing contracts but does have the
potential to add interface ambiguity.
This is a resubmission of #19146.
Reviewed By: mrshenli
Differential Revision: D15926384
fbshipit-source-id: b6ee5d81606bfa8ed63c3d63a9e307613491e0ae
Summary:
This change is backwards incompatible in *C++ only* on mean(), sum(), and prod() interfaces that accepted either of:
```
Tensor sum(IntArrayRef dim, bool keepdim=false) const;
Tensor sum(IntArrayRef dim, ScalarType dtype) const;
```
but now to specify both the dim and dtype will require the keepdim parameter:
```
Tensor sum(IntArrayRef dim, bool keepdim=false, c10::optional<ScalarType> dtype=c10::nullopt) const;
```
[xla ci]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21088
Reviewed By: ailzhang
Differential Revision: D15944971
Pulled By: nairbv
fbshipit-source-id: 53473c370813d9470b190aa82764d0aea767ed74
Summary:
Currently many build options are explicitly passed from Python build scripts to CMake. But this is unecessary, at least for many of them. This commit removes the build options that have the same name in CMakeLists.txt and environment variables (e.g., `USE_REDIS`). Additionally, many build options that are not explicitly passed to CMake are lost.
For `ONNX_ML`, `ONNX_NAMESPACE`, and `BUILDING_WITH_TORCH_LIBS`, I changed their default values in CMake scripts (as consistent with what the `CMake.defines` call meant), to avoid their default values being redundantly set in the Python build scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21877
Differential Revision: D15964996
Pulled By: ezyang
fbshipit-source-id: 127a46af7e2964885ffddce24e1a62995e0c5007
Summary:
This PR tackles issue https://github.com/pytorch/pytorch/issues/18352 .
Progress:
- [x] conv_dilated2d CPU
- [x] conv_dilated3d CPU
- [x] conv_dilated2d CUDA
- [x] conv_dilated3d CUDA
- [x] RocM port
- [x] Port of CUDA gemm and gemv
- [x] Refactored 2d and 3d functions as well as output and gradient computations into a single C++ template function
- [x] Cleanup
+ [x] eliminate forward functions
+ [x] eliminate buffers `columns` and `ones` from functions API
+ [x] eliminate out functions
+ [x] eliminate using `ones`
Note that col2im, im2col, col2vol, vol2col implementations are exposed in `ATen/native/im2col.h` and `ATen/native/vol2col.h`. The corresponding operators (not ported in this PR) should use these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20983
Differential Revision: D15958088
Pulled By: ezyang
fbshipit-source-id: 1897f6e15abbf5710e9413cd1e443c2e1dc7d705
Summary:
This is useful for measuring inference performance of your
models. This is a very basic benchmark for now. We don't support
batching on the benchmark side, no inter and intra op parallelizm is
supported yet, just caller based parallelizm.
Main phylosophy here is that user should be able to provide inputs
from python and just stack them within the benchmark. API should be
exactly the same as passing inputs to module.forward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20766
Test Plan: Added a new unit test
Differential Revision: D15435461
Pulled By: salexspb
fbshipit-source-id: db08829dc3f4398bb1d8aa16cc4a58b6c72f16c6
Summary:
Previously any assert failures would leave the updated setting, making
the test suite semantics dependent on the order in which the tests are run.
The diff is large only due to the indentation change (might be good to review without whitespace changes).
cc yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22115
Differential Revision: D15960875
Pulled By: soumith
fbshipit-source-id: 9313695277fc2d968786f13371719e03fff18519
Summary:
Apply launch bounds annotations for ROCm as the maximum threads per
block (1024) is higher than the ROCm internal default (256).
Reduce the minBlocksPerMultiprocessor for ROCm to 8 from 16 as this
improves performance in some microbenchmarks by (statistically
significant) 4%.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22081
Differential Revision: D15947426
Pulled By: bddppq
fbshipit-source-id: b4b7015417f99e14dfdedb62639e4d837c38e4fd
Summary:
We can't really test these until we get Python 3.8 in the CI, but these all work locally and won't be invoked at all for Python 3.7 and lower so this should be pretty safe.
Fixes#21710
](https://our.intern.facebook.com/intern/diff/15914735/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22007
Pulled By: driazati
Differential Revision: D15914735
fbshipit-source-id: 83833cebe7e38b162719a4f53cbe52c3fc638edd
Summary:
This was originally introduced between at::Half, which overloaded a number of operators; since this isn't necessary anymore, get rid of it.
Note in many cases, these files still need THCNumerics.cuh (which was included by THCHalfAutoNumerics); I was not careful about isolating these usages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21878
Differential Revision: D15941236
Pulled By: gchanan
fbshipit-source-id: 65f30a20089fcd618e8f3e9646cf03147a15ccba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21753
- it accidentally didn't move non-IValue-based lists before. This is fixed now.
- it only needs to recreate a T() for IValue-based lists
Reviewed By: resistor
Differential Revision: D15809220
fbshipit-source-id: 944badf1920ee05f0969fff0d03284a641dae4a9
Summary:
Get benefit from the compile time vectorization and multi-threading.
Before:
```python
In [1]: import torch
In [2]: x = torch.randn(1000000)
In [3]: y = torch.randn(1000000)
In [4]: w = 0.7
In [5]: timeit torch.lerp(x, y, w)
2.29 ms ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
After:
```python
In [1]: import torch
In [2]: x = torch.randn(1000000)
In [3]: y = torch.randn(1000000)
In [4]: w = 0.7
In [5]: timeit torch.lerp(x, y, w)
452 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
After with multi-processing:
```python
In [1]: import torch
In [2]: x = torch.randn(1000000)
In [3]: y = torch.randn(1000000)
In [4]: w = 0.7
In [5]: timeit torch.lerp(x, y, w)
167 µs ± 48.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22038
Differential Revision: D15941468
Pulled By: VitalyFedyunin
fbshipit-source-id: fa8a5126187df4e6c849452e035b00b22be25739
Summary:
# Motivation
We allow to override JIT module serialization with `__getstate__/__setstate__` in order to cover cases where parameters are not serializable. Use cases include: MKLDNN integration: a388c78350/torch/utils/mkldnn.py (L18-L26)
and also fbgemm prepacked format integration for quantized tensors.
However many Eager scripts use `torch.save(module.state_dict())` form of serialization. There are several ways to make it work:
* make packed_weight itself pickleable (e.g. by binding `__getstate__/__setstate__` on C++ UDT level)
* change: we’d need to allow module buffers to be of arbitrary, non-Tensor types
* pro: no change to state_dict behavior
* cons: might not be directly inspectable by user calling .state_dict(), especially if packed weights represent several tensors fused together
* make packed_weight being proper Tensor layout
* pro: no change to state_dict or buffers behavior
* cons: adding new tensor layouts is pretty costly today
* cons: doesn’t work if multiple tensors are packed in one interleaved representation
* *[this approach]* allow Modules to override state_dict and return regular tensors
* pro: most flexible and hackable
* pro: maintains semantic meaning of statedict as all data necessary to represent module’s state
* cons: complicates state_dict logic
* cons: potential code duplication between `__getstate__/__setstate__`
Based on discussions with zdevito and gchanan we decided to pick latter approach. Rationale: this behavior is fully opt-in and will impact only modules that need it. For those modules the requirement listed above won't be true. But we do preserve requirement that all elements of state_dict are tensors. (https://fburl.com/qgybrug4 for internal discussion)
In the future we might also implement one of the approaches above but those are more involved.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21933
Differential Revision: D15937678
Pulled By: dzhulgakov
fbshipit-source-id: 3cb5d1a8304d04def7aabc0969d0a2e7be182367
Summary:
This pull request adds the necessary Windows DLL code to be able to support JIT fusion for CUDA. CPU JIT Fusion isn't supported. This also adds all the non-CPU JIT tests back in on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21861
Differential Revision: D15940939
Pulled By: soumith
fbshipit-source-id: e11f6af1ac258fcfd3a077e6e2f2e6fa38be4ef1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22015
Previous fusion logic only works for operators back-to-back in the linear order of protobuf file.
This diff generalizes to work for any predecessor-successor operators in the graph without any "interfering" use/def of the related blobs.
Reviewed By: csummersea
Differential Revision: D15916709
fbshipit-source-id: 82fe4911a8250845a8bea3427d1b77ce2442c495
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21709
Change the return type from Scalar to double/int64_t so we don't need to do conversion when we call other quantize related aten functions
Differential Revision: D15793003
fbshipit-source-id: 510936c69fa17a4d67340a31ebb03415647feb04
Summary:
This is a modified version of https://github.com/pytorch/pytorch/pull/14705 since commit structure for that PR is quite messy.
1. Add `IterableDataset`.
3. So we have 2 data loader mods: `Iterable` and `Map`.
1. `Iterable` if the `dataset` is an instance of `IterableDataset`
2. `Map` o.w.
3. Add better support for non-batch loading (i.e., `batch_size=None` and `batch_sampler=None`). This is useful in doing things like bulk loading.
3. Refactor `DataLoaderIter` into two classes, `_SingleProcessDataLoaderIter` and `_MultiProcessingDataLoaderIter`. Rename some methods to be more generic, e.g., `get_batch` -> `get_data`.
4. Add `torch.utils.data.get_worker_info` which returns worker information in a worker proc (e.g., worker id, dataset obj copy, etc.) and can be used in `IterableDataset.__iter__` and `worker_init_fn` to do per-worker configuration.
5. Add `ChainDataset`, which is the analog of `ConcatDataset` for `IterableDataset`.
7. Import torch.utils.data in `torch/__init__.py`
9. data loader examples and documentations
10. Use `get_worker_info` to detect whether we are in a worker process in `default_collate`
Closes https://github.com/pytorch/pytorch/issues/17909, https://github.com/pytorch/pytorch/issues/18096, https://github.com/pytorch/pytorch/issues/19946, and some of https://github.com/pytorch/pytorch/issues/13023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19228
Reviewed By: bddppq
Differential Revision: D15058152
fbshipit-source-id: 9e081a901a071d7e4502b88054a34b450ab5ddde
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20673
Add option to bucket-weighted pooling to hash the bucket so that any cardinality score can be used.
Reviewed By: huginhuangfb
Differential Revision: D15003509
fbshipit-source-id: 575a149de395f18fd7759f3edb485619f8aa5363
Summary:
The first attempt and more discussions are available in https://github.com/pytorch/pytorch/issues/19577
#### Goal
Allow toggling DDP gradient synchronization across iterations. With this feature, users may accumulate grads in module variables, and only kick off expensive grad synchronize every a few iterations.
#### Concerns
Our first attempt in https://github.com/pytorch/pytorch/issues/19577 tries to do it using a variable or a function. But apaszke made a good point that it will not be error prone, and favors a context manager instead.
#### Proposed Solution
Instead of providing a `accumulate_grads` variable/function/context, we provide a `DistributedDataParallel.no_sync()` context manager. And it does exactly what the name suggests, i.e., disable DDP grad synchronization within the context. Note that `accumulate_grads` means `no_sync` + no optimizer step, where the latter is not controlled by DDP.
It is true that users need to call another `model(input).backward()` after exiting the context, and this is indeed more verbose. But I think it is OK as one major concern in the previous discussion is to prevent users from running into errors without knowing it. This API should reaffirm the expected behavior, and does not mess up with other use cases if accumulating grads is not required..
The application would then look like:
```python
with ddp.no_sync():
for input in inputs:
ddp(input).backward()
ddp(one_more_input).backward()
optimizer.step()
```
chenyangyu1988 myleott
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21736
Differential Revision: D15805215
Pulled By: mrshenli
fbshipit-source-id: 73405797d1e39965c52016af5cf45b15525ce21c
Summary:
There aren't any substantive changes aside from some test renames (e.g. `TestScript.test_dict_membership` -> `TestDict.test_membership`) and the addition of `TestDict.dict()`.
Adding the rest of the dict ops was making the tests a mess and `TestScript` is already > 10000 lines by itself, so breaking them up should make things cleaner
](https://our.intern.facebook.com/intern/diff/15911383/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/22000
Pulled By: driazati
Differential Revision: D15911383
fbshipit-source-id: 614428e03fbc14252f0e9cde74ab9a707169a860
Summary:
The cppdocs build job (originally run on Chronos as a cron job) was frequently broken because it was not run on every PR. This PR moves it to CircleCI and enables it on every PR, so that we can get the build failure signal much earlier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19768
Differential Revision: D15922289
Pulled By: yf225
fbshipit-source-id: e36ef59a2e42f78b7d759ee02f2d94dc90f88fff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19443
This adds support for sparse gradients to the reducer as well as to
the DistributedDataParallel wrapper. Note that an out of band signal
is needed whether or not a dense parameter (e.g. an embedding) is
expected to receive a sparse gradient or not. This information is
passed to the bucket assignment computation routine and the reducer as
a vector of booleans. Every parameter for which we expect a sparse
gradient is assigned its own bucket, as we cannot easily group
multiple unrelated sparse tensors.
Reviewed By: mrshenli
Differential Revision: D15007365
fbshipit-source-id: f298e83fd3ca828fae9e80739e1db89d045c99ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19146
Implemented only on ProcessGroupGloo, as an allgather of metadata
(sparse_dim, dense_dim, and nnz), followed by an allgather of indices,
followed by an allgather of values. Once these operations have
finished, all ranks locally compute a reduction over these sparse
tensors. Works for both CPU and CUDA tensors.
This surfaced a problem with the existing assumption of only modifying
tensors that are passed at the call site, because for sparse tensors
we don't know the dimensions of the output tensors before we run the
collective. To deal with this unknown, this commit adds a `result`
function to the `c10d::ProcessGroup::Work` class that returns a vector
of tensors.
It's a bit odd to have to retrieve the result through this function
only for operations on sparse tensors. To make this work irrespective
of tensor layout, we can create a follow-up commit to make all in
place operations make their results accessible through this function
as well. This doesn't break any existing contracts but does have the
potential to add interface ambiguity.
Reviewed By: mrshenli
Differential Revision: D14889547
fbshipit-source-id: 34f3de4d6a2e09c9eba368df47daad0dc11b333e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21938
After having changed all call sites, we can now remove the old naming scheme.
Reviewed By: zdevito
Differential Revision: D15892402
fbshipit-source-id: 1f5b53a12fa657f6307811e8657c2e14f6285d2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21937
This changes call sites to use the new naming scheme
Reviewed By: zdevito
Differential Revision: D15892404
fbshipit-source-id: 8d32aa90a0ead1066688166478f299fde9c2c133
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21936
This introduces torch::List and torch::Dict as aliases to ListPtr/DictPtr.
After this lands, we can step by step change the call sites to the new naming
and finally remove the old spellings.
Reviewed By: zdevito
Differential Revision: D15892405
fbshipit-source-id: 67b38a6253c42364ff349a0d4049f90f03ca0d44
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21806
Dispatcher::findSchema(op_name) now uses a lookup table instead of iterating through the list of operators to find it.
This speeds up op lookup (as in finding the operator handle from the name, not as in finding a kernel when you already have the operator handle)
and it also speeds up op registration since that needs to look if an op with the same name already eists.
Differential Revision: D15834256
fbshipit-source-id: c3639d7b567e4ed5e3627c3ebfd01b7d08b55ac1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21809
Many error messages show dispatch keys, for example when the dispatcher didn't find a kernel to dispatch to.
Previously, this was a string like "CPU" or "CUDA" for known backends and just an arbitrary number for other backends.
Now, tensor type id registration also registers a name for the dispatch key and shows that in the error messages.
There is no API change, just the error messages are better now.
Differential Revision: D15835809
fbshipit-source-id: 4f0c9d0925c6708b02d79c653a2fae75b6623bb9
Summary:
https://github.com/pytorch/pytorch/pull/17072 breaks `model.to(xla_device)`, because moving `model` to XLA device involves changing its parameters' TensorImpl type, and the current implementation of `nn.Module.to()` doesn't support changing module parameters' TensorImpl type:
```python
# 6dc445e1a8/torch/nn/modules/module.py (L192-L208)
def _apply(self, fn):
...
for param in self._parameters.values():
if param is not None:
# Tensors stored in modules are graph leaves, and we don't
# want to create copy nodes, so we have to unpack the data.
param.data = fn(param.data) # NOTE: this doesn't allow changing `param.data`'s TensorImpl type
if param._grad is not None:
param._grad.data = fn(param._grad.data) # NOTE: this doesn't allow changing `param._grad.data`'s TensorImpl type
...
```
yf225 TODO: fix the description here when we finish the implementation
To fix this problem, we introduce a new API `model.to_()` that always assign new tensors to the parameters (thus supporting changing the parameters to any TensorImpl type), and also bump the version counter of the original parameters correctly so that they are invalidated in any autograd graph they participate in.
We also add warning to the current `model.to()` API to inform users about the upcoming behavior change of `model.to()`: in future releases, it would create and return a new model instead of in-place updating the current model.
This unblocks adding XLA to our CI test suite, which also allows XLA to catch up with other changes in our codebase, notably the c10 dispatcher.
[xla ci]
cc. resistor ailzhang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21613
Differential Revision: D15895387
Pulled By: yf225
fbshipit-source-id: b79f230fb06019122a37fdf0711bf2130a016fe6
Summary:
When we pass `fn` to `nn.Module._apply()` and `fn` is an in-place operation, the correct behavior should also include bumping the parameters' and their gradients' version counters. This PR fixes the old incorrect behavior and makes sure the new behavior is right.
Note that this PR is BC-breaking in the following way:
Previously, passing an in-place operation to `nn.Module._apply()` does not bump the module's parameters' and their gradients' version counters. After this PR, the module's parameters' and their gradients' version counters will be correctly bumped by the in-place operation, which will invalidate them in any autograd graph they previously participate in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21865
Differential Revision: D15881952
Pulled By: yf225
fbshipit-source-id: 62f9244a4283a110147e9f20145ff232a5579fbd
Summary:
Added some extra tests for std_mean and var_mean for multiple dims.
Some refactoring of previously created tests based on PR comments: https://github.com/pytorch/pytorch/pull/18731
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20650
Differential Revision: D15396101
Pulled By: ifedan
fbshipit-source-id: d15c3c2c7084a24d6cfea4018173552fcc9c03a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21852
To enable change of q_scale and q_zero_point in `copy_`
Differential Revision: D15793427
fbshipit-source-id: a7040b5b956d161fd6af6176287f4a4aa877c9be
Summary:
The code in `python_sugared_value.cpp` to recursively compile methods
was not being tested, so this adds a test for it and fixes some errors
in it
It was necessary to disable any hooks set since (at least in our tests) they would try to export
a half-finished graph since they were being called on recursively
compiled methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21862
Differential Revision: D15860314
Pulled By: driazati
fbshipit-source-id: e8afe9d4c75c345b6e1471072d67c5e335b61337
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21914https://github.com/pytorch/pytorch/pull/21591 added a needed feature to clean up grad accumulator post hooks when the DistributedDataParallel model object is cleaned up. There's a minor typo that causes it to loop infinitely over the first element.
Differential Revision: D15878884
fbshipit-source-id: b7fd0bbd51eb187579d639b1709c6f7b62b85e7a
Summary:
This PR adds support for `in` checks like `key in my_dict`
For now it leaves lists as a follow up due to the changes around `IValue` lists and it needing an `IValue` equality op.
For objects it uses the magic method `__contains__(self, key)`
](https://our.intern.facebook.com/intern/diff/15811203/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21527
Pulled By: driazati
Differential Revision: D15811203
fbshipit-source-id: 95745060394f8a9450efaaf8ab09d9af83bea01e
Summary:
This adds support for inferred attributes (everything except empty lists, dicts, and tuples) as well as using the PEP 526 style annotations on a class, so this eliminates the need for `torch.jit.Attribute`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21379
Differential Revision: D15718537
Pulled By: driazati
fbshipit-source-id: b7481ae3d7ee421613e931b7dc3427ef2a99757f
Summary:
This is a fix for https://github.com/pytorch/pytorch/issues/21469
Currently there is no way to define if backward function released variables when variables were added to a vector. This change will set a flag if function has saved variables and they were released. So we will prevent if somebody will call this function again with already released variables.
Functions that do not have saved variables can be called multiple times for BC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21533
Differential Revision: D15810481
Pulled By: ifedan
fbshipit-source-id: 5663e0c14f1b65727abc0d078aef348078d6a543
Summary:
This will need a conflict resolution once avg_pool2d() has been merged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21732
Differential Revision: D15824923
Pulled By: ezyang
fbshipit-source-id: 83341e0209b660aecf788272079d8135d78b6ff1
Summary:
This was some code I added :^)
Time for me to remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21897
Differential Revision: D15873213
Pulled By: Chillee
fbshipit-source-id: 769c3bd71c542be4afddc02dc2f65aa5c751b10d
Summary:
What's the point of having warnings if we never fix them :^)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21898
Differential Revision: D15873280
Pulled By: Chillee
fbshipit-source-id: a8274bab2badd840d36a9d2e1354677a6114ae1d
Summary:
cosine_similarity has two non-tensor parameters, needs some special handling. Add the support for its export in this diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21884
Reviewed By: zrphercule
Differential Revision: D15866807
Pulled By: houseroad
fbshipit-source-id: a165fbc00c65c44b276df89ae705ca8960349d48
Summary:
```
This replaces the kernel helpers in Loops.h/cuh with the following:
cpu_kernel
cpu_kernel_vec
gpu_kernel
gpu_kernel_with_scalars
These work with functions with any number of input arugments, with the
exception of 'gpu_kernel_with_scalars' which is limited to binary
operations. Previously, we only supported functions of 0, 1, or 2 input
arguments. Adding support for 3 or 4 input argument functions required
significant amount of additional code.
This makes a few other changes:
Remove 'ntensors' from the for_each/serial_for_each loop. Most loops
assume a fixed number of tensors, and the value is accessible from
TensorIterator::ntensors()
Only lift CPU scalars to parameters in 'gpu_kernel_with_scalars'.
Previously, we performed this recursively in gpu_unary_kernel and
gpu_binary_kernel, so something like `torch.add(3, 4, out=cuda_tensor)`
would specialize to a "nullary" kernel. Now, only the first
scalar input is lifted to a kernel parameter. Any additional scalar
inputs are copied to CUDA tensors. Note that operations like `x + 5`
and `5 + x` still work efficiently. This avoids generating an exponential
number of specializations in the number of input arguments.
```
**Performance measurements**
Timing numbers are unchanged for basic elementwise operations. Linked below is a script to measure torch.add perf on PR vs. master CPU+GPU (GCC 7.3):
[miniperf.py](https://gist.github.com/colesbury/4a61893a22809cb0931f08cd37127be4)
**Generated assembly**
cpu_kernel and cpu_kernel_vec still generate good vectorized code with
both GCC 7.3 and GCC 4.8.5. Below is the assembly for the "hot" inner loop of
torch.add as well as an auto-vectorized torch.mul implementation using cpu_kernel/
binary_kernel. (The real torch.mul uses cpu_kernel_vec but I wanted to check that
auto vectorization still works well):
[torch.add GCC 7.3](https://gist.github.com/colesbury/927ddbc71dc46899602589e85aef1331)
[torch.add GCC 4.8](https://gist.github.com/colesbury/f00e0aafd3d1c54e874e9718253dae16)
[torch.mul auto vectorized GCC 7.3](https://gist.github.com/colesbury/3077bfc65db9b4be4532c447bc0f8628)
[torch.mul auto vectorized GCC 4.8](https://gist.github.com/colesbury/1b38e158b3f0aaf8aad3a76963fcde86)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21475
Differential Revision: D15745116
Pulled By: colesbury
fbshipit-source-id: 914277d7930dc16e94f15bf87484a4ef82890f91
Summary:
PR https://github.com/pytorch/pytorch/issues/20685 incorrectly only enabled P2P access for non-contiguous copies.
This can make cudaMemcpy slow for inter-gpu copies, especially on ROCm
devices. I didn't notice a difference on CUDA 10, but ngimel says it's
important for CUDA too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21872
Differential Revision: D15863965
Pulled By: colesbury
fbshipit-source-id: 0a858f3c338fa2a5d05949d7f65fc05a70a9dfe1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21080
Add Huber loss as a new option for regression training (refer to TensorFlow implementation: https://fburl.com/9va71wwo)
# huber loss
def huber(true, pred, delta):
error = abs(true-pred)
loss = 0.5 * min(error, delta)^2 + delta * max(error - delta, 0)
return mean(loss)
As a combination of MSE loss (`x < delta`) and MAE loss (`x >= delta`), the advantage of Huber loss is to reduce the training dependence on outlier.
One thing worth to note is that Huber loss is not 2nd differential at `x = delta`. To further address this problem, one could consider adopt the loss of `LOG(cosh(x))`.
Reviewed By: chintak
Differential Revision: D15524377
fbshipit-source-id: 73acbe2728ce160c075f9acc65a1c21e3eb64e84
Summary:
After fixing https://github.com/pytorch/pytorch/issues/20774 the TRT build was broken
Because of missing annotations, pybind_state_gpu.so was missing symbols, but pybind_state.so did not. It caused a weird combination when trying to import pybind_state_gpu first left system in semi-initialized state and lead to sigsev.
Minimal repro:
```
>>> import ctypes
>>> ctypes.CDLL('/var/lib/jenkins/.local/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/ctypes/__init__.py", line 362, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /var/lib/jenkins/.local/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state_gpu.so: undefined symbol: _ZN6caffe219TensorRTTransformer9TransformEPNS_9WorkspaceEPNS_6NetDefERKSt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS_11TensorShapeESt4hashISB_ESt8equal_toISB_ESaISt4pairIKSB_SC_EEE
>>> ctypes.CDLL('/var/lib/jenkins/.local/lib/python2.7/site-packages/caffe2/python/caffe2_pybind11_state.so')
Segmentation fault (core dumped)
```
Too lazy to repro locally, let's see if CI passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21775
Differential Revision: D15829605
Pulled By: dzhulgakov
fbshipit-source-id: 1adb2bde56b0cd68f84cfca67bc050adcf787cd9
Summary:
Following up b811b6d5c03596d789a33d7891b606842e01f7d2
* Use property instead of __setattr__ in CMake.
* Add a comment clarifying when built_ext.run is called.
---
cc ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21792
Differential Revision: D15860606
Pulled By: umanwizard
fbshipit-source-id: ba1fa07f58d4eac81ac27fa9dc7115d1cdd3dec0
Summary:
https://github.com/pytorch/pytorch/issues/11866 has corrected this issue in function `host_softmax` (aten/src/ATen/native/SoftMax.cpp). But I tried the example proposed in https://github.com/pytorch/pytorch/issues/11752. `log_softmax` is still not working for big logits.
I have looked into the source code, found that example had called `vec_host_softmax_lastdim`, not `host_softmax`.
This code fixes the issue in `_vec_log_softmax_lastdim` and has a test for `log_softmax`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21672
Differential Revision: D15856327
Pulled By: VitalyFedyunin
fbshipit-source-id: 7a1fd3c0a03d366c99eb873e235361e4fcfa7567
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21735
ghimport-source-id: 4a4289693e372880e3d36e579c83d9e8745e70ed
Test Plan:
- I'm not sure how to test this other than making sure it compiles.
- [namedtensor ci]
gh-metadata: pytorch pytorch 21735 gh/zou3519/49/head
Imported from OSS
Differential Revision: D15833456
Pulled By: zou3519
fbshipit-source-id: ea2fa6d5c5f1eb2d7970d47189d6e4fcd947146d
Summary:
kuttas pointed out that the DDP Reducer only needs to remember `uintptr, Function` pairs, and hence does not need a nunordered map as added by https://github.com/pytorch/pytorch/issues/21591. Using a vector should speed it up a bit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21783
Differential Revision: D15854312
Pulled By: mrshenli
fbshipit-source-id: 153ba035b8d658c7878a613f16a42de977d89c43
Summary:
After https://github.com/pytorch/pytorch/pull/17072, we are allowed to pass Variables into ATen ops, thus there is no need to unwrap input variables in the c10 call path.
Note that since Caffe2 still expects inputs to be pure Tensors, we moved the unwrapping logic to the Caffe2 wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21620
Differential Revision: D15763560
Pulled By: yf225
fbshipit-source-id: 5375f0e51eb320f380ae599ebf98e6b259f0bff8
Summary:
This refactors pybind_utils so we can have all our type-inferring stuff in
1 place (e.g. for #21379)
There is some follow up work to make the error messages better, but I think that's fine to save for another PR.
](https://our.intern.facebook.com/intern/diff/15727002/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21550
Pulled By: driazati
Differential Revision: D15727002
fbshipit-source-id: a6974f2e1e5879f0503a18efc138da31cda7afa2
Summary:
Resolves https://github.com/pytorch/lockdown/issues/18
This implements NamedTuple by taking advantage of the existing `names` field in `TupleType`.
TODO: This currently doesn't retain the NamedTuple-ness through serialization. Discussed with suo offline, we can probably make a way to define an anonymous NamedTuple in script (e.g. `NamedTuple('Foo', [('a', int), ('b', float), ('c', List[float])])` and serialize that
TODO: implement support for calling the constructor with kwargs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21428
Differential Revision: D15741564
Pulled By: jamesr66a
fbshipit-source-id: c077cbcea1880675ca6deb340a9ec78f824a136c
Summary:
when enabling this flag, there were a lot of warnings, this pr focuses on the warnings where this comparison could be affecting array indices, which could be ones most prone to fail
the good news is that I didn't find anything obviously concerning
one degenerate case could be when the matrices we work with are too skinny could run into issues (dim1=1, dim2 needs to hold a big number)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18187
Differential Revision: D14527182
Pulled By: hyuen
fbshipit-source-id: b9f46b6f68ab912c55368961758a7a5af1805555
Summary:
We plan on generating python bindings for C++ ChunkDataset API using the current Pytorch Dataloader class, which must call get_batch() instead of get_batch(size)
This changes doesnt break the current API, just add one more method that will make future extensions easier (WIP)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21797
Differential Revision: D15830522
Pulled By: soumith
fbshipit-source-id: 7208f305b48bf65d2783eaff43ff57a05e62c255
Summary:
Originally, the tests for tensorboard writer are smoke tests only. This PR lets CI compare the output with expected results at low level. The randomness of the tensors in the test are also removed.
ps. I found that how protobuf serializes data differs between different python environment. One method to solve this is to write the data and then read it back instantly. (compare the data at a higher level)
For `add_custom_scalars`, the data to be written is a dictionary. and the serialized result might be different (not `ordereddict`). So only smoke test for that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20987
Reviewed By: NarineK, lanpa
Differential Revision: D15804871
Pulled By: orionr
fbshipit-source-id: 69324c11ff823b19960d50def73adff36eb4a2ac
Summary:
Try to fix a sporadic failure on some CIs.
I've run this test hundreds of times on my machine (GeForce 1060, MAGMA) but I cannot reproduce this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21638
Differential Revision: D15827779
Pulled By: ezyang
fbshipit-source-id: 3586075e48907b3b84a101c560a34cc733514a02
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21712
Warn when people use unordered_map or vector with IValues. These APIs are deprecated.
The unordered_map API is slow because it requires copying the whole map.
The vector API is slow for some types (e.g. std::string) because for them it also requires copying the whole map.
Also, the vector API would get slow for all types if we decide to switch to SmallVector.
Differential Revision: D15792428
fbshipit-source-id: 1b72406b3a8d56521c862858c9f0ed01e56f2757
Summary:
When kwargs are specified in a test defined via common_method_invocations, it doesn't work if there isn't also a positional argument (`{'foo':'foo'}` without a positional arg generates a python call like: `self.method(, foo=foo)`, erroring on the `,`). I wanted to test something in a different PR and noticed I couldn't.
Also fixed some flake8 warnings I was seeing locally.
I replaced `lambda x: x` with `ident` since it seems a bit cleaner to me, but happy to revert that if others don't agree?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21499
Differential Revision: D15826974
Pulled By: nairbv
fbshipit-source-id: a3f37c80ba2303c7d9ae06241df06c7475b64e36
Summary:
So far, we only have py2 ci for onnx. I think py3 support is important. And we have the plan to add onnxruntime backend tests, which only supports py3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21715
Reviewed By: bddppq
Differential Revision: D15796885
Pulled By: houseroad
fbshipit-source-id: 8554dbb75d13c57b67ca054446a13a016983326c
Summary:
Some data loader tests are flaky on py 2 with the following error
```
Jun 12 22:17:31 Traceback (most recent call last):
Jun 12 22:17:31 File "test_dataloader.py", line 798, in test_iterable_dataset
Jun 12 22:17:31 fetched = sorted([d.item() for d in dataloader_iter])
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 697, in __next__
Jun 12 22:17:31 idx, data = self._get_data()
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 664, in _get_data
Jun 12 22:17:31 success, data = self._try_get_data()
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 617, in _try_get_data
Jun 12 22:17:31 data = self.data_queue.get(timeout=timeout)
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/queues.py", line 135, in get
Jun 12 22:17:31 res = self._recv()
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/queue.py", line 22, in recv
Jun 12 22:17:31 return pickle.loads(buf)
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1382, in loads
Jun 12 22:17:31 return Unpickler(file).load()
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 858, in load
Jun 12 22:17:31 dispatch[key](self)
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/pickle.py", line 1133, in load_reduce
Jun 12 22:17:31 value = func(*args)
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/site-packages/torch/multiprocessing/reductions.py", line 274, in rebuild_storage_fd
Jun 12 22:17:31 fd = multiprocessing.reduction.rebuild_handle(df)
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 157, in rebuild_handle
Jun 12 22:17:31 new_handle = recv_handle(conn)
Jun 12 22:17:31 File "/opt/python/2.7.9/lib/python2.7/multiprocessing/reduction.py", line 83, in recv_handle
Jun 12 22:17:31 return _multiprocessing.recvfd(conn.fileno())
Jun 12 22:17:31 OSError: [Errno 4] Interrupted system call
```
Apparently, Python 2.7's `recvfd` calls `recvmsg` without EINTR retry: https://github.com/python/cpython/blob/2.7/Modules/_multiprocessing/multiprocessing.c#L174
So we should call it with an outer try-catch loop.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21723
Differential Revision: D15806247
Pulled By: ezyang
fbshipit-source-id: 16cb661cc0fb418fd37353a1fef7ceeb634f02b7
Summary:
Currently when building extensions, variables such as USE_CUDA, USE_CUDNN are used to determine what libraries should be linked. But we should use what CMake has detected, because:
1. If CMake found them unavailable but the variables say some libraries should be linked, the build would fail.
2. If the first build is made using a set of non-default build options, rebuild must have these option passed to setup.py again, otherwise the extension build process is inconsistent with CMake. For example,
```bash
# First build
USE_CUDA=0 python setup.py install
# Subsequent builds like this would fail, unless "build/" is deleted
python setup.py install
```
This commit addresses the above issues by using variables from CMakeCache.txt when building the extensions.
---
The changes in `setup.py` may look lengthy, but the biggest changed block is mostly moving them into a function `configure_extension_build` (along with some variable names changed to `cmake_cache_vars['variable name']` and other minor changes), because it must be called after CMake has been called (and thus the options used and system environment detected by CMake become available).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21653
Differential Revision: D15824506
Pulled By: ezyang
fbshipit-source-id: 1e1eb7eec7debba30738f65472ccad966ee74028
Summary:
This makes the error thrown in aten_to_numpy_dtype consistent with that in numpy_dtype_to_aten.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21608
Differential Revision: D15816035
Pulled By: gchanan
fbshipit-source-id: 392e8b9ea37003a859e7ed459911a1700fcbd695
Summary:
This PR is intended as a fix for https://github.com/pytorch/pytorch/issues/21644.
It allows the `with emit_nvtx` context manager to take an additional `record_shapes` argument. `record_shapes` is False by default, but if True, the nvtx ranges generated for each autograd op will append additional information about the sizes of Tensors received by that op.
The format of shape information is equivalent to what the CPU-side profiler spits out. For example,
```
M = torch.randn(2, 3)
mat1 = torch.randn(2, 3)
mat2 = torch.randn(3, 3)
with torch.cuda.profiler.profile():
with torch.autograd.profiler.emit_nvtx(record_shapes=True):
torch.addmm(M, mat1, mat2)
```
produces the following nvtx range label for addmm:

(cf the "Input Shapes" shown in 864cfbc216 (diff-115b6d48fa8c0ff33fa94b8fce8877b6))
I also took the opportunity to do some minor docstring cleanup.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21691
Differential Revision: D15816226
Pulled By: gchanan
fbshipit-source-id: b2b01ea10fea61a6409a32b41e85b6c8b4851bed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20924
I found a python3 bug for deserializing caffe2 code. The exception thrown is Unicode related error instead of just decode error, and we need to catch that as well
Reviewed By: ipiszy
Differential Revision: D15293221
fbshipit-source-id: 29820800d1b4cbe5bf3f5a189fe2023e655d0508
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21763
Custom __getattr__ functions can only raise AttributeError. This code throwed NotImplementedError which caused upstream troubles when hasattr() was called.
Differential Revision: D15815176
fbshipit-source-id: 0982e2382de4578d3fc05c5d2a63f624d6b4765e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21446
this is used for easier tracing of iter id when looking at trace diagram
Reviewed By: ilia-cher
Differential Revision: D15628950
fbshipit-source-id: ee75b3bdb14a36abc18c7bddc49d8ec9789b724d
Summary:
```
The stride calculation using OffsetCalculator performs poorly with
MAX_DIMS=25. This reduces MAX_DIMS (after coalescing) to 16 on ROCm.
I think it's unlikely that anyone will exceed this limit. If they do,
we can add additional specializations for ROCm with more dimensions.
```
I'm not sure about the underlying cause. With MAX_DIM=25, the add kernel's params
is ~648 bytes vs. ~424 bytes with MAX_DIM=16. The kernel instruction footprint is
bigger too, but most of these instructions are never executed and most kernel parameters
are never loaded because the typical dimensionality is much smaller.
Mini benchmark here:
https://gist.github.com/colesbury/1e917ae6a0ca9d24712121b92fed4c8f
(broadcasting operations are much faster)
cc iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21754
Reviewed By: bddppq
Differential Revision: D15811906
Pulled By: colesbury
fbshipit-source-id: 063f92c083d26e2ef2edc98df7ff0400f9432b9d
Summary:
Currently multihead attention for half type is broken
```
File "/home/ngimel/pytorch/torch/nn/functional.py", line 3279, in multi_head_attention_forward
attn_output = torch.bmm(attn_output_weights, v)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument https://github.com/pytorch/pytorch/issues/2 'mat2'
```
because softmax converts half inputs into fp32 inputs. This is unnecessary - all the computations in softmax will be done in fp32 anyway, and the results need to be converted into fp16 for the subsequent batch matrix multiply, so nothing is gained by writing them out in fp32. This PR gets rid of type casting in softmax, so that half works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21658
Differential Revision: D15807487
Pulled By: zhangguanheng66
fbshipit-source-id: 4709ec71a36383d0d35a8f01021e12e22b94992d
Summary:
In this PR, we use `expect` to fill in the token for pytorchbot when doing `git push`, so that we don't need to save the token in the git remote URL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20459
Differential Revision: D15811676
Pulled By: yf225
fbshipit-source-id: cd3b780da05d202305f76878e55c3435590f15a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21742
Add error message to NotImplementedError so we know which function it is about.
Reviewed By: bddppq
Differential Revision: D15806379
fbshipit-source-id: 14eab9d03aa5b44ab95c5caeadc0e01d51f22188
Summary:
When converting pixel_shuffle to reshape + transpose + reshape, the first reshape should
be:
[N, C * r^2, H, W] => [N, C, r, r, H, W]
in order to match pytorch's implementation (see ATen PixelShuffle.cpp).
This previously wasn't caught by the test case, since it uses C = r = 4. Updated test case to
have C = 2, r = 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21486
Reviewed By: houseroad
Differential Revision: D15700945
Pulled By: houseroad
fbshipit-source-id: 47019691fdc20e152e867c7f6fd57da104a12948
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21718
adding a detection method on whether the package is built for AMD.
Reviewed By: bddppq
Differential Revision: D15795893
fbshipit-source-id: 91a21ee76b2273b1032507bdebe57e016717181d
Summary:
**Closes:** Confusing documentation with distributions.Categorical about logits https://github.com/pytorch/pytorch/issues/16291
**Solution**: Changes documentation on the Categorical distribution from `log probabilities` to `event log-odds`. This makes should reduce confusion as raised by this issue, and is consistent with other distributions such as `torch.Binomial`.
More than happy to make any other changes if they fit :).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21707
Differential Revision: D15799181
Pulled By: soumith
fbshipit-source-id: f11acca7a5c130102a3ff6674640235ee5aa69bf
Summary:
- [x] Add tests after https://github.com/pytorch/pytorch/pull/20256 is merged
- Support exporting ScriptModule with inputs/outputs of arbitrarily constructed tuples.
- Moved the assigning of output shapes to after graph conversion to ONNX is completed. By then all tuples in the IR has already been lowered by the pass ```_jit_pass_lower_all_tuples```. If assigning output shapes is required to happen before that, we'll need to hand parse the tuple structures in the graph, and repeat the same logic in ```_jit_pass_lower_all_tuples```. Handling inputs is easier because all tuple information is encoded within the input tensor type.
- Swap the order of ```_jit_pass_lower_all_tuples``` and ```_jit_pass_erase_number_types```. Ops like ```prim::TupleIndex``` relies on index being a scalar. ```_jit_pass_erase_number_types``` will convert these kind of scalars to tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20784
Reviewed By: zrphercule
Differential Revision: D15484171
Pulled By: houseroad
fbshipit-source-id: 4767a84038244c929f5662758047af6cb92228d3
Summary:
This renames the CMake `caffe2` target to `torch`, as well as renaming `caffe2_gpu` to `torch_gpu` (and likewise for other gpu target variants). Many intermediate variables that don't manifest as artifacts of the build remain for now with the "caffe2" name; a complete purge of `caffe2` from CMake variable names is beyond the scope of this PR.
The shell `libtorch` library that had been introduced as a stopgap in https://github.com/pytorch/pytorch/issues/17783 is again flattened in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20774
Differential Revision: D15769965
Pulled By: kostmo
fbshipit-source-id: b86e8c410099f90be0468e30176207d3ad40c821
Summary:
Class member annotations can be marked with `Final[T]` instead of adding them to `__constants__`. `Final` comes from the `typing_extensions` module (which will be used if it is present). If not, the polyfill from `_jit_internal` is exposed as `torch.jit.Final` for users that don't want to install `typing_extensions`.
This keeps around `__constants__` since a lot of code is still using it, but in documentation follow ups we should change the examples to all to use `Final`.
TODO: install typing_extensions on CI, move tests to a Python3 only file when #21489 lands
](https://our.intern.facebook.com/intern/diff/15746274/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21603
Pulled By: driazati
Differential Revision: D15746274
fbshipit-source-id: d2c9b5643b4abba069b130c26fd42714c906ffac
Summary:
This adds support for PEP 526 style annotations on assignments in place of
`torch.jit.annotate()`, so
```python
a = torch.jit.annotate(List[int], [])
```
turns into
```python
a : List[int] = []
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21390
Differential Revision: D15790937
Pulled By: driazati
fbshipit-source-id: 0cc204f7209a79839d330663cc6ba8320d3a4120
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21177
- Integrate c10::ListPtr into IValue and the c10 dispatcher.
- Streamline conversion to/from IValue. Before, we had IValue::to<> and kernel_functor.h had its own ivalue_to_arg_type and return_type_to_ivalue. They are now unified. Also, this means that nested types like Dicts of Lists of Optional of Dict of ... do work as expected now
Differential Revision: D15476433
fbshipit-source-id: bde9df80df20091aa8e6ae17ba7e90abd149b954
Summary:
Accidentally rebased the old PR and make it too messy. Find it here (https://github.com/pytorch/pytorch/pull/19274)
Create a PR for comments. The model is still WIP but I want to have some feedbacks before moving too far. The transformer model depends on several modules, like MultiheadAttention (landed).
Transformer is implemented based on the paper (https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf). Users have the flexibility to build a transformer with self-defined and/or built-in components (i.e encoder, decoder, encoder_layer, decoder_layer). Users could use Transformer class to build a standard transformer model and modify sub-layers as needed.
Add a few unit tests for the transformer module, as follow:
TestNN.test_Transformer_cell
TestNN.test_transformerencoderlayer
TestNN.test_transformerdecoderlayer
TestNN.test_transformer_args_check
TestScript.test_scriptmodule_transformer_cuda
There is another demonstration example for applying transformer module on the word language problem. https://github.com/pytorch/examples/pull/555
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20170
Differential Revision: D15417983
Pulled By: zhangguanheng66
fbshipit-source-id: 7ce771a7e27715acd9a23d60bf44917a90d1d572
Summary:
Currently we don't have any Linux libtorch binary build in the PR CI, which led to nightly build failure such as https://circleci.com/gh/pytorch/pytorch/1939687. This PR adds Linux libtorch CPU binary build to prevent such breakage from happening in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21671
Differential Revision: D15785003
Pulled By: yf225
fbshipit-source-id: d1f2e4235e48296ddecb3367f8e5a0df16f4ea49
Summary:
Fix https://github.com/pytorch/pytorch/issues/20421
`ProcessGroupGloo` only requires input/output tensors to be contiguous. Contiguous tensors might not start from the beginning of the underlying storage, e.g., `chunk(..., dim=0)[1]`. The current implementation passes `tensor.storage().data()` ptr to gloo buffer. This leads to wrong results if the tensor has a non-zero storage offset.
The proposed solution is to use `tensor.data_ptr()` instead. Let's see if this breaks any tests.
cc qijianan777
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21490
Differential Revision: D15768907
Pulled By: mrshenli
fbshipit-source-id: 9d7d1e9baf0461b31187c7d21a4a53b1fbb07397
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21592
We now support groupwise convolutions for qconv2d
Reviewed By: zafartahirov
Differential Revision: D15739239
fbshipit-source-id: 80b9b4fef5b9ee3d22ebecbaf205b970ab3d4250
Summary:
Closes https://github.com/pytorch/pytorch/issues/21344
DDP assigns the original module to the first module replica instead of creating a new one. Then, it creates a new Reducer to add post hooks to sync gradients. However, because every reconstructed DDP instance wraps the same original module, all their reducers will add hooks to the same set of variables. This PR deletes DDP hooks from variables when destructing Reducer, trying to make DDP failure recoverable.
pietern kuttas and I discussed the following solutions:
#### Solution 1
Keep `add_post_hook` API intact, and do a `dynamic_cast` in `del_post_hook` to check hook type. If the type matches Reducer's hook, delete it. As pietern mentioned, this will not work if we create multiple DDP instances from the same original model.
#### Solution 2
Use a counter to generate a unique key for every hook in `Function`, and keep them in a map. return the key to the caller of `add_post_hook`, and ask the caller to provide key if it needs to delete the hook.
Con: this would add extra overhead to `add_post_hook` and every `Function` object.
#### Solution 3 [Current implementation]
kuttas suggests that, instead of generating a unique key, directly using the address of the pointer would be better. In order to avoid messing up dereferencing, let `add_post_hook` to return a `uintptr_t`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21591
Differential Revision: D15745706
Pulled By: mrshenli
fbshipit-source-id: e56d2d48de0c65f6667790ab16337eac7f7d8b76
Summary:
This makes it so we can see the output of prim::Print in environments like iPython notebooks which override sys.stdout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21625
Differential Revision: D15756793
Pulled By: jamesr66a
fbshipit-source-id: 7d9a14b2e229ed358e784318e9d862677db2c461
Summary:
Emit loop condition as a separate block in loops, then inline them before conversion to SSA. This is needed for breaks & continues where we will inline the condition block after the continue pass and before the break pass.
I also considered emitting a prim::For and a prim::While, but i think it's easier to just have one pathway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21611
Differential Revision: D15775820
Pulled By: eellison
fbshipit-source-id: de17c5e65f6e4a0256a660948b1eb630e41b04fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21606
StoreMatrixInMatrixMarketFormat was able to dump quantized tensors only but sometimes we want to dump float tensors.
Reviewed By: csummersea
Differential Revision: D15741611
fbshipit-source-id: 95b03c2fdf1bd8407f7d925171d9dc9f25677464
Summary:
Stream is not respected on range/linspace/logspace functions, which contributes to https://github.com/pytorch/pytorch/issues/21589 (this is not a complete solution for that issue).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21619
Differential Revision: D15769666
Pulled By: ezyang
fbshipit-source-id: 7c036f7aecb3119430c4d432775cad98a5028fa8
Summary:
Resolves issue https://github.com/pytorch/pytorch/issues/19003
The author of this issue also asked that `cycle_momentum` default to `False` if the optimizer does not have a momentum parameter, but I'm not sure what the best way to do this would be. Silently changing the value based on the optimizer may confuse the user in some cases (say the user explicitly set `cycle_momentum=True` but doesn't know that the Adam optimizer doesn't use momentum).
Maybe printing a warning when switching this argument's value would suffice?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20401
Differential Revision: D15765463
Pulled By: ezyang
fbshipit-source-id: 88ddabd9e960c46f3471f37ea46013e6b4137eaf
Summary:
This adds support for PEP 526 style annotations on assignments in place of
`torch.jit.annotate()`, so
```python
a = torch.jit.annotate(List[int], [])
```
turns into
```python
a : List[int] = []
```
](https://our.intern.facebook.com/intern/diff/15706021/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21390
Pulled By: driazati
Differential Revision: D15706021
fbshipit-source-id: 8bf1459f229d5fd0e16e59953b9656e85a2207fb
Summary:
Ops on a Process Group (pg) instance will hit an error when input/output tensors are created on a different process, because, pg calls `recordStream` on `CUDACachingAllocator` which only knows tensors created within the same process.
The proposed solution is to add a `suppressError` arg (suggestions for better names?) to `recordStream`. See comments in code for arguments.
CC pichuang1984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21449
Differential Revision: D15689736
Pulled By: mrshenli
fbshipit-source-id: e7fc81b167868f8666536067eaa7ae2c8584d88e
Summary:
1. reduce the overhead of mkldnn-bridge itself
2. remove redundant code and useless APIs
3. provide new operators, including int8 inner_product, ND permute/transpose, elem_add/mul, and etc.
4. improve inner_product to support io format weights without implicit reorder
5. add SoftMax support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20569
Reviewed By: houseroad
Differential Revision: D15558663
Pulled By: bddppq
fbshipit-source-id: 79a63aa139037924e9ffb1069f7e7f1d334efe3a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21207
This diff adds 80 PT pointwise unary ops to the benchmark suite. Most of the ops are added using the generate_pt_tests_from_list interface. The rest are handled separately.
Reviewed By: zheng-xq
Differential Revision: D15471597
fbshipit-source-id: 8ea36e292a38b1dc50f064a48c8cd07dbf78ae56
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21210
This diff introduces a new path to run op with JIT. There are two steps involved here:
1. Users need to script the op. This should happen in the `init` method.
2. The generated graph from step1 is passed to `jit_forward` which will be executed by the benchmark backend
Reviewed By: zheng-xq
Differential Revision: D15460831
fbshipit-source-id: 48441d9cd4be5d0acebab901f45544616e6ed2ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20723
These classes already existed but only as c10::Dict and c10::OperatorKernel.
Since they're now part of torch::RegisterOperators(), they should also live in the torch namespace.
Differential Revision: D15421575
fbshipit-source-id: d64ebd8664fadc264bbbae7eca1faa182529a32b
Summary:
yf225 helped me discovered that our CI does not run multi-gpu tests in `test_c10d.py`. There are quite a few multi-gpu c10d tests. This PR tries to enable those tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21598
Differential Revision: D15744256
Pulled By: mrshenli
fbshipit-source-id: 0a1524a862946128321f66fc8b7f331eff10e52a
Summary:
Create an uninitialized ivalue. This will be needed for Breaks & Continues to match up if block outputs of values that are guaranteed not to be used but need to escape the block scope. It is not exposed to users.
Was previously part of final returns but I was asked to make a separate PR for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21387
Differential Revision: D15745124
Pulled By: eellison
fbshipit-source-id: ae6a6f766b4a70a71b9033987a630cfbf044e296
Summary:
For consistency, derivatives.yaml now uses the same schema specification as native_functions.yaml.
Note that there are some small downsides, e.g. changing the default values or return parameter names in native_functions.yaml also now requires updating derivatives.yaml as well. But this has a few nice properties:
1) Able to copy-paste definitions from native_functions to derivatives.
2) Makes it impossible to write derivatives for operators without schemas (e.g. old TH operators).
3) Moves us closer to the ideal situation of co-locating forward and backwards declarations.
Note that this doesn't change any generated code; in particular, this has the same behavior of mapping in-place and out-of-place definitions together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20916
Differential Revision: D15497800
Pulled By: gchanan
fbshipit-source-id: baee5caf56b675ce78dda4aaf6ce6a34575a6432
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21599
We prevented this because c10 ops can't have a backwards yet and calling them with requires_grad=True would do the wrong thing
if the c10 op is not purely implemented by calling other autograd-able ops.
However, it is a valid use case to have c10 ops that just call other autograd-aware ops, and these ops should be callable with requires_grad=True.
This should fix https://github.com/pytorch/pytorch/issues/21584.
Differential Revision: D15744692
fbshipit-source-id: ba665365c850ef63fc9c51498fd69afe49e5d7ec
Summary:
An incorrect increment / decrement caused the samples to not be generated from a multinomial distribution
Changelog:
- Remove the incorrect increment / decrement operation
Fixes#21257, fixes#21508
cc: LeviViana neerajprad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21324
Differential Revision: D15717575
Pulled By: ezyang
fbshipit-source-id: b1154e226d426c0d412d360c15f7c64aec95d101
Summary:
test that wasn't on the CI, but is tested internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21594
Differential Revision: D15742157
Pulled By: eellison
fbshipit-source-id: 11fc82d1fc0281ffedd674ed96100e0c783c0599
Summary:
This PR addresses some numerical issues of Sigmoid/StickBreakingTransform, where these transforms give +-inf when the unconstrained values move to +-20 areas.
For example, with
```
t = torch.distributions.SigmoidTransform()
x = torch.tensor(20.)
t.inv(t(x)), t.log_abs_det_jacobian(x, t(x))
```
current behaviour the inverse will return `inf` and logdet return `-inf` while this PR makes it to `15.9424` and `-15.9424`.
And for
```
t = torch.distributions.StickBreakingTransform()
x = torch.tensor([20., 20.])
t.inv(t(x)), t.log_abs_det_jacobian(x, t(x))
```
current value is `(inf, nan)` and `-inf` for logdet, while this PR makes it `[16.6355, 71.3942]` and `-47.8272` for logdet.
Although these finite values are wrong and seems unavoidable, it is better than returning `inf` or `nan` in my opinion. This is useful in HMC where despite that the grad will be zero when the unconstrained parameter moves to unstable area (due to clipping), velocity variable will force the parameter move to another area which by chance can move the parameter out of unstable area. But inf/nan can be useful to stop doing inference early. So the changes in this PR might be inappropriate.
I also fix some small issues of `_Simplex` and `_RealVector` constraints where batch shape of the input is not respected when checking validation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20288
Differential Revision: D15742047
Pulled By: ezyang
fbshipit-source-id: b427ed1752c41327abb3957f98d4b289307a7d17
Summary:
This changes our compiler so it first emits Loads & Stores, and then transforms the graph to SSA in a follow up pass. When a variable is set, we emit a prim::Store, and when a variable is referenced, we emit a prim::Load.
```
a = 1
print(a)
```
becomes:
```
%a.1 : int = prim::Constant[value=1]()
prim::Store[name="a"](%a.1)
%a : int = prim::Load[name="a"]()
prim::Print(%a)
```
In the follow up pass, convertToSSA, the values are turned into SSA form with the Loads & Stores removed. This change will enable breaks and continues because you can transform the graph with the variable naming information still intact.
There are still some remaining jitter and edge cases issues that I have to look through, but I think is still ready for eview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21101
Differential Revision: D15723353
Pulled By: eellison
fbshipit-source-id: 3269934d4bc24ddaf3a87fdd20620b0f954d83d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21382
Concat tensor inference function was not handling correctly the case where axis argument points to the last dimension so input tensors don't need to have the same number of dimensions.
Split tensor inference function was not handling correctly the case where split information is provided as the second input tensor rather than as an argument.
Reviewed By: mdschatz
Differential Revision: D15633148
fbshipit-source-id: d566af44dc882457ee9efe83d2461b28408c2c5d
Summary:
Should be self-explanatory. This `int` variable is overflowing.
Reported in #21526
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21530
Differential Revision: D15719275
Pulled By: umanwizard
fbshipit-source-id: 24e917a00a5b78bc3af29ef3b8b72eea7e89d5d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21556
Optimize batch mm op when broadcast the second input
Reviewed By: houseroad
Differential Revision: D15728914
fbshipit-source-id: c60441d69d4997dd32a3566780496c7ccda5e67a
Summary:
This was looking at the number of elements in the memo table, not the total capacity, and was thus calling reserve() a lot more than it should have
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21542
Reviewed By: driazati
Differential Revision: D15723132
Pulled By: jamesr66a
fbshipit-source-id: 20e1f9099b6a51a33994ea9dbc3f22eb3bc0c8f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21195
The motivation is that, while we shouldn't break USER code for using
deprecated declarations, we should keep our internal code base
deprecation clean.
Differential Revision: D15576968
fbshipit-source-id: fb73a8986a5b60bf49ee18260653100319bb1030
Summary:
namedtensor build + test should run on PRs only if the commit message
includes [namedtensor ci].
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21520
Differential Revision: D15718404
Pulled By: zou3519
fbshipit-source-id: ce8b5df2682e795e64958a9d49e2e3c091599b33
Summary:
This should further reduce noise by only clang-formatting the lines you actually touched in the precommit hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15657
Differential Revision: D15717337
Pulled By: suo
fbshipit-source-id: 57e65a679a8fdee5c3ff28e241c74ced9398eb0c
Summary:
The new implementation of tracing supports more module. So many error-handling code can be removed by placing the old one (LegacyTracedModule).
cc orionr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21339
Reviewed By: natalialunova
Differential Revision: D15695154
Pulled By: orionr
fbshipit-source-id: af7d35754e9f34bd1a0ad7b72a9ebe276ff8ab98
Summary:
Fixes#12259, needs to make sure tests (see #13766) don't break due to numerical precision issues. Not sure what would need to be adjusted here...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13774
Differential Revision: D15715021
Pulled By: ezyang
fbshipit-source-id: 20ce2beee1b39ebe9f023c5f2b25be53acccb5f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21492
If one async operator failed, async_scheduling net currently only marks all scheduled async operators as finished without cancelling the callbacks.
The new behavior is to cancel the callbacks first, then set event status to finished.
Reviewed By: ilia-cher
Differential Revision: D15702475
fbshipit-source-id: 55a1774d768b2e238bab859b83332f1877a001ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21502
In BenchResult, we keep name, avg_fwd, std_fwd, avg_bwd, and std_bwd. There is no information about the number of each iteration. In this diff, I am adding more info to BenchResult to include the number reported from each iteration.
Reviewed By: wanchaol
Differential Revision: D15706306
fbshipit-source-id: 3f14be4ba91f1f6da473995783bd7af1d067938d
Summary:
This moves `JitTestCase` to its own file so that we can have other jit
test files (ex. `test_jit_py3.py`)
There aren't any code changes, just a move and cleaning up the imports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21491
Pulled By: driazati
Differential Revision: D15703060
fbshipit-source-id: 6082e8b482100bb7b0cd9ae69738f1273e626171
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21230
tsia; we support empty tensor with this diff for reshape operator
Reviewed By: jerryzh168
Differential Revision: D15583356
fbshipit-source-id: 6d44c04e95ca3546509bfb12102e29c878f9a7c7
Summary:
Modify MKLDNN pooling operation to support ceil mode by adjusting the right/bottom padding accordingly. This is done similarly as in Caffe (see discussion https://github.com/pytorch/pytorch/pull/19205#discussion_r276903751).
To make this possible, I split the padding to left and right (top / bottom). This naming is confusing but actually follows mkldnn's own naming for pooling::compute(). We increase the r paddings so that it matches the ceiling mode expected output size.
Strengthened the test case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21310
Reviewed By: bddppq
Differential Revision: D15611664
Pulled By: akyrola
fbshipit-source-id: 46b40015dafef69a8fd5e7b2c261d8dbf448cd20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21393
Result of splitting the base diff. We moved a header from src/* to include/fbgemm/*
Reviewed By: jianyuh
Differential Revision: D15635188
fbshipit-source-id: ad7d0ddba964ff1cb8b2e33f5f98e457a4d2eac9
Summary:
changed `UpsampleBilinearKernel` s.t. the throughput increased 40~50%.
I tested locally with my local test code -- **not pytorch's provided test code** -- because I am having a build problem ( which I made an issue about [here](https://github.com/pytorch/pytorch/issues/19184)). I tested with various tensor sizes and across all the sizes, it should a significant increase in throughput.
1. added `__restrict__`
2. instead of launch as many threads as there are output elements, I launched only `output_height * output_width` may threads and had each thread iterate through the channel and batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19306
Differential Revision: D15701840
Pulled By: ezyang
fbshipit-source-id: 53c54d4f4e4a28b58ecc7d7ae6b864cbfc760e27
Summary:
Currently, when the input of MVN is precision matrix, we take inverse to convert the result to covariance matrix. This, however, will easily make the covariance matrix not positive definite, hence will trigger a cholesky error.
For example,
```
import torch
torch.manual_seed(0)
x = torch.randn(10)
P = torch.exp(-(x - x.unsqueeze(-1)) ** 2)
torch.distributions.MultivariateNormal(loc=torch.ones(10), precision_matrix=P)
```
will trigger `RuntimeError: cholesky_cpu: U(8,8) is zero, singular U.`
This PR uses some math tricks ([ref](https://nbviewer.jupyter.org/gist/fehiepsi/5ef8e09e61604f10607380467eb82006#Precision-to-scale_tril)) to only take inverse of a triangular matrix, hence increase the stability.
cc fritzo, neerajprad , SsnL
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21366
Differential Revision: D15696972
Pulled By: ezyang
fbshipit-source-id: cec13f7dfdbd06dee94b8bed8ff0b3e720c7a188
Summary:
This PR addresses the problem described in the comment: https://github.com/pytorch/pytorch/pull/20203#issuecomment-499231276
and previously coded bad behaviour:
- a warning was raised all the times when lr schedulling is initialized
Now the code checks that:
- on the second call of `lr_scheduler.step`, ensure that `optimizer.step` has been already called, otherwise raise a warning (as it was done in #20203 )
- if optimizer's step is overridden -> raise once another warning to aware user about the new pattern:
`opt.step()` -> `lrs.step()` as we can not check this .
Now tests check that
- at initialization (`lrs = StepLR(...)`)there is no warnings
- if we replace `optimizer.step` by something else (similarly to the [code of nvidia/apex](https://github.com/NVIDIA/apex/blob/master/apex/amp/_process_optimizer.py#L287)) there is another warning raised.
cc ezyang
PS. honestly I would say that there is a lot of overhead introduced for simple warnings. I hope all these checks will be removed in future `1.2.0` or other versions...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21460
Differential Revision: D15701776
Pulled By: ezyang
fbshipit-source-id: eac5712b9146d9d3392a30f6339cd33d90c497c7
Summary:
Fixes#21026.
1. Improve build docs for Windows
2. Change `BUILD_SHARED_LIBS=ON` for Caffe2 local builds
3. Change to out-source builds for LibTorch and Caffe2 (transferred to #21452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21190
Differential Revision: D15695223
Pulled By: ezyang
fbshipit-source-id: 0ad69d7553a40fe627582c8e0dcf655f6f63bfdf
Summary:
Another simple bit of syntax that NumPy supports and we don't.
Support int, float, and bool.
```python
>>> torch.randn((2,3), dtype=float)
tensor([[-0.1752, -0.3240, -0.6148],
[ 0.1861, 1.6472, 0.1687]], dtype=torch.float64)
```
A bit confusingly, Python's "float" actually means double, but nothing we can do about that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21215
Differential Revision: D15697012
Pulled By: umanwizard
fbshipit-source-id: 9a38d960a610b8e67023486b0c9265edd3c22246
Summary:
Adds support for recursively compiling `nn.Sequential` and
`nn.ModuleList`. When either is used, it is converted to a
`jit._ConstModuleList` or `jit._ConstSequential` as necessary. Due to
this, we don't need to add it to `__constants__` since it's made
constant on demand.
This PR also moves the recursive script tests out to their own class
`TestRecursiveScript` (the added test is called `test_iterable_modules`)
](https://our.intern.facebook.com/intern/diff/15611738/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21306
Pulled By: driazati
Differential Revision: D15611738
fbshipit-source-id: fac52993990bd2dfad71d044c463a58a3759932a
Summary:
Enable bool tensors for these index methods:
- index_select
- index_copy
- put
- take
- index_fill
Tested via unit tests
TODO:
Enable index_add in a separate PR as it requires more "side" changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21435
Differential Revision: D15684964
Pulled By: izdeby
fbshipit-source-id: 48440e4d44873d70c4577e017dd0d8977e0fa15a
Summary:
`torch.tensor([True, False, True], dtype=torch.bool).sum()` should return **2** instead of **True** as it does now.
Tested via unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21421
Differential Revision: D15674203
Pulled By: izdeby
fbshipit-source-id: b00e3d0ca809c9b92b750adc05632522dad50c74
Summary:
Fixes#19540
CC nmerrill67
C++ data parallel was using Module.clone() to create module replicas on every destination device. However, clone() does not set up gradient edges to point from replicas to the original module. As a result, the gradient will not be aggregated into the original module. This commit fixes the the problem by manually setting gradient edges from every parameter X in every replica to the same parameter X in the original module.
## Failed Attempt
Initially I tried implementing what we did in `replicate.py`, which
1. create module replicas
2. use Python `Broadcast` autograd function to broadcast every parameter in the original module to all destination devices.
3. assign the broadcast result params to module replicas' `_parameters` dict.
This works in Python because derived module member field params (e.g., `Linear.weight`) and base module `_parameters` (e.g., `Linear._parameters['weight']`) are referencing the same parameter instance. Assigning one of them will apply to both. However, in C++, even though I can modify Module's `parameters_ `values and gradient edges to point to the broadcast source, I cannot touch the weight and bias member fields in Linear, because replicate cannot (and should not) add special-case handlers to every different module. (See `Linear` [.h](https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/include/torch/nn/modules/linear.h), [.cpp](https://github.com/pytorch/pytorch/blob/master/torch/csrc/api/src/nn/modules/linear.cpp)) Although they initially point to the same `TensorImpl` instance, after assigning to `Module.parameters_['weight']`, it will be different from `Linear.weight`.
## Solution Options
gchanan and I had several discussions on this issue and figured two solutions to this problem.
### Option One [implemented in this PR]
Replicate the module in two steps:
1. call `Module.clone()` to create a module replica on every destination device.
2. manually setting gradient edges from every parameter in every replica to the same parameter in the original module.
* Pro: Does not need to change any existing module, and relatively easier to implement
* Con: It is a little hackish.
### Options Two
Implement a `Replicatable` class (similar to `Cloneable`), and make it a friend class of `Module`. For more details see `Note [Replicating Modules]` in the code change.
* Pro: Maybe this aligns more with our existing approach implemented in `Cloneable`?
* Con: Require changes to every existing module.
I am inclined to go with option one, because `replicate` will only be used on data parallel. I feel it is too big an overkill if we have to change all existing module implementations due to a data parallel requirement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20910
Differential Revision: D15556426
Pulled By: mrshenli
fbshipit-source-id: aa836290ec657b32742e2bea80bd0ac2404ef3b0
Summary:
Fixed an issue where models can not be loaded in a 32-bit environment like Raspbian.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20900
Differential Revision: D15696709
Pulled By: ezyang
fbshipit-source-id: 37a81f05f235d3b9fc6244e12d3320ced3d1465e
Summary:
Current versions of NVRTC incorrectly map error code 7 to the error string "NVRTC unknown error." This update maps error code 7 to the correct string explicitly in PyTorch. See the documentation at: https://docs.nvidia.com/cuda/nvrtc/index.html#group__error.
This may give us a better idea of the source of NVRTC errors that some community members, like Uber, have reported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21174
Differential Revision: D15696593
Pulled By: ezyang
fbshipit-source-id: f5c7b5876c07b311ab5f2d7c8e375e93273912c6
Summary:
Fixed#21269 by removed the the expected `ValueError` when converting a tensor to a Numpy `int8` array in the Numba interoperability test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21458
Differential Revision: D15696363
Pulled By: ezyang
fbshipit-source-id: f4ee9910173aab0b90a757e75c35925b026d1cc4
Summary:
I inserted default weight and reduction params in binary_cross_entropy_with_logits function . These default params exist in python and binary_cross_entropy function in cpp.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21336
Differential Revision: D15628917
Pulled By: ezyang
fbshipit-source-id: 38e5f53851125238842df1bd71cb6149c8603be1
Summary:
This could serve as a alternative solution to export ```torch.gather``` before something similar goes into ONNX spec. The exported model is verified to be correct against onnxruntime backend. We weren't able to test against Caffe2 backend because it doesn't seem to support OneHot opset9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21235
Differential Revision: D15613039
Pulled By: houseroad
fbshipit-source-id: 7fc097f85235c071474730233ede7d83074c347f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21440
This diff modifies the output format when ai_pep_format is enabled.
Reviewed By: hl475
Differential Revision: D15681042
fbshipit-source-id: df5f2dbb38d1bd866ca7f74ef4e63459d480be6e
Summary:
We have encountered `std::bad_cast` error when running PyTorch binary built with cxx11 abi on CentOS7, stack trace:
```
#0 0x00007fec10160207 in raise () from /lib64/libc.so.6
#1 0x00007fec101618f8 in abort () from /lib64/libc.so.6
#2 0x00007fec015767d5 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007fec01574746 in ?? () from /lib64/libstdc++.so.6
#4 0x00007fec01574773 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fec01574993 in __cxa_throw () from /lib64/libstdc++.so.6
#6 0x00007fec015c94d2 in std::__throw_bad_cast() () from /lib64/libstdc++.so.6
#7 0x00007feb2ab3c2d7 in std::__cxx11::numpunct<char> const& std::use_facet<std::__cxx11::numpunct<char> >(std::locale const&) ()
from /root/.local/lib/python2.7/site-packages/torch/lib/libcaffe2.so
#8 0x00007feb28643d62 in torch::jit::script::strtod_c(char const*, char**) () from /root/.local/lib/python2.7/site-packages/torch/lib/libcaffe2.so
```
We are suspecting this line will get compiled to gcc abi dependent symbol:
```
char decimal_point = std::use_facet<std::numpunct<char>>(std::locale()).decimal_point();
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21293
Differential Revision: D15609910
Pulled By: bddppq
fbshipit-source-id: e247059729863868e4b36d6fec4fcbc36fbc4bb1
Summary:
Fixing an incorrect implementation of the CELU activation function. The existing implementation works by a chance combination of errors that seem to cancel each other out. This change makes the code more readable, aligns the parameter names correctly, and is consistent with the cuda implementation.
I came across this issue while working on version counters... I attempted to specify a gradient in derivatives.yaml for CELU due to a failed test, but the derivative couldn't be specified correctly without fixing the celu implementation.
https://github.com/pytorch/pytorch/pull/20612
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21213
Differential Revision: D15678823
Pulled By: nairbv
fbshipit-source-id: 29fa76b173a66c2c44ed2e0b7959e77f95d19c43
Summary:
This PR is a continuation of #15310, which itself is a continuation of #14845, #14941, & #15293. It should be synced up with the pytorch/master branch as of yesterday.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19465
Differential Revision: D15632268
Pulled By: ezyang
fbshipit-source-id: 8e337e8dc17ac31439935ccb530a7caf77f960e6
Summary:
We want to be able to call stft from a torchscript which requires that stft have a type annotation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21302
Differential Revision: D15607973
Pulled By: cpuhrsch
fbshipit-source-id: c4a5c09cdaafe7e81cf487a3ad216d1b03464a21
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21392
as discussed at https://github.com/pytorch/pytorch/pull/21244, we
found some values in log_beta are not properly initialized. This diff will 1)
initialize all log_beta to -inf; 2) fix a tricky compare condition; 3) zero all
the gradient elements corresponding to padding to zero.
Offline experiments show that this diff can fix previous seen NaN loss.
Differential Revision: D15637977
fbshipit-source-id: 477008a5e11aae946bd2aa401ab7e0c513421af0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21398
Module::forward method calls find_method() function potentially in multiple threads.
Internally it calls find_offset() method and reads dict_ object.
If the correspondent name is not in a dictionary thread call insert() method and modifies dict_ object.
At the same time when first thread modifies dict_ object another thread can enter forward()->find_method()->find_offset() path
and access dict_ object for reading while it have been modified -> crash.
Moved mutex protection up to protect both calls find_offset() and insert().
Consider to use C++ 17 shared_mutex locking object instead of recursive_mutex object.
Reviewed By: bddppq
Differential Revision: D15638942
fbshipit-source-id: ca6a453448302a0b3666c87724755fa4e9ce242f
Summary:
Something flaky is going on with `test_inplace_view_saved_output` on Windows.
With my PR #20598 applied, the test fails, even though there is no obvious reason it should be related, so the PR was reverted.
Based on commenting out various parts of my change and re-building, I think the problem is with the name -- renaming everything from `T` to `asdf` seems to make the test stop failing. I can't be sure that this is actually the case though, since I could just be seeing patterns in non-deterministic build output...
I spoke with colesbury offline and we agreed that it is okay to just disable this test on Windows for now and not block landing the main change. He will look into why it is failing.
**Test Plan:** I will wait to make sure the Windows CI suite passes before landing this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21175
Differential Revision: D15566970
Pulled By: umanwizard
fbshipit-source-id: edf223375d41faaab0a3a14dca50841f08030da3
Summary:
Currently tools/build_pytorch_libs.py looks quite convoluted. This commit reorgnizes cmake related functions to a separate file to make the code clearer.
---
This is hopefully helpful for further contribution for better integration with cmake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21367
Differential Revision: D15636991
Pulled By: soumith
fbshipit-source-id: 44d76e4e77aec0ce33cb32962b6a79a7f82785da
Summary:
This default was incorrect and made printing in python not print file:line:col
This wasn't caught because FileCheck internally uses operator<< to print the graph, which has `true` hardcoded as the value. I've added more comprehensive tests to catch this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21370
Differential Revision: D15631135
Pulled By: jamesr66a
fbshipit-source-id: c809e06fff4f0174eefeb89062024384b4944ef7
Summary:
I found this significantly speeds up incremental builds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21334
Differential Revision: D15632994
Pulled By: suo
fbshipit-source-id: bb4af90f4400bffa90d168d82ff30fece5e3835c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21365
This diff adds new operators to benchmark_all_test so all the supported ops can be built as one binary
Reviewed By: hl475
Differential Revision: D15627328
fbshipit-source-id: b7ca550a279f485102a6a6bd47e4032c7beb9940
Summary:
The original PR (#16071) is not working anymore after `caffe2` and `torch` is unified. What's more, It is making the binary big since the optimizing flag is disabled on a very big project(the `torch` library used to be small, but it now applies on the whole `caffe2` and `caffe2_gpu` library). We need to get it reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21335
Differential Revision: D15622163
Pulled By: soumith
fbshipit-source-id: 900bd400106d27a1512eed1e9f2288114f5f41bb
Summary:
This adds a regression test for the bug fix in #21236. Operations
involving CUDA tensors an CPU scalars should not copy the CPU scalar to
the device (because that is slow). They should instead "lift" the scalar
to a kernel parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21253
Reviewed By: bddppq
Differential Revision: D15604080
Pulled By: colesbury
fbshipit-source-id: c14ded5d584499eaa5ea83337ffc50278205f3d6
Summary:
This solves the situation where, for example, someone instantiates LSTM with `dropout=0`, a Python integer. This works fine in Python, but JIT throws a type error because it expected float but got int
Resolves https://github.com/pytorch/lockdown/issues/65
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21304
Differential Revision: D15613153
Pulled By: jamesr66a
fbshipit-source-id: eabff76e3af3de0612583b37dbc5f7eab7e248a4
Summary:
This PR adds support for torch.rand export in the PyTorch ONNX exporter. There are other generator ops that need to be supported for export and they will added in subsequent PRs. This op is needed with priority for a model on our end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20559
Differential Revision: D15379653
Pulled By: houseroad
fbshipit-source-id: d590db04a4cbb256c966f4010a9361ab8eb3ade3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20915
Clean the unary processor code. Some question are added into the comments to seek suggestions.
Reviewed By: pjh5
Differential Revision: D15448502
fbshipit-source-id: ef0c45718c1a06187e3fe2e4e59b7f20c641d9c5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21206
This diff change the default test_name to be a globally unique value across tests. With that, users can list all the tests and choose to run a specific test.
Reviewed By: zheng-xq
Differential Revision: D15543508
fbshipit-source-id: 0814ef6a60d41637fed5245e30c282497cf21bb8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21149
The diff modifies the interface for PyTorch operators in the benchmark suite
Reviewed By: zheng-xq
Differential Revision: D15433897
fbshipit-source-id: e858183431eb37d90313356716c2de8709372b58
Summary:
This doesn't affect anything because we run constant pooling, and in the case of Closures and Forks creates unnecessary closures over constants.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21229
Differential Revision: D15587764
Pulled By: eellison
fbshipit-source-id: d5609b0a5697071fab5050eb9e03876ab9ebb27a
Summary:
~~This is work in progress due to its dependency on multiple pending PRs.~~
- [x] ONNX: Relax constraint on subgraph input/output type & shape check. https://github.com/onnx/onnx/pull/2009
- [x] PyTorch: Add infra to test_pytorch_onnx_caffe2.py to test ScriptModule models. https://github.com/pytorch/pytorch/pull/20256
This PR should partially resolve https://github.com/pytorch/pytorch/issues/17531. However, ideally we shouldn't need to put cast(and reshape) node to help the conversion for loop condition.
- Added cast node for condition values before entering loop node. The ONNX spec only accepts Bool type, while in PyTorch if the condition value is an output from other node it could potentially have any integral type.
- Tidying up the exported ONNX loop subgraph input type & shape. According to ONNX spec, input "M" is exported as 0-d scalar tensor with type int64. input "Cond" is exported as incomplete tensor of type Bool without shape information. This is because through out the iteration, the rank of condition value is dynamic, either 0-d or 1-d, as long as it holds a single value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20445
Differential Revision: D15534188
Pulled By: houseroad
fbshipit-source-id: d174e778529def05ee666afeee4b8fb27786e320
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21267
Replace AT_ASSERTM with TORCH_CHECK: AT_ASSERTM is deprecated.
Not sure when ```AT_ASSERT``` is dprecated with some new TORCH ASSERT function.
Reviewed By: zafartahirov
Differential Revision: D15599242
fbshipit-source-id: 23f21a9a23dc3c147dc817e6d278066d0832e08d
Summary:
This PR improves performance of advanced indexing backward, partially solving #15245 (performance is still worse than gather, but not by such outrageous margins). Before, using benchmarking harness from #15245, cuda 10/V100:
```
Indexing is faster by at most -270.61607820767887 us on N: 16 D: 256 K: 1
Indexing is slower by at most 11127.466280784833 us on N: 16 D: 4096 K: 4096
```
after:
```
Indexing is faster by at most 23.524456737696028 us on N: 512 D: 4096 K: 4096
Indexing is slower by at most 186.24056029472553 us on N: 16 D: 1024 K: 4096
```
Strategy is to reuse embedding backward kernel, adapting it to handle unindexed dimensions in the beginning by launching additional threadblocks, and also allowing it to handle slices that are bigger than `65K*128`, that is hardly ever a problem for embedding. Still, integer indexing is baked in the kernel, and is important for performance, so for now bigger than 2G element tensors are not supported.
The main savings come from not having to expand index to all unindexed dimensions, and not sorting expanded index with incoming gradient values, but rather only sorting unexpanded index.
There are ways to make sorting overhead smaller (thanks mcarilli for suggestions) but I'll get to it when it becomes a real problem, or rather, when cuda graphs will force us to get rid of thrust::sort calls.
I've also added tests for indexing backward, before tests for index_put_ and indexing backward were non-existent.
This PR also fixes#20457 by casting indices to `self` backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20557
Differential Revision: D15582434
Pulled By: ezyang
fbshipit-source-id: 91e8f2769580588ec7d18823d99a26f1c0da8e2a
Summary:
Stacked on https://github.com/pytorch/pytorch/pull/21217
This adds support for recording file and line information during tracing, by extracting the top Python interpreter frame
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21247
Reviewed By: suo, driazati
Differential Revision: D15594553
Pulled By: jamesr66a
fbshipit-source-id: 72e1b3a46f1dabe3e83a608ec1a7d083bd1720f9
Summary:
Remove Dropout from the opset 10 blacklist.
ONNX Dropout was modified in opset 10, but only the output "mask" was modified, which is not exported in pytorch opset 9. So we can still fallback on the opset 9 op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20710
Differential Revision: D15571248
Pulled By: houseroad
fbshipit-source-id: 15267eb63308a29a435261034b2f07324db1dea6
Summary:
We're not getting much from checking the export strings, and they are noisy and slow development. DIdn't realize they existed until now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21238
Differential Revision: D15604256
Pulled By: eellison
fbshipit-source-id: 488e9401231228cffe132dab99d519563fa63afc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21100
Added multifile flag to write scalar data into separate files. This can slow down dashboard loading.
Reviewed By: orionr
Differential Revision: D15548913
fbshipit-source-id: dd39a7f76f93025d28f14babbf933e39860e6910
Summary:
Loops.h has contains specializations for cases where all the inputs are
contiguous as well as cases where one input is a scalar and all other
inputs are contiguous.
Previously, there were separate checks for each functions that take
zero, one, or two input arguments. This is getting unwieldy, especially
once we add support for functions that take three inputs (#21025).
This requires the use of recursive templates (which have their own
downsides), but this seems better than the alternative.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21106
Differential Revision: D15562430
Pulled By: colesbury
fbshipit-source-id: 5f19ab2212e16e29552887f4585c2b4a70309772
Summary:
Instead of attempting to hardcode calls to "ninja" or "make", we should always let cmake do it. This better integrates build configurations (DEBUG or REL_WITH_DEB_INFO) and better handles the case in which the native build tool is not in PATH (cmake has some capacity to find them and has options for users to specify their locations).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21105
Differential Revision: D15602883
Pulled By: soumith
fbshipit-source-id: 32ac46d438af00e791defde6ae5ac21c437d0bb0
Summary:
Retry #21197
The previous one failed because it uses some Python3 only syntax.
ezyang Do we still have multi-GPU py2 tests? I am curious why the CI tests did not catch this error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21262
Differential Revision: D15598941
Pulled By: mrshenli
fbshipit-source-id: 95f416589448c443685d6d236d205b011998a715
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20883
Add autograd for layer_norm on CPU, after this diff, both PyTorch and jit model can automatically benefit from performance improvement of nn.functional.layer_norm
Reviewed By: zheng-xq
Differential Revision: D15483790
fbshipit-source-id: 94ed3b16ab6d83ca6c254dbcfb224ff7d88837f3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20665
Add gelu activation forward on CPU in pytorch
Compare to current python implemented version of gelu in BERT model like
def gelu(self, x):
x * 0.5 * (1.0 + torch.erf(x / self.sqrt_two))
The torch.nn.functional.gelu function can reduce the forward time from 333ms to 109ms (with MKL) / 112ms (without MKL) for input size = [64, 128, 56, 56] on a devvm.
Reviewed By: zheng-xq
Differential Revision: D15400974
fbshipit-source-id: f606b43d1dd64e3c42a12c4991411d47551a8121
Summary:
cc ezyang this is meant to fix the fuser failures on master
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21252
Differential Revision: D15594283
Pulled By: jamesr66a
fbshipit-source-id: 85f37e78b2de051c92ade3fe4c44c7530b4542e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21233
It is possible that OnnxifiOp is created in a thread where weights have been cleaned from the workspace, which is legit use case as we can create the backend once and lower all the weights. So we need to extract the weight shape info the first time we create the backend and save it.
Reviewed By: bertmaher, rdzhabarov
Differential Revision: D15587237
fbshipit-source-id: 1f264dc32c0398c42b618e9c41c119eb13e1c9f1
Summary:
Fixes#21108
When grad is disabled, Python autograd function outputs are [wrapped as detached aliases](8cde4c4d22/torch/csrc/autograd/python_function.cpp (L395-L399)), which prevents calling `Tensor.set_()` on them after recent changes in Tensors and Variables. This will hit a problem when users would like to call `rnn.flatten_parameters()` in the forward pass, as the function [calls `set_()`](9d09f5df6c/aten/src/ATen/native/cudnn/RNN.cpp (L669)).
The proposed solution is to avoid using an autograd Broadcast if in no_grad mode.
apsdehal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21197
Differential Revision: D15577342
Pulled By: mrshenli
fbshipit-source-id: 1a024c572171a3f2daca9454fd3ee6450d112f7c
Summary:
I think there was a typo in #20690 here https://github.com/pytorch/pytorch/pull/20690/files#diff-b47a50873394e38a005b4c1acd151957R130.
Original conditional was ` common_backend == Backend::CUDA && op.tensor.type().backend() == Backend::CPU)`, now it is `op.device.is_cuda() && op.tensor.device().is_cpu()`. It seems that `op.device` and `op.tensor.device()` should be the same, so this conditional is never true. This leads to spurious h2d copies for operations between cuda tensors and cpu scalars, because cpu scalars are now sent to gpu, instead of being passed to lambdas directly.
Unfortunately, I don't know how to test this change, because functionally everything was fine after #20690, it was just a performance regression.
cc colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21236
Differential Revision: D15592754
Pulled By: soumith
fbshipit-source-id: 105bfecc61c222cfdb7294a03c9ecae3cc7f5817
Summary:
`Tensor.is_cuda` and `is_leaf` is not a predicate function but a `bool` attribute. This patch fixes the type hints in `torch/__init__.pyi` for those attributes.
```diff
- def is_cuda(self) -> bool: ...
+ is_cuda: bool
- def is_leaf(self) -> bool: ...
+ is_leaf: bool
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21192
Differential Revision: D15592766
Pulled By: soumith
fbshipit-source-id: 8c4ecd6939df8b8a8a19e1c9db6d40193bca7e4a
Summary:
This makes file-line reporting also work for things loaded using `torch.jit.load()` as well as the string frontend (via `CompilationUnit`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21217
Differential Revision: D15590838
Pulled By: jamesr66a
fbshipit-source-id: 6b6a12574bf9eca0b83f24f0b50535fda5863243
Summary:
Studied why sparse tensor coalesce was slow: issue #10757.
Using nv-prof, and writing a simple benchmark, I determined bulk of the time was used ``kernelTransformReduceInnermostDimIndex``, which is called when sparse tensor is constructed with sparse_coo_tensor when it does sanity check on the minimum and maximum indices. However, we do not need this sanity check because after coalescing the tensor, these min/maxs won't change.
On my benchmark with 1 million non-zeros, the runtime of coalesce. was about 10x from 0.52s to 0.005 sec.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21214
Reviewed By: bddppq
Differential Revision: D15584338
Pulled By: akyrola
fbshipit-source-id: a08378baa018dbd0b45d7aba661fc9aefd3791e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21163
These two backend transformation share some common traits. Therefore we want to reuse the data struct/code as much as possible.
Reviewed By: hlu1
Differential Revision: D15561177
fbshipit-source-id: 35f5d63b2b5b3657f4ba099634fd27c3af545f1b
Summary:
Some of the functions are only used in this file - mark them `static`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21140
Differential Revision: D15578076
Pulled By: Krovatkin
fbshipit-source-id: 71ae67baabebd40c38ecb9292b5b8202ad2b9fc1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21152
Migrate existing add benchmark to use the new op front-end
Reviewed By: zheng-xq
Differential Revision: D15325524
fbshipit-source-id: 34e969e1bd289913d881c476711bce9f8ac18a29
Summary:
i will do loops in a follow up after some other changes I am working on have landed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20911
Differential Revision: D15497205
Pulled By: eellison
fbshipit-source-id: 8cac197c6a6045b27b552cbb39e6fc86ca747b18
Summary:
Following on #19747, this implements most of the `torch.jit.script()` changes laid out in #20939.
Still to do:
* Accessing a method from Python does not add it as a `ScriptMethod` (so only `export`ed methods and `forward` are compiled)
* Calling a method other than `forward` on a submodule doesn't work
](https://our.intern.facebook.com/intern/diff/15560490/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20708
Pulled By: driazati
Differential Revision: D15560490
fbshipit-source-id: cc7ef3a1c2772eff9beba5f3e66546d2b7d7198a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21085
Now that torch::jit::RegisterOperators() always passes through to torch::RegisterOperators() (see diffs stacked below this), we can remove the old custom op implementation.
Reviewed By: dzhulgakov
Differential Revision: D15542261
fbshipit-source-id: ef437e6c71950e58fdd237d6abd035826753c2e4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21084
- Now AliasAnalysisKind can be set using the torch::RegisterOperators() API
- This also allows us to remove the last place in torch::jit::RegisterOperators that didn't use c10 yet.
Reviewed By: dzhulgakov
Differential Revision: D15542097
fbshipit-source-id: ea127ecf051a5c1e567e035692deed44e04faa9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21181
Implement c10::OperatorOptions as a class to store metadata about operators.
This is meant to replace torch::jit::OperatorOptions.
Reviewed By: dzhulgakov
Differential Revision: D15569897
fbshipit-source-id: 95bf0bf917c1ef2bdf32702405844e1a116d9a64
Summary:
This reduces DenseNet load time by about 25% (down to 5.3s on my laptop) and gets AliasAnalysis out of the profile top hits entirely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21203
Differential Revision: D15578155
fbshipit-source-id: ddbb1ad25c9540b5214702830084aa51cc6fd3cb
Summary:
Adds persistent cuda kernels that speed up SoftMax applied over the fast dimension, i.e. torch.nn.Softmax(dim=-1) and torch.nn.LogSoftmax(dim=-1). When the size is <= 1024, this code is 2-10x faster than the current code, speedup is higher for smaller sizes. This code works for half, float and double tensors with 1024 or fewer elements in the fast dimension. Numerical accuracy is on par with the current code, i.e. relative error is ~1e-8 for float tensors and ~1e-17 for double tensors. Relative error was computed against the CPU code.
The attached image shows kernel time in us for torch.nn.Softmax(dim=-1) applied to a half precision tensor of shape [16384,n], n is plotted along the horizontal axis. Similar uplifts can be seen for the backward pass and for LogSoftmax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/20827
Differential Revision: D15582509
Pulled By: ezyang
fbshipit-source-id: 65805db37487cebbc4ceefb1a1bd486d24745f80
Summary:
This is a follow up on Jame's PR: https://github.com/pytorch/pytorch/pull/19041. The idea is to replace the legacy `sinh` / `cosh` ops that are being dispatched to TH with the operations defined in `Vec256` for better performance.
benchmark(from Jame's script):
```python
import torch, time
ops = ['sinh', 'cosh']
x = torch.rand(1024, 1024)
NITER = 10000
print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')
for op in ops:
s = time.time()
for i in range(NITER):
getattr(x, op)()
elapsed_sec = ((time.time() - s) / NITER)
print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')
```
code on master:
```
op time per iter (ms) gops/s GB/s
sinh 3.37614369392395 0.3105839369002935 2.484671495202348
cosh 3.480502033233643 0.3012714803748572 2.4101718429988574
```
after change (on Macbook pro 2018):
```
op time per iter (ms) gops/s GB/s
sinh 0.8956503868103027 1.1707425301677301 9.365940241341841
cosh 0.9392147302627564 1.1164390487217428 8.931512389773943
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21115
Reviewed By: ljk53
Differential Revision: D15574580
Pulled By: xta0
fbshipit-source-id: 392546a0df11ed4f0945f2bc84bf5dea2750b60e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21196
we'll add `quantize(quantizer)` as a tensor method later when we expose `quantizer` in Python frontend
Python
```
torch.quantize_linear(t, ...)
```
C++
```
at::quantize_linear(t, ...)
```
Differential Revision: D15577123
fbshipit-source-id: d0abeea488418fa9ab212f84b0b97ee237124240
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21169
We should minimize dependency from perfkernels (we were including eigen header files only in cc files not compiled with avx or avx2 options but better to be very strict because it's easy to introduce illegal instruction errors in perfkernels)
Reviewed By: salexspb
Differential Revision: D15563839
fbshipit-source-id: d4b1bca22d7f2e6f20f23664d4b99498e5984586
Summary:
Most important fix: Correct "tensor.rst" to "tensors.rst"
Secondary fix: some minor English spelling/grammar fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21029
Differential Revision: D15523230
Pulled By: umanwizard
fbshipit-source-id: 6052d8609c86efa41a4289cd3a099b2f1037c810
Summary:
Dynamically creating a type at runtime was messing up the MRO and has been causing many other problems. I think it's best to delete it, this causes a regression since
```python
self.linear = nn.Linear(10, 10)
isinstance(self.linear, nn.Linear)
```
will now be `False` again, but this will be fixed once recursive script mode is the default (#20939)
](https://our.intern.facebook.com/intern/diff/15560549/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21107
Pulled By: driazati
Differential Revision: D15560549
fbshipit-source-id: 7bd6b958acb4f353d427d66196bb4ee577ecb1a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21148
The diff modifies the interface for Caffe2 operators in the benchmark suite
Reviewed By: zheng-xq
Differential Revision: D15433888
fbshipit-source-id: c264a95906422d7a26c10b1f9836ba8b35e36b53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21147
This diff introduces a new interface to add PT/C2 operators to the benchmark suite.
The following steps are needed to add a new operator:
1. Specify the input shapes, args to an operator in configs
2. Create a PT/C2 benchmark class which includes ```init``` (create tensors), ```forward``` (specify the operator to be tested.), and ```backward```(gradient of an op.) methods
3. call generate_pt_test/generate_c2_test to create test cases based on configs
Reviewed By: zheng-xq
Differential Revision: D15250380
fbshipit-source-id: 1025a7cf60d2427baa0f3f716455946d3d3e6a27
Summary:
This should pass once https://github.com/pytorch/vision/pull/971 is merged.
To remove torchvision as baseline, we just compare to sum of all param.sum() in pretrained resnet18 model, which means we need to manually update the number only when that pretrained weights are changed, which is generally rare.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21132
Differential Revision: D15563078
Pulled By: ailzhang
fbshipit-source-id: f28c6874149a1e6bd9894402f6847fd18f38b2b7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21164
Write a List type to be used in operator kernels. This abstracts away from the concrete list type used (e.g. std::vector vs SmallVector)
and allows us to change these implementation details without breaking the kernel API.
Also, this class allows for handling List<bool>, which would not work with ArrayRef because vector<bool> is a bitset and can't be converted to ArrayRef<bool>.
Reviewed By: ezyang
Differential Revision: D15476434
fbshipit-source-id: 5855ae36b45b70437f996c81580f34a4c91ed18c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21156
we'll add `quantize(quantizer)` as a tensor method later when we expose `quantizer` in Python frontend
Python
```
torch.quantize_linear(t, ...)
```
C++
```
at::quantize_linear(t, ...)
```
Differential Revision: D15558784
fbshipit-source-id: 0b194750c423f51ad1ad5e9387a12b4d58d969a9
Summary:
In the previous implementation of triu / tril, we passed the batch size in the 2nd dimension of a grid. This is limited to 65535, which means that performing triu / tril on a tensor with batch size > 65535 will throw an error. This PR removes the dependence on the 2nd dimension, and corresponding non-contiguity constraints.
Changelog:
- Compute offset, row and col in the kernel
- Use 1st dimension of grid alone
- Remove unnecessary contiguity checks on tensors as a result of this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21067
Differential Revision: D15572501
Pulled By: ezyang
fbshipit-source-id: 93851cb661918ce794d43eeb12c8a38762e1358c
Summary:
Resolves https://github.com/pytorch/lockdown/issues/51
This adds support for converting simple f-string literals to calls to `string.format()`. It does not support conversion specifiers or format strings.
This also does not support the string parser frontend, since that implementation would be more involved and likely would require modifying our TorchScript AST
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21037
Reviewed By: zdevito
Differential Revision: D15541183
Pulled By: jamesr66a
fbshipit-source-id: ae9df85e73f646d7219c1349f5b7683becbcef20
Summary:
# Overall Improvements
1. Switched from using `unordered_set` to sparse bitset.
1. Prevent some excessive memory allocations (thanks to resistor )
1. Take advantage of the sparse bitset operations
1. Switch to `flat_hash_map` instead of `unordered_map` in some places.
# Benchmarks (somewhat approximate, best of a couple runs)
1. InceptionNet (load + one forward pass): 19.8->13.3
1. GoogleNet(load + one forward pass): 10.0 -> 7.24
1. DenseNet (only load): 7.3 -> 5.3
I use the `sparse bitset` taken from https://llvm.org/doxygen/SparseBitVector_8h_source.html. I had to make some modifications to use `__builtin_popcountl` and instructions like that instead of other transitive clang dependencies.
## Some notes on our graph topologies
In general, our graphs are very sparse, and most of the components aren't connected. For GoogleNet, we have 200k nodes, we do 2k `mayAlias` queries, and the sum of magnitudes of sets at each node is 500k (ie: every node, on average, reaches 2.5 leaves).
PS: Holy crap macbooks throttle an insane amount with the default fan settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20899
Differential Revision: D15564612
Pulled By: Chillee
fbshipit-source-id: 2a293a21a9be25f942ca888c8f225cab32bbfcd0
Summary:
Now you can run `python test/run_tests --jit` to run all jit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21161
Differential Revision: D15563912
Pulled By: eellison
fbshipit-source-id: 4bb0285cda4168b72a3dc4bba471485566a59873
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21051
In net transforms, we perform an SSARewrite where we update the 'net_pos' for all the ops in the net. The transform function also takes a unordered set of net positions for blacklisting. It's possible that SSARewrite will change the indexes of the ops so the blacklist is applied to the wrong ops. We fix this issue by having SSARewrite only assign new net_pos if the op doesn't already have one.
Reviewed By: yinghai
Differential Revision: D15532795
fbshipit-source-id: e020492a7b5196a91cdc39d0eda761b1ca612cdb
Summary:
These do not work. We'll save time and cpu until someone has the time to fix these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21153
Differential Revision: D15558601
Pulled By: pjh5
fbshipit-source-id: f9bfe580aa7962a88506f9af0032647f553637a4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21027
Previously, we are only able to adjust batch size when output shape has batch size conditioned at its first dim. Although not common, there are cases where we want to slice back the output whose batch size is conditioned on non-first dim, or whose output shape doesn't really has batch size in it but rather is an expression of it. Examples are shapes at the output of `Transpose` or `Tile`. This diff redesigns how we handle the output size. The key is when we run OnnxifiOp, the input shapes are given, and we can actually do a shape inference to derive the real output shapes, no matter how they got transformed. And then we compare the real output shape with max batch sized output shape, dim by dim and use a `Slice` op to cut the max output back to real output shape.
Notice that general `Slice` op is slow and in most of the cases, we still prefer adjusting batch size by shrinking its first dim, which is just an operation on meta info without data allocation/manipulation. Therefore, we add a flag `fast_path` to detect this situation and operate accordingly.
Reviewed By: tracelogfb
Differential Revision: D15515189
fbshipit-source-id: 9c1fff161f82d0bc20eeac07ca4a2756e964e9fd
Summary:
Resolves https://github.com/pytorch/lockdown/issues/29
Examples:
```
import torch
torch.jit.script
def foobar(x):
return torch.blargh(xyz)
==
RuntimeError:
object has no attribute blargh:
at compile.py:5:12
torch.jit.script
def foo(x):
return torch.blargh(x)
~~~~~~~~~~~~ <--- HERE
```
It also gets the correct column number in the case where the original source file has common leading whitespace in front of the callable:
```
import torch
with torch.no_grad():
torch.jit.script
def foo(x):
return torch.blargh(x)
==
RuntimeError:
object has no attribute blargh:
at compile_leading.py:6:24
torch.jit.script
def foo(x):
return torch.blargh(x)
~~~~~~~~~~~~ <--- HERE
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20898
Differential Revision: D15552424
Pulled By: jamesr66a
fbshipit-source-id: 78d0f0de03f7ccbf3e7ea193a1b4eced57ea5d69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20874
A criteria for what should go in Tensor method is whether numpy has it, for this one it does not
so we are removing it as a Tensor method, we can still call it as function.
Python
```
torch.quantize_linear(t, ...), torch.dequantize(t)
```
C++
```
at::quantize_linear(t, ...), at::dequantize(t)
```
Reviewed By: dzhulgakov
Differential Revision: D15477933
fbshipit-source-id: c8aa81f681e02f038d72e44f0c700632f1af8437
Summary:
Following on #19747, this implements most of the `torch.jit.script()` changes laid out in #20939.
Still to do:
* Accessing a method from Python does not add it as a `ScriptMethod` (so only `export`ed methods and `forward` are compiled)
* Calling a method other than `forward` on a submodule doesn't work
](https://our.intern.facebook.com/intern/diff/15546045/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20708
Pulled By: driazati
Differential Revision: D15546045
fbshipit-source-id: c2c8fe179088ffbdad47198e799a456560655b86
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20869
Adding support for the functions listed in the title, by implementing the copy kernel.
Differential Revision: D15474060
fbshipit-source-id: 9264df6e442cca1cc5d952e3e5dcc9f4a426f317
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20876
Tell the compiler that assertions are likely to succeed.
This allows the compiler to generate betterr code and optimize for the success case.
Differential Revision: D15480066
fbshipit-source-id: 4485154d66b2ee0ef8a401718712dbd61d811aee
Summary:
Thanks Jonas1312 for validating this workground.
Fixes#20635.
However, I don't know exactly why this one is needed.
The following are my guesses:
1. It is a CUDA bug. Static linking against `cudart` is the default now, so they didn't run enough tests for dynamic ones.
2. It is related to UCRT. But (1)according to msdn, shared DLLs should share the same CRT. (2) The CUDA related objects like `CUDevice` passing to `cudart` are stored on the stack, not the heap. (3) If this is the case, it should always fail, not sometimes. https://docs.microsoft.com/en-us/cpp/c-runtime-library/potential-errors-passing-crt-objects-across-dll-boundaries?view=vs-2019
3. It is a bug of our side. However, I was unable to find it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21062
Differential Revision: D15543557
Pulled By: ezyang
fbshipit-source-id: c23af45ebf582fad93ce5f029af6e1f06cf1d49d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20887
Switch AT_xxx assertion macros to the TORCH_ variants and make sure the separation between TORCH_CHECK and TORCH_INTERNAL_ASSERT makes sense.
Differential Revision: D15484658
fbshipit-source-id: 490ae64cc36946756c30971f1b685048bc5f77da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20940
- `torch.nn._intrinsic` will contain normal(unquantized) fused modules like Conv2DRelu, Conv2DBnRelu, FakeQuantize ops etc.
- `torch.nn._intrinsic` will contain fused and quantized modules like Quantized Conv2DRelu, Quantized LinearRelu etc.
Right now I only added FakeQuantize op in `torch.nn._intrinsic` namespace, we'll have more later
Differential Revision: D15505228
fbshipit-source-id: d380929e38af7a5bcfbea27474d5b80f95d43b03
Summary:
A bunch of modules were missing entries for `__constants__` which was making their `__repr__`s not work. Others had `__constants__` that were not necessary since it was provided by some parent class instead.
Fixes#20978
](https://our.intern.facebook.com/intern/diff/15539518/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21071
Pulled By: driazati
Differential Revision: D15539518
fbshipit-source-id: 24bdd1ef41ef636eefd5d2bad4ab2d79646ed4f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17946
Some of these are probably implementable for exported operators,
but aren't implemented yet and for now it's better to assert than to just return wrong results.
Reviewed By: ezyang
Differential Revision: D14430749
fbshipit-source-id: 2b0037a9ed227a22aa7376a90e6d3d09d3e04707
Summary:
Fixes#18440
I calculate a derived index from `start,stop,step` as `start + step*index`. When `start=0` and `step=1` (the defaults/`range(n)`), this is the same behavior as before.
Unluckily, it seems that we do not optimize out operations like `x*1` or `x+0`. That means that we're doing lots of redundant operations when we don't need to. EDIT: More specifically, it seems like we only do this optimization for (tensor, scalar): https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/passes/peephole.cpp#L128
The most annoying part of this code is calculating the number of iterations, given `start, stop, step`. I ended up going with the formula `(abs(stop-start) + abs(step)-1)//abs(step)`. Other intuitively appealing formulas like `(stop-start + step -1)//step` don't work for negative numbers.
I tried using `SymbolicVariable` for the calculations, but it seems that `symbolicvariable` only outputs ops for `tensors`, not the integers we have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20795
Differential Revision: D15446869
Pulled By: Chillee
fbshipit-source-id: 6085545ace04e25985c6ac870226f7a651f670d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21035
Fix the dtype error in `dequantize_linear`, it should accept the same dtype argument as `quantize_linear`
Differential Revision: D15521931
fbshipit-source-id: 0114c046a3f1046e42fca49c74c85e487fee8616
Summary:
This PR adds a check that prints a warning if a type annotation prefix isn't what mypy expects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20884
Differential Revision: D15511043
Pulled By: Krovatkin
fbshipit-source-id: 9038e074807832931faaa5f4e69628f94f51fd72
Summary:
I accidentally added a TF dependency in #20413 by using the from tensorboard.plugins.mesh.summary import _get_json_config import.
I'm removing it at the cost of code duplication.
orionr, Please review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21066
Reviewed By: natalialunova
Differential Revision: D15538746
Pulled By: orionr
fbshipit-source-id: 8a822719a4a9f5d67f1badb474e3a73cefce507f
Summary:
In larger system environment, there's usually a need to store some information about how the model was created (e.g. from which process, workflow, by which user, etc). It's almost like JPEG metadata written by camera.
This PR adds a low-level c++ hook to allow population of additional files in zip container based on environment. The reason to have it a low-level hook instead of top-level API wrapper (e.g. `m.save_with_metadata`) is to capture all usages of the saving API transparently for user.
Let me know if there are concerns.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20863
Differential Revision: D15487941
Pulled By: dzhulgakov
fbshipit-source-id: 120c5a4c9758aa82846bb51a1207f923e3da1333
Summary:
This doesn't have `strace` yet. But still have `faulthandler` to print stack traces at hanging. Also part of an attempt to isolate changes from #19228 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20166
Differential Revision: D15536504
Pulled By: ezyang
fbshipit-source-id: fe6e6e2e9899f30d8167436d7bc62b42883a3356
Summary:
Previously, we didn't work when 2d target tensors had extra columns at the end. Now we just ignore those.
Also fix the confusion in the doc example regarding the number of classes.
Thank you, ypw-rich for the report with reproducing example.
Fixes: #20522
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20971
Differential Revision: D15535481
Pulled By: ezyang
fbshipit-source-id: 397e44e20165fc4fa2547bee9390d4c0b688df93
Summary:
https://github.com/pytorch/pytorch/pull/17783 has made ninja and makefile builds to print out build commands unconditionally, this has made the build log very verbose, e.g. ROCm CI build log becomes >13mb. Large build log make searching for real error hard.
https://github.com/pytorch/pytorch/pull/20508 has reverted the ninja change, and this one reverts the makefile change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21053
Differential Revision: D15533412
Pulled By: bddppq
fbshipit-source-id: ad89b617d06acc670d75d4cf25111a4081e9c95e
Summary:
I've reported inconsistency between `checkpoint_sequential` and `nn.Sequential` at https://github.com/pytorch/pytorch/issues/19260. Both should provide the same input signature but they don't. I think the consistency is important and I agree with apaszke that `nn.Sequential`'s semantics should be kept instead of `checkpoint_sequential`.
I hope `checkpoint_sequential` raises `TypeError` on variadic arguments since PyTorch 1.2.0. But for now, it's okay just to warn as `DeprecationWarning`. I've talked about this approach with soumith.
Please review this pull request. Any comment will be my pleasure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21006
Differential Revision: D15530801
Pulled By: soumith
fbshipit-source-id: 0ceb2cc6a17dcc547d0d00ebaf9df8603be53183
Summary:
gradcheck currently includes a determinism check (although only trying twice and seeing if results match).
This can lead to flaky tests, e.g. in #20971, but also #13818.
This adds nondet_tol for both gradcheck and gradgradcheck. It does not change / reenable any tests yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20980
Differential Revision: D15530129
Pulled By: soumith
fbshipit-source-id: 04d7f85b5b59cd62867820c74b064ba14f4fa7f8
Summary:
Fixes a typo in the CyclicLR docs by adding the lr_scheduler directory and puts in other required arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21021
Differential Revision: D15530109
Pulled By: soumith
fbshipit-source-id: 98781bdab8d82465257229e50fa3bd0015da1286
Summary:
Just an annoying warning that's been popping up a lot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20964
Differential Revision: D15531064
Pulled By: Chillee
fbshipit-source-id: 9580115676c5e246481054bbfc749a551a3cca5e
Summary:
This PR covers two important points with respect to the QR decomposition:
- batching of input matrices (#7500)
- adding `some` as an option in `torch.qr` akin to NumPy's `mode` option (#10538)
Changelog:
- Enable batching for inputs to `torch.qr`
- Move QR decomposition implementation to ATen (CPU and CUDA)
- Remove existing implementations in TH/THC
- Add a `some` option to `torch.qr` that will enable users to switch between complete and reduced decomposition
- Modify doc strings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20689
Differential Revision: D15529230
Pulled By: soumith
fbshipit-source-id: 16af82b1d2db8a3a758fa8a5f798d83f5f950efb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20603
When we use intra_op_parallel operators, Caffe2 tracing was generating trace only for the master task giving a false impression that a lot of threads are underutilized.
This diff also traces child tasks.
Reviewed By: ilia-cher
Differential Revision: D14820008
fbshipit-source-id: ff4ed203804d86d9231c21c99d869f1ddf1d1ef9
Summary:
Add an option to setup.py to stop the build process once cmake terminates. This leaves users a chance to fine adjust build options. Also update README accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21034
Differential Revision: D15530096
Pulled By: soumith
fbshipit-source-id: 71ac6ff8483c3ee77c38d88f0d059db53a7d3901
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20647
The initial assumption was that `qint8` would be unsigned. After introduction of `quint8` and `qint8`, some tests break.
Reviewed By: jerryzh168
Differential Revision: D15332106
fbshipit-source-id: 6ed18da428915aea918a363c5f38754a3c75d06b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20493
This helps distinguish if the op was a quantized op or not.
Reviewed By: salexspb
Differential Revision: D15337854
fbshipit-source-id: 43c7aef143085cfaeb4ec2102a7f36cc454e0e94
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20173
Enabled op profiling even when net type is not dag or prof dag. Also added
engine type info to summary.
Reviewed By: salexspb, ilia-cher
Differential Revision: D15177813
fbshipit-source-id: 5be0efeaabc9a961cf1d73b0703749c08bb1adbb
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#19587 [jit] Make ScriptModule.training an attribute instead of a parameter**
Remove the hack we had previously where `training` was a buffer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19587
Differential Revision: D15502768
Pulled By: driazati
fbshipit-source-id: 3022f2d57ec6849868f9225d9bc2bfb7828cb318
Summary:
Before we look into supporting `deepcopy` we could at least improve an error msg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20885
Differential Revision: D15511023
Pulled By: Krovatkin
fbshipit-source-id: 93b8730a2cc663eee0147f14d3341d0606748eaf
Summary:
This #20919 without the changes to aten/src/THC/THCIntegerDivider.cuh
that broke the ROCm build.
cc bddppq
Original summary:
This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.
This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.
More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21019
Differential Revision: D15518477
Pulled By: colesbury
fbshipit-source-id: 4db5626fda76eb58250793e8aa7d4f2832db3a34
Summary:
Fixes#20495 .
Now for
```python
class A(torch.jit.ScriptModule):
def __init__(self):
super(A, self).__init__()
torch.jit.script_method
def forward(self, x):
return x + self.whatisgoingon
class B(A):
def __init__(self):
super(B, self).__init__()
torch.jit.script_method
def bar(self, x):
return x * x
A()
```
it does
```
RuntimeError:
attribute 'whatisgoingon' does not exist:
torch.jit.script_method
def forward(self, x):
return x + self.whatisgoingon
~~~~~~~~~~~~~~~~~~ <--- HERE
```
I added a test in `test_jit.py` as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20779
Differential Revision: D15441138
Pulled By: Chillee
fbshipit-source-id: 88f458c36b5e32a1ffc467b27bbc28a3c5c07321
Summary:
As a part of https://github.com/pytorch/pytorch/pull/20580 I noticed that we had some unusual variable naming in `summary.py`. This cleans it up and also removes some variables that weren't being used.
I'll wait until we have an `add_custom_scalars` test to land this.
cc lanpa natalialunova
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20861
Differential Revision: D15503420
Pulled By: orionr
fbshipit-source-id: 86d105a346198a1ca543d1c5d297804402ab5a0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20880
This clarifies how the momentum parameters should be used.
Reviewed By: soumith
Differential Revision: D15482450
fbshipit-source-id: e3649a38876c5912cb101d8e404abca7c3431766
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20610
Change InferLengthsRangeFill
Add InferGatherRanges
add tests from ClipRangesGatherSigridHash all the way to SparseLengthsWeightedSum
add tests from SigridTransforms all the way to SparseLengthsWeightedSum
e2e test will be added in the following diff
Reviewed By: ipiszy
Differential Revision: D15382730
fbshipit-source-id: a611cd129007a273dfc43955cd99af1c4ed04efd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20938
Dequantize_linear need not be exposed to the front end users.
It will only be used for the jit passes for q-dq insertion and op
substitution.
Differential Revision: D15446097
fbshipit-source-id: a5fbcf2bb72115122c9653e5089d014e2a2e891d
Summary:
Remove the internal functions in multi_head_attention_forward. Those internal functions cause 10-15% performance regression and there is possibly a JIT issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20653
Differential Revision: D15398888
Pulled By: cpuhrsch
fbshipit-source-id: 0a3f053a4ade5009e73d3974fa6733c2bff9d929
Summary:
Changes:
- protobuf has been moved to protocolbuffers/protobuf a while ago.
- cpuinfo has been moved to pytorch/cpuinfo and updated in FBGEMM recently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20973
Differential Revision: D15511926
Pulled By: soumith
fbshipit-source-id: 2c50373c9b245524f839bd1059870dd2b84e3b81
Summary:
Sometimes users forget using the "--recursive" option when they update submodules. This added check should help expose this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20937
Differential Revision: D15502846
Pulled By: mrshenli
fbshipit-source-id: 34c28a2c71ee6442d16b8b741ea44a18733b1536
Summary:
This changes the progress bars in `_download_url_to_file` from saying things like `49773343.40it/s` to `47.5MB/s`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20908
Differential Revision: D15511223
Pulled By: soumith
fbshipit-source-id: 2422eb5fb486f9ef4bd69c556c4ed1775b8b2860
Summary:
I believe the `True` and `False` in the doc are reversed :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20961
Differential Revision: D15510806
Pulled By: soumith
fbshipit-source-id: 62566bb595e187506b23dedc24892e48f35b1147
Summary:
Fixes#20630
Haven't tested it yet. Let's see if it passes all CI tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20882
Reviewed By: pietern
Differential Revision: D15483561
Pulled By: mrshenli
fbshipit-source-id: 5f0730a04d92906af077b2fe2170b674ca371e6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20868
When `input_boxes_include_bg_cls` is false (which means `input_scores_fg_cls_starting_id` is 0), It doesn't map the class index of score currectly when sorting and limiting the detections over all classes after nms.
Reviewed By: newstzpz
Differential Revision: D15472706
fbshipit-source-id: dc1e808b63ad09fb4bd95acf866771bb3fa92d69
Summary:
This fixes advanced indexing in cases where there's more than 2^31-1
bytes in the output. The `gpu_index_kernel` was missing the
`can_use_32bit_indexing`/`with_32bit_indexing` check.
This also adds a number of TORCH_INTERNAL_ASSERTS in Loops.cuh,
OffsetCalculator, and IntDivider that sizes are fit in a signed 32-bit
integer.
More comprehensive tests that require a 32 GB GPU are here:
https://gist.github.com/colesbury/e29387f5851521256dff562be07b981eFixes#20888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20919
Differential Revision: D15501945
Pulled By: colesbury
fbshipit-source-id: e876e678e866d2efda8ee92c47a1d2d1310671f0
Summary:
Previously, this used `crepr` afer the decref of `repr`. This is not
allowed because `repr` owns the cached copy of `crepr`.
Let's see if this fixes the contbuild.
See #20926
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20931
Differential Revision: D15501929
Pulled By: colesbury
fbshipit-source-id: 24141ba62df8758d2a3998cf7c2054be09088b6a
Summary:
Bug reported internally at FB:
```python
>>> t=torch.from_numpy(np.empty((0,4)))
>>> t[:,1::2]*=1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: Trying to resize storage that is not resizable at ../aten/src/TH/THStorageFunctions.cpp:76
```
This happens because the storage offset of `t[:, 1::2]` is 1, and it has 0 elements. We can fix this by avoiding resizing the storage for no-element arrays.
(We could *also* have avoided it by not modifying the storage index in this case, but I felt this way was more semantically correct -- in general, we should not be assuming it's okay to do anything to the storage when it has zero elements).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20914
Differential Revision: D15497860
Pulled By: umanwizard
fbshipit-source-id: 6af61d73a05edfc5c07ce8be9e530f15bf72e6a9
Summary:
I started adding support for the new **[mesh/point cloud](https://github.com/tensorflow/graphics/blob/master/tensorflow_graphics/g3doc/tensorboard.md)** data type introduced to TensorBoard recently.
I created the functions to add the data, created the appropriate summaries.
This new data type however requires a **Merged** summary containing the data for the vertices, colors and faces.
I got stuck at this stage. Maybe someone can help. lanpa?
I converted the example code by Google to PyTorch:
```python
import numpy as np
import trimesh
import torch
from torch.utils.tensorboard import SummaryWriter
sample_mesh = 'https://storage.googleapis.com/tensorflow-graphics/tensorboard/test_data/ShortDance07_a175_00001.ply'
log_dir = 'runs/torch'
batch_size = 1
# Camera and scene configuration.
config_dict = {
'camera': {'cls': 'PerspectiveCamera', 'fov': 75},
'lights': [
{
'cls': 'AmbientLight',
'color': '#ffffff',
'intensity': 0.75,
}, {
'cls': 'DirectionalLight',
'color': '#ffffff',
'intensity': 0.75,
'position': [0, -1, 2],
}],
'material': {
'cls': 'MeshStandardMaterial',
'roughness': 1,
'metalness': 0
}
}
# Read all sample PLY files.
mesh = trimesh.load_remote(sample_mesh)
vertices = np.array(mesh.vertices)
# Currently only supports RGB colors.
colors = np.array(mesh.visual.vertex_colors[:, :3])
faces = np.array(mesh.faces)
# Add batch dimension, so our data will be of shape BxNxC.
vertices = np.expand_dims(vertices, 0)
colors = np.expand_dims(colors, 0)
faces = np.expand_dims(faces, 0)
# Create data placeholders of the same shape as data itself.
vertices_tensor = torch.as_tensor(vertices)
faces_tensor = torch.as_tensor(faces)
colors_tensor = torch.as_tensor(colors)
writer = SummaryWriter(log_dir)
writer.add_mesh('mesh_color_tensor', vertices=vertices_tensor, faces=faces_tensor,
colors=colors_tensor, config_dict=config_dict)
writer.close()
```
I tried adding only the vertex summary, hence the others are supposed to be optional.
I got the following error from TensorBoard and it also didn't display the points:
```
Traceback (most recent call last):
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/werkzeug/serving.py", line 302, in run_wsgi
execute(self.server.app)
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/werkzeug/serving.py", line 290, in execute
application_iter = app(environ, start_response)
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/tensorboard/backend/application.py", line 309, in __call__
return self.data_applications[clean_path](environ, start_response)
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/werkzeug/wrappers/base_request.py", line 235, in application
resp = f(*args[:-2] + (request,))
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/tensorboard/plugins/mesh/mesh_plugin.py", line 252, in _serve_mesh_metadata
tensor_events = self._collect_tensor_events(request)
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/tensorboard/plugins/mesh/mesh_plugin.py", line 188, in _collect_tensor_events
tensors = self._multiplexer.Tensors(run, instance_tag)
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 400, in Tensors
return accumulator.Tensors(tag)
File "/home/dawars/workspace/pytorch/venv/lib/python3.6/site-packages/tensorboard/backend/event_processing/plugin_event_accumulator.py", line 437, in Tensors
return self.tensors_by_tag[tag].Items(_TENSOR_RESERVOIR_KEY)
KeyError: 'mesh_color_tensor_COLOR'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20413
Differential Revision: D15500737
Pulled By: orionr
fbshipit-source-id: 426e8b966037d08c065bce5198fd485fd80a2b67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20821
Change registration API. Instead of
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>()
.dispatchKey(CPUTensorId()));
it is now
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>(CPUTensorId()));
This binds kernel and dispatch key together, allowing them to be separate from other future configuration options like alias analysis or autograd wrappers.
The semantic problem behind this is that the dispatch key is a *kernel config parameter* and not an *operator config parameter* while things like autograd wrappers, alias info, and actually the kernel itself are *operator config parameters*. And while previously, the different kind of config parameters have been mixed, this diff now separates them.
Before this change, it wouldn't have been well defined if you specified a dispatchKey together with an autogradWrapper or aliasInfo for example.
// what is this supposed to do?
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.aliasInfo(DEFAULT)
.dispatchKey(CPUTensorId()));
If we get more kernel config parameters in the future, we could introduce something like this
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel>(torch::RegisterOperators::kernelOptions()
.dispatchKey(CPUTensorId())
.otherConfig());
but that's overkill as long as dispatch keys are the only kernel config parameter, and we can introduce that later without breaking backwards compatibility.
A nice side effect of this is that people can register multiple kernels to the same operator in the same `.op()` call:
static auto registry = torch::RegisterOperators()
.op("my::op", torch::RegisterOperators::options()
.kernel<Kernel1>(CPUTensorId())
.kernel<Kernel2>(CUDATensorId()));
Reviewed By: dzhulgakov
Differential Revision: D15455790
fbshipit-source-id: 1c46bfe676dcacf74cf36bd3f5df3d2c32b8fb11
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17818
Some of these are probably implementable for exported operators,
but aren't implemented yet and for now it's better to assert than to just return wrong results.
Reviewed By: ezyang
Differential Revision: D14392459
fbshipit-source-id: bf86e6cb0a7cfefd112a65dc85cc243e57a5ad52
Summary:
This PR also moves Device::validate into the header file, which makes
statements like `Device d = kCPU` effectively free.
Device includes the device's index, so TensorIterator::compute_types
now implicitly checks that all CUDA inputs are on the same GPU.
Previously, this was done ad-hoc in places like TensorIterator::binary_op.
Note that zero-dim Tensor (scalars) are NOT required to be on the
same device as other inputs because they behave almost like Python numbers.
TensorIterator handles copying zero-dim Tensors to the common device.
Prior to this PR, TensorIterator would copy zero-dim Tensors between CPU
and GPU, but not between different GPUs (because Backend didn't encode
the GPU index). This removes that restriction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20690
Differential Revision: D15414826
Pulled By: colesbury
fbshipit-source-id: 1d0ad1f7d663252af36dd4590bcda418c2f7a09f
Summary:
This PR is a eliminates unneeded grad_sum_to_size and in particular speeds up the LSTM backward by allowing better fusion.
It consists of two parts:
- In AutoDiff, record broadcasting sizes only if the broadcast output size is different from the input size, otherwise record None.
- The specialization of Optional arguments (#18407) allows us to then eliminate ` _grad_sum_to_size(t, None)` in the peephole optimization step.
Thus, in the LSTM case, no SumToSize remain in the crucial fusion group. The trick here is that we can specialize on the runtime information from the forward.
I'm testing that different broadcasting situations lead to different graphs.
I didn't move all symbolic_script _grad_sum_to_size to the new logic, but it might be better to do this incrementally, anyway.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18697
Differential Revision: D15482076
Pulled By: wanchaol
fbshipit-source-id: 7f89367e35b8729910077c95c02bccefc8678afb
Summary:
To say that we don't do refinement on module attributes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20912
Differential Revision: D15496453
Pulled By: eellison
fbshipit-source-id: a1ab9fb0157a30fa1bb71d0793fcc9b1670c4926
Summary:
Earlier, the workspace size query and allocation was placed inside the loop.
However, since we have batches of matrices with the same number of rows and columns, the workspace size query and allocation for every matrix in the batch is redundant.
This PR moves the workspace size query and allocation outside the loop, effectively saving (batch_size - 1) number of queries and allocation (and consequently the deallocation).
There is a tremendous speedup in inverse computation as a result of this change.
Changelog:
- Move workspace query and allocation outside the batch loop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20904
Differential Revision: D15495505
Pulled By: ezyang
fbshipit-source-id: 226729734465fcaf896f86e1b1a548a81440e082
Summary:
- Do not install unecessary packages in the Docker image.
- In the Docker image, use conda to install ninja (saving one layer)
- When workdir is set, use "." to refer to it to reduce redundancy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20881
Differential Revision: D15495769
Pulled By: ezyang
fbshipit-source-id: dab7df71ac107c85fb1447697e25978daffc7e0b
Summary:
Currently PyTorch forces color output due to #20662. But users should be left an option to turn it off because redirection of the output to a file would be messed if color output is forced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20771
Differential Revision: D15495677
Pulled By: ezyang
fbshipit-source-id: 9d89bbed40d0b67368554305394763a54c5ff6f5
Summary:
Currently when the argument to isinf and isfinite is not tensor, a ValueError is raised. This, however, should be a TypeError, because the error is a type mismatch.
In the error message, "str(tensor)" is replaced by "repr(tensor)" because, when an error occurs, a printable representation of the object is likely more useful than the "informal" string version of the object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20817
Differential Revision: D15495624
Pulled By: ezyang
fbshipit-source-id: 514198dcd723a7031818e50a87e187b22d51af73
Summary:
Attention mask should be of shape `(L, S)` since it is added to `attn_output_weights`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20850
Differential Revision: D15495587
Pulled By: ezyang
fbshipit-source-id: 61d6801da5291df960daab273e874df28aedbf6e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20892
FBGEMM uses 64 bit values. Need to change our implementation to match
Reviewed By: jerryzh168
Differential Revision: D15487664
fbshipit-source-id: 29cba26093c6f9aeafce14982c1ae12149e63562
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20773
This removes the feature to register fallback kernels that are called when no other kernel matches.
Instead, we introduce the concept of catchall kernels that are always called independent of inputs.
If you only have a fallback/catchall kernel and no kernels with concrete dispatch keys, then both concepts behave in the same way.
The difference is that we now disallow operators to have both, a catchall kernel and kernels with concrete dispatch keys.
This was possible before when they have been fallback kernels.
The reason for this change is that we anticipate needing a method_missing feature in backends, i.e. a backend-wide fallback to call when the backend doesn't specify a kernel for an operator.
We are not clear on precendence between this backend-wide fallback and an operator level fallback. Disallow fallbacks for now so we are free to choose later without breaking backwards compatibility.
Reviewed By: dzhulgakov
Differential Revision: D15438977
fbshipit-source-id: cb3aa764a1659d909ee21a7bd8ec3d32438aafaa
Summary:
Resubmit #20698 which got messed up.
Idea is that when PyTorch is used in a custom build environment (e.g. Facebook), it's useful to track usage of various APIs centrally. This PR introduces a simple very lightweight mechanism to do so - only first invocation of a trigger point would be logged. This is significantly more lightweight than #18235 and thus we can allow to put logging in e.g. TensorImpl.
Also adds an initial list of trigger points. Trigger points are added in such a way that no static initialization triggers them, i.e. just linking with libtorch.so will not cause any logging. Further suggestions of what to log are welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20745
Differential Revision: D15429196
Pulled By: dzhulgakov
fbshipit-source-id: a5e41a709a65b7ebccc6b95f93854e583cf20aca
Summary:
As part of the Variable/Tensor merge work: https://github.com/pytorch/pytorch/issues/13638, we make the following changes in this PR:
1. Remove the `Variable::Impl` class and the `DifferentiableViewImpl` class
2. Change all `Variable.data()` call sites to either use `Variable` directly, or use `Variable.tensor_data()`
3. Remove `Variable.data()` API
3. Add `Variable.variable_data()` that matches `tensor.data` in Python API, which creates a new `Variable` that shares the same storage and tensor metadata with the original `Variable`, but with a completely new autograd history.
After this PR, Variable doesn't wrap a Tensor internally anymore, and both Variable and Tensor use the same TensorImpl class as its `impl_`. The only difference is that Variable always has AutogradMeta in its TensorImpl, but Tensor doesn't.
**Note that this PR is BC-breaking in the following use cases:**
**Use Case 1:**
Previously, `x.data = y` works even if `x` and `y` are of different TensorImpl type (e.g. `x` is a CPU dense tensor whose impl is of type TensorImpl, while `y` is a CPU sparse tensor whose impl is of type SparseTensorImpl). However, after this PR, `x.data = y` doesn't work anymore if `x` and `y` are of different TensorImpl type, because the underlying implementation `variable.set_data(tensor)` no longer works if `variable` and `tensor` have different TensorImpl type.
**Use Case 2:**
If a tensor `x`'s `grad` is sparse, accumulating dense gradients to `x` will change the tensor that `x.grad` is pointing to. This is better illustrated with the following example:
```python
params = torch.tensor([1.5, 1.5]).requires_grad_()
with torch.no_grad():
# Change gradient to a sparse tensor
params.grad = torch.sparse_coo_tensor(torch.tensor([[1, 1]]).long(), torch.tensor([1., 1.]))
grad_saved = params.grad
params.backward(torch.tensor([1.5, 1.5]))
assert id(grad_saved) == id(params.grad) # This will fail after this PR
```
The assertion in the last line will fail after this PR, because adding dense gradients to sparse gradients will change the `params.grad` tensor reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17072
Differential Revision: D14075257
Pulled By: yf225
fbshipit-source-id: 0e681df641270dea586042dd26db59f2e76b5957
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20669
Before, Dict was a value type, i.e. copying it did a deep copy.
Unfortunately, this doesn't work well with storing and passing Dicts around in IValues because IValues are reference types.
This diff changes Dict to be a reference type.
Reviewed By: dzhulgakov
Differential Revision: D15404911
fbshipit-source-id: dc990d3eb7cae044b74dd0253f8b704dde6a6c86
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20833
Att. The algorithm is still "horrendously inefficient". But since we are sunsetting Nomnigraph, I just did the minimal fix here.
Reviewed By: tracelogfb
Differential Revision: D15463880
fbshipit-source-id: 413a1280a92c1923ba49031177816a2d5f888575
Summary:
This tries to fix the following error on current master:
```
May 23 16:18:47 Traceback (most recent call last):
May 23 16:18:47 File "main.py", line 7, in <module>
May 23 16:18:47 from torchvision import datasets, transforms
May 23 16:18:47 File "/opt/conda/lib/python3.6/site-packages/torchvision/__init__.py", line 1, in <module>
May 23 16:18:47 from torchvision import models
May 23 16:18:47 File "/opt/conda/lib/python3.6/site-packages/torchvision/models/__init__.py", line 11, in <module>
May 23 16:18:47 from . import detection
May 23 16:18:47 File "/opt/conda/lib/python3.6/site-packages/torchvision/models/detection/__init__.py", line 1, in <module>
May 23 16:18:47 from .faster_rcnn import *
May 23 16:18:47 File "/opt/conda/lib/python3.6/site-packages/torchvision/models/detection/faster_rcnn.py", line 7, in <module>
May 23 16:18:47 from torchvision.ops import misc as misc_nn_ops
May 23 16:18:47 File "/opt/conda/lib/python3.6/site-packages/torchvision/ops/__init__.py", line 1, in <module>
May 23 16:18:47 from .boxes import nms, box_iou
May 23 16:18:47 File "/opt/conda/lib/python3.6/site-packages/torchvision/ops/boxes.py", line 2, in <module>
May 23 16:18:47 from torchvision import _C
May 23 16:18:47 ImportError: /opt/conda/lib/python3.6/site-packages/torchvision/_C.cpython-36m-x86_64-linux-gnu.so: undefined symbol: _ZN2at19NonVariableTypeMode10is_enabledEv
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20865
Differential Revision: D15481736
Pulled By: yf225
fbshipit-source-id: 67d4fd70652ccc709b44cb15392d6e44a8fe9235
Summary:
This PR changes CPU implementation of `AdaptiveAveragePool2D` by
- move dispatch to outside the OpenMP loop
- support fp16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20366
Differential Revision: D15456069
Pulled By: ezyang
fbshipit-source-id: 00fa2916f8b136af9f5c8b5db0eca4619f9f5bac
Summary:
When adding custom scalars like this
```python
from torch.utils.tensorboard import SummaryWriter
with SummaryWriter() as writer:
writer.add_custom_scalars({'Stuff': {
'Losses': ['MultiLine', ['loss/(one|two)']],
'Metrics': ['MultiLine', ['metric/(three|four)']],
}})
```
This error is raised:
```
TypeError: Parameter to MergeFrom() must be instance of same class: expected tensorboard.SummaryMetadata.PluginData got list.
```
Removing the square brackets around `SummaryMetadata.PluginData(plugin_name='custom_scalars')` should be enough to fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20580
Differential Revision: D15469700
Pulled By: orionr
fbshipit-source-id: 7ce58034bc2a74ab149fee6419319db68d8abafe
Summary:
fix#20781#20757
hmm I don't know an easy way to add a test to make sure it runs against a package installed as .egg. But i tested it locally with torchvision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20782
Differential Revision: D15443600
Pulled By: ailzhang
fbshipit-source-id: 285eb0d9a44d6edb8e93618fa293f4feb431d2ae
Summary:
XLA needs a way to override CPUTensor.copy_(XLATensor), but we only
dispatch on the "self" argument. This inverts the dispatch order when
"src" is an unhandled type.
Note that things like XLATensor.copy_(CPUTensor) never enter this
implementation.
cc dlibenzi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20783
Differential Revision: D15443187
Pulled By: colesbury
fbshipit-source-id: 4ee93ba598ef0fed2a99c0683aae30cb50a1f99c
Summary:
was reading the README on github and came across a couple of typos.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20819
Differential Revision: D15469603
Pulled By: nairbv
fbshipit-source-id: 0ed7868de2d4e6d82557a8c170783966f8a1afd7
Summary:
The duplicated code of `_optimize_trace` in _pytorch_graph.py is used to bypass some optimization step which causes missing scope.
It seems that most of the problematic steps have been fixed recently. Standard models implemented in torchvision are visually inspected before the commit. However, the `+=` in 50d54a82d1/torchvision/models/resnet.py (L63) will let f4d9bfaa4d/torch/onnx/utils.py (L159) produce a bad result. It can be fixed by replacing it with `out += identity`. This also implies that `+=` has non-intuitive behavior.
cc orionr ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20394
Reviewed By: NarineK
Differential Revision: D15452204
Pulled By: orionr
fbshipit-source-id: eaa4c13f16551c78dc6419f1e22eb2c560af4cc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20772
Copy of D15178352
A conflicting commit landed at the same time as D15178352 that removed registering kernels using IntArrayRef, Hence, D15178352 was revered. Using std::vector instead.
Reviewed By: zafartahirov
Differential Revision: D15437237
fbshipit-source-id: cd2f1caebcc720352b48ce25d716cb1ca49a5197
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20816
Previously, the c10 dispatcher expected ops to be called with Variables and unwrapped them to Tensors before calling into the kernel.
The kernel was expected to return Tensors that were re-wrapped into Variables before passing them on into the system.
However, that doesn't work with kernels that call other operators. One recent example was a kernel that returned the result of `torch::ones()` as output.
Now, with this diff, the c10 dispatcher still passes Tensors to the kernel and Variables back into the system, but it accepts ops to be called with both Tensors or Variables
and kernels are also allowed to return either.
After https://github.com/pytorch/pytorch/pull/17072 , we should be able to get rid of the whole wrapping/unwrapping logic.
Reviewed By: hl475
Differential Revision: D15453963
fbshipit-source-id: 7602b7f2bc43e8ceb8a8c0e97aafcc53d4c47b6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20740
Provide a way to assemble quantized Tensor from int8 Tensor, scale and zero point.
Differential Revision: D15232416
fbshipit-source-id: c3a3d9d7214b1dc569214c019440c2779fbd063b
Summary:
This is the first part of the planned changes to change the comparison operations result tensor dtype from Byte to Bool. You can see the whole list of changes (not cleaned up) [here](https://github.com/pytorch/pytorch/pull/19332). As the PR is too big for a single review im breaking it into pieces.
**Changes in this PR:**
1. Enable these methods for bool tensors:
- maskedSelect
- maskedSelectBool
- bitand
- cbitand
- bitor
- cbitor
- bitxor
- cbitxor
- sign
- equal
- neg
2. Add bool clause for the TH version of sign method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20767
Differential Revision: D15436446
Pulled By: izdeby
fbshipit-source-id: 8d2494b5f4873cd79c7f1a40d2cb045cadfad51a
Summary:
I didn't update the Windows references because I wasn't sure if they apply to CUDA 9. peterjc123 what should the Windows section say?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20718
Differential Revision: D15459276
Pulled By: colesbury
fbshipit-source-id: 917e22f8ac75378d88c962c226b5a42b6799c79a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20802
Need this for sequence model
Reviewed By: dzhulgakov
Differential Revision: D15448529
fbshipit-source-id: cd5abe3b689fc0e02feff10faf8cd61c99369f4f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20786
Add a method to LayerModelHelper to filter metrics_schema. A general model builder may add metric schema that is not needed in some situations. This change add the ability to skip those unneeded.
Reviewed By: alex1o1o7cloud
Differential Revision: D15418140
fbshipit-source-id: 520f5dffd9938cf206cb1352e2953a4d4d2b6ab1
Summary:
When detecting the presence of NumPy using import, move numpy-related variable assignments outside the try block (i.e., to an else block) to improve readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20739
Differential Revision: D15453916
Pulled By: ezyang
fbshipit-source-id: d3c37f2b290846be3c6a1462251cbb3e95d493be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20787
Set requires_grad=False for bias: this will block the jit tracing.
The as_type fix: The input tensor shape and output tensor shape will be different, which will trigger the assertion failure at https://fburl.com/0m8xy7tc.
Reviewed By: jamesr66a
Differential Revision: D15445092
fbshipit-source-id: 22da41a56ecb9ac092585d0cc1ff0658fb9d631b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20045
This pass adds quant-dequant nodes for bias. This pass requires
quant-dequant pass for activations and weights to be done as it is required
to compute the qparams for bias
Differential Revision: D15179141
fbshipit-source-id: 3aab9fceefcadc3fa42a4e802d9b1e18addad78a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20770
Add dict type since it's part of the pytorch built-in system, and sparse features and text features will be converted to Dict
Reviewed By: pritamdamania87
Differential Revision: D15436255
fbshipit-source-id: 239adbd6a8f68be29020fe656d790f6872f1f0e9
Summary:
as title. We were using AT_ASSERT, which is newly deprecated. In this case, we do in fact want an internal assertion since this is used in testing code to describe expected behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20555
Differential Revision: D15362964
Pulled By: suo
fbshipit-source-id: 984bfe71a774571611f3bbd81767d3cdb878a6fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20345
Seperate from D15194600
Optimize pytorch layer_norm op part 1:
optimize layer_norm_forward_cpu
import Eigen Maps for the performance of reduction
Reviewed By: zheng-xq
Differential Revision: D15290608
fbshipit-source-id: cf2c208dfd6fbcbc4c69db3ed60278d9bee156b5
Summary:
Previous implementation of magic methods extended from BuiltinOperators, but it should be able to work with other sugared values, such as casts.
I was also considering making CastValue's and BuiltinOperators's extend from a MagicMethod super class, and having them try to call into the super's before their own call. However, not all Builtin Operators have corresponding magic methods so i did it this way instead (although there are workarounds for that).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20654
Differential Revision: D15434469
Pulled By: eellison
fbshipit-source-id: 813fa00bf8b5b9ada46505075ebf984d8eee6aef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20711
For uint8_t, ```std::numeric_limits::digits``` returns 8;
For int8_t, ```std::numeric_limits::digits``` returns 7.
FBGEMM wants to get the ```qparams.precision``` to be always 8 for both int8_t and uint8_t.
Reviewed By: jerryzh168
Differential Revision: D15410695
fbshipit-source-id: 17dc3842d7c426947454c201bcb167b87b7301ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20726
Edward says it doesn't actually provide compilers,
but it does provide dependencies, so let's mention that instead.
Reviewed By: ezyang
Differential Revision: D15423316
fbshipit-source-id: 9b384f88e5bf7a3d2c132508620c276b49e1569f
Summary:
This PR implements auto-conversion of GPU arrays that support the `__cuda_array_interface__` protocol (fixes#15601).
If an object exposes the `__cuda_array_interface__` attribute, `touch.as_tensor()` and `touch.tensor()` will use the exposed device memory.
#### Zero-copy
When using `touch.as_tensor(...,device=D)` where `D` is the same device as the one used in `__cuda_array_interface__`.
#### Implicit copy
When using `touch.as_tensor(...,device=D)` where `D` is the CPU or another non-CUDA device.
#### Explicit copy
When using `torch.tensor()`.
#### Exception
When using `touch.as_tensor(...,device=D)` where `D` is a CUDA device not used in `__cuda_array_interface__`.
#### Lifetime
`torch.as_tensor(obj)` tensor grabs a reference to `obj` so that the lifetime of `obj` exceeds the tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20584
Differential Revision: D15435610
Pulled By: ezyang
fbshipit-source-id: c423776ba2f2c073b902e0a0ce272d54e9005286
Summary:
Appending `arch` to the generator name is not supported for VS starting from VS 2019.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20752
Differential Revision: D15436740
Pulled By: ezyang
fbshipit-source-id: 20057aae8f708d82619927bf2cb87dd1bc2df312
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20737
If someone tries to register multiple kernels in the same .op() call, we're now throwing an error.
Differential Revision: D15425660
fbshipit-source-id: 6d2f1444da3e16a6a98863d847965c2aa211e046
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20674
A few targets in caffe2/caffe2/distribute needs to be split too, otherwise won't compile. Also some clean ups and make select_gpu_type to gpu_library_selector
Differential Revision: D15406019
fbshipit-source-id: 6455ab885b248502b48d4c7565597e00fecfd547
Summary:
Let there be color!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20662
Differential Revision: D15434110
Pulled By: suo
fbshipit-source-id: a317ae72ad72e0b8249f55c9c8d31f420c78c040
Summary:
building with cuda and gcc 4.8.5-28, we see many warnings like:
[893/1645] Building NVCC (Device) object caffe2/CMakeFiles/caffe2_gpu.dir/__/aten/src/THCUNN/caffe2_gpu_generated_ELU.cu.o
/home/bvaughan/repos/pytorch/c10/util/ArrayRef.h:277:48: warning: ‘deprecated’ attribute directive ignored [-Wattributes]
using IntList C10_DEPRECATED_USING = ArrayRef<int64_t>;
This change prevents those warnings on the older compiler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20587
Differential Revision: D15432749
Pulled By: nairbv
fbshipit-source-id: fd707afcbd6564f96617378d7cd6d62d941a052b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20468
ScalarType node is mandatory for activations and parameters now.
This change inserts ScalarType node for all the quant-dequant nodes. For the activations, currently the default value is at::ScalarType::Undefined. Remove this and explicitly pass the at::ScalarType::QUint8 dtype
Differential Revision: D15331600
fbshipit-source-id: 5b51e0b42e694bf409026af4783a12da6d7e234b
Summary:
Copy.cu goes from 308 to 190 lines of code. In general it uses, the same
copy strategy, using cudaMempcyAsync, a pointwise kernel, or a copy
using temporary buffers. The pointwise kernel has slightly improved
performance when broadcasting due to faster index calculation.
This deletes "`s_copy_`", "`_s_copy_from`", and "`_copy_same_type_`". The only
entry-point now is "`copy_`".
A mini-benchmark is here:
https://gist.github.com/colesbury/706de1d4e8260afe046020988410b992
Before:
https://gist.github.com/colesbury/ab454b6fe3791bff420d7bcf8c041f18
After:
https://gist.github.com/colesbury/9024d242b56ab09a9ec985fa6d1620bc
Results were measured on 2.2 GHz Broadwell; no-turbo; one thread;
compiled with GCC 7.3.0. (Results are slower than typical usage due to
turbo being off.)
The only significant differences is in the CUDA [1024] -> [1024, 1024]
broadcasting copy which is ~25% faster. I don't expect a noticeable
difference in real programs.
CPU copy overhead is a tiny bit (~200 ns) faster, but I don't expect
anyone to notice that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20685
Differential Revision: D15414819
Pulled By: colesbury
fbshipit-source-id: d3c6e04a5020470e3bef15b1fc09503cae5df440
Summary:
Symbols are given hidden visibility by default on Linux to emulate the behavior on Windows. This helps developers catch visibility issues in their streamlined Linux dev environment before being surprised, late in the process, by Windows errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20461
Reviewed By: kostmo
Differential Revision: D15410410
Pulled By: dzhulgakov
fbshipit-source-id: 1d684b5a9a80b692966a775c3f1c56b7c72ffc95
Summary:
Fixes#20017
This wraps the `torch._C.Function` currently returned from `torch.jit.script` and `torch.jit.trace` in a `ScriptFunction` and `TracedFunction` respectively, both of which are just wrappers to hold the function.
](https://our.intern.facebook.com/intern/diff/15403161/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20386
Pulled By: driazati
Differential Revision: D15403161
fbshipit-source-id: 94fb9f32929e62a00be6cf7512ea144ec9b91e0b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20677
With new changes in IR, it is possible to insert nodes after param
nodes in graph. Thus we do not need to have two methods for inserting q-dq
nodes to input or output to quantizable nodes.
Differential Revision: D15406354
fbshipit-source-id: 1963762f434fd82877fa76a272e8520c342b6069
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20709
- Remove ArrayRef based API. This is neither the old nor the planned new API.
- De-deprecate kernels based on std::vector and std::unordered_map. We don't have the Dict/List based API figured out entirely yet, so we shouldn't push people towards using them.
std::vector and std::unordered_map will get deprecated again once we figured out List/Dict.
Reviewed By: dzhulgakov
Differential Revision: D15417025
fbshipit-source-id: bfbb33c762e43487bb499bc8cc36d515e678f8fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20667
Compilation errors:
```
xplat/caffe2/caffe2/utils/signal_handler.h:31:10: error: private field 'SIGINT_action_' is not used [-Werror,-Wunused-private-field]
Action SIGINT_action_;
^
xplat/caffe2/caffe2/utils/signal_handler.h:32:10: error: private field 'SIGHUP_action_' is not used [-Werror,-Wunused-private-field]
Action SIGHUP_action_;
^
xplat/caffe2/caffe2/utils/signal_handler.h:33:17: error: private field 'my_sigint_count_' is not used [-Werror,-Wunused-private-field]
unsigned long my_sigint_count_;
^
xplat/caffe2/caffe2/utils/signal_handler.h:34:17: error: private field 'my_sighup_count_' is not used [-Werror,-Wunused-private-field]
unsigned long my_sighup_count_;
^
4 errors generated.
xplat/caffe2/caffe2/share/fb/stylizer/median_blur_ops.cc:593:14: error: private field 'ws_' is not used [-Werror,-Wunused-private-field]
Workspace* ws_;
^
1 error generated.
```
Reviewed By: bwasti
Differential Revision: D15402928
fbshipit-source-id: 5b98499850aa659fd37ab8e7f2e75166787b8129
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20040
Add the support of feature store example in fblearner pytorch predictor, end to end
Reviewed By: dzhulgakov
Differential Revision: D15177897
fbshipit-source-id: 0f6df8b064eb9844fc9ddae61e978d6574c22916
Summary:
load_state_dict includes a recursive inner function `load` that captures
Tensors through the close-over variable `state_dict`. Because it's
recursive, it also captures itself leading to a reference cycle.
This breaks the reference cycle so that any Tensors in state_dict can be
collected immediately instead of waiting until the next GC cycle.
Alternatively, we could have passed `state_dict` and `metadata` as
arguments to load to prevent capture of Tensors. (That would still
result in cyclic garbage, but not any cyclic garbage of Tensors).
See:
https://github.com/pytorch/pytorch/issues/20199#issuecomment-491089004
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20397
Differential Revision: D15414834
Pulled By: colesbury
fbshipit-source-id: 4c2275a08b2d8043deb3779db28be03bda15872d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20607
Add a new method SummaryWriter.flush() that iterates through all of the FileWriters and flushes them
Reviewed By: orionr
Differential Revision: D15380124
fbshipit-source-id: 1975f3f61c5ae3754552bfdb23f2cd78f687d19f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20649
I went through every occurrence of AT_ASSERT in this file and
thought about whether or not it should be TORCH_INTERNAL_ASSERT
or TORCH_CHECK. I think I did a good job at it. Some thoughts:
- In order to decide if a check is "internal" or not, we must
think about where the separation between userspace and our internals
are. I think any code that utilizes the PyTorch or Caffe2 C++ frontends
count as userspace. An important collorary is that the majority of operator
code "counts" as userspace, even though it lives in our repository. This
is inline with TCB (trusted computing base) thinking: you want the TCB to
be as small as possible, and because we have a *lot* of operator
implementations, they should not count as TCB.
- The primary test I applied when considering an AT_ASSERT was whether or
not I could trigger this error by just making method calls on caffe2::Tensor
or at::Tensor. If I could, that made it a TORCH_CHECK. This covers most
of the misapplications of TORCH_INTERNAL_ASSERT. One place I didn't
do this was the "is variable" checks; I think you have to work a bit
harder to trigger this case, and userspace code is not mixing up
Variables and Tensros.
- I updated the docs for device_opt_, explaining when it could be nullopt.
(The nullopt checks here are TORCH_CHECK, because you can trigger them
by taking an undefined tensor and poking the methods.)
Differential Revision: D15395576
fbshipit-source-id: 1c51b396012e7d949fbb4258092cf80e5e6f851b
Summary:
Fixes#20651
Communication collectives in `torch.distributed` call `CUDACachingAllocator::recordStream()` on input and output tensors to prevent their memory blocks being freed too early. `CUDACachingAllocator` uses tensor's data pointer to track memory blocks, which does not accept null pointers. However, empty tensor's `storage().data()` might be null. In this case, as there is no associated memory block for the empty tensor, it should be fine to make `recordStream()` a no-op.
Tests only cover `broadcast` empty tensors for GLOO backend, because GLOO does not support empty inputs (facebookincubator/gloo/issues/179). It can be addressed in either `ProcessGroupGloo` or GLOO itself. Will add more tests when that gap is filled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20658
Differential Revision: D15399371
Pulled By: mrshenli
fbshipit-source-id: d29ebd1c72fddae49531f32695f81b89e42e5a4d
Summary:
According to https://pytorch.org/docs/stable/notes/broadcasting.html, in-place operations do not allow the in-place tensor to change shape as a result of the broadcast. Therefore our shape analysis could keep the shape information on inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20661
Differential Revision: D15406477
Pulled By: wanchaol
fbshipit-source-id: 8ab60e783292f2fe26e5fdecfb64bec43bca6826
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20561
We previously planned to deprecate the direct passing of a kernel function or lambda to the op() call, e.g.
static auto registry = RegisterOperators().op("my::op", &func);
and push users towards the options based API:
static auto registry = RegisterOperators().op("my::op", RegisterOperators::options().kernel<decltype(func), &func>());
because that has a slightly lower performance overhead when calling the kernel.
However, that overhead is negligible for all but exotic use cases, so there's no reason to push users towards a more verbose API.
This diff removes the deprecation warning from that API.
However, if you use the API together with deprecated types like std::unordered_map, you will now get a deprecation warning there.
Reviewed By: zdevito
Differential Revision: D15364271
fbshipit-source-id: 56dae0c5870bbab16ad19ba5178f4bea9eafed9f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20514
Change API from
static auto registry = c10::RegisterOperators()
.op("my::op",
c10::kernel(...),
c10::dispatchKey(...)
);
to
static auto registry = c10::RegisterOperators()
.op("my::op", c10::RegisterOperators::options()
.kernel(...)
.dispatchKey(...)
);
because this allows better discoverability. People looking for which options are available will easier find it and IDE autocompletion will work better.
Reviewed By: zdevito
Differential Revision: D15346348
fbshipit-source-id: 4b74a33b75c2b9cda4a903639fb7abd2c7cff167
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20672
Current test case looks for q->int_repr->dq pattern and constant nodes also.
The prim::Constant nodes are not guaranteed to be present at same point in graph.
So we modify the test case to only look for the q->int_repr->dq nodes.
Differential Revision: D15405606
fbshipit-source-id: 2086ffb5bbd328d2a9a55f4c2a2de342575194d3
Summary:
Otherwise users see something like (Tensor, Tensor)? and don't know what the ? means.
First commit is formatting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20657
Differential Revision: D15400225
Pulled By: eellison
fbshipit-source-id: cf826790bf2ddafd34f6d5c144526cad9904770b
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#19578 [jit] Try to script all Python functions**
This adds the `torch.jit._enable_recursive_script` context manager, which will try to compile any Python functions it sees. It's hidden behind an internal context manager for now since it's incomplete (doesn't work for script_methods/Python submodules). If it can't compile the Python function it outputs an error.
](https://our.intern.facebook.com/intern/diff/15386727/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19578
Pulled By: driazati
Differential Revision: D15386727
fbshipit-source-id: 4e308f67677b8e9fccfc525a91bb2f4585062048
Summary:
Adds support for `__getstate__` and `__setstate__` on modules that are called as part of export (`torch.save()`) and import (`torch.jit.load`).
* `__getstate__` and `__setstate__` must be TorchScript functions with the signatures `() -> T` and `(T) -> None` respectively
* The results of `__getstate__` are stored using the pickler in `states.pkl` with one for each module in definition order (`__getstate__` returns `None` by default if an imlpementation is not provided)
* This prevents sharing between `__getstate__` and attributes, but this should be fine since their use is mostly unrelated (attributes are for storing values to be used in script methods, `__getstate__` for running arbitrary computations during import)
Follow up
* Somehow replacing `__getstate__`/`__setstate__` with a `ScriptMethodStub` makes `MyScriptModule().__getstate__()` call `ScriptModule.__getstate__()` when used in Python. This should be fixed so semantics in Python are preserved, but it doesn't affect the typical usage.
](https://our.intern.facebook.com/intern/diff/15287161/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20242
Pulled By: driazati
Differential Revision: D15287161
fbshipit-source-id: b3f5f33ab74a21a89e6d15460af63aff75cab2d8
Summary:
- Earlier, we had to use the legacy implementation of `getri` for single matrix inverse from TH and THC
- Now, this has been moved to ATen
Changelog:
- Move single matrix inverse implementation to ATen
- Remove unused code in TH and THC resulting from the change
- Minor modifications made to single matrix CPU function implementations in ATen to avoid redundancy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20534
Differential Revision: D15393383
Pulled By: ezyang
fbshipit-source-id: 81972111cd9757d15f1d634f294c93fd0f35636c
Summary:
Recent versions of GCC split unaligned load and store intrinsics into
two 128-bit instructions. On old processors (Sandy Bridge) this was a
bit faster for unaligned data, but bit slower for aligned data. On new
processors (Intel Haswell+, recent AMD) splitting loads is slower on
both aligned and unaligned data.
Clang, MSVC, and ICC do not split unaligned load and store intrinsics.
There's a good explanation here:
https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top
Splitting load and store intrinsics makes no sense in our AVX2
configuration because the CPUs that support AVX2 instructions are the
same CPUs where splitting is disadvantageous on all data alignemnt.
Note that this doesn't change the AVX configuration (used by CPUs that
support AVX but not AVX2). It's possible this would be benficial for
that configuration too (our data is usually 32-byte aligned), but I'd
prefer the conservative change for now.
torch.add generated assembly (hot loop) (GCC 7.3.0)
before:
https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295
after:
https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91
Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell,
no turbo):
before: 7.35 us after: 6.39 us
(Take the torch.add timings with a grain of salt. The difference in timings
is much larger than I would expect.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20609
Differential Revision: D15385800
Pulled By: colesbury
fbshipit-source-id: 66415b148a3b19360b9de9881af594ab46547b6f
Summary:
Change `Inputs` to `Shape` to unify the format of CTCLoss `class`, and add the type of `Output` in `Shape`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20422
Differential Revision: D15393484
Pulled By: ezyang
fbshipit-source-id: 5b49647f9740de77db49a566fa2de74fcecd9110
Summary:
CUDA 8 is no longer supported and removed from CI, so these checks are irrelevant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20482
Differential Revision: D15393438
Pulled By: ezyang
fbshipit-source-id: ac0979bf660b3314eec502c745e34ce4940bda0e
Summary:
Fixes#20568.
Looks like CMake is passing `/MD` when we call `add_library`. We need to fix these with C source files too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20574
Differential Revision: D15392682
Pulled By: ezyang
fbshipit-source-id: c92034d8725fcec48fd7db6cf5322868e956dc6b
Summary:
Fixes#20523 .
nn.Upsample was unable to accept tuple inputs for the scale_factor argument due to direct casting to float, which was done in #17732.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20581
Differential Revision: D15392622
Pulled By: ezyang
fbshipit-source-id: b56ba8197a5bbf8891bc7e1bebf5cad63dcab04d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20441
This op is fairly complex and the fact that it isn't formatted
correctly makes things that much harder to reason about. Clean it up.
Reviewed By: dreiss
Differential Revision: D15220006
fbshipit-source-id: 30632d8bdbf15f96e73d8b6c96c5f29c052e6e7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20502
Following D15307410 removing more floating point exceptions in unit tests
Reviewed By: hx89
Differential Revision: D15340930
fbshipit-source-id: 269fc75e0800bc9d39126767a0f3ca15cd8b0cad
Summary:
(Reopens https://github.com/pytorch/pytorch/pull/20330 and fixes test error.)
After the Variable/Tensor merge, there is no guarantee that `indices` and `values` passed into the sparse tensor constructor don't contain AutogradMeta. However, we want to maintain the existing invariant that `indices_` and `values_` of a sparse tensor don't contain AutogradMeta, and to achieve this we need do shallow-copy in the sparse tensor constructor.
Note that this is BC-breaking for code that changes the sizes / strides of the indices or values tensor after it's used to create a sparse tensor. In current master, such changes will be reflected in the sparse tensor and break sparse tensor invariants. After this PR, those changes will not be reflected in the sparse tensor, and thus the sparse tensor invariants are always preserved. Specifically, running in-place size/stride-changing ops such as `resize_` / `resize_as_` / `as_strided_` / `set_` / `transpose_` on the original values tensor will not update the sparse tensor's `values_`. For example:
```python
# Calling resize_ on non-requires-grad value tensor
i2 = torch.zeros([1, 1])
v2 = torch.ones([1, 2, 3])
t2 = torch.sparse_coo_tensor(i2, v2, torch.Size([2, 2, 3]))
v2.resize_(4, 5)
t2.coalesce().values().size()
# On current master, this throws "indices and values must have same nnz, but got nnz from indices: 1, nnz from values: 4", because resizing the original value tensor affects `values_` of the sparse tensor.
# After this PR, this prints "torch.Size([1, 2, 3])", which means resizing the original value tensor doesn't affect `values_` of the sparse tensor.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20614
Differential Revision: D15385811
Pulled By: yf225
fbshipit-source-id: e963fcf5e4097f8c881b56145f408565d97cf5c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19932
In preparation to add int8_t data type for QTensor
Reviewed By: zafartahirov
Differential Revision: D15137838
fbshipit-source-id: 59462c36d6fc5982986d4196bf3f32f49bb294d7
Summary:
After the Variable/Tensor merge, there is no guarantee that `indices` and `values` passed into the sparse tensor constructor don't contain AutogradMeta. However, we want to maintain the existing invariant that `indices_` and `values_` of a sparse tensor don't contain AutogradMeta, and to achieve this we need do shallow-copy in the sparse tensor constructor.
Note that this is BC-breaking for code that changes the sizes / strides of the indices or values tensor after it's used to create a sparse tensor. In current master, such changes will be reflected in the sparse tensor and break sparse tensor invariants. After this PR, those changes will not be reflected in the sparse tensor, and thus the sparse tensor invariants are always preserved. Specifically, running in-place size/stride-changing ops such as `resize_` / `resize_as_` / `as_strided_` / `set_` / `transpose_` on the original values tensor will not update the sparse tensor's `values_`. For example:
```python
# Calling resize_ on non-requires-grad value tensor
i2 = torch.zeros([1, 1])
v2 = torch.ones([1, 2, 3])
t2 = torch.sparse_coo_tensor(i2, v2, torch.Size([2, 2, 3]))
v2.resize_(4, 5)
t2.coalesce().values().size()
# On current master, this throws "indices and values must have same nnz, but got nnz from indices: 1, nnz from values: 4", because resizing the original value tensor affects `values_` of the sparse tensor.
# After this PR, this prints "torch.Size([1, 2, 3])", which means resizing the original value tensor doesn't affect `values_` of the sparse tensor.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20330
Differential Revision: D15373683
Pulled By: yf225
fbshipit-source-id: 32e7275d7121e17937c7cc258e8a60bb0848ff25
Summary:
Currently `bmm()` has very heavy performance overhead on CPU due to construction/deconstruction of `TensorImpl`. Applying `TensorAccessor` when indexing tensor data can greatly improve the performance.
I tested this on `fairseq` Transformer model. Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by approximately **10%** (seconds per iteration reduced from **3.60** to **3.21**). Considering the fact that `bmm()` takes only **14%** of the total time, 10% overall improvement indicates `bmm()` itself improves by roughly **3x**.
Before:
```
| epoch 001: 0%| | 43/25337 [02:34<25:17:11, 3.60s/it, loss=16.179, nll_loss=16.137, ppl=72045.59, wps=1320, ups=0, wpb=4758.767, bsz=136.558, num_updates=43, lr=6.45e-06, gnorm=6.88
```
After:
```
| epoch 001: 0%| | 23/25337 [01:13<22:32:48, 3.21s/it, loss=17.072, nll_loss=17.068, ppl=137419.42, wps=1478, ups=0, wpb=4746.870, bsz=128.348, num_updates=23, lr=3.45e-06, gnorm=10.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20266
Differential Revision: D15262201
Pulled By: cpuhrsch
fbshipit-source-id: c2e4e406c06714b04cc7534f3da71e986eddca35
Summary:
In onnx spec, the supported input/output type for `And` and `Or` is `Bool` only.
Thus in exporting, cast to/from `Bool` is inserted for input/output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17894
Reviewed By: zrphercule
Differential Revision: D15103148
Pulled By: houseroad
fbshipit-source-id: 3e1068ea236c743260d42882fb11f0e3a21707e6
Summary:
First time this was merged it broke master and was reverted. This time I do not add ```set -u``` to the .circleci/scripts/setup* scripts. There's still a chance that ```set -u``` breaks the binary builds on master, but at least those can be fixed in parallel and don't completely eliminate signal from all merges.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20540
Differential Revision: D15373444
Pulled By: pjh5
fbshipit-source-id: 0203c20865827366ecd8fa07b2db74d255549ed1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20501
Fixing unit tests related to optimizer related operators and tests
Reviewed By: hx89
Differential Revision: D15307410
fbshipit-source-id: e5400c26e08f26191ee542fe6b02e0a69bc4e1ae
Summary:
#19975 was separated by 2 PRs.
This one:
Introduce MemoryFormat argument to the `x.is_contiguous(memory_format=torch.channels_last)` and to the `y = x.contiguous(memory_format=torch.channels_last)` functions.
At this moment both functions just operate with strides and doesn't store any tensor state.
(Original RFC #19092)
-----
Expands functionality of two tensor functions `.is_contiguous` and `.contiguous` (both python and c++ api).
Note: We had several complaints about `.to(memory_format)` function, and decided not to support it.
1. `.contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- Using `torch.contiguous_format` will preserve existing `.contiguous()` behavior.
- Calling `x.contiguous(memory_format=torch.channels_last)` returns new tensor which maintain same semantical layout (NCHW), but have different memory allocation pattern.
`x.contiguous(memory_format=torch.channels_last)` expects input tensor to be 3d, 4d or 5d; and fails otherwise.
2. `.is_contiguous` now support optional keyword-only argument - `memory_format`, which can be either `torch.contiguous_format` or `torch.channels_last`.
- `x.is_contiguous(memory_format=torch.contiguous_format)` preserves same functionality as `x.is_contiguous()` and remains unchanged.
- `x.is_contiguous(memory_format=torch.channels_last)` returns true if A) input tensor is contiguous in memory AND B) allocated in the memory in NWHC (or similar for 3d,5d) format.
Note: By the end of the phase one `x.is_contiguous(memory_format=torch.channels_last)` will calculate state of the Tensor on every call. This functionality going to be updated later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20455
Differential Revision: D15341577
Pulled By: VitalyFedyunin
fbshipit-source-id: bbb6b4159a8a49149110ad321109a3742383185d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20513
They've been using an old API, switch them to the new one instead.
Reviewed By: li-roy
Differential Revision: D15346349
fbshipit-source-id: 538eb460897ec6addebeebf88b316eb0d6b1dd6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20379
The legacy custom op API allowed nesting of std::unordered_map and std::vector. While we haven't figured out yet how to do that with the new API,
we at least have to keep backwards compatibility. This diff adds the feature so we can switch to the new API without breaking third party code.
Reviewed By: li-roy
Differential Revision: D15287693
fbshipit-source-id: bb5b8429fddf6298719cbf567b584ed371f8fc81
Summary:
Previously, the caller of `shallow_copy_and_detach()` is responsible for deciding whether the shallow-copy should share the source TensorImpl's version counter, or have its own new version counter. However, since this decision is crucial for ensuring the correctness of the shallow-copy's version counter, we want to enforce users of `shallow_copy_and_detach()` to pass a version counter to the function call, so that they are required to make the decision at the time of API usage, not as an afterthought.
For similar reasons, we want to enforce users of `shallow_copy_and_detach()` to pass `allow_tensor_metadata_change` to the function call, so that they are required to decide "whether the TensorImpl shallow-copy should allow tensor metadata change" at the time of API usage, not as an afterthought.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20496
Differential Revision: D15363620
Pulled By: yf225
fbshipit-source-id: a65e74738b10452668d6dc644b43aad5b3d8c9e6
Summary:
Remove weak_script. After recently splitting the forward() function in MultiheadAttention module, we notice a memory leak on GPU. Fix the problem by removing those "weak_script" decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20563
Differential Revision: D15368262
Pulled By: zhangguanheng66
fbshipit-source-id: 475db93c9ee0dbaea8fb914c004e7d1e0d419bc2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20543
All of that code for concatenating strings together adds up. Just discard it all for mobile builds.
Reviewed By: ljk53
Differential Revision: D15353447
fbshipit-source-id: a82dd0b884335d662605aabf7dd3d09dfcc1478b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20439
This is the QTensorProto workflow for multi group quantization in C2 side.
No DNNLOWP Tensor related thing is included in this pr, so once we finished glow side, we should be able to test this pr using resnet50.
Reviewed By: yinghai
Differential Revision: D15096919
fbshipit-source-id: 741eecd59eb79d24d9fe2b035f6246d42422d25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19816
We need this for quantization for bias
add third argument of ScalarType to `quantize_linear`
Differential Revision: D15094174
fbshipit-source-id: f19ec8f4716cf5fe0aa21b38d45af6d27c9ab377
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20512
Fixing typos in the description of schema for one of the inputs for BatchMatMul operator.
Reviewed By: jianyuh, BIT-silence
Differential Revision: D15343879
fbshipit-source-id: 06354e8e6b0d79fea937ed2703bb457b2d04f859
Summary:
Fix a typo in the doc.
Add an AssertError check back to MultiheadAttention module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20492
Differential Revision: D15349008
Pulled By: cpuhrsch
fbshipit-source-id: 2d898345f03787c713e537673613a748ad826b34
Summary:
The current variance kernels compute mean at the same time. Many times we want both statistics together, so it seems reasonable to have a kwarg/function that allows us to get both values without launching an extra kernel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18731
Differential Revision: D14726082
Pulled By: ifedan
fbshipit-source-id: 473cba0227b69eb2240dca5e61a8f4366df0e029
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20374
This test case now also tests that the argument type works correctly in kernels that
- don't return outputs
- return multiple outputs
Reviewed By: li-roy
Differential Revision: D15298233
fbshipit-source-id: 82ab9d81b55b4f9fb34d66a155cc426af8592e25
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20373
- Add support for Dict<Key, Value> arguments and returns to c10 operators
- Add support for std::unordered_map<Key, Value> to the legacy API (but not to c10 kernels)
Reviewed By: li-roy
Differential Revision: D15298235
fbshipit-source-id: 6d9793db1f12bea377f508a9b33a495ebe0bec18
Summary:
Add automatic translations for a few argument names that commonly differ between PyTorch and NumPy.
For now, they are as follows:
* `keepdim` -> `keepdims`
* `dim` -> `axis`
* `input` -> (any of `a`, `x`, `x1`)
* `other` -> `x2`
Basic examples:
```python
>>> t=torch.randn(10,10)
>>> torch.sum(x=t, axis=1)
tensor([ 0.5199, -0.3768, 4.3619, -0.9105, 1.1804, 1.0837, -0.9036, 0.2365,
1.1171, -0.0999])
```
```python
>>> torch.add(x1=5, x2=6)
tensor(11)
```
The additional overhead is zero when using traditional PyTorch argument names, and a few (usually 1) extra PyDict lookups when using NumPy argument names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20451
Differential Revision: D15337521
Pulled By: umanwizard
fbshipit-source-id: 7a7d389786f4ccf5c86a14ecb2002c61730c51b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20020
Add shape inference for LearningRate op. The output (lr) should have similar shape with input (iteration), but not the same type (float vs int).
Reviewed By: un-disclosed
Differential Revision: D15112300
fbshipit-source-id: 09969aefa15172a6f3c70cd9b2548e3020da5d7a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20372
Implement a Dict type that allows us to abstract away from the concrete implementation used.
The API is similar to std::unordered_map, but behind the scenes we can switch to any map implementation we like. ska::flat_hash_map, google dense map, or any future map implementation with better performance.
Switching such an implementation choice does not have to break backwards compatibility of kernel code using the Dict type.
Reviewed By: zdevito
Differential Revision: D15298234
fbshipit-source-id: b5ad368a9e9516030805cd8f5f1b02e3986933c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20463
Source file changes mostly involve ifdef'ing-out references to JIT code
from files that are part of Caffe2Go. Update Internal build scripts to
remove those files from our globs.
After this, changes to most of the JIT files should not trigger mobile CI.
Reviewed By: dzhulgakov
Differential Revision: D15329407
fbshipit-source-id: 48f614c6b028eef0a03ce5161d083a3e078b0412
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20021
Add shape inference for AtomicIter operator. The operator takes two blobs iteration and iter_mutex as input and outputs iteration, which should have the same type and shape as the input.
Reviewed By: un-disclosed
Differential Revision: D15111643
fbshipit-source-id: 0d06413305cc4c6257c0cfabf62fb874970803bc
Summary:
Moving functions from torch/nn/modules/activation.py to torch/nn/functional.py. For functions not implemented (_get_input_buffer and _set_input_buffer), a TODO is added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20415
Differential Revision: D15318078
Pulled By: jamarshon
fbshipit-source-id: 5ca698e2913821442cf8609cc61ac8190496a3c6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20390
duc0 Ngo implemented observing floating point exceptions but there were a couple of places where we have "benign" floating point exceptions leading to false positives. This diff eliminates one source of such false positives, namely using _mm256_cvtph_ps and _mm256_cvtps_ph for partially uninitialized array for the remainder loop.
Reviewed By: hx89
Differential Revision: D15307358
fbshipit-source-id: 38f57dfdd90c70bc693292d2f9c33c7ba558e2c9
Summary:
Tagging along to changes in #20191 which added more support for types in the pickler
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20444
Pulled By: driazati
Differential Revision: D15321463
fbshipit-source-id: 985061bf5070a7d7bad58ea8db11d531f3d13e74
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20108
Add cpp runs for c2, hooked up via pybinds. Print output to terminal. This is not hooked up with the pep output yet because I'd like to verify the numbers first.
Note that this isn't quite the same mechanism as the pytorch cpp hookup, which uses cpp_python_extensions. If I can use the same mechanism to pull all the inputs for c2 through cpp and do FeedBlobs in cpp, then I'll switch to that.
Reviewed By: zheng-xq
Differential Revision: D15155976
fbshipit-source-id: 708079dacd3e19aacfe43d70c5e5bc54da2cf9e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20321
First part of https://github.com/pytorch/pytorch/issues/20287
- Rename `AT_ASSERT` to `TORCH_INTERNAL_ASSERT`
- Make `TORCH_INTERNAL_ASSERT` work with variadic inputs
- Deprecated `AT_ASSERT` and `AT_ASSERTM`
- Rename `AT_CHECK` to `TORCH_CHECK`
- Make `TORCH_CHECK` give a better error message when no arguments are
provided
- Deprecate `AT_ERROR` in favor of `TORCH_CHECK(false, ...)`
- Deprecate `AT_INDEX_ERROR` in favor of `TORCH_CHECK_INDEX(false, ...)`
- Rename `AT_WARN` to `TORCH_WARN`
No use sites are changed; I'll work on that in follow up patches
(or disable the deprecation, if necessary.)
Differential Revision: D15278439
fbshipit-source-id: 7e0ed489d4e89e5f56b8ad7eafa72cb9a06065ee
Summary:
In https://github.com/pytorch/pytorch/pull/18223/files#diff-77a6f3462f2233b921d3042412fed6d3R178, we used `auto saved_version_ = data_.unsafeGetTensorImpl()->version_counter().current_version()` and then `new_data_impl_copy->set_version_counter(saved_version_)`, which actually doesn't preserve the original semantics that `var.set_data(tensor)` should keep `var`'s version counter object intact. This PR fixes the bug and adds test to make sure it doesn't happen again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20391
Differential Revision: D15323430
Pulled By: yf225
fbshipit-source-id: e3ba49b51ec8ccecd51c80cb182387f74cfd2b2b
Summary:
As part of the Variable/Tensor merge, we allow passing Tensor with AutogradMeta into ATen ops, but we want to make sure they are not treated as Variables (i.e. their `is_variable()` is false). This PR makes the necessary change to make this work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20392
Differential Revision: D15321899
Pulled By: yf225
fbshipit-source-id: c2ab09db73c63bd71ba2d8391095f4d6b4240a9a
Summary:
"then the output would also has k tensors" -> "then the output would also have k tensors"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20425
Differential Revision: D15320152
Pulled By: zou3519
fbshipit-source-id: b04e2ccd29c6a3e33ad1040d0ea975a01a7bd9b5
Summary:
As a first step for this plan: https://github.com/pytorch/pytorch/issues/19508#issuecomment-485178192, this PR moves `THCTensor_(uniform)` to ATen. Major changes are:
- `uniform_` cuda kernel now utilizes a philox generator.
- the kernel also utilizes TensorIterator
- the kernel uses a grid-stride loop to achieve peak effective bandwidth
- Since the engine has changed from `curandStateMTGP32` to `curandStatePhilox4_32_10`, the randoms generated now will be different.
- Here is the diff showing codegen changes: https://gist.github.com/syed-ahmed/4af9ae0d42b6c7dbaa13b9dd0d1dd1e8 (BC breaking change if any)
- Philox4_32_10 is known to pass the standard TestU01 Big Crush test (https://www.thesalmons.org/john/random123/papers/random123sc11.pdf) and hence the quality of random numbers generated isn't an issue when compared to the previously used `curandStateMTGP32`.
- I have added a test case in `aten/src/ATen/test/cuda_distributions_test.cu` which verifies that philox offset is incremented properly
The benchmark was done on a DGX station with 4 V100s.
I modified the script from jcjohnson 's [multinomial benchmark](https://github.com/jcjohnson/pytorch-multinomial-benchmark) to produce this notebook which shows that there is a general speedup with this PR and a regression hasn't been introduced: https://gist.github.com/syed-ahmed/9d26d4e96308aed274d0f2c7be5218ef
To reproduce the notebook:
- Run https://gist.github.com/syed-ahmed/4208c22c541f1d30ad6a9b1efc1d728f in a container with the current pytorch top of tree with the command: `python uniform_benchmark.py --stats_json before.json`
- Apply this diff to the current pytorch top of tree and run the same script in a container with the command: `python uniform_benchmark.py --stats_json after.json`
- Run the notebook attached above with the `after.json` and `before.json` in the same directory
The effected bandwidth was calculated using the script (thanks to ngimel ): https://gist.github.com/syed-ahmed/f8b7384d642f4bce484228b508b4bc68
Following are the numbers before and after.
```
uniform, size, elements 65536 forward 5.168914794921875e-06 bandwidth (GB/s) 50.71548098597786
uniform, size, elements 131072 forward 5.056858062744141e-06 bandwidth (GB/s) 103.67860705101367
uniform, size, elements 262144 forward 7.164478302001953e-06 bandwidth (GB/s) 146.357621001797
uniform, size, elements 524288 forward 1.1217594146728515e-05 bandwidth (GB/s) 186.9520302275877
uniform, size, elements 1048576 forward 1.923084259033203e-05 bandwidth (GB/s) 218.10297600317384
uniform, size, elements 2097152 forward 3.640890121459961e-05 bandwidth (GB/s) 230.39992200138826
uniform, size, elements 4194304 forward 6.778717041015625e-05 bandwidth (GB/s) 247.49839679819922
uniform, size, elements 8388608 forward 0.00012810707092285157 bandwidth (GB/s) 261.92490202361347
uniform, size, elements 16777216 forward 0.00025241613388061524 bandwidth (GB/s) 265.86598474620627
uniform, size, elements 33554432 forward 0.000497891902923584 bandwidth (GB/s) 269.5720239913193
```
```
uniform, size, elements 65536 forward 5.550384521484375e-06 bandwidth (GB/s) 47.22988091821306
uniform, size, elements 131072 forward 5.581378936767578e-06 bandwidth (GB/s) 93.93520954942333
uniform, size, elements 262144 forward 6.165504455566406e-06 bandwidth (GB/s) 170.071404141686
uniform, size, elements 524288 forward 6.3276290893554685e-06 bandwidth (GB/s) 331.4277702414469
uniform, size, elements 1048576 forward 8.509159088134765e-06 bandwidth (GB/s) 492.91639239047356
uniform, size, elements 2097152 forward 1.2989044189453124e-05 bandwidth (GB/s) 645.8218077979443
uniform, size, elements 4194304 forward 2.347707748413086e-05 bandwidth (GB/s) 714.6211452997259
uniform, size, elements 8388608 forward 4.4286251068115234e-05 bandwidth (GB/s) 757.6715389250498
uniform, size, elements 16777216 forward 8.672237396240235e-05 bandwidth (GB/s) 773.8356427961071
uniform, size, elements 33554432 forward 0.00016920566558837892 bandwidth (GB/s) 793.2224227438523
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20292
Differential Revision: D15277761
Pulled By: ezyang
fbshipit-source-id: 8bfe31a01eeed77f0ed6e7ec4d2dda4c6472ecaa
Summary:
To fully support incremental_state function, it requires several additional utils available in fairseq. However, we lack a problem for the unit test. Therefore, the incremental_state function will be disable for now. If it is needed in the future, a feature request could be created. Fixed#20132
Add some unit tests to cover the arguments of MultiheadAttention module, including bias, add_bias_kv, add_zero_attn, key_padding_mask, need_weights, attn_mask.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20177
Differential Revision: D15304575
Pulled By: cpuhrsch
fbshipit-source-id: ebd8cc0f11a4da0c0998bf0c7e4e341585e5685a
Summary:
We don't need to overlay vc env when not using ninja. CMake will deal with it automatically. Overlaying is a no-op when the env is the same with the generator specified but will generate the error "Cannot find CMAKE_CXX_COMPILER" when they are different.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20417
Differential Revision: D15317081
Pulled By: ezyang
fbshipit-source-id: 5d9100321ecd593e810c31158f22c67d3e34973b
Summary:
This is an attempt to isolate unrelated changes from #19228 for easier review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20150
Differential Revision: D15314891
Pulled By: ezyang
fbshipit-source-id: 8c429747ba83ad5aca4cdd8f8086bcf65a326921
Summary:
* Constructs a new type at runtime so that `isinstance` checks work for
weak modules assigned to `ScriptModule`s
* Fix some extraneous names in `__constants__`
* Add `in_features` and `out_features` to `nn.Linear` `__constants__`
Fixes#19363
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20190
Pulled By: driazati
Differential Revision: D15302350
fbshipit-source-id: 1d4d21ed44ab9578a4bc2a72396a82e9bbcd387c
Summary:
TensorList, DoubleList, and BoolList were missing from the pickler, so
this adds them.
As a follow up a lot of the code for these could be templated and cut
down
](https://our.intern.facebook.com/intern/diff/15299106/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20191
Pulled By: driazati
Differential Revision: D15299106
fbshipit-source-id: f10c0c9af9d60a6b7fb8d93cea9f550b1a7e2415
Summary:
Given that tensorboardX and our PyTorch 1.1 release had `log_dir` as the argument for SummaryWriter initialization and member variable (which some users access), we need to preserve this name. However, we might deprecate this in the future and I've added a `get_logdir` method that can be used in the future.
cc natalialunova, lanpa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20382
Reviewed By: NarineK
Differential Revision: D15300941
Pulled By: orionr
fbshipit-source-id: a29a70fcbc614a32ebfa6c655962fdff081af1af
Summary:
This code is unused and has been superseded by TensorIterators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20207
Differential Revision: D15240832
Pulled By: cpuhrsch
fbshipit-source-id: 4f600bb8645f9b28a137e2cefb099978f5152d05
Summary:
This PR add Poisson NLL loss to aten and substitute the python implementation with a call to the c++.
Fixes#19186.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19316
Differential Revision: D15012957
Pulled By: ezyang
fbshipit-source-id: 0a3f56e8307969c2f9cc321b5357a496c3d1784e
Summary:
This PR is an intermediate step toward the ultimate goal of eliminating "caffe2" in favor of "torch". This PR moves all of the files that had constituted "libtorch.so" into the "libcaffe2.so" library, and wraps "libcaffe2.so" with a shell library named "libtorch.so". This means that, for now, `caffe2/CMakeLists.txt` becomes a lot bigger, and `torch/CMakeLists.txt` becomes smaller.
The torch Python bindings (`torch_python.so`) still remain in `torch/CMakeLists.txt`.
The follow-up to this PR will rename references to `caffe2` to `torch`, and flatten the shell into one library.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17783
Differential Revision: D15284178
Pulled By: kostmo
fbshipit-source-id: a08387d735ae20652527ced4e69fd75b8ff88b05
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19803
There is no reason to set a specific logging level for this module. Removing it to just use the default logging level.
Differential Revision: D15098834
fbshipit-source-id: 1654c04500c19690ddde03343f2e84b04bb0f1ef
Summary:
Fixed#20250
Not sure if there's any specific design reason to `add_dependecy()` and manually add a few include dir, instead of linking the target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20319
Differential Revision: D15294584
Pulled By: ezyang
fbshipit-source-id: 97f813a6b1829dad49958e0f880b33eb95747607
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20351
This was broken because of a merge race between #20282 and the stack in #20236.
Cleaned up the test and comments a bit as well.
Differential Revision: D15292786
fbshipit-source-id: a4379ea700cad959d3a6921fc5ddf9384fb8f228
Summary:
The trick here is that creating a mapping from const values to
const values means that downstream clients that want to mutate
the output of the mapping are stuck. However, a mapping from
const values to non-const values is just fine and doesn't put
constraints on downstream clients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20303
Differential Revision: D15284076
fbshipit-source-id: 16206fd910dd5f83218525ca301b1889df0586cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20236
Use the new version of broadcast_coalesced that deals with both CPU
and CUDA models. Add tests that evaluate correctness of
DistributedDataParallel for CPU models.
Closes#17757.
Reviewed By: mrshenli
Differential Revision: D15245428
fbshipit-source-id: d2fa09f68593b3cd1b72efeb13f5af23ebd5c80a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20235
The tests expected to only run for CUDA models. In a future commit we
need to update this to work for CPU models as well. Therefore, we can
no longer rely on only integers being passed for device identifiers.
With this change we pass both the materialized list of devices to use
(as `torch.Device` objects), as well as an optional list of integers.
The latter is specified to exercise the code in the
DistributedDataParallel constructor that turns a list of integers into
CUDA devices, IFF it is used to wrap a single-device CUDA module.
This commit also groups together the 'str' and non-'str' tests. These
used to test passing the list of devices as integers or as
`torch.Device` instances. These are now executed from the same test.
Reviewed By: mrshenli
Differential Revision: D15245429
fbshipit-source-id: 5797ba9db33d2c26db8e7493c91bb52f694285ac
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20234
The differences with the existing function _dist_broadcast_coalesced
is that this one works for both CPU and CUDA tensors and that it has a
maximum number of in flight operations.
This should be the final change needed to have only a single version
of DistributedDataParallel that both supports CPU and CUDA models, or
even a mix of both.
See #17757 for more information.
Reviewed By: mrshenli
Differential Revision: D15228099
fbshipit-source-id: a2113ba6b09b68cb5328f49f4c1960031eb43c93
Summary:
The isConvFusion(...) is only for Conv op.
If non-Conv op, the crash takes place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20139
Differential Revision: D15280604
Pulled By: yinghai
fbshipit-source-id: eb45be11990b3bf7c5b45f02ebb6018444ab5357
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20281
This is how it looks like now:
```
<FunctionEventAvg key=mm self_cpu_time=11.404s cpu_time=2.895ms
cuda_time=0.000us input_shapes=[[26, 4096], [4096, 1024]]>
```
Before I forgot to update the repr for these when updated it for not
averaged events
Differential Revision: D15262862
fbshipit-source-id: a9e5b32c347b31118f98b4b5bf2bf46c1cc6d0d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20185
This test case now also tests that the argument type works correctly in kernels that
- don't return outputs
- return multiple outputs
Reviewed By: li-roy
Differential Revision: D15227621
fbshipit-source-id: 83db7536e9065e0f8c5032d6b96e970bbaf718b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20184
- Add support for Dict<Key, Value> arguments and returns to c10 operators
- Add support for std::unordered_map<Key, Value> to the legacy API (but not to c10 kernels)
Reviewed By: li-roy
Differential Revision: D15227620
fbshipit-source-id: c1ea6c12165e07b74272cb48c6021bdb5c2d7922
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19976
Implement a Dict type that allows us to abstract away from the concrete implementation used.
The API is similar to std::unordered_map, but behind the scenes we can switch to any map implementation we like. ska::flat_hash_map, google dense map, or any future map implementation with better performance.
Switching such an implementation choice does not have to break backwards compatibility of kernel code using the Dict type.
Reviewed By: li-roy
Differential Revision: D15156384
fbshipit-source-id: b9313ec4dd9acb3b6a0035345b6ba4f2a437d1e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20282
Add unit test to ensure no gradients sync when calling ddp.module(input), e.g not invoking prepare_for_backward
PyText is depending on DDP for data parallel distributed training. To support accumulate gradients locally before gradients sync, we are calling orig_model.forward instead of ddp_model.forward. Add a unit test to avoid changes break the assumption.
Reviewed By: pietern, mrshenli
Differential Revision: D15263155
fbshipit-source-id: 7734e174f507690fb23ea6c52dffff4a93f9b151
Summary:
fixes#20215
The confusing behavior was caused by typos in type annotation :(
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20306
Differential Revision: D15276216
Pulled By: ailzhang
fbshipit-source-id: 1b0c9635a72a05c9b537f80d85b117b5077fbec7
Summary:
This addresses #18436
The logic replicates the essence of closing file descriptors in numpy:
bf20e30340/numpy/core/include/numpy/npy_3kcompat.h (L278)
This stores the position of the file descriptor before resetting it to the Python handle offset, then resets to the original position before exit. The Python-side handle is then updated to reflect the new position. Also added somewhat more demanding tests to cover this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20270
Differential Revision: D15275902
Pulled By: soumith
fbshipit-source-id: 5ca8a52b61c7718d2e69571f72f80b1350b0acdb
Summary:
See Issue #20301
Specifying dim in docstring example to prevent UserWarning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20310
Differential Revision: D15277734
Pulled By: ezyang
fbshipit-source-id: 2e8b748dbe743675a5a538ccbe97713aad02e8ac
Summary:
Schema matching for sort is a little tricky because you have to check whether the class defines the __lt__ method correctly, and you have to mark whatever effects happen in __lt__ to the sorting op as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19706
Differential Revision: D15244366
Pulled By: eellison
fbshipit-source-id: 73b3e36462c6cc40f9d8cb235b44499a67d3149e
Summary:
This PR restricts the BatchType template argument of ChunkDataReader to STL
vectors only. Internally, ChunkDataReader was assuming BatchType was a
vector, but the user could pass any type to the template argument,
leading to compiling issues during CPP extensions.
Additionally to the proposed API change, this PR adds missing include
headers to chunk.h. Currently the current implementation works but if
users try to create C++ extensions that implements new ChunkDataReaders
to be along with the existing ChunkDataset, the build will fail due to
the missing headers.
In terms of functionality, nothing has changed. This PR simply makes the
implementation slightly more robust for future extensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19485
Differential Revision: D15261725
Pulled By: soumith
fbshipit-source-id: 38c9465d665392ae6a2d12c5a520a4f501e1a6ca
Summary:
C++ `Scatter` and `Gather` always set autograd history for input data tensors regardless whether they require grad. This hits assertion failure in `set_history(Tensor, shared_ptr<Function> grad_fn)`
where `grad_fn` cannot be nullptr. After this PR, C++ `Scatter` and `Gather` only record `grad_fn` when required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20286
Differential Revision: D15266610
Pulled By: mrshenli
fbshipit-source-id: 641df0ea36e7c922b5820c8dc3f83e2a050412b5
Summary:
Previously we only had a Python wrapper for `torch.quantized_lstm_cell`. We had the op `torch.quantized_lstm`, but it didn't have a wrapper. This makes the wrapper
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20249
Reviewed By: driazati
Differential Revision: D15250023
Pulled By: jamesr66a
fbshipit-source-id: f05ad784d903e0ef3a62633c8bf80bad79de48ae
Summary:
This is a step towards enabling the ONNX constant folding pass by default in the PT->ONNX export. In this change we have enabled test points in `test/onnx/test_pytorch_onnx_caffe2.py` to run with constant folding pass enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20290
Reviewed By: zrphercule
Differential Revision: D15271674
Pulled By: houseroad
fbshipit-source-id: 9e59ab46ae74b4ad8dea1a2200ecc1f3eb8aad75
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19660
Implementation of aggregated Scale operator.
The operator takes a list of tensors as an input and scales all of them them with the argument float value.
The tensor sizes can be different, therefore bookkeeping of the sizes and pointers to the tensors are
necessary for the GPU version of the kernel.
Reviewed By: BIT-silence
Differential Revision: D14984233
fbshipit-source-id: 37cc97159a4f2c38cd6fff4f5710ab7d3a773611
Summary:
Don't make an alias value for a value that is known to be None. This was preventing constant propagation from running the `out is None` check in nn.functional.normalize, and thus preventing the if statement from being inlined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20112
Differential Revision: D15267328
Pulled By: eellison
fbshipit-source-id: 5b878b0dc50944c2e7a2f583ea483dad9d6bbec3
Summary:
This PR makes `torch.save` call out to the pickler which saves a tensor in the same format that `torch.save()` does, the file looks like `| pickle archive 1 (includes sizes, strides, requires_grad, etc...) | pickle archive 2 (list of tensor keys) | tensor binary data |` and can be read back in with `torch.load(my_file, pickle_module=torch.jit._pickle)`
Fixes#18003
Unpickling in the JIT for things such as model parallelism will be a follow up PR
](https://our.intern.facebook.com/intern/diff/15015160/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18154
Pulled By: driazati
Differential Revision: D15015160
fbshipit-source-id: ef76a44b8c243f4794cd7e245ec8305e965bc59f
Summary:
Also, current mkldnn primitive code recreate the computation everytime, causes tiny Convolutions spending significant portion of its time on the repeated codegen. ideep has implemented an lru cache to save the computation and so this change will help us improving perfs for tiny Convolutions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19963
Differential Revision: D15156527
Pulled By: bddppq
fbshipit-source-id: 6a8fbd10a213ec22cdeaff1a2bdb0d09905d1fcd
Summary:
Canonicalize the ordering of outputs of if and loop nodes based on their first usage. Previously we were able to canonicalize output order by sorting on variable name, but this breaks down with outputs added in an early return pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20015
Differential Revision: D15266066
Pulled By: eellison
fbshipit-source-id: ba5340c068a68b1ffc73f056db194b92d3274dc4
Summary:
cc nairbv
All failures I have seen are of this combination. So let's just disable it for all cases. After #20063 I find it failing for py3 once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20172
Differential Revision: D15266527
Pulled By: nairbv
fbshipit-source-id: afb9389dfc54a0878d52975ffa37a0fd2aa3a735
Summary:
As a part of supporting writing data into TensorBoard readable format, we show more example on how to use the function in addition to the API docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20008
Reviewed By: natalialunova
Differential Revision: D15261502
Pulled By: orionr
fbshipit-source-id: 16611695a27e74bfcdf311e7cad40196e0947038
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19541
For the quantized FC operator, replace the tuple (Tensor, scale, zero_point) with QTensor.
Differential Revision: D14900407
fbshipit-source-id: 164df38f3564e0a68af21b9fedaba98a44ca1453
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19497
Implements a basic quantized FC (uint8 * int8 -> uint8) with FBGEMM APIs.
Related document:
https://fb.quip.com/rP5tAx56ApMMhttps://fb.quip.com/MU7aAbzGDesu
Work Item List:
1. [DONE] currently we use prepack routines inside Quantized FC operator. Will separate it as a standalone operator soon.
2. [DONE] rebase to D14817809 and D14994781 (cpp custom types).
3. [DONE] correctness unit test.
4. [To Do] rebase to QTensor. Similar to D14565413, this will be implemented in the next Diff.
Differential Revision: D14761865
fbshipit-source-id: 031a39915fecd947afb4dd2719112b4ddc1082d3
Summary:
Fixes https://github.com/pyro-ppl/pyro/issues/1853
This fixes a memory leak in `torch._dirichlet_grad()`. This function is used for reparametrized gradients for the `Dirichlet` and `Beta` distributions.
- [x] Could a reviewer please confirm that `freeCopyTo()` is being used correctly and doesn't need an additional `decref()`? The author is unfamiliar with PyTorch C++ memory utilities. Help appreciated.
- ran locally and confirmed leak is fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20244
Differential Revision: D15259008
Pulled By: ezyang
fbshipit-source-id: 222ec7d80ddd97bcdd7d54549f3e756575e8402e
Summary:
This just updates the `JIT` comments with the issue number #20215. Hopefully this will stop the proliferation of the workaround. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20222
Differential Revision: D15259007
Pulled By: ezyang
fbshipit-source-id: 5060a351aa618c6dae49d0b7a6ac9b0f57f2490a
Summary:
The earlier fix to extract scripts missed an attach_workspace which was used to make the built binaries available to the nightly build upload jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20265
Differential Revision: D15259080
Pulled By: soumith
fbshipit-source-id: bf835c2cd76976b4563798ee348f7db83c7a79c1
Summary:
The current pytorch config.yml is causing some backend performance
problems on CircleCI, due to the size of the file when all of the YAML
anchors have been expanded. You can view the "processed" config as our
internal system deal with it by running `circleci config process`.
circleci config process .circleci/config.yml | wc -c
Before: 2833769 bytes
After: 558252 bytes (~80% less)
Add create a new job, `setup`, that has 2 functions:
- Assert that config.yml is up to date
- Put the .circleci/scripts directory into a workspace, so that
downstream jobs can easily access it.
The `setup` job becomes the parent of all jobs in the workflow. This
allows us to fail fast if config is invalid. It might be a good place to
add other, quick, lint checks to help fail the build faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19674
Differential Revision: D15252864
Pulled By: pjh5
fbshipit-source-id: 0778c7b8f95e7f3f33ac92fbb8862377fc9fb0ac
Summary:
This ensures that custom operators registered through c10::RegisterOperators are recorded in autograd profile traces.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20175
Differential Revision: D15221311
Pulled By: jamesr66a
fbshipit-source-id: 9452b24272c2399c20a49af85b62d34cabe6e27a
Summary:
Do tests with common models from torchvision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20007
Differential Revision: D15251754
Pulled By: orionr
fbshipit-source-id: 9dc09bd407b3ccaaa310d2f4a8d53d5a7d12469d
Summary:
We now can build libtorch for Android.
This patch aims to provide two improvements to the build
- Make the architecture overridable by providing an environment variable `ANDROID_ABI`.
- Use `--target install` when calling cmake to actually get the header files nicely in one place.
I ran the script without options to see if the caffe2 builds are affected (in particularly by the install), but they seem to run OK and probably only produce a few files in build_android/install.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20152
Differential Revision: D15249020
Pulled By: pjh5
fbshipit-source-id: bc89f1dcadce36f63dc93f9249cba90a7fc9e93d
Summary:
Support operator overloading for User Defined Types, which includes desugaring `a + b` and python builtin functions which call into a method if it is defined like `len(x)`.
See https://rszalski.github.io/magicmethods/ for list of magic methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20033
Reviewed By: driazati
Differential Revision: D15246573
Pulled By: eellison
fbshipit-source-id: 03d45dd524ea2a3b40db36843d6067bede27b30d
Summary:
similar to too few blank lines, I feel like this is not important enough to warrant breaking signal for all linters when it's violated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20225
Differential Revision: D15243480
Pulled By: suo
fbshipit-source-id: 37cdc18daf09e07081e42b69c72d331d81660217
Summary:
This is useful when you would like to understand performance
bottlenecks of your model. One can use the shape analysis in order to
fit model to a roofline model of their hardware.
Please note that this feature can potentially skew profiling
results. Also timing for not nested events will become wrong. One
should only use timing for the bottom most events when shape analysis
is used. Also for the case where people don't need shapes, profiling
should not be affected. As in this case we don't collect shapes, which
is the default behavior and this diff doesn't change it.
One of the next steps
could be, for example, choosing best candidates for quantization. In
the scope of this diff I am just adding optional shapes collection
into the Even class. After that in python there is minor functionality
for providing groupping by shapes.
In the output tables shapes are being truncated but in groupping full
shape string is used as a key.
Here is an example output:
test_profiler_shapes (test_autograd.TestAutograd) ...
```
------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
unsigned short 2.30% 305.031us 2.30% 305.031us 305.031us NaN 0.000us 0.000us 1 [[30, 20]]
addmm 69.40% 9.199ms 69.40% 9.199ms 9.199ms NaN 0.000us 0.000us 1 [[30], [128, 20], [20, 30], [], []]
unsigned short 0.98% 129.326us 0.98% 129.326us 129.326us NaN 0.000us 0.000us 1 [[40, 30]]
addmm 27.32% 3.621ms 27.32% 3.621ms 3.621ms NaN 0.000us 0.000us 1 [[40], [128, 30], [30, 40], [], []]
------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us
------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
unsigned short 2.30% 305.031us 2.30% 305.031us 305.031us NaN 0.000us 0.000us 1 [[30, 20]]
addmm 69.40% 9.199ms 69.40% 9.199ms 9.199ms NaN 0.000us 0.000us 1 [[30], [128, 20], [20, 30], [], []]
unsigned short 0.98% 129.326us 0.98% 129.326us 129.326us NaN 0.000us 0.000us 1 [[40, 30]]
addmm 27.32% 3.621ms 27.32% 3.621ms 3.621ms NaN 0.000us 0.000us 1 [[40], [128, 30], [30, 40], [], []]
------------------ --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 13.255ms
CUDA time total: 0.000us
```
Also added this for older aggregation test:
```
test_profiler_aggregation_lstm (test_autograd.TestAutograd) ...
======================================================================================================================================================================================================
TEST
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
lstm 0.69% 4.606ms 5.30% 35.507ms 35.507ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.67% 4.521ms 5.27% 35.340ms 35.340ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.66% 4.399ms 5.02% 33.638ms 33.638ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.65% 4.354ms 4.92% 32.958ms 32.958ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.65% 4.351ms 4.96% 33.241ms 33.241ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.65% 4.323ms 5.10% 34.163ms 34.163ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.64% 4.304ms 4.92% 32.938ms 32.938ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.64% 4.300ms 5.10% 34.172ms 34.172ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.64% 4.292ms 5.05% 33.828ms 33.828ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
lstm 0.64% 4.263ms 4.98% 33.357ms 33.357ms NaN 0.000us 0.000us 1 [[5, 3, 10]]
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls Input Shapes
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
sigmoid 15.32% 102.647ms 15.32% 102.647ms 171.078us NaN 0.000us 0.000us 600 [[3, 20]]
mul 15.20% 101.854ms 15.20% 101.854ms 169.757us NaN 0.000us 0.000us 600 [[3, 20], [3, 20]]
lstm 12.74% 85.355ms 100.00% 670.120ms 33.506ms NaN 0.000us 0.000us 20 [[5, 3, 10]]
addmm 11.16% 74.808ms 11.16% 74.808ms 249.361us NaN 0.000us 0.000us 300 [[80], [3, 20], [20, 80], [], []]
tanh 9.89% 66.247ms 9.89% 66.247ms 165.617us NaN 0.000us 0.000us 400 [[3, 20]]
split 6.42% 43.019ms 6.42% 43.019ms 215.095us NaN 0.000us 0.000us 200 [[3, 80]]
add 5.67% 38.020ms 5.67% 38.020ms 190.101us NaN 0.000us 0.000us 200 [[3, 80], [3, 80], []]
add 4.81% 32.225ms 4.81% 32.225ms 161.124us NaN 0.000us 0.000us 200 [[3, 20], [3, 20], []]
addmm 3.79% 25.380ms 3.79% 25.380ms 253.796us NaN 0.000us 0.000us 100 [[80], [3, 10], [10, 80], [], []]
unsigned short 3.72% 24.925ms 3.72% 24.925ms 83.083us NaN 0.000us 0.000us 300 [[80, 20]]
----------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- -----------------------------------
Self CPU time total: 670.120ms
CUDA time total: 0.000us
Total time based on python measurements: 691.366ms
CPU time measurement python side overhead: 3.17%
ok
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20035
Differential Revision: D15174987
Pulled By: salexspb
fbshipit-source-id: 9600c5d1d1a4c2cba08b320fed9da155d8284ab9
Summary:
In line 508.
convert_sync_batchnorm is called recursively to convert the bn to syncbn, thus the process_group also should be passed in the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19240
Differential Revision: D15240318
Pulled By: ezyang
fbshipit-source-id: 0fc9e856392824814991e5e9e8f9513d57f311af
Summary: This can be used for problems where the action vector must sum to 1
Reviewed By: kittipatv
Differential Revision: D15206348
fbshipit-source-id: 665fbed893d8c52d451a12d3bb2e73b2638b7963
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20186
We're missing two USE_CUDA macro for GPU-related code in THD's DataChannelGloo.
Also adding GlooCache back to compilation.
Differential Revision: D15227502
fbshipit-source-id: f260e1cb294d662ba0c170931913b64287d62344
Summary:
Currently, constant folding pass during ONNX conversion removes all onnx::Constant nodes that are parents of nodes that are folded. In situations where the parent onnx::Constant node is other subscribers downstream this could be a problem. This change updates the removal logic to remove to only those onnx::Constant nodes that do not have other subscribers downstream
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20109
Reviewed By: zrphercule
Differential Revision: D15220392
Pulled By: houseroad
fbshipit-source-id: 150788654ea1c84262becaffd6de152114bf76c0
Summary:
Add logging import and a failed MLP model that confirms that we don't fail `add_graph` when graph optimization fails.
This addresses part of https://github.com/pytorch/pytorch/issues/18903
cc lanpa ezyang natalialunova
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20115
Reviewed By: natalialunova
Differential Revision: D15206765
Pulled By: orionr
fbshipit-source-id: c40b7e2671ef845a1529a2910ba030159f53f393
Summary:
This patch specializes `Optional[Tensor]` graph inputs to either a `DimensionedTensorType` (if a Tensor is passed) or `NoneType`. Other `Optional[T]` are specialized to `T` or `None`.
- For unwrapping (checked and unchecked) we need to keep the output type, as IR code that follows unwrapping may not work with NoneType (just as it doesn't deal with Optional). While it would not be hit during execution, it will run against the (legitimate) assumptions of the analysis passes.
- Function lookup currently will not match NoneType when it expects optional (I'm not entirely sure why this doesn't lead to unhappyness currently, but hey), I amend this at the level of the function matching code (`operator.cpp`), but see Adam's comments. We would run into trouble if we needed to select between functions whose signature only differs in Optional types with different subtypes, but we would have the same problem when calling them directly, so I would think this is OK.
- It would enable throwing away branches we can't hit. This also reduces the "blockyness" of the graph, so it may be easier to apply optimizations (e.g. fuse things in `if t is None: ...` and outside the `if`.
- Arguments passed into `Optional[Tensor]` arguments will get shape information, which is very handy.
- It get's rid of the problem that tensors passed into Optional arguments get requires_grad set erroneously #18270 (though that also affects lists, which aren't fixed here).
- `Optional[List[int]]` is needed for #18697.
- We're changing typing in a more subtle way than the `TensorType`->`DimensionedTensorType`.
- In particular, specializing to NoneType loses the Type information captured in the `OptionalType` element type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18407
Reviewed By: zdevito
Differential Revision: D15216808
Pulled By: eellison
fbshipit-source-id: 01f1a7643deaf4962c3f55eff2070d54b0e54b69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20062
Previously, the batch counter is incremented even if none of the readers has data. In this diff,
1) Limiter is applied to the last reader so that the batch counter is not incremented unless the first N-1 readers have data
2) The stop blob of the last reader as the stop blob of the task so that it's checked before the counter is incremented
Reviewed By: xianjiec
Differential Revision: D15099761
fbshipit-source-id: 47ed6c728118fe453cf57ac3457085867939485b
Summary:
As a work around for dynamic shape case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20093
Reviewed By: zrphercule
Differential Revision: D15220661
Pulled By: houseroad
fbshipit-source-id: de271fce542be380bd49a3c74032c61f9aed3b67
Summary:
Fix for https://github.com/pytorch/pytorch/issues/16962
This needs fixing because we turn lists into tuples when constantify a module, so indexing into a Tuple of one type with a non-constant integer is quite common.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20081
Differential Revision: D15205893
Pulled By: eellison
fbshipit-source-id: 61d74ee071ad0aad98e46fe807d6f6cc5f6abd2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19513
Add support for printing a QTensor in python frontend
Differential Revision: D15017168
fbshipit-source-id: 312d1f18e6ca3c9eb4a5b8bb1c64f7cc8bc1dcf5
Summary:
Eigen was updated with the commit needed to get rid of this warning that plagued the CI. This PR bumps third_party/eigen to that commit head.
```
warning: #warning "host_defines.h is an internal header file and must not be used directly. This file will be removed in a future CUDA release. Please use cuda_runtime_api.h or cuda_runtime.h instead." [-Wcpp]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19917
Differential Revision: D15218183
Pulled By: ezyang
fbshipit-source-id: 653c7d61ea401a7d4469c2009612dc43cc70122d
Summary:
pytorch failed to build with the following error, complaining about the first regex match
It may be caused by a bug in python 2.7.5
This change proposed is a workaround for building pytorch with python 2.7.5
Since the '*' star notation is greedy in python regex, the new expression shall produce the identical result with the old one.
```
Traceback (most recent call last):
File "/data2/nihuini/pytorch/cmake/../aten/src/ATen/gen.py", line 14, in <module>
import preprocess_declarations
File "/data2/nihuini/pytorch/aten/src/ATen/preprocess_declarations.py", line 3, in <module>
from function_wrapper import TYPE_FORMAL_GENERIC
File "/data2/nihuini/pytorch/aten/src/ATen/function_wrapper.py", line 5, in <module>
from code_template import CodeTemplate
File "/data2/nihuini/pytorch/aten/src/ATen/code_template.py", line 13, in <module>
class CodeTemplate(object):
File "/data2/nihuini/pytorch/aten/src/ATen/code_template.py", line 23, in CodeTemplate
subtitution = re.compile(substitution_str, re.MULTILINE)
File "/usr/lib64/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib64/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
--
CMake Error at cmake/Codegen.cmake:162 (message):
Failed to get generated_cpp list
Call Stack (most recent call first):
caffe2/CMakeLists.txt:2 (include)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20137
Differential Revision: D15218122
Pulled By: ezyang
fbshipit-source-id: 10b618ff92a04e9074f5d83e31411fc2341e0cf8
Summary:
Some functions were not decorated with `CAFFE2_API`, makes them unusable when creating unit tests for custom ops outside Caffe2 repo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20114
Differential Revision: D15217490
Pulled By: ezyang
fbshipit-source-id: dda3910ad24e566567607deaac705a34ec8e7b8d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20044
We do not have a gating functor. This diff adds it. I'm leveraging existing learning rate op because there are other policies I'll need to use as a union together.
* Since there are other policy in LearningRateOp which will be used as a union, I chose to add it as a LearningRateOp.
* constantwarmup cannot do step function of nonzero first and zero later
* There are multiple uses for it,
* e.g. as a gating blob generator that is useful for turning off.
* e.g. as a learning rate switcher at certain iteration.
* For generalizability, no regulation or constraint is applied on the range of the values
* see figure below for illustration
{F157366621}
Reviewed By: ccheng16
Differential Revision: D15178229
fbshipit-source-id: 1e66e9a4bc1bfb946a57f8aefc97d8170f6be731
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19777
When used from iPython/Jupyter/Bento, the cell doing kernel registration might be executed multiple times.
Also, there might be a kernel library that wants to overwrite one of the existing kernels.
Let's allow this.
Reviewed By: dzhulgakov
Differential Revision: D15090318
fbshipit-source-id: 09f842e8fd36646053c5c2f11325de4d31105b0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19776
The diff stacked on top of this enables overwriting of kernels, but before that, we need to do some refactorings.
This diff:
- hides deregistration logic behind a RAII class so we can keep more information about what exactly to deregister without the API user knowing about it.
- Split KernelRegistration from SchemaRegistration by taking Dispatcher::OperatorDef and moving it to a different file. This is better readable, especially since kernel registration will become more complex in the next diff.
- Move LeftRight synchronization out of DispatchTable to Operator because there will be a mutex added to Operator in the next diff and related synchronization primitives shouldn't live on different abstraction levels.
Reviewed By: dzhulgakov
Differential Revision: D15090322
fbshipit-source-id: 2e51a192075163f0d496956d9e54b9aaf26b2369
Summary:
This PR adds a new trace API `trace_module` that will allow us to trace multiple methods as a part of a single `ScriptModule`
See the example below.
```python
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv = nn.Conv2d(1, 1, 3)
def forward(self, x):
return self.conv(x)
def weighted_kernel_sum(self, weight):
return weight * self.conv.weight
example_weight = torch.rand(1, 1, 3, 3)
example_forward_input = torch.rand(1, 1, 3, 3)
n = Net()
inputs = {'forward' : example_forward_input, 'weighted_kernel_sum' : example_weight}
module = torch.jit.trace_module(n, inputs)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19905
Differential Revision: D15200007
Pulled By: Krovatkin
fbshipit-source-id: 0354d973fe40cb6e58b395bd866df14e0fc29d5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19784
For backwards compatibility, we allow vector arguments if a kernel is registered with the deprecated API.
Reviewed By: dzhulgakov
Differential Revision: D15091972
fbshipit-source-id: 4db3e3a262e605504b05c42d40046011408501d2
Summary:
Class attributes preferably be explicitly initiated within
the __init__() call. Otherwise, overriding step() is
prone to bugs.
This patch partially reverts #7889
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20059
Differential Revision: D15195747
Pulled By: soumith
fbshipit-source-id: 3d1a51d8c725d6f14e3e91ee94c7bc7a7d6c1713
Summary:
This takes care of some outstanding review comments for https://github.com/pytorch/pytorch/pull/16196/
Specifically:
1. Add comment about kind
2. Add comment about GraphPy
3. Remove ONNX version comment
4. Remove scalar_dict from SummaryWriter and all history functions
cc lanpa ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20038
Reviewed By: natalialunova
Differential Revision: D15177257
Pulled By: orionr
fbshipit-source-id: 218aa799d8b7dbb58f422a331236bba4959347de
Summary:
Sometimes people need to checkout an older version and build PyTorch. In that case, they need to do `git submodule sync` and maybe `git submodule update --init` as mentioned [here](https://github.com/pytorch/pytorch/issues/20074).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20088
Differential Revision: D15195729
Pulled By: soumith
fbshipit-source-id: 73232b801e5524cdba462dd504fb973d95d0498c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/20027
Before this change, any pointer was converted to bool, i.e.
IValue("string") == IValue(true)
After this change, it does the right thing and creates a string.
Reviewed By: dzhulgakov
Differential Revision: D15172409
fbshipit-source-id: 8167dd780005f9bceef4fe3c751f752e42ceeb20
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19910
This change modifies the quant-dequant node pattern from
qparam->q->dq to qparam->q->int_repr->qparam->dq. The motivation for
this change is to make the qparams required for op substition one level
up at dequant node instead of multiple levels up.
Differential Revision: D15120146
fbshipit-source-id: 74b0fd5cb50a338f562740a9cc727a7791c718c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19680
This was broken for quite some time because of an operator schema
check that went into effect at some point in time.
Reviewed By: manojkris
Differential Revision: D15055082
fbshipit-source-id: 7f730f9b810bdaffd69bab7ac4d02c5b2e40645b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19402
This pass propagate the qparams calculated after calibration to the
quant nodes which will be used later for quantization
Differential Revision: D14995230
fbshipit-source-id: 5709153ea1c039c4ab4470ddb689a303b0bcc6fd
Summary:
From the comment: "don't use TYPE again in case it is an expensive or side-effect op"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19968
Differential Revision: D15151567
Pulled By: gchanan
fbshipit-source-id: 4d42c081ac1472b71f1cea5172cb42a7c83a7043
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19232
Add observer nodes to collect stats for input data nodes excluding params
which are constant at inference and need not be observed. This information
is required to compute quantization params.
Differential Revision: D14885485
fbshipit-source-id: 8762cc2a4e510e1553b3dbd1d1aecd55b4bdb89f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19783
Previously, the IValues were copied into the kernel arguments, which caused a refcount bump if Tensor was taken by value.
Now, a kernel can take Tensor by value without any refcount bump because it is moved in.
Reviewed By: dzhulgakov
Differential Revision: D15091973
fbshipit-source-id: 4c5ff2e3ee86f5934cc84191697f7dbc9c3ee345
Summary:
The second commit will be removed before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19965
Differential Revision: D15153243
Pulled By: pjh5
fbshipit-source-id: 70eae38d0cb07dc732c0cf044d36ec36d0a4472d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19945
this logging was removed in D6888977, but i think it could be useful to debug upstream data issue to check the violating index
Reviewed By: xianjiec
Differential Revision: D15127887
fbshipit-source-id: 4ad7eceefcd063bf45bc190a4c0d458a089c918a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19681
For accelerator, we need to lower just the quantized weights data without layout transformation. This diff attempts to provide this option.
Reviewed By: jerryzh168, zrphercule
Differential Revision: D15066568
fbshipit-source-id: 133d749e087c2ad4a899bee5e96f597f70b2443c
Summary:
log_normal_ and geometric_ were disabled for CPU by mistake in [this PR](bc53805f2e), this PR fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19938
Differential Revision: D15143404
Pulled By: izdeby
fbshipit-source-id: 41c7bd29f046b5a3ac6d601de8c64ab553771d19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19811
This does not immediately take effect for the custom op API but will break backwards compatibility once we switch to the new operator registration.
Reviewed By: dzhulgakov
Differential Revision: D15101924
fbshipit-source-id: 8890a5a3e163d3263dc1837be0b4851984771917
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19779
This macro wasn't set correctly because the target macros weren't included from Apple's header.
Reviewed By: dzhulgakov
Differential Revision: D15090427
fbshipit-source-id: 43ca44f0f409e11718b7f60c3fdcd2aa02d7018e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19773
const members can't be moved, so whenever somebody moved a function schema, it was copied instead.
This diff fixes this.
Reviewed By: dzhulgakov
Differential Revision: D15090323
fbshipit-source-id: 123a1d6b96ac46cb237966c0b072edebcdafe54c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19817
A lot of files were depending on the JIT's typesystem
because operator.h depends on function_schema.h. However,
this isn't fundamental to the design. This diff tries to
remove the direct depenency and only includes the c10
wrapper helpers in files where it is required.
Reviewed By: smessmer
Differential Revision: D15112247
fbshipit-source-id: 2c53d83e542c32d9a398c8b60dbf40ab7a1cb0f6
Summary:
This adds method details and corrects example on the page that didn't run properly. I've now confirmed that it runs in colab with nightly.
For those with internal access the rendered result can be seen at https://home.fburl.com/~orionr/pytorch-docs/tensorboard.html
cc lanpa, soumith, ezyang, brianjo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19915
Differential Revision: D15137430
Pulled By: orionr
fbshipit-source-id: 833368fb90f9d75231b8243b43de594b475b2cb1
Summary:
Trying to get this in before 1.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19918
Reviewed By: driazati
Differential Revision: D15124430
Pulled By: eellison
fbshipit-source-id: 549cdcbaff91218657e94ce08c0f4e69b576d809
Summary:
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#19638 [jit] Serialize attribute module as torch.jit._pickle**
* use `torch.jit._pickle` as the module for globals in the pickle program. Pickle will try to resolve these to the actual functions in `torch.jit._pickle.py` automatically (I believe this can also be overridden to point to whatever functions you want). This means that `pickle.load("my_model/attributes.pkl")` will work instead of having to use a custom `pickle.Unpickler`
* use `REDUCE` opcodes instead of `BUILD` to make use of the last bullet
* use a union in the unpickler to support globals better (+ any future metadata we might need that can't be stored in an `IValue`), this makes some of the code around `IntList`s clearer and lets us get rid of any lookbehind for opcodes
* pickle things as a tuple instead of a list (an immutable result is more semantically correct)](https://our.intern.facebook.com/intern/diff/15111203/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19638
Pulled By: driazati
Differential Revision: D15111203
fbshipit-source-id: 526c6c2b63a48eb1cba1c658045a7809730070dd
Summary:
Added deprecation warnings for the masked methods and enabled them for a bool tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19140
Differential Revision: D14888021
Pulled By: izdeby
fbshipit-source-id: 0e42daf8f3732ca29f36d10485402bfc502716ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19688
Minor changes to hipify script to take extra folders.
Reviewed By: bddppq
Differential Revision: D15068427
fbshipit-source-id: e2e792c8227cbd0e15fd2564f87d740a62c477da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19901
The existing code used `expect_autograd_hooks_` as a proxy for the
situation where finalization of the previous iteration is needed. This
is not correct, however, since you may decide to completely ignore the
output of a DDP wrapped module. If this is the case, and no gradients
have been passed to the reducer, it is fine to keep going. This commit
adds a new variable `require_finalize_` that tracks whether the
finalization is really needed.
Reviewed By: mrshenli
Differential Revision: D15118871
fbshipit-source-id: 25938eaf1fe13e2940feae1312892b9d3da8a67d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19897
During validation, gradient reduction is not needed, and autograd is
never called. The model output will always be a detached tensor. After
the new reducer was merged, this meant that it would find all model
parameters unused, and kick off reduction for them. When #19799 and
output where no parameters are used and it tries to kick off reduction
of zeroed gradients. Test for `torch.is_grad_enabled()` and
`self.training` before calling into the reducer.
Reviewed By: mrshenli
Differential Revision: D15118726
fbshipit-source-id: b0208f632a61cbe8110fa626fa427937b7f05924
Summary:
As DDP in previous releases does not support unused params, turning off `find_unused_parameters` by default to derisk new reducer.
CC pietern soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19895
Reviewed By: pietern
Differential Revision: D15118563
Pulled By: mrshenli
fbshipit-source-id: 6215c486e1dae3387b36011d8e64a2721ac85f58
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19821
It is possible that not a single parameter is used during an
iteration. If this is the case, the `prepare_for_backward` function
marks all parameters as unused, kicks off reduction of all buckets,
*and* finalizes the reduction.
This is different from the prior implementation where we assumed that
autograd would produce a gradient for at least a single parameter.
We then used the autograd callback mechanism to queue a finalizer
callback. Now, this finalizer may be executed in line.
Reviewed By: mrshenli
Differential Revision: D15113272
fbshipit-source-id: dc91458b569cd8c106ddaeea558464b515683550
Summary:
When output blob names are specified while load_all=1, output blob names are ignored. However, this behavior is not documented. In this diff, we just disallow users to provide blob names when load_all=1.
See discussion at https://fb.workplace.com/groups/1405155842844877/permalink/2714909788536136/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19133
Reviewed By: dzhulgakov
Differential Revision: D14883698
Pulled By: chandlerzuo
fbshipit-source-id: 6e4171e36c4ccc4f857e79da98b858a06b7d8ad6
Summary:
Previously in type unification when we encountered an Optional[T] and a None, we would unify it to Optional[Optional[T]]. If you think about Optionals as a union of [T, None], then a union of [Optional[T], None] -> [T, None]. We should just be never create an Optional of an Optional.
The other fix is to change unify_types directly, but I think this is the more general fix, and would play more nicely with our optional type refinement, which also assumes we never encounter an Optional[Optional[T]].
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19813
Reviewed By: suo
Differential Revision: D15103083
Pulled By: eellison
fbshipit-source-id: db803db10d6934eaa5458e7c1746546b0d0c0a6c
Summary:
It's been hard to understand how workers are launched and what code runs in the worker vs. main process, especially on Windows, which leads to many of our samples failing. This explains when workers run an how to make code work on Windows as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18091
Differential Revision: D15083766
Pulled By: soumith
fbshipit-source-id: 8a7e60defc8a72ec63874f657d7d5267d951dccf
Summary:
One more fix for https://github.com/pytorch/pytorch/pull/19810
We now know that we are running with python3, so no need to check python version. The quotes were probably causing problems here.
cc ezyang soumith zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19814
Differential Revision: D15106459
Pulled By: orionr
fbshipit-source-id: 0443b9b54d17fead9c8c2c9d8d2f373e1f95a28b
Summary:
This is a follow up PR for #19547.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19748
Differential Revision: D15103230
Pulled By: ezyang
fbshipit-source-id: e7ce925faeadea502f77ed42d52e247c8c6571d8
Summary:
This is a follow up PR for #19409.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19738
Differential Revision: D15103231
Pulled By: ezyang
fbshipit-source-id: 11c9fec641b389906b8accd22504a683331fa6ec
Summary:
Eigen was updated with the commit needed to get rid of this warning that plagued the CI. This PR bumps third_party/eigen to that commit head.
```
warning: #warning "host_defines.h is an internal header file and must not be used directly. This file will be removed in a future CUDA release. Please use cuda_runtime_api.h or cuda_runtime.h instead." [-Wcpp]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19789
Differential Revision: D15103223
Pulled By: ezyang
fbshipit-source-id: 5b56c4dd9cc41ff1794570ba2f6abfbe23f6ab68
Summary:
Tested locally this fix#19039, did not add a test since there's no way to create a script module in the cpp world.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19700
Differential Revision: D15094195
Pulled By: wanchaol
fbshipit-source-id: fcc2c1e5efbc160d976ae485ba2457442f62f065
Summary:
Currently, the Python API doesn't serialize layers that don't have weights (such as `nn.ReLU` and `nn.MaxPool2d`e.g. in https://github.com/pytorch/vision/blob/master/torchvision/models/densenet.py#L80-L81). If one saves a model that contains weight-less layers in Python and tries to load it into C++, the C++ module loading code (`torch::load(...)`) will throw an error complaining that the expected layers are not found in the serialized file (e.g. https://github.com/pytorch/vision/pull/728#issuecomment-480974175). This PR solves the problem by ignoring layers that are not serializable (which currently only include `nn::Functional`) in the C++ module serialization code (`torch::save(...)` and `torch::load(...)`), and the user is expected to use `nn::Functional` to wrap the weight-less layers so that they can be ignored when serializing / deserializing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19740
Differential Revision: D15100575
Pulled By: yf225
fbshipit-source-id: 956481a2355d1de45341585abedda05e35d2ee8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18820
ghimport-source-id: 220b2a3dd9d4d6d2e557e1802851f082c2dc6452
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18820 Refactor ProcessGroupNCCL collective primitives**
Planning to add reduce-scatter, but no room in my stomach for more
copypasta.
Also rewrote the tensor list validation logic. The existing validation
was ill-suited for all the cases it was being used for; it took a vector
of input tensors and a vector of output tensors, but only ever received
either two references to the same vector, or a bespoke singleton vector
and a vector of outputs (for which it would ignore all but the first
output). In the first case, it performed unnecessary checks, and in the
second, it skipped necessary ones.
Reviewed By: mrshenli
Differential Revision: D14762369
fbshipit-source-id: dcf882ce1c5854333a9eb4424bfc18d9f4648ddf
Summary:
In order to have `torch.utils.tensorboard.SummaryWriter` rendered in the documentation at the bottom of https://pytorch.org/docs/master/tensorboard.html we need to have TensorBoard installed.
This change makes it so our pinned version of `tb-nightly` is used for doc generation same as it is used for running tests at https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/test.sh#L45-L52
Eventually we'll use a pinned version of `pip install tensorboard`, but it's not on the release channel yet.
cc kostmo soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19810
Differential Revision: D15101730
Pulled By: orionr
fbshipit-source-id: c41678c4f9ef3d56a168f2b96a1ab05f351bdc56
Summary:
Input argument `f` in `_model_to_graph()` method in `torch/onnx/utils.py` is unused. This PR removes it. If there's a reason to keep it around, please let me know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19647
Reviewed By: dzhulgakov
Differential Revision: D15071720
Pulled By: houseroad
fbshipit-source-id: 59e0dd7a4d5ebd64d0e30f274b3892a4d218c496
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19799
A module that returns multiple outputs and where the called may end up
doing multiple calls to torch.autograd.backward did not work with
DistributedDataParallel. It expected the first call to
torch.autograd.backward to provide gradients for ALL parameters that
expect gradients and were used in computing the module output. If you
have outputs with disjoint autograd graphs it is fine to call
torch.autograd.backward on both and fill in the module's parameter
gradients in separate chunks.
With this change we delay queuing the finalizer callback until we have
marked all buckets as ready, instead of queueing it the first time we
receive an autograd hook. This returns the current implementation to
be functionally equivalent to the DistributedDataParallel
implementation before #18953 was merged.
Reviewed By: mrshenli
Differential Revision: D15097045
fbshipit-source-id: 2df023319713bc31e29a8b45108c78e6593fccd4
Summary:
* adds TORCH_API and AT_CUDA_API in places
* refactor code generation Python logic to separate
caffe2/torch outputs
* fix hip and asan
* remove profiler_cuda from hip
* fix gcc warnings for enums
* Fix PythonOp::Kind
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19554
Differential Revision: D15082727
Pulled By: kostmo
fbshipit-source-id: 83a8a99717f025ab44b29608848928d76b3147a4
Summary:
This PR adds TensorBoard logging support natively within PyTorch. It is based on the tensorboardX code developed by lanpa and relies on changes inside the tensorflow/tensorboard repo landing at https://github.com/tensorflow/tensorboard/pull/2065.
With these changes users can simply `pip install tensorboard; pip install torch` and then log PyTorch data directly to the TensorBoard protobuf format using
```
import torch
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter()
s1 = torch.rand(1)
writer.add_scalar('data/scalar1', s1[0], 0)
writer.close()
```
Design:
- `EventFileWriter` and `RecordWriter` from tensorboardX now live in tensorflow/tensorboard
- `SummaryWriter` and PyTorch-specific conversion from tensors, nn modules, etc. now live in pytorch/pytorch. We also support Caffe2 blobs and nets.
Action items:
- [x] `from torch.utils.tensorboard import SummaryWriter`
- [x] rename functions
- [x] unittests
- [x] move actual writing function to tensorflow/tensorboard in https://github.com/tensorflow/tensorboard/pull/2065
Review:
- Please review for PyTorch standard formatting, code usage, etc.
- Please verify unittest usage is correct and executing in CI
Any significant changes made here will likely be synced back to github.com/lanpa/tensorboardX/ in the future.
cc orionr, ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16196
Differential Revision: D15062901
Pulled By: orionr
fbshipit-source-id: 3812eb6aa07a2811979c5c7b70810261f9ea169e
Summary:
disable_autodiff_subgraph_inlining should be always on to check AD regression.
Thanks eellison for spotting the test regression!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19787
Differential Revision: D15093104
Pulled By: ailzhang
fbshipit-source-id: 82a75a7dd7097d5f93a2e4074023da2105341c1b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19378
Function profile events are typically nested. In this diff I
add parent child relationship to the intervals. This way we can
attribute self time easily. As a result, user printing a table from a
profiler trace gets self cpu time.
This diff doesn't try to address CUDA self time as CUDA kernels are
already getting special care in the profiler.
There are also some other minor improvements. Like reporting total CPU
time spent, reversed sorting, aggregated data after the table,
etc.
There is a new unit test added which tests more functionality than
previous profiler test
Reviewed By: zheng-xq
Differential Revision: D14988612
fbshipit-source-id: 2ee6f64f0a4d0b659c6b23c0510bf13aa46f07dc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19319
Quantized SUM + ReLU (Fused). The implementation is the same as the one in the DNNLOWP.
Reviewed By: jianyuh
Differential Revision: D14866442
fbshipit-source-id: c8c737a37e35b6ce3c1c2077c07546aba16e0612
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19312
Replaces the tuple hack with the QTensor. Please, note this can be landed ONLY after #18960 (D14810261) is landed.
Reviewed By: raghuramank100
Differential Revision: D14819460
fbshipit-source-id: 75ca649304b1619cb3cfe845962c9f226b8f884a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19714
We had rounding in the quantizer set as `round(x/scale) + zp`. To make it consistent, converting it to `round(x/scale + zp)`.
Reviewed By: raghuramank100
Differential Revision: D15077095
fbshipit-source-id: 5d20a90391fe8c2e11b338c05631fcf7770320c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19751
This has probably never been tested on Windows but destruction of WorkersPool crashes because it uses _aligned_malloc to allocate and 'free' to deallocate, which is not symmetric. Fix is to use _aligned_free in deallocation
Reviewed By: hlu1
Differential Revision: D15083472
fbshipit-source-id: 42243fce8f2dfea7554b52e6b289d9fea81d7681
Summary:
New pip package becomes more restricted. We need to add extra flag to make the installation work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19725
Differential Revision: D15078698
Pulled By: houseroad
fbshipit-source-id: bbd782a0c913b5a1db3e9333de1ca7d88dc312f1
Summary:
Fixes#19650
When driazati started bicubic implementation we used TF result as ground truth. It turns out opencv version bicubic resize is used more commonly.
This PR does two things:
- Fix a bug where we didn't use area mode to compute source index
- Follow the Opencv logic to handle computed negative source indices(we used to bound them by 0).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19703
Differential Revision: D15078159
Pulled By: ailzhang
fbshipit-source-id: 06a32baf2fbc93b90a156b863b4f9fab326d3242
Summary:
I want to use libtorch in a C++/CUDA project but as soon as I include `<torch/torch.h>`, ".cu" files fail to compile:
`torch/csrc/jit/script/tree.h(64): error C3520: 'args': parameter pack must be expanded in this context`
This PR makes it build on my machine (don't know if it breaks anything though).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19615
Differential Revision: D15063712
Pulled By: ezyang
fbshipit-source-id: 7561e705f8f5b42b8e6a23430710b36508fee1ee
Summary:
Because of merge error with master in #15042, open a new PR for ezyang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17226
Differential Revision: D14418145
Pulled By: mrshenli
fbshipit-source-id: 099ba225b28e6aba71760b81b2153ad1c40fbaae
Summary:
Replace cpu_apply functions with the TensorIterator.
Vectorize copy and clone functions.
Move big pieces of the code to cpu kernels folder to be able to use AVX2.
Add fast path for copy_ function if tensor types matches.
Slow down observed on smaller tensors (up to 10% or about 1us per op.) which might be explained by the bigger CPU footprint of TensorInterator in compare to simpler cpu_apply. COntrary on bigger tensors we can see 2x-3x performance improvement (single threaded, multithreading giving even better performance boost).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18618
Differential Revision: D14954118
Pulled By: VitalyFedyunin
fbshipit-source-id: 9d9bdf3fd9d5e539a03071cced50d0a47bac1615
Summary:
Sometimes at::cat gets transposed inputs and goes on a slow path. Also, make jit_premul lstm benchmark add bias to the whole input tensor to avoid separate reduction kernels in the backward pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18816
Differential Revision: D15013576
Pulled By: wanchaol
fbshipit-source-id: bcfa1cf44180b11b05b0f55f034707012f66281a
Summary:
move the insert_guard all the way up to the beginning of the decomposation, this will fix the case that we lose insert_point context after decomposeCommonNormalization and we still need to modify the graph.
fixes#19502
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19646
Differential Revision: D15058040
Pulled By: wanchaol
fbshipit-source-id: ebdbf8623ebfe4556c461e1b650e94b905791adb
Summary:
We had a few hard to repro cases where very occasionally libthnvrtc failed to be loaded due to what looked like garbled dladdr return in `info.dli_fname` field. We could not root cause why this is happening, but this workaround avoids the problem altogether. $ORIGIN is already added to RPATH as the first search location, so dlopen("libthnnvrtc.so") will look for libthnvrtc in the caller (`libtorch.so.1`) directory, which was the purpose of the previous code that was getting `libtorch.so.1` directory using dladdr.
```
root@4ec0aab027a0:/opt/conda/lib/python3.6/site-packages/torch/lib# readelf -d ./libtorch.so.1 | grep RPATH
0x000000000000000f (RPATH) Library rpath: [$ORIGIN:/usr/local/cuda/lib64:/opt/conda/lib]
```
Hopefully, same should be happening on Mac.
cc zdevito ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19690
Differential Revision: D15076990
Pulled By: soumith
fbshipit-source-id: a4d2992ccf26953f1fc73f17c4e752d69c58e2fc
Summary:
In the distributed training development work, we need to be able to serialize a `std::vector` of `torch::Tensor`s. This PR adds support for serializing `std::vector<torch::Tensor>`.
cc. mrshenli
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19677
Differential Revision: D15069860
Pulled By: yf225
fbshipit-source-id: 505147e5f5fea78be1bf60fb8418bc187dbc2a98
Summary:
Print out the tensor value when throwing the cannot insert tensor with grad error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19645
Differential Revision: D15057809
Pulled By: eellison
fbshipit-source-id: 3f622ef1322a75c965e780275f1fb447e9acf38d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19676
Make copy work with QTensor, enable assignment of QTensor in pytorch frontend.
Differential Revision: D15064710
fbshipit-source-id: 04f2dc02a825695d41fa1114bfca49e92108fef3
Summary:
The input shape checkers in conv/int8_conv operator is aims to avoid the issue when running with mkldnn winograd, the weigths has to be reordered each time if input shape changed.
However, the checkers result to big performance regression due to frequent reorder.
Meanwhile, in mkldnn-bridge, such case has been already fixed by correcting the prop_kind.
Therefore, we have to remove the useless checker to fix the performance regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19608
Differential Revision: D15061169
Pulled By: yinghai
fbshipit-source-id: 649a43ae6fce989e84939210f6dffb143ec3d350
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19585
Originally we will unroll all If op to many different subnets;
Now we will not unroll it anymore, but just add all external input of its subnet to the If op, and ssa-rewrite all external input/outputs. That would be enough.
Reviewed By: yinghai
Differential Revision: D15038139
fbshipit-source-id: 8532216d8749068acd5558ad0d8cb1d98463a063
Summary:
Also
1. Bump multiprocessing test timeout following python core tests
2. Fix one type of flakiness in `test_proper_exit`.
3. Add trace reporting when loader process hangs in `test_proper_exit` using `faulthandler`.
3. Give `test_proper_exit` another try.
I'll heavily retest this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19421
Differential Revision: D15063728
Pulled By: ezyang
fbshipit-source-id: 4e0d992622e11053c44a9ec237b88b9a28a4472c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19530
Make copy work with QTensor, enable assignment of QTensor in pytorch frontend.
Differential Revision: D15008160
fbshipit-source-id: 5f1166246d768b23f009cde1fa03e8952368a332
Summary:
We can't introduce aliasing to a graph output, since they may be mutated after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19576
Differential Revision: D15057734
Pulled By: eellison
fbshipit-source-id: 33594c05d985a0c58edebd6252e1ee2c0efb6f0e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19607
Explicit is better than implicit - it's pretty hard to debug where particular file is if it's not greppable.
As a follow up step - we should look whether we can just include build_variables.py in CMake directly to share setups of two build systems
Reviewed By: ezyang
Differential Revision: D15023348
fbshipit-source-id: 600ef2d1871bc28530c6a02681b284f7499904df
Summary:
We would previously have statements like
```
set_history(flatten_tensor_args( result ), grad_fn);
```
Internally, {set,rebase}_history would check grad_fn and short circuit if it is nullptr. However, this means that we are executing the expression `flatten_tensor_args( result )` and immediately throwing away the results. This was causing unnecessary allocations + overhead.
My JIT overhead benchmark script (with custom benchmark method):
```
import torch, time
torch.jit.script
def add(x, y):
return x + y
a = torch.rand([])
b = torch.rand([])
niter = 1000000
with torch.no_grad():
s = time.time()
add.__getattr__('forward').benchmark(niter, a, b)
e = time.time() - s
print('overhead per call (us)', e / niter * 1e6)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19623
Differential Revision: D15053399
Pulled By: jamesr66a
fbshipit-source-id: 8777e1a2b5c5a5bbd3a035b7247c8154c5fc4aa6
Summary:
Add base support for torch.logspace. See #19220 for details.
SsnL can you feedback? Thanks a lot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19542
Differential Revision: D15028484
Pulled By: soumith
fbshipit-source-id: fe5a58a203b279103abbc192c754c25d5031498e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19621
Comments for group_norm_op is not accurate (i.e., the math part), this diff will fix it.
Reviewed By: BIT-silence
Differential Revision: D15048695
fbshipit-source-id: 27d41d3ae21054257967815254134849944d56ca
Summary:
This is the second part of #18064.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19547
Differential Revision: D15046630
Pulled By: ezyang
fbshipit-source-id: 03f80602b94d47bca66bfd0dcab1b7bb99e5b7f1
Summary:
Add setting requires_grad = True within torchscript to torch.Tensor
Within constant propagation, we can't insert any constants that require grad.
Also added shape analysis and requires grad analysis to torch.tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19445
Differential Revision: D15046211
Pulled By: eellison
fbshipit-source-id: b4ef7a6b4b6b8dc03e1fa49f87dc415874cd1998
Summary:
The current code initialize the `state` in `__init__` method, but the initialization process is not invoked in `add_parameter_group`.
I followed the same approach in other Optimizers to init the `state`.
```python
import torch
emb = torch.nn.Embedding(10,10)
emb2 = torch.nn.Embedding(10,10)
optim = torch.optim.Adagrad(emb.parameters())
print(optim.state[emb.weight]) # already initialized
optim.add_param_group({'params': emb2.parameters()})
print(optim.state[emb2.weight]) # empty dict
loss = emb2.weight.sum() + emb.weight.sum()
loss.backward()
optim.step() # raised KeyError
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17679
Differential Revision: D14577575
Pulled By: ezyang
fbshipit-source-id: 12440079ac964b9eedad48e393d47f558babe300
Summary:
Often times, we want to experiment with loss per element (image etc.). This changeset allows getting per element loss as well. This output is optional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19579
Reviewed By: jerryzh168
Differential Revision: D15035797
Pulled By: prigoyal
fbshipit-source-id: 562dea514f49c1f2f1cbbc083a1938dc019a75c4
Summary:
in functional interfaces we do boolean dispatch, but all to max_pool\*d_with_indices. This change it to emit max_pool\*d op instead when it's not necessary to expose with_indices ops to different backends (for jit).
It also bind max_pool\*d to the torch namespace, which is the same behavior with avg_pool\*d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19449
Differential Revision: D15016839
Pulled By: wanchaol
fbshipit-source-id: f77cd5f0bcd6d8534c1296d89b061023a8288a2c
Summary:
Adding fakequant op so that we can use it in pytorch models, the exact implementation might change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19387
Differential Revision: D13739657
fbshipit-source-id: d5cb084e843d236bb1da9827ac1ba3900ed99786
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18499
If the init op is not fp16 compatible, it should throw.
However, in the special case where the original init op is UniformFill,
we replace it with Float16UniformFill
Reviewed By: kennyhorror
Differential Revision: D14627209
fbshipit-source-id: eb427772874a732ca8b3a25d06670d119ce8ac14
Summary:
Added the formula for the corner case. Updated unit tests.
Fixes#17913
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19180
Differential Revision: D14942023
Pulled By: ezyang
fbshipit-source-id: 167c109b97a7830d5b24541dc91e4788d531feec
Summary:
I fixed a mistake in the explanation of `pos_weight` argument in `BCEWithLogitsLoss` and added an example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19212
Differential Revision: D14923431
Pulled By: ezyang
fbshipit-source-id: 15696c67d56789102ac72afbe9bdd7b667eae5a0
Summary:
fix
- the order of `Arguments` in `RandomSampler` doc
- the meaningless check of `replacement`'s type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19113
Differential Revision: D15013081
Pulled By: ezyang
fbshipit-source-id: 39e367f42841de6814b1214eb9df7b75f14f747e
Summary:
A future version of cusparse will define "cusparseGetErrorString." This PR simply updates PyTorch's name for this function to "getCusparseErrorString" to avoid the collision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19591
Differential Revision: D15046871
Pulled By: ezyang
fbshipit-source-id: 821304f75fe84c68a26680a93809a18cfdbd540b
Summary:
Hello everyone :) !!
I've found that lr_scheduler was initialized with last_epoch as -1.
This causes that even after the first step (not the one in init but explicit step of scheduler),
learning rate of scheduler's optimizer remains as the previous.
```python
>>> import torch
>>> cc = torch.nn.Conv2d(10,10,3)
>>> myinitial_lr = 0.1
>>> myoptimizer = torch.optim.Adam(cc.parameters(), lr=myinitial_lr)
>>> mylrdecay = 0.5
>>> myscheduler = torch.optim.lr_scheduler.ExponentialLR(myoptimizer,mylrdecay)
>>> myscheduler.get_lr()
[0.2] # this is because of get_lr calculates lr by 0.1 * 0.5^-1
>>> myscheduler.optimizer.param_groups[0]["lr"]
0.1 # this is not consistent with get_lr value
>>> myscheduler.last_epoch
-1
>>> myscheduler.step()
>>> myscheduler.get_lr()
[0.1] # this should be the value right after the init, not after first step
>>> myscheduler.optimizer.param_groups[0]["lr"]
0.1 # since this is after first step, it should have been decayed as 0.05
>>> myscheduler.last_epoch
0
>>> myscheduler.step()
>>> myscheduler.last_epoch
1
>>> myscheduler.get_lr()
[0.05]
>>> myscheduler.optimizer.param_groups[0]["lr"]
0.05
>>> myscheduler.last_epoch
1
```
First problem is, even after the init of lr_scheduler, you get the inconsistent parameter values.
The second problem is, you are stuck with same learning rate in the first 2 epochs if the step function of lr_scheduler is not called in the beginning of the epoch loop.
Of course, you can avoid this by calling lr_scheduler's step in the beginning,
but I don't think this is proper use since, incase of optimizer, step is called in the end of the iteration loop.
I've simply avoided all above issues by setting last_epoch as 0 after the initialization.
This also makes sense when you init with some value of last_epoch which is not -1.
For example, if you want to init with last epoch 10,
lr should not be set with decayed 1 step further. Which is
last_epoch gets +1 in the previous code.
base_lr * self.gamma ** self.last_epoch
Instead, it should be set with step 10 exact value.
I hope this fix find it's way with all your help :)
I'm really looking forward & excited to become a contributor for pytorch!
Pytorch Rocks!!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/7889
Differential Revision: D15012769
Pulled By: ezyang
fbshipit-source-id: 258fc3009ea7b7390a3cf2e8a3682eafb506b08b
Summary:
n was set as self.in_channels, but not used within the scope of the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19194
Differential Revision: D14937764
Pulled By: ezyang
fbshipit-source-id: 55cb599109309503fee897f77d798fd454fcc02d
Summary:
This is the first part of #18064.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19409
Differential Revision: D15037390
Pulled By: ezyang
fbshipit-source-id: 16a3feed2fd9cc66033696da224a7d5fb7208534
Summary:
Add setting requires_grad = True within torchscript to torch.Tensor
Within constant propagation, we can't insert any constants that require grad.
Also added shape analysis and requires grad analysis to torch.tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19445
Differential Revision: D15039713
Pulled By: eellison
fbshipit-source-id: 47f1931b6fc4a1137c13d80110cc404465bfdf06
Summary:
I believe the existing check in FuseGraph was only `false` if PyTorch was built with NO_CUDA=1. Otherwise, we would create fusion groups even if we're on a CPU-only machine running CPU code. This is confusing. Instead I've made it so that the decision to fuse or not is dependent on if the producer Value is a known CPU tensor. If it is, we skip fusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19342
Differential Revision: D15038351
Pulled By: jamesr66a
fbshipit-source-id: fce9d83929309a7bf14346833f84b996f3e7f6db
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19516
Explicitly define types that are supported in kernel inputs and outputs.
Also, this allows us to show much nicer error messages if a user writes kernels with wrong argument types.
Reviewed By: ezyang
Differential Revision: D15020306
fbshipit-source-id: 55ebec81e075e874777acd59aa29a5578fc19ef7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19442
For cases like CV, some of ops like transpose and tile will mangle the batch size so that we don't know how to adjust output batch size. In this case, the current solution is just fix the input batch statically and do not adjust output batch size.
Reviewed By: zrphercule
Differential Revision: D15007237
fbshipit-source-id: a21b943a52ee5462d9d7804dfae44360f579f8cf
Summary:
Changelog:
- Rename `potri` to `cholesky_inverse` to remain consistent with names of `cholesky` methods (`cholesky`, `cholesky_solve`)
- Fix all callsites
- Rename all tests
- Create a tentative alias for `cholesky_inverse` under the name `potri` and add a deprecation warning to not promote usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19498
Differential Revision: D15029901
Pulled By: ezyang
fbshipit-source-id: 2074286dc93d8744cdc9a45d54644fe57df3a57a
Summary:
This just plugs into the existing mechanism to do a direct translation to TensorOptions in the backend, so no codegen changes.
After this lands, all native_functions will match the JIT signature.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19484
Differential Revision: D15013051
Pulled By: gchanan
fbshipit-source-id: 6818f868d2f765ca3e56e7e6f75fe4f68492466c
Summary:
Remove an useless format checker in mkldnn-bridge to fix the crash issue in DNNLOWP dequantize op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19159
Differential Revision: D15027670
Pulled By: yinghai
fbshipit-source-id: ac97d6ff94de013105108b9596b1bd7621c5aa75
Summary:
In this PR, the fusion alogrithms are improved to support DNNLOWP.
1. Enabled conv fusions for DNNLOWP
2. Fused order switch op into following quantize op
3. Improve conv+sum fusion to parse larger scope/window
4. re-org fusion code to fix random crash issue due to changing graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18843
Differential Revision: D15021030
Pulled By: yinghai
fbshipit-source-id: 88d2199d9fc69f392de9bfbe1f291e0ebf78ab08
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19418
This change makes Observer class template which always
takes an observer function as argument. Second test-case becomes redundant, hence removing
it.
Reviewed By: jerryzh168
Differential Revision: D15000594
fbshipit-source-id: 9555fe98a5f2054b8fd01e64e9ac2db72c043bfa
Summary:
There are two corrections in this pull request.
The first is specific to gcc-7.4.0.
compiled with -std=c++14 gcc-7.4.0 has __cplusplus = 201402L
This does not meet the check set in Deprecated.h, which asks for >201402L.
The compiler goes down to the __GNUC__ check, which passes and sets C10_DEPRECATED_MESSAGE to a value that c++14 does not appear to support or even recognize, leading to a compile time error.
My recommended solution, which worked for my case, was to change the = into a >=
The second correction comes in response to this error:
caffe2/operators/crash_op.cc: In member function ‘virtual bool caffe2::CrashOp::RunOnDevice()’:
caffe2/operators/crash_op.cc:14:11: error: ‘SIGABRT’ was not declared in this scope
I am merely committing to the repository the solution suggested here (which worked for me)
https://discuss.pytorch.org/t/building-pytorch-from-source-in-conda-fails-in-pytorch-caffe2-operators-crash-op-cc/42859
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19470
Differential Revision: D15019529
Pulled By: ailzhang
fbshipit-source-id: 9ce9d713c860ee5fd4266e5c2a7f336a97d7a90d
Summary:
This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5
And I got ~8 GB/s before this change, but ~14 GB/s after this change.
This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166):
== Before ==
time_per_iter 0.0001298875093460083
GB/s 3.082544287868467
== After ==
time_per_iter 0.00010104801654815674
GB/s 3.9623142905451076
The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that.
EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression:
before
time_per_iter 8.983819484710693e-05
GB/s 4.456723564864611
After no axpy
time_per_iter 7.19951868057251e-05
GB/s 5.56126065872172
AFter perfkernels
time_per_iter 5.6699180603027346e-05
GB/s 7.061548257694262
After perfkernels no grad
time_per_iter 4.388842582702637e-05
GB/s 9.122769670026413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19329
Reviewed By: dzhulgakov
Differential Revision: D14969630
Pulled By: jamesr66a
fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19515
This is still done by default, but can now be disabled by specifying
`find_unused_parameters=False`. There are use cases where finding
unused parameters results in erroneous behavior, because a subset of
model parameters is used *outside* the `forward` function. One can
argue that doing this is not a good idea, but we should not break
existing use cases without an escape hatch. This configuration
parameter is that escape hatch.
Reviewed By: bddppq
Differential Revision: D15016381
fbshipit-source-id: f2f86b60771b3801ab52776e62b5fd6748ddeed0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19511
In the c10 operator registration API, disallow std::vector arguments and show a nice error message
pointing users towards using ArrayRef instead.
Reviewed By: ezyang
Differential Revision: D15017423
fbshipit-source-id: 157ecc1298bbc598d2e310a16041edf195aaeff5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19503
After reading the arguments from the stack, the c10 kernel wrapper accidentally popped them again, causing a vector to be allocated.
Instead, it should just drop them because they have already been read.
Reviewed By: ezyang
Differential Revision: D15016023
fbshipit-source-id: b694a2929f97fa77cebe247ec2e49820a3c818d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19518
Previous design needs to run the op benchmarks from PyTorch root directory which could lead to `module not found` error in OSS environment. This diff fixes that issue by making the benchmark to be launched in the `benchmarks` folder.
Reviewed By: ilia-cher
Differential Revision: D15020787
fbshipit-source-id: eb09814a33432a66cc857702bc86538cd17bea3b
Summary:
First step at allowing container types within alias analysis.
Since the current implementation hides the concept of Wildcards within alias analysis and does not expose it to memory dag, we cannot represent whether a container type holds a wildcard. As a result, only handle TupleConstruct, where we can directly inspect if any input values are wildcards, and don't handle nested containers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18710
Differential Revision: D15017068
Pulled By: eellison
fbshipit-source-id: 3ee76a5482cef1cc4a10f034593ca21019161c18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19458
The algorithm in https://fburl.com/ggh9iyvc fails to really ensure topological ordering of nodes. The fix is ugly but effective. I think we need a real topological sort to fix this issue more nicely. Mikhail Zolotukhin, Bram Wasti.
Differential Revision: D15011893
fbshipit-source-id: 130c3aa442f5d578adfb14fbe5f16aa722434942
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19381
Expose QScheme enum in frontend so that people can use it in
quantization configs in modules.
Differential Revision: D14922992
fbshipit-source-id: ab07b8a7ec42c1c1f5fe84a4a0c805adbcad408d
Summary:
Add the defaults field to the copied object.
Prior to this patch, optimizer.__getattr__ has excluded the defaults
attribute of optimizer source object, required by some LR schedulers. (e.g. CyclicLR with momentum)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19308
Differential Revision: D15012801
Pulled By: soumith
fbshipit-source-id: 95801b269f6f9d78d531d4fed95c973b280cc96f
Summary:
This is a simple yet useful addition to the torch.nn modules: an identity module. This is a first draft - please let me know what you think and I will edit my PR.
There is no identity module - nn.Sequential() can be used, however it is argument sensitive so can't be used interchangably with any other module. This adds nn.Identity(...) which can be swapped with any module because it has dummy arguments. It's also more understandable than seeing an empty Sequential inside a model.
See discussion on #9160. The current solution is to use nn.Sequential(). However this won't work as follows:
```python
batch_norm = nn.BatchNorm2d
if dont_use_batch_norm:
batch_norm = Identity
```
Then in your network, you have:
```python
nn.Sequential(
...
batch_norm(N, momentum=0.05),
...
)
```
If you try to simply set `Identity = nn.Sequential`, this will fail since `nn.Sequential` expects modules as arguments. Of course there are many ways to get around this, including:
- Conditionally adding modules to an existing Sequential module
- Not using Sequential but writing the usual `forward` function with an if statement
- ...
**However, I think that an identity module is the most pythonic strategy,** assuming you want to use nn.Sequential.
Using the very simple class (this isn't the same as the one in my commit):
```python
class Identity(nn.Module):
def __init__(self, *args, **kwargs):
super().__init__()
def forward(self, x):
return x
```
we can get around using nn.Sequential, and `batch_norm(N, momentum=0.05)` will work. There are of course other situations this would be useful.
Thank you.
Best,
Miles
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19249
Differential Revision: D15012969
Pulled By: ezyang
fbshipit-source-id: 9f47e252137a1679e306fd4c169dca832eb82c0c
Summary:
This should have been fixed in newest ROCm version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19436
Reviewed By: ezyang
Differential Revision: D15004685
Pulled By: bddppq
fbshipit-source-id: 19fd4cca94c914dc54aabfbb4e62b328aa348a35
Summary:
This is a continuation of efforts into packed accessor awareness.
A very simple example is added, along with the mention that the template can hold more arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19464
Differential Revision: D15012564
Pulled By: soumith
fbshipit-source-id: a19ed536e016fae519b062d847cc58aef01b1b92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18092
Previously, tracing required all inputs to be either tensors,
or tuples of tensor. Now, we allow users to pass dicts as well.
Differential Revision: D14491795
fbshipit-source-id: 7a2df218e5d00f898d01fa5b9669f9d674280be3
Summary:
Strip the doc_string by default from the exported ONNX models (this string has the stack trace and information about the local repos and folders, which can be confidential).
The users can still generate the doc_string by specifying add_doc_string=True in torch.onnx.export().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18882
Differential Revision: D14889684
Pulled By: houseroad
fbshipit-source-id: 26d2c23c8dc3f484544aa854b507ada429adb9b8
Summary:
We are already using custom comparators for sorting (for a good reason), but are still making 2 sorting passes - global sort and stable sorting to bring values into their slices. Using a custom comparator to sort within a slice allows us to avoid second sorting pass and brings up to 50% perf improvement.
t-vi I know you are moving sort to ATen, and changing THC is discouraged, but #18350 seems dormant. I'm fine with #18350 landing first, and then I can put in these changes.
cc umanwizard for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19379
Differential Revision: D15011019
Pulled By: soumith
fbshipit-source-id: 48e5f5aef51789b166bb72c75b393707a9aed57c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19450
We want to make each operator benchmark as a separate binary. The previous way to run the benchmark is by collecting all operators into a single binary, it is unnecessary when we want to filter a specific operator. This diff aims to resolve that issue.
Reviewed By: ilia-cher
Differential Revision: D14808159
fbshipit-source-id: 43cd25b219c6e358d0cd2a61463b34596bf3bfac
Summary:
Fixes: #19253
Fixing pow(Tensor, float) is straightforward.
The breakage for pow(float, Tensor) is a bit more subtle to trigger, and fixing needs `torch.log` (`math.log` didn't work) from the newly merged #19115 (Thanks ngimel for pointing out this has landed.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19324
Differential Revision: D15003531
Pulled By: ailzhang
fbshipit-source-id: 8b22138fa27a43806b82886fb3a7b557bbb5a865
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19290
Add test cases for the supported argument types
And TODOs for some unsupported ones that we might want to support.
Reviewed By: dzhulgakov
Differential Revision: D14931920
fbshipit-source-id: c47bbb295a54ac9dc62569bf5c273368c834392c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19433
For operator benchmark project, we need to cover a lot of operators, so the interface for adding operators needs to be very clean and simple. This diff is implementing a new interface to add op.
Here is the logic to add new operator to the benchmark:
```
long_config = {}
short_config = {}
map_func
add_test(
[long_config, short_config],
map_func,
[caffe2 op]
[pt op]
)
```
Reviewed By: zheng-xq
Differential Revision: D14791191
fbshipit-source-id: ac6738507cf1b9d6013dc8e546a2022a9b177f05
Summary:
My bad - it might be called in variable and non-variable context. So it's better to just inherit variable-ness from the caller.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19400
Reviewed By: ezyang
Differential Revision: D14994781
Pulled By: dzhulgakov
fbshipit-source-id: cb9d055b44a2e1d7bbf2e937d558e6bc75037f5b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19360
We'll return the output object verbatim since it is a freeform object.
We need to find any tensors in this object, though, because we need to
figure out which parameters were used during this forward pass, to
ensure we short circuit reduction for any unused parameters.
Before this commit only lists were handled and the functionality went
untested. This commit adds support for dicts and recursive structures,
and also adds a test case.
Closes#19354.
Reviewed By: mrshenli
Differential Revision: D14978016
fbshipit-source-id: 4bb6999520871fb6a9e4561608afa64d55f4f3a8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19388
The old implementation forced a refcount bump when converting at::Tensor to caffe2::Tensor.
Now, it is possible to move it without a refcount bump.
Reviewed By: dzhulgakov
Differential Revision: D14986815
fbshipit-source-id: 92b4b0a6f323ed38376ffad75f960cad250ecd9b
Summary:
Attempt fix for #14057 . This PR fixes the example script in the issue.
The old behavior is a bit confusing here. What happened to pickling is python2 failed to recognize `torch.float32` is in module `torch`, thus it's looking for `torch.float32` in module `__main__`. Python3 is smart enough to handle it.
According to the doc [here](https://docs.python.org/2/library/pickle.html#object.__reduce__), it seems `__reduce__` should return `float32` instead of the old name `torch.float32`. In this way python2 is able to find `float32` in `torch` module.
> If a string is returned, it names a global variable whose contents are pickled as normal. The string returned by __reduce__() should be the object’s local name relative to its module
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18045
Differential Revision: D14990638
Pulled By: ailzhang
fbshipit-source-id: 816b97d63a934a5dda1a910312ad69f120b0b4de
Summary:
Previously to get a list of parameters this code was just putting them in the reverse order in which they were defined, which is not always right. This PR allows parameter lists to define the order themselves. To do this parameter lists need to have a corresponding function that provides the names of the parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18198
Differential Revision: D14966270
Pulled By: driazati
fbshipit-source-id: 59331aa59408660069785906304b2088c19534b2
Summary:
This PR paves the way for support more iterator types in for-in loops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19341
Differential Revision: D14992749
Pulled By: Krovatkin
fbshipit-source-id: e2d4c9465c8ec3fc74fbf23006dcb6783d91795f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19033
torch.distributed.init_process_group() has had many parameters added, but the contract isn't clear. Adding documentation, asserts, and explicit args should make this clearer to callers and more strictly enforced.
Reviewed By: mrshenli
Differential Revision: D14813070
fbshipit-source-id: 80e4e7123087745bed436eb390887db9d1876042
Summary:
Added the ">>>" python interpreter sign(three greater than symbols), so that the edited lines will appear as code, not comments/output, in the documentation. Normally, the interpreter would display "..." when expecting a block, but I'm not sure how this would work on the pytorch docs website. It seems that in other code examples the ">>>" sign is used as well, therefore I used with too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19347
Differential Revision: D14986154
Pulled By: soumith
fbshipit-source-id: 8f4d07d71ff7777b46c459837f350eb0a1f17e84
Summary:
This PR aims to improve Transformer performance on CPU, `bmm()` is one of the major bottlenecks now.
Current logic of `bmm()` on CPU only uses MKL batch gemm when the inputs `A` and `B` are contiguous or transposed. So when `A` or `B` is a slice of a larger tensor, it falls to a slower path.
`A` and `B` are both 3D tensors. MKL is able to handle the batch matrix multiplication on occasion that `A.stride(1) == 1 || A.stride(2) == 1` and `B.stride(1) == || B.stride(2) == 1`.
From [fairseq](https://github.com/pytorch/fairseq) implementation of Transformer, multi-head attention has two places to call bmm(), [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L167) and [here](https://github.com/pytorch/fairseq/blob/master/fairseq/modules/multihead_attention.py#L197), `q`, `k`, `v` are all slices from larger tensor. So the `bmm()` falls to slow path at the moment.
Results on Xeon 6148 (20*2 cores 2.5GHz) indicate this PR improves Transformer training performance by **48%** (seconds per iteration reduced from **5.48** to **3.70**), the inference performance should also be boosted.
Before:
```
| epoch 001: 0%| | 27/25337 [02:27<38:31:26, 5.48s/it, loss=16.871, nll_loss=16.862, ppl=119099.70, wps=865, ups=0, wpb=4715.778, bsz=129.481, num_updates=27, lr=4.05e-06, gnorm=9.133,
```
After:
```
| epoch 001: 0%| | 97/25337 [05:58<25:55:49, 3.70s/it, loss=14.736, nll_loss=14.571, ppl=24339.38, wps=1280, ups=0, wpb=4735.299, bsz=131.134, num_updates=97, lr=1.455e-05, gnorm=3.908,
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19338
Differential Revision: D14986346
Pulled By: soumith
fbshipit-source-id: 827106245af908b8a4fda69ed0288d322b028f08
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19287
Since we now have a string-schema-based op registration API, we can also use it when exposing caffe2 operators.
Reviewed By: dzhulgakov
Differential Revision: D14931925
fbshipit-source-id: ec162469d2d94965e8c99d431c801ae7c43849c8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19286
The operator registration API now allows registering an operator by only giving the operator name and not the full operator schema,
as long as the operator schema can be inferred from the kernel function.
Reviewed By: dzhulgakov
Differential Revision: D14931921
fbshipit-source-id: 3776ce43d4ce67bb5a3ea3d07c37de96eebe08ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19285
The either type is a tagged union with two members.
This is going to be used in a diff stacked on top to allow a function to return one of two types.
Also, generally, either<Error, Result> is a great pattern for returning value_or_error from a function without using exceptions and we could use this class for that later.
Reviewed By: dzhulgakov
Differential Revision: D14931923
fbshipit-source-id: 7d1dd77b3e5b655f331444394dcdeab24772ab3a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19284
Instantiating a dispatch table previously only worked when the op had a tensor argument we could dispatch on.
However, the legacy API for custom operators didn't have dispatch and also worked for operators without tensor arguments, so we need to continue supporting that.
It probably generally makes sense to support this as long as there's only a fallback kernel and no dispatched kernel registered.
This diff adds that functionality.
Reviewed By: dzhulgakov
Differential Revision: D14931926
fbshipit-source-id: 38fadcba07e5577a7329466313c89842d50424f9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19283
Now that the function schema parser is available in ATen/core, we can use it from the operator registration API to register ops based on string schemas.
This does not allow registering operators based on only the name yet - the full schema string needs to be defined.
A diff stacked on top will add name based registration.
Reviewed By: dzhulgakov
Differential Revision: D14931919
fbshipit-source-id: 71e490dc65be67d513adc63170dc3f1ce78396cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19282
This is largely a hack because we need to use the function schema parser from ATen/core
but aren't clear yet on how the final software architecture should look like.
- Add function schema parser files from jit to ATen/core build target.
- Also move ATen/core build target one directory up to allow this.
We only change the build targets and don't move the files yet because this is likely
not the final build set up and we want to avoid repeated interruptions
for other developers. cc zdevito
Reviewed By: dzhulgakov
Differential Revision: D14931922
fbshipit-source-id: 26462e2e7aec9e0964706138edd3d87a83b964e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19281
String<->Number conversions aren't available in the STL used in our Android environment.
This diff adds workarounds for that so that the function schema parser can be compiled for android
Reviewed By: dzhulgakov
Differential Revision: D14931649
fbshipit-source-id: d5d386f2c474d3742ed89e52dff751513142efad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19280
We want to use the function schema parser from ATen/core, but with as little dependencies as possible.
This diff moves the function schema parser into its own file and removes some of its dependencies.
Reviewed By: dzhulgakov
Differential Revision: D14931651
fbshipit-source-id: c2d787202795ff034da8cba255b9f007e69b4aea
Summary:
A few improvements while doing bert model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19247
Differential Revision: D14989345
Pulled By: ailzhang
fbshipit-source-id: f4846813f62b6d497fbe74e8552c9714bd8dc3c7
Summary:
Op was improperly schematized previously. Evidently checkScript does not test if the outputs are the same type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19370
Differential Revision: D14985159
Pulled By: eellison
fbshipit-source-id: feb60552afa2a6956d71f64801f15e5fe19c3a91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19082
When you have just one line of deletions, just as with additions, there is no count printed.
Without this fix, we ignore all globs with single-line deletions when selecting which lines were changed.
When all the changes in the file were single-line, this meant no line-filtering at all!
Differential Revision: D14860426
fbshipit-source-id: c60e9d84f9520871fc0c08fa8c772c227d06fa27
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19362
`float` type is never used in OnnxifiOp....
Reviewed By: bddppq
Differential Revision: D14977970
fbshipit-source-id: 8fee02659dbe408e5a3e0ff95d74c04836c5c281
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18960
empty_affine_quantized creates an empty affine quantized Tensor from scratch.
We might need this when we implement quantized operators.
Differential Revision: D14810261
fbshipit-source-id: f07d8bf89822d02a202ee81c78a17aa4b3e571cc
Summary:
As part of implicitly casting condition statements, we should be casting not expressions as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19361
Differential Revision: D14984275
Pulled By: eellison
fbshipit-source-id: f8dae64f74777154c25f7a6bcdac03cf44cbb60b
Summary:
It turns out that copying bytes is the same no matter what type
they're interpreted as, and memcpy is already vectorized on every
platform of note. Paring this down to the simplest implementation
saves just over 4KB off libtorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19198
Differential Revision: D14922656
Pulled By: resistor
fbshipit-source-id: bb03899dd8f6b857847b822061e7aeb18c19e7b4
Summary:
This adds checks for `mul_`, `add_`, `sub_`, `div_`, the most common
binops. See #17935 for more details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19317
Differential Revision: D14972399
Pulled By: zou3519
fbshipit-source-id: b9de331dbdb2544ee859ded725a5b5659bfd11d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19273
Some of the CIs are failing if the protobuf is not installed. Protobuf is imported as part of the `caffe2.python.core`, and this adds a skip decorator to avoid running tests that depend on `caffe2.python.core`
Reviewed By: jianyuh
Differential Revision: D14936387
fbshipit-source-id: e508a1858727bbd52c951d3018e2328e14f126be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19083
As we have discussed, there are too many of AdjustBatch ops and they incur reallocation overhead and affects the performance. We will eliminate these ops by
- inling the input adjust batch op into Glow
- inling the output adjust batch op into OnnxifiOp and do that only conditionally.
This is the C2 part of the change and requires change from Glow side to work e2e.
Reviewed By: rdzhabarov
Differential Revision: D14860582
fbshipit-source-id: ac2588b894bac25735babb62b1924acc559face6
Summary:
I tried first to convert the `.bat` script to a Bash `.sh` script, but I got this error:
```
[...]/build/win_tmp/ci_scripts/test_python_nn.sh: line 3: fg: no job control
```
Line 3 was where `%TMP_DIR%/ci_scripts/setup_pytorch_env.bat` was invoked.
I found a potential workaround on stack overflow of adding the `monitor` (`-m`) flag to the script, but hat didn't work either:
```
00:58:00 /bin/bash: cannot set terminal process group (3568): Inappropriate ioctl for device
00:58:00 /bin/bash: no job control in this shell
00:58:00 + %TMP_DIR%/ci_scripts/setup_pytorch_env.bat
00:58:00 /c/Jenkins/workspace/pytorch-builds/pytorch-win-ws2016-cuda9-cudnn7-py3-test1/build/win_tmp/ci_scripts/test_python_nn.sh: line 3: fg: no job control
```
So instead I decided to use Python to replace the `.bat` script. I believe this is an improvement in that it's both "table-driven" now and cross-platform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18756
Differential Revision: D14957570
Pulled By: kostmo
fbshipit-source-id: 87794e64b56ffacbde4fd44938045f9f68f7bc2a
Summary:
Make it possible to construct a pinned memory tensor without creating a storage first and without calling pin_memory() function. It is also faster, as copy operation is unnecessary.
Supported functions:
```python
torch.rand_like(t, pin_memory=True)
torch.randn_like(t, pin_memory=True)
torch.empty_like(t, pin_memory=True)
torch.full_like(t, 4, pin_memory=True)
torch.zeros_like(t, pin_memory=True)
torch.ones_like(t, pin_memory=True)
torch.tensor([10,11], pin_memory=True)
torch.randn(3, 5, pin_memory=True)
torch.rand(3, pin_memory=True)
torch.zeros(3, pin_memory=True)
torch.randperm(3, pin_memory=True)
torch.empty(6, pin_memory=True)
torch.ones(6, pin_memory=True)
torch.eye(6, pin_memory=True)
torch.arange(3, 5, pin_memory=True)
```
Part of the bigger: `Remove Storage` plan.
Now compatible with both torch scripts:
` _1 = torch.zeros([10], dtype=6, layout=0, device=torch.device("cpu"), pin_memory=False)`
and
` _1 = torch.zeros([10], dtype=6, layout=0, device=torch.device("cpu"))`
Same checked for all similar functions `rand_like`, `empty_like` and others
It is fixed version of #18455
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18952
Differential Revision: D14801792
Pulled By: VitalyFedyunin
fbshipit-source-id: 8dbc61078ff7a637d0ecdb95d4e98f704d5450ba
Summary:
Unit tests that hang on clock64() calls are now fixed.
test_gamma_gpu_sample is now fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19307
Differential Revision: D14953420
Pulled By: bddppq
fbshipit-source-id: efe807b54e047578415eb1b1e03f8ad44ea27c13
Summary:
I audited the relevant kernel and saw it accumulates a good deal into float
so it should be fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19293
Differential Revision: D14942274
Pulled By: zou3519
fbshipit-source-id: 36996ba0fbb29fbfb12b27bfe9c0ad1eb012ba3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19299
I saw larger than 5% performance variation with small operators, this diff aims to reduce the variation by avoiding python overhead. Previously, in the benchmark, we run the main loop for 100 iterations then look at the time. If it's not significant, we will double the number of iterations to rerun and look at the result. We continue this process until it becomes significant. We calculate the time by total_time / number of iterations. The issue is that we are including multiple python trigger overhead.
Now, I change the logic to calculate execution time based on the last run instead of all runs, the equation is time_in_last_run/number of iterations.
Reviewed By: hl475
Differential Revision: D14925287
fbshipit-source-id: cb646298c08a651e27b99a5547350da367ffff47
Summary:
Add input information into generated RecordFunction calls in
VariableType wrappers, JIT operators and a few more locations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18717
Differential Revision: D14729156
Pulled By: ilia-cher
fbshipit-source-id: 811ac4cbfd85af5c389ef030a7e82ef454afadec
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19170
As title
The quantized resnext3d model in production got the following failures without the fix:
```
Caffe2 operator Int8ConvRelu logging error: [enforce fail at conv_pool_op_base.h:463] order == StorageOrder::NCHW. 1 vs 2. Conv3D only supports NCHW on the production quantized model
```
Reviewed By: jspark1105
Differential Revision: D14894276
fbshipit-source-id: ef97772277f322ed45215e382c3b4a3702e47e59
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19118
A bug introduced by D14700576 reported by Yufei (fixed by D14778810 and D14785256) was not detected by our units tests.
This diff improves unit tests to catch such errors (with this diff and without D14778810, we can reproduce the bug Yufei reported).
This improvement also revealed a bug that affects the accuracy when we pre-pack weight and bias together and the pre-packed weight/bias are used by multiple nets. We were modifying the pre-packed bias in-place which was supposed to be constants.
Reviewed By: csummersea
Differential Revision: D14806077
fbshipit-source-id: aa9049c74b6ea98d21fbd097de306447a662a46d
Summary:
closes#18873
Doesn't fail the build on warnings yet.
Also fix most severe shellcheck warnings
Limited to `.jenkins/pytorch/` at this time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18874
Differential Revision: D14936165
Pulled By: kostmo
fbshipit-source-id: 1ee335695e54fe6c387ef0f6606ea7011dad0fd4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18953
This removes Python side bucketing code from DistributedDataParallel
and replaces it with calls to the new C++ based bucketing and reducing
code. To confirm this is working well, we ran a test with both the
previous implementation and the new implementation, and confirmed they
are numerically equivalent.
Performance is improved by a couple percent or more, including the
single machine multiple GPU runs.
Closes#13273.
Reviewed By: mrshenli
Differential Revision: D14580911
fbshipit-source-id: 44e76f8b0b7e58dd6c91644e3df4660ca2ee4ae2
Summary:
The derivative of the Cholesky decomposition was previously a triangular matrix.
Changelog:
- Modify the derivative of Cholesky from a triangular matrix to symmetric matrix
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19116
Differential Revision: D14935470
Pulled By: ezyang
fbshipit-source-id: 1c1c76b478c6b99e4e16624682842cb632e8e8b9
Summary:
This PR splits the configuration tree data from the logic used to construct the tree, for both `pytorch` and `caffe2` build configs.
Caffe2 configs are also now illustrated in a diagram.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18517
Differential Revision: D14936170
Pulled By: kostmo
fbshipit-source-id: 7b40a88512627377c5ea0f24765dabfef76ca279
Summary:
The caching allocator tries to free all blocks on an out-of-memory
error. Previously, it did not free blocks that still had outstanding
stream uses. This change synchronizes on the outstanding events and
frees those blocks.
See #19219
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19222
Differential Revision: D14925071
Pulled By: colesbury
fbshipit-source-id: a2e9fe957ec11b00ea8e6c0468436c519667c558
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19218
Sync some contents between fbcode/caffe2 and xplat/caffe2 to move closer towards a world where they are identical.
Reviewed By: dzhulgakov
Differential Revision: D14919916
fbshipit-source-id: 29c6b6d89ac556d58ae3cd02619aca88c79591c1
Summary:
Enable multi-GPU tests that work with ROCm 2.2. Have been run three times on CI to ensure stability.
While there, remove skipIfRocm annotations for tests that depend on MAGMA. They still skip but now for the correct reason (no MAGMA) to improve our diagnostics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19169
Differential Revision: D14924812
Pulled By: bddppq
fbshipit-source-id: 8b88f58bba58a08ddcd439e899a0abc6198fef64
Summary:
Partially fuse layer_norm by decomposing layer_norm into the batchnorm kernel that computes the stats, and then fusing the affine operations after the reduce operations, this is similar to the batchnorm fusion that apaszke did, it also only works in inference mode now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18266
Differential Revision: D14879877
Pulled By: wanchaol
fbshipit-source-id: 0197d8f2a17ec438d3e53f4c411d759c1ae81efe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18546
We'll expose all combinations of various ways of quantization in the top level dispatch key, that is we have AffineCPUTensor, PerChannelAffineCUDATensor, etc.
QTensor method added:
- is_quantized()
- item()
Differential Revision: D14637671
fbshipit-source-id: 346bc6ef404a570f0efd34e8793056ad3c7855f5
Summary:
If JIT constant propagation doesn't work, we have to handle the ListConstructor in symbolic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19102
Reviewed By: zrphercule
Differential Revision: D14875588
Pulled By: houseroad
fbshipit-source-id: d25c847d224d2d32db50aae1751100080e115022
Summary:
This PR is to fix the CI error:
```
nvidia-docker2 : Depends: nvidia-container-runtime (= 2.0.0+docker18.09.4-1) but 2.0.0+docker18.09.5-1 is to be installed
E: Unable to correct problems, you have held broken packages.
Exited with code 100
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19195
Differential Revision: D14913104
Pulled By: yf225
fbshipit-source-id: d151205f5ffe9cac7320ded3c25baa7e051c3623
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19182
This is a bug discovered by zafartahirov, right now if one of the tensor is QInt
type we'll return undefined, but actually we want to allow ops that accepts
Tensors of the same QInt type to work.
Reviewed By: zafartahirov
Differential Revision: D14909172
fbshipit-source-id: 492fd6403da8c56e180efe9d632a3b7fc879aecf
Summary:
According to https://github.com/pytorch/pytorch/issues/13638#issuecomment-468055428, after the Variable/Tensor merge, we may capture variables without autograd metadata inside an autograd function, and we need a working version counter in these cases. This PR makes it possible by moving `version_counter_` out of autograd metadata and into TensorImpl, so that variables without autograd metadata still have version counters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18223
Differential Revision: D14735123
Pulled By: yf225
fbshipit-source-id: 15f690311393ffd5a53522a226da82f5abb6c65b
Summary:
Currently, a TensorImpl's `is_variable_` is true if and only if the TensorImpl has AutogradMeta. This PR unifies these two concepts by removing `is_variable_` and change `is_variable()` to check existence of AutogradMeta instead.
Removing `is_variable_` is part of the work in Variable/Tensor merge.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19139
Differential Revision: D14893339
Pulled By: yf225
fbshipit-source-id: ceb5e22c3c01f79b5d21d5bdbf4a7d1bc397796a
Summary:
This PR propagates where we use first-class modules objects into the compiler. This creates a transitionary state where:
* compiler.cpp creates Graphs where `self` is a Module class and attributes/parameters/buffers/submodules are looked up with `prim::GetAttr`
* GraphExecutor still runs "lowered graphs" where the self object has been removed by a compiler pass `lower_first_class_method`.
* Tracing still creates "lowered graphs", and a pass "lift_lowered_method" creates a first-class method graph for things.
* This PR separates out Method and Function. A script::Function is a pure Graph with no `self` bound. Similar to Python, a script::Method is just a bound `self` and its underlying `script::Function`.
* This PR also separates CompilationUnit from Module. A CompilationUnit is just a list of named script::Functions. Class's have a CompilationUnit holding the class methods, and Modules also have a CompilationUnit holding their Methods. This avoids the weird circular case Module --has a-> Class -> has a -> Module ...
Details:
* In this transitionary state, we maintain two copies of a Graph, first-class module and lowered. Th first-class one has a self argument that is the module's class type. The lowered one is the lowered graph that uses the initial_ivalues inputs.
* When defining lowered methods using `_defined_lowered` we immediately create the first-class equivalent. The reverse is done lazily, creating lowered_methods on demand from the class.
* The two way conversions will be deleted in a future PR when the executor itself runs first-class objects. However this requires more changes to (1) the traces, (2) the python bindings, and (3) the onnx export pass and would make this PR way to large.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19167
Differential Revision: D14891966
Pulled By: zdevito
fbshipit-source-id: 0b5f03118aa65448a15c7a7818e64089ec93d7ea
Summary:
It's not intended that Storages have 'default' CUDA devices, but this is allowable via the Storage::create_legacy codepath.
This also messages with device_caching, because the initial cache is obtained from the Storage, which may have a 'default' device.
Instead, we materialize a device by allocating 0 bytes via the allocator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18605
Differential Revision: D14680620
Pulled By: gchanan
fbshipit-source-id: 6d43383d836e90beaf12bfe37c3f0506843f5432
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19154
I recently saw some weird workflow error due to empty but set net_type. Maybe we should just fallback to simple net in this case.
Reviewed By: dzhulgakov
Differential Revision: D14890072
fbshipit-source-id: 4e9edf8232298000713bebb0bfdec61e9c5df17d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19147
After #14809 was merged there is no longer a need for getGroupRank.
Every ProcessGroup object has its own rank and size fields which are
accurate for the global group as well as subgroups.
Strictly speaking removing a function in a minor version bump is a big
no-no, but I highly doubt this was ever used outside of
`torch.distributed` itself. This will result in a compile error for
folks who have subclassed the ProcessGroup class though.
If this is a concern we can delay merging until a later point in time,
but eventually this will need to be cleaned up.
Differential Revision: D14889736
fbshipit-source-id: 3846fe118b3265b50a10ab8b1c75425dad06932d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19091
Implements a basic quantized ReLU (uint8). This is a temporary solution before using the `QTensor` type instead of the tuple.
Reviewed By: dzhulgakov
Differential Revision: D14565413
fbshipit-source-id: 7d53cf5628cf9ec135603d6a1fb7c79cd9383019
Summary:
Import MultiheadAttention into the core pytorch framework.
Users now can import MultiheadAttention directly from torch.nn.
See "Attention Is All You Need" for more details related to MultiheadAttention function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18334
Differential Revision: D14577966
Pulled By: zhangguanheng66
fbshipit-source-id: 756c0deff623f3780651d9f9a70ce84516c806d3
Summary:
Remove pointer to nonexistent Note.
It is already removed in "Remove support for CUDNN 6 (#15851)"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19148
Differential Revision: D14891514
Pulled By: soumith
fbshipit-source-id: dd33cfefa3a21e18afae5b3992dea085adaabda8
Summary:
"""
This will verify that the func syntax follows the JIT signature schema. If you are a developer outside the core team, set this to False first to help us track unification. After your tests pass try setting this to True once and leave it set to True if it doesn't trigger any asserts. This means that your signature happens to be compliant. In general, it serves as a means of tracking an ongoing schema unification with the goal of aligning func syntax with other components of PyTorch in order to reduce overall complexity and assert coverage of all functions by each component.
"""
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18956
Differential Revision: D14807952
Pulled By: cpuhrsch
fbshipit-source-id: 42dac49269fb3cd96dc62e0b10820d0c32c7fb0e
Summary:
* `torch.hub.list('pytorch/vision')` - show all available hub models in `pytorch/vision`
* `torch.hub.show('pytorch/vision', 'resnet18')` - show docstring & example for `resnet18` in `pytorch/vision`
* Moved `torch.utils.model_zoo.load_url` to `torch.hub.load_state_dict_from_url` and deprecate `torch.utils.model_zoo`
* We have too many env to control where the cache dir is, it's not very necessary. I actually want to unify `TORCH_HUB_DIR`, `TORCH_HOME` and `TORCH_MODEL_ZOO`, but haven't done it. (more suggestions are welcome!)
* Simplify `pytorch/vision` example in doc, it was used to show how how hub entrypoint can be written so had some confusing unnecessary args.
An example of hub usage is shown below
```
In [1]: import torch
In [2]: torch.hub.list('pytorch/vision', force_reload=True)
Downloading: "https://github.com/pytorch/vision/archive/master.zip" to /private/home/ailzhang/.torch/hub/master.zip
Out[2]: ['resnet18', 'resnet50']
In [3]: torch.hub.show('pytorch/vision', 'resnet18')
Using cache found in /private/home/ailzhang/.torch/hub/vision_master
Resnet18 model
pretrained (bool): a recommended kwargs for all entrypoints
args & kwargs are arguments for the function
In [4]: model = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True)
Using cache found in /private/home/ailzhang/.torch/hub/vision_master
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18758
Differential Revision: D14883651
Pulled By: ailzhang
fbshipit-source-id: 6db6ab708a74121782a9154c44b0e190b23e8309
Summary:
Previously, MPI process groups were created for all processes, even if
they were not part of the created group. Their MPI_Comm member field
would be MPI_COMM_NULL and they would ignore any calls. Their rank and
size were identical to that of the global process group and they had a
special groupRank and groupSize field to capture the _real_ rank.
This also meant assymetry with other process group types, where creating
a new group would either return the process group OR
GroupMember.NON_GROUP_MEMBER. For the MPI process group, it would always
return a process group and an additional check was needed to verify
whether or not a process was indeed part of a process group or not.
This commit changes this such that every MPI process group is a valid
process group, and by extension that we no longer have to special case
MPI to determine whether or not a process is part of a group. Now, if
the value returned by `new_group` is GroupMember.NON_GROUP_MEMBER, the
process is not a member, otherwise it is.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14809
Differential Revision: D14887937
Pulled By: pietern
fbshipit-source-id: c5bf86d3b33e524cc5004ee68e30103178fa491d
Summary:
~Sometimes, `init_process_group()`, `store.get()`, and `destory_process_group()` can take more than a few seconds. Hence, removing thread join timeout.~
The error was due to `Address already in use` when starting TPC backend. The solution is to catch the error and report it to the `retry_on_address_already_in_use_error` decorator.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19114
Reviewed By: ezyang
Differential Revision: D14872680
Pulled By: mrshenli
fbshipit-source-id: fc504d02853ca73f76288c0ade564ab20bc01f7e
Summary:
It is done by flattening all tensor lists that are inputs/outputs to the
graph into the inputs/outputs list in the autograd graph.
This is less desirable than simply allowing IValues to exist in the
inputs/outputs of autograd::Function but it is substantially less
intrusive.
CaptureList describes the variables captured for backward in a single class.
UnpackInstructs describes how the flattened inputs to backwards are re-packed into lists.
ailzhang
This PR is also part 2 of covering maskrcnn & bert AD formulas, following #16689.
Ops added in this PR:
```
cat
index
meshgrid
reshape
split
split_with_sizes
stack
unbind
```
I will also add a few perf numbers here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16784
Differential Revision: D14104063
Pulled By: ailzhang
fbshipit-source-id: 5ceadadfd67ccaac60c5fd6740786c5354e252b9
Summary:
Previously the error message would look like:
```
Attempted to set the storage of a tensor on device cuda:0 to a storage on different device cuda. This is no longer allowed; the devices must match.
```
Now it looks like:
```
Attempted to set the storage of a tensor on device "cuda:0" to a storage on different device "cuda". This is no longer allowed; the devices must match.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19068
Reviewed By: dzhulgakov
Differential Revision: D14854257
Pulled By: gchanan
fbshipit-source-id: deb1ef73c2fcbf9338e7d67f2856282db2befac8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19085
This is a bug where input_shapes_ and output_shapes_ will grow indefinitely. Fix it here.
Reviewed By: bertmaher, rdzhabarov
Differential Revision: D14861695
fbshipit-source-id: d59116f27c3b54f5cc5a33533de4b9222dbb7afc
Summary:
Bug fix for https://github.com/pytorch/pytorch/issues/15043, where a large fusion in JIT with a large number of kernel arguments, which exceeds the limit allowed by nvrtc on a cuda device.
The fix is to check the number of arguments before a cuda kernel is generated. If the number exceeds the limit, take the runFallBack() path.
Add a reduced test from the original issue to keep the test time low. The test would fail without this fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18063
Differential Revision: D14691401
Pulled By: soumith
fbshipit-source-id: b98829bc89ed7724e91eda82ae3a5a1151af721a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18902
Fix in D14778810 had an issue that when we fallback to acc32 because the density of outlier is too high W_quantized_ is already modified. In this diff we first just count the number of outliers (without modifying W_quantized_) and only when density is low enough and no need for fallback we modify W_quantized_ and construct an outlier matrix.
Reviewed By: jspark1105
Differential Revision: D14785256
fbshipit-source-id: 03933110a4ca7409686a06b18a9bb921f8657950
Summary:
I've been messing around with vectorizing the fusion compiler in JIT, and noticed that these ops were pathologically slow. I moved them to use TensorIterator + Vec256<> and got some speed wins.
Benchmark script:
```
import torch, time
ops = ['abs', 'neg', 'reciprocal', 'frac']
x = torch.rand(1024, 1024)
NITER = 10000
print('op', 'time per iter (ms)', 'gops/s', 'GB/s', sep='\t')
for op in ops:
s = time.time()
for i in range(NITER):
getattr(x, op)()
elapsed_sec = ((time.time() - s) / NITER)
print(op, elapsed_sec * 1000, (1024*1024/elapsed_sec)/1e9, (1024*1024*4*2) / elapsed_sec / 1e9, sep='\t')
```
Before this change (on my mac with a skylake):
```
op time per iter (ms) gops/s GB/s
abs 0.9730974197387695 1.0775652866097343 8.620522292877874
neg 1.0723679780960083 0.9778136063534356 7.822508850827485
reciprocal 1.2610594034194946 0.8315040490215421 6.6520323921723366
frac 1.1681334018707275 0.8976509004200546 7.181207203360437
```
After this change:
```
op time per iter (ms) gops/s GB/s
abs 0.5031076192855835 2.084198210889721 16.673585687117768
neg 0.4433974027633667 2.3648672578256087 18.91893806260487
reciprocal 0.47145988941192624 2.2241043693195985 17.79283495455679
frac 0.5036592721939087 2.0819154096627024 16.65532327730162
```
So, after this change it looks like we are hitting machine peak for bandwidth and are bandwidth bound.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19041
Differential Revision: D14862037
Pulled By: jamesr66a
fbshipit-source-id: e2032ac0ca962dbf4120bb36812277c260e22912
Summary:
Fixes the problem of #18391
The issue is that when we code gen the ATenOp, we always generated static number of outputs for each operator. E.g. If there's operator from a old model that only requires two outputs, in its createOperator it will only allocate two output blobs, while the newer version of the operator (`unique` in this case) requires more output blob to be allocated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18581
Differential Revision: D14865647
Pulled By: wanchaol
fbshipit-source-id: 85f63fe16d6fe408a09eca84798c7e8cab3070e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19080
OSS: add a tiny unit test utility function to create tensors given shape and data outside of any workspace. I use it in an internal test
Reviewed By: dzhulgakov
Differential Revision: D14814194
fbshipit-source-id: 6d53b235d99a97da812215f5c7f11fecad363c8c
Summary:
Changelog:
- Rename `btrisolve` to `lu_solve` to remain consistent with names of solve methods (`cholesky_solve`, `triangular_solve`, `solve`)
- Fix all callsites
- Rename all tests
- Create a tentative alias for `lu_solve` under the name `btrisolve` and add a deprecation warning to not promote usage
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18726
Differential Revision: D14726237
Pulled By: zou3519
fbshipit-source-id: bf25f6c79062183a4153015e0ec7ebab2c8b986b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19062
As a temporary demonstration on how to extend this hack further until custom C types are ready.
Reviewed By: ezyang
Differential Revision: D14817809
fbshipit-source-id: 6eaf731e9135313eb858e178abcd9f25380ab8fe
Summary:
closes#16520
Hi pietern, I am not sure if this is the expected way to pass timeout to `Store`, could you please help take a look? Thanks!
Questions:
1. How do I write tests for this? I wanted to do something like `test_barrier_timeout_global`, but it seems I need to set the pg's timeout larger than the `Store`'s default timeout (3 min) to see a difference, which is too long for a unit test. And I do not want to change the `Store`'s default timeout either. Any suggestion?
2. Should I also propagate timeout configuration down to `PrefixStore` in `_new_process_group_helper`?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16571
Differential Revision: D13954527
Pulled By: mrshenli
fbshipit-source-id: 77f2653903f24255207233eb298f7c0321119a87
Summary:
Fixes#18518
I changed the C++ API torch::nn::init::orthogonal_ implementation to match the Python implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18915
Differential Revision: D14851833
Pulled By: ezyang
fbshipit-source-id: 45b5e9741582777c203e9ebed564ab3ac1f94baf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19042
show the model saving step in the log.
Reviewed By: kennyhorror
Differential Revision: D14809385
fbshipit-source-id: c7a1e50ff92bb45b16b1c501d9325b304b07fbd3
Summary:
Partial fix of: https://github.com/pytorch/pytorch/issues/394
- `gels` and `triangular_solve` now returns namedtuple
- refactor test for namedtuple API for better coverage and maintainability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17195
Differential Revision: D14851875
Pulled By: ezyang
fbshipit-source-id: 9b2cba95564269d2c3a15324ba48751d68ed623c
Summary:
DDP does not support replicating BN layers within a process. Existing BN tests fail if the test environment has more than 8 GPUs. This is fixed by explicitly setting each process to use a single replica.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19049
Differential Revision: D14845286
Pulled By: mrshenli
fbshipit-source-id: 937dda5081d415ece48b21f2781b6b4e008dd42f
Summary:
Define `AT_CPP14_CONSTEXPR` from `constexpr` to empty on Windows with CUDA >= 9.2 as workaround.
Discussed in #18425.
When using CUDA 10.1 on Windows, I faced following errors:
~~~
D:/data/source/pytorch\c10/util/ArrayRef.h(144): error: variable in constexpr function does not have automatic storage duration
detected during instantiation of "const T &c10::ArrayRef<T>::front() const [with T=at::Tensor]"
D:/data/source/pytorch/aten/src\ATen/DeviceGuard.h(30): here
~~~
From documentation of CUDA Toolkit v10.1.105, compiler supports `constexpr` and relaxing requirements (in C++14), but compilation failed.
I suppose this could be compiler bug and require this workaround.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18986
Differential Revision: D14821836
Pulled By: ezyang
fbshipit-source-id: 9800da2fe7291e7c09e8e5e882adebab08d83ae3
Summary:
This is a minimalist PR to add MKL-DNN tensor per discussion from Github issue: https://github.com/pytorch/pytorch/issues/16038
Ops with MKL-DNN tensor will be supported in following-up PRs to speed up imperative path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17748
Reviewed By: dzhulgakov
Differential Revision: D14614640
Pulled By: bddppq
fbshipit-source-id: c58de98e244b0c63ae11e10d752a8e8ed920c533
Summary:
Previously, when a user built PyTorch from source, but set the version string manually to be binary-formatted, it would've simply used CXX11_ABI=0 incorrectly.
We have this information available at runtime with `torch._C._GLIBCXX_USE_CXX11_ABI`, so this PR improves the situation by simply using that information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18994
Differential Revision: D14839393
Pulled By: soumith
fbshipit-source-id: ca92e0810b29ffe688be82326e02a64a5649a3ad
Summary:
This is a fix for issue https://github.com/pytorch/pytorch/issues/18525. The issue is related not only to ONNX export, but can manifest in other scenarios.
An existing test point in test/onnx/test_operators.py has been updated to cover this scenario as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18764
Reviewed By: zrphercule
Differential Revision: D14735166
Pulled By: houseroad
fbshipit-source-id: 5a737c648f64355929ff31eb12bd4869e744768d
Summary:
Almost there, feel free to review.
these c10 operators are exported to _caffe2 domain.
TODO:
- [x] let the onnx checker pass
- [x] test tensor list as argument
- [x] test caffe2 backend and converter
- [x] check the c10 schema can be exported to onnx
- [x] refactor the test case to share some code
- [x] fix the problem in ONNX_ATEN_FALLBACK
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18210
Reviewed By: zrphercule
Differential Revision: D14600916
Pulled By: houseroad
fbshipit-source-id: 2592a75f21098fb6ceb38c5d00ee40e9e01cd144
Summary:
I haven't had a chance to rigorously try these out yet so don't merge yet.
Closes#18725.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18963
Differential Revision: D14832897
Pulled By: ezyang
fbshipit-source-id: 4780e7a34126bc66ddbfd9d808dfc9e0edd77e68
Summary:
the ROCm compiler cannot and will not satisfy them, causing compile time warnings.
Reason being a runtime loop trip count.
Some warnings remain arising from other parts of the ROCm stack - tickets are filed and they will be resolved within these components.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19018
Differential Revision: D14832859
Pulled By: ezyang
fbshipit-source-id: 0d66e4aebe4e56af14dd5e2967d3c374a82be25c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/19004
Handling the exception case when the data has min 3.40282e+38 max -3.40282e+38
Reviewed By: jspark1105
Differential Revision: D14822193
fbshipit-source-id: b9771d1584fdf8317f5b8c7f5806be5d27314386
Summary:
1) sparse_dispatches in native_parse was not used anymore, got rid of it.
2) got rid of overloaded sizes_ in SparseTensorImpl, which just uses the base implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18962
Differential Revision: D14811545
Pulled By: gchanan
fbshipit-source-id: 2fa60ef50456b5f605caa63beae1d8d2542fd527
Summary:
* Annotate also two pass reduction with launch bounds
* ifdef some shortcomings of ROCm w.r.t. short-circuit returns - internal tickets filed
* while there, plug memory leak by destroying matrix descriptor after the sparse call (applicable to cuSPARSE)
* while there, fix types for cusparseXcoo2csr as per cuSPARSE documentation
* enable test_dsmm in test_sparse which now passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18985
Differential Revision: D14822009
Pulled By: bddppq
fbshipit-source-id: 757267a47a63ee56ef396c33059f7eca099f4833
Summary:
Make sure to check if profiler is disabled in push/pop and mark event
functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18908
Differential Revision: D14791931
Pulled By: ilia-cher
fbshipit-source-id: e4f5149e69999ee2b9238c21cccad6d27c6a714a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18848
The Observer Module is based on eager mode compute qparam implementation.
Goal is to validate QParam result for EagerMode and Script Mode for simple
model
Observer stats are collected and qparam computed only for activations only at this point
Reviewed By: zafartahirov
Differential Revision: D14720805
fbshipit-source-id: cb2f321b4b9927b37905fdb8eb55c5610d41b351
Summary:
Dear All,
The proposed patch fixes the test code snippets used in cmake infrastructure, and implicit failure to set properly the ```CAFFE2_COMPILER_SUPPORTS_AVX2_EXTENSIONS``` flag. The libcaffe2.so will have some ```UND``` avx2 related references, rendering it unusable.
* Using GCC 9 test code from cmake build infra always fails:
```
$ gcc -O2 -g -pipe -Wall -m64 -mtune=generic -fopenmp -DCXX_HAS_AVX_1 -fPIE -o test.o -c test.c -mavx2
test.c: In function ‘main’:
test.c:11:26: error: incompatible type for argument 1 of ‘_mm256_extract_epi64’
11 | _mm256_extract_epi64(x, 0); // we rely on this in our AVX2 code
| ^
| |
| __m256 {aka __vector(8) float}
In file included from /usr/lib/gcc/x86_64-redhat-linux/9/include/immintrin.h:51,
from test.c:4:
/usr/lib/gcc/x86_64-redhat-linux/9/include/avxintrin.h:550:31: note: expected ‘__m256i’ {aka ‘__vector(4) long long int’} but argument is of type ‘__m256’ {aka ‘__vector(8) float’}
550 | _mm256_extract_epi64 (__m256i __X, const int __N)
|
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/9/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-redhat-linux
Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 9.0.1 20190328 (Red Hat 9.0.1-0.12) (GCC)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18991
Differential Revision: D14821838
Pulled By: ezyang
fbshipit-source-id: 7eb3a854a1a831f6fda8ed7ad089746230b529d7
Summary:
Stacked on https://github.com/pytorch/pytorch/pull/18815 and https://github.com/pytorch/pytorch/pull/18811.
This makes it so that we emit a higher-precision literal for float values in the fusion kernel, as well as assign that to a `double` variable. This prevents us from losing precision for values such as `pi`, but with the previous fixes this will also get downcasted to `float` if downstream operations require it. Therefore, we should not lose performance because of implicit promotions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18817
Differential Revision: D14820842
Pulled By: jamesr66a
fbshipit-source-id: 519671c6ca5e7adac746a4c4c72760a6d91e332f
Summary:
Closes#18382
Please let me know if any changes are required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18787
Differential Revision: D14821147
Pulled By: soumith
fbshipit-source-id: edd98eab1b3f4151c4ae5148146435ddb2ae678d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18974
When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.
Reviewed By: bddppq
Differential Revision: D14814067
fbshipit-source-id: aec917322de695e283f0aca1e930c5603d196404
Summary:
Move these 2 ops back to autodiff to unblock xla CI.
I will leave them for my next PR to cleanup symbolic_variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18969
Differential Revision: D14816811
Pulled By: ailzhang
fbshipit-source-id: dd8a7e133dcad29560d3d1d25691883960117299
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18713
- Quantizer observer node output is hooked up to following node
which mutates the naming for input/output. This is not desired and
required because observer op can be a sync node
- Quantizer is aimed for quantizing tensors so we should insert observer
op for Values that are type tensor
Reviewed By: zafartahirov
Differential Revision: D14715916
fbshipit-source-id: feca04c65a43103b46084d3548998498b19ee599
Summary:
Stacked on https://github.com/pytorch/pytorch/pull/18811
This makes it so that we only emit the *f variants of math functions if the output value's type is FloatTensor, otherwise we call the double variants to prevent loss of precision. This fixes more numerical issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18815
Differential Revision: D14816965
Pulled By: jamesr66a
fbshipit-source-id: 464be644168875ede987142281fb2168f4041e81
Summary:
Thanks to dusty-nv , we now have Stable and Weekly wheels provided for the NVIDIA Jetson Platform. They require JetPack 4.2.
He's also maintaining source build instructions.
This PR adds links to the binaries and source build instructions to the README.
The links are dynamic, so when new stable / weekly wheels are available, Dustin will update the same URL to point to the new files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18990
Differential Revision: D14820158
Pulled By: soumith
fbshipit-source-id: 761a56557decb72ad9c1b9f8a2745667f558eec3
Summary:
- Quantizer pass to mutate IR by inserting quant-dequant nodes
before and after nodes which support quantized ops. This information
will be used by jit compiler to substitute with quantized ops
- This currently covers simple model. It will be expanded later
for subgraph pattern matching to cover more complex patterns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18446
Differential Revision: D14592265
Pulled By: nishantpdce
fbshipit-source-id: c9ba6c12aa96cb9c117826e386721eec83a55ea6
Summary:
Trainer has been removed long time ago
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18980
Differential Revision: D14819855
Pulled By: ezyang
fbshipit-source-id: f62020e688ebf6663416aec7435bf1f531607941
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18469
ghimport-source-id: 73cb8b58f43f10b1dcfca805fd5b25c4fa977632
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18469 Create Object that represents a Module**
* #18468 slots with explicit value/setValue make more sense in future patches
* #18467 Make Object hold its ClassType
* #18379 Enforce single parent for script submodules
* #18378 Unify namespace of script::Module
* #18314 Add ability to specialize class types to ArgumentSpec
* #18226 Add Slot type to abstract the raw pointers being used for slots.
This changes the underlying storage for script::Module to hold
a ivalue::Object which has slots for all the parameters and attributes.
NamedIValue and Slot are now merged together into one class Slot that stores
the tuple (ivalue::Object, offset) and can be used to read the name, type,
or value of the slot and also to set the value. This cleans up a bunch
of client uses.
This PR does not actually use the module object in any generated code.
A future PR will switch how code is generated to treat modules as
first class.
Differential Revision: D14613508
fbshipit-source-id: d853a7559f58d244de2ef54a781427fcd1060ed0
Summary:
Fixes https://github.com/pytorch/pytorch/issues/10654
The issue is that in tracing `.size` returns an int tensor, and when an int tensor is multiplied by a scalar the int dominates and the scalar gets casted 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18875
Differential Revision: D14814441
Pulled By: eellison
fbshipit-source-id: a4e96a2698f2fcbf3ec4b2bb4c43a30250f30ad9
Summary:
This adds a C++ function `debugGetFusedKernelCode` as well as a Python binding `_jit_fuser_get_fused_kernel_code` that will, given a FusionGroup graph and a set of specified inputs, return the compiled kernel source code. We can then check the contents of this source code for verification of the fuser codegen backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18884
Differential Revision: D14795508
Pulled By: jamesr66a
fbshipit-source-id: 8f6e9dd13ebbb517737d893b0b5f5e9aa06af124
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18468
ghimport-source-id: d4b41c521f2269a695e03c8e7d05d5542731ee48
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18469 Create Object that represents a Module
* **#18468 slots with explicit value/setValue make more sense in future patches**
* #18467 Make Object hold its ClassType
* #18379 Enforce single parent for script submodules
* #18378 Unify namespace of script::Module
* #18314 Add ability to specialize class types to ArgumentSpec
* #18226 Add Slot type to abstract the raw pointers being used for slots.
Reviewed By: suo
Differential Revision: D14613509
fbshipit-source-id: 9f2208d0efd01465c78cebdc3e8365a9e0adf9ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18467
ghimport-source-id: d51bdd64d2529d08c634c58df1a0870b54ad49fb
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18469 Create Object that represents a Module
* #18468 slots with explicit value/setValue make more sense in future patches
* **#18467 Make Object hold its ClassType**
* #18379 Enforce single parent for script submodules
* #18378 Unify namespace of script::Module
* #18314 Add ability to specialize class types to ArgumentSpec
* #18226 Add Slot type to abstract the raw pointers being used for slots.
Currently it holds a symbol whose unqualified name is the name of the
class. This will get confusing when there are multiple possible registries,
and it makes getting the class type from the object difficult.
The pointer to the class is only 4 more bytes so this patch just puts
it in the object.
Reviewed By: suo
Differential Revision: D14613510
fbshipit-source-id: b35175ba4be83d2522deaa6dad5070d6ec691fed
Summary:
I have experienced that sometimes both were in `__dict__`, but it chose to copy `probs` which loses precision over `logits`. This is especially important when training (bayesian) neural networks or doing other type of optimization, since the loss is heavily affected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18614
Differential Revision: D14793486
Pulled By: ezyang
fbshipit-source-id: d4ff5e34fbb4021ea9de9f58af09a7de00d80a63
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18878
When the weight is prepacked and it doesn't contain a prepacked weight for acc32, we shouldn't fallback to acc32.
TODO: add unit tests with better coverage
Reviewed By: feiyu1990
Differential Revision: D14778810
fbshipit-source-id: d49a8c4b7c815ab29b77feb53ee730ad63780488
Summary:
The C++ and CUDA implementations of the lerp are not numerically stable. This is discussed on Wikipedia [here](https://en.wikipedia.org/wiki/Linear_interpolation#Programming_language_support). I checked the GPU SASS output and there's no overhead from using the more precise implementation, from Kepler all the way to Turing. I haven't looked at CPU ASM though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18871
Differential Revision: D14793438
Pulled By: ezyang
fbshipit-source-id: 2ddc2e026c5285466cae7d1b4101174253100445
Summary:
As discussed with gchanan we should deduplicate symbolic_variable and symbolic_script to prepare for the future merge with derivatives.yaml.
This PR moves most easy formulas to symbolic_script.
TODO: run benchmarks to make sure no perf regression
cc: apaszke zdevito wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17986
Differential Revision: D14766412
Pulled By: ailzhang
fbshipit-source-id: d95a3f876e256c0f505779a71587c985571d3b8f
Summary:
This PR:
* pulls four distinct installation steps out of `build_pytorch.bat` and into their own scripts.
* eliminates the copy step for helper scripts called by `win-build.sh` and `win-test.sh`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18917
Differential Revision: D14807236
Pulled By: kostmo
fbshipit-source-id: 03e91a5834dfd6d68903ad9725eacc099bbf6d53
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18791
As a temporary demonstration on how to extend this hack further until custom C types are ready.
Reviewed By: jamesr66a
Differential Revision: D14742020
fbshipit-source-id: 0f2fd83ae56ab2abe16977a1829ed421e6abe74b
Summary:
Looks like the issue of using `std::` functions is fixed in new rocm version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18905
Differential Revision: D14792943
Pulled By: bddppq
fbshipit-source-id: af11acbb85872943f23b6e55415db1f0699e7b8f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18711
ghimport-source-id: c9caedc0660b2b7ba3730cd0e1a2e0e9c3cf422b
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18711 [jit] fix side-effects and aliasing for custom ops**
Previously we didn't track aliasing, mutation, or side effects for
custom ops. This PR adds in guards with the most conservative
assumptions possible: the op will
1) have side effects,
2) write to everything
3) produce a wildcard.
In order to tell whether a given operator is a custom op, this PR introduces
the concept of a "reserved" namespace (basically all our builtin namespaces).
Custom ops live in non-reserved namespaces, so a check on the namespace
is sufficient to tell whether a schema/node is "custom" or not.
This is just to get things correct for now. Follow-ups to this:
- Users should be able to specify aliasing/mutability without having to learn
the whole alias annotation schema.
- Relax assumptions a bit. In particular outputs can only alias input tensors,
they don't have to be wildcards.
Fixes#18490
Differential Revision: D14730978
fbshipit-source-id: 540b47a24ccf24145051609bdcc99c97e46e0fe0
Summary:
Expand the list of ops that resize an input in-place to include broadcasting ops and other ops that affect shape. Whoever is reviewing the PR could you please look through pytorch in place ops and see if I missed any.
Expanding the PR from: https://github.com/pytorch/pytorch/pull/17518
This is already being tested in test_resize_input_ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18812
Differential Revision: D14793410
Pulled By: eellison
fbshipit-source-id: 125f4f5375ac1036fb96fabc9da2aaccc9adc778
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18845
This adds a few CPU only test cases for the reducer class.
Reviewed By: mrshenli
Differential Revision: D14768432
fbshipit-source-id: c008a52206826304e634a95bc14167ed94c97662
Summary:
Fix a few instances of notifying on a CV while holding the lock to release the lock before notifying. This avoids an extra thread suspension when the notified thread tries to grab the lock.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18857
Differential Revision: D14779132
Pulled By: resistor
fbshipit-source-id: b18a05c4c15be1426ebfdffac1c8f002b771cfd7
Summary:
`scripts/build_windows.bat` is the original way to build caffe2 on Windows, but since it is merged into libtorch, the build scripts should be unified because they actually do the same thing except there are some different flags.
The follow-up is to add the tests. Looks like the CI job for caffe2 windows is defined [here](https://github.com/pytorch/ossci-job-dsl/blob/master/src/jobs/caffe2.groovy#L906). Could we make them a separate file, just like what we've done in `.jenkins/pytorch/win-build.sh`? There's a bunch of things we can do there, like using ninja and sccache to accelerate build.
cc orionr yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18683
Differential Revision: D14730188
Pulled By: ezyang
fbshipit-source-id: ea287d7f213d66c49faac307250c31f9abeb0ebe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18833
ghimport-source-id: 6f2be25fcc5e6be3ffe20582e604bd2c1fbab66b
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18833 [STACK] Cache device on TensorImpl; clean up TensorImpl constructors.**
* #18832 [STACK] Disallow changing the device of a tensor via set_.
* #18831 [STACK] Stop swapping in Storages of the wrong device for Tensors.
1) We cache device on TensorImpl. This means we can access the device without a virtual function and allows us to more easily extend TensorImpls (because they don't need to figure out how to store the Device for themselves).
2) Clean up TensorImpl APIs. We had a constructor that took a TensorTypeId and an allocator and would allocate a Storage based on the recognized types of TensorTypeIds. Instead, we just have two different constructors: one for types with a storage, one without.
Reviewed By: dzhulgakov
Differential Revision: D14766230
fbshipit-source-id: 745b8db84dcd6cb58f1a8675ad3ff8d033bc50df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18912
We intentionally test a deprecated API, no need to show the warnings here.
Reviewed By: dzhulgakov
Differential Revision: D14792617
fbshipit-source-id: 9ea2a4106d566064283726eed2c274b98f49a2e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18716
Might be useful as an intermediate stage for some systems that currently use Caffe2 nets as an execution mechanism.
Not sure it's a good idea all together, please comment.
Limitations:
- only Tensor types as inputs/outputs
- the entire module is serialized as a zip archive inside a proto in Caffe2 db, it'd be subject to 4Gb limit and is likely very slow. For small models it'd work though.
- no autograd, though it can be attached in principle
- no way to retrieve parameters inside the script module from C2 runtime perspective (though they potentially can be alias-fetched and stored as individual blobs)
- after deserialization, python wrappers returned don't have correct type (as we don't do module_lookup trick)
Build-wise, I had to add dependency from pybind_state to libtorch.so. I don't think we build Caffe2 python frontend independently anymore, so it should be fine.
Reviewed By: amirshim, houseroad
Differential Revision: D14339599
fbshipit-source-id: 88a37a8abd1f1c4703e5ef937031f222535d4080
Summary:
This refactor lets us track the types of initial values added onto a `Method`. The main motivation for this is the change in `module.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18252
Differential Revision: D14673459
Pulled By: driazati
fbshipit-source-id: 21200180c47f25bb70898771adfb569856e6c34a
Summary:
Replace link to a file in a private repo with a gist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18852
Reviewed By: ezyang
Differential Revision: D14778481
Pulled By: izdeby
fbshipit-source-id: 8389aa4bf115ddcfd85079cc2c861404efa678e7
Summary:
return missing_keys and unexpected_keys from load_state_dict so the user can handle them when strict mode is off; also removed an unused variable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18668
Differential Revision: D14782073
Pulled By: ezyang
fbshipit-source-id: ab3b855eb77bb7422594d971988067e86eef20f2
Summary:
Tested by running the script in #16562 , and there was no error.
Then:
```
>>> print(mat.grad)
tensor([[1., 2., 3.],
[1., 2., 3.],
[1., 2., 3.]])
```
which is correct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18737
Differential Revision: D14773078
Pulled By: umanwizard
fbshipit-source-id: 8aa36eb6f6aa104263a467d9ac91d61b3bfd05f5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18595
There is no need to force the backend to be the same as the global
process group, as long as the backend is "nccl" or "gloo".
Reviewed By: mrshenli
Differential Revision: D14657204
fbshipit-source-id: 868817b9f219e3be8db0761a487f0027ed46663b
Summary:
This was causing some numerical issues in the fuser
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18811
Differential Revision: D14767390
Pulled By: jamesr66a
fbshipit-source-id: f1123d1aab5501abad850d2edc996f8aa8dafe04
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18832
ghimport-source-id: fde4ad90541ba52dfa02bdd83466f17e6541e535
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18833 [STACK] Cache device on TensorImpl; clean up TensorImpl constructors.
* **#18832 [STACK] Disallow changing the device of a tensor via set_.**
* #18831 [STACK] Stop swapping in Storages of the wrong device for Tensors.
This is necessary to cache the device on a TensorImpl.
Differential Revision: D14766231
fbshipit-source-id: bba61634b2d6252ac0697b96033c9eea680956e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18831
ghimport-source-id: 2741e0d70ebe2c2217572c3af54ddd9d2047e342
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18833 [STACK] Cache device on TensorImpl; clean up TensorImpl constructors.
* #18832 [STACK] Disallow changing the device of a tensor via set_.
* **#18831 [STACK] Stop swapping in Storages of the wrong device for Tensors.**
This is necessary to support device caching, see https://github.com/pytorch/pytorch/pull/18751 and https://github.com/pytorch/pytorch/pull/18578.
In library code, we potentially swap in Storages with the wrong device when device_guard is False. This happens as follows with "view-like" operations.
1) We allocate a tensor on the 'wrong' device (because device_guard is false).
2) We swap out the 'wrong' storage with the 'right' storage using e.g. THCTensor_setStorage.
Instead, we can just construct the Tensor with the correct Storage from the beginning. This is what we do with 'view'.
Note there are two other "view-like" cases where this happens:
1) unfold
2) set_()
Because these aren't performance critical, I just added the device_guard instead of applying the above correction.
For completeness, this also includes a test that all `device_guard: false` functions behave properly under these conditions.
Reviewed By: dzhulgakov
Differential Revision: D14766232
fbshipit-source-id: 0865c3ddae3f415df5da7a9869b1ea9f210e81bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17991
changes:
-Breaks bc: Tensor::type() now returns DeprecatedTypeProperties& rather than Type&.
-Added DeprecatedTypeProperties, it serves as a temporary replacement for Type as the return value of Tensor::type(). This contributes to making Type just for dispatch purposes so that we can make it dtype agnostic.
-Tensor::dispatch_type() now returns Type& like Tensor::type() used to do.
-Changed callsites of Tensor::type() appropriately.
Reviewed By: ezyang
Differential Revision: D14443117
fbshipit-source-id: 239ccb7a09626279a71d1a37f8f82e7f57bf7d9e
Summary:
DLPack can have non-strided tensors, which is represented by a nullptr in the place of dl_tensor.strides.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18510
Differential Revision: D14647328
Pulled By: bwasti
fbshipit-source-id: 5364282810a5772cfc2319fc8133fe86fdd84dd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18838
It turns out that we don't have shape inference function of `Split` op at all. This diff adds that.
Reviewed By: bertmaher
Differential Revision: D14766871
fbshipit-source-id: 535cb4f24bdada603c76579e00e7a39aee93e19f
Summary:
Since parameter.data will create a new torch.Tensor each time, we get duplicate tensors when call _unique_state_dict now. Try to deduplicate it before creating new tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18139
Reviewed By: dzhulgakov
Differential Revision: D14511262
Pulled By: houseroad
fbshipit-source-id: cb69795d0b6509721220650bbb19edeb3459a503
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17026
D14013931 was for FC. This diff is similar optimizations for Conv.
A subtle difference is that in FC, once we fold col_offset into bias during pre-processing step, we can treat everything as if A_zero_offset == 0 (symmetric quantization of A).
In Conv, we can't do this because padding still needs to use the original A_zero_offset.
From requantization point of view, once col_offset folded into bias, we can treat as if we're doing symmetric A quantization.
But, for steps involving padding like im2col, im2col fused with packing, and direct conv for depth-wise/group convolution we still need to pass the original A_zero_offset.
Reviewed By: jianyuh
Differential Revision: D14020276
fbshipit-source-id: c29caefd1127bbc6aff0e9d535939bb0c1ecb66c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18826
ghimport-source-id: 7ffa3bc7ef7402a6d6eb6ba5849e197019d77bf8
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18826 [jit] run cpp tests for non-cuda builds in test_jit.py**
We did all the work of nicely separating our cpp tests that don't require
CUDA, but they aren't run from test_jit.py if CUDA is missing.
Reviewed By: ZolotukhinM
Differential Revision: D14766287
fbshipit-source-id: 9326b3a5c90f6c20fc8cfaf1a1885a363b91f30a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18379
ghimport-source-id: 9895ecc1ff7897e98853dc00675341f36726e7c7
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18379 Enforce single parent for script submodules**
* #18378 Unify namespace of script::Module
* #18314 Add ability to specialize class types to ArgumentSpec
* #18226 Add Slot type to abstract the raw pointers being used for slots.
The assumption that a ScriptModule has a single parent is present in
our serialization format, and likely a few other places. It is not
enforced on creation of script module hierarchies though, meaning that
problems associated with (e.g. replicating a module twice in the output
format) will not be caught until much later in the development cycle.
This patch enforces the property when a submodule is registered.
It also removes NamedModule since it is no longer necessary in this regime.
This will also allow the easy discover of a modules fully-qualified name
without needing to traverse the Module hierarchy.
Differential Revision: D14603722
fbshipit-source-id: 63ab5d0cccf7d66c7833e0adf9023024ca9607cb
Summary:
Per our offline discussion, allow Tensors, ints, and floats to be casted to be bool when used in a conditional
Fix for https://github.com/pytorch/pytorch/issues/18381
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18755
Reviewed By: driazati
Differential Revision: D14752476
Pulled By: eellison
fbshipit-source-id: 149960c92afcf7e4cc4997bccc57f4e911118ff1
Summary:
Fix the layernorm formula when weight and bias passed in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18233
Differential Revision: D14760375
Pulled By: wanchaol
fbshipit-source-id: d6bd3b137bc04c391aa5c24d021d1f811ba2a877
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18378
ghimport-source-id: 55c29bb436a2153d29ff2f4488d99d8863c187b1
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18379 Enforce single parent for script submodules
* **#18378 Unify namespace of script::Module**
* #18314 Add ability to specialize class types to ArgumentSpec
* #18226 Add Slot type to abstract the raw pointers being used for slots.
This removes individual OrderedDicts in favor of a single unified
namespace for all things in a script::Module. This removes a whole
class of bugs where both a method and an parameter could get the
same name, for instance.
Since we no longer have to expose OrderedDict::Item objects, a lot of
downstream code can be simplified.
We no longer now double-store names (both in the key of the dictionary,
and in the object itself).
Differential Revision: D14603723
fbshipit-source-id: b5f7551b3074679623edd6ea70269830353b4d4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18648
ghimport-source-id: 1cf4a8fe91492621e02217f38cae5d7e0699fb05
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18661 Step 7: remove _unique
* #18655 Step 6: Rename _unique2 to unique and add int? dim
* #18654 Step 5: remove _unque_dim in favor of unique_dim
* #18651 Step 4: add support for unique with dim=None
* #18650 Step 3: Add support for return_counts to torch.unique for dim not None
* #18649 Step 2: Rename _unique_dim2_temporary_will_remove_soon to unique_dim
* **#18648 Step 1: Secretly add return_counts to unique, and refactor unique_dim for performance**
`unique` is fragile, previously I tried to change it in #18391 and #17097, they all pass OSS tests but finally get reverted due to internal failure. My previous work of refactoring unique #18459 is based on #18391, and after #18391 get reverted, I could not work on #18459. To continue working on #18459, #18391, and #17097 without worrying about internal failures, I am suggesting the following steps for the improvements of `unique` and `unique_dim`. soumith Please take this and there is no need to put #18391 back.
The motivation is basically to move forward as much as possible without causing any internal failures. So I will try to divide it into steps and sort from low probability of internal failure to high probability. (I don't know what the internal failure is, so I have to guess). Let's merge these PR stack one by one until we enounter internal failure.
Step 1: Create two new ATen operators, `_unique2_temporary_will_remove_soon` and `_unique_dim2_temporary_will_remove_soon` and keep `_unique` and `_unique_dim` unchanged. The backend of these two functions and `_unique` and `_unique_dim` are all the same, the only difference is the temporary ones support `return_counts` but not the `_unique` and `_unique_dim`. Step one is mostly #18391 + #18459. The cuda8 errors has been fixed. At this point, there is no user visible API change, so no docs are updated. `torch.unique` does not support `return_counts` yet, and `return_counts` is tested through the newly added temporary operators. This step just added two new ATen operators, so there shouldn't be any internal failure.
Step 2: Rename `_unique_dim2_temporary_will_remove_soon` to `unique_dim`. This should cause no internal failure either, because no change to existing operators. The only thing to worry about is to delete `unique_dim` from python side because we don't want users to use it. At this point, C++ users now have `return_counts` support for `unique_dim`.
Step 3: Update the docs of `torch.unique` and use `unique_dim` inside `torch.unique` to support `return_counts` In the docs, we should say `torch.unique` with None dim support does not support `return_counts` yet. This might cause internal failure.
Step 4: Rename `_unique2_temporary_will_remove_soon` to `_unique2` and use `_unique2` inside `torch.unique` to support `return_counts`. Update the docs saying that `torch.unique` with None dim now support `return_counts`. This might cause internal failure.
Step 5: Remove `_unique_dim`. This might cause internal failure.
Step 6: Rename `_unique2` to `unique`, add optional `dim` argument to make it looks like the signature of Python's `torch.unique`. Inside `torch.unique`, use `unique` and get rid of `unique_dim`. Unbind `unique_dim` totally from Python at codegen. This is likely to cause internal fail.
Step 7: Remove `_unique`. This is very likely to cause internal failure.
This PR
======
This PR is for step 1. This create two new ATen operators, `_unique2_temporary_will_remove_soon` and `_unique_dim2_temporary_will_remove_soon` and implement `return_counts` inside them and do refactor for performance improvements.
Please review ngimel VitalyFedyunin. They are mostly copied from #18391 and #18459, so the review should be easy.
Below is a benchmark on a tensor of shape `torch.Size([15320, 2])`:
Before
---------
```python
print(torch.__version__)
%timeit a.unique(dim=0, sorted=True, return_inverse=False); torch.cuda.synchronize()
%timeit a.unique(dim=0, sorted=True, return_inverse=True); torch.cuda.synchronize()
```
```
1.0.1
192 µs ± 1.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
548 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```
```python
print(torch.__version__)
%timeit a.unique(sorted=True, return_inverse=False); torch.cuda.synchronize()
%timeit a.unique(sorted=True, return_inverse=True); torch.cuda.synchronize()
```
```
1.0.1
226 µs ± 929 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
302 µs ± 7.06 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
After
-------
```python
print(torch.__version__)
%timeit a.unique(dim=0, sorted=True, return_inverse=False); torch.cuda.synchronize()
%timeit a.unique(dim=0, sorted=True, return_inverse=True); torch.cuda.synchronize()
%timeit torch._unique_dim2_temporary_will_remove_soon(a, dim=0, sorted=True, return_inverse=False, return_counts=True); torch.cuda.synchronize()
%timeit torch._unique_dim2_temporary_will_remove_soon(a, dim=0, sorted=True, return_inverse=True, return_counts=True); torch.cuda.synchronize()
```
```
1.1.0a0+83ab8ac
190 µs ± 2.14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
237 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
219 µs ± 2.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
263 µs ± 1.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
```python
print(torch.__version__)
%timeit a.unique(sorted=True, return_inverse=False); torch.cuda.synchronize()
%timeit a.unique(sorted=True, return_inverse=True); torch.cuda.synchronize()
%timeit torch._unique2_temporary_will_remove_soon(a, sorted=True, return_inverse=False, return_counts=True); torch.cuda.synchronize()
%timeit torch._unique2_temporary_will_remove_soon(a, sorted=True, return_inverse=True, return_counts=True); torch.cuda.synchronize()
```
```
1.1.0a0+83ab8ac
232 µs ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
301 µs ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
264 µs ± 7.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
339 µs ± 9.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Differential Revision: D14730905
fbshipit-source-id: 10026b4b98628a8565cc28a13317d29adf1225cc
Summary:
If the input `network` resides on multiple GPUs, `devices` must be a 2D list with `devices[0]` matching `network`'s devices. See #18591
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18687
Differential Revision: D14706162
Pulled By: mrshenli
fbshipit-source-id: dca630d3308f2dbcf8b75629c452d7a64092ba42
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18230
Implementing minimum qtensor API to unblock other workstreams in quantization
Changes:
- Added Quantizer which represents different quantization schemes
- Added qint8 as a data type for QTensor
- Added a new ScalarType QInt8
- Added QTensorImpl for QTensor
- Added following user facing APIs
- quantize_linear(scale, zero_point)
- dequantize()
- q_scale()
- q_zero_point()
Reviewed By: dzhulgakov
Differential Revision: D14524641
fbshipit-source-id: c1c0ae0978fb500d47cdb23fb15b747773429e6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18560
We have to import python protobuf here **before** we load cpp extension.
Otherwise it breaks under certain build conditions if cpp implementation of
protobuf is used. Presumably there's some registry in protobuf library and
python side has to initialize the dictionary first, before static
initialization in python extension does so. Otherwise, duplicated protobuf
descriptors will be created and it can lead to obscure errors like
Parameter to MergeFrom() must be instance of same class: expected caffe2.NetDef got caffe2.NetDef.
I think it also fixes https://github.com/facebookarchive/caffe2/issues/1573
Reviewed By: ezyang, iroot900
Differential Revision: D14622054
fbshipit-source-id: 2499eb88ecdee85ff8d845859048f7ae5da2a480
Summary:
to make test_operators.py more stable. in future, we will bump this up manually, and I think it's acceptable, since ir_version should be bumped too often.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18768
Reviewed By: zrphercule
Differential Revision: D14741514
Pulled By: houseroad
fbshipit-source-id: 0369dbc55424e345a113e49fc104a441ea290d58
Summary:
Introduce this check to see whether it will break any existing workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18145
Reviewed By: dzhulgakov
Differential Revision: D14511711
Pulled By: houseroad
fbshipit-source-id: a7bb6ac84c9133fe94d3fe2f1a8566faed14a136
Summary:
The mkldnn-bridge is upgraded in this PR to support DNNLOWP operators.
Meanwhile, APIs have been updated in caffe2 to use latest version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16308
Differential Revision: D14697018
Pulled By: yinghai
fbshipit-source-id: ca952589098accb08295fd5aa92924c61e74d69c
Summary:
Fixes : #6469
1. `ATen/native/native_functions.yml` had [dispatch](03e7953a98/aten/src/ATen/native/native_functions.yaml (L451-L455)) variants for for `embedding_dense_backward` , however `embedding_backward` explicitly made [call](03e7953a98/aten/src/ATen/native/Embedding.cpp (L35-L45)) to it, thus leading to error.
2. In case of CUDA type tensor, the function crashed used to crash on dereferencing of indices's data [pointer](03e7953a98/aten/src/ATen/native/Embedding.cpp (L93)).
Both have been solved and checked against (on CUDA and CPU)
1. As mentioned in the issue
```
import torch
class Test(torch.nn.Module):
def __init__(self):
super(Test,self).__init__()
self.embd = torch.nn.Embedding(1000, 100)
self.dense = torch.nn.Linear(100, 1)
def forward(self, inp):
inp = self.embd(inp)
return self.dense(inp)
test = Test()
inp = torch.tensor([0,1,2,1,1])
out = test(inp)
raw_loss = out.mean(dim=0)
loss_grad = torch.autograd.grad(outputs=raw_loss,
inputs=list(test.parameters()),
retain_graph=True, create_graph=True, only_inputs=True)
norm = sum([param.norm()**2 for param in loss_grad])
loss = raw_loss + norm
loss.backward(retain_graph=True)
print(test.embd.weight.grad)
```
2. Test Script
```
import torch
import time
start = time.time()
l = [1,1]*100
input = torch.tensor([[1,0],[1,0]],device='cpu')
embedding_matrix = torch.tensor([[1.0,3.0],[2.0,4]],requires_grad=True,device='cpu')
sq = embedding_matrix * embedding_matrix
emb = torch.nn.functional.embedding(input, sq,scale_grad_by_freq=False)
print('Embedding Matrix')
print(embedding_matrix)
print('-----------------')
sum_ = emb.sum()#prod.sum()
loss_grad, = torch.autograd.grad(outputs=sum_,inputs=embedding_matrix,create_graph=True)
print('Gradient')
print(loss_grad)
print('-----------------')
sum2_ = sum_ + loss_grad.sum()
print(sum2_)
sum2_.backward()
print(embedding_matrix.grad)
print(time.time() - start)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9078
Reviewed By: ezyang
Differential Revision: D14691901
Pulled By: soumith
fbshipit-source-id: 78e2612ba39080be564c876311671eb5a0119a0f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18749
ghimport-source-id: 9026a037f5e11cdb9ccd386f4b6b5768b9c3259b
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18751 Disallow changing the device of a tensor via set_.
* #18750 Use non-legacy constructors for tensor deserialization.
* **#18749 Add device and dtype to storage.**
The goal here is to fix our serialization, which currently depends on the legacy constructors. Having dtype and device on Storage allows us to use the non-legacy constructors.
This fits somewhat along our goal of removing Storage, my having Storage act like a Tensor.
Differential Revision: D14729516
fbshipit-source-id: bf4a3e8669ad4859931f4a3fa56df605cbc08dcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18750
ghimport-source-id: f1475cfb67841c41d9867d4429ba9125d5c7dd07
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18751 Disallow changing the device of a tensor via set_.
* **#18750 Use non-legacy constructors for tensor deserialization.**
* #18749 Add device and dtype to storage.
Deserialization currently uses legacy constructors. This is bad because we need to maintain them, but there is a more immediate problem:
1) We are trying to implement device caching on TensorImpl to get rid of a virtual dispatch
2) This doesn't work if one is able to change the device of a Tensor underlying a Variable.
3) Deserialization does 2)
So the plan is to change deserialization, then enforce that we don't change the device out from underneath a Variable.
Differential Revision: D14729513
fbshipit-source-id: 090d6cdb375b94dc1bf4f554b2df243952b8cdc6
Summary:
It's not used and unfold's use of `device_guard: False` is scary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18773
Differential Revision: D14736526
Pulled By: gchanan
fbshipit-source-id: 6281a284bee45fa5038783e4c1ed4d1ed7ca81ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18531
Currently we use C10_LOG_EVERY_MS to log the data type change, but it pollutes the log of some service,
we would like to change it to C10_LOG_FIRST_N to prevent that.
Reviewed By: dzhulgakov
Differential Revision: D14647704
fbshipit-source-id: b84e4002bd4aa94d616133cd1049c3d4ab05386e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18284
ghimport-source-id: 5a92c03fda19072ffb6afd40e0f56806716c7be6
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18296 [jit] Add namespacing for ScriptClasses
* **#18284 [jit] make test module hook use save/load**
* #18211 [jit] Turn script_type_parser into a class
* #18148 [jit] python interop for script classes
Instead of python-printing and comparing strings (which does not capture
depdency information, etc.), use save/load on in-memory buffers and
compare the main module contents inside the buffer
Reviewed By: ailzhang
Differential Revision: D14581129
fbshipit-source-id: 52264ae9ce076775ab3fd1a0c32c8d6f6677a903
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18314
ghimport-source-id: 8cecb768d476ab19c9460f39c8f94a764e4cb052
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18314 Add ability to specialize class types to ArgumentSpec**
* #18226 Add Slot type to abstract the raw pointers being used for slots.
Differential Revision: D14574395
fbshipit-source-id: cc3af6e56e9ae52990f4a1ad56ecceaa2d493577
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18740
Test utilities for writing Caffe2/PyTorch performance microbenchmarks. Brief description of the file structure
* benchmark_core.py : core utiltiites for running microbenchmark tests
* benchmark_caffe2.py : Caffe2 specific benchmark utilitites
* benchmark_pytorch.py: PyTorch specific benchmark utilities
* benchmark_runner.py : Main function. Currently it can run the microbenchmark tests in a stand-alone mode. The next step is to have this integrate with AI-PEP.
The utilities are located at https://github.com/pytorch/pytorch/tree/master/test to have access to both Caffe2/PyTorch Python's frontend.
Include two operator microbenchmarks; support both Caffe2/PyTorch:
* MatMul
* Add
Reference: PyTorch benchmarks : https://github.com/pytorch/benchmark/tree/master/timing/python. In this work, we start with two example binary operators MatMul and Add, but eventually we should to cover unary operators like in the PyTorch benchmark repo.
Reviewed By: zheng-xq
Differential Revision: D13887111
fbshipit-source-id: b7a56b95448c9ec3e674b0de0ffb96af4439bfce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18166
ghimport-source-id: a8e2ba2d966e49747a55701c4f6863c5e24d6f14
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18166 Bool Tensor for CUDA**
* #18165 Resolved comments from Bool Tensor for CPU PR
------
This PR enables bool tensor creation and some basic operations for the CPU backend. This is a part of Bool Tensor feature implementation work. The whole plan looks like this:
1. Storage Implementation [Done]
2. Tensor Creation.
a) CPU [Done]
b) CUDA [This PR]
3. Tensor Conversions.
4. Tensor Indexing.
5. Tensor Operations.
6. Back compatibility related changes.
Change:
Enable bool tensor in CUDA with the following operations:
torch.zeros
torch.tensor
torch.ones
torch.rand/rand_like/randint/randint_like
torch.full
torch.full_like
torch.empty
torch.empty_like
Tested via unit tests and local scripts.
Differential Revision: D14605104
fbshipit-source-id: b7d7340a7d70edd03a109222d271e68becba762c
Summary:
To debug a `one of the variables needed for gradient computation has been modified by an inplace operation` error, I wanted to know *which* variable has been modified, so I extended the error message with what information is easily available at this point.
Before:
```
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation
```
After:
```
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [80, 1]], which is output 0 of UnsqueezeBackward0, is at version 1, not expected version 0. Hint: enable anomaly detection to find the forward pass operation which modified it.
```
The hint to enable anomaly detection is only shown when it is not enabled. It's meant to save people some googling. I'd even go further and reference `torch.autograd.set_detect_anomaly(True)`, but maybe we're not running Python?
Disclaimer: I haven't looked at other parts of the code to check if using `std::stringstream` is acceptable practice, let me know if it isn't. Similarly, I haven't checked about indentation practices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18523
Differential Revision: D14683249
Pulled By: soumith
fbshipit-source-id: f97a99d4aabea7461df766d66cd72300b48e2350
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18763
Without `link_whole` flag in opt-builds some of the files are not linked into `_C_impl` library, which causes some of static initializers not to run (namely, registering an cutomPythonOperation from python_interpreter.cpp). This diff fixes it.
Differential Revision: D14732471
fbshipit-source-id: 57cff6b4b6d479ad7ab7fd29f677746d91d6ff45
Summary:
Fix the bug introduced by #18681 where an undefined variable was being used to limit max cpu count when building for Windows without Ninja.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18748
Differential Revision: D14733209
Pulled By: soumith
fbshipit-source-id: 52fc0dd4dde99da75a6956b63f02da2e647eed4f
Summary:
Argument dim=-1 doesn't work for torch.cross. The signature of the torch.cross has been changed to c10::optional<int64_t> dim instead of int64_t. So based on document "If dim is not given, it defaults to the first dimension found with the size 3." and if dim is specified (even negative) it will use the correspondent dim.
Fixes#17229
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17582
Differential Revision: D14483063
Pulled By: ifedan
fbshipit-source-id: f9699093ec401cb185fd33ca4563c8a46cdcd746
Summary:
Multiple configurations is the default (eg. Release;Debug) on Windows and this check always broke this configuration as CMAKE_BUILD_TYPE was not set. The workaround was to always set CMAKE_BUILD_TYPE to Debug or Release, which was very unfortunate.
The correct method is to use generator expressions that expand depending on the current CONFIG being processed.
Side note: Anywhere else CMAKE_BUILD_TYPE is checked should probably be fixed too.
Note that the CMakeLists.txt forces it in to Release mode. However, I came across this error when importing the prebuilt Config in to another project, where CMAKE_BUILD_TYPE was not set.
> 3>CMake Error at pre_built/pytorch-1.0.1/share/cmake/Caffe2/public/cuda.cmake:380 (message):
> 3> Unknown cmake build type:
Proper support for configurations would mean we can build debug and release at the same time and as you can see, it is less CMake code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18548
Differential Revision: D14730790
Pulled By: ezyang
fbshipit-source-id: 70ae16832870d742c577c34a50ec7564c3da0afb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18577
This is also part of the legacy API and we need to support it if we want to replace it.
Reviewed By: dzhulgakov
Differential Revision: D14671432
fbshipit-source-id: 007abf4ab816647a509fc08e35d79b6c1aa55b03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18551
This is helpful for defining a set of operators as an interface but not adding concrete kernels just yet.
The registration logic will ensure that any other libraries that add kernels for these schemas exactly match the schema defined here.
Reviewed By: dzhulgakov
Differential Revision: D14660208
fbshipit-source-id: 7adb5a4876cff5a0ad21d92d8c450cb889f00cc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18550
When the operator registration API is used wrongly, in most cases we should now get a nice compiler error
instead of weird template error messages.
This is done by making the enable_if conditions more broad so they also match error cases,
but then having static_asserts against these error cases inside the function.
Before that, since the function didn't match, the error message said something like "no function found to match your call",
now it will show the error message specified in the static_asserts.
Reviewed By: dzhulgakov
Differential Revision: D14659178
fbshipit-source-id: 7ca4fb72d9051eadf0a7e2717b962bf1213a52b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18547
- Argument indices in the error messages are 1-indexed not 0-indexed.
- Add test cases that a mismatching signature actually shows the correct error messages
Reviewed By: dzhulgakov
Differential Revision: D14656695
fbshipit-source-id: 55e45634baa3117e18b8687ea6b2a2f83715bdf6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18703
`zeroPtr` is sometimes a `std::string` tensor, so `memset` to 0 is undefined behavior.
This might be accidentally safe with `std::string` implementation that use SSO (Small String Optimization), but will crash otherwise.
Reviewed By: zheng-xq
Differential Revision: D14714458
fbshipit-source-id: 012a18464e6514d38ff791509b88ddc3fc55b2b1
Summary:
Make it possible to construct a pinned memory tensor without creating a storage first and without calling pin_memory() function. It is also faster, as copy operation is unnecessary.
Supported functions:
```python
torch.rand_like(t, pin_memory=True)
torch.randn_like(t, pin_memory=True)
torch.empty_like(t, pin_memory=True)
torch.full_like(t, 4, pin_memory=True)
torch.zeros_like(t, pin_memory=True)
torch.ones_like(t, pin_memory=True)
torch.tensor([10,11], pin_memory=True)
torch.randn(3, 5, pin_memory=True)
torch.rand(3, pin_memory=True)
torch.zeros(3, pin_memory=True)
torch.randperm(3, pin_memory=True)
torch.empty(6, pin_memory=True)
torch.ones(6, pin_memory=True)
torch.eye(6, pin_memory=True)
torch.arange(3, 5, pin_memory=True)
```
Part of the bigger: `Remove Storage` plan.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18455
Reviewed By: ezyang
Differential Revision: D14672084
Pulled By: VitalyFedyunin
fbshipit-source-id: 9d0997ec00f59500ee018f8b851934d334012124
Summary:
Hi. It seems that when building CPP-extensions with CUDA for Windows, an `extra_cuda_cflags` options are not properly forwarded to `nvcc`.
Use of extra CUDA options is necessary to build, for instance, a InplaceABN (https://github.com/mapillary/inplace_abn), which requires `--expt-extended-lambda` option.
This PR adds one line that correctly appends `extra_cuda_cflags`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18638
Differential Revision: D14704270
Pulled By: ezyang
fbshipit-source-id: e1e330d193d9afd5707a5437a74c0499460d2b90
Summary:
At some point, we needed these functions to deal with autograd dispatching to the sparse of TH version of a backwards. But we rewrote all backwards definitions in terms of native functions, so this is no longer necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18696
Differential Revision: D14710834
Pulled By: gchanan
fbshipit-source-id: b22568c58eefc79d672555bd8832398ccd965cb7
Summary:
Added stubs for:
* The `device` module
* The `cuda` module
* Parts of the `optim` module
* Began adding stubs for the `autograd` module. I'll annotate more later but `no_grad` and friends are probably the most used exports from it so it seemed like a good place to start.
This would close#16996, although comments on that issue reference other missing stubs so maybe it's worth keeping open as an umbrella issue.
The big remaining missing package is `nn`.
Also added a `py.typed` file so mypy will pick up on the type stubs. That closes#17639.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18511
Differential Revision: D14715053
Pulled By: ezyang
fbshipit-source-id: 9e4882ac997063650e6ce47604b3eaf1232c61c9
Summary:
Peephole optimize ops that just require Dimensioned Tensor Type, which is what we specialize graphs on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18549
Differential Revision: D14690827
Pulled By: eellison
fbshipit-source-id: 9d7439eb584f0a5b877f5aa53cf80150f00e7e5f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18542
This adds the deprecated API for defining kernels as lambdas. The new API for defining kernels as lambdas was introduced in D14653005.
Reviewed By: dzhulgakov
Differential Revision: D14653551
fbshipit-source-id: 99900f1436716c69e52c83b68333b642ec2c8558
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18444
This adds the deprecated function based API to c10::RegisterOperators().
This is the API currently exposed under jit::RegisterOperators() and we need to support it for backwards compatibility.
Reviewed By: dzhulgakov
Differential Revision: D14514218
fbshipit-source-id: c77676851cfd431d66f18fd8038cf153a3a7d7cc
Summary:
This commit adds the `c10d::Reducer` class that hooks into autograd
and performs gradient bucketing and reduction. These are the core
parts of `nn.parallel.DistributedDataParallel` that up to now were
only usable for CUDA models.
This should enable the following:
* Distributed data parallelism for models defined using the C++ frontend.
* Allow overlap of gradient computation and reduction for non-CUDA models.
* Enable distributed data parallelism for models with some unused parameters.
This does not include any logic for computing bucket assignment, which
can be done separately; either by observing autograd execution order
(this is what Apex does), or by assigning buckets based on some
maximum byte size, or both.
Also see #17757 and #13273.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18251
Reviewed By: mrshenli
Differential Revision: D14571899
Pulled By: pietern
fbshipit-source-id: 20f95eefd288dfe8cfffe0a28ca22fa7c9c3cd4c
Summary:
If none of the outputs require_grad, we don't actually check gradgrad, instead we will check that their numerical gradients are 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18190
Differential Revision: D14563388
Pulled By: ifedan
fbshipit-source-id: a4eb94c9eb60f14dbe6986cd8cef1fe78a7bc839
Summary:
The last time I tried to land it there was a merge race with the docs coverage test lol. Re-landing with the fix.
Re-land of https://github.com/pytorch/pytorch/pull/18304
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18570
Reviewed By: driazati
Differential Revision: D14707285
Pulled By: eellison
fbshipit-source-id: 3a0265928aa8cad78961723d8bf0fbf871fdb71d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18672
In Skylake, when n < 128 or k < 128, acc16 is slower.
Reviewed By: jianyuh
Differential Revision: D14700576
fbshipit-source-id: 80ca9f1af4626637eed9c5ca49f95ae744811189
Summary:
Since we are going to add ideep to ATen, and ATen is always compiled, it makes sense to have the registration in ATen rather than C2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18335
Reviewed By: bddppq
Differential Revision: D14578652
Pulled By: gchanan
fbshipit-source-id: 4d77fcfc21a362b21d5291a127498aa722548873
Summary:
`python setup.py develop` fails with following messages.
~~~
...
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using MIOpen
-- Not using CUDA
-- Using MKLDNN
-- Not using NCCL
-- Building without distributed package
Copying extension caffe2.python.caffe2_pybind11_state
Copying caffe2.python.caffe2_pybind11_state from torch\Lib\site-packages\caffe2\python\caffe2_pybind11_state.cp37-win_amd64.pyd to C:\data\source\pytorch\build\lib.win-amd64-3.7\caffe2\python\caffe2_pybind11_state.cp37-win_amd64.pyd
copying torch\Lib\site-packages\caffe2\python\caffe2_pybind11_state.cp37-win_amd64.pyd -> C:\data\source\pytorch\build\lib.win-amd64-3.7\caffe2\python
building 'torch._C' extension
creating build\temp.win-amd64-3.7
creating build\temp.win-amd64-3.7\Release
creating build\temp.win-amd64-3.7\Release\torch
creating build\temp.win-amd64-3.7\Release\torch\csrc
...
creating C:\data\source\pytorch\build\lib.win-amd64-3.7\torch
C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\bin\HostX64\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /nodefaultlib:libucrt.lib ucrt.lib /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:C:\data\source\pytorch\torch\lib /LIBPATH:C:\data\dlenv\libs /LIBPATH:C:\data\dlenv\PCbuild\amd64 "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\ATLMFC\lib\x64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.16.27023\lib\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17763.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.17763.0\um\x64" shm.lib torch_python.lib /EXPORT:PyInit__C build\temp.win-amd64-3.7\Release\torch/csrc/stub.obj /OUT:build\lib.win-amd64-3.7\torch\_C.cp37-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.7\Release\torch/csrc\_C.cp37-win_amd64.lib /NODEFAULTLIB:LIBCMT.LIB
ライブラリ build\temp.win-amd64-3.7\Release\torch/csrc\_C.cp37-win_amd64.lib とオブジェクト build\temp.win-amd64-3.7\Release\torch/csrc\_C.cp37-win_amd64.exp を作成中
コード生成しています。
コード生成が終了しました。
copying build\lib.win-amd64-3.7\torch\_C.cp37-win_amd64.pyd -> torch
copying build\lib.win-amd64-3.7\caffe2\python\caffe2_pybind11_state.cp37-win_amd64.pyd -> caffe2\python
copying build/temp.win-amd64-3.7/Release/torch/csrc/_C.cp37-win_amd64.lib -> build/lib.win-amd64-3.7/torch/lib/_C.lib
error: could not create 'build/lib.win-amd64-3.7/torch/lib/_C.lib': No such file or directory
~~~
When `python setup.py install` is executed, `torch/lib` has been created by previous process (copying many files) and this copy succeeds. But in develop mode, that process does not executed and this copy fails.
This patch creates `torch/lib` directory if do not exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18666
Differential Revision: D14704269
Pulled By: ezyang
fbshipit-source-id: b2d7c698a906b945bf34bb78f17b91b4fdfd3294
Summary:
MSVC errors on these flags as they are not supported
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18686
Differential Revision: D14704254
Pulled By: ezyang
fbshipit-source-id: 936d33ed6b7474d7774a49505cdac50dbe8dd99a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18239
When min is inf or nan, we get UBSAN errors
Reviewed By: csummersea
Differential Revision: D14537668
fbshipit-source-id: e70ffb5ecd2b10793356070c69fdabf8f25b203e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18628
ghimport-source-id: d94b81a6f303883d97beaae25344fd591e13ce52
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18629 Provide flake8 install instructions.
* **#18628 Delete duplicated technical content from contribution_guide.rst**
There's useful guide in contributing_guide.rst, but the
technical bits were straight up copy-pasted from CONTRIBUTING.md,
and I don't think it makes sense to break the CONTRIBUTING.md
link. Instead, I deleted the duplicate bits and added a cross
reference to the rst document.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14701003
fbshipit-source-id: 3bbb102fae225cbda27628a59138bba769bfa288
Summary:
If a test triggers autodiff, it must have a `DifferentiableGraph` in its differentiated forward graph, and this subgraph must have either the original aten node, or the corresponding nodes used in AD formula.
Typically a forward differentiable graph looks like this:
```
graph(%i0 : Float(),
%i1 : Float()):
%3 : Float() = prim::DifferentiableGraph_0(%i0, %i1)
return (%3)
with prim::DifferentiableGraph_0 = graph(%0 : Float(),
%1 : Float()):
%2 : Float() = aten::max(%0, %1)
return (%2)
```
which tells us `aten::max(Tensor self, Tensor other) -> Tensor` is symbolically differentiable.
Update: there're a lot of cases (fusions/ConstantChunk/python implementations) that breaks it so I decided to make the check optionally take node names if different from function name.
~~[OLD]Theoretically I could also check if `aten::max` is in the differentiable block or not to be more precise, but there're also cases like `chunk` where in a differentiable block it's replaced with a prim node (ConstantChunk) and we will have to special case them. Any suggestions here (to be more precise or no) is very welcome!~~
We used to have a list containing nn tests should be run against AD, I moved it to an field when constructing our test(either torch or nn). I think it's cleaner this way, and it matches the fact that for the same op we support one schema of it but not all, in this way we could just turn on the corresponding test which triggers that supported schema.
cc: apaszke zdevito wanchaol ngimel for a review
[Done] :
- Going through a manual second pass of all tests to check if they should enable AD test or not....
- Add a readme about how to add AD for an op and how to add/enable its test in test_jit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18509
Differential Revision: D14696811
Pulled By: ailzhang
fbshipit-source-id: c5e693277baac585cd3aed5ab2c0e7faa5e6f29f
Summary:
Problem:
```cpp
// This function expects a `Variable` as input
inline PyObject* wrap(at::Tensor tensor) {
return THPVariable_Wrap(Variable(std::move(tensor)));
}
inline PyObject* wrap(at::Scalar scalar) {
// This function calls `wrap(at::Tensor tensor)` (the function above), but since
// `scalar_to_tensor(...)` returns a `Tensor` and not a `Variable`, the call to
// `wrap(at::Tensor tensor)` will fail with "Tensor that was converted to Variable
// was not actually a Variable", which is not what we want.
return wrap(scalar_to_tensor(scalar));
}
```
The right fix is to call `make_variable(...)` with the tensor returned from `scalar_to_tensor(scalar)`.
This unblocks https://github.com/pytorch/pytorch/pull/18230 as it is the only patch that hits this code path now. All other native functions that return Scalar (such as `item()` or `_local_scalar_dense()`) either has custom-defined implementation that doesn't go through this path, or is not exposed to Python at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18632
Differential Revision: D14689293
Pulled By: yf225
fbshipit-source-id: be7ba5d3de83a69533a2997de97ad92989ff78ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18598
ghimport-source-id: c74597e5e7437e94a43c163cee0639b20d0d0c6a
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18598 Turn on F401: Unused import warning.**
This was requested by someone at Facebook; this lint is turned
on for Facebook by default. "Sure, why not."
I had to noqa a number of imports in __init__. Hypothetically
we're supposed to use __all__ in this case, but I was too lazy
to fix it. Left for future work.
Be careful! flake8-2 and flake8-3 behave differently with
respect to import resolution for # type: comments. flake8-3 will
report an import unused; flake8-2 will not. For now, I just
noqa'd all these sites.
All the changes were done by hand.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14687478
fbshipit-source-id: 30d532381e914091aadfa0d2a5a89404819663e3
Summary:
This is meant to resolve#18249, where I pointed out a few things that could improve the CTCLoss docs.
My main goal was to clarify:
- Target sequences are sequences of class indices, excluding the blank index
- Lengths of `target` and `input` are needed for masking unequal length sequences, and do not necessarily = S, which is the length of the longest sequence in the batch.
I thought about Thomas's suggestion to link the distill.pub article, but I'm not sure about it. I think that should be up to y'all to decide.
I have no experience with .rst, so it might not render as expected :)
t-vi ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18415
Differential Revision: D14691969
Pulled By: soumith
fbshipit-source-id: 381a2d52307174661c58053ae9dfae6e40cbfd46
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18443
Allow registering a kernel without a dispatch key. In this case, the kernel becomes a fallback kernel that is called whenever no other kernel matches.
This is also useful for the legacy function based API (since that API doesn't know about dispatch keys) or any other custom ops that don't care about dispatch
and just want one kernel to be called no matter the dispatch key.
Reviewed By: dzhulgakov
Differential Revision: D14603258
fbshipit-source-id: 242dc8871dad2989ca25079854d0cc97429e7199
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18302
These might be use cases we want to support in the future, but they don't work yet.
Let's at least report an error instead of doing segfaults or worse.
Reviewed By: dzhulgakov
Differential Revision: D14572346
fbshipit-source-id: 49262ce131493bc887defe2978d8b22f202cd8cc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18301
Move code out of headers and templates into source files and non-templates.
Reviewed By: dzhulgakov
Differential Revision: D14572347
fbshipit-source-id: 9fd5d62d54000a95e93076cd73f591ba2c5c2653
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18256
This diff infers the function schema from the kernel function/functor and checks that it matches the specified function schema.
This diff does not allow (yet) to omit specifying the function schema in the registration API. That will come in a future diff.
Reviewed By: dzhulgakov
Differential Revision: D14552738
fbshipit-source-id: 00202b489ede19f26ae686c97416b38c72c11532
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18162
- Adds the API to register a functor- and function-based kernel.
- Change the experimental c10 ops to use this new API instead of the old one
- Deletes the old APIs in KernelRegistration.h and OpSchemaRegistration.h
Reviewed By: dzhulgakov
Differential Revision: D14514239
fbshipit-source-id: 35b2f6e8f62964e54886450a6a5fac812ed20f26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18161
This introduces version 0 for the new operator registration.
For now, it only works with kernels that are defined as stack-based functions.
This is actually not the intended public API for defining kernels, but it's the basis which is going to be used to define the public APIs (see diffs on top for them),
and it's also the API used for exposing caffe2 operators.
This diff also switches the mechanism for exposing caffe2 operators to the new mechanism.
Reviewed By: dzhulgakov
Differential Revision: D14514231
fbshipit-source-id: 454ab7b5b46a10203aa27b175400d23f818dd1df
Summary:
caffe2_py2_cuda9_0_cudnn7_ubuntu16_04_build is failing
```
...
Mar 29 04:44:46 Need to get 174 MB of archives.
Mar 29 04:44:46 After this operation, 576 MB of additional disk space will be used.
Mar 29 04:44:46 Do you want to continue? [Y/n] Abort.
Exited with code 1
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18609
Differential Revision: D14694990
Pulled By: bddppq
fbshipit-source-id: 260446a8650f660a2baf123a3f17efdf0a8d6c64
Summary:
* adds attributes to `ScriptModule.__getattr__` so they can be accessed in Python after re-importing
* full support for all the possible values for an `int64_t`
* this necessitated a bunch more `pushWhatever` functions, so re-introduced a templated version to cut down on duplicate code
* tests to validate references / value sharing works
* adds `torch.jit.Unpickler` which people can use to de-serialize the pickle files into Python / have a quick reference on how to do this without PyTorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18188
Differential Revision: D14527490
Pulled By: driazati
fbshipit-source-id: efd15579cc04aa2e28c4b2c9490d82d849dee559
Summary:
For MKL-DNN,the filter data will be reorderd to primitive format, it takes a lot of time.
So the patch provide a method to convert filter format before training.
And "OptimizeForIdeep" will be changed to "OptimizeForMkldnn" in this patch.
This patch depends on https://github.com/pytorch/pytorch/pull/12866
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15171
Differential Revision: D14590741
Pulled By: yinghai
fbshipit-source-id: 07971c9977edac3c8eec08ca2c39cda639683492
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16373
motivation: https://github.com/pytorch/pytorch/pull/12407
This is a manual diff.
most of the fixes should be:
```
auto* Y = Output(0);
Y->Resize(dims);
Y->raw_mutable_data(dtype);
```
-->
```
auto* Y = Output(0, dims, at::dtype(dtype));
```
But there might be other cases.
Reviewed By: dzhulgakov
Differential Revision: D13725460
fbshipit-source-id: 649a4b0e42f62cda1a60171dd9fa3e440dc9dca1
Summary:
This adds `hash()` which supports `int`, `str`, and `float`. It relies on `std::hash` which is implementation defined, so the result of `hash()` in TorchScript is not the same as in Python, but should satisfy the same properties.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18258
Differential Revision: D14692317
Pulled By: driazati
fbshipit-source-id: 909df5d024bb3feea157d5a203b7de53c72261c9
Summary:
Start of breaking up test_jit.py
New files will have the format test_jit_* so they are easily grepable but remain in the same directory so we don't have to go through multiple sources for imports.
I am adding a test that's expected to fail to be sure it's running.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18590
Reviewed By: wanchaol
Differential Revision: D14677094
Pulled By: eellison
fbshipit-source-id: 9782c6aa9525bb6f332fc75cfff004c83a417522
Summary:
This defines a generic counters API that users can utilize to provide monitoring functionality in e.g. a production service. We expose both counters for runtime internals as well as a TorchScript API to create user-defined counters. Synopsis of the API:
- `torch/csrc/jit/script/logging.h` specifies the externally-facing API in C++
- `torch/jit/_logging.py` specifies the Python API
We use an interface, `LoggerBase`, to define the interactions between users and a logging backend. Implementing a subclass of `LoggerBase` allows the user to handle these events in a custom way, such as logging into a DB or calling into an infra-specific counters API.
From the frontend perspective, we can create log events in two ways:
1. We provide an `add_stat_value(name, val)` function. This calls into the Logger backend with a key/value pair. For example, we might call `add_stat_value('foo', 1)` to bump an event counter.
2. We provide a `time_point()` function to record a timestamp in nanoseconds. This can be used in conjunction with `add_stat_value` to record runtime wall clock durations.
Examples of frontend usage can be found in `test_jit.py TestLogging`.
We provide a trivial `LockingLogger` implementation as an example and for testing purposes. It is likely not ready for production usage. It demonstrates that a backend implementing the API can do things like specify aggregation types and report these aggregate stats via the `get_counters()` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18235
Differential Revision: D14545060
Pulled By: jamesr66a
fbshipit-source-id: 04099543a1898cfdd411511e46e03d5dce9b4881
Summary:
They are called as (outputs, inputs) and were named (inputs, outputs).
Possible follow up fix is to make the outputs argument an lvalue to allow for calling multiple post hooks without ever copying outputs vector. It looks like the copy is now forced because the hook takes a const reference as input and returns an value. This would change the prototype of the function, so needs further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18140
Differential Revision: D14684498
Pulled By: pietern
fbshipit-source-id: 1bd3ddbdd1ff7fe0a18241de5a9ec745a4e7ef07
Summary:
The last time I tried to land it there was a merge race with the docs coverage test lol. Re-landing with the fix.
Re-land of https://github.com/pytorch/pytorch/pull/18304
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18570
Differential Revision: D14668859
Pulled By: eellison
fbshipit-source-id: 3825a35ddc6179a0d433d70d22b5c1a96c20b21a
Summary:
In blob feeder for ideep device, the wrong device option is given and led to a crash issue.
This patch aims to correct the device option to fix this bug.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18552
Differential Revision: D14679838
Pulled By: yinghai
fbshipit-source-id: bde11e6a6fe44822166881dcb7c9bd0b34b4ecf3
Summary:
Previously, we were not able to assign names to `nn::Sequential`'s submodules. This PR adds this feature to match the Python API. Example use:
```cpp
Sequential sequential(named_submodule({
{"linear", Linear(10, 3)},
{"conv2d", Conv2d(1, 2, 3)},
{"dropout", Dropout(0.5)},
{"batchnorm", BatchNorm(5)},
{"embedding", Embedding(4, 10)},
{"lstm", LSTM(4, 5)}
}));
```
It also enables loading parameters of Python `nn.Sequential` module with custom submodules names into C++ frontend, unblocking https://github.com/pytorch/vision/pull/728#issuecomment-466661344.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17552
Differential Revision: D14246834
Pulled By: yf225
fbshipit-source-id: 3030b5c5d68f6dd5d3e37ac4b4f98dc6d6d9ba72
Summary:
Changelog:
- Renames `btriunpack` to `lu_unpack` to remain consistent with the `lu` function interface.
- Rename all relevant tests, fix callsites
- Create a tentative alias for `lu_unpack` under the name `btriunpack` and add a deprecation warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18529
Differential Revision: D14683161
Pulled By: soumith
fbshipit-source-id: 994287eaa15c50fd74c2f1c7646edfc61e8099b1
Summary:
Kindly let me know if its okay and if any places i need to make a fix. Closes#18534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18604
Differential Revision: D14680712
Pulled By: soumith
fbshipit-source-id: 030e4a3d8f7839cbe2b8a3ef386323f0d39eb81a
Summary:
Changelog:
- Renames `btrifact` and `btrifact_with_info` to `lu`to remain consistent with other factorization methods (`qr` and `svd`).
- Now, we will only have one function and methods named `lu`, which performs `lu` decomposition. This function takes a get_infos kwarg, which when set to True includes a infos tensor in the tuple.
- Rename all tests, fix callsites
- Create a tentative alias for `lu` under the name `btrifact` and `btrifact_with_info`, and add a deprecation warning to not promote usage.
- Add the single batch version for `lu` so that users don't have to unsqueeze and squeeze for a single square matrix (see changes in determinant computation in `LinearAlgebra.cpp`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18435
Differential Revision: D14680352
Pulled By: soumith
fbshipit-source-id: af58dfc11fa53d9e8e0318c720beaf5502978cd8
Summary:
Deleting batch tensor since we are no longer maintaining the project and keeping it functional is blocking other improvements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18575
Differential Revision: D14671126
Pulled By: eellison
fbshipit-source-id: b42d5b699c4d12171ed95e6d3a977532167f0d2c
Summary:
This will allow pathlib.Path object to the torch.load as an input argument.
Fixes#16607
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18562
Differential Revision: D14668255
Pulled By: soumith
fbshipit-source-id: 0ae4f7c210918582912f2d1ef2a98f1ab288c540
Summary:
Addind the same warning message already present in the mse_loss function to the L1 losses when input and target sizes are different.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18565
Differential Revision: D14671415
Pulled By: soumith
fbshipit-source-id: 01f5e1fb1ea119dbb2aecf1d94d0cb462f284982
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18226
ghimport-source-id: b9ec8651212875b30971cc6859d2ddec6559ae3a
If modules become first-class IValues, then the slots will no longer be raw pointers but (IValue, index) pairs. This commit inserts the Slot abstraction so that this change can be made in later patches.
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18226 Add Slot type to abstract the raw pointers being used for slots.**
Differential Revision: D14542022
fbshipit-source-id: b81d7f4334c983d663e7551bda82df43680d7c5f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18512
Ceil and Floor have been supported since version 6 of ONNX: export them using the native onnx ops instead of an Aten op.
Similarly, support for the Where op has been added in version 9, so we don't need to wrap these op in an Aten op.
Reviewed By: houseroad
Differential Revision: D14635130
fbshipit-source-id: d54a2b6e295074a6214b5939b21051a6735c9958
Summary:
While benchmarking a kernel with broadcasted inputs, I noticed
that is was much slower than a hand-coded kernel for the smae task.
The kernel in question computed a * b + c for a of shape
32 x 32 x 10240 and b and c of shape 1 x 32 x 1.
This patch accellerates said kernel from 450us to 250us on my GTX1080Ti.
I didn't change half because there doesn't seem to be __ldg for
half.
An alternative could be to sprinkle const and restrict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18540
Differential Revision: D14657840
Pulled By: soumith
fbshipit-source-id: 408847346ec12d1d1d9b119ac50bbc70f0d9ed33
Summary:
This implements a cyclical learning rate (CLR) schedule with an optional inverse cyclical momentum. More info about CLR: https://github.com/bckenstler/CLR
This is finishing what #2016 started. Resolves#1909.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18001
Differential Revision: D14451845
Pulled By: sampepose
fbshipit-source-id: 8f682e0c3dee3a73bd2b14cc93fcf5f0e836b8c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18538
ghimport-source-id: 665b09f158d1c5dd94686d4212792504b55b7f73
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18538 Completely synchronize behavior of Facebook flake8 and public flake8.**
Previously, developers at Facebook had the very funny experience
wherein /usr/local/bin/flake8 behaved differently than a freshly
installed flake8 from pip. In this commit, I add enough ignores to
.flake8 and install enough plugins to make the Facebook flake8
and public flake8 line up exactly. These means you don't have
to care which flake8 you use; they all will report accurate information
on your Python files.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14652336
fbshipit-source-id: ba7776eaa139cf2e3df2e65349da6fd7c99acca4
Summary:
This allows you to embed checks in IR, making the test more readable.
E.g.
```
graph_str = 'graph(%0 : Double(5, 5)):
# CHECK: aten::relu
%1 : Double(5, 5) = aten::relu(%0)
return (%1)'
FileCheck().run(graph_str, parseIR(graph_str))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18304
Differential Revision: D14652372
Pulled By: eellison
fbshipit-source-id: 7430b9d1dc2b7584704375aac02d7392ecec76a0
Summary:
Previously we were moving nodes with writers into differentiable subgraphs, without necessarily preserving whether or not they were written to. This can lead to bugs with CSE, which needs that context.
I'm not completely sure if there's anything else we can do to be more aggresive here - inline these nodes and not run CSE and just run constant pooling, or possibly something else, but I think we should land this correctness condition first and then possibly think further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18491
Differential Revision: D14648562
Pulled By: eellison
fbshipit-source-id: bc1e444774ccdb708e22f0e06a477a221a231f9e
Summary:
Is Tensor has been brought up as misleading a couple times, rename it isCompleteTensor for clarity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18437
Differential Revision: D14605223
Pulled By: eellison
fbshipit-source-id: 189f67f12cbecd76516a04e67d8145c260c79036
Summary:
Enable unit tests working with ROCm 2.3. In particular, these are unit tests where we skipped for double data types previously and some tests for multi-GPU setups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18537
Differential Revision: D14651822
Pulled By: ezyang
fbshipit-source-id: 7dd575504ebe235a91489866c91000e9754b1235
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18494
Today we have some C2 end2end test run requiring reading model data from external filesystem (for example, Gluster and AWS). This could be a source for flaky test when the external filesystems are not reachable during the tests.
In this diff, we add try/catch logic around where we download models and open model files from external system. In case such attempts fails, we will catch the excption and let the unittest skip the current test instead of failure.
I also refactor the code a little bit by removing some duplicated logic on downloading and build the c2 model data. It has been duplicated in two classes and a few functions...
Reviewed By: yinghai
Differential Revision: D14442241
fbshipit-source-id: da8bf56c8d096efa34ca2070de5cd10a18aad70c
Summary:
We are about to merge onnxifi quantization support soon. Before that, I would like to merge this diff seperately to make sure it doesnt break anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18486
Reviewed By: bddppq, houseroad
Differential Revision: D14626419
Pulled By: yinghai
fbshipit-source-id: 504c1eae60be1e629203267b59defb8b69d82c0a
Summary:
There are a number of pages in the docs that serve insecure content. AFAICT this is the sole source of that.
I wasn't sure if docs get regenerated for old versions as part of the automation, or if those would need to be manually done.
cf. https://github.com/pytorch/pytorch.github.io/pull/177
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18508
Differential Revision: D14645665
Pulled By: zpao
fbshipit-source-id: 003563b06048485d4f539feb1675fc80bab47c1b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18507
ghimport-source-id: 1c3642befad2da78a7e5f39d6d58732b85c76267
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18507 Upgrade flake8-bugbear to master, fix the new lints.**
It turns out Facebobok is internally using the unreleased master
flake8-bugbear, so upgrading it grabs a few more lints that Phabricator
was complaining about but we didn't get in open source.
A few of the getattr sites that I fixed look very suspicious (they're
written as if Python were a lazy language), but I didn't look more
closely into the matter.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14633682
fbshipit-source-id: fc3f97c87dca40bbda943a1d1061953490dbacf8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18385
By moving the weight offload into the backend initialization function, we can instantiate the backend once by creating the OnnxifiOp once and then clean up the parameter workspace. And we need to keep hold of that instantiated net (OnnxifiOp) without cleaning it. Subsequent ctor of OnnxifiOp of the same model will hit the cached backend and they will not look into weight offloading, which is safe as the weight is already gone.
Reviewed By: ipiszy
Differential Revision: D14590379
fbshipit-source-id: f7f34016e09777ad3df0af487885cd14658e1044
Summary:
Added full instructions for how to use the `ccache` package. Thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18495
Differential Revision: D14635351
Pulled By: ezyang
fbshipit-source-id: 158e1052bae580e95f73644252fdbddcc0213128
Summary:
This depend on https://github.com/pytorch/pytorch/pull/16039
This prevent people (reviewer, PR author) from forgetting adding things to `tensors.rst`.
When something new is added to `_tensor_doc.py` or `tensor.py` but intentionally not in `tensors.rst`, people should manually whitelist it in `test_docs_coverage.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16057
Differential Revision: D14619550
Pulled By: ezyang
fbshipit-source-id: e1c6dd6761142e2e48ec499e118df399e3949fcc
Summary:
arguments order is okay to be different
ajyu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18466
Differential Revision: D14627258
Pulled By: bddppq
fbshipit-source-id: 430e1fb1bea2c5639a547ae7c1652368788c86b9
Summary:
Set value as tensor of 1 element instead of scalar, according to ONNX spec.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18199
Reviewed By: dzhulgakov
Differential Revision: D14542588
Pulled By: houseroad
fbshipit-source-id: 70dc978d870ebe6ef37c519ba4a20061c3f07372
Summary:
More ops for https://github.com/pytorch/pytorch/issues/394. ~~Also need to rebase after landing #16186, because we need to update the whitelist of the new unit test added in #16186.~~
cc: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17093
Differential Revision: D14620068
Pulled By: ezyang
fbshipit-source-id: deec5ffc9bf7624e0350c85392ee59789bad4237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18160
When exposing a c10 operator to the caffe2 frontend, don't use the operator schema but use the operator name instead.
This allows us to get rid of the existing mechanism for operator schema registration in a diff stacked on top.
Reviewed By: dzhulgakov
Differential Revision: D14513420
fbshipit-source-id: 6b08a9c6d9497eaf18b62361dd44bc07c7b4b76b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18242
ghimport-source-id: b949d312a48226a34f90304162e910acee7c95cd
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18242 Test running a CUDA build on CPU machine.**
* #18362 Add ability to query if built with CUDA and MKL-DNN.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14584429
fbshipit-source-id: b54de5b33f0c795a7d9605d30576cdf9b74050fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18485
I don't know how (1) we landed the wrong version of the patch and (2) how
this passed the push blocking test
Reviewed By: pjh5
Differential Revision: D14621961
fbshipit-source-id: 0a3953d7adcdc79727a61c2acff65f436dcafe55
Summary:
This PR adds a Global Site Tag to the site.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17690
Differential Revision: D14620816
Pulled By: zou3519
fbshipit-source-id: c02407881ce08340289123f5508f92381744e8e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18165
ghimport-source-id: 55cb3fb63a25c2faab1725b4ec14c688bf45bd38
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18166 Bool Tensor for CUDA
* **#18165 Resolved comments from Bool Tensor for CPU PR**
-------
------------
This is a follow up PR that resolves some additional feedback on one the of previous Bool Tensor PRs.
gchanan, here is a list of almost all the comments from the original PR with respective fixes and replies:
**[utils/python_scalars.h]** why is this converting from uint8_t and not bool? (comment?)
When i was adding this, i was testing by creating a tensor and then calling its .tolist(). it worked for bool and uint8_t equally good so i left uint8_t as thought it makes more sense as we are calling PyBool_FromLong. �Changing it to bool.
**[ATen/Dispatch.h]**better name?.
fixed.
**[test/test_torch.py]** what about other factories, such as full? (and more).
There is a test that goes through the factory methods - test_tensor_factories_empty. i added some bool cases above it and added a comment that once CUDA will be done, i will unite them and it will iterate not just between CUDA and CPU but also all types. ��Adding all bool cases now. Will unite in CUDA PR.
**[generic/THTensorMath.h]** any changes in this file actually needed?
Bad merge. Fixed.
**[TH/THTensor.h]** this generates code for random, clampedRandom, and cappedRandom -- do we have tests for all of these with bool?
Added
**[c10/core/ScalarType.h]** I'm not very confident about the lack of Bool here -- can you look at the call sites and see what makes sense to do here?
Added bool to the macro and created a similar one without for a single case which fails the build with errors:
_./torch/csrc/jit/symbolic_variable.h:79:20: error: ambiguous overload for ‘operator*’ (operand types are ‘const torch::jit::SymbolicVariable’ and ‘torch::jit::Value*’)
return (*this) * insertConstant(rhs);_
Differential Revision: D14605105
fbshipit-source-id: abf82d50e8f8c50b386545ac068268651b28496d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18445
ghimport-source-id: 30d018737bf6989bc68b7e3676f44e0ca6141fde
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18445 Unify cudaGetDeviceCount implementations.**
I went about doing this by searching for calls to cudaGetDeviceCount,
and then methodically replacing them with references to c10::cuda::device_count()
or at::cuda::device_count().
There is a point to doing this: the various implementations wildly differed
in their handling of what to do when cudaGetDeviceCount returns an error.
The final standardized behavior is that **all errors are swallowed** and
we return device count of zero. This indirectly fixes running CUDA builds
on CPU, which was broken in #17847.
I added 'noexcept' to the 'deviceCount' virtual method on DeviceGuardImpl.
This is a BC-breaking change for anyone inheriting from DeviceGuardImpl
but all you need to do is put 'noexcept' on your method and it is backwards
compatible with older libtorch.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14612189
fbshipit-source-id: 3c8d186e3dd623c0e27625212c7ce30f75d943cb
Summary:
`SobolEngine` is a quasi-random sampler used to sample points evenly between [0,1]. Here we use direction numbers to generate these samples. The maximum supported dimension for the sampler is 1111.
Documentation has been added, tests have been added based on Balandat 's references. The implementation is an optimized / tensor-ized implementation of Balandat 's implementation in Cython as provided in #9332.
This closes#9332 .
cc: soumith Balandat
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10505
Reviewed By: zou3519
Differential Revision: D9330179
Pulled By: ezyang
fbshipit-source-id: 01d5588e765b33b06febe99348f14d1e7fe8e55d
Summary:
Simplify or eliminate boolean and/or expressions, optimize unwrapping a value that cannot be None, and optimize using `is` with a None and a non-None value
Since peephole optimize is now introducing constants, i added another constant propagation pass after running it.
Previously i had a PR that did this & optimized shape ops - i will add the shape optimizations in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18259
Differential Revision: D14602749
Pulled By: eellison
fbshipit-source-id: 1c3f5a67067d8dfdf55d7b78dcb616472ea8a267
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/12598
This PR was originally authorized by ptrblck at https://github.com/pytorch/pytorch/pull/15495, but since there was no update for months after the request change, I clone that branch and resolve the code reviews here. Hope everything is good now. Especially, the implementation of count is changed from ptrblck's original algorithm to the one ngimel suggest, i.e. using `unique_by_key` and `adjacent_difference`.
The currently implementation of `_unique_dim` is VERY slow for computing inverse index and counts, see https://github.com/pytorch/pytorch/issues/18405. I will refactor `_unique_dim` in a later PR. For this PR, please allow me to keep the implementation as is.
cc: ptrblck ezyang ngimel colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18391
Reviewed By: soumith
Differential Revision: D14605905
Pulled By: VitalyFedyunin
fbshipit-source-id: 555f5a12a8e28c38b10dfccf1b6bb16c030bfdce
Summary:
Dropout is now eligible for fusion, and generated fused kernels are just as fast as dropout in ATen. Change its lowering in symbolic script so that it can actually be fused. Still special-cased for cuda, because without fusion this lowering is less efficient than current (bernoulli_ * input). Testing is covered by the test case that ailzhang added (test_dropout_cuda).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18375
Differential Revision: D14611938
Pulled By: soumith
fbshipit-source-id: 11b18f4784e6c9265e382a8f8deca7add8df3b37
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18155
- Make a python decorator caffe2_flaky for caffe2 operator unit tests.
- The environment variable CAFFE2_RUN_FLAKY_TESTS are now used to mark flaky test mode
During test run,
- If flaky tests mode are on, only flaky tests are run
- If flaky tests mode are off, only non-flaky tests are run
Mark ctc_beam_search_decoder_op_test as flaky
Reviewed By: ezyang, salexspb
Differential Revision: D14468816
fbshipit-source-id: dceb4a48daeb5437ad9cc714bef3343e9761f3a4
Summary:
This PR did two things:
1. Enable scalar->float specialization in symbolic script, so AD formula that contains scalar in the schema, should write `float` instead.
2. add addcmul, lerp to AD and fuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18081
Differential Revision: D14490493
Pulled By: wanchaol
fbshipit-source-id: b3b86d960d5f051b30733bc908b19786111cdaa4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18362
ghimport-source-id: 374b7ab97e2d6a894368007133201f510539296f
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18242 Test running a CUDA build on CPU machine.
* **#18362 Add ability to query if built with CUDA and MKL-DNN.**
Fixes#18108.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14584430
fbshipit-source-id: 7605a1ac4e8f2a7c70d52e5a43ad7f03f0457473
Summary:
This is to fix#16141 and similar issues.
The idea is to track a reference to every shared CUDA Storage and deallocate memory only after a consumer process deallocates received Storage.
ezyang Done with cleanup. Same (insignificantly better) performance as in file-per-share solution, but handles millions of shared tensors easily. Note [ ] documentation in progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16854
Differential Revision: D13994490
Pulled By: VitalyFedyunin
fbshipit-source-id: 565148ec3ac4fafb32d37fde0486b325bed6fbd1
Summary:
Also asserts in storage_initialized that there is a storage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18347
Differential Revision: D14582028
Pulled By: gchanan
fbshipit-source-id: df3f5d181188f39e361839169fd054539c3b2839
Summary:
There's no reason we can't check this, but I'm punting on implementing it for now. But it currently segfaults, so this is an improvements.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18338
Differential Revision: D14580308
Pulled By: gchanan
fbshipit-source-id: 44d4cafeab12e1beeb3453a2d4068d221c2e9c4f
Summary:
Previously it would look for the Config even if it was not written.
Fixed#18419
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18421
Differential Revision: D14597139
Pulled By: ezyang
fbshipit-source-id: c212cbf5dc91564c12d9d07e507c8285e11c6bdf
Summary:
This reverts commit 7cc7ed1322405ba3c627b9c5661a330f92c4183d.
I think it's better to sort out the issues raised in #18407 firs. I'm sorry for not stopping it earlier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18411
Differential Revision: D14594937
Pulled By: soumith
fbshipit-source-id: 3c90b7fa7694e2f59e55607acecde4a47af801ea
Summary:
Sorry for not sending these fixes in a single PR. I found this compiler warning when I was working on something else, and I just go to GitHub and modify the file directly for convenience...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18406
Differential Revision: D14594180
Pulled By: soumith
fbshipit-source-id: 92f48513bc62fbe2c67c759d68830a973296e43b
Summary:
To address the issue of broadcasting giving the wrong result in `nn.MSELoss()` as mentioned here https://github.com/pytorch/pytorch/issues/16045 . In particular, the issue often arises when computing the loss between tensors with shapes (n, 1) and (n,)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18349
Differential Revision: D14594176
Pulled By: soumith
fbshipit-source-id: f23ae68a4bf42f3554ad7678a314ba2c7532a6db
Summary:
Previously, we would continue to run requires grad on a loop body when the outputs and inputs disagreed. This adds a check so that we don't continue running if the results haven't changed since the last run.
Fix for https://github.com/pytorch/pytorch/issues/18320
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18361
Differential Revision: D14584332
Pulled By: eellison
fbshipit-source-id: 696b225f80a2036318540946428b525985a9e735
Summary:
This specializes optional tensor inputs to either a DimensionedTensorType or, when None is passed,
UndefinedTensor (aka AutogradZeroTensorType).
This works because we already have different specs and thus separate plans for the two cases.
It enhances the shape analysis - because now unwrapped optional tensors will have DimensionedTensorType with appropriate shape and required grad etc.
Also, when combined with "if-pruning" (which I understand #18259 works towards), we actually get much nicer concrete graphs, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18360
Differential Revision: D14590577
Pulled By: soumith
fbshipit-source-id: cac204a506d1d38b15703cbcc67a6b75fd4979f4
Summary:
Currently, `THPVariable_Wrap(…)` and `THPVariable_NewWithVar(…)` depend on the existence of `pyobj_` in the autograd metadata of a Variable to convert the Variable to a Python tensor. However, after the Variable/Tensor merge, there will be Variables that don't contain autograd metadata, and to allow the conversion from non-autograd-meta Variable to a Python tensor we need to store the `pyobj_` outside of autograd metadata and in a place where it will always be available.
This PR makes it possible by moving `pyobj_` into TensorImpl, so that `THPVariable_Wrap(…)` and `THPVariable_NewWithVar(…)` can always access a Variable's `pyobj_` and convert the Variable to a Python tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18225
Differential Revision: D14562616
Pulled By: yf225
fbshipit-source-id: 18d4aaace70eee6120abaf9276036d1f8f51b18d
Summary:
Adds a suggestion to add to __constants__ when a torch.nn.Module attr is accessed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18164
Differential Revision: D14580060
Pulled By: eellison
fbshipit-source-id: 0c5adc21d7341a5691d4b45930947cb1ba84c8e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18211
ghimport-source-id: 73b81e9ec631937b14db1da10991831788a6894b
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18296 [jit] Add namespacing for ScriptClasses
* #18284 [jit] make test module hook use save/load
* **#18211 [jit] Turn script_type_parser into a class**
* #18148 [jit] python interop for script classes
If we are namespacing classes, the type parser will need to carry around
some state about which namespaces to look in. This PR just wraps it in a
class in preparation.
Also, subscriptToType can no longer be static, since parseTypeFromExpr
may give different results depending on the namespaces available, so
it's been made a regular function instead of a static map lookup.
Reviewed By: eellison
Differential Revision: D14581128
fbshipit-source-id: 711315472ccde1920abf9fdb5a871ac27fb86787
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18148
ghimport-source-id: 40a9d745dc9aeba53d098743323fcbd50ca65137
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18148 py interop**
Support for converting classes between the Python–TorchScript boundary. Like other TorchScript values, ScriptClasses are native Python values when used in Python and IValues when used in TorchScript.
Notably, there is a copy across this boundary, which will be surprising to users who will expect standard Python reference semantics. I have some ideas for fixing that, but it's a more involved process.
Reviewed By: jamesr66a
Differential Revision: D14526259
fbshipit-source-id: 5916e3032488a42dc7da756c1826d7c040a21ebd
Summary:
Fix for https://github.com/pytorch/pytorch/issues/17583
There's an unrelated issue right now causing a segfault when printing tensor so that might have to fixed first for this to land
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18298
Differential Revision: D14584266
Pulled By: eellison
fbshipit-source-id: 4e7850dadc78ef1e98ad40b9d8adc0fef42acf48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18345
ghimport-source-id: 9649d76bb194866859d62e6ba2a3a265c96ebba5
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18345 Make it possible to trigger XLA/slow tests via commit message.**
Four variants are supported: `[xla ci] [ci xla] [xla test] [test xla]`; substitute
xla with slow for slow tests.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14584557
fbshipit-source-id: fcbfdfb28246823135bb3d3910baae073d16e81d
Summary:
so that functions like `def fn(x, p:float)` can be fused. Fixes#9940 and #11186. Fuses only float (not integer) arguments to simplify assembling arguments for fusion launch.
CPU fusion is disabled in CI and this won't be tested, but I tested it locally.
cc t-vi, apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18087
Differential Revision: D14581206
Pulled By: wanchaol
fbshipit-source-id: ccb0cf79b1751706f9b2cdf1715115eae5a39fb6
Summary:
Two functions were not directed ad NVRTC.
It's a bit hard to test this, as the fuser usually produces correct code - unless I try to hack on it. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18327
Differential Revision: D14579285
Pulled By: soumith
fbshipit-source-id: 1be7ba461cc473d514ba619507742a47d4d7c97e
Summary:
this is handy when testing various core dump related
things. If in the future we want to unit test our future gdb debugger
extensions, we can use this op to generate a core dump for us within a
unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18207
Differential Revision: D14482186
Pulled By: salexspb
fbshipit-source-id: 39a9fffbdd4bd083597f544d1c783a82cf023a89
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18194
Add a util method to cleanup external inputs and outputs from a NetDef
The following conditions will be met after the modification
- No duplicate external inputs
- No duplicate external outputs
- Going through list of ops in order, all op inputs must be outputs
from other ops, or registered as external inputs.
- All external outputs must be outputs of some operators.
Reviewed By: ZolotukhinM
Differential Revision: D14528589
fbshipit-source-id: c8d82fda1946aa3696abcbec869a4a8bb22f09b6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18267
Motivation: we don't actually want to use it for real under any circumstances. This is an idea to unblock our internal progress and parallelize workstreams. We can easily define schemas for all ops in question and implement forwarding to C2 ops which is NOT going to be performant. Then several things can be happening in parallel:
* move code of ops outside of C2 ops that depend on protobuf into c10
* development of optimization/fusion passes
* building python-level wrappers with clean API
* improving perf
This demonstrates, Relu, quant, dequant. It seems to cover all use cases necessary (maybe except weights prepacking). Ideally I'd demonstrate Conv, but will get to it later in a separate PR (contributions welcomed)
Reviewed By: ezyang
Differential Revision: D14531232
fbshipit-source-id: 4cd4a71ae0cb373c6c0e81f965c442b82a1b4069
Summary: Removing the maximum number of blocks limit from the operator and making the nesterov parameter templated to remove branching.
Reviewed By: BIT-silence
Differential Revision: D14567003
fbshipit-source-id: 394c2039ee214adc6ccd2e562e4e9563d307131f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18291
ghimport-source-id: d6e95e899bd320407967df41435801e54864ba62
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18292 Add test for #17271 (torch.exp incorrect for 2**31 size tensor)
* **#18291 Correctly call superclass setUp in TestCase subclasses.**
This makes PYTORCH_TEST_SKIP_FAST work correctly for more
tests, reducing the wasted testing effort on our slow_test job.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14567643
fbshipit-source-id: 40cf1d6556e0dd0a0550ff3d9ffed8b6000f8191
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18129
A lot of tensor interference function assume the operator passes the schema.
So call Verity to make sure this is actually the case.
Created diff before to add checking in Concat (https://github.com/pytorch/pytorch/pull/17110), but I encountered lot more places where this is assumed (for example ElementwiseOpShapeInference)
Reviewed By: mdschatz
Differential Revision: D14503933
fbshipit-source-id: cf0097b8c3e4beb1cded6b61e092a6adee4b8fcb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18246
Simplifies histogram collection and quantization process.
Histogram collection before this diff was something like this
```
from caffe2.quantization.server import dnnlowp_pybind11
...
dnnlowp_pybind11.ObserveHistogramOfOutput(hist_file)
for ...
workspace.RunNet(predict_net)
dnnlowp_pybind11.ClearNetObservers() # This is to trigger Stop function in the observer to dump out histogram file but this can have unintended consequence of also clearing all the other useful observers we attached
```
After this diff we can
```
workspace.CreateNet(predict_net) # Note we need to create net to have a net to attach observer
histogram_observer = dnnlowp_pybind11.AddHistogramObserver(predic_net, hist_file)
for ...
workspace.RunNet(predict_net)
predict_net.RemoveObserver(histogram_observer)
```
Choosing quantization parameters of weights before this diff was something like this
```
dnnlowp_pybind11.ObserveHistogramOfOutput(weight_hist_file)
workspace.RunNetOnce(init_net)
dnnlowp_pybind11.ClearNetObservers() # Has same issue as the histogram collection example above
dnnlowp_pybind11.RegisterQuantizationParamsWithHistogram(
weight_hist_file, is_weight=True, qparams_output_file_name=qparams_file
)
workspace.CreateNet(init_net, overwrite=True)
dnnlowp_pybind11.ClearNetObservers()
logger.info("Loading quantization params from {}".format(qparams_file))
blobs_to_qparams = {}
with open(qparams_file) as f:
lines = f.readlines()
for line in lines:
op_id, op_type, output_id, tensor_name, mini, maxi, scale, zero_point, precision = (
line.split()
)
op_id = int(op_id)
output_id = int(output_id)
op = net.Proto().op[op_id]
if op_type != op.type or op.output[output_id] != tensor_name:
print(
"Corrupt qparams file {} {} {} {} {}".format(
qparams_file, op_type, op.type, op.output[output_id], tensor_name
)
)
blobs_to_qparams[tensor_name] = QuantizationParam(float(scale), int(zero_point))
```
After this diff this can be simplified to
```
blobs_to_qparams = {}
for op in init_net.Proto().op:
for output in op.output:
scale, zero_point = dnnlowp_pybind11.ChooseQuantizationParams(output)
blobs_to_qparams[output] = QuantizationParam(scale, zero_point)
```
Reviewed By: dskhudia
Differential Revision: D14544694
fbshipit-source-id: 4fd06cd63256201e2e9d15c39f503138d1be53c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18257
support adding op in global_init_net. because pred_init_net is per thread, and just doesn't cut it.
Reviewed By: jspark1105
Differential Revision: D14552695
fbshipit-source-id: 53dd44c84ad019019ab9f35fc04d076b7f941ddc
Summary:
* Adds more headers for easier scanning
* Adds some line breaks so things are displayed correctly
* Minor copy/spelling stuff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18234
Reviewed By: ezyang
Differential Revision: D14567737
Pulled By: driazati
fbshipit-source-id: 046d991f7aab8e00e9887edb745968cb79a29441
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18290
Some such as `Tile` will mess up our tracking of batch size and for now it makes sense to stop the shape inference on these ops so that we don't lower it and downstream ops without proper batch info.
Reviewed By: zrphercule
Differential Revision: D14463550
fbshipit-source-id: 2792481efa540f2a7dd310e677c213860c3053ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18159
In some instances, the call to forward could clash with std::forward. Fully qualify it to make sure it gets the right one
Reviewed By: ezyang
Differential Revision: D14512189
fbshipit-source-id: 6242607dbe54fcdb93229c1a4aaee8b84a88caa1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18158
They didn't work when called from other namespaces before because they didn't fully specify the c10 namespace.
Reviewed By: ezyang
Differential Revision: D14512187
fbshipit-source-id: a496b89a1bbe2b56137cfae03ab94a60f38d7068
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18090
This schema inference is needed by the c10 operator registration mechanism. Move it to c10.
It is going to be used by diffs stacked on top.
Reviewed By: ezyang
Differential Revision: D14491454
fbshipit-source-id: 0f8ddcdbd91467c8347d315dd443a1ca8b216481
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18038
Now that we have named overloads, we can allow registering the same function schema multiple times and just check it's identical.
This is going to be used in custom op registration since they register the schema every time a kernel is registered.
Reviewed By: dzhulgakov
Differential Revision: D14467494
fbshipit-source-id: 2c26cf72a64b65f120afe05e989302ec42597515
Summary:
Changelog:
- Renames `trtrs` to `triangular_solve` to remain consistent with `cholesky_solve` and `solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `triangular_solve` under the name `trtrs`, and add a deprecation warning to not promote usage.
- Move `isnan` to _torch_docs.py
- Remove unnecessary imports
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18213
Differential Revision: D14566902
Pulled By: ezyang
fbshipit-source-id: 544f57c29477df391bacd5de700bed1add456d3f
Summary:
Fixes Typo and a Link in the `docs/source/community/contribution_guide.rst`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18237
Differential Revision: D14566907
Pulled By: ezyang
fbshipit-source-id: 3a75797ab6b27d28dd5566d9b189d80395024eaf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18236
ghimport-source-id: 2bb80d017c2ea833669a2d55b340a922b2d44685
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18236 Enable running of slow tests in CI.**
* #18231 Add a decorator for marking slow tests.
These tests only run on master, as they are slow.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14563115
fbshipit-source-id: f54ddef4abedc7e872e58657fc9ac537952773d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18278
ghimport-source-id: 3c35f6e7229c3c2b3a27d96370d7c05fad58365e
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18278 Shut up compiler about unused this_type.**
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14563050
fbshipit-source-id: 4b516f6c9ef3784d1430f793f304066c351b1a93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18231
ghimport-source-id: 78c230f60c41877fe91b89c8c979b160f36f856b
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18231 Add a decorator for marking slow tests.**
The general strategy:
- It's a normal skip decorator, which triggers a skip if
PYTORCH_TEST_WITH_SLOW is not set.
- It also annotates the method in question that says it's
slow. We use this to implement a catch-all skipper in
setUp that skips all non-slow tests when
PYTORCH_TEST_SKIP_FAST is set.
I added a little smoketest to test_torch and showed that I get:
```
Ran 432 tests in 0.017s
OK (skipped=431)
```
when running with PYTORCH_TEST_WITH_SLOW=1 and PYTORCH_TEST_SKIP_FAST=1
CI integration coming in later patch, as well as nontrivial uses of
this decorator.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14544441
fbshipit-source-id: 54435ce4ec827193e019887178c09ebeae3ae2c9
Summary:
This moves median to ATen.
- median with dimension reduces to kthvalue
- median without dimension (aka medianall) is implemented in parallel to kthvalue because we would not want to reshape (copying for non-contiguous) and then copy again in kthvalue. We can sue the helper functions we moved from kthvalue.
- `median_cuda` was accidentally already put into ATen in #17544.
- The quickselect algorthm without indices for CPU in TH is now obsolete and removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17637
Differential Revision: D14346510
Pulled By: ezyang
fbshipit-source-id: c07ad144efbd6b4194179bb1c02635862521d8cb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18181
ghimport-source-id: 9c23551584a1a1b0b7ac246367f3a7ae1c50b315
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18184 Fix B903 lint: save memory for data classes with slots/namedtuple
* **#18181 Fix B902 lint error: invalid first argument.**
* #18178 Fix B006 lint errors: using mutable structure in default argument.
* #18177 Fix lstrip bug revealed by B005 lint
A variety of sins were committed:
- Some code was dead
- Some code was actually a staticmethod
- Some code just named it the wrong way
- Some code was purposely testing the omitted case
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14530876
fbshipit-source-id: 292a371d9a76ddc7bfcfd38b6f0da9165290a58e
Summary:
Two small refinements to the shape analysis:
- `detach` can set requires grad to false for dimensioned tensors (not sure if I would also need to deal with Complete?).
- add `batch_norm_stats`.
I noticed these while looking at what's going on when trying to code batch norm manually. (Hi wanchaol )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18271
Differential Revision: D14561303
Pulled By: ezyang
fbshipit-source-id: 64a6879392e77403c44f2ed82f84b6397754d0ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18177
ghimport-source-id: fbbf915b66762fc88bc5b541464e71ba27500958
Stack from [ghstack](https://github.com/ezyang/ghstack):
* #18184 Fix B903 lint: save memory for data classes with slots/namedtuple
* #18181 Fix B902 lint error: invalid first argument.
* #18178 Fix B006 lint errors: using mutable structure in default argument.
* **#18177 Fix lstrip bug revealed by B005 lint**
lstrip() doesn't strip a prefix; it strips all of the characters
in the passed in string. B005 lint revealed this. Replaced with
substring operation.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14530873
fbshipit-source-id: 13b3438fcc3cce13b5110730dc3d0b528a52930f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18240
For rare cases when dst_bin_width == 0 we should just put all numbers to an arbitrary bin.
Reviewed By: csummersea
Differential Revision: D14544685
fbshipit-source-id: 02d04ff8bd1555d6cf7e7eeb1196a4ab3325a9e5
Summary:
Why do we need this workaround? `PythonArgParser` handles these two cases well.
The discussion started at https://github.com/pytorch/pytorch/pull/6201#issuecomment-378724406. The conclusion at that time by goldsborough was:
> Because we wanted to allow `dim=None` in Python and route to a different function. Essentially the problem was wanting to wrap the C++ function in Python. AFAIK there is no way of translating `dim=None` behavior into C++? So Richard and I came up with this strategy
Maybe at that time `PythonArgParser` was not powerful enough to handle the routing of two function with same name but different C++ signature.
Will keep an eye on the CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17103
Differential Revision: D14523503
Pulled By: VitalyFedyunin
fbshipit-source-id: cae3e2678062da2eccd93b51d4050578c7a9ab80
Summary:
Fix#17801 to add an exception regarding `ignore_index` in the documentation for `torch.nn.CrossEntropyLoss` and `torch.nn.NLLLoss`
If any other files/functions are hit, I'd be glad to incorporate the changes there too! 😊
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18117
Differential Revision: D14542079
Pulled By: ezyang
fbshipit-source-id: 7b918ac61f441dde7d3d6782d080c500cf2097f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17943
Together with xw285cornell came up with a solution for static destruction
order fiasco that caused the NCCL context to be destroyed **after**
the CUDA context was already destroyed. In this commit we destroy all
cached NCCL contexts as soon as the last NCCL related Caffe2 operator
instance is destructed, thereby avoiding a dependency on static
variable destruction.
Reviewed By: xw285cornell
Differential Revision: D14429724
fbshipit-source-id: fe5ce4b02b1002af8d9f57f6fa089b7a80e316ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17905
support adding op in global_init_net. because pred_init_net is per thread, and just doesn't cut it.
Reviewed By: jspark1105
Differential Revision: D14114134
fbshipit-source-id: 112bb2ceb9d3d5e663dd430585567f4eaa2db35f
Summary:
So, we will keep the names of ONNX initializers the same as the names in PyTorch state dict.
Later, we will make this as the default behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17551
Reviewed By: dzhulgakov
Differential Revision: D14491920
Pulled By: houseroad
fbshipit-source-id: f355c02e1b90d7ebbebf4be7c0fb6ae208ec795f
Summary:
- Remove single batch TH/THC implementations
- Remove `_batch_trtrs_lower` from `multivariate_normal`
- Add tests for batched behavior
- Modify trtrs_backward to accommodate for batched case
- Modify docs
In a future PR, this will be renamed to `triangular_solve`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18025
Differential Revision: D14523004
Pulled By: ifedan
fbshipit-source-id: 11c6a967d107f969b60e5a5c73ce6bb8099ebbe1
Summary:
I don't know if we actually want to expose this or not, but it's useful for debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18197
Reviewed By: ezyang
Differential Revision: D14530712
Pulled By: gchanan
fbshipit-source-id: 98fdba9cf113738f0db3a198c49365de536b9919
Summary:
Loop analysis indicates that there is a runtime trip count and hence
unrolling cannot take place.
This will silence compile-time warnings we have been observing with recent ROCm releases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18204
Differential Revision: D14539875
Pulled By: ezyang
fbshipit-source-id: a7ea7f2a95603754296b76a6b62a154f56f4ad4d
Summary:
Further breakup test_misc.h. The remaining tests don't directly map to a jit file so I left them in test_misc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18191
Differential Revision: D14533442
Pulled By: eellison
fbshipit-source-id: 7f538ce0aea208b6b55a4716dfcf039548305041
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18193
ghimport-source-id: 540859cf0b238a9832f45b3f4c2351e3343fc1a2
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18193 Turn on Travis builds for ghstack PRs.**
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision: D14529945
fbshipit-source-id: 4476e996e311a04f2a997ca9b7c4cf2157dd6286
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18195
ghimport-source-id: 05102cb115c6bd6d141f51905e20155bcd79a908
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18195 [build] do not throw when unicode is seen in pull request info**
Differential Revision: D14529707
fbshipit-source-id: 2f6a31b01b3a9b044fd24be466cc5325b70929ad
Summary:
The type of each `initial_ivalue` is completely known at some point but that information is discarded by the time a call to it is emitted. This PR is kind of a hack, as a better (longer) solution, the method should know about the type of each initial value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18156
Differential Revision: D14525768
Pulled By: driazati
fbshipit-source-id: 52d53e9711a07a4551c988bd95fe997e654aa465
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18138
ghimport-source-id: be62a71ef98714e6f168a00f84120f612363528e
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18138 Enable flake8-bugbear line length checking.**
flake8-bugbear's line length checker (B950) which permits violations
of up to 10% but specifies the "true" limit when you go over.
I had to ignore a bunch of flake8-bugbear's other checks when I
turned this on. They're good checks though (they're turned on
in fbcode) and we should fix them eventually.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Reviewed By: salexspb
Differential Revision: D14508678
fbshipit-source-id: 2610ecc0dd43cc0788d77f4d024ebd85b26b8d41
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18146
ghimport-source-id: 4b061c27c5c44ef0d06066490ed16cab3d0c7a64
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18146 [jit] fix bug in alias analysis**
We handled hasWriters() incorrectly in the case of wildcards. There's
even a comment describing the correct behavior. Sad!
Much thanks to t-vi for tracking this down and suggesting the fix!
Differential Revision: D14524208
fbshipit-source-id: 8010b54257241bd64013a0d0a8b6e7d22d8c70af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18123
the motivation of this fix is to resolve things like:
for(auto i = 0; i < N; i++) where N is bigger than int32
These instances of comparison were found by enabling -Wsign-compare
There are way too many things to fix, so issuing this as a series of fixes
The plan is to fix all these issues and then enable this flag into Caffe2 to catch future instances
Reviewed By: ZolotukhinM
Differential Revision: D14497094
fbshipit-source-id: bca3927a2188bd33a508fa503ba221c220cdaefe
Summary:
The momentum buffer is initialized to the value of
d_p, but the current code takes the long way to do this:
1. Create a buffer of zeros
2. Multiply the buffer by the momentum coefficient
3. Add d_p to the buffer
All of these can be collapsed into a single step:
1. Create a clone of d_p
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18114
Differential Revision: D14509122
Pulled By: ezyang
fbshipit-source-id: 4a79b896201d5ff20770b7ae790c244ba744edb8
Summary:
In aten we have a _fused_dropout implementation for CUDA case. As ngimel suggested if we discard it in JIT AD, it hurts performance.
It doesn't seem ideal to include backend specific implementation in AD, but this is helpful to prevent performance regression atm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17756
Differential Revision: D14368999
Pulled By: ailzhang
fbshipit-source-id: 9a371c5020f630e8f6e496849ec9772b6f196169
Summary:
Addresses #15738, using fritzo's suggestion. This adds a `torch._sample_dirichlet` method in `Distributions.cpp` and `Distributions.cu`.
- For CPU, this leads to no perf hit since all we do is to promote the `alpha` to double when getting the gamma samples (the gamma sampler anyways uses `accscalar_t`(double for CPU)) and cast it back to float32 on return.
- I have added an analogous method for CUDA as well, but the default sampler for CUDA uses scalar_t for efficiency, so I have kept it as that. With this, I do not see the bias towards 1 as reported in #15738 with `float32`, but there is a spurious mode at 0.5, as would be expected. Users would need to explicitly use `float64` for GPU to not see the spurious mode at 0.5. (EDIT: see note below, it appears that the bias issue is still there for certain builds).
Added some tests and checked that there is no perf regression. My experience with C++ is very limited, so apologies in advance if I missed something basic. cc. ailzhang, fritzo, fmassa
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17488
Differential Revision: D14410301
Pulled By: ezyang
fbshipit-source-id: 62b2f694b4642685eab06db96d74ce28e05c3992
Summary:
It's wrong and unused. Use one of the many other constructors instead :).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18137
Differential Revision: D14508364
Pulled By: gchanan
fbshipit-source-id: 19c6ff78ad9d9221d0874425edd02b78627c4ca7
Summary:
There are multiple backends for a device type, so we just kill this function.
Also, kill an getNonVariableType instance which was also underspecified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18135
Differential Revision: D14507474
Pulled By: gchanan
fbshipit-source-id: fc791a76d4b851b23d09a070725f3838621eb13d
Summary:
This gets rid of 'aten_sparse' which was used at one time with legacy THS code, but is now only overloaded in native_parse.py.
The way that 'aten_sparse' worked was wonky -- it extended all backends (default [CPU, CUDA]) to include sparse.
But this is totally unnecessary; we already have the backends we need to generate for from type_method_definition_dispatch.
codegen changes: fc37c8e171/diff.txt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18144
Reviewed By: ezyang
Differential Revision: D14511324
Pulled By: gchanan
fbshipit-source-id: 8bb4ac4cf0985f8756790779a22bc229e18e8e7f
Summary:
Fix#16428 by correcting type of 'swap' from `float` to `bool`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18115
Differential Revision: D14516615
Pulled By: ezyang
fbshipit-source-id: c61a45d533f3a443edf3c31c1ef3d9742bf46d2b
Summary:
Allows serialization/loading of attributes (`IValue`s of any type).
* metadata (attribute name, type) is stored in the `model.json`
* The binary format is a subset of the `pickle` module that supports the operations necessary for `IValue`s
* Attributes are serialized in the order they are defined on a module to a list in a single `attributes` file, with submodule attributes coming first. This order directly matches the order attributes are listed in `model.json`
* This can be inspected in Python with `pickle.load()` or with `pickletools` (PyTorch need not be installed for this to work)
* A class is used to store a tensor's index into the tensor table of the model, so to unpickle the file you have to use a custom Unpickler:
```python
class TensorID(object):
def __setstate__(self, id):
self.id = id
class JitUnpickler(pickle.Unpickler):
def find_class(self, module, name):
if module == '__main__' and name == 'TensorID':
return TensorID
JitUnpickler(open("my_model/attributes.pkl", "rb")).load()
```
* pickle format: https://svn.python.org/projects/python/trunk/Lib/pickletools.py
* It currently does not support/guarantee that anything saved out with `pickle` (i.e. if you edit `attributes` with `pickle` directly) instead of our tools will be imported correctly
Also will fix#17683 and fix#16367
Followup Work:
* document format / choice of pickle: #17951
* create an example
* list specializations
* int size specializations, large binputs
* do a first pass over attributes to output only necessary `BINPUT` ops
* attribute reassignment (e.g `self.my_attribute = new_value`)
* `tensor.save("some_checkpoint.pkl")` support with tensors embedded in Pickle file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17423
Differential Revision: D14470965
Pulled By: driazati
fbshipit-source-id: 6a21a9939efdbe59b4bc57fd31d6d630bab5297e
Summary:
Changelog:
- Renames `gesv` to `solve` to remain consistent with `cholesky_solve`.
- Rename all tests, fix callsites
- Create a tentative alias for `solve` under the name `gesv`, and add a deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18060
Differential Revision: D14503117
Pulled By: zou3519
fbshipit-source-id: 99c16d94e5970a19d7584b5915f051c030d49ff5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18107
Pull Request resolved: https://github.com/pytorch/translate/pull/396
also:
1. fix issues with OptionalType not having a createWithContainedType (PyTorch diff)
2. Delete tests for ONNX full beam search export (nobody is using it and it just makes things harder. Currently ONNX doesn't support `_unwrap_optional`)
Reviewed By: jmp84
Differential Revision: D14483771
fbshipit-source-id: 0e37ef1cb5a16d03a535eef808b0488b98802128
Summary:
...because gcc will have failures with very strange error messages
if you do.
This affects people with Debian/Ubuntu-provided NVCC, the PR should
not change anything for anyone else.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18127
Differential Revision: D14504386
Pulled By: soumith
fbshipit-source-id: 1aea168723cdc71cdcfffb3193ee116108ae755e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18121
ghimport-source-id: 70c273bfbcb68f7b25cf87f5614c662960864758
Stack from [ghstack](https://github.com/ezyang/ghstack):
* **#18121 [jit] fix double free in test_jit**
These definitions used to be in anonymous namespace so they weren't exported from the translation unit. #18071 put those in a `test` namespace so I guess they were getting their destructors called twice on exit somehow. Making them static again fixes the problem.
Reviewed By: ezyang
Differential Revision: D14498349
fbshipit-source-id: f969781695dcbebdfcfce667fce5b986222a373e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18059
Replace resize_dim() with set_sizes_and_strides() in `THTensor_(squeeze)` in aten/src/TH/generic/THTensor.cpp and `THCTensor_(squeeze)` in aten/src/THC/generic/THCTensor.cpp
Reviewed By: ezyang
Differential Revision: D14471066
fbshipit-source-id: 1c8c412ff09246c4df6843736e3bf0279bfadea8
Summary:
sphinx doesn't understand hyphen. it does not merge the two halves together in html.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18118
Differential Revision: D14498012
Pulled By: mrshenli
fbshipit-source-id: d6f4cfddc0a8e3a8f91578da43c26ca9c6fff3ce
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18072
ghimport-source-id: 9653731602c72f299e095dd50e3afe6bcc8b01d6
Stack:
* **#18072 properly device_guard IndexTensor and BoolTensor.**
* #18073 Change one_hot from IndexTensor to Tensor.
Currently IndexTensor and BoolTensors do not have device_guards applied to them.
This is bad in the case where the only tensor(s) are IndexTensors or BoolTensors, because no device guard is present.
The only case this currently happens is with one_hot which ends up not mattering because of the way the implementation is written. But I wanted to make sure we are covered here.
Reviewed By: ezyang
Differential Revision: D14485249
fbshipit-source-id: e57b28086fa1ad2fdd248bb1220e8a2e42da03e1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18093
ghimport-source-id: 021adc52aa7bfe5fff74531c76a8cd28cab30b2a
Stack:
* **#18093 [jit] fix corner case for optional aliasing**
Occasionally the compiler can insert constant Nones to make types line
up. In that case, don't try to make a pointer from the optional type to
None, since we know statically that None won't be mutated or whatever.
Reviewed By: shannonzhu
Differential Revision: D14493004
fbshipit-source-id: 6564065f39d99ee5af664f3a0fe235892973d9be
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18040
Add flag to fails if float point exceptions is detected in operator runs
Sample exception
Exception [enforce fail at operator.h:837] !std::fetestexcept(FE_DIVBYZERO). Division by zero floating point exception (FE_DIVBYZERO) reported.
Error from operator:
input: "1" input: "0" output: "out" name: "" type: "Div"
Reviewed By: jspark1105
Differential Revision: D14467731
fbshipit-source-id: fad030b1d619a5a661ff2114edb947e4562cecdd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18084
data_strategy parameter was not used in some of unit tests for optimizers
Reviewed By: hyuen
Differential Revision: D14487830
fbshipit-source-id: d757cd06aa2965f4c0570a4a18ba090b98820ef4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18037
The FunctionSchema can now store an overload name and the parser knows how to parse it. Specify like this:
my_func.overload1(arg1: Tensor) -> Tensor
my_func.overload2(arg1: Tensor, arg2: Tensor) -> Tensor
Reviewed By: zdevito
Differential Revision: D14467497
fbshipit-source-id: 8832b32f07351bb61090357b17b77a6a2fed3650
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18036
- Add macros to export c10 cuda operators to caffe2 frontend
- Instead of having a separate caffe2 registry for the c10 operator wrappers, use the existing caffe2 registries
Reviewed By: ezyang
Differential Revision: D14467495
fbshipit-source-id: 7715ed2e38d2bbe16f1446ae82c17193a3fabcb9
Summary:
Changes:
1) https://github.com/pytorch/pytorch/pull/17527 changed dispatch macros to be ScalarType based instead of at::Type based. This broke cpp extensions that relied on dispatch macros. Since IMO these should be ScalarType based (and some extensions have already updated), we allow either at::Type or at::ScalarType to be passed, but passing at::Type will result in a deprecated warning.
2) Reintroduce macros that were deleted (AT_DISPATCH_ALL_TYPES_AND_HALF, AT_DISPATCH_COMPLEX_TYPES, AT_DISPATCH_ALL_TYPES_AND_HALF_AND_COMPLEX, AT_DISPATCH_ALL_TYPES_AND_COMPLEX); the AND_HALF ones now give a deprecated warning because there are more extensible macros that were introduced in their place.
3) Makes AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND into a ScalarType based macro (and updates usages). This was the result of a logical merge conflicts.
4) Adds a new macro, C10_DEPRECATED_MESSAGE for passing a deprecated message to the compiler. I didn't spend much time seeing if this can be enabled for versions before C++14.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17996
Reviewed By: ezyang
Differential Revision: D14446203
Pulled By: gchanan
fbshipit-source-id: 1da56e2e9c15aa8f913ebbf6bf1110c5b6dc375e
Summary:
Breakup test_misc so that a test for a file is in test_filename. I think we might want to wait on moving test files into the source directory, since that would involve moving some tests over to the C10 folder, and this goes 99% of the way for test discoverability IMO anyway.
I added a file test_utils for common functions invoked in the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18071
Differential Revision: D14485787
Pulled By: eellison
fbshipit-source-id: dcb20d1978d490999d435ea20c1d0503413a5c80
Summary:
Stack:
⚫ **#17856 [jit] support serialization of classes** [💛](https://our.intern.facebook.com/intern/diff/D14402599/)
Add support for saving/loading TorchScript modules that depend on user-defned classes.
We track class dependencies the same we track tensor constants, then write them
all out such that we can just compile them in order before compiling the module
hierarchy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17856
Reviewed By: shannonzhu
Differential Revision: D14461599
Pulled By: suo
fbshipit-source-id: 7115f87e069fd00dc8381d7de9997864fef7ea9f
Summary:
The C10 ops are not registered as custom ops in PyTorch. So we have to add the explicit support for it, too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17899
Reviewed By: dzhulgakov
Differential Revision: D14436999
Pulled By: houseroad
fbshipit-source-id: a31fdf13a5c84f9b156a7288e0ffa57deb23b83f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17993
ghimport-source-id: 5427773f6306bdeddffd9a3ae032acc3f253f458
Stack:
* #17926 Implement at::has_internal_overlap helper function
* #17927 Error out on in-place (unary) ops on tensors that have internal overlap
* **#17993 [easy] Delete dead code in THTensorMoreMath.cpp**
We seem to have new implementations already for these in ATen.
Reviewed By: ezyang
Differential Revision: D14457838
fbshipit-source-id: 8481aad74b2127bd28c0f3e09740889fc0488a31
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17927
ghimport-source-id: 626d321e430b6b5c0ea3aa1eb9df8c1e2d058bf8
Stack:
* #17926 Implement at::has_internal_overlap helper function
* **#17927 Error out on in-place (unary) ops on tensors that have internal overlap**
On the way to #17935.
Works for CPU and CUDA on the following ops:
- abs_, acos_, asin_, atan_, ceil_, cos_, erf_, erfc_, exp_, expm1_
- floor_, log_, log10_, log1p_, log2_, round_, rsqrt_,
- sin_, sqrt_, tan_, tanh_, trunc_
This PR adds a check to see if the out/result tensor has internal
overlap. If it does, then we error out because the result **may** be
incorrect.
This is overly conservative; there are some cases where if the result is
the same as the input, the inplace operation is OK (such as floor_,
round_, and trunc_). However, the current code isn't organized in such a
way that this is easy to check, so enabling those will come in the future.
Reviewed By: ezyang
Differential Revision: D14438871
fbshipit-source-id: 15e12bf1fdb2ab7f74bb806e22bc74840bd6abd1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17926
ghimport-source-id: 9f7572b5d43e474492363fa17dcb86a6c27ca13c
Stack:
* **#17926 Implement at::has_internal_overlap helper function**
* #17927 Error out on in-place (unary) ops on tensors that have internal overlap
On the way to #17935.
Checks if a tensor's sizes/strides indicate that multiple elements share
the same memory location. This problem in general is hard so
at::has_internal_overlap implements two heuristics and avoids solving
the general problem:
if a tensor is contiguous, it cannot have internal overlap
if a tensor has any zero strides, it does have internal overlap
otherwise, return MemOverlap::kTooHard to indicate that there might be
overlap, but we don't know.
Reviewed By: ezyang
Differential Revision: D14438858
fbshipit-source-id: 607ab31771315921ab6165b2a1f072ac3e75925a
Summary:
In python2, float values get truncated. We are storing default float values as floats (not 100% sure why?), which results in the defaults being truncated in the JIT and not matching the (specified) native function signatures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18044
Reviewed By: ezyang
Differential Revision: D14469868
Pulled By: gchanan
fbshipit-source-id: a456de599e8dab106966bcac7a6033f02ce3cdd2
Summary:
Currently, we cannot run a checkpointed function with None argument.
```python
out = torch.utils.checkpoint.checkpoint(run_fn, input_var, None)
```
```
File "/home/tunz/anaconda3/envs/torchdev/lib/python3.7/site-packages/torch/utils/checkpoint.py", line 14, in detach_variable
x = inp.detach()
AttributeError: 'NoneType' object has no attribute 'detach'
```
This PR makes checkpoint function to safely handle None argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17969
Differential Revision: D14475148
Pulled By: ezyang
fbshipit-source-id: 9afe9e9aac511a6df1e1620e9ac341536890d451
Summary:
According to https://docs.python.org/3/tutorial/inputoutput.html, it is good practice to use the "with" keyword when dealing with file objects. If not, you should call f.close() to close the file and immediately free up any system resources used by it. Thus, I adjust the open file function to "with open() as f".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18017
Differential Revision: D14475112
Pulled By: ezyang
fbshipit-source-id: d1c0821e39cb8a09f86d6d08b437b4a99746416c
Summary:
This is now working in rocm 2.2
cc xw285cornell
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18043
Differential Revision: D14477493
Pulled By: bddppq
fbshipit-source-id: 4d2dab1d5dbdbd4d6189162c074b19c4e9882c7d
Summary:
The output format of NonZero in ONNX(numpy https://docs.scipy.org/doc/numpy/reference/generated/numpy.nonzero.html) differs from that in PyTorch:
In ONNX: `[rank_of_input, num_of_nonzeros]`, whereas in PyTorch: `[num_of_nonzeros, rank_of_input]`.
To resolve the difference a Transpose op after the nonzero output is added in the exporter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18047
Differential Revision: D14475081
Pulled By: ezyang
fbshipit-source-id: 7a3e4899f3419766b6145d3e9261e92859e81dc4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17958
In some places, we need 64-bit for corner cases even though it's going to be rare.
In some places, we were using 64-bit unnecessarily.
Reviewed By: hyuen
Differential Revision: D14435523
fbshipit-source-id: e01ab73029ff780133af7ff4bbbe2e17926ed5a2
Summary:
Fix a very common typo in my name.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17949
Differential Revision: D14475162
Pulled By: ezyang
fbshipit-source-id: 91c2c364c56ecbbda0bd530e806a821107881480
Summary:
ROCm 2.2 was released today, if we respin the CI docker images with the attached, PyTorch/Caffe2 will support ROCm 2.2
Changes necessary:
* for the Ubuntu target, HIP PR 934 needs to be applied to fix the forceinline definition. ROCm 2.3 will contain this.
* two unit tests proof flaky on different platforms, disable them defensively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18007
Differential Revision: D14473903
Pulled By: bddppq
fbshipit-source-id: b1939f11d1c765a3bf71bb244b15f6ceb0e816d3
Summary:
Fixes#17558
The flattened tuple `Optional[Tuple[int, int]]` could either result in 1 (`None`) or 2 (`int` and `int`) values, so allow this case in `ArgumentSpec`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17826
Differential Revision: D14415290
Pulled By: driazati
fbshipit-source-id: 971bfa39502cfb8f08a991f16ffed6d138e48dc9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/18039
We basically flatten the whole net in order to ease the ONNXIFI transform. An alternative way is to ONNXIFI the internal net of the If op, which can be done by adding interfacing inputs/outputs that the internal then_net or else_net referred to the inputs/outputs of the If op. This will be left as an TODO option.
Reviewed By: zrphercule
Differential Revision: D14452132
fbshipit-source-id: 00ad48d40da6fb8eabf9cca36701bcf61cbe4edc
Summary:
Modified Tensor Iterator gpu reduction kernel.
Creating multiple accumulator during thread reduce, this removes data dependency
between unrolled loops, expose instruction level parallelism that benefits
latency bounded kernels (e.g. welford used by `torch.std`)
This approach increases register usage, such that we need to tune unrolling
factors to prevent register spilling.
Current implementation tune down the unrolling factor to 2 for welford (register
heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17667
Differential Revision: D14368325
Pulled By: umanwizard
fbshipit-source-id: 9d64c0dccabdb1b7c3922a6557224af704a1974e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17965
ghimport-source-id: 0d3d6340141d8413ce524a8d8ed0d308854ee7ef
Stack:
* (to be filled)
Also added it to the python bindings. Not for any particular reason,
just because otherwise the function gets elided (even in debug mode!)
and thus can't be called from the debugger
Reviewed By: eellison
Differential Revision: D14442654
fbshipit-source-id: 2868bb32ccb80b04f9483883faa702f63a7948bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17781
The wrapper for calling a c10 operator from caffe2 is now based on a runtime FunctionSchema instead of compile time information. This way, it can be created for any c10 operator schema with just one invocation to a simple macro instead of having to define arguments and more as compile time structures.
Furthermore, previously, the wrapper assumed there's an argument present for preallocated outputs, but that was only true for caffe2 operators exported to c10. So the wrapper only worked correctly for calling caffe2->c10->caffe2. Now with the new implementation, it works for any c10 operator.
Also, binary size for this should be much smaller.
Reviewed By: ezyang
Differential Revision: D14375054
fbshipit-source-id: bac7ab8e63929e6e2a148eacac41ed092009aa86
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17743
- caffe2::Operator::SetOutputTensor() can now be used in operators that are called from c10/PyTorch.
- If the operator uses SetOutputTensor() instead of XOutput(), the wrapper doesn't preallocate an empty tensor for the operator anymore. Only outputs accessed in XOutput() will get an output tensor preallocated.
- Remove the copying of the vector with output tensors into a vector with pointer to output tensors.
- Preallocated outputs are now passed in as one TensorList argument on the stack. This TensorList argument has a well-defined name so other wrappers (i.e. the wrapper calling from c2 into c10) can recognize and use it).
- Macros for exporting caffe2 operators to c10 are simplified. Instead of having `c10_op_handle_for_c2_op`, we now pass in the operator handle as a template argument.
- `SetOutputTensor` and `OutputTensorOrUndefined` now work with operators exported to c10
Reviewed By: ezyang
Differential Revision: D14362434
fbshipit-source-id: 44a5e717204f21ea8e9728437429d9b84906f9f5
Summary:
1. Kernel size is larger than input
2. Expected output size to be less than zero
Test case added:
- invalid_conv1d
- Relevant test cases for conv2d and conv3d exists
Fixes#17247
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17436
Reviewed By: mrshenli
Differential Revision: D14354272
Pulled By: fmassa
fbshipit-source-id: 94b98621aa03b1f60d151ef9399ed3da55d41b42
Summary: https://github.com/pytorch/pytorch/pull/17995 's CI has verified it should fix the CI.
Reviewed By: bddppq
Differential Revision: D14447674
fbshipit-source-id: 50085db9ae7421b5be216ed0a2216234babfdf6c
Summary:
```
In file included from /var/lib/jenkins/workspace/aten/src/ATen/native/hip/BatchLinearAlgebra.hip:3:
In file included from /var/lib/jenkins/workspace/aten/src/ATen/hip/HIPContext.h:5:
/var/lib/jenkins/workspace/aten/src/ATen/hip/impl/HIPStreamMasqueradingAsCUDA.h:107:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17961
Reviewed By: houseroad
Differential Revision: D14436421
Pulled By: bddppq
fbshipit-source-id: 962665602178699d7c7b55f4ca7ff1eb72ee0349
Summary:
Our AVX2 routines use functions such as _mm256_extract_epi64
that do not exist on 32 bit systems even when they have AVX2.
This disables AVX2 when _mm256_extract_epi64 does not exist.
This fixes the "local" part of #17901 (except disabling FBGEMM),
but there also is sleef to be updated and NNPACK to be fixed,
see the bug report for further discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17915
Differential Revision: D14437338
Pulled By: soumith
fbshipit-source-id: d4ef7e0801b5d1222a855a38ec207dd88b4680da
Summary:
FBGEMM doesn't work on x86 32bit and prior to this patch, it will
generate x86_64 objects in a build that is supposed to be x86 32bit.
FBGEMM actually relies on registers not available on x86_32, so
we disable it.
This takes of one element of #17901. There are more dependencies
and a separate PR (#17915) regarding AVX detection for the code in the
main repository.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17922
Differential Revision: D14437340
Pulled By: soumith
fbshipit-source-id: bd9fc98cf607d9b0bc28127fbbc8b04fa10eecbe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17925
There's no need for OpKernel to keep the cache creator around if we initialize cache on construction.
This basically means, kernel caches are now constructed when the kernel is looked up from the dispatcher, and not delayed to the first call anymore.
This gives us the benefit of cheaper calling because now kernel calling doesn't have to check if the cache is already initialized.
Also, this improves thread-safety. Now, OpKernel is thread-safe if the kernel is thread-safe.
Reviewed By: ezyang
Differential Revision: D14424907
fbshipit-source-id: a0d09a3a560dfe78aab53d558c9ebb91b57722df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17957
So developer knows what action should be taken when model contains nondeterministic node
Reviewed By: dzhulgakov
Differential Revision: D14435923
fbshipit-source-id: 12d930185852f78c54efc8e90c51aa7c7c7faab5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17947
Instead of having a gtest and a no-gtest file that you have to remember to register tests in, add a single registration point and use some macro magic to make it work for both gtest and non-gtest builds
Reviewed By: eellison
Differential Revision: D14431302
fbshipit-source-id: e1abac135992577a943eaa7abcc81a6ed31fa6e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17917
D14375995 introduced instantiation of the following templates with `bool` type (more specifically `To` is `int64_t`, `From` is `bool`):
```
template <typename To, typename From>
typename std::enable_if<std::is_integral<From>::value, bool>::type overflows(
From f) {
using limit = std::numeric_limits<typename scalar_value_type<To>::type>;
if (!limit::is_signed && std::numeric_limits<From>::is_signed) {
// allow for negative numbers to wrap using two's complement arithmetic.
// For example, with uint8, this allows for `a - b` to be treated as
// `a + 255 * b`.
return f > limit::max() ||
(f < 0 && -static_cast<uint64_t>(f) > limit::max());
} else {
return f < limit::lowest() || f > limit::max();
}
}
template <typename To, typename From>
typename std::enable_if<std::is_floating_point<From>::value, bool>::type
overflows(From f) {
using limit = std::numeric_limits<typename scalar_value_type<To>::type>;
if (limit::has_infinity && std::isinf(static_cast<double>(f))) {
return false;
}
if (!limit::has_quiet_NaN && (f != f)) {
return true;
}
return f < limit::lowest() || f > limit::max();
}
```
MSVC gives C4804 warning and because "treat warnings as errors" is on it fails to build on Windows. Disabling such warning for those 2 templates.
Reviewed By: mingzhe09088
Differential Revision: D14421157
fbshipit-source-id: e72ba34406628c84da48518b32a46f851819bad1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17939
Instead of just asserting min <= 0 and max >= 0 , we adjust histogram to include 0 in the range.
We need to include 0 in the range during norm error minimization to correctly represent our quantization method that includes 0.
Reviewed By: csummersea
Differential Revision: D14428732
fbshipit-source-id: 6669a9d2c7d409ec3b31aee0afe48071986b9b71
Summary:
This improves locality and affinity by keeping work on the same
threads preferentially to starting work on new ones, and reduces
contention on the threadpool lock more generally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17808
Differential Revision: D14391282
Pulled By: resistor
fbshipit-source-id: 3aec81656a50460a725aa4187c61864295d4f46e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17931
When converting from NetDef to IR and back, the prefix string should be removed so the operator types are preserved in caffe2.
Reviewed By: ZolotukhinM
Differential Revision: D14425954
fbshipit-source-id: 2807e7337b0f804f126970768b1250a4a8c5f35c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17618
Base on the code, we only add key to `missing_keys` and `unexpected_keys` if `$strict` is `True`. The documentation is confusing.
This diff also fix one FLAKE8 warning.
Reviewed By: ailzhang
Differential Revision: D14280593
fbshipit-source-id: d368f5596bdf74ff62ee4d28d79120f5af91e0a3
Summary:
This PR removes dead code from THTensorMath.h
I found these unused methods while working on a PR where i plan to move fill and zero methods from TH/THC to Aten.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17873
Differential Revision: D14407013
Pulled By: izdeby
fbshipit-source-id: a3551c5d91e7b380931a8b3bd4b3ae972d16911d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17807
Lint also detected a bug in test_linspace where we weren't
actually testing the CUDA case.
Differential Revision: D14388241
fbshipit-source-id: e219e46400f4952c6b384bca3baa0724ef94acde
Summary:
Stack:
⚫ **#17804 Eliminate the use of Type.** [💛](https://our.intern.facebook.com/intern/diff/D14382165/)
at::CPU produces Type object which is then casted into TensorOptions, instead directly using TensorOptions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17804
Differential Revision: D14407851
Pulled By: ezyang
fbshipit-source-id: 6462d698305b7c24382c1bfd440d3227bd28d9e4
Summary:
IIRC we decided to remove warning in code in #11568. This got reverted accidentally in #14123.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17921
Differential Revision: D14422811
Pulled By: ailzhang
fbshipit-source-id: 7067264bd1d3e3b7861d29e18ade2969ed705ca1
Summary:
This PR allows Scalars to be castable with `int()` and `float()`, allows scalars to match with float arguments, and prints out a better error message if `x.item()` is used as an int.
Scalars are a very uncommon case, and I don't think we want to add the maintenance burden of building out op coverage for it. It's more maintainable to better handle converting it to int/float.
Fix https://github.com/pytorch/pytorch/issues/17652
Also note: https://github.com/pytorch/pytorch/issues/16849
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17875
Differential Revision: D14411138
Pulled By: eellison
fbshipit-source-id: a4e957cefb0ffd10ddb234d92f6d1558cfce8751
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17810
Partially addresses #12728. Also, switch the element_size bindings
to use the new function, rather than the method on Type.
We don't add Python bindings yet, as they need to be special
(they will be properties.)
Differential Revision: D14388790
fbshipit-source-id: 294183d0c8a59b0c13f2bf21d6f1cd557333e83b
Summary:
These changes add the following new Python bindings:
- Values have a 'type' property now that allows getting to the 'type' object
- Blocks have now inputs and outputs as well as returnNode and paramNode properties
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17822
Differential Revision: D14410123
Pulled By: ezyang
fbshipit-source-id: 64ef79f85a7a43b83e4b127b1d39efcaa64b74dc
Summary:
This PR causes kthvalue to be consistent with sort
(i.e. treat NaN as larger than any number), so that
`a.kthvalue(n) == a.sort()[n - 1]`.
One drawback is that median with a NaN argument does not return NaN,
which is a deviation from NumPy.
Thank you, ngimel, for raising this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17824
Differential Revision: D14410092
Pulled By: ezyang
fbshipit-source-id: bdec2d8272dc4c65bcf2f9b8995e237774c44c02
Summary:
Fixes some minor grammatical mistakes in the doc of `loss.py`.
I think in the doc:
> Note that for some losses, there multiple elements per sample.
the "are" is lost between "there" and "multiple".
This mistake takes place in all the descriptions of parameter `size_average` and there are 17 of them.
It's minor but perfects the doc I think. 😁
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17892
Differential Revision: D14418177
Pulled By: ezyang
fbshipit-source-id: 412759f2f9b215819463bf8452ab0e0513218cd6
Summary:
This PR resolves two concurrent issues discovered when running the test in windows. Details about the windows test can be found here: https://github.com/pytorch/pytorch/issues/17609
The change covers two fixes:
1. update running_preloaders_ upfront before creating worker thread to prevent underflow.
2. add a lock when updating stop_ to prevent dead lock in condition variable cv_write_.
The fix has been tested on both Windows and Linux. With --gtest_repeat=1000, the tests runs smoothly without issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17778
Differential Revision: D14404910
Pulled By: soumith
fbshipit-source-id: 2fbb8007e4b0bce4613e9a9fd31b8aace1bbfa8d
Summary:
Motivation:
- Earlier, `torch.btrifact` could not handle tensors with greater than 3 dimensions. This is because of the check:
> AT_CHECK(THTensor_(nDimension)(a) == 3, "expected 3D tensor, got size: ", a->sizes());
What is in this PR?:
- Move `btrifact` to ATen
- Remove relation to TH/THC.
- Handle tensors with more than three dimensions
- Tests
- Docs modifications: added a note about the non-pivoting variant.
[blocked due to old magma-cuda binaries]
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14964
Differential Revision: D14405106
Pulled By: soumith
fbshipit-source-id: f051f5d6aaa45f85836a2867176c065733563184
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17628
This is not hooked up anywhere yet, just adding support.
This shares the same restrictions as the python frontend—namely, that the only exprs allowed right now are method defs.
Reviewed By: shannonzhu
Differential Revision: D14291654
fbshipit-source-id: 7798e5ff412a52ef8803c7bae8f439e50968a73a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17624
Just to make sure this path works
Reviewed By: shannonzhu
Differential Revision: D14288056
fbshipit-source-id: b719c0e90252b6821b1f9b22d3d98982985a6cb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17585
Create a sugared value that represents a class during initialization. This is
so that assignments to attributes correctly define attributes in __init__ but
raise an error elsewhere.
Reviewed By: shannonzhu
Differential Revision: D14263403
fbshipit-source-id: 09b2feeb272302f00a79c2a0302fbdf5483aed6a
Summary:
This is not used anywhere and wasn't cleaned up prior to 1.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17718
Reviewed By: janewangfb
Differential Revision: D14355154
Pulled By: pietern
fbshipit-source-id: f8ff3c8f50cd6365b369a5c5b85d72d8940df048
Summary:
Last batch of IR expect files removed. Includes some removal of expect files that are no longer used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17886
Differential Revision: D14414435
Pulled By: eellison
fbshipit-source-id: 0bfd7ce66ac2f72a57f15f45ebd60b95e80b6c16
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17859
this has been fixed due to improvements in shape analysis
Reviewed By: driazati
Differential Revision: D14402781
fbshipit-source-id: 4ef2722ffedd9c8ac1eff55c244b421d7d3715ed
Summary:
Currently the following code gives an error on python 2 because `ret` is a structseq which is not a tuple
```python
ret = a.max(dim=0)
ret1 = torch.max(a, dim=0, out=ret)
```
This PR modify tuple check in python arg parser to allow structseq to be input of operators where tuple is expected, which would make the above code work.
Depend on: https://github.com/pytorch/pytorch/pull/17136
Partially fixes: https://github.com/pytorch/pytorch/issues/16813
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17208
Differential Revision: D14280198
Pulled By: VitalyFedyunin
fbshipit-source-id: beffebfd3951c4f5c7c8fe99a5847616a89491f3
Summary: Adding new documents to the PyTorch website to describe how PyTorch is governed, how to contribute to the project, and lists persons of interest.
Reviewed By: orionr
Differential Revision: D14394573
fbshipit-source-id: ad98b807850c51de0b741e3acbbc3c699e97b27f
Summary:
This PR removes dead code from THTensorMath.h
I found these unused methods while working on a PR where i plan to move **fill** and **zero** methods from TH/THC to Aten.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17769
Differential Revision: D14372732
Pulled By: izdeby
fbshipit-source-id: 94fd3b52c691ebc89d2bdc8905452e7498038bf5
Summary:
In the loss doc description, replace the deprecated 'reduct' and 'size_average' parameters with the 'reduction' parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17300
Differential Revision: D14195789
Pulled By: soumith
fbshipit-source-id: 625e650ec20f13b2d22153a4a535656cf9c8f0eb
Summary:
Indices in Subset were stored as tensors earlier
passing as list in random_split to ensure integer indexing
fixes: #17466
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17649
Differential Revision: D14400250
Pulled By: soumith
fbshipit-source-id: cd20a959f33773c4babf8e861ea37ec61c2713a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17640
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17311
I've extended our model metadata framework in this diff to support
traced modules as well. Re-used a lot of components from the previous
implementation of ScriptModule metadata.
Tracing is a little different from Scripting since you can't just create a
subclass of TopLevelTraceModule (type returned by torch.jit.trace) and attach
metadata the way we did for ScriptModule. As a result, I've introduced a
separate API torch.fb.jit_trace which returns an instance of
TracedModuleWithMetadata which is a subclass of TopLevelTracedModule. As a
result, we can now attach metadata to this instance.
Reviewed By: dzhulgakov
Differential Revision: D14117966
fbshipit-source-id: 3eee5eef733cb8d6a219c02e2f41d08698eca326
Summary:
1. Move ATen threadpool & open registration mechanism to C10
2. Move the `global_work_queue` to use this open registration mechanism, to allow users to substitute in their own
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17788
Reviewed By: zdevito
Differential Revision: D14379707
Pulled By: jamesr66a
fbshipit-source-id: 949662d0024875abf09907d97db927f160c54d45
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17813
We have a lot of manually written out dict() constructors,
and (1) I don't think use of curly brace syntax is much
of an improvement and (2) it seems like a waste of time to
fix them all.
Reviewed By: eellison
Differential Revision: D14390136
fbshipit-source-id: 6199bef4dea75b6079bcb9d9e8acf20a2e1a86e1
Summary:
CreateDB actually returns nullptr when db type is unknown and throws when the file is missing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17795
Reviewed By: ezyang
Differential Revision: D14383226
Pulled By: dzhulgakov
fbshipit-source-id: 1dcf75a6b4ba8b64a24d4e5daf02db3189d56b7b
Summary:
This causes the tracer to record the select / cast to int operation instead of just an int constant
Fixes#15319 but relies on a fix for #17583 first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17727
Differential Revision: D14377886
Pulled By: driazati
fbshipit-source-id: 59453def54ba72756303f723993844dbeb5d2f8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17742
This path isn't used anymore, and is incompatible with the changes stacked on top of this diff.
Removing it.
cc bwasti to check and confirm these can really be deleted
Reviewed By: ezyang
Differential Revision: D14362426
fbshipit-source-id: 32cdc19f28c2a981ae1e204901420998367ee588
Summary:
We used to have different ATen Tensor types, but we don't anymore. This was just being maintained by a codegen'ed comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17782
Reviewed By: ezyang
Differential Revision: D14378004
Pulled By: gchanan
fbshipit-source-id: 1bbf276393a391252d372cc385230c784bd78588
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17764
Original commit changeset: f1923fdca4a1
reverted int8 ops fixes the original runtime regression.
We'll ignore the memory regression since it is flaky, see D14228484
Reviewed By: dzhulgakov
Differential Revision: D13885233
fbshipit-source-id: ccbe4b94acb44b7b4cb3ae4d73e3f6091e1e1195
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17548
expose half float operators to OSS
common/math/Float16.h is the original implementation
this is substituted by caffe2/c10/util/Half.h
from the comments seems like the both implementations don't handle denormals
Reviewed By: jspark1105
Differential Revision: D14244200
fbshipit-source-id: f90ba28c5bf6a2b451b429cc4925b8cc376ac651
Summary:
1) The changes in the new opset won't affect internal pipeline.
2) The CI won't be affected by the ONNX changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17736
Reviewed By: zrphercule
Differential Revision: D14358710
Pulled By: houseroad
fbshipit-source-id: 4ef15d2246b50f6875ee215ce37ecf92d555ca6a
Summary:
Similar to `nn.Parameter`s, this PR lets you store any `IValue` on a module as an attribute on a `ScriptModule` (only from the Python front-end currently). To mark something as an attribute, it should wrapped in `jit.Attribute(value, type)` (ex. `self.table = torch.jit.Attribute(table, Dict[str, torch.Tensor])`)
Followup Work:
* (de)serializing for use in C++
* change `self.training` to be a `bool` attribute instead of a buffer
* mutable attributes
* string frontend support
* documentation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17309
Differential Revision: D14354316
Pulled By: driazati
fbshipit-source-id: 67e08ab5229366b67fbc837e67b58831a4fb3318
Summary:
Currently, serialization of model parameters in ONNX export depends on the order in which they are stored in a container (`list` on Python side and `std::vector` on C++ side). This has worked fine till now, but if we need to do any pass on that graph that mutates the parameter list, then strictly order-based serialization may not work.
This PR is the first in a set to bring in more passes (such as constant folding) related to ONNX export. This PR lays the groundwork by moving the serialization in ONNX export from order-based to name based approach, which is more amenable to some of the passes.
houseroad - As discussed this change uses a map for export, and removes the code from `export.cpp` that relies on the order to compute initializer names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17420
Differential Revision: D14361993
Pulled By: houseroad
fbshipit-source-id: da93e945d55755c126de06641f35df87d1648cc4
Summary:
Use flake8 installed with mypy checks so that our linter matches fbcode. Mypy type errors also provide valuable signal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17721
Differential Revision: D14357778
Pulled By: eellison
fbshipit-source-id: d8c9ea3fe3b5f550c3b70fe259e0eabf95e4c92d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17725
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17461
Implementing a standalone LSTM Operator in Caffe2 adopted from this Aten implementation: diffusion/FBS/browse/master/fbcode/caffe2/aten/src/ATen/native/RNN.cpp. The most tricky thing in this exercise was that caffe2::Tensor has no copy constructor that made it necessary to implement a custom templated copy constructor for the different Tensor containers used in the code. Also there was no way to use off-the-shelf C2 operators in my code easily so I had to copy some code that is doing basic matmul, cat, split, transpose and linear as utility functions.
Two things missing:
- Profiling this implementation against the current ONNXified LSTM op
- Make this operator available to use in PyTorch
Reviewed By: dzhulgakov
Differential Revision: D14351575
fbshipit-source-id: 3b99b53212cf593c7a49e45580b5a07b90809e64
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17729
When doing "import torch" in fbcode, previously the caffe2 cuda kernels weren't loaded because libcaffe2_gpu.so wasn't loaded.
Once you also did "from caffe2.python import workspace", then the cuda kernels were loaded because that triggered a runtime mechanism for loading libcaffe2_gpu.so.
We want the cuda kernels to always be available, so this diff adds a dependency from caffe2:libtorch_cuda to caffe2:caffe2_gpu.
Reviewed By: ezyang
Differential Revision: D14353498
fbshipit-source-id: 76a9fe69f231b308ab40eac393bb216c6fad3658
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17734
If input is not BATCH, we will skip adjust its batch size during onnxifi transformation. So when we take hints, we take it as CONSTANT but later need to change it to BATCH if possible.
Reviewed By: jackm321
Differential Revision: D14355983
fbshipit-source-id: 63eb54a44afb1565c71486fdd73db07ca0ac4fd4
Summary:
xxtemp, colesbury, bhushan23, zou3519, convert gpu round behavior to half-to-even, consistent with torch cpu version and numpy. You feedback are welcomed.
See #16498
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17443
Differential Revision: D14261786
Pulled By: VitalyFedyunin
fbshipit-source-id: 98156436b545d72769831a89e2775d43ad913ebc
Summary:
Eventually we should remove these when we're certain that all our ops
handle memory overlaps correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17576
Differential Revision: D14349990
Pulled By: zou3519
fbshipit-source-id: c3a09f6113b9b1bf93e7f13c0b426c45b2cdf21f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17545
This diff avoids renaming boundary inputs of net during onnxifi transform.
It also removes adding mappings for the initializer during onnxifi op creation.
Thus gets read of the mapped ws creation during onnxifi op creation.
Reviewed By: zrphercule
Differential Revision: D14243161
fbshipit-source-id: 6eafa920c45f6a6bfacbbb443e8e84cf9778644c
Summary:
Another batch of removing expect files.
One note - I removed the Batched expect files without adding equivalent tests since they are already being tested in another ways, and we are no longer actively maintaining that project.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17581
Differential Revision: D14343578
Pulled By: eellison
fbshipit-source-id: ce0b1fd2b5b4ec80ad9003bab1b58f41645d3da6
Summary:
- Summary:
Added synchronized batch normalization, allows synchronization of stats across mini-batches between processes within a process group.
Current implementation uses a mixture of extended ATen native functions (cpp cuda extension) + torch.nn.modules (c10d python API)
- User-facing api:
1. torch.nn.utils.convert_sync_batchnorm(modules, process_group=None)
2. torch.nn.SyncBatchNorm(num_features, eps=1e-5, momentum=0.1, affine=True, track_running_stats=True, ***process_group=None***)
- supported use case:
DistributedDataParallel with ***single-gpu multi-process***
a. User creates model containing `torch.nn.SyncBatchNorm` layers through one of the ways listed below:
1. use layers directly:
torch.nn.SyncBatchNorm(...)
similar API as with torch.nn.BatchNormXd(...)
with added argument `process_group` which is used to limit the scope of
synchronization within each process group. Default value is None, which
implies synchronization across all GPUs
2. use torch.nn.utils.convert_sync_batchnorm(modules, process_group)
recursively convert all `torch.nn.BatchNormXd` into `torch.nn.SyncBatchNorm`
preserving values of parameters/buffers.
the utility function also allows user to specify process_group value to all
converted layers.
b. user wraps their model with
`torch.distributed.parallel.DataParallelDistributed`, from this point, user
should follow the general guidelines for DDP use guide
- Error checking
For use cases not supported, we error out:
1. Application launched without ddp:
> import torch
> sbn = torch.nn.SyncBatchNorm(10).cuda()
> inp = torch.randn(5, 10, 3, 3).cuda()
> sbn(inp) --> Error!
> AttributeError: SyncBatchNorm is only supported within torch.nn.parallel.DistributedDataParallel
2. Application launched using DDP with multi-GPU per-process:
> ddp_module = nn.parallel.DistributedDataParallel(module, device_ids=device_ids, output_device=args.local_rank)
> ValueError: SyncBatchNorm is only supported for DDP with single GPU per process
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14267
Differential Revision: D14270035
Pulled By: ezyang
fbshipit-source-id: 4956d8fa565c32e9df5408d53719ff9f945f4d6d
Summary:
teng-li is passing the baton to mrshenli. Thanks for all your work on distributed teng-li!! 🎉
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17720
Differential Revision: D14350120
Pulled By: pietern
fbshipit-source-id: edfe784520c54630203cc8fbb296455d3dbf341b
Summary:
Observed the test `TestGroupConvolution.test_group_convolution` to fail with the following error:
```
Falsifying example: test_group_convolution(self=<caffe2.python.operator_test.group_conv_test.TestGroupConvolution testMethod=test_group_convolution>, stride=3, pad=0, kernel=5, size=8, group=4, input_channels_per_group=7, output_channels_per_group=8, batch_size=2, order='NHWC', engine='', use_bias=False, gc=, dc=[, device_type: 1])
You can reproduce this example by temporarily adding reproduce_failure('3.59.1', b'AAAA') as a decorator on your test case
```
This example generated by hypothesis has `group=2, order='NHWC' and dc=[, device_type: 1])`.
I think this example should be skipped.
I have mimicked the change corresponding to [PR#13554](https://github.com/pytorch/pytorch/pull/13554) to skip this example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17715
Differential Revision: D14346642
Pulled By: ezyang
fbshipit-source-id: b1f1fef09f625fdb43d31c7213854e61a96381ba
Summary:
Check for Tuple Matching in isSubvalueOf, since they may contain container types that need to be recursed within isSubvalueOf
Fix for https://github.com/pytorch/pytorch/issues/17650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17687
Differential Revision: D14324642
Pulled By: eellison
fbshipit-source-id: 7f1e019875286b2640a3b9c003d1635dda8cf543
Summary:
In discussion with houseroad, because Upsample op is being updated in ONNX https://github.com/onnx/onnx/pull/1773 and these tests are blocking it. These tests will be updated once the ONNX PR goes in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17696
Differential Revision: D14338845
Pulled By: houseroad
fbshipit-source-id: cfaf8cf1ab578ae69dd3bf21b1c0681b572b9b6f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17623
Despite it's generic sounding name, caffe2::DeviceGuard actually
only worked on CUDA devices. Rename it to something that more
clearly spells out its applicability.
I'm not sure if it's the right call, but in this patch I added
'using CUDAGuard = c10::cuda::CUDAGuard', as this seems to be more
in-line with how the Caffe2 codebase is currently written. More
idiomatic c10 namespace style would be to say cuda::CUDAGuard.
Willing to change this if people shout.
This is a respin of D13156470 (#14284)
Reviewed By: dzhulgakov
Differential Revision: D14285504
fbshipit-source-id: 93b8ab938b064572b3b010c307e1261fde0fff3d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17461
Implementing a standalone LSTM Operator in Caffe2 adopted from this Aten implementation: diffusion/FBS/browse/master/fbcode/caffe2/aten/src/ATen/native/RNN.cpp. The most tricky thing in this exercise was that caffe2::Tensor has no copy constructor that made it necessary to implement a custom templated copy constructor for the different Tensor containers used in the code. Also there was no way to use off-the-shelf C2 operators in my code easily so I had to copy some code that is doing basic matmul, cat, split, transpose and linear as utility functions.
Two things missing:
- Profiling this implementation against the current ONNXified LSTM op
- Make this operator available to use in PyTorch
Reviewed By: dzhulgakov
Differential Revision: D14160172
fbshipit-source-id: c33e3f9e8aeae578b64d97593cb031a251216029
Summary:
hip-clang uses triple chevron kernel dispatch syntax. Add an option to the hipification script to skip translating triple chevron to hipLaunchKernelGGL.
Once we switch to hip-clang, this option will be default and subsequently removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17686
Differential Revision: D14327810
Pulled By: bddppq
fbshipit-source-id: 5e1512325077dd3ebb8fb9b5bf35fd1f8d9a4dc3
Summary:
```
NVIDIA changed the CUDA allocation behavior on Pascal GPUs. The
page size increased from 1MB to 2MB and allocations larger than 1MB
are now always page-aligned. Previously, allocations larger than 1MB
were aligned to 128KB boundaries.
This interacted poorly with the caching allocator. The remaining
memory in a page could only be filled by small cudaMalloc calls, but
the caching allocator never cudaMalloc's a chunk smaller than 1MB.
This behavior could also cause a large discrepancy between the memory
usage reported by nvidia-smi and the memory usage reported by
PyTorch, because nvidia-smi counts a partially used page as "full",
while PyTorch only counts the actual memory requested.
This PR makes a few changes to the caching allocator to better support
Pascal and Volta GPUs:
- All cudaMalloc calls are now multiples of 2MB (the page size)
- Requests between 1-10MB allocate (and split) a 20MB block to
reduce wasted space due to rounding
- Small requests are now packed into 2MB blocks (instead of 1MB)
This improves Mask R-CNN memory usage by 10-20% in internal tests on
Volta GPUs. Maxwell performance seems to be largely unchanged, but
it's possible that some use cases suffer slightly.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17120
Differential Revision: D14301536
Pulled By: colesbury
fbshipit-source-id: a8282315ea8f7b8ca149b5066fdeaecd0d404edf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17579
These methods previously just returned 0 when it was not a legacy operator,
making it impossible to convert some operators.
Reviewed By: dzhulgakov
Differential Revision: D14253094
fbshipit-source-id: 72bfdcf6da291a4ab80d1e0ceb20984b86edc408
Summary:
Fixes#17449
Context: before #17186, we don't fuse `clamp` for the case when `min/max` are missing inputs, because they are `prim::None` node, after #17186, we make None a `prim::Constant` node which enables the fusion for `clamp`. But codegen.cpp does not handle the case when `prim::Constant` is not a Double/Int/Bool, this PR makes it so that missing inputs are handled correctly, it is done in the following way:
1. emit nothing when you see `type? = prim::Constant()`
2. when emitRHS, do special casing for aten::clamp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17533
Differential Revision: D14238450
Pulled By: wanchaol
fbshipit-source-id: 61a272154754b13e89021bb86002927f02cde19c
Summary:
1. Enabling int32 indexing for cases where TI cannot accumulate in output due to
incompatible data types (e.g. Welford).
2. Updating Welford kernel to use int32 instead of int64 indexing on GPU.
This change improves performance for torch.var / torch.std
Implementation:
1. Allocated extra buffer to handle accumulation between sub Tensor Iterators.
2. Removed int64 indexing in gpu_reduce_kernel
3. WelfordOps now supports index type / combination typeas a template parameter.
While GPU uses int32_t and float, CPU implementation uses int64_t and double.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17428
Differential Revision: D14264608
Pulled By: umanwizard
fbshipit-source-id: 3eb54451de925b469dbc1127e5ea7443c4431036
Summary:
TH_Index_Base is hard coded to 0 and can be removed from the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17591
Differential Revision: D14269273
Pulled By: izdeby
fbshipit-source-id: d844e261f4af7297bad8a81e7d6dcf0a391b94e6
Summary:
Because of two separate python extensions with different pybind
instances I have to go through void* conversion. Since it's hidden from
user, it's fine.
New APIs added on C2 side:
- workspace.FetchTorch('blob')
- workspace.Workspace.current.blobs['blob'].to_torch()
- workspace.FeedBlob('blob', pytorch_tensor)
Works on CPU an GPU.
The only glitches are with resizing because of variable/tensor split.
But data sharing works properly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17190
Reviewed By: ezyang
Differential Revision: D14163882
Pulled By: dzhulgakov
fbshipit-source-id: d18e5b8fcae026f393c842a1149e972515732de2
Summary:
Hi, there.
There is a typo in aten/src/ATen/native_parse.py, and I fix it.
`std::aray` -> `std::array`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17641
Differential Revision: D14301981
Pulled By: ezyang
fbshipit-source-id: a37859cdedcbf6c29333b954486dfa086d6c2176
Summary:
Create a `make_variable` override that moves out of a tensor instead of going through `shallow_copy_and_detach`. Call this override from factory methods like `empty` that create a brand new tensor, do nothing with it, and then copy it into a variable.
Will update this with actual numbers, but it seems to get rid of around 20-40% of the overhead of calling `torch.empty(0)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17565
Differential Revision: D14266130
Pulled By: umanwizard
fbshipit-source-id: f57d5f2ca3f80ee8ee96d50f905e852fd10db941
Summary:
Currently, the fake tqdm implementation requires an input (whereas real tqdm does not).
This caused a problem in torchvision (https://github.com/pytorch/vision/pull/770), and seems likely to cause minor irritations elsewhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17636
Differential Revision: D14296530
Pulled By: ezyang
fbshipit-source-id: bc077d898773c93dab34c985a7b30525a43e558a
Summary:
Various functions aren't used by the JIT, so they're jit-compliant w.r.t. their schema by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17631
Differential Revision: D14295559
Pulled By: cpuhrsch
fbshipit-source-id: a2ecdcb5df47eb67c54ec642d88d42e985515142
Summary:
The CPU version is based on the TH version.
The GPU version is based on #8406 by Pararth Shah (thank you).
CPU quickselect based on that in TH's THTensorMoreMath.cpp, but with C++ (quickselectnoindex will be achieved by a different swap)
CPU kthvalue is based on the THTensor function in the same file.
The dim_apply function is a C++ replacement for TH_TENSOR_DIM_APPLYx macros.
The CUDA kernel uses functions adapted from the THCTensorSortK implementation.
In particular radixSelect is from THCTensorTopK.cuh.
The CUDA launcher code replaces a bunch of macros with C++. It will be re-used in one of the following patches.
Plan for further PRs:
- This
- Sort
- TopK + Mode + Median in any order
- Rip out THC stuff.
There may be utility functions / structs in the SortingCommon.cuh that come into
relevance only with sort.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17544
Differential Revision: D14286934
Pulled By: ezyang
fbshipit-source-id: 35dbea050b097e88777ac5fa5c0f499d5e23c738
Summary:
Fixing MSVC errors
```
D:\pytorch-scripts\caffe2_builders\v141\pytorch\aten\src\THC/THCReduce.cuh(144): error C4002: too many actual paramet
ers for macro 'C10_LAUNCH_BOUNDS_1' [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2\caffe2_gpu.vcxp
roj]
D:\pytorch-scripts\caffe2_builders\v141\pytorch\aten\src\THC/THCReduce.cuh(259): error C4002: too many actual paramet
ers for macro 'C10_LAUNCH_BOUNDS_1' [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2\caffe2_gpu.vcxp
roj]
D:/pytorch-scripts/caffe2_builders/v141/pytorch/aten/src/THCUNN/SpatialDilatedMaxPooling.cu(51): error C4002: too man
y actual parameters for macro 'C10_LAUNCH_BOUNDS_1' [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2
\caffe2_gpu.vcxproj]
```
on variadic C10_LAUNCH_BOUNDS as well as Debug linking issues with at::Half in pool_op_cudnn.cc like this one
```
pool_op_cudnn.obj : error LNK2019: unresolved external symbol "public: bool __cdecl caffe2::MaxPoolFunctor<class caff
e2::CUDAContext>::GlobalPoolingBackward<struct c10::Half,2>(int,int,int,struct c10::Half const *,struct c10::Half const
,struct c10::Half const ,struct c10::Half ,class caffe2::CUDAContext )const " (??$GlobalPoolingBackward@UHalf@c10@
@$01@?$MaxPoolFunctor@VCUDAContext@caffe2@@caffe2@QEBA_NHHHPEBUHalf@c10@00PEAU23@PEAVCUDAContext@1@Z) referenced in
function "public: bool __cdecl caffe2::`anonymous namespace'::CuDNNMaxPoolFunctor::GlobalPoolingBackward<struct c10::H
alf,2>(int,int,int,struct c10::Half const ,struct c10::Half const ,struct c10::Half const ,struct c10::Half ,class
caffe2::CUDAContext *)const " (??$GlobalPoolingBackward@UHalf@c10@@$01@CuDNNMaxPoolFunctor@?A0xb936404a@caffe2@QEBA_NH
HHPEBUHalf@c10@00PEAU34@PEAVCUDAContext@2@Z) [D:\pytorch-scripts\caffe2_builders\v141\pytorch\build\Debug\caffe2\caff
e2_gpu.vcxproj]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17201
Differential Revision: D14165732
Pulled By: ezyang
fbshipit-source-id: 875fd9a5b2db6f83fc483f6d750d2c011260eb8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17619
--filler hive --iter -1 will let debugger exhaust all batches from a hive partition before exiting.
add README that summarizes command line options and usage.
Reviewed By: yinghai
Differential Revision: D14220166
fbshipit-source-id: daa23b7e8a9184481c6d7b67acf1599e5c99d74a
Summary:
Sparse Linear in TH(CU)NN implements sparse linear layers without
using sparse matrices.
It is currently not documented in PyTorch and there is no functional or
module interface. This means it is unused from a PyTorch point of view.
The reason for removing it is twofold:
- The module uses sort, which I would like to move to ATen.
- When we implement a SparseLinear layer, we would want to do it
using sparse tensors, so it's not all that useful, anyway.
I checked this on slack with soumith, I hope the above is an accurate
representation. All bad ideas are my own.
This is part of the ongoing work to move
sort/topk/mode/median/kthvalue to ATen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17610
Differential Revision: D14280663
Pulled By: gchanan
fbshipit-source-id: 289231d2c20626855ce2ceecd4f204b460c32378
Summary:
They are previously merged to resolve#17051. However, since it was resolved by the upstream, and it was causing some issues like https://github.com/abjer/tsds/issues/8, I think it's time to revert these changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17567
Differential Revision: D14265241
Pulled By: kostmo
fbshipit-source-id: 7fa2b7dd4ebc5148681acb439cf82d983898694e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17528
as title. register_prim_ops is messy because someone ruined clang-format, but I figured it's okay to include here since this is such a mechanical change
Reviewed By: driazati
Differential Revision: D14236943
fbshipit-source-id: c2b22845837b7f830015510e48ec2ee5202fa407
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17594
The original version of this broke things because a concurrent change raced with it in CI.
Reviewed By: ezyang
Differential Revision: D14266663
fbshipit-source-id: e8ac5dfcb7349b4f2c425d9f0eabbfc964314063
Summary:
It will be better to split the CPU job on CI. But unluckily, we are out of Windows machines.
cc, davidbrownellWork yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17608
Differential Revision: D14281393
Pulled By: soumith
fbshipit-source-id: ae9a6140b7207ce56cfb2da3d812bc3fe060764a
Summary:
- Test updates
1. test_torch: added 0-d test case and t_() test cases
2. test_jit : updated error message for TestAsync.test_async_script_error
- Updating documentation for torch.t()
Adding information regarding new support of 0-D and 1-D tenso
Fixes#17520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17535
Differential Revision: D14269984
Pulled By: gchanan
fbshipit-source-id: 38b723f31484be939261c88edb33575d242eca65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17549
Currently Dropout is only enabled in training, we enable the option of having dropout in Eval.
This is to follow [1]. This functionality would be used for uncertainty estimation in exploration project.
[1] Gal, Yarin, and Zoubin Ghahramani. "Dropout as a bayesian approximation: Representing model uncertainty in deep learning." international conference on machine learning. 2016.
Reviewed By: Wakeupbuddy
Differential Revision: D14216216
fbshipit-source-id: 87c8c9cc522a82df467b685805f0775c86923d8b
Summary:
The max pooling backwards kernel is currently annotated with launch bounds (256,8).
Adjust the number of waves to 4 (4 times 64 is 256) for ROCm. This improves training performance for torchvision models by up to 15% (AlexNet) on a gfx906 GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17555
Differential Revision: D14277744
Pulled By: bddppq
fbshipit-source-id: 2a62088f7b8a87d1e350c432bf655288967c7883
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17561
The push at the top of the file was missing a corresponding pop
Reviewed By: ezyang
Differential Revision: D14254500
fbshipit-source-id: ff20359b563d6d6dcc68273dc754ab31aa8fad12
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17522
Dispatch is still based on the first tensor arg, but that first "tensor arg" is now allowed to be a tensor list.
That is, the first argument that is either Tensor or TensorList will be the deciding factor for dispatch.
If it is a TensorList, then that TensorList must not be empty or dispatch will fail.
Reviewed By: ezyang
Differential Revision: D14235840
fbshipit-source-id: 266c18912d56ce77aa84306c5605c4191f3d882b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17491
Before, there was no way to expose a caffe2 operator that had a variable number of inputs.
Now, this is allowed by giving the operator one tensor list input.
Note that the tensor list must be the first input, and that any other tensor inputs will be ignored and inaccessible in this case.
Reviewed By: ezyang
Differential Revision: D14220705
fbshipit-source-id: 7f921bfb581caf46b229888c409bbcc40f7dda80
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17088
clangr codemod
also manually moved the constructor of a class from the .cpp file to the .h file.
Reviewed By: ezyang
Differential Revision: D14078531
fbshipit-source-id: 2adb4ac0ce523742da6cce3bc3b6c177b816c299
Summary:
HIPGuard interfaces that interacted with HIPStream were previously
totally busted (because the streams had the wrong device type).
This fixes it, following along the same lines of MasqueardingAsCUDA.
Along the way I beefed up the explanatory comment.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
cc jithunnair-amd iotamudelta bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17469
Differential Revision: D14243396
Pulled By: ezyang
fbshipit-source-id: 972455753a62f8584ba9ab194f9c785db7bb9bde
Summary:
As discussed here #16952, this PR aims at improving the __repr__ for distribution when the provided parameters are torch.Tensor with only one element.
Currently, __repr__() relies on dim() == 0 leading to the following behaviour :
```
>>> torch.distributions.Normal(torch.tensor([1.0]), torch.tensor([0.1]))
Normal(loc: torch.Size([1]), scale: torch.Size([1]))
```
With this PR, the output looks like the following:
```
>>> torch.distributions.Normal(torch.tensor([1.0]), torch.tensor([0.1]))
Normal(loc: 1.0, scale: 0.10000000149011612)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17503
Differential Revision: D14245439
Pulled By: soumith
fbshipit-source-id: a440998905fd60cf2ac9a94f75706021dd9ce5bf
Summary:
See comment inside of code. This fixes a bug where sometimes we would try to avoid printing long lines but would inadvertently reorder the expressions, which can change the semantics of the program
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17557
Differential Revision: D14250608
Pulled By: zdevito
fbshipit-source-id: d44996af4e90fe9ab9508d13cd04adbfc7bb5d1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17511
AliasTracker was doing bookkeeping for three concepts: the points-to graph,
writes, and wildcards.
This PR makes AliasTracker's job clearer: it keeps track of the points-to
graph. Thus it has been renamed MemoryDAG. Write and wildcard information were
pulled back into AliasDb as part of this—I may decide to pull them into their
own little modules since I don't want the alias analysis stuff to get too
bloated.
This refactor is necessary because we want to start tracking information for
aliasing elements that _aren't_ first-class IR Values (e.g. the "stuff" inside
a list). So MemoryDAG can't know too much about Values
Reviewed By: houseroad
Differential Revision: D14231251
fbshipit-source-id: 6cd98ae6fced8d6c1522c2454da77c3c1b2b0504
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17480
This was always part of our "spec" but not implemented
Reviewed By: houseroad
Differential Revision: D14214301
fbshipit-source-id: 118db320b43ec099dc3e730c67d39487474c23ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17478
Enable onnxifi_ext in glow and build an e2e test in caffe2.
Reviewed By: yinghai
Differential Revision: D14190136
fbshipit-source-id: 26245278b487b551623109b14432f675279b17b5
Summary:
+ All quotes for ENV VARS are erroneous;
+ Toolset hasn't be specified;
+ Provide paths for all 3 Visual Studio 2017 products: Community/Professional/Enterprise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17475
Differential Revision: D14262968
Pulled By: soumith
fbshipit-source-id: c0504e0a6be9c697ead83b06b0c5cf569b5c8625
Summary:
The generated_comments are wrong to below generated files:
```bash
./torch/csrc/autograd/generated/VariableType_0.cpp:3:// generated from tools/autograd/templates/VariableType_0.cpp
./torch/csrc/autograd/generated/VariableType_1.cpp:3:// generated from tools/autograd/templates/VariableType_1.cpp
./torch/csrc/autograd/generated/VariableType_2.cpp:3:// generated from tools/autograd/templates/VariableType_2.cpp
./torch/csrc/autograd/generated/VariableType_3.cpp:3:// generated from tools/autograd/templates/VariableType_3.cpp
./torch/csrc/autograd/generated/VariableType_4.cpp:3:// generated from tools/autograd/templates/VariableType_4.cpp
./torch/csrc/autograd/generated/VariableTypeEverything.cpp:3:// generated from tools/autograd/templates/VariableTypeEverything.cpp
./torch/csrc/jit/generated/register_aten_ops_0.cpp:23:// generated from tools/autograd/templates/register_aten_ops_0.cpp
./torch/csrc/jit/generated/register_aten_ops_1.cpp:23:// generated from tools/autograd/templates/register_aten_ops_1.cpp
./torch/csrc/jit/generated/register_aten_ops_2.cpp:23:// generated from tools/autograd/templates/register_aten_ops_2.cpp
```
These generated files were split to speed the compile, however, the template files are not.
After this fix, the comments will look like below:
```bash
./torch/csrc/autograd/generated/VariableType_0.cpp:3:// generated from tools/autograd/templates/VariableType.cpp
./torch/csrc/autograd/generated/VariableType_1.cpp:3:// generated from tools/autograd/templates/VariableType.cpp
......
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17563
Differential Revision: D14260992
Pulled By: soumith
fbshipit-source-id: 038181367fa43bee87837e4170704ddff7f4d6f2
Summary:
resize_ and resize_as resize the input tensor. because our shape analysis
is flow invariant, we don't do shape analysis on any op that relies on a Tensor that can alias a resized Tensor.
E.g. in the following graph the x += 10 x may have been resized.
```
torch.jit.script
def test(x, y):
for i in range(10):
x += 10
x.resize_as_([1 for i in int(range(torch.rand())))
return x
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17518
Differential Revision: D14249835
Pulled By: eellison
fbshipit-source-id: f281b468ccb8c29eeb0f68ca5458cc7246a166d9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17481
Usually, feature macros are either defined or undefined and checked accordingly.
C10_MOBILE was a weird special case that was always defined but either defined to 1 or to 0.
This caused a lot of confusion for me when trying to disable something from mobile build and it also disabled it
from the server build (because I was using ifdef). Also, I found a place in the existing code base that made
that wrong assumption and used the macro wrongly, see https://fburl.com/y4icohts
Reviewed By: dzhulgakov
Differential Revision: D14214825
fbshipit-source-id: f3a155b6d43d334e8839e2b2e3c40ed2c773eab6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17078
This prevents caffe2 operators from being expsoed to c10 on mobile,
which in turn causes the whole c10 dispatcher to be stripped away
and saves binary size.
We probably want to re-enable the c10 dispatcher for mobile,
but for now this is ok.
Reviewed By: ezyang
Differential Revision: D14077972
fbshipit-source-id: e4dd3e3b60cdfbde91fe0d24102c1d9708d3e5c4
Summary:
And adding timestamps to linux build jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17516
Differential Revision: D14244533
Pulled By: pjh5
fbshipit-source-id: 26c38f59e0284c99f987d69ce6a2c2af9116c3c2
Summary:
This PR allows `gather` to optionally return sparse gradients, as requested in #16329. It also allows to autograd engine to accumulate sparse gradients in place when it is safe to do so.
I've commented out size.size() check in `SparseTensor.cpp` that also caused #17152, it does not seem to me that check serves a useful purpose, but please correct me if I'm wrong and a better fix is required.
Motivating example:
For this commonly used label smoothing loss function
```
def label_smoothing_opt(x, target):
padding_idx = 0
smoothing = 0.1
logprobs = torch.nn.functional.log_softmax(x, dim=-1, dtype=torch.float32)
pad_mask = (target == padding_idx)
ll_loss = logprobs.gather(dim=-1, index=target.unsqueeze(1), sparse = True).squeeze(1)
smooth_loss = logprobs.mean(dim=-1)
loss = (smoothing - 1.0) * ll_loss - smoothing * smooth_loss
loss.masked_fill_(pad_mask, 0)
return loss.sum()
```
backward goes from 12.6 ms with dense gather gradients to 7.3 ms with sparse gradients, for 9K tokens x 30K vocab, which is some single percent end-to-end improvement, and also improvement in peak memory required.
Shout-out to core devs: adding python-exposed functions with keyword arguments through native_functions.yaml is very easy now!
cc gchanan apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17182
Differential Revision: D14158431
Pulled By: gchanan
fbshipit-source-id: c8b654611534198025daaf7a634482b3151fbade
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16746
as titled. We use a special url schem elasticzeus for elastic zeus so that we dont need to change the public interface of init_process_group.
Reviewed By: aazzolini, soumith
Differential Revision: D13948151
fbshipit-source-id: 88939dcfa0ad93467dabedad6905ec32e6ec60e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17456
Using an instruction sequence similar to function in fbgemm/src/QuantUtilAvx2.cc
elementwise_sum_benchmark added
Reviewed By: protonu
Differential Revision: D14205695
fbshipit-source-id: 84939c9d3551f123deec3baf7086c8d31fbc873e
Summary:
For some additional context on this change, please, see this [PR](https://github.com/pytorch/pytorch/pull/17376)
As a part of work on Bool Tensor, we will need to add support for a bool type to _fill() and _zero() methods that are currently located in THTensorMath. As we don't need anything else and those methods are not really math related - we are moving them out into separate THTensorFill for simplicity.
Change:
-moved _fill() and _zero() from THTensorMath.h to THTensorFill
-enabled _fill() and _zero() for HALF type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17536
Differential Revision: D14242130
Pulled By: izdeby
fbshipit-source-id: 1d8bd806f0f5510723b9299d360b70cc4ab96afb
Summary:
Causing a problem with spectral norm, although SN won't use that anymore after #13350 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13352
Differential Revision: D14209562
Pulled By: ezyang
fbshipit-source-id: f5e3183e1e7050ac5a66d203de6f8cf56e775134
Summary:
As of MIOpen 1.7.1 as shipped in ROCm 2.1 this works correctly and we can use MIOpen and do not need to fall back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17472
Differential Revision: D14210323
Pulled By: ezyang
fbshipit-source-id: 4c08d0d4623e732eda304fe04cb722c835ec70e4
Summary:
This only deals with four functions, but is an important first step towards removing BoolTensor and IndexTensor entirely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17193
Differential Revision: D14157829
Pulled By: cpuhrsch
fbshipit-source-id: a36f16d1d88171036c44cc7de60ac9dfed9d14f2
Summary:
Pytorch's tensor.t() is now equivalent with Numpy's ndarray.T for 1D tensor
i.e. tensor.t() == tensor
Test case added:
- test_t
fixes#9687
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17462
Differential Revision: D14214838
Pulled By: soumith
fbshipit-source-id: c5df1ecc8837be22478e3a82ce4854ccabb35765
Summary:
this code is a bit intricate so i refactor it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16995
Differential Revision: D14050667
Pulled By: ifedan
fbshipit-source-id: 55452339c6518166f3d4bc9898b1fe2f28601dc4
Summary:
First pass at user defined types. The following is contained in this PR:
- `UserType` type, which contains a reference to a module with all methods for the type, and a separate namespace for data attributes (map of name -> TypePtr).
- `UserTypeRegistry`, similar to the operator registry
- `UserObject` which is the runtime representation of the user type (just a map of names -> IValues)
- `UserTypeValue` SugaredValue, to manage getattr and setattr while generating IR, plus compiler.cpp changes to make that work.
- Frontend changes to get `torch.jit.script` to work as a class decorator
- `ClassDef` node in our AST.
- primitive ops for object creation, setattr, and getattr, plus alias analysis changes to make mutation safe.
Things that definitely need to get done:
- Import/export, python_print support
- String frontend doesn't understand class definitions yet
- Python interop (using a user-defined type outside TorchScript) is completely broken
- Static methods (without `self`) don't work
Things that are nice but not essential:
- Method definition shouldn't matter (right now you can only reference a method that's already been defined)
- Class definitions can only contain defs, no other expressions are supported.
Things I definitely won't do initially:
- Polymorphism/inheritance
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17314
Differential Revision: D14194065
Pulled By: suo
fbshipit-source-id: c5434afdb9b39f84b7c85a9fdc2891f8250b5025
Summary:
Not sure the best way to integrate this…I wrote something that focuses on mutability "vertically" through the stack. Should I split it up and distribute it into the various sections, or keep it all together?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17454
Differential Revision: D14222883
Pulled By: suo
fbshipit-source-id: 3c83f6d53bba9186c32ee443aa9c32901a0951c0
Summary:
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17476
Differential Revision: D14218312
Pulled By: suo
fbshipit-source-id: 64df096a3431a6f25cd2373f0959d415591fed15
Summary:
Temporarily disable them for perf consideration. Will figure out a way to do `torch.zeros(sizes, grad.options())` in torchscript before enabling these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17470
Differential Revision: D14210313
Pulled By: ailzhang
fbshipit-source-id: efaf44df1192ae42f4fe75998ff0073234bb4204
Summary: Update the docs to include the value parameter that was missing in the `scatter_` function.
Differential Revision: D14209225
Pulled By: soumith
fbshipit-source-id: 5c65e4d8fbd93fcd11a0a47605bce6d57570f248
Summary:
Add check and provide useful warning/error information to user if foxi is not checked out.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17477
Reviewed By: zrphercule
Differential Revision: D14212896
Pulled By: houseroad
fbshipit-source-id: 557247d5d8fdc016b1c24c2a21503e59f874ad09
Summary:
Stack:
⚫ **#17453 [jit] simplify aliasdb interface** [💛](https://our.intern.facebook.com/intern/diff/D14205209/)
The previous "getWrites" API relies on the user to do alias checking, which is confusing and inconsistent with the rest of the interface. So replace it with a higher-level call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17453
Differential Revision: D14209942
Pulled By: suo
fbshipit-source-id: d4aff2af6062ab8465ee006fc6dc603296bcb7ab
Summary:
Previously we were unifying the types of lists across if block outputs. This now fails with Optional subtyping because two types which can be unified have different runtime representations.
```
torch.jit.script
def list_optional_fails(x):
# type: (bool) -> Optional[int]
if x:
y = [1]
else:
y = [None]
return y[0]
```
the indexing op will expect y to be a generic list, but it will find an intlist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17424
Differential Revision: D14210903
Pulled By: eellison
fbshipit-source-id: 4b8b26ba2e7e5bebf617e40316475f91e9109cc2
Summary:
When switching back to `d0` from a stream on a different device `d1`, we need to restore the current streams on both `d0` and `d1`. The current implementation only does that for `d0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17439
Differential Revision: D14208919
Pulled By: mrshenli
fbshipit-source-id: 89f2565b9977206256efbec42adbd789329ccad8
Summary:
I originally set out to fix to_sparse for scalars, which had some overly restrictive checking (sparse_dim > 0, which is impossible for a scalar).
This fix uncovered an issue with nonzero: it didn't properly return a size (z, 0) tensor for an input scalar, where z is the number of nonzero elements (i.e. 0 or 1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17406
Differential Revision: D14185393
Pulled By: gchanan
fbshipit-source-id: f37a6e1e3773fd9cbf69eeca7fdebb3caa192a19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17308
In some cases there is still no RVO/NRVO and std::move is still needed. Latest
Clang gained -Wreturn-std-move warning to detect cases like this (see
https://reviews.llvm.org/D43322).
Reviewed By: igorsugak
Differential Revision: D14150915
fbshipit-source-id: 0df158f0b2874f1e16f45ba9cf91c56e9cb25066
Summary:
as title. These were already added to the tutorials, but I didn't add them to the cpp docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17452
Differential Revision: D14206501
Pulled By: suo
fbshipit-source-id: 89b5c8aaac22d05381bc4a7ab60d0bb35e43f6f5
Summary:
" ProTip! Great commit summaries contain fewer than 50 characters. Place extra information in the extended description."
lol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17450
Differential Revision: D14206500
Pulled By: suo
fbshipit-source-id: af7ffe299f8c8f04fa8e720847a1f6d576ebafc1
Summary:
The chunk buffer had a possibility to hang when no data is read and the buffer size is lower than chunk size. We detected this while running with larger dataset and hence the fix. I added a test to mimic the situation and validated that the fix is working. Thank you Xueyun for finding this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17409
Differential Revision: D14198546
Pulled By: soumith
fbshipit-source-id: b8ca43b0400deaae2ebb6601fdc65b47f32b0554
Summary:
The CI is broken now, this diff should fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17430
Differential Revision: D14198045
Pulled By: houseroad
fbshipit-source-id: a1c8cb5ccff66f32488702bf72997f634360eb5b
Summary:
This involves another purely cosmetic (ordering) change to the `config.yml` to facilitate simpler logic.
Other changes:
* add some review feedback as comments
* exit with nonzero status on config.yml mismatch
* produce a diagram for pytorch builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17427
Differential Revision: D14197618
Pulled By: kostmo
fbshipit-source-id: 267439d3aa4c0a80801adcde2fa714268865900e
Summary:
Previously we only generate one class for each extension backend. This caused issues with scalarType() calls and mapping from variable Types to non-variable types. With this change we generate one Type for each scalar type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17278
Reviewed By: ezyang
Differential Revision: D14161489
Pulled By: li-roy
fbshipit-source-id: 91e6a8f73d19a45946c43153ea1d7bc9d8fb2409
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17384
Better handling of possible net run errors in prof_dag counters.
Reviewed By: yinghai
Differential Revision: D14177619
fbshipit-source-id: 51bc952c684c53136ce97e22281b1af5706f871e
Summary:
Batch of removing expect files, and some tests that no longer test anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17414
Differential Revision: D14196342
Pulled By: eellison
fbshipit-source-id: 75c45649d1dd1ce39958fb02f5b7a2622c1d1d01
Summary:
This will evolve into complete technical docs for the jit. Posting what I have so far so people can start reading it and offering suggestions. Goto to Files Changed and click 'View File' to see markdown formatted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16887
Differential Revision: D14191219
Pulled By: zdevito
fbshipit-source-id: 071a0e7db05e4f2eb657fbb99bcd903e4f46d84a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17375
Previously we create the onnxGraph first and take it to the onnx manager for registration. It doesn't work well in practice. This diff takes "bring your own constructor" approach to reduce the resource spent doing backend compilation.
Reviewed By: kimishpatel, rdzhabarov
Differential Revision: D14173793
fbshipit-source-id: cbc4fe99fc522f017466b2fce88ffc67ae6757cf
Summary:
The benchmarks are now running on gpu cards with more memory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17416
Differential Revision: D14190493
Pulled By: bddppq
fbshipit-source-id: 66db1ca1fa693d24c24b9bc0185a6dd8a3337103
Summary:
This PR removes a few size of `self` that passed from forward pass to backward pass when `self` is already required in backward pass. This could be reason that cause the potential slow down in #16689 . I will attach a few perf numbers (still a bit volatile among runs tho) I got in the comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17187
Differential Revision: D14179512
Pulled By: ailzhang
fbshipit-source-id: 5f3b1f6f26a3fef6dec15623b940380cc13656fa
Summary:
This fell through the cracks from the migration from pytorch/builder to circleci. It's technically still racey, but is much less likely now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17381
Differential Revision: D14190137
Pulled By: pjh5
fbshipit-source-id: 2d4cd04ee874cacce47d1d50b87a054b0503bb82
Summary:
Creates a new shared type parser to be shared between the IR parser and the Schema Parser.
Also adds parsing of CompleteTensorType and DimensionedTensorType, and feature-gates that for the IRParser.
Renames the existing type_parser for python annotations, python_type_parser, and names the new one jit_type_parser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17383
Differential Revision: D14186438
Pulled By: eellison
fbshipit-source-id: bbd5e337917d8862c7c6fa0a0006efa101c76afe
Summary:
Still wip, need more tests and correct handling for opset 8 in symbolics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16068
Reviewed By: zrphercule
Differential Revision: D14185855
Pulled By: houseroad
fbshipit-source-id: 55200be810c88317c6e80a46bdbeb22e0b6e5f9e
Summary:
reorder some envars for consistency
add readme and notice at the top of config.yml
generate more yaml from Python
closes#17322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17323
Differential Revision: D14186734
Pulled By: kostmo
fbshipit-source-id: 23b2b2c1960df6f387f1730c8df1ec24a30433fd
Summary:
fallback operators to CPU for onnx support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15270
Differential Revision: D14099496
Pulled By: yinghai
fbshipit-source-id: 52b744aa5917700a802bdf19f7007cdcaa6e640a
Summary:
As of tight now, the script will produce a new generated file which will be inconsistent with the rest.
Test Result:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17370
Differential Revision: D14184943
Pulled By: izdeby
fbshipit-source-id: 5d3b956867bee661256cb4f38f086f33974a1c8b
Summary:
Currently there is a mismatch in naming between Python BatchNorm `running_var` and C++ BatchNorm `running_variance`, which causes JIT model parameters loading to fail (https://github.com/pytorch/vision/pull/728#issuecomment-466067138):
```
terminate called after throwing an instance of 'c10::Error'
what(): No such serialized tensor 'running_variance' (read at /home/shahriar/Build/pytorch/torch/csrc/api/src/serialize/input-archive.cpp:27)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x85 (0x7f2d92d32f95 in /usr/local/lib/libc10.so)
frame #1: torch::serialize::InputArchive::read(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, at::Tensor&, bool) + 0xdeb (0x7f2d938551ab in /usr/local/lib/libtorch.so.1)
frame #2: torch::nn::Module::load(torch::serialize::InputArchive&) + 0x98 (0x7f2d9381cd08 in /usr/local/lib/libtorch.so.1)
frame #3: torch::nn::Module::load(torch::serialize::InputArchive&) + 0xf9 (0x7f2d9381cd69 in /usr/local/lib/libtorch.so.1)
frame #4: torch::nn::Module::load(torch::serialize::InputArchive&) + 0xf9 (0x7f2d9381cd69 in /usr/local/lib/libtorch.so.1)
frame #5: torch::nn::operator>>(torch::serialize::InputArchive&, std::shared_ptr<torch::nn::Module> const&) + 0x32 (0x7f2d9381c7b2 in /usr/local/lib/libtorch.so.1)
frame #6: <unknown function> + 0x2b16c (0x5645f4d1916c in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest)
frame #7: <unknown function> + 0x27a3c (0x5645f4d15a3c in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest)
frame #8: <unknown function> + 0x2165c (0x5645f4d0f65c in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest)
frame #9: <unknown function> + 0x1540b (0x5645f4d0340b in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest)
frame #10: __libc_start_main + 0xf3 (0x7f2d051dd223 in /usr/lib/libc.so.6)
frame #11: <unknown function> + 0x1381e (0x5645f4d0181e in /home/shahriar/Projects/CXX/build-TorchVisionTest-Desktop_Qt_5_12_1_GCC_64bit-Debug/TorchVisionTest)
```
Renaming C++ BatchNorm `running_variance` to `running_var` should fix this problem.
This is a BC-breaking change, but it should be easy for end user to rename `running_variance` to `running_var` in their call sites.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17371
Reviewed By: goldsborough
Differential Revision: D14172775
Pulled By: yf225
fbshipit-source-id: b9d3729ec79272a8084269756f28a8f7c4dd16b6
Summary:
**WIP**
Attempt 2 at #14831
This adds `nn.LSTM` to the jit standard library. Necessary changes to the module itself are detailed in comments. The main limitation is the lack of a true `PackedSequence`, instead this PR uses an ordinary `tuple` to stand in for `PackedSequence`.
Most of the new code in `rnn.py` is copied to `nn.LSTM` from `nn.RNNBase` to specialize it for LSTM since `hx` is a `Tuple[Tensor, Tensor]` (rather than just a `Tensor` as in the other RNN modules) for LSTM.
As a hack it adds an internal annotation `@_parameter_list` to mark that a function returns all the parameters of a module. The weights for `RNN` modules are passed to the corresponding op as a `List[Tensor]`. In Python this has to be gathered dynamically since Parameters could be moved from CPU to GPU or be deleted and replaced (i.e. if someone calls `weight_norm` on their module, #15766), but in the JIT parameter lists are immutable, hence a builtin to handle this differently in Python/JIT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15744
Differential Revision: D14173198
Pulled By: driazati
fbshipit-source-id: 4ee8113159b3a8f29a9f56fe661cfbb6b30dffcd
Summary:
The test I added was failing lint because a constant was being created that wasn't being destroyed.
It was being inserted to all_nodes, then failing the check
` AT_ASSERT(std::includes(ALL_OF(sum_set), ALL_OF(all_nodes_set)));`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17316
Differential Revision: D14172548
Pulled By: eellison
fbshipit-source-id: 0922db21b7660e0c568c0811ebf09b22081991a4
Summary:
This provides the minimum necessary to allow derivative formulas for things that have a kwarg only specifier in their schema. Support for non-parser frontend default arguments for kwargs is not completed.
Fixes#16921
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17339
Differential Revision: D14160923
Pulled By: zdevito
fbshipit-source-id: 822e964c5a3fe2806509cf24d9f51c6dc01711c3
Summary:
Fix for #17261, SsnL do you have tests for it in your other PR? If not, I'll add to this. Example from #17261 now does not error out (and same for log_softmax).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17330
Differential Revision: D14171529
Pulled By: soumith
fbshipit-source-id: ee925233feb1b44ef9f1d757db59ca3601aadef2
Summary:
Adds about 30 matches due to new functions / misuse of double.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17340
Differential Revision: D14161109
Pulled By: cpuhrsch
fbshipit-source-id: bb3333446b32551f7469206509b480db290f28ee
Summary:
The method will be used in IRParser and in NetDef converter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17372
Differential Revision: D14172494
Pulled By: ZolotukhinM
fbshipit-source-id: 96cae8422bc73c3c2eb27524f44ec1ee8cae92f3
Summary:
This PR switches from `OperationCreator` to `Operation` to simplify the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17183
Differential Revision: D14169829
Pulled By: Krovatkin
fbshipit-source-id: 27f40a30c92e29651cea23f08b5b1f13d7eced8c
Summary:
Our sparse tests still almost exclusively use legacy constructors. This means you can't, for example, easily test scalars (because the legacy constructors don't allow them), and not surprisingly, many operations are broken with sparse scalars.
Note: this doesn't address the SparseTensor constructor itself, because there is a separate incompatibility there that I will address in a follow-on commit, namely, that torch.sparse.FloatTensor() is supported, but torch.sparse_coo_tensor() is not (because the size is ambiguous).
The follow-on PR will explicitly set the size for sparse tensor constructors and add a test for the legacy behavior, so we don't lose it.
Included in this PR are changes to the constituent sparse tensor pieces (indices, values):
1) IndexTensor becomes index_tensor
2) ValueTensor becomes value_tensor if it is a data-based construction, else value_empty.
3) Small changes around using the legacy tensor type directly, e.g. torch.FloatTensor.dtype exists, but torch.tensor isn't a type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17324
Differential Revision: D14159270
Pulled By: gchanan
fbshipit-source-id: 71ee63e1ea6a4bc98f50be41d138c9c72f5ca651
Summary:
Apparently, before the only way we enforced it was size>=0 in alloc_cpu. So empty((5,-5)) would fail but empty((-5,-5)) would hang :)
Please suggest better place to enforce it if any.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17077
Differential Revision: D14077930
Pulled By: dzhulgakov
fbshipit-source-id: 1120513300fd5448e06fa15c2d72f9b0ee5734e4
Summary:
This PR addresses the slowness of MVN's log_prob as reported in #17206.
t-vi I find it complicated to handle permutation dimensions if we squeeze singleton dimensions of bL, so I leave it as-is and keep the old approach. What do you think?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17294
Differential Revision: D14157292
Pulled By: ezyang
fbshipit-source-id: f32590b89bf18c9c99b39501dbee0eeb61e130d0
Summary:
Fix#16650.
Headers such as `ATen/cpu/vml.h` contain `#include <ATen/cpu/vec256/vec256.h>`
for example, but these vec256 headers aren't included, due to commit e4c0bb1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17220
Differential Revision: D14165695
Pulled By: ezyang
fbshipit-source-id: 27b2aa2a734b3719ca4af0565f79623b64b2620f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17297
When `torch.load` needs to load a tensor, no matter which device it will be end up being loaded on, it first creates a CPU storage for it of the necessary size. This storage is allocated but it's not "set" yet, hence no data is written to it: it exists in the kernel's memory map, but it's not resident and doesn't take up physical pages. Then, this storage is passed to the `map_location` function (if the parameter is a string, a device or a map, PyTorch builds that function automatically). The default map for CUDA consists effectively in `lambda storage, _: storage.cuda()` (I omitted the code needed to pick the correct device). This creates a GPU storage and copies over the data of the CPU storage. *This step is unnecessary as we're copying uninitialized memory*. (Surprisingly enough, though, it appears the kernel is smart enough that reading from the unpaged CPU memory doesn't cause it to become paged.) Once `map_location` returns a storage residing on the correct target device, `torch.load` resumes reading the file and copying the tensor's content over into the storage. This will overwrite the content that had previously been written to it, which confirms that the above copy was pointless.
A way to avoid this useless copy is to just create and return a new empty storage on the target GPU, instead of "transforming" the original one.
This does indeed increase the performance:
```
In [5]: torch.save(torch.rand(100, 100, 100), "/tmp/tensor")
In [6]: %timeit torch.load("/tmp/tensor", map_location="cuda")
1.55 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit torch.load("/tmp/tensor", map_location=lambda storage, _: torch.cuda.FloatStorage(storage.size()))
1.03 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
Credit for this diff is shared with adamlerer and fmassa.
Differential Revision: D14147673
fbshipit-source-id: a58d4bc0d894ca03a008499334fc2cdd4cc91e9f
Summary:
If something is a TensorList, it should be a list of `TensorType`, not a list of some specialized type.
Fixes#17140, #15642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17321
Differential Revision: D14158192
Pulled By: suo
fbshipit-source-id: ba8fe6ae8d618c73b23cd00cbcb3111c390c5514
Summary:
Bunch of random stuff I came across while doing UDT stuff. Putting in a separate PR to avoid noise
- fix up the alias analysis list ops to include fork/wait
- improve dump() for aliasDb to print writes
- Move BuiltinFunction::call() to sugaredvalue with the rest of the methods
- formatting and includes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17288
Differential Revision: D14147105
Pulled By: suo
fbshipit-source-id: 62e2a922a1726b684347365dc42c72188f154e9c
Summary:
MKL-DNN support multi-node mode,but not support multi-devices mode,this commit will support multi-devices for MKL-DNN.This commit depend on https://github.com/pytorch/pytorch/pull/11330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12856
Differential Revision: D13735075
Pulled By: ezyang
fbshipit-source-id: b63f92b7c792051f5cb22e3dda948013676e109b
Summary:
add missing std introduced by #16689 . Investigating why this wasn't caught in CI (nor my local dev environment).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17263
Reviewed By: ezyang
Differential Revision: D14134556
Pulled By: ailzhang
fbshipit-source-id: 6f0753fa858d3997e654924779646228d6d49838
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15034
Rethrow exception happened during RunAsync, ensure that pending tasks
are not executed after marked as finished
Reviewed By: andrewwdye
Differential Revision: D13409649
fbshipit-source-id: 3fd12b3dcf32af4752f8b6e55eb7a92812a5c057
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17132
schedule() function is not supposed to throw exception and is supposed
to succeed in scheduling the full graph of tasks, potential errors (e.g. errors
from underlying thread pool, out of memory exceptions etc) are considered not
recoverable.
The invariant - the graph of tasks is either not executed or
executed in full before the call to finishRun()
Reviewed By: andrewwdye
Differential Revision: D14092457
fbshipit-source-id: a3e5d65dfee5ff5e5e71ec72bb9e576180019698
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17275
Previous implementation used a memcpy inside the kernel. It is more efficient to reduce the data fetched per thread to a single word from memory. This exposes more concurrency and takes advantage of GPU memory coalescing support.
Reviewed By: takatosp1
Differential Revision: D14120147
fbshipit-source-id: c4734003d4342e55147c5b858f232a006af60b68
Summary:
With this patch you can use USE_DISTRIBUTED=OFF (possibly in combination with USE_NCCL=OFF (?))
The significance is partly because the NCCL doesn't build with CUDA 8.
This is written under the assumption that NCCL is required for distributed if not, the USE_DISTRIBUTED check in nccl.py should be replaced by a check for the USE_NCCL environment variable.
Fixes: #17274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17295
Differential Revision: D14155080
Pulled By: ezyang
fbshipit-source-id: 0d133f7c5b4d118849f041bd4d4cbbd7ffc3c7b4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16723
Removed obsolete argument correct_transform_coords in bbox_transform op.
* It was only for backward compatibility. We should not have models using it now.
Differential Revision: D13937430
fbshipit-source-id: 504bb066137ce408c12dc9dcc2e0a513bad9b7ee
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17194
we found that there is a per row absolute error due to int8 quant
and a relative error table-wide in case fp16 is used
Reviewed By: csummersea
Differential Revision: D14113353
fbshipit-source-id: c7065aa9d15c453c2e5609f421ad0155145af889
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17198
We come to the point that we need to apply some rules to bind certain ops together to avoid un-inferrable intermediate shapes. We either lower them together to backend or neither. This diff adds a pass for us to add rules like this. The first one is to bind `Gather` with `SparseLengthsWeighted*`.
Reviewed By: ipiszy
Differential Revision: D14118326
fbshipit-source-id: 14bc62e1feddae02a3dd8eae93b8f553d52ac951
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17272
after windows-specific fixes were applied new file was left out of CMakeLists
Reviewed By: orionr
Differential Revision: D14140419
fbshipit-source-id: 6a6c652048ed196ec20241bc2a1d08cbe2a4e155
Summary:
This commit did below enhancements:
1, add doc for build_android.sh;
2, add install step for build_android.sh, thus the headers and libraries can be collected together for further usage conveniently;
3, change the default INSTALL_PREFIX from $PYTORCH_ROOT/install to $PYTORCH_ROOT/build_android/install to make the project directory clean.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17298
Differential Revision: D14149709
Pulled By: soumith
fbshipit-source-id: a3a38cb41f26377e21aa89e49e57e8f21c9c1a39
Summary:
The particular use case reported is Jetson TX2 and maskrcnn.
Fixes#17144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17296
Differential Revision: D14147886
Pulled By: soumith
fbshipit-source-id: 44d5a89aaeb4cc07d1b53dd90121013be93c419c
Summary:
`TestNN.test_variable_sequence_cuda` sometimes brakes due to CUDA leak.
The cause appears to be too small tolerance breaking float16 sub-test of the test above.
When it breaks it calls abort disrupting correct tear down of the test
and false alarming about the leak.
~~Also, removed annoying **Upsample** module warning.
IMHO this warning is wrong because the module **Upsample** is not deprecated. Seems like it's been mixed
with `nn.functional.upsample` function which is indeed deprecated in favor of `nn.functional.interpolate`, see `torch/nn/functional.py:2387` for details (this replacement is also performed in `test_nn.py`).~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17242
Differential Revision: D14141686
Pulled By: soumith
fbshipit-source-id: faa8f87440d94bdc6ab0ff00be6dad82353115c4
Summary:
Currently, when the input tensor `self` is not contiguous, `tril_` and `triu_` calls `self = self.contiguous()`, which allocates a new contiguous tensor and assign it to `self`. This effectively changes the input tensor `self`'s pointer and will break downstream code after Variable/Tensor merge.
This PR fixes it so that `tril_` and `triu_` always update the input tensor in-place and preserve the input tensor's TensorImpl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17031
Differential Revision: D14069592
Pulled By: yf225
fbshipit-source-id: d188218f426446a44ccc1d33fc28ac3f828c6a05
Summary:
Some value are copied when it could've been moved.
Detected by compiler flag -Wreturn-std-move
Reviewed By: igorsugak
Differential Revision: D14134303
fbshipit-source-id: 8fc3bb2017108b3d65097cb8447e33f5b6c743b4
Summary:
light weight implementation of LLVM filecheck utility. Currently only handles string matching - regexes & saving a regex to a variable name can be added as needed.
Current intended usage is through FileCheckBuilder python handle, and is shown in the tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16858
Differential Revision: D14096244
Pulled By: eellison
fbshipit-source-id: c7c8d1457691c105e6ccbb3c1a378d96baac2569
Summary:
Trying to land again, make prim::None into a case of prim::Constant. Reverted the previous landing because it broke an important onnx export test.
https://github.com/pytorch/pytorch/pull/16160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17186
Differential Revision: D14115304
Pulled By: eellison
fbshipit-source-id: 161435fc30460b4e116cdd62c7b2e5b94581dcb7
Summary: The `tensor` be used as `end` clarified in the docs.
Differential Revision: D14132212
Pulled By: ezyang
fbshipit-source-id: e9bca14d5079e5f7adfc18afcb1eec832ef86e9e
Summary:
Reenables rand_like fusion if no tensor is broadcasted in the fusion group. This is a sufficient but not necessary condition for fused rand_like to produce correct results, and it has an unpleasant side effect of falling back to non-fused path if rand_like was optimistically included in the fusion group, but there is a broadcast in the fusion group not necessarily related to rand_like. E.g. before this PR, if the network had (biasAdd -> relu -> dropout), fuser could fuse biasAdd and relu, now it will try fusing the whole thing (if dropout is expressed via rand_like) and fall back every time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16087
Differential Revision: D13720232
Pulled By: zou3519
fbshipit-source-id: 1e19203bec4a59257bfc7078b054a19f00fab4ad
Summary:
This is the first commit from a series of planned changes in order to add boolean tensors to PyTorch. The whole plan looks like this:
0. Storage Implementation (this change)
1. Tensor Creation.
2. Tensor Conversions.
3. Tensor Indexing.
4. Tensor Operations.
5. Back compatibility related changes.
This feature was requested by the community:
https://github.com/pytorch/pytorch/issues/4764https://github.com/pytorch/pytorch/issues/4219https://github.com/pytorch/pytorch/issues/4288
**Change**:
Added boolean type to the Storage class for CPU and CUDA backends.
**Tested via**:
1. unit tests
2. running this:
-> import torch
-> torch.BoolStorage
<class 'torch.BoolStorage'>
-> torch.cuda.BoolStorage
<class 'torch.cuda.BoolStorage'>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16810
Reviewed By: gchanan
Differential Revision: D14087246
Pulled By: izdeby
fbshipit-source-id: 042642ced1cb0fd1bb6bff05f9ca871a5c54ee5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17188
Using flag "-Wreturn-std-move", compiler can identify the cases where a copy
operation
is performed when a move operation would have been available. Wrapped return
statement with std::move to fix.
For some reason, these files are not automatically modded. With D14115372
we should be able to turn on the compile flag
Reviewed By: soumith
Differential Revision: D14115786
fbshipit-source-id: e763b92eecbe4468027fc141d029618d1e9f280b
Summary:
Adding two distrbuted samplers, Random and Sequential to the mix. Similar to python counterpart, DistributedSampler introduces a new method `set_epoch(size_t epoch)` which can be use to shuffle data determinstically between distributed processes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16910
Differential Revision: D14130980
Pulled By: soumith
fbshipit-source-id: ec08b7130c01e2fc6dc3693f7ac622a0a6d60f10
Summary:
del Tensor.grad set PyObject to nullptr
and Tensor.grad = None set PyObject to Py_None
Handling both the cases now
fixes ##16471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16525
Differential Revision: D14130800
Pulled By: soumith
fbshipit-source-id: ed85c38305bba94d5047311cb58e4e4cedd09832
Summary:
It might need some cleaning up and might be missing some features, but it should be already working for most cases.
This PR is based on top of PR16986 (so please review only the last commit here).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16987
Differential Revision: D14074577
Pulled By: ZolotukhinM
fbshipit-source-id: 712b598f423265655f574bb9903e2066628eaad3
Summary:
similar to softmax there are issues of getting nan randomly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17170
Differential Revision: D14110515
Pulled By: bddppq
fbshipit-source-id: 5c97661184d45a02122fd69d35a839fdf4520c8c
Summary:
Currently the converters are very straightforward, i.e. there is no code for trying to
preserve semantics, we're purely perform conversion from one format to another.
Two things that we might want to add/change:
1. Add semantic conversion as well (but probably it would be a good idea to keep
it separate as a temporary thing).
2. Make sure we don't mess with value names, as they are crucial for current
uses of NetDefs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17123
Differential Revision: D14090244
Pulled By: ZolotukhinM
fbshipit-source-id: 07175fa9235582e1d1da5f10a42a5c1280b1b394
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17062
from jiyan's training jobs it seems like we found a quantization bug
fp32
fp32->rowwise int8 is fine
fp16 is fine
fp16->rowwise int8 is not fine
we are preconverting everything to fp32 and using the existing code, so there is no need to change the epsilon in the case of fp16 since at the time of converting, everything is a float
Reviewed By: jspark1105
Differential Revision: D14063271
fbshipit-source-id: 747297d64ed8c6fdf4be5bb10ac584e1d21a85e6
Summary:
This change simplifies analysis done on constants since prim::None does not need to be handled separately now. To check if a constant node is None, use node->isNone().
Next step will be to remove prim::Undefined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16160
Differential Revision: D14109636
Pulled By: eellison
fbshipit-source-id: d26fd383976163a2ddd4c24984bd672a541cc876
Summary:
Based on https://github.com/pytorch/pytorch/pull/12413, with the following additional changes:
- Inside `native_functions.yml` move those outplace operators right next to everyone's corresponding inplace operators for convenience of checking if they match when reviewing
- `matches_jit_signature: True` for them
- Add missing `scatter` with Scalar source
- Add missing `masked_fill` and `index_fill` with Tensor source.
- Add missing test for `scatter` with Scalar source
- Add missing test for `masked_fill` and `index_fill` with Tensor source by checking the gradient w.r.t source
- Add missing docs to `tensor.rst`
Differential Revision: D14069925
Pulled By: ezyang
fbshipit-source-id: bb3f0cb51cf6b756788dc4955667fead6e8796e5
Summary:
The main problem there is with differentiating batch norm statically
is that we make a lot of complex run-time decisions about the backend
we choose. Then, the autograd derivatives are implemented for every
backend separately, which makes sense, because they might be saving
buffers containing different values. To resolve the issue, the forward
op returns an index of the chosen backend, and the backward function
takes it as an argument, such that it knows how to interpret the buffers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15403
Differential Revision: D14098815
Pulled By: ailzhang
fbshipit-source-id: 7fcd3e6e0566433e81fe8286fb441c1ecaf198ad
Summary:
/cc goldsborough
Working on #14582
The corresponding python implementations are at: [pytorch/torch/nn/init.py](6302e4001a/torch/nn/init.py (L261-L327))
Here is my initial implementation of Kaiming Initialization. I have not been able to figure out how to successfully run tests locally so I haven't added any yet.
A couple questions:
- Are the enums defined in the right place? I copied their names from Python, but do you prefer different naming conventions for C++?
- To run tests locally do I use `python setup.py test`? Can I run just a subset of the tests somehow?
- Should I add my tests at [test/cpp/api/misc.cpp](https://github.com/pytorch/pytorch/blob/master/test/cpp/api/misc.cpp#L47-L54)?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14718
Differential Revision: D14049159
Pulled By: goldsborough
fbshipit-source-id: 966ac5126875936e69b185b5041f16476ed4cf70
Summary:
In `torch.distributed.launch.py`, it passes `local_rank` as argument and requires user's program to parse it. However, it would be more flexible for users and consistent with other variables, e.g. `RANK`, `MASTER_PORT`, `WORLD_SIZE`, if passing through environment variables.
265ed8ff45/torch/distributed/launch.py (L200-L212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16360
Differential Revision: D14070372
Pulled By: ezyang
fbshipit-source-id: c3f6a8e55ab513918cad09d1326eccdedb4d98c9
Summary:
The binary ops that are using TensorIterator do a trick in order to only write the code once for out and non-out variants:
1) Have the non-out variant call the out variant with an undefined tensor.
2) the out variant then reassigns the result tensor to the output of the TensorIterator; this is a no-op in the case where a valid tensor was passed and it correctly propagates the result back to the non-out variant, which is legal because it's just reassigning an undefined tensor.
I believe other solutions to this problem would require an unnecessary reference bump, e.g. defining another out variant that returns a Tensor rather than a reference.
Unfortunately, this doesn't work with const-references, which we want to move our output arguments to be (because const doesn't actually provide const correctness here, and writers mistakenly reassign the parameter in the case it isn't an out variant).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17059
Differential Revision: D14068402
Pulled By: gchanan
fbshipit-source-id: 89fef177a1e174dbe2858e2eae0f6d85460b07d1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17158
Because of Reshape op, batch size can be changed. This diff addresses first order issue raised from multiple batch size system. We need to export different real_batch_size for different max_batch_size input and attach it to the right output.
It also fixes a false exception.
Reviewed By: ipiszy
Differential Revision: D14099541
fbshipit-source-id: 0fa9e86826f417a11d2b5dd2ee60dff64a7ce8c4
Summary:
This initial PR splits the `.circleci/config.yml` file into several smaller files that are stitched verbatim back into the original. A proof of concept of dynamically generating yaml for the job configuration list is also introduced.
Since the `config.yml` file must exist in the repo in its final form, there must exist a manual update and check-in step to regenerate `config.yml` from its constituent parts.
Consistency between the checked-in `config.yml` file and the authoritative source data is enforced at build time through TravisCI.
closes#17038
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17039
Reviewed By: yf225
Differential Revision: D14109059
Pulled By: kostmo
fbshipit-source-id: bc04a73145290358854f5a5e552a45e559118fc3
Summary:
This PR add supports for simpler for-in-list loops such as the example below:
```python
torch.ji.python
def sum_list(a):
# type: (List[int]) -> int
sum = 0
for i in a:
sum += i
return sum
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16726
Differential Revision: D14070007
Pulled By: ezyang
fbshipit-source-id: b4d971ee647729a6caa3099ceac34ec5c4f143de
Summary:
one_hot docs is missing [here](https://pytorch.org/docs/master/nn.html#one-hot).
I dug around and could not find a way to get this working properly.
Differential Revision: D14104414
Pulled By: zou3519
fbshipit-source-id: 3f45c8a0878409d218da167f13b253772f5cc963
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17105
To make FC with rowwise quantization faster, reduce code duplication, and make code consistent with Convolution
Reviewed By: csummersea
Differential Revision: D14080461
fbshipit-source-id: 2b0e67b86e7e3029c90751a8824bf80ae1223680
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17145
Prepacked weight contains both weight and bias, so the bias should be obtained from input index 1, not from 2
Reviewed By: jianyuh
Differential Revision: D14097281
fbshipit-source-id: b8b836b85a7b240e2fd1734377c46d9bf2ce3390
Summary:
In the NUMA case, PinnedCPUAllocator's allocate() would return a
DataPtr constructed by DefaultCPUAllocator, which would reference
the Default... Delete() rather than the Pinned... Delete(). That
meant Pinned... Delete() would never run, so cudaHostUnregister()
would never be called when regions were freed.
See: https://github.com/pytorch/pytorch/issues/16280
This change adds a 'naked_allocate()' method to the Default allocator
that just returns a pointer to the allocated memory rather than
wrapping it in a DataPtr. Pinned allocator uses that then constructs
a DataPtr with reference to its own Delete().
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16340
Reviewed By: dzhulgakov
Differential Revision: D13843206
Pulled By: ezyang
fbshipit-source-id: 9efb572e5a01b49ef2a4aceeccc13cd0b1066528
Summary:
This prevent people (reviewer, PR author) from forgetting adding things to `torch.rst`.
When something new is added to `_torch_doc.py` or `functional.py` but intentionally not in `torch.rst`, people should manually whitelist it in `test_docs_coverage.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16039
Differential Revision: D14070903
Pulled By: ezyang
fbshipit-source-id: 60f2a42eb5efe81be073ed64e54525d143eb643e
Summary:
setting the correct math type for cudnn rnn, which is enforced starting from cudnn 7.5+
1. Updating persistent rnn check with input data type instead of rnn math type;
2. Updating rnn type promotion to set correct math type for accumulation;
3. Replace datatype check for filter descriptor from rnn.datatype to input.datatype;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16825
Differential Revision: D14071190
Pulled By: ezyang
fbshipit-source-id: 1c9a1531ccf510cb0619e830be444c20c5e72f3f
Summary:
In light of the antistatic feature being a part of the released ROCm 2.1, remove
the feature in pyHIPIFY for extraction of kernel arguments and insertion of
static_casts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17055
Differential Revision: D14068478
Pulled By: bddppq
fbshipit-source-id: 6895f490c78247a129aa18c520ff8d4d1a3d3642
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16722
Updated bbox_transform and nms unit test for caffe2 ops.
Differential Revision: D13937416
fbshipit-source-id: 034743d29671c6e73d323a935e2d734ecc071bff
Summary:
support data parallel for ScriptModule.
see unit tests for testing done for this PR. I also tried traced version of resnet18 from torchvision.
I'm yet to try a complete end-to-end data parallel training. This will be next steps.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16891
Differential Revision: D14002222
Pulled By: gqchen
fbshipit-source-id: fce3598169113215599815c6978e66d3c3a8c282
Summary:
Follow up of #14533, add more test coverage for emitif metaprogramming conditions. Also delete some unwrap optional usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16794
Differential Revision: D14096868
Pulled By: wanchaol
fbshipit-source-id: ee1cec609c58d0dd65211249a90207be06649e71
Summary:
When adaptive pooling has to produce a single pixel feature map, it is faster to do so by calling .mean(). Backward calls a pretty inefficient cuda kernel with atomics, which becomes ridiculously slow for halfs. For half this PR provides approx 30x speed-up for adaptive average pooling, which results in 30% end-to-end speed-up on senet. Improvements are smaller for float, but still significant (approx 5x).
Also this PR unifies handling of 3d (no batch dimension) and 4d tensors, using negative dimension indices.
cc ezyang for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17011
Reviewed By: ailzhang
Differential Revision: D14078747
Pulled By: soumith
fbshipit-source-id: 0eb9255da2351190a6bcaf68c30e2ae2402a2dd9
Summary:
This updates the example for `torch.mode` to show a case where there is a mode.
Also add a bit of a description to the explanation as well as being a bit more precise about "a" mode rather than "the" mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17069
Differential Revision: D14078722
Pulled By: soumith
fbshipit-source-id: 837a238d53a9b8e868511acbdc258633975bea48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17074
There are some common functionalities in backend lowering. This diff creates a base class which hosts these common stuff.
Reviewed By: ipiszy
Differential Revision: D14073192
fbshipit-source-id: 9617603d0e73db6f7fcc5572756b9dbab506dae5
Summary:
I noticed that we were sinking a lot of time into `cat` operations in machine translation on CPU, and drilled down to us doing the cat element-by-element, even though all the inputs were contiguous. The reason was we were doing the cat along a dimension that was not 0, and that caused us to not use the fast `memcpy` branch. This PR generalizes that branch.
Quick benchmark script:
```
import torch, time
tensors = [torch.rand(6, 2, 1024) for i in range(5)]
NITER = 1000
s = time.time()
for i in range(NITER):
torch.cat(tensors, dim=1)
print('time per iter ', (time.time() - s) / NITER)
```
Before:
```
time per iter 8.089399337768554e-05
```
After:
```
time per iter 2.183413505554199e-05
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17032
Differential Revision: D14090038
Pulled By: jamesr66a
fbshipit-source-id: 2c733a84915896008ac95f2233f44894bd2573de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17080
This changes all operators using this macro to the new format
Reviewed By: dzhulgakov
Differential Revision: D14078628
fbshipit-source-id: 67048e485e326765fd49567cc008633d3d500d5c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17076
OSS: slightely change the tools/amd_build/build_amd.py to add the output_directory for internal use. Also modify the renaming convention in hipify script to reflect the updated rules.
Reviewed By: bddppq
Differential Revision: D13767218
fbshipit-source-id: cbcadc51daab42197d545f204840dcc18176bb3d
Summary:
- Moved a few functions from `autograd` namespace to `aten` namespace to be visible from JIT nativeResolver.
- Added a hack to loop up keyword only argument. Will add proper support for kw only later
- Simulate function overload in aten using `_<number>` as function name suffix.
- Even `forward` returns multiple outputs like in `kthvalue`, there's at most one requires grad that we currently support.
- Removed the `TensorList` related ops here since partial `TensorList` support is prone to bugs. Our symbolic diff for `cat` was never tested with autodiff, and it seems broken. Need to find another proper way to support these ops(either by properly supporting `TensorList` or sth like `prim::ConstantChunk` and leave them for next PR.
Ops supported in this PR:
```
erf
expand_as
index
kthvalue
mean
permute
pow
rsub
select
sqrt
squeeze
t
to
topk
transpose
view
var
embedding
logsumexp
// grad is None
_dim_arange
contiguous
nonzero
ones_like
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16689
Differential Revision: D14020806
Pulled By: ailzhang
fbshipit-source-id: a5e2c144a7be5a0d39d7ac5f93cb402ec12503a5
Summary:
update:
1. global_reduce check for should_block_y_reduce first.
This avoids the enabling global_reduce without block_y_reduce. Leading to
accessing shared memory during global reduce without allocation.
2. updating block_y_reduce heuristics. Improves perf on tiny tensors
3. adding test case covering old cases where illegal memory access might occur
TensorIterator cuda launch configs update (#16224)
Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible
block dimension to improve efficiency for reduction cases with small fast
dimension.
Previously TensorIterator launches blocks with fixed 32x16 threads.
For cases like:
import torch
torch.randn(2**20, 4, device='cuda').sum(0)
The fixed launch config does handle coalesced memory access efficiently.
Updated launch configure enables flexible block dimension. Combining with
improved reduction scheme (using flexible vertical / horizontal reduction
instead of limited warp / block reduction in the old code), it ensures optimal
memory access pattern even with reduction on dimension with small stride.
Possible future improvements:
1. Precise dynamic shared memory allocation.
2. Using warp shuffle for vertical (block_y) reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17040
Differential Revision: D14078295
Pulled By: umanwizard
fbshipit-source-id: ecc55054a5a4035e731f0196d633412225c3b06c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17046
As we are moving to use bound shape inference, we can remove the awkward fake inference run path and make the code cleaner.
Reviewed By: ipiszy
Differential Revision: D14061501
fbshipit-source-id: b3ace98b3dabef3c3359086a0bb1410518cefa26
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17096
LengthsRangeFill will take a batch size of lengths input and expand it into sequence. Later op should follow this type until it hits another batch type moderating op, e.g. SparseLengthsSum.
Reviewed By: ipiszy
Differential Revision: D14079422
fbshipit-source-id: 1a26925d502c32875ea95c160268bf6a256cc955
Summary:
Currently, when you pass a negative index to a `Dataset` created with `ConcatDataset`, it simply passes that index to the first dataset in the list. So if, for example, we took `concatenated_dataset[-1]`, this will give us the last entry of the *first* dataset, rather than the last entry of the *last* dataset, as we would expect.
This is a simple fix to support the expected behavior for negative indices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15756
Reviewed By: ezyang
Differential Revision: D14081811
Pulled By: fmassa
fbshipit-source-id: a7783fd3fd9e1a8c00fd076c4978ca39ad5a8a2a
Summary:
Add support of count_include_pad end to end test for AveragePool
We can export AveragePool from PyTorch with count_include_pad attribute. However, we don't directly support it in Caffe2's ONNX backend.
We also want to check whether we can pass the end to end test for average pool operator with count_include_pad attribute (pytorch => onnx => caffe2)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17034
Reviewed By: houseroad
Differential Revision: D14060186
Pulled By: dwarakrajagopal
fbshipit-source-id: 10dae532611c71f8c8cfc3fa701cc7c1c1c02695
Summary:
Since we don't do tmp_install any more it's better to include all necessary headers.
cc kostmo for better suggestions of how to list all headers here
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16890
Differential Revision: D14079848
Pulled By: dzhulgakov
fbshipit-source-id: 4522c80d05e5d91f99f6700cde46cac559330d28
Summary:
Some legacy TH code was relying on alloc to throw when called with negative number!!! E.g. `torch.linspace(0, 1, -1)`. And it breaks ASAN build. I still believe alloc should receive size_t, but I added a safety enforce inside.
It should fix ASAN. I'll follow up with a proper fix for empty_cpu (which is probably the right place to do it) separately
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17071
Differential Revision: D14074157
Pulled By: dzhulgakov
fbshipit-source-id: 3ed3bdb873e446edecb558e1df491310fd7179e3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16275
Adding a generic string `metadata` field as part of the model to capture additional metadata with the model.
Reviewed By: dzhulgakov
Differential Revision: D13579029
fbshipit-source-id: 7456ef2edbe73bb70bbb31889cecd94e0db329a2
Summary:
libshm_manager doesn't need to depend on all of libtorch. It only uses tiny tempfile.h which can be moved to c10. I could just duplicate the file too, but it's not worth it as c10 is small enough.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17019
Differential Revision: D14052688
Pulled By: dzhulgakov
fbshipit-source-id: 8797d15f8c7c49c49d40b7ab2f43aa3bf6becb0c
Summary:
Currently the converters are very straightforward, i.e. there is no code for trying to
preserve semantics, we're purely perform conversion from one format to another.
Two things that we might want to add/change:
1. Add semantic conversion as well (but probably it would be a good idea to keep
it separate as a temporary thing).
2. Make sure we don't mess with value names, as they are crucial for current
uses of NetDefs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16967
Differential Revision: D14062537
Pulled By: ZolotukhinM
fbshipit-source-id: 88b184ee7276779e5e9152b149d69857515ad98a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16929
Separate CPU reduce functions from math
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D13999469
fbshipit-source-id: bd628b15a6e3c1f04cc62aefffb0110690e1c0d1
Summary:
For >2D input, previously the code uses static shape captured during tracing and reshape before/after `Gemm`.
Now we add `-1` to the first `Reshape`, and uses `Shape(X) => Slice(outer) => Concat(with -1 for inner) => Reshape` for the second.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16184
Differential Revision: D14070754
Pulled By: ezyang
fbshipit-source-id: 86c69e9b254945b3406c07e122e57a00dfeba3df
Summary:
This PR fixes following issue: https://github.com/pytorch/pytorch/issues/16828
It is a combination of two things:
1) MKLDNN streams are not thread-safe but are currently shared between different threads. This change makes them thread_local
2) By default MKLDNN primitives can share global memory and can't be invoked from multiple threads. This PR enables the MKLDNN_ENABLE_CONCURRENT_EXEC cmake configuration option that makes them thread-safe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17022
Differential Revision: D14069052
Pulled By: ezyang
fbshipit-source-id: f8f7fcb86c40f5d751fb35dfccc2f802b6e137c6
Summary:
This removes curly braces from the outputs (we have indentation to indicate scopes), also adds ':' after graph and blocks declaration and removes ';' from the return line. ".expect" tests are updated to keep up with it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16986
Differential Revision: D14062540
Pulled By: ZolotukhinM
fbshipit-source-id: 7f8e2d11619152a21ef7f1f7f8579c49392c3eca
Summary:
Previously, the ChunkBuffer depends on the remaining chunk count to signal end of dataloading. This does not work with distributed samplers where each sampler only loads a subset of chunks. This refactor remove the dependency on the remaining chunk count at the ChunkBuffer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16868
Differential Revision: D14066517
Pulled By: goldsborough
fbshipit-source-id: 293dfe282ceff326dff0876c2f75c2ee4f4463e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16824
There was a big wooly yak getting the deprecated macros to work.
Gory details are in Deprecated.h
Reviewed By: smessmer
Differential Revision: D13978429
fbshipit-source-id: f148e5935ac36eacc481789d22c7a9443164fe95
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17027
Glow doesn't support second output of Reshape right now and it's useless. For correctness, we do make sure that the second output of Reshape is of Constant type during bound shape inference.
Reviewed By: ipiszy
Differential Revision: D14056555
fbshipit-source-id: f39cca7ba941bf5a5cc3adc96e2b1f943cc0be93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16942
We can fold col offsets into bias if zero point of activation is constant.
fbgemm still needs to provide an option to pass col offsets in case zero point of activation keep changes (e.g., dynamic quantization).
A trick to optimize static quantization case is setting A zero point to 0 after folding into bias.
This diff also optimizes when weights use symmetric quantization. When B zero point is 0, we use PackAMatrix instead of PackAWithRowOffset .
TODO:
Ideally, PackAWithRowOffset should perform as fast as PackAMatrix when B_zero_point is 0 to make client code simpler
Same in PackAWithIm2Col and depth-wise convolution (group convolution is already doing this)
Reviewed By: csummersea
Differential Revision: D14013931
fbshipit-source-id: e4d313343e2a16a451eb910beed30e35de02a40c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16691
Previous diffs already introduced a macro that registers caffe2 CPU kernels with c10.
This now also registers the CUDA kernels with it.
Reviewed By: bwasti
Differential Revision: D13901619
fbshipit-source-id: c15e5b7081ff10e5219af460779b88d6e091a6a6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/17025
Extract ShapeInfo and some util functions into a separate file.
Reviewed By: yinghai
Differential Revision: D14017432
fbshipit-source-id: 201db46bce6d52d9355a1a86925aa6206d0336bf
Summary:
Fix issue #12174 for Mac OSX.
PS: This is a duplicate of PR #16968 that got messed up. Sorry for the confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16999
Differential Revision: D14050669
Pulled By: zou3519
fbshipit-source-id: a4594c03ae8e0ca91a4836408b6c588720162c9f
Summary:
This fixes the segfault.
Changelog:
- Modify the function calls in LegacyDefinitions for `geqrf_out` and `ormqr_out`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16964
Differential Revision: D14025985
Pulled By: gchanan
fbshipit-source-id: aa50e2c1694cbf3642273ee14b09ba12625c7d33
Summary:
The second input (`lengths`) is not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16727
Differential Revision: D14054105
Pulled By: houseroad
fbshipit-source-id: 36b8d00460f9623696439e1bd2a6bc60b7bb263c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16985
These statements were causing some redundant allocations + copying, so I cleaned
them up
Reviewed By: zdevito, wanchaol
Differential Revision: D14031067
fbshipit-source-id: f760fb29a2561894d52a2663f557b3e9ab1653de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16997
1. Don't create multiple AdjustBatch ops for the same input name. We create it once and hook input to abc_post_adjust_batch.
2. Dangling tensor. The problem for such an error is still with AttachAdjustBatchOp. Considering such as net
```
op {
type : "Relu"
input: "X"
outpu: "Y"
}
op {
type : "Relu"
input: "Y"
output: "Y2"
}
external_output: "Y"
external_output: "Y2"
```
In this the output of first Relu will be used as an internal node as well as output. We cannot simply rename Y into Y_pre_batch_adjust. Basically, we need another pass in to check all the input of the ops in the net and rename Y into Y_pre_batch_adjust.
Reviewed By: bertmaher
Differential Revision: D14041446
fbshipit-source-id: f6553e287a8dfb14e4044cc20afaf3f290e5151b
Summary:
Closes#16983
Remove backticks that are being interpreted by the shell. Add -e option to bash script to avoid future such failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16984
Reviewed By: yf225
Differential Revision: D14039128
Pulled By: kostmo
fbshipit-source-id: c31a1895377ca86c1b59e79351843cc8c4fd7de3
Summary:
The use case for making this PR is the following bug :
(with F = torch.nn.functional)
`F.max_pool2d.__module__` is `torch._jit_internal`
`F.max_pool2d.__name__` is `fn`
With this PR you get:
`F.max_pool2d.__module__` is `torch.nn.functional`
`F.max_pool2d.__name__` is `max_pool2d`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16922
Differential Revision: D14020053
Pulled By: driazati
fbshipit-source-id: c109c1f04640f3b2b69bc4790b16fef7714025dd
Summary:
gchanan pointed out in https://github.com/pytorch/pytorch/pull/16389 that `allow_inf` is treating `-inf` and `inf` as equal. This fixes it.
Also fixing #16448 since it's near and 2.1 has released.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16959
Differential Revision: D14025297
Pulled By: gchanan
fbshipit-source-id: 95348309492e7ab65aa4d7aabb5a1800de66c5d6
Summary:
Previously, we used the templated class directly to provide
implementations. However, there is a subtle difference
between this, and CUDAStreamGuard: CUDAStreamGuard has refined types
for the Streams it returns. This lead to a compilation failure
of HIPified ddp.cpp. This commit lines them up more closely,
at the cost of copy-paste.
A possible alternate strategy would have been to extend the
InlineDeviceGuard templates to optionally accept refinements
for Stream. I leave this for future work.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16978
Differential Revision: D14045346
Pulled By: ezyang
fbshipit-source-id: 2b101606e62e4db588027c57902ea739a2119410
Summary:
This is needed to check for wrong arguments or --help options
before `build_deps()` is executed. Otherwise command line arguments
are not parsed and checked until `setup()` is run.
Fixes: #16707
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16914
Differential Revision: D14041236
Pulled By: soumith
fbshipit-source-id: 41f635772ccf47f05114775d5a19ae04c495ab3b
Summary:
Fixes the bug for when tensor is created on Caffe2 side, then passed to PT and resized. Now we just initialize allocator correctly.
Note that the code in raw_mutable_data() is still necessary because of non-resizable tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16857
Reviewed By: houseroad
Differential Revision: D14019469
Pulled By: dzhulgakov
fbshipit-source-id: 14d3a3b946d718bbab747ea376903646b885706a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16932
During onnxifi transformation net ssa is rewritten. At the last step the weight
names are changed back to what they were before. The diff keeps the weight
names unchanged thru the process.
Reviewed By: yinghai
Differential Revision: D13972597
fbshipit-source-id: 7c29857f788a674edf625c073b345f2b44267b33
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16965
Instead of having one large templated function to wrap the caffe2 op, minimize the amount of templated code.
Non-templated code can be reused between different operators and decreases binary size.
Reviewed By: orionr
Differential Revision: D14018806
fbshipit-source-id: bedd4152eec21dd8c5778446963826316d210543
Summary:
Fixes#16577.
This greatly improves memory efficiency of certain ops like Dropout2d. Previously, they were implemented as `input * mask` where mask never requires_grad, but we didn't use that knowledge in forward, and (in case of a in-place dropout) kept input.clone() for the backward, when it would simply get ignored.
This patch tries to address this situation by emitting some guards for stores like this, but only if they are as simple, as checking if a single value requires_grad.
Interestingly, the same optimizations apply to methods like bmm, baddmm, etc., but _not to mm nor addmm_, because of how their derivatives are defined. Apparently they unnecessarily use `mat1` to compute the derivative of `mat1` just to improve the error message in case `mat1` was sparse. I'd like to apply this optimization to that case, but I don't want to loose the nicer error message, so if anyone has any ideas for solutions, please let me know...
Full list of operators affected by this patch:
* _nnpack_spatial_convolution
* addbmm
* addcdiv
* addcmul
* addmv
* addr
* baddbmm
* bmm
* cross
* div
* dot
* fmod
* ger
* index_add_
* mul
* mv
* scatter_add_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16583
Differential Revision: D13900881
Pulled By: gchanan
fbshipit-source-id: dd0aeb2ab58c4b6aa95b37b46d3255b3e014291c
Summary:
In VariableType.cpp, when a function modifies its input tensors, it should only change the input tensors' storage data in-place, and should never change the input tensors' storage pointers. This PR adds checks for this, and also fixes functions that fail this test.
This is part of the Variable/Tensor merge work (https://github.com/pytorch/pytorch/issues/13638).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16305
Differential Revision: D13897855
Pulled By: yf225
fbshipit-source-id: 0c4fc7eb530d30db88037b1f0981f6f8454d3b79
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16906
In C++11, constexpr implies const, so these methods actually wouldn't be rvalue overloads as intended but const rvalue overloads.
Let's only apply the constexpr flag in C++14 to be safe.
Reviewed By: bddppq
Differential Revision: D13998486
fbshipit-source-id: a04d17ef0cc8f45e3d0a1ca9843d194f4f0f6f7f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16399
Catching cudaError_t return values in a few places, because it's nodiscard in rocm. Unless we add -Wno-unused-result, it'll end up with a compilation error.
Also in c10/cuda/test, check whether a host has GPU or not. We were silently throwing out the error before (so not really testing the cuda api).
Reviewed By: bddppq
Differential Revision: D13828281
fbshipit-source-id: 587d1cc31c20b836ce9594e3c18f067d322b2934
Summary:
Here is a stab at implementing an option to zero out infinite losses (and NaN gradients).
It might be nicer to move the zeroing to the respective kernels.
The default is currently `False` to mimic the old behaviour, but I'd be half inclined to set the default to `True`, because the behaviour wasn't consistent between CuDNN and Native anyways and the NaN gradients aren't terribly useful.
This topic seems to come up regularly, e.g. in #14335
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16199
Differential Revision: D14020462
Pulled By: ezyang
fbshipit-source-id: 5ba8936c66ec6e61530aaf01175dc49f389ae428
Summary:
Merge binaries "convert_image_to_tensor" and "caffe2_benchmark" to remove the overhead of writing to/reading from Tensor file.
*TODO next: TensorProtos is another overhead. No need for de-serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16875
Reviewed By: sf-wind
Differential Revision: D13997726
Pulled By: ZhizhenQin
fbshipit-source-id: 4dec17f0ebb59cf1438b9aba5421db2b41c47a9f
Summary:
I'm seeing a bunch of apt gpg key errors on CI with the following message:
```
An error occurred during the signature verification. The repository is not
updated and the previous index files will be used. GPG error:
https://packagecloud.io trusty InRelease: The following signatures couldn't
be verified because the public key is not available:
NO_PUBKEY 4E6910DFCB68C9CD
```
Most of the times apt will reuse the old cached version, but sometimes this results in a build failure: https://circleci.com/gh/pytorch/pytorch/758366?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link.
This should hopefully fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16961
Differential Revision: D14028151
Pulled By: ezyang
fbshipit-source-id: 7648a0a58ece38d8d04916937a9fa17f34f8833e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16911
I think the Thrust package has want we want for /opt/rocm/include/thrust. We probably can stop patching it now.
Reviewed By: bddppq
Differential Revision: D14015177
fbshipit-source-id: 8d9128783a790c39083a1b8b4771c2c18bd67d46
Summary:
Hi,
caffe2/operators/quantized/int8_given_tensor_fill_op.cc expects the value array to be named "values" but the operator schema describe "value" (no s). I guess it is a little typo but it made me losing a bit of time before understanding why I had this error by passing "value" instead of "values":
```
[F int8_given_tensor_fill_op.h:95] Check failed: output->t.numel() == values_.numel() output size: 3 given size: 0
Aborted (core dumped)
```
Thanks,
Eyyüb Sari
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16204
Differential Revision: D14020476
Pulled By: ezyang
fbshipit-source-id: a8a46bfc44ec125e7925ce4b7c79fdf99c890a50
Summary:
Instead of converting coo to csr format of the sparse matrix in the original implementation, in my revision I directly use coo format for sparse dense matrix mutliplication.
On my linux machine it is 5 times faster than the original code:
```
(original code)
SIZE: 15000 DENSITY: 0.01 DEVICE: cpu
torch: 0.39403 seconds
np: 0.00496674 seconds
torch/np: 79.3338
----------------------------------------
(my update)
SIZE: 15000 DENSITY: 0.01 DEVICE: cpu
torch: 0.0812583 seconds
np: 0.00501871 seconds
torch/np: 16.1911
```
Further code feedback and running time tests are highly welcomed. I will keep revise my code if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16905
Differential Revision: D14020095
Pulled By: ezyang
fbshipit-source-id: 4ab94075344a55b375f22421e97a690e682baed5
Summary:
Renewed attempt at https://github.com/pytorch/pytorch/pull/14171
From the original PR:
> Currently, the pin_memory_batch function in the dataloader will return a batch comprised of any unrecognized type without pinning the data, because it doesn't know how.
>
>This behavior was preventing us from overlapping data prefetching in Mask-RCNN, whose custom collate_fn returns a custom batch type.
The old PR allowed the user to implement batch pinning for custom batch and data types by passing a custom pin function to the dataloader. slayton58 suggested a cleaner approach: allow the user to define a `pin_memory` method on their custom types, and have `pin_memory_batch` [check for the presence of that method](https://github.com/pytorch/pytorch/pull/16743/files#diff-9f154cbd884fe654066b1621fad654f3R56) in the incoming batch as a fallback. I've updated the test and docstrings accordingly.
The old PR was merged but then reverted due to weird cuda OOM errors on windows that may or may not have been related. I have no idea why my changes would cause such errors (then or now) but it's something to keep an eye out for.
fmassa and yf225 who were my POCs on the old PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16743
Differential Revision: D13991745
Pulled By: ezyang
fbshipit-source-id: 74e71f62a03be453b4caa9f5524e9bc53467fa17
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/16233
The following changes are made:
- Modify `TupleType` to store optional field names
- Modify schema matching to return fill in those field names when creating `TupleType` as return type.
- Modify codegen of JIT to copy field names to schema string
- Modify `SchemaParser` to set field names of returned schema.
- Modify `SimpleValue::attr` to emit tuple indexing for named tuple.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16253
Reviewed By: ezyang
Differential Revision: D13954298
Pulled By: zdevito
fbshipit-source-id: 247d483d78a0c9c12d1ba36e1f1ec6c3f1a3007b
Summary:
I found a few sentences in DataParallel docstring confusing, so I suggest this enhancement.
- Arbitrary arguments are allowed to be passed .... *INCLUDING* tensors (Not *EXCLUDING*)
- The original author said that "other types" are shallow-copied but I think actually only some builtin types are (effectively) shallow-copied. And "other types" are shared. Here is an example.
```python
import torch
from torch.nn import Module, DataParallel
from collections import deque
class MyModel(Module):
def forward(self, x):
x.append(None)
model = MyModel(); model.cuda()
model = DataParallel(model)
d = deque()
model.forward(d)
print(d)
```
This is a side note.
As far as I know, copying objects is not a specially frequent operation in python unlike some other languages. Notably, no copying is involved in assignment or function parameter passing. They are only name bindings and it is the whole point of "everything is object" python philosophy, I guess. If one keep this in mind, it may help you dealing with things like multithreading.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15993
Differential Revision: D14020404
Pulled By: ezyang
fbshipit-source-id: a38689c94d0b8f77be70447f34962d3a7cd25e2e
Summary:
This PR is a simple fix for the mistake in the "tensor" and "torch.Tensor"doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16842
Differential Revision: D14020300
Pulled By: ezyang
fbshipit-source-id: 3ab04f1223d6e60f8da578d04d759e385d23acbb
Summary:
This changes the libnvToolsExt dependency to go through CMake find_library.
I have a machine where cuda libs, and libnvToolsExt in particular, are in the "usual library locations". It would be neat if we could find libnvToolsExt and use the path currently hardcoded as default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16714
Differential Revision: D14020315
Pulled By: ezyang
fbshipit-source-id: 00be27be10b1863ca92fd585f273d50bded850f8
Summary:
The documentation for LogSigmoid says:
> Applies the element-wise function:
> \<blank\>
Now the documentation properly displays the math string.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16900
Differential Revision: D14020097
Pulled By: ezyang
fbshipit-source-id: 41e229d0fcc6b9bb53367be548bf85286dc13546
Summary:
When Variable and Tensor are merged, the dynamic type of the tensors passed to certain functions will become variables, and expecting `type()` on those variables to still return non-Variable types will cause type mismatch error.
One way to fix this problem is to use the thread-local guard `at::AutoNonVariableTypeMode` to force `type()` to return non-Variable type, but ideally we want to limit the use of `at::AutoNonVariableTypeMode` to be only in VariableType.cpp. Another way to fix the problem is to use `at::globalContext().getNonVariableType()` instead to get the non-Variable type of the tensor, which is what this PR is trying to achieve.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16325
Differential Revision: D14012022
Pulled By: yf225
fbshipit-source-id: 77ef1d2a02f78bff0063bdd72596e34046f1e00d
Summary:
This PR is a simple fix for the mistake in the first note for `torch.device` in the "tensor attributes" doc.

```
>>> # You can substitute the torch.device with a string
>>> torch.randn((2,3), 'cuda:1')
```
Above code will cause error like below:
```
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-53-abdfafb67ab1> in <module>()
----> 1 torch.randn((2,3), 'cuda:1')
TypeError: randn() received an invalid combination of arguments - got (tuple, str), but expected one of:
* (tuple of ints size, torch.Generator generator, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool requires_grad)
* (tuple of ints size, Tensor out, torch.dtype dtype, torch.layout layout, torch.device device, bool requires_grad)
```
Simply adding the argument name `device` solves the problem: `torch.randn((2,3), device='cuda:1')`.
However, another concern is that this note seems redundant as **there is already another note covering this usage**:

So maybe it's better to just remove this note?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16839
Reviewed By: ezyang
Differential Revision: D13989209
Pulled By: gchanan
fbshipit-source-id: ac255d52528da053ebfed18125ee6b857865ccaf
Summary:
Post 2.1 release, packing is fixed and alignas works as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16920
Differential Revision: D14018539
Pulled By: bddppq
fbshipit-source-id: 0ed4d9e9f36afb9b970812c3870082fd7f905455
Summary:
It now works post ROCm 2.1 release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16919
Differential Revision: D14018538
Pulled By: bddppq
fbshipit-source-id: c4e1bafb53204a6d718b2d5054647d5715f23243
Summary:
This is the first round of enabling unit tests that work on ROCm 2.1 in my tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16871
Differential Revision: D13997662
Pulled By: bddppq
fbshipit-source-id: d909a3f7dd5fc8f85f126bf0613751c8e4ef949f
Summary:
This was serializing all calls to `addmm` (and any op that used it, in my case `bmm`) in the entire process, and led to downright atrocious performance in the TorchScript threaded runtime. Removing this gives a 2x throughput boost for high-load machine translation inference.
The original justification for this is dubious: there are other `gemm` callsites in the codebase that are not protected by critical sections. And in caffe2 land we never had any issues with nonreentrant BLAS libraries
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16889
Differential Revision: D14008928
Pulled By: jamesr66a
fbshipit-source-id: 498e2133bd6564dba539a2d9751f4e61afbce608
Summary:
Impl ExpandDims op and fallback to CPU if needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15264
Differential Revision: D13808797
Pulled By: yinghai
fbshipit-source-id: 7795ec303a46e85f84e5490273db0ec76e8b9374
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16721
The very key line is we have to set the stream to the default
stream before calling the allocator. This is very interesting.
It shouldn't be necessary, but seemingly is!
Reviewed By: dzhulgakov
Differential Revision: D13943193
fbshipit-source-id: c21014917d9fe504fab0ad8abbc025787f559287
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16720
I'm taking the deduplication slowly because there is something here
that is causing problems, and I want to figure out what it is.
Reviewed By: dzhulgakov
Differential Revision: D13943194
fbshipit-source-id: cbc08fee5862fdcb393b9dd5b1d2ac7250f77c4b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16615
This is another go at landing https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.
The difference between this and the previous PR is that this
version faithfully maintains the binding code; in particular,
we end up with a SECOND copy of the caching allocator in
this patch. I verified that this code does NOT cause a crash
in the workflow we canaried last time.
In further diffs, I plan to eliminate the second copy, and then
adjust the binding code.
Reviewed By: dzhulgakov
Differential Revision: D13901067
fbshipit-source-id: 66331fd4eadffd0a5defb3cea532d5cd07287872
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16867
Some caffe2 operators (example: BBoxTransform) have not just one template parameter which is the context, but might have multiple template parameters.
Because of this, we can't handle the context parameter inside the macro.
Reviewed By: bwasti
Differential Revision: D13995696
fbshipit-source-id: f55c3be913c8b125445a8d486846fc2fab587a63
Summary:
This PR implements:
1. a fix to issue #12174 - determine the location of cudnn library using `ldconfig`
2. a fix to determine the installed conda packages (in recent versions of conda, the command `conda` is a Bash function that cannot be called within a python script, so using CONDA_EXE environment variable instead)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16859
Differential Revision: D14000399
Pulled By: soumith
fbshipit-source-id: 905658ecacb0ca0587a162fade436de9582d32ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16855
Save the histogram of each net to a separate file
Reviewed By: jspark1105
Differential Revision: D13991610
fbshipit-source-id: a5be4e37a5e63567dcd7fdf99f451ee31bb350a5
Summary:
Some batched updates:
1. bool is a type now
2. Early returns are allowed now
3. The beginning of an FAQ section with some guidance on the best way to do GPU training + CPU inference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16866
Differential Revision: D13996729
Pulled By: suo
fbshipit-source-id: 3b884fd3a4c9632c9697d8f1a5a0e768fc918916
Summary:
During tracing, we record `aten::_convolution` rather than `aten::convolution`. The schema for the former was not present in the shape analysis pass, and resulted in some missing shape information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16837
Differential Revision: D13993831
Pulled By: jamesr66a
fbshipit-source-id: ebb63bf628d81613258caf773a3af5930303ce5a
Summary:
* we do not need EAP packages any longer as the antistatic feature is now in the release
* consistently install the rccl package
* Skip one unit test that has regressed with 2.1
* Follow-up PRs will use 2.1 features once deployed on CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16808
Differential Revision: D13992645
Pulled By: bddppq
fbshipit-source-id: 37ca9a1f104bb140bd2b56d403e32f04c4fbf4f0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16834
Inserting AdjustBatch ops will possibly change the names of the input/output, so we need to create a mapping and use the renamed names for external_inputs/outputs and input_shape_info for the onnxifi_net.
Reviewed By: ipiszy
Differential Revision: D13982731
fbshipit-source-id: c18b8a03d01490162929b2ca30c182d166001626
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16643
The test was disabled in D13908117 because it conflicted with another diff that was about to land.
Now fixed the merge conflict and re-landing it.
Reviewed By: ezyang
Differential Revision: D13911775
fbshipit-source-id: b790f1c3a3f207916eea41ac93bc104d011f629b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16548
With this macro, a caffe2 operator can now directly be registered with c10.
No need to write custom wrapper kernels anymore.
Differential Revision: D13877076
fbshipit-source-id: e56846238c5bb4b1989b79855fd44d5ecf089c9c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16835
We were using label type `multi_label_dense` to denote both 1) dense representation of integer label 2) embedding label of data type floating number.
This cause some issues as two cases have different assumption, such as for integer label, we will check whether label value is in [0, number_class - 1]. But such check should be skipped for `embedding label`.
Reviewed By: BIT-silence
Differential Revision: D13985048
fbshipit-source-id: 1202cdfeea806eb47647e3f4a1ed9c104f72ad2c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16739
Some code ATen locations seemed to use int, etc. inclorrectly where either
int64_t or size_t was required. Update them to use int64_t for dimension indexing where necessary.
Reviewed By: ezyang
Differential Revision: D13950124
fbshipit-source-id: aaf1cef783bf3c657aa03490f2616c35c816679f
Summary:
Discussed with zdevito and we want to use Variable (with `set_requires_grad(false)`) instead of Tensor in all parts of JIT, to eliminate the distinction and the conceptual overhead when trying to figure out which one to use.
This also helps with the Variable/Tensor merge work tracked at https://github.com/pytorch/pytorch/issues/13638, which will make common functions (such as `numel()` / `sizes()` / `dim()`) on Variable much faster when finished.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16596
Differential Revision: D13979971
Pulled By: yf225
fbshipit-source-id: c69119deec5bce0c22809081115f1012fdbb7d5a
Summary:
List of changes:
- Always push the final state of the doc build docker for debugging purposes.
- Adds code for the stable doc build. This code is never actually run on master, only the v1.0.1 branch. There is a big note for this behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16503
Differential Revision: D13972469
Pulled By: zou3519
fbshipit-source-id: 68f459650ef0de200a34edd43fc1372143923972
Summary:
This PR is a follow up of #15460, it did the following things:
* remove the undefined tensor semantic in jit script/tracing mode
* change ATen/JIT schema for at::index and other index related ops with `Tensor?[]` to align with what at::index is really doing and to adopt `optional[tensor]` in JIT
* change python_print to correctly print the exported script
* register both TensorList and ListOfOptionalTensor in JIT ATen ops to support both
* Backward compatibility for `torch.jit.annotate(Tensor, None)`
List of follow ups:
* remove the undefined tensor semantic in jit autograd, autodiff and grad_of
* remove prim::Undefined fully
For easy reviews, please turn on `hide white space changes` in diff settings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16379
Differential Revision: D13855677
Pulled By: wanchaol
fbshipit-source-id: 0e21c14d7de250c62731227c81bfbfb7b7da20ab
Summary:
This adds calls to `super().__init__()` in three classes in torch.distributions.
This is needed when `Distribution` and `Transform` objects are used with multiple inheritance, as e.g. combined with `torch.nn.Module`s. For example
```py
class MyModule(torch.distributions.Transform, torch.nn.Module):
...
```
cc martinjankowiak esling who have wanted to use this pattern, e.g. in #16756
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16772
Differential Revision: D13978633
Pulled By: soumith
fbshipit-source-id: 8bc6cca1747cd74d32135ee2fe588bba2ea796f1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16811
As the title. The AdjustBatch ops will be inserted before and after the Onnxifi op to:
1) adjust batch/seq sizes to the ideal batch/seq size before these tensors are processed by the Onnxifi op;
2) adjust batch size to the original batch size for batches generated by the Onnxifi op.
Reviewed By: yinghai
Differential Revision: D13967711
fbshipit-source-id: 471b25ae6a60bf5b7ebee1de6449e0389b6cafff
Summary:
Emphasize on the fact that docker build should be triggered from pytorch repo directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16812
Differential Revision: D13985531
Pulled By: soumith
fbshipit-source-id: c6511d1e81476eb795b37fb0ad23e8951dbca617
Summary:
Update launch configs for TensorIterator gpu_reduce_kernel. Enable flexible
block dimension to improve efficiency for reduction cases with small fast
dimension.
Previously TensorIterator launches blocks with fixed 32x16 threads.
For cases like:
import torch
torch.randn(2**20, 4, device='cuda').sum(0)
The fixed launch config does handle coalesced memory access efficiently.
Updated launch configure enables flexible block dimension. Combining with
improved reduction scheme (using flexible vertical / horizontal reduction
instead of limited warp / block reduction in the old code), it ensures optimal
memory access pattern even with reduction on dimension with small stride.
Possible future improvements:
1. Precise dynamic shared memory allocation.
2. Using warp shuffle for vertical (block_y) reduction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16224
Differential Revision: D13806753
Pulled By: soumith
fbshipit-source-id: 37e45c7767b5748cf9ecf894fad306e040e2f79f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16535
There is now no need anymore to define the layer norm schema in a central location.
It can just be defined in caffe2 next to the kernel implementation.
Reviewed By: ezyang
Differential Revision: D13869503
fbshipit-source-id: c478153f8fd712ff6d507c794500286eb3583149
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16534
All c10 ops from the c10 dispatcher are now automatically registered with JIT
Reviewed By: dzhulgakov
Differential Revision: D13869275
fbshipit-source-id: 5ab5dec5b983fe661f977f9d29d8036768cdcab6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16676
This op is used for changing batch size (first dimension) of the tensor.
Reviewed By: bertmaher, ipiszy
Differential Revision: D13929200
fbshipit-source-id: 4f2c3faec072d468be8301bf00c80d33adb3b5b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16820
Sometimes parsing histogram was not working correctly due to changes in D13633256
We need to call istringstream clear after str
Reviewed By: csummersea
Differential Revision: D13977509
fbshipit-source-id: ce3e8cb390641d8f0b5c9a7d6d6daadffeddbe11
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16785
There's no EIGEN engine implemented for DeformConv but unit test was checking it.
Reviewed By: BIT-silence
Differential Revision: D13967306
fbshipit-source-id: e29c19f59f5700fc0501c59f45d60443b87ffedc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16597
This diff fixes some bugs in shape inference for `SparseLengthsSumFused8BitRowwise`. And added input shape inference for `Concat` when `add_axis=1`.
Reviewed By: bertmaher
Differential Revision: D13892452
fbshipit-source-id: 6cd95697a6fabe6d78a5ce3cb749a3a1e51c68e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16765
Code changes required to build caffe2 for windows with toolchain used by FB.
Reviewed By: orionr
Differential Revision: D13953258
fbshipit-source-id: 651823ec9d81ac70e32d4cce5bc2472434104733
Summary:
Adding torch/lib64 in .gitignore so that a git status --porcelain
check during CI build and test passes for ppc64le. During build
torch/lib64 is created and contains third-party libraries. This
should be ignored by the porcelain check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16782
Differential Revision: D13972794
Pulled By: ezyang
fbshipit-source-id: 5459c524eca42d396ac46e756a327980b4b1fa53
Summary:
Since pip 18.0 (2018-07-22), `legacy` is no longer a valid choice for `pip list --format` as can be seen in the [Release Notes](https://pip.pypa.io/en/stable/news/#id62). Therefore, the options now are: `columns`, `freeze` and `json`. With `legacy`, this is how it looked like:
```
[...]
Versions of relevant libraries:
[pip3] numpy (1.16.1)
[pip3] torch (1.0.1)
[pip3] torchvision (0.2.1)
[...]
```
Changing to `freeze`, this is how it looks like:
```
[...]
Versions of relevant libraries:
[pip3] numpy==1.16.1
[pip3] torch==1.0.1
[pip3] torchvision==0.2.1
[...]
```
Currently, this is what happens:
```
[...]
Versions of relevant libraries:
[pip] Could not collect
[...]
```
The `freeze` option is also available in old pip, so this change is backwards compatible. Also, if we would like to keep the old style, which I think it is not necessary, I could easily change that.
---
In case anyone wants to know how `columns` looks like (I prefer `freeze`):
```
[...]
Versions of relevant libraries:
[pip3] numpy 1.16.1
[pip3] torch 1.0.1
[pip3] torchvision 0.2.1
[...]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16798
Differential Revision: D13971793
Pulled By: soumith
fbshipit-source-id: 3721d9079a2afa245e1185f725598901185ea4cd
Summary:
(review top commit only).
As expected, fork/wait introduces some corner cases into the alias analysis. The comments inline should describe the changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16671
Differential Revision: D13963219
Pulled By: suo
fbshipit-source-id: 2bec6fc03a4989cf309fbb9473f3f2ffe2c31431
Summary:
Doubling the sccache timeout from default of 600.
the asan build of #16645 will fail without this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16728
Differential Revision: D13963727
Pulled By: li-roy
fbshipit-source-id: 3614d75c1b46d663fa05b84f99d8a099283a8e64
Summary:
In #16085 , we introduced initial hip-clang bring-up code. Document the use of the __HIP__ macro now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16771
Differential Revision: D13961538
Pulled By: ezyang
fbshipit-source-id: 67f6226abcbe62e2f4efc291c84652199c464ca6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16751
This was made more complicated by the fact that ivalue::IntList
is a thing. So I had to fix all of the sites where we referring
to IValue post facto.
The following codemods were run, in this order:
```
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntList IntArrayRef
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in IntArrayRef::create IntList::create
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in ivalue::IntArrayRef ivalue::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in Tag::IntArrayRef Tag::IntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in isIntArrayRef isIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in toIntArrayRef toIntList
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'Shared<IntArrayRef>' 'Shared<IntList>'
codemod -m -d . --extensions cc,cpp,cu,cuh,h,hpp,py,cwrap,yaml,in 'intrusive_ptr<IntArrayRef>' 'intrusive_ptr<IntList>'
```
Some manual fixups were done afterwards; they can be reviewed separately
at https://github.com/pytorch/pytorch/pull/16752
Reviewed By: dzhulgakov
Differential Revision: D13954363
fbshipit-source-id: b5c40aacba042402155a2f5a229fa6db7992ac64
Summary:
Adds some operations for dicts to match Python and tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16629
Differential Revision: D13961144
Pulled By: driazati
fbshipit-source-id: b31f27a4320ff62cd118b508fb0a13056535dc7c
Summary:
There is no way to test this until it is merged.
On master jobs that run after a PR is merged, there is no CIRCLE_PR_NUMBER so the binary builds clone pytorch/pytorch/master, which races.
Based off of https://circleci.com/docs/2.0/env-vars/ and the circleci checkout code
```
git config --global url."ssh://git@github.com".insteadOf "https://github.com" || true
git config --global gc.auto 0 || true
if [ -e /home/circleci/project/.git ]
then
cd /home/circleci/project
git remote set-url origin "$CIRCLE_REPOSITORY_URL" || true
else
mkdir -p /home/circleci/project
cd /home/circleci/project
git clone "$CIRCLE_REPOSITORY_URL" .
fi
if [ -n "$CIRCLE_TAG" ]
then
git fetch --force origin "refs/tags/${CIRCLE_TAG}"
else
git fetch --force origin "master:remotes/origin/master"
fi
if [ -n "$CIRCLE_TAG" ]
then
git reset --hard "$CIRCLE_SHA1"
git checkout -q "$CIRCLE_TAG"
elif [ -n "$CIRCLE_BRANCH" ]
then
git reset --hard "$CIRCLE_SHA1"
git checkout -q -B "$CIRCLE_BRANCH"
fi
git reset --hard "$CIRCLE_SHA1"
```
I believe we do no use git tags
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16773
Differential Revision: D13962132
Pulled By: pjh5
fbshipit-source-id: c62d2139f38ff39ecda1509b0bcd8bd102828e40
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16478
This diff includes an example registration of a caffe2 op in torch. A previous attempt ran into a static initialization order bug.
Reviewed By: smessmer
Differential Revision: D13854304
fbshipit-source-id: ec463ce2272126d08a5163d1599361ee5b718bbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16730
with Jerry's new updates Tensor must be defined -- as a result I've needed to update the shim for caffe2 ops being used in PyTorch
Reviewed By: smessmer
Differential Revision: D13946950
fbshipit-source-id: 6f77877c61a743f82bdfc2ad04d6ab583000cc18
Summary:
Fixes#16591
This uses uniqueBaseName so that parameters do not end up with suffixes. It changes next_id to be per-base-name rather than global to fix jittering issues when re-importing a re-numbered graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16750
Differential Revision: D13960282
Pulled By: zdevito
fbshipit-source-id: 2156f581d9b95d77bf1f1252074e800b19116555
Summary:
This should enable xla tests thus let master xla tests pass.
As usual, I will add the branch filters back before landing.
Thanks ezyang !
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16761
Differential Revision: D13959746
Pulled By: ailzhang
fbshipit-source-id: 7384da281d093d16edccb4283c74e47ac659eeff
Summary:
I'll test with this really long summary.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce risus sem, mattis vitae commodo vitae, mattis vel ex. Integer nec consectetur ligula, sit amet ultricies risus. Suspendisse potenti. Donec aliquet quam ante. Donec porttitor justo ligula, ut vestibulum erat facilisis a. Nullam eget lobortis nisi. Aenean quis sem id ante eleifend condimentum nec a lacus. Sed sed dolor augue. Proin feugiat, tellus in eleifend cursus, libero nulla lacinia erat, et efficitur dui odio ut ex. In et sem purus. Proin dictum scelerisque magna, nec feugiat dolor lobortis id. Proin ante urna, ultrices in semper et, pulvinar et dui. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; Mauris ullamcorper neque a pharetra rhoncus.
Aliquam vel semper felis. Integer id massa erat. Morbi leo eros, varius sed viverra eu, dictum nec purus. Fusce vitae mollis sem, non fringilla nulla. Donec tincidunt luctus dolor. Morbi lobortis, magna quis viverra bibendum, lacus tortor pulvinar risus, eu porta tellus nulla vitae dolor. Sed tincidunt, turpis quis facilisis malesuada, nulla eros lobortis lorem, a fermentum mi nisl non quam. Pellentesque vehicula, nisl non eleifend viverra, tellus neque accumsan tellus, id ultricies lacus mi sed sapien. Proin rutrum ultrices quam sit amet euismod. Maecenas vel faucibus libero, nec efficitur mi. Proin felis augue, elementum eget vestibulum non, euismod sed urna. Curabitur purus nisi, interdum nec rutrum id, faucibus nec sapien. Integer consectetur interdum elit, volutpat vulputate velit. Integer et ultricies magna. Fusce blandit lorem urna, quis sodales sapien porttitor in. Nulla nec sodales sem.
Morbi consequat massa sit amet fringilla pretium. Nunc maximus vitae neque auctor pharetra. Morbi gravida feugiat urna, eu sagittis est pulvinar eget. Maecenas ut fermentum ante, eget malesuada neque. In ut maximus magna. Donec nec finibus sapien. Quisque viverra erat lobortis, rhoncus augue sed, hendrerit dui. Donec in feugiat augue, a ultrices justo. Pellentesque rutrum augue sed nulla auctor, a venenatis risus aliquam. Nullam ipsum justo, dictum sit amet elementum eu, eleifend a turpis. Proin ut tellus ut urna volutpat fermentum ac aliquam tellus.
Quisque ultricies est id eros dictum ultrices. Cras eu urna interdum, eleifend felis vitae, vulputate nulla. Cras tincidunt, mi sodales imperdiet tristique, diam odio convallis ligula, ac vulputate enim sapien eu tellus. Phasellus eleifend finibus sapien id ullamcorper. Donec aliquet eleifend consectetur. Proin in nulla venenatis, egestas neque quis, blandit sem. Suspendisse pellentesque arcu vel ligula fermentum maximus. Aliquam non ipsum ut ante pharetra finibus.
Nunc rhoncus purus sit amet risus congue venenatis. Integer id vestibulum neque, et fermentum elit. Nunc sit amet tortor quis mi aliquam vestibulum et in mauris. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Maecenas mollis hendrerit nulla, non tempus neque pharetra ac. Proin commodo bibendum velit, consectetur pretium metus sollicitudin eget. Aliquam malesuada semper tempor. Ut vel vulputate dolor, eu faucibus mauris. Nam commodo quis dolor sit amet eleifend. Phasellus eget massa odio. Donec tempor est at ante finibus lobortis. Suspendisse porttitor imperdiet ultrices. Orci varius natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus.
Nullam id dignissim magna, non suscipit odio. Vestibulum vel maximus erat, suscipit ullamcorper tellus. Fusce egestas augue lorem, in ultricies est vehicula ac. Integer pretium, ex in elementum varius, nisi turpis posuere lectus, nec posuere ligula mi ac ligula. Donec vehicula dolor ut ex elementum, quis scelerisque tellus molestie. Mauris euismod magna ac ornare cursus. Vivamus dapibus quam nec tellus aliquam elementum.
Phasellus ultricies quis augue ut fringilla. Suspendisse eu molestie eros. Suspendisse potenti. Curabitur varius sodales maximus. Etiam nec rutrum est. Sed vulputate suscipit elit, eu condimentum mauris pretium eget. Curabitur convallis commodo dui. Aenean lectus orci, pretium non mi sit amet, commodo imperdiet dui. In hac habitasse platea dictumst. In et ex nisl. Duis justo tortor, finibus at augue vitae, fermentum hendrerit tellus. Donec malesuada justo a molestie posuere. Morbi nisl leo, feugiat ut faucibus ut, mattis id purus.
Vestibulum hendrerit lorem ligula, et ullamcorper nisl lacinia sed. Integer vitae lacinia nunc, sed interdum enim. Aliquam aliquet ipsum vitae eros ornare accumsan. Phasellus venenatis laoreet est, sed feugiat neque lobortis id. Proin pulvinar placerat leo lacinia vehicula. Duis accumsan semper lobortis. Donec elementum nunc non quam aliquam, rutrum fringilla justo interdum. Morbi pulvinar pellentesque massa vitae maximus. Cras condimentum aliquam massa, et pellentesque lorem dictum a. Vivamus at dignissim justo. Donec ligula dui, tempus vestibulum est vel, rutrum blandit arcu. Vivamus iaculis molestie neque in elementum. Sed convallis tempus quam non elementum. Nulla euismod lobortis ligula. Etiam ac mauris eget magna posuere ornare id vitae felis. Nunc efficitur lorem et euismod porttitor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16766
Differential Revision: D13959962
Pulled By: pjh5
fbshipit-source-id: 9b71bdf981d4fda9d8951e2d183db81f349b7f81
Summary:
-In the case where an operator does not support a given data type
an error message is emitted to alert the user, this message is
incorrectly structured. This commit adds to and rearranges the
error message to make it a little clearer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16537
Differential Revision: D13958859
Pulled By: zou3519
fbshipit-source-id: 935fc3adcef2f969042b1db902c9ec004488ea9c
Summary:
As the comment indicates, the issue is only present in some versions of
Python 2, so we should be able to use heavily optimized PyTuple_Check in
most cases, and skip allocation of the strings, and unnecessary lookups
on object's type.
cc ezyang zasdfgbnm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16657
Differential Revision: D13957854
Pulled By: ezyang
fbshipit-source-id: be32eb473ad77a0805e8247d8d583d673d4bdf25
Summary:
This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.
Changelist:
- Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`.
- Used cudnnGet*_v7 everywhere cudnnGet* was being used.
- Removed all cudnn6 paths (This PR depends on https://github.com/pytorch/pytorch/pull/15851)
Differential Revision: D13957944
Pulled By: ezyang
fbshipit-source-id: a88c39d80ae37f2d686665622302b62b50fab404
Summary:
Move `logsumexp` and `max_values` to `TensorIterator` and use it to make `logsumexp` work for multiple dimensions.
Timings on a tensor of shape `(10,1000000,10)`, for each combination of (cpu, single-threaded cpu, gpu) and dimension:
**before**
208 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
279 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
199 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.11 s ± 33.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.25 s ± 25.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.11 s ± 6.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 ms ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
132 ms ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.6 ms ± 19.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
**after**
199 ms ± 8.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
307 ms ± 8.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
207 ms ± 7.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.16 s ± 8.92 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.26 s ± 47.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.13 s ± 13.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
15.4 ms ± 868 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
132 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.6 ms ± 21.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16475
Differential Revision: D13855746
Pulled By: umanwizard
fbshipit-source-id: aaacc0b967c3f89073487e1952ae6f76b7bd7ad3
Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/309
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16481
This gives us a boolean flag `quantize` on the `BeamSearch` module that allows us to apply FBGEMM quantization to a pretrained PyTorch model and export this to PyTorch native runtime.
Reviewed By: jmp84
Differential Revision: D13514776
fbshipit-source-id: 3f7cbff0782aae54c9623ad1ea7e66d7f49e2b32
Summary:
Pull Request resolved: https://github.com/pytorch/translate/pull/310
This adds fork/join parallelism to the EncoderEnsemble and DecoderBatchedStepEnsemble models. Note that when run in Python, these calls are no-op, and similarly we remove these calls before exporting to ONNX. But when we run in the PyTorch native runtime, we will now have the opportunity to run these sections in parallel.
Benchmark validation is pending me slogging through FBLearner Flow issues, as usual
Reviewed By: jmp84
Differential Revision: D13827861
fbshipit-source-id: 0cb9df6e10c0ba64a6b81fa374e077bce90f1d5b
Summary:
This PR reworks the mutability API to be simpler (updates passes to use "mayAlias" calls) and improves the caching logic.
The difference is that we now directly express the idea of a "memory location." Leaves in the alias trackers points-to graph are considered unique memory locations, and mayAlias questions can be boiled down whether two values share a leaf.
To speed up queries, some basic path compression has been added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16605
Differential Revision: D13952738
Pulled By: suo
fbshipit-source-id: cfc7fb2b23369f1dc425d1d8ca2c753c193d95dd
Summary:
The idea is to unify the environment variables `JOB_BASE_NAME` and `BUILD_ENVIRONMENT` which controlled the Pytorch and Caffe2 jobs respectively. In this commit, we have converted all the `JOB_BASE_NAME` references in _.jenkins/pytorch/*_ files to `BUILD_ENVIRONMENT`. Then, did the same thing in ._circleci/config.yml_. One thing that we needed to be careful was when both `BUILD_ENVIRONMENT `and `JOB_BASE_NAME` were present under same declaration in _config.yml_ file (e.g., for "caffe2-" stuffs). To ensure that all "==" checks work as expected, we also had to add "*" in some if conditions in _.jenkins/caffe2/build.sh_ file. Finally, removed "-build", "-test", etc. suffixes from `COMPACT_JOB_NAME` variable assignment in the bash script files in _.jenkins/pytorch_ folder, e.g., modify `COMPACT_JOB_NAME="${BUILD_ENVIRONMENT}-build"` to `COMPACT_JOB_NAME="${BUILD_ENVIRONMENT}"`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16649
Differential Revision: D13946392
Pulled By: mmh683
fbshipit-source-id: 790de6abf96de184758e395c9098a50998e05bc5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16673
Replace resize_dim() with set_sizes_and_strides() in THTensor_(unsqueeze1d) in aten/src/TH/generic/THTensor.cpp, as described in T38058642.
Reviewed By: ezyang
Differential Revision: D13928879
fbshipit-source-id: d593cebcc82589cd362ac78884d4e367d0da0ce6
Summary:
Just noticed while building on a machine without cudnn present - it was building but the runtime failed since some methods weren't bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16701
Differential Revision: D13937247
Pulled By: dzhulgakov
fbshipit-source-id: c81f05be7a9e64a1a8591036dcf8692c0ed4064e
Summary:
The op was implicitly relying on pos_to_output to be zero-initialized after extending. We're removing this functionality from allocator, thus fixing here. For some reason it wasn't spotted by junk-initialization but was reliably reproducible with standard malloc() if both junk_fill and zero_fill flags are turned off.
cc kittipatv jerryzh168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16702
Reviewed By: kittipatv
Differential Revision: D13937257
Pulled By: dzhulgakov
fbshipit-source-id: 3ee520b05467108e6c3e64eb3e6c60589bdf3d87
Summary:
This bump includes:
* Memory leak fix where the Gloo transport would hold on to auxiliary
structures for send/recv pairs after they finished.
* Fix write-after-free from Gloo thread during stack unwinding on error.
* Removal of the PATENTS file.
Fixes#16144.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16638
Differential Revision: D13937950
Pulled By: pietern
fbshipit-source-id: 3cfecaf13ee0f214c06681386557a4b1c3e1d6b9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16524
- Make it exception safe. When an exception happens during write, the old state is recovered.
- Use RAII instead of try/catch to increment counters in readers. This is more readable, and it also makes it work with reader closures that return void, which previously didn't work because the reader return value was stored on the stack.
- Assert there's no reads or writes happening when it's destructed to avoid destruction race conditions
- Explain the algorithm in detail in comments
- Add test cases
Reviewed By: ezyang
Differential Revision: D13866609
fbshipit-source-id: 01306a282a3f555569caa13d8041486f960d00e2
Summary:
When trying to get a test to pass I was missing an exclamation mark. Instead now I just use a different function in the conditional
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16686
Differential Revision: D13935182
Pulled By: jamesr66a
fbshipit-source-id: 7525a1a829276641dbafe06734f03f6202df6b22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16648
We added onnxGraph sharing keyed on model id and net seq number but we forgot to supply these info to the Onnxifi. Therefore, we will only create ONE onnxGraph whatsoever... This diff adds necessary info to the OnnxifiOp to prevent this from happening.
Reviewed By: bertmaher, rdzhabarov
Differential Revision: D13912356
fbshipit-source-id: fe8982327287a35f32fe3b125d94b617d18c0ab5
Summary:
Adds a decorator `torch.jit.ignore` for Python functions that tells the compiler to skip over these Python values, putting a `prim::Error` in their place which always throws an exception when run.
This lets you have Python-only code in your model in an explicit way, which is useful for debugging, and still be able to save/load the model.
Fixes#15815
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16055
Differential Revision: D13797286
Pulled By: driazati
fbshipit-source-id: 29d36776608ec101649a702952fc6ff3c27655b1
Summary:
Add winograd conv method. Users can select the direct conv or winograd conv in the model file.
We close the origin pr https://github.com/pytorch/pytorch/pull/12154 and create this new one for better rebasing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15196
Differential Revision: D13463721
Pulled By: yinghai
fbshipit-source-id: c5cd5c8aa7622ae7e52aeabd3dbb8ffb99b9b4ee
Summary:
Previously this would fail with the error message:
```
ValueError: Auto nesting doesn't know how to process an input object of type dict. Accepted types: Tensors, or lists/tuples of them
```
Turns out we're not using the line that causes this error (or a side effect of that line), so removing it fixes the issue. Also cleaned up some related dead code (cc apaszke to make sure the code isn't useful in some way)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16616
Differential Revision: D13908352
Pulled By: suo
fbshipit-source-id: 27094f1f4ea0af215b901f7ed3520e94fbc587b3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16625
This is a squash of multiple PRs that refactored the old c10 dispatcher into a new one that follows the c10 dispatcher design doc.
It is now unboxed and follows the Stack semantics from JIT. It also uses the runtime JIT schema instead of its own compile time schema definitions.
Reviewed By: ezyang
Differential Revision: D13907069
fbshipit-source-id: edcc4806ccd21474fdfb5a98516219b1956db13d
Summary:
There is a regression in cudnnGet*_v7 that causes slowdown in resnet50 training. I am opening a bug with cuDNN team about this. This reverts commit 38374468832e307ca741901870914857a836dd5d.
ezyang 😿
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16484
Differential Revision: D13924755
Pulled By: soumith
fbshipit-source-id: 8c719345fc443f1289539bfae630eea9224ba4a5
Summary:
Adds better bounds checks for target lengths in CTC loss, checks for integral types for target and prediction lengths, and adds tests for each, according to #15946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16269
Differential Revision: D13847567
Pulled By: ezyang
fbshipit-source-id: 5d7a975565e02baf78fe388813a1d1ef56dfb212
Summary:
-Skip the test due to flaky behavior on AMD/Rocm
-The fix is expected in Rocm 2.2 ( HSA runtime)
bddppq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16639
Differential Revision: D13915231
Pulled By: bddppq
fbshipit-source-id: 66e1d275836337170b15ceb9d60cfdd3242d4df8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16620
LogfiledbNetLoader loads all external input blobs into a workspace instance, we pack a shared pointer to this loaded workspace into the SingleLoadedNetSupplier.
SingleLoadedNetSupplier will pass this workspace to BlackBoxPredictor to be executed. (D13891759 is a WIP of how it all comes together)
Reviewed By: pjh5
Differential Revision: D13901467
fbshipit-source-id: 20589f898922f5f1aec50be131dad17a8c38e9b2
Summary:
Resolves#15863
Changed the documentation for MultiLabelSoftMarginLoss and MultiLabelMarginLoss to be more explicit about the `target` format.
More than happy to change the messaging based on discussion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16644
Differential Revision: D13912395
Pulled By: soumith
fbshipit-source-id: 24a3c214c5f6f9d043e25b13ac758c1c1211b641
Summary:
We inadvertently switch the OSX build over to ninja on CI. It then fails to respect MAX_JOBS and hits the same scache deadlock bug, this makes the ninja build respect MAX_JOBS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16641
Differential Revision: D13910751
Pulled By: zdevito
fbshipit-source-id: 61bec500539519b019b74421a13cd87fc1d86090
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16513
compare_exchange_deleter makes it easier to replace a
deleter on a DataPtr with a new one, without requiring
allocating another closure to hold the old deleter.
See comment for details.
This diff was originally landed as part of D13762540
(#16226) but we are reverting that diff D13863610 (#16510)
Reviewed By: smessmer
Differential Revision: D13864245
fbshipit-source-id: 56eda4748238dd3a5130ba6434fda463fe7c690e
Summary:
So that things like below can be JITable, and available in C++ API:
```python
import torch
torch.jit.script
def f(x, y, z):
x.index_add(0, y, z)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12413
Differential Revision: D13899948
Pulled By: suo
fbshipit-source-id: b0006b4bee2d1085c813733e1037e2dcde4ce626
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16630
two PRs landed concurrently - enforcing tensor constraints and refactoring c10. Since it's not a prod code - disable test and I'll let Sebastian to fix it properly.
Reviewed By: ezyang
Differential Revision: D13908117
fbshipit-source-id: 381c5626078b794afa1fc7a95cb1ea529650424c
Summary:
I went through my build log and did what I thought were reasonable fixes to all the C++ compilation warnings that came up
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16411
Differential Revision: D13901006
Pulled By: jamesr66a
fbshipit-source-id: 02df4e3e5a5c8dd9e69ac9f065cd3f2a80645033
Summary:
This PR adds basic support (creation and indexing) for immutable dictionaries in Script. This includes Python/string frontend support and a `IValue::GenericDict` type backed by a `std::unordered_map`. Only `str`, `int`, and `float` are supported as keys, any type can be a value. Structure is pretty similar to list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16208
Differential Revision: D13881686
Pulled By: driazati
fbshipit-source-id: 29ce9835b953c3456f57bcc2bbdf7fe0cbf941c0
Summary:
so that it's included in the hashed key that decides whether to call Find or not. This is required to ensure that Find is run for all devices
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16613
Differential Revision: D13901769
Pulled By: bddppq
fbshipit-source-id: 7d29ea9e40231cd4eef80847afa1307efeb0945c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16514
Original commit changeset: dc371697f14b
Relanding https://github.com/pytorch/pytorch/pull/15860 - the problem was that layer_norm was using at::empty which is not yet on mobile
Reviewed By: ezyang
Differential Revision: D13861480
fbshipit-source-id: e2116da32bc117175c96b9151b1beba9b31eff36
Summary:
This simplifies the process for building on windows, since users no longer have to find and run the vcvarsall.bat file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16540
Differential Revision: D13893596
Pulled By: zdevito
fbshipit-source-id: 79b7ad55c3251b3f573fd8464931138f8a52dd1d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16473
This resolves the issues associated with caffe2 initialization (specifically the REGISTER_FUNCTION_SCHEMA_OPERATOR calls) being run after Torch's static op registration calls.
The fix employs a meyer's singleton wrapped by the constructor of a type. Everything is placed inside a macro to make it easier for users to use.
Reviewed By: smessmer
Differential Revision: D13854306
fbshipit-source-id: ecf60861f229532826fae254974e9af4389055df
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16576
allows instantiation of operator with arguments passed by move rather than explicit copies
per Sebastian's suggestion
Reviewed By: smessmer
Differential Revision: D13882416
fbshipit-source-id: bc8d50e73f5a1ae87155b0cf96799b8573a7a8fa
Summary:
Here is a fresh attempt at getting some fusion back in autodiff-generated graphs in the presence of SumToSize.
- The sum to size operator is now `aten::_grad_sum_to_size` to allow symbolic script differentiation (and that in turn would need to use this in place of sum_to_size to signal that it strictly operates on gradients). This is also used in the autodiff code, replacing `prim::SumToSize`.
- `_grad_sum_to_size` is now fusable, `cat`s - which are fused afterwards thanks to Adam's simplification of the code - are only fused if there is no `_grad_sum_to_size` in the fusion group.
- I push the `_grad_sum_to_size` out of the the fusion group when compiling and record the desired summations in the KernelSpec. The reasoning is the following:
- As the autodiff is a repeated applicaiton of the chain rule, we always have the pattern `grad_in = mm(A, grad_out)`, with A often diagonal for cases interesting to the fuser, whence it is `grad_in = a * grad_out` (a pointwise multiplication). We know that only `grad_out` may have AutodiffGradSumToSize applied, so we can commute AutodiffGradSumToSize with the `mul` (and `div` and `neg` are of similar origin).
- For `type_as` the gradient might be giving the type, so just skip SumToSize,
- `add` (which was inserted as `prim::AutogradAdd`) adding gradients when the forward used the same value in several places. This is non-broadcasting, so we know that the two arguments would have the same sizes as inputs - which is good so we don't have to do bookkeeping of the two parts.
Details:
- During fusion, the Tensor arguments are always kept as the first parameters of the fusion group to accomodate indexing assumptions in the fuser.
- The rewriting of the fusion group to record the necessary output transformation and eliminate `_grad_sum_to_size` from the fusion group is now in the fuser compile step.
- In the execution step, the arguments are split into Tensor / Non-Tensor and the non-tensor args are mostly forgotten about except for doing `sum_to_size` at the end. This would want to be improved if/when we fuse nonconstant scalar arguments.
- In a number of places in the fuser, the non-Tensor arguments to the fusion group needed to be ignored.
Thank you, apaszke for the insightful discussion. All bad ideas and errors are my own.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14957
Differential Revision: D13888173
Pulled By: zou3519
fbshipit-source-id: 071992c876e8b845f2b3e6329ae03a835d39a0ea
Summary:
In the warning box on https://pytorch.org/docs/stable/tensors.html#torch.Tensor.new_tensor it says:
> new_tensor() always copies data. [...] If you have a numpy array and want to avoid a copy, use **torch.from_numpy()**.
But then further up the page we have another warning box with the message:
> torch.tensor() always copies data. [...] If you have a numpy array and want to avoid a copy, use **torch.as_tensor()**.
Now I believe this is just a small oversight, since from_numpy is to be deprecated in favour of as_tensor. See for example https://github.com/pytorch/pytorch/issues/6885 and https://github.com/pytorch/pytorch/issues/8611. I suggest to just use **torch.as_tensor()** in both of the warning boxes.
cc gchanan
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16587
Differential Revision: D13897038
Pulled By: gchanan
fbshipit-source-id: 2eb3cd47d2c0b5bf4350f980de3be9fe59b4a846
Summary:
applySelect does modify the tensor and removes the top most dimension which makes it complicated to track just using dim and need to use another parameter as real_dim to signify original dimension
fixes#16192
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16495
Differential Revision: D13897182
Pulled By: gchanan
fbshipit-source-id: 105581dbbff6b431cc8e2539a07e0058161e53a1
Summary:
We don't use this in the lambda body anymore. Remove it to fix a warning.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16526
Differential Revision: D13867043
Pulled By: umanwizard
fbshipit-source-id: 4c9a9d194fdfcb63fde16823517d2c6c8e2ae93d
Summary:
This just moves thing around to make AliasTracker independently testable and keep things a little more separate. Follow-on PRs will change the interfaces of AliasDb and AliasTracker to be more clearly distinct.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16588
Differential Revision: D13891894
Pulled By: suo
fbshipit-source-id: c5b590b5fdd462afefe743e499034068bf35784a
Summary:
Doc doesn't need to be changed. Also clarifies two inaccurate comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16384
Differential Revision: D13886637
Pulled By: soumith
fbshipit-source-id: 227385008211a6f3ad9135c54fd2d3754cc9daaf
Summary:
- Add libtorch upload jobs
- Unify checkout and env code for binary jobs (san binary test jobs)
- Compress variables passed into binary jobs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16511
Differential Revision: D13893714
Pulled By: pjh5
fbshipit-source-id: b8bd72e1397dec569a8ec3e859e319178c7c6f8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15388
This is another pass to make perfkernels code safer from illegal instruction error.
Removed dependency to c10/util/Logging.h
We're err on the safer side at the expense of some verbosity.
Reviewed By: dskhudia
Differential Revision: D13502902
fbshipit-source-id: 4f833115df885c5b4f8c1ca83b9badea1553f944
Summary:
The current implementation of the `torch.utils.model_zoo.load_url`
function is prone to a race condition when creating the directory in
which it saves the loaded models, since it checks whether the
directory exists and then creates it in two separate steps. The
directory can be created after the check was made but before we
attempt to create the directory, resulting in an unhandled exception.
Instead, try to create the directory directly, and do nothing if it
already exists.
Note: for Python versions ≥ 3.2, we could simply use the
`exist_ok=True` flag on `os.makedirs`, but this is unavailable in
Python 2.7.
Signed-off-by: Antoine Busque <antoine.busque@elementai.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16578
Differential Revision: D13886470
Pulled By: soumith
fbshipit-source-id: 88815c8a65eec96caea32d6e9a7f83802502fdb9
Summary:
As there are no checks that all the functions are actually being used, we can end up with stale entries. This diff removes unused entries from Declarations.cwrap
Testing:
Successful build via "python setup.py develop"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16463
Differential Revision: D13885815
Pulled By: izdeby
fbshipit-source-id: 4e35c2ac9196167af74dff3d4f971210721285f8
Summary:
Start splitting up these tests so we don't have a massive test file. Doesn't change how you run them, since `gtest.cpp` and `no-gtest.cpp` will still collect everything.
Renamed `tests.h` to `test_misc.h` to vaguely discourage people from adding yet more stuff to it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16536
Reviewed By: zdevito, eellison
Differential Revision: D13882215
Pulled By: suo
fbshipit-source-id: 61cf97f3c2c50703dcf6a3a34da01415ecb7e7d6
Summary:
Fixes https://github.com/pytorch/pytorch/issues/16326
Previously we didn't handle module inputs which included Generic Lists. When checking whether a generic list if a subvalue of the input arg type, I currently recurse on every element of the list. This shouldn't be too slow since the innermost list will be specialized and we won't have to check it's elements.
E.g. Tensor[][] -> GenericList [TensorList ].
The error message could be improved, but extracting the complete type of nested lists would have to deal with unifying types across lists / empty lists & typevars so I'm going to save that for a follow up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16482
Differential Revision: D13882582
Pulled By: eellison
fbshipit-source-id: 3609bc572f0ee9ebf20a77ea5ebc8fa3b165e24b
Summary:
Absolutely no idea why this is needed. This should be a valid argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16575
Differential Revision: D13884796
Pulled By: pjh5
fbshipit-source-id: 6011e721e2870499f6b5e627d5ad00ece08b530b
Summary:
This PR changes the way we store aliasing information from a "set" approach to a "points-to" analysis. Set-based approaches lose information in ways that make it difficult to do "live" updates to the alias DB as one as mutating the graph.
The tradeoff is that simple queries get more expensive, since they require traversing the points-to graph to answer most questions. In practice, this is unlikely to be that costly since we don't have massive aliasing chains, but we could create an approximation/caching layer if this becomes a problem.
My rough plan is:
1. This PR, switching to a points-to graph
2. Make it "live": analyzing a node should record all the edges the node added, so that we can rollback when the node is destroyed.
3. Reduce wildcard scope: we can make the wildcard a special vertex that points to anything that we're not "sure" about; namely, things that have been put inside lists, or graph inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16386
Differential Revision: D13855117
Pulled By: suo
fbshipit-source-id: f009f58143173c275501624eb105d07ab60fe5e1
Summary:
Changelog:
- Modify concantenation of [1] to a tuple by using cases for list and non-list types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16489
Differential Revision: D13875838
Pulled By: soumith
fbshipit-source-id: fade65cc47385986b773b9bde9b4601ab93fe1cf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16462
This file was moved, now we change the includes to the new location and remove the proxy header.
Reviewed By: ezyang
Differential Revision: D13847279
fbshipit-source-id: 4617d52fdcfe785cb7b2154460a6686c437abd8b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16282
This changes the core kernel abstraction to be a function taking a stack, popping its arguments from the stack and pushing results to the stack,
instead of getting arguments as ArrayRef<IValue> and returning an output IValue.
Caffe2 operators need to have a way to pass in preallocated output tensors.
The convention for them is to get all inputs *and* outputs on the stack and also return all of them, i.e. a caffe2 op will always have inputs == outputs.
This will probably change in later diffs towards making the outputs in-arguments optional in the JIT schema.
Reviewed By: ezyang
Differential Revision: D13792335
fbshipit-source-id: e9cc2b5e438cc4653e1f701633a154b92b604932
Summary:
This PR contains the implementation of chunk dataset, with the API proposed in PR https://github.com/pytorch/pytorch/pull/15562
A chunk dataset is derived from StatefulDataset. It utilizes worker threads to prefetches chunk data, splits it into batches and caches them into a queue. When get_batch is called from dataloader, batch data is retrieved from the queue, and data in new chunks will be pushed for later following batches.
Chunk dataset uses two samplers (chunk_sampler and example_sampler) to perform sampling. The chunk_sampler decides which chunk to load, and example_sampler shuffles the examples inside a specific chunk. More detail of this sampling approach can be found here: http://martin.zinkevich.org/publications/nips2010.pdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15932
Differential Revision: D13868688
Pulled By: soumith
fbshipit-source-id: a43000c478ca2a3c64cc84b3626d6b8b1ad9a07e
Summary:
Rehash of previous attempts. This tries a different approach where we accept the install as specified in cmake (leaving bin/ include/ and lib/ alone), and then try to adjust the rest of the files to this more standard layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16414
Differential Revision: D13863635
Pulled By: zdevito
fbshipit-source-id: 23725f5c64d7509bf3ca8f472dcdcad074de9828
Summary:
This is particularly useful when using a c10d::Store from tests.
cc jgehring
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16278
Reviewed By: janewangfb
Differential Revision: D13866271
Pulled By: pietern
fbshipit-source-id: c8670b5f4ebd5cd009f2cabbe46cc17a9237d775
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16510
This diff was supposed to be memory usage neutral, but based on
some internal flows involving cuDNN, it was not. Reverting pending
further investigation.
Original commit changeset: 03f1ebf7f11c
Reviewed By: xw285cornell
Differential Revision: D13863610
fbshipit-source-id: 15517e255fd6b0c064b65fb99f0ef19742236cfd
Summary:
In the case of spurious failure, refcount is not incremented -- which leads to underflow once all references are released.
This was discovered when exercising multiprocessing on ppc64le.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16302
Differential Revision: D13845435
Pulled By: ezyang
fbshipit-source-id: 8e264fff9dca8152cb12617e3216d5e48acd9557
Summary:
We have:
- This is an initial stab at creating a type stub `torch/__init__.pyi` .
- This is only tested on Python 3, since that's the only Python version mypy
works on.
- So far, we only aim at doing this for torch functions and torch.Tensor.
- Quite a few methods and functions have to be typed manually. These are
done in `torch/__init__.pyi.in`
For me, PyCharm (the non-paid one) didn't seem to indicate errors in the .pyi when opening and seemed to be able to get the type hint for the few functions I tried, but I don't use PyCharm for my usual PyTorch activities, so I didn't extensively try this out.
An example of a generated PYI is at [this gist](https://gist.github.com/ezyang/bf9b6a5fa8827c52152858169bcb61b1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12500
Differential Revision: D13695553
Pulled By: ezyang
fbshipit-source-id: 4566c71913ede4e4c23ebc4a72c17151f94e8e21
Summary:
Some HTTP servers dont return Content-Length, account for that
Fixes: https://github.com/pytorch/pytorch/issues/16152
Differential Revision: D13858882
Pulled By: soumith
fbshipit-source-id: e4293e9368ed4c87548d22adec1ce0c25ea4bd8f
Summary:
It looks like `WithInsertionPoint` and `WithCurrentScope` can be easily implemented without
`ResourceGuard` - that helps readability and removes one more dependency. Is there anything I'm missing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16351
Differential Revision: D13821826
Pulled By: ZolotukhinM
fbshipit-source-id: b203200b345fb5508a97dc8656e6f51cde4cc21f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15860
Few changes (which are harder to split in separate diffs, so together):
- make conversion explicit (as they can throw to avoid surprises)
- fix tensor legacy dispatch not initialized when tensor is created on C2 side
- add a bunch of invariants to enforce
Reviewed By: ezyang
Differential Revision: D13596031
fbshipit-source-id: d20b601e06ba47aeff2f6e8e15769840e2d46108
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16247
Stack is going to be used by the c10 dispatcher.
This just moves the file, also changing the namespace turned out to be more complicated than I thought, I'll leave the namespace for now.
Reviewed By: ezyang
Differential Revision: D13774189
fbshipit-source-id: 66aeee36425e0ea2b3a4f8159604f38572306d57
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16180
Only the kernel knows about its state, the caller doesn't see it anymore.
Reviewed By: ezyang
Differential Revision: D13744071
fbshipit-source-id: cb00ff1a881508c1b36ac4123bee1f68ca02ca9c
Summary:
The current uses of `IR_IF` are mostly trivial, so there is not much value in having special macros for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16354
Differential Revision: D13821823
Pulled By: ZolotukhinM
fbshipit-source-id: 1ca73111f5b4868fa38a1f29c9230540773e5de6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16273
Previously we have SetOutputSize which accept a partially initialized Output Tensor and set it to the correct size,
the diff change this to GetOutputSize that returns the correct size instead.
e.g.
```
auto* Y = Output(0);
ConvPoolOp<Context>::SetOutputSize(X, Y, channels);
...
Y->mutable_data<T>...
```
-->
```
auto sizes = ConvPoolOp<Context>::GetOutputSize(X, channels);
auto* Y = Output(0, sizes, at::dtype<T>());
```
Reviewed By: dzhulgakov
Differential Revision: D13736281
fbshipit-source-id: 64abce3dbaed0b375098463333dfd0ea5a3b1945
Summary:
Working on the tracer was really annoying because a lot of the implementations were in `tracer.h` and editing that file caused us to rebuild almost the whole world. So this moves all the implementations into tracer.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16410
Differential Revision: D13847776
Pulled By: jamesr66a
fbshipit-source-id: ec8500da32b2d4cd990f293a0a96101d3e82f158
Summary:
Fix alias annotations for ops that may return a fresh tensor. The previous version was overly conservative.
Currently there is no actual behavior change in the alias analysis, but we may use the information in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16460
Differential Revision: D13849086
Pulled By: suo
fbshipit-source-id: cd23b314a800e5e077d866e74456d37a321439d5
Summary:
Adds Tensor alias annotations.
This isn't a full implementation of alias annotations, but that isn't required to increase compliance with the JIT signature schema. There are some sanity checks within native_parse.py for their usage, which can also help overall correctness. Otherwise, this exists solely for further alignment between the JIT signature schema and the native_functions.yaml func schema.
This gets us to ~85% matches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16239
Differential Revision: D13804133
Pulled By: cpuhrsch
fbshipit-source-id: aa5750f2c7e0f08b8c35d6d8f38cb148e9629855
Summary:
Also, because sometimes we have `CMakeCache.txt` but cmake errored out so I'm adding the existence of `'build.ninja'` as another criterion of rerunning cmake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16426
Differential Revision: D13843801
Pulled By: ezyang
fbshipit-source-id: ea1efb201062f23b7608f8d061997d8a8e293445
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16177
Change the API for calling operators so that it can store state in an OpKernel object.
This diff doesn't store the state there yet, that comes in a follow up diff.
Reviewed By: ezyang
Differential Revision: D13742889
fbshipit-source-id: 20511a9a1b9f850074e50634d4b4acf87f8c6ecd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16246
The op schema says it returns multiple values, so let's actually return multiple values instead of one tuple.
For some reason, this did work when called from python (probably some auto-unpacking),
but once called from JIT, it segfaulted. This diff fixes that.
Reviewed By: dzhulgakov
Differential Revision: D13780147
fbshipit-source-id: fe94f82f4c53b7454f77c4484fca4ac9dc444475
Summary:
swapBytes64 used to use SwapByteOrder_32 and value, both of which dont exist. This commit rewrites that part from scratch.
This happened on Debugbuild on Microsoft compiler. For that case " && !defined(_DEBUG)" is also removed, because _byteswap_uint64 works fine in debug mode (if it is necessary it should me commented why).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16418
Differential Revision: D13843306
Pulled By: ezyang
fbshipit-source-id: dde1c7baeccec3aaa750d4b7200b3f4ccb4a00cb
Summary:
This flag is useful in identifying if a test is taking way too long like the ones in the following snippet when running the test suite with pytest. 9757ad35b0/test/common_utils.py (L814-L835)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16423
Differential Revision: D13843507
Pulled By: ezyang
fbshipit-source-id: 643e1766a85905b3b112ea5ca562135a17896a72
Summary:
cdist is used for calculating distances between collections of observations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16168
Differential Revision: D13739147
Pulled By: ifedan
fbshipit-source-id: 9419c2c166891ac7db40672c72f17848f0b446f9
Summary:
Before this diff, we execute `std::vector<optional<acc_t>> buffer((unsigned)max_threads, optional<acc_t> {});` in every iteration of `foreach_reduced_elt`. Change the code to only execute that line if we need it; i.e., we are actually about to parallelize.
This overhead is quite significant when we are doing a lot of small reductions in single-threaded code.
```
x=torch.randn((1024,10,1024),dtype=torch.float64)
torch.set_num_threads(1)
%timeit x.std(1)
```
Before (with #15845 applied): 708.25 ms
After: 508 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15850
Differential Revision: D13612960
Pulled By: umanwizard
fbshipit-source-id: f5e61abfe0027775c97ed81ac09c997fbee741df
Summary:
Made the change requested in #15555
PR was failing build due to a time out error while getting packages using pip.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16416
Differential Revision: D13833873
Pulled By: soumith
fbshipit-source-id: e2200e9e8015558fcd359dfa3d025b25802d62b5
Summary:
This one needs to be merged ASAP because the CUDA build for Windows is skipped at this time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16412
Differential Revision: D13833889
Pulled By: soumith
fbshipit-source-id: 95a401a01fb0f9c1045df0bfd72d8206b8a6f3fd
Summary:
The real fix for https://github.com/pytorch/pytorch/issues/15605.
This is sort of BC breaking because now
```py
In [1]: import torch
In [2]: a = torch.randn(3, 3, requires_grad=True)
In [3]: a.slogdet()
Out[3]: (tensor(1.), tensor(0.1356, grad_fn=<SlogdetBackward>))
In [4]: a.slogdet()[0].requires_grad
Out[4]: False
```
while before this patch ` a.slogdet()[0]` requires grad with `grad_fn=<SlogdetBackward>`. But any use of backproping through this value will meet the error in #15605 so I don't think this is a problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16337
Differential Revision: D13832644
Pulled By: soumith
fbshipit-source-id: f96c477e99edcbdbd966888e5c5ea7fd058429a8
Summary:
The documentation stated that operands to einsum should be a list of Tensors, not individual arguments. The function, however, now accepts individual arguments for each Tensor operand *and* a single argument consisting of a list of Tensors. The documentation was updated to reflect this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16323
Differential Revision: D13832647
Pulled By: soumith
fbshipit-source-id: c01c2b350f47674d3170337f493b0ee2ea381b3f
Summary:
These were really annoying to see in the phabricator UI when trying to land PRs that touched test_jit.py, so this fixes them.
One remaining item is the T484 error. Locally, flake8 still chokes on that line even though I put the noqa comment there (and tried varying whitespaces around it etc). Not sure why it still persists...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16409
Differential Revision: D13832658
Pulled By: jamesr66a
fbshipit-source-id: 46356ba6444ae5ee1a141c28489bdcc7c99e39c0
Summary:
Changelog:
- Append a condition that switches to the native CUDA implementation for affine_grid
Fixes#16365
Differential Revision: D13832192
Pulled By: soumith
fbshipit-source-id: 3f484e6673d71e3ba7627b170cb8f1611e12b9b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16374
this fixes the original attempt in OSS (adds to CMake and python build files)
Reviewed By: smessmer
Differential Revision: D13821061
fbshipit-source-id: 82f0dade0145fd04bdf8e3cb3954b5790e918162
Summary:
This commit removes the dependency on `build_pytorch_libs.sh` by moving the remaining functionality that is not expressible in cmake into python. Removing the indirection through bash also removes over 300 lines of environment munging code that is incredibly hard to understand because it passes a lot of secret parameters through `os.env`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16289
Reviewed By: ezyang
Differential Revision: D13821662
Pulled By: zdevito
fbshipit-source-id: d658d26925e3b1169ac1e3d44a159cf8a1f0d9b1
Summary:
1. Improve error message for better debugging info
2. Increase timeout
3. Also apply the windows worker failure detection mechanism on non-Windows platforms, for better robustness
Attempt to fix#14501
cc ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16249
Differential Revision: D13784702
Pulled By: ezyang
fbshipit-source-id: 09a7cff83ab9edce561ed69f9fb555ab35d1275f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16191
logdevice related modifications for generic feature type
we directly convert the generic feature structures to json strings, which corresponds to the column input in offline and dper
Reviewed By: itomatik
Differential Revision: D13551909
fbshipit-source-id: 807830c50bee569de202530bc3700374757793a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16350
Example usage of the new caffe2 integration
Reviewed By: smessmer
Differential Revision: D13408546
fbshipit-source-id: 87240ca7f48d653a70241d243aa0eb25efa67611
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16173
Helper to make it easy to run ops in caffe2
Reviewed By: smessmer
Differential Revision: D13468240
fbshipit-source-id: 2276c7870af6dcdf829957f005fd16ac1ef319b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16048
This enables full shimming of the operator (previously it was only
Output() shimmed).
Reviewed By: smessmer
Differential Revision: D13468241
fbshipit-source-id: c853b775ab5cdcd968f4a6cc4766e91c3c6b1c45
Summary:
Some cleanups in ir.{h,cpp}. I plan to continue cleaning it up, so this is a first step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16342
Differential Revision: D13808897
Pulled By: ZolotukhinM
fbshipit-source-id: 2dedb414576c3efbf8e36434145d7f14a66b1ee7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16335
group conv is not implemented with EIGEN engine so this diff disables related tests
Reviewed By: jamesr66a
Differential Revision: D13807204
fbshipit-source-id: 41f6de43da40882f57e64474520e185733caefb7
Summary:
Remove calls to torch.jit._unwrap_optional that are no longer needed.
The remaining instances would require control flow logic for exceptions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16245
Differential Revision: D13804292
Pulled By: eellison
fbshipit-source-id: 08c5cbe4b956519be2333de5cf4e202488aff626
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16341
as in the title
Reviewed By: intermilan
Differential Revision: D13808679
fbshipit-source-id: 0d12d3253f380bec66bc9be899be565861b8163a
Summary:
This PR adds thread-local guard (`at::AutoNonVariableTypeMode`) to make sure that in VariableType.cpp the operations on baseType still dispatch to non-Variable type, even if the parameters will become Variables after the Tensor/Variable merge. We achieve this by making `legacyTensorType()` and `getType()` check the `at::AutoNonVariableTypeMode` guard to decide whether to return non-Variable type for a variable.
This is part of the VariableImpl/TensorImpl merge work: https://github.com/pytorch/pytorch/issues/13638.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15939
Reviewed By: ezyang
Differential Revision: D13640980
Pulled By: yf225
fbshipit-source-id: d12c2543822958558d7d70d36c50999a5eb8783f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16226
Now that the caching allocator is moved to c10_cuda, we can
delete the duplicate copy from Caffe2.
Reviewed By: dzhulgakov, smessmer
Differential Revision: D13762540
fbshipit-source-id: 03f1ebf7f11c68c19aa0d66110156fe228da6138
Summary:
Some renaming and renamespacing also took place. I was originally planning not to do anything, but it turns out that it was easier to make HIPify work by using a namespace CUDACachingAllocator:: rather than THCCachingAllocator_, since :: is a word boundary but _ is not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16119
Reviewed By: smessmer
Differential Revision: D13718768
fbshipit-source-id: 884a481d99027fd3e34471c020f826aa12225656
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16117
This means I can move it to c10_cuda with minimal fuss.
Reviewed By: smessmer
Differential Revision: D13717836
fbshipit-source-id: a94c7dc649af64542480fc1c226b289588886c00
Summary:
In preparation for setting up a doc build job for stable docs, I wanted
to refactor the workflow so that future changes will be easier.
This PR the following changes:
- Refactor the doc push script into a reusable command
- Add command line options for the doc push script.
These don't matter too much for now but will be useful
for setting up future jobs for building different versions of the
docs.
- Instead of checking out pytorch/pytorch:master, we re-use the pytorch
installation inside the docker image.
- Change the sed in the script to a perl command. sed is annoyingly
different across platforms; the perl command is more stable
- Run the script in dry run mode (without pushing the doc build)
whenever a PR is opened. This lets us test changes to the doc build workflow.
Test Plan
- I tested the doc build script locally with my own credentials and it
worked fine.
- Wait for the pytorch_doc_push CI.
- After merging this PR, keep an eye on the pytorch_doc_push CI status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16265
Differential Revision: D13803511
Pulled By: zou3519
fbshipit-source-id: 4564bca3e74d490f89a1d1da9fb8b98eb44bdbb1
Summary:
pdist was recently patched to remove buggy batch support and fix issues
with large tensors. This fixed missed a few spots, and didn't handle a
few recommendations that this commit addresses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16210
Differential Revision: D13791914
Pulled By: gchanan
fbshipit-source-id: 0595841be1b298f7268fd4c02a6628acfec918f2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16294
In `ReinitializeTensor`, we compare `tensor->GetDevice()` and `options.device()`, but in the callsite, we actually just provide an option with `device_type`, which means the `device_id` will always be default(-1) for `options`, but for tensor, although it is passed a `device` with default `device_id`, when we allocate the data, the `device` of the `tensor` is the `device` of `Storage`, which is the `device` of underlying `DataPtr`, which is the same as the `device` of the `Context` of the operator, which has a non-default `device_id`.
Therefore everytime we do `ReinitializeTensor`, we'll find the `device` does not match, and after the `ReinitializeTensor` call, the `device` still does not match. That's why everytime we'll allocate a new Tensor and cause perf regressions for ops that uses `ReinitializeTensor` on multiple GPUs.
Reviewed By: BIT-silence
Differential Revision: D13795635
fbshipit-source-id: 24d6afa1a0196a32eb0134ee08b4280244cdb0c3
Summary: Some automation to fix uninitialized members for caffe2 code. Ran canary to make sure I don't have any regression in prod, but not sure how to test comprehensively for caffe2
Reviewed By: ezyang
Differential Revision: D13776185
fbshipit-source-id: fb2a479971cc0276d8784be1c44f01252410bd24
Summary:
This PR adds support for overloaded functions as a step toward adding rnn modules to the JIT standard library.
Possible overloads must be manually specified, and when resolving the overload it chooses by the first one that passes the schema matching logic. The structure is very similar to boolean dispatch in #14425. The overload will only work on weak modules.
In order to avoid supporting overloaded methods in Python to match the JIT execution, the current setup offloads that work to the user. In the test added in `test_jit.py`, two methods are used to overload the `forward` method. In order to call `forward` outside the JIT, a Python-only `forward` that does the right argument type switching must also be provided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15556
Differential Revision: D13576348
Pulled By: driazati
fbshipit-source-id: 7d3bdd4ee5a6088cc20c92f26a696d1ee5b9204b
Summary:
- remove loop node that is guaranteed not to execute
- remove extra loop outputs that are no longer needed
- if we are inlining an if node, only run constant propagation on the block that will execute
- remove the recurse argument since we only expose the Graph Constant Propagation and it's not used
This also includes a few extra hooks to python_ir that I think make it a little be easier to test graph conditions from python.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16244
Differential Revision: D13791635
Pulled By: eellison
fbshipit-source-id: d16351fffcfc8013b02015db200f8fde002e0577
Summary:
- Fix environment variable used to guard binary uploads
- Move common MacOS brew setup-code into a common function to decrease code duplication and also to move that noisy console output into its own CircleCI step
- Split Mac builds into separate build-test and upload jobs. Add one of these jobs to PR runs; add upload jobs to nightly binarybuilds workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16016
Differential Revision: D13791084
Pulled By: pjh5
fbshipit-source-id: 8eeb8e1963d46eab84f0f6dad9f0265163d5bf73
Summary:
Otherwise, it won't work if we sync on this event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/8219
Reviewed By: pietern
Differential Revision: D13788657
Pulled By: teng-li
fbshipit-source-id: 8c96e9691ed2441d7a685fb7ae8fece906f58daf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16176
This makes PyTorch and Caffe2's data() method line up.
Historically, PyTorch made no distinction between tensors
with const or non-const data, and thus provided a
non-const pointer with data() member. Changing the API to
return a const-pointer would break all mutable code, whereas
changing the Caffe2 API to change a pointer doesn't break
any code, *except* for code which required an exact match
on const-ness (e.g., in template arguments). Since the latter
is less disruptive, we've opted for it here.
The few places downstream that broke due to this are fixed
in this patch.
Reviewed By: smessmer
Differential Revision: D13742916
fbshipit-source-id: baa4b4544cfdf7c1f369f4d69a1e0d5953c1bd99
Summary:
This PR applies a few minor modifications leading to 100s of additional matches
Modifications to native_functions.yaml
1) double to float
2) int64_t to int
3) IntList[\d*] to int[\d*]
4) {} to []
5) Tensor? x=[] to Tensor? x=None
6) TensorList to Tensor[]
7) 1e-x to 1e-0x
8) Generator* x = nullptr to Generator? x = None
9) `{.*}` to `[.*]`
Overall this adds about 300 new matches and brings us to about 1/2 compliance of native_functions func with their JIT signature equivalent
While this is still a draft "tools/jit/gen_jit_dispatch.py" contains code to aid in finding close signatures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16111
Reviewed By: ezyang
Differential Revision: D13738123
Pulled By: cpuhrsch
fbshipit-source-id: d1ec1e089bdb26ec155f6f31ccf768270acb76c7
Summary:
This PR updates the logic for using cudnnGet* and cudnnFind*. Current version of cudnn find and get (v7) returns a pair of best algorithm and the convDesc mathType. While we were using the returned algorithm, we didn't update the mathType. As a result, we ended up with a slow choice of algorithm and math type. Without this patch, we are seeing a 10x regression in group convolutions.
Changelist:
- Changed the template arguments to be `perf_t` instead of `algo_t` to unify cudnnFind and cudnnGet. Both cudnnFind and cudnnGet have the same purpose and hence, it made sense to unify them and get rid of `getAlgorithm`.
- Used cudnnGet*_v7 everywhere cudnnGet* was being used.
- Removed all cudnn6 paths (This PR depends on https://github.com/pytorch/pytorch/pull/15851)
Differential Revision: D13787601
Pulled By: ezyang
fbshipit-source-id: 81fe86727673d021306fe1c99c3e528b7c9ad17f
Summary:
Tune elementwise kernel for AMD architectures by increasing the work group sizes and launch bounds. This change improves training throughput for torchvision models by up to 11% in our tests while exhibiting no significant performance regression.
No functional/performance change for CUDA - just shifting numbers into constrexpr.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16217
Differential Revision: D13776684
Pulled By: bddppq
fbshipit-source-id: edbaebe904598b2de66a9e9a68a1aa219ebc01e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16166
Since we now don't use std::function anymore, we can make kernel registration constexpr again.
Reviewed By: ezyang
Differential Revision: D13738630
fbshipit-source-id: 918fa3a3c8c6f0ddbd0f08b3b143cdf066265387
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16165
Store kernels as direct function pointers instead of std::function.
Using direct function pointers avoids a performance risk std::function would introduce.
Reviewed By: ezyang
Differential Revision: D13738627
fbshipit-source-id: a348906c8a201436699681980a82ca95065a06a0
Summary:
Partially fixes: https://github.com/pytorch/pytorch/issues/394
Implementation detail:
Codegen is modified to generate codes that looks like below:
```C++
static PyObject * THPVariable_svd(PyObject* self_, PyObject* args, PyObject* kwargs)
{
HANDLE_TH_ERRORS
static PythonArgParser parser({
"svd(Tensor input, bool some=True, bool compute_uv=True, *, TensorList[3] out=None)",
}, /*traceable=*/true);
ParsedArgs<6> parsed_args;
auto r = parser.parse(args, kwargs, parsed_args);
static PyStructSequence_Field fields0[] = {
{"U", ""}, {"S", ""}, {"V", ""}, {nullptr}
};
static PyStructSequence_Desc desc0 = {
"torch.return_types.svd_out", nullptr,
fields0, 3
};
static PyTypeObject type0;
static bool namedtuple_type_initialized0 = false;
if (!namedtuple_type_initialized0) {
PyStructSequence_InitType(&type0, &desc0);
namedtuple_type_initialized0 = true;
}
static PyStructSequence_Field fields1[] = {
{"U", ""}, {"S", ""}, {"V", ""}, {nullptr}
};
static PyStructSequence_Desc desc1 = {
"torch.return_types.svd", nullptr,
fields1, 3
};
static PyTypeObject type1;
static bool namedtuple_type_initialized1 = false;
if (!namedtuple_type_initialized1) {
PyStructSequence_InitType(&type1, &desc1);
namedtuple_type_initialized1 = true;
}
if (r.idx == 0) {
if (r.isNone(3)) {
return wrap(&type1, dispatch_svd(r.tensor(0), r.toBool(1), r.toBool(2)));
} else {
auto results = r.tensorlist_n<3>(3);
return wrap(&type0, dispatch_svd(r.tensor(0), r.toBool(1), r.toBool(2), results[0], results[1], results[2]));
}
}
Py_RETURN_NONE;
END_HANDLE_TH_ERRORS
}
```
Types are defined as static member of `THPVariable_${op_name}` functions, and initialized at the first time the function is called.
When parsing function prototypes in `native_functions.yaml`, the parser will set the specified name as `field_name` when see things like `-> (Tensor t1, ...)`. These field names will be the field names of namedtuple. The class of namedtuples will be named `torch.return_types.${op_name}`.
In some python 2, `PyStructSequence` is not a subtype of tuple, so we have to create some functions to check if an object is a tuple or namedtuple for compatibility issue.
Operators in `native_functions.yaml` are changed such that only `max` and `svd` are generated as namedtuple. Tests are added for these two operators to see if the return value works as expected. Docs for these two ops are also updated to explicitly mention the return value is a namedtuple. More ops will be added in later PRs.
There is some issue with Windows build of linker unable to resolve `PyStructSequence_UnnamedField`, and some workaround is added to deal with this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15429
Differential Revision: D13709678
Pulled By: ezyang
fbshipit-source-id: 23a511c9436977098afc49374e9a748b6e30bccf
Summary:
Initial enabling of the upcoming hip-clang compiler for the PyTorch source base.
Changes:
* update the Eigen submodule to a version including our upstreamed hip-clang enabling there
* modify a few ifdef guards with the `__HIP__` macro used by hip-clang
* use `__lane_id` instead of `hc::__lane_id`
* add Debug flags for ROCm to the cmake infrastructure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16085
Differential Revision: D13709459
Pulled By: ezyang
fbshipit-source-id: 1b7b33fe810a0434766180580d4443ea177eb7c7
Summary:
`torch.distributed.launch.py` will not raise error when `subprocess.Popen` is not return 0.
For better debugging it should always raise an error if processes launched have unusual behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16069
Differential Revision: D13709467
Pulled By: ezyang
fbshipit-source-id: 31d32a5ec8fed7bccd62d845bfba0e670ed3fe20
Summary:
Save reallocation costs, by reserving vectors according to how many elements we expect to put in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16201
Differential Revision: D13762594
Pulled By: ezyang
fbshipit-source-id: 7e3bfe421489dde48a2ddb0920dd155f69baecc0
Summary:
Fixed a few C++ API callsites to work with v1.0.1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16221
Differential Revision: D13759207
Pulled By: yf225
fbshipit-source-id: bd92c2b95a0c6ff3ba5d73cb249d0bc88cfdc340
Summary:
Now it is only necessary to use 'develop' or 'install' to build. Incremental cmake is on by default. `develop --cmake` forces it to rerun.
The NinjaBuilder stuff is dead. It was used to make building _C.so
faster but now _C.so is just an empty stub file.
Removed a bunch of custom build commands from setup.py that are
no longer meaningful now that cmake handles most of the build.
Removed unused targets in build_pytorch_lib.sh/bat
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16162
Differential Revision: D13744155
Pulled By: zdevito
fbshipit-source-id: d836484782c65b7f8e8c7a82620886f7a7777892
Summary:
This PR does three things:
~~Allow `int64_t?` in function schema, which provide an elegant way of implementing null-able int arguments, as discussed in https://github.com/pytorch/pytorch/pull/15208#pullrequestreview-185230081~~
~~Originally implemented in https://github.com/pytorch/pytorch/pull/15235~~
~~Example:~~
```yaml
- func: myop(Tensor self, int64_t? dim=None) -> Tensor
variants: function
```
~~cc: zou3519~~
Edit: implemented in https://github.com/pytorch/pytorch/pull/15234
Previously tried in https://github.com/pytorch/pytorch/pull/12064. There was a problem that C++ does not have kwarg support, which makes it confusing to know whether `unique(t, 1)` actually means `unique(t, dim=1)` or `unique(t, sorted=1)`.
Now I think I have a better idea on how to implement this: there are two ATen operators: `unique` and `unique_dim`. `unique` has the same signature as in python, and exported to both python and C++. `unique_dim` has signature `unique_dim(tensor, dim, sorted=False, return_inverse=False)`, and only exported to C++, which could be used more naturally for a C++ user.
Differential Revision: D13540278
Pulled By: wanchaol
fbshipit-source-id: 3768c76a90b0881f565a1f890459ebccbdfe6ecd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16175
Separate Moments from math and optimize it
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D13742472
fbshipit-source-id: 90757d908d38c98ca69818855aaf68315e525992
Summary:
Submitting this PR as an update to existing PR (https://github.com/pytorch/pytorch/pull/15938) on houseroad 's request.
This PR replaces the use of ONNX op `ConstantLike` with `ConstantOfShape` in the ONNX exporter. In addition to removing the call sites in `symbolic.py`, it also replace the call site in `peephole.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16095
Differential Revision: D13745723
Pulled By: houseroad
fbshipit-source-id: e2a5f534f01adf199df9e27544f7afcfa540e1f0
Summary:
Resolves#15923 where LBFGS threw "Error: a leaf Variable that requires grad has been used in an in-place operation."
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16167
Differential Revision: D13745822
Pulled By: soumith
fbshipit-source-id: 7d1d0511d06838c0c6f4c8a6b53cf15193283059
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16174
Our service creates a new caffe2 workspace for the same underlying network on multiple threads concurrently at service startup time (later these workspaces are being reused for sequential requests), resulting in concurrent quantization via FullyConnectedDNNLowPOp calling GetOrCreateFbgemmPackBMatrix(). The lazily performed quantizations during the first inference in each workspace are all funnelled through GetOrCreateFbgemmPackBMatrix()'s cache_mutex, which means quantization is serialized, so at service startup time only a single CPU core is being used for around a minute until the serial quantization is done.
An better solution would be to avoid the quantization of the same weight matrix of the operator copies in different net copies to begin with, but this here is the simpler solution for our current problem.
Reviewed By: jspark1105
Differential Revision: D13708785
fbshipit-source-id: 537519896b3b939c552d67f400bafc8a69ce11eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16135
Separate affine_channel from math and optimize it
i-am-not-moving-c2-to-c10
Reviewed By: houseroad
Differential Revision: D13727606
fbshipit-source-id: 8980af4afadaf964a18a9da581106fe30896a7e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16065
Before, we registered the caffe2 kernel with the c10 dispatcher using plain C types.
Now, we pass in IValues, which avoids the unwrapping inbetween.
Reviewed By: ezyang
Differential Revision: D13689036
fbshipit-source-id: b976a2c46a5a541f6a926b3df255e8a535e32420
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16051
This changes the kernels stored in the c10 dispatcher from plain C function pointers to IValue-based KernelFunction*.
Note that KernelFunction is currently taking an `ArrayRef<IValue>` as arguments. A later diff will change that to it taking a `Stack*`.
Reviewed By: ezyang
Differential Revision: D13684518
fbshipit-source-id: 1fa54f60cec2e967b92a4a043d6e3ac1627ed991
Summary:
This tests the water for adding back NNPACK in PyTorch, it's a lot better than the fallback THNN versions.
In #6151, we (ezyang and soumith) removed NNPACK support from PyTorch. Of course Maratyszcza might have advice, too. (Or an opinion on the CMake changes.)
The only functional changes are to use NNPack more aggressively on mobile and a .contiguous() to match NNPack's assumption (I stumbled over that while using NNPack for style transfer.)
The CMake changes try to use the NNPack we already have in git.
In terms of lines of code this is a large part of the diff of https://lernapparat.de/pytorch-jit-android/ . As far as I can tell, we don't have MKLDNN on mobile and the native THNN implementation are prohibitively expensive in terms of both CPU and memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15924
Differential Revision: D13709576
Pulled By: ezyang
fbshipit-source-id: f2e287739909451c173abf046588209a7450ca2c
Summary:
Partial fix for #15804, only w/o dim.
For jcjohnson benchmarking script I'm getting the following results on V100:
Before:
```
unning with N = 10000, M = 10000
cuda (no inverse): 0.98 ms
cpu (no inverse): 0.96 ms
cuda (with inverse): 1.07 ms
cpu (with inverse): 1.76 ms
Running with N = 10000, M = 100000
cuda (no inverse): 0.76 ms
cpu (no inverse): 1.53 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 3.02 ms
Running with N = 100000, M = 100000
cuda (no inverse): 1.28 ms
cpu (no inverse): 11.22 ms
cuda (with inverse): 69.76 ms
cpu (with inverse): 20.28 ms
Running with N = 100000, M = 1000000
cuda (no inverse): 0.78 ms
cpu (no inverse): 18.78 ms
cuda (with inverse): 133.45 ms
cpu (with inverse): 34.09 ms
Running with N = 500000, M = 500000
cuda (no inverse): 1.43 ms
cpu (no inverse): 61.13 ms
cuda (with inverse): 3315.18 ms
cpu (with inverse): 104.57 ms
Running with N = 500000, M = 5000000
cuda (no inverse): 0.86 ms
cpu (no inverse): 96.44 ms
cuda (with inverse): 5209.93 ms
cpu (with inverse): 176.10 ms
```
After
```
Running with N = 10000, M = 10000
cuda (no inverse): 1.04 ms
cpu (no inverse): 0.94 ms
cuda (with inverse): 0.64 ms
cpu (with inverse): 1.76 ms
Running with N = 10000, M = 100000
cuda (no inverse): 0.77 ms
cpu (no inverse): 1.55 ms
cuda (with inverse): 0.58 ms
cpu (with inverse): 2.79 ms
Running with N = 100000, M = 100000
cuda (no inverse): 1.30 ms
cpu (no inverse): 14.15 ms
cuda (with inverse): 1.63 ms
cpu (with inverse): 20.90 ms
Running with N = 100000, M = 1000000
cuda (no inverse): 0.82 ms
cpu (no inverse): 18.63 ms
cuda (with inverse): 0.61 ms
cpu (with inverse): 33.52 ms
Running with N = 500000, M = 500000
cuda (no inverse): 1.51 ms
cpu (no inverse): 59.81 ms
cuda (with inverse): 1.23 ms
cpu (with inverse): 110.69 ms
Running with N = 500000, M = 5000000
cuda (no inverse): 0.92 ms
cpu (no inverse): 104.26 ms
cuda (with inverse): 0.84 ms
cpu (with inverse): 187.12 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16145
Differential Revision: D13738821
Pulled By: soumith
fbshipit-source-id: 0811fb4ade47e3b466cebbc124e3f3333a986749
Summary:
It turns out that clang-tidy is bundled with travis's standard trusty distribution, so no need to install it manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16164
Differential Revision: D13738986
Pulled By: suo
fbshipit-source-id: d0cd76c615625b2ed7f18951289412989f15849d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16049
We might see the pattern
```
if (scale_.numel() != N) {
scale_->Resize(N);
// set initial value for scale_
}
// In class:
Tensor scale_{CPU};
```
before in the code, where `scale_` is a member variable of Type `caffe2::Tensor`
This pattern actually serves two purposes, if `scale_` is partially initialized with device type but not size, this call will
initialize Tensor with the correct size, or if `scale_` is already initialized with size, it will check whether the size
matches a runtime value `N` and if not it will Resize. To rewrite this we'll do the following:
```
if (!scale_.defined() || scale_.numel() != N) {
ReinitializeTensor(&scale_, {N}, at::dtype<float>().device(CPU));
// set initial value for scale_
}
```
There are some variants, if `scale_` is resized to a constant size, we can call `ReinitializeTensor` instead
```
if (scale_.numel() != 1) {
scale_->Resize(1);
}
```
-->
```
ReinitializeTensor(&scale_, {1}, at::dtype<float>().device(CPU));
```
Normal Resize will be refactored directly into ReinitializeTensor:
```
scale_->Resize(N);
```
-->
```
ReinitializeTensor(&scale_, {N}, at::dtype<float>().device(CPU));
```
Reviewed By: dzhulgakov
Differential Revision: D13667883
fbshipit-source-id: 2c7cb61544b72765b594011b99150eb5a1b50836
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16086
[caffe2] RNN operators should inherit step_net device_options
According to NetDef documentaiton, if network has a specific device option it applies to all network operators that do not explicitly specifiy it.
But this does not seem to be the case for RecurrentNetwork operators
Reviewed By: orionr
Differential Revision: D13699552
fbshipit-source-id: 14529bc9504e3b02f763e3c2429be21e46f82b68
Summary:
Add support for type inference for optional type refinement.
If a conditional is of the form "x is None" or "x is not None", or is a boolean expression containing multiple none checks, the proper type refinements are inserted in each branch.
For example:
if optional_tensor is not None and len(optional_tensor) < 2:
# optional_tensor is a Tensor
if optional_tensor1 is not None and optional_tensor2 is not None:
# both optional_tensor1 and optional_tensor2 are Tensors
TODO:
- not run an op for unchecked unwrap optional in the interpreter
- potentially refine types to prim::None (omitted for now to simply things & because it's not an actual use cause).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15587
Differential Revision: D13733810
Pulled By: eellison
fbshipit-source-id: 57c32be9f5a09ab5542ba0144a6059b96de23d7a
Summary:
Mention that if enforce_sorted=True, the user can set
enforce_sorted=False. This is a new flag that is probably hard to
discover unless one throughly reads the docs.
Fixes#15567
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16084
Differential Revision: D13701118
Pulled By: zou3519
fbshipit-source-id: c9aeb47ae9769d28b0051bcedb8f2f51a5a5c260
Summary:
This PR fixes a race condition for TCP init method, when master rank can exit earlier than slave ranks and thus the TCP daemon thread gets shutdown before other slaves are able to access it.
This will let every rank (process) write a special key to the store to mark that they are completed (and thus about to exit). The master rank (who is the server) will always wait until all the ranks to complete before complete itself.
This should fix: https://github.com/pytorch/pytorch/issues/15638
Tested using the repro of https://github.com/pytorch/pytorch/issues/15638 and works fine. Also test_distributed and test_c10d should have already had this coverage.
I had to make rendezvous test in c10d the world size of 1, since it is a single process code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15684
Differential Revision: D13570904
Pulled By: teng-li
fbshipit-source-id: 34f3bc471204bbd29320df359347ad5561c6b589
Summary:
Based on offline discussion it should be less surprising to the users of existing code. Thus caffe2::Tensor is now a move-only class (as it used to be), explicit calls to UnsafeSharedInstance() are necessary to get shared_ptr behavior.
This change also identified a few places that misused the copy constructor - those are fixed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15416
Reviewed By: Yangqing
Differential Revision: D13524598
fbshipit-source-id: aea12d6dff77342606fa88ce4ddddbff266245a7
Summary:
This PR inlines `Attributes` into `Node`. It helps to cleanup the code a little as everything is one place (some of the cleanups are included in the PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16098
Differential Revision: D13717637
Pulled By: ZolotukhinM
fbshipit-source-id: c54ae65178a95a01354688921a9ccb1ca699f8eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15856
They seem to be wrong.
cc zdevito to take a look but I think this is now more correct.
It's weird this didn't cause linker errors. Probably, this functionality isn't used across library boundaries yet.
Reviewed By: dzhulgakov
Differential Revision: D13605257
fbshipit-source-id: 7077ca9027c3ac79a4847ec15ead7ddb28696445
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15855
This is preparation work for moving IValue to c10.
Reviewed By: ezyang
Differential Revision: D13605259
fbshipit-source-id: cc545f582ab8607bb02aaf71273cb2710200b295
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16052
We need IValue to take/return Blob as an intrusive_ptr because we want to pass it around and Blob has disabled copying.
This is needed in a diff on top.
Reviewed By: ezyang
Differential Revision: D13684761
fbshipit-source-id: 7cb3d7e9fec39a2bc9f063d4d30404e6d7016eb2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16050
The c10 dispatcher will (soon) depend on IValue and IValue can't be moved to c10 yet because it depends on at::Tensor, which depends on legacy Type dispatch and we don't want the legacy dispatch in c10.
So instead, we move the c10 dispatcher back to ATen/core until we can actually move at::Tensor to c10.
Reviewed By: ezyang
Differential Revision: D13684517
fbshipit-source-id: 1125f4254223907c52f96ff73034f6d4ae9fd0a7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16104
PyTorch PR 15784 removed cuda-convnet from the contrib directory. This broke
some internal-only fb dependencies. Moving this to the internal area.
Reviewed By: ezyang
Differential Revision: D13709112
fbshipit-source-id: 2d7811545da67489869b59c350a29817eff693cf
Summary:
Some cleanup to wildcard handling, including one bugfix: previously, we were not considering writes to the wildcard set as part of the potential write set for nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16041
Differential Revision: D13705738
Pulled By: suo
fbshipit-source-id: acb8ccbaa70fe47445577ddf24a69f84630de411
Summary:
Confirmed on a local run that all the additional headers are present. This shouldn't be caught in any existing tests though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16124
Differential Revision: D13720773
Pulled By: pjh5
fbshipit-source-id: 22a42639f5649cac555ecc5a8b6760a8cbfcf01f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15843
RNN/LSTMs only need one bias vector, but our implementation uses two to be compatible with CuDNN. This diff adds a comment to explain this.
Reviewed By: ezyang
Differential Revision: D13602365
fbshipit-source-id: eef5bd9383d9f241dc0ef0472f753b4a44cc19b5
Summary:
Fixes#12643, amends to #3341.
- Allow multidimensional input ~~(but apply softmax over `dim=-1`)~~ with `dim` argument
- Cleaner: Less lines of code
- Faster (1.32x speedup vs original, 2x speedup vs using `torch.Distributions`)
- Small fixes in docstring
- Remove some references in docstring. Was the linked (excellent) ipynb the first to do the straight-through trick? Instead, I propose changing to reference to the two papers most known for it.
- Add deprecationwarning for `eps`. It's not needed anymore.
- Initial commit keeps some code alternatives commented to exploit CI
- As of discussion when `gumbel_softmax` was added (#3341), this was merged into `torch.nn.functional` before all the work with `Distributions` and `Pyro`, and there will probably be multiple other best practices for this in the future.
I've tested building using the `Distributions`-api, but it was too slow, see below.
I therefore propose not using `Distributions` to keep it fast and simple, but adding a comment in docstring that `gumbel_softmax` may be deprecated in the future.
```
dist = torch.distributions.RelaxedOneHotCategorical(temperature=tau, logits=logits, validate_args=False)
y_soft = dist.rsample()
```
Pros:
* Built using tricks like `logsumexp` etc
* Explicitly uses `torch.distributions.utils._finfo` to avoid overflow (old implementation had an `eps` flag)
* Maintained for this exact purpose.
Cons:
* Very slow. Construction of distribution adds overhead see timings below. May be solved in future with speedups of `TransformedDistribution` and `Distribution`.
* Assumes which `dim` to apply softmax over.
```
y_soft = logits.new(logits.shape)
y_soft = (logits - y_soft.exponential_().log()) / tau # Gumbel noise
y_soft = y_soft.softmax(dim) # Gumbel softmax noise
```
Pros:
* Faster
```
import time
start = time.time()
num_draws = 1000000
logits = torch.randn(1,3)
for draw in range(num_draws):
y_draw = gumbel_softmax(logits, hard=True)
counts = counts + y_draw
print(end - start)
>> 12.995795965194702
>> 7.658372640609741
>> 20.3382670879364
````
Decide on which path to chose. I'll commit in changes to the unit tests in a while to show that it passes both old tests and new tests. I'll also remove the commented code about `RelaxedOneHotCategorical`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13339
Differential Revision: D13092434
Pulled By: ezyang
fbshipit-source-id: 4c21788df336f4e9c2ac289022e395b261227b4b
Summary:
If "matches_jit_signature" is set to True for a particular function, we will assume that the func syntax follows the JIT signature syntax. This is a temporary attribute and doesn't need to be set by developers outside the core team. It serves as a means of tracking an ongoing schema unification with the goal of aligning func syntax with other components of PyTorch in order to reduce overall complexity and match coverage of different function descriptions.
Followup PRs might be about removing _out from native_functions.yaml and using Tensor annotations instead, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16040
Reviewed By: ezyang
Differential Revision: D13703176
Pulled By: cpuhrsch
fbshipit-source-id: ce248e1823a6f18efa95502f9f3eebf023b4a46c
Summary:
without this "if", code below will throw error " Linear' object has no attribute '_buffers' "
And with this if, error would be "cannot assign buffer before Module.\_\_init\_\_() call", which I think it's more accurate, just like register_parameter.
```
import math
import torch
from torch.nn.parameter import Parameter
from torch.nn import functional as F
from torch.nn import Module
class Linear(Module):
def __init__(self, in_features, out_features, bias=True):
self.in_features = in_features
self.out_features = out_features
self.register_buffer('test', torch.Tensor(out_features, in_features))
self.weight = Parameter(torch.Tensor(out_features, in_features))
if bias:
self.bias = Parameter(torch.Tensor(out_features))
else:
self.register_parameter('bias', None)
super(Linear, self).__init__()
self.reset_parameters()
def reset_parameters(self):
stdv = 1. / math.sqrt(self.weight.size(1))
self.weight.data.uniform_(-stdv, stdv)
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)
def forward(self, input):
return F.linear(input, self.weight, self.bias)
def extra_repr(self):
return 'in_features={}, out_features={}, bias={}'.format(
self.in_features, self.out_features, self.bias is not None
)
linear = Linear(3,4)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16110
Differential Revision: D13715839
Pulled By: soumith
fbshipit-source-id: c300eff0a8655aade448354cf489a592f7db722a
Summary:
respect grad guard for torch.jit._fork and torch.jit._wait.
Verified that the test failed without the fix, and pass with the fix.
Ideally I would like to enable and disable grad inside the forked function.
It doesn't seems like it's supported at this moment. This code handles that
as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16101
Differential Revision: D13708374
Pulled By: gqchen
fbshipit-source-id: 0533f080c4d0253fb4c61d2a0d3cc22de5721a09
Summary:
1) Reverts https://github.com/pytorch/pytorch/pull/12302 which added support for batched pdist. Except I kept the (non-batched) test improvements that came with that PR, because they are nice to have. Motivation: https://github.com/pytorch/pytorch/issues/15511
2) For the non-batched pdist, improved the existing kernel by forcing fp64 math and properly checking cuda launch errors
3) Added a 'large tensor' test that at least on my machine, fails on the batch pdist implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15901
Reviewed By: ezyang
Differential Revision: D13616730
Pulled By: gchanan
fbshipit-source-id: 620d3f9b9acd492dc131bad9d2ff618d69fc2954
Summary:
PR to update the shape notation for all of the torch.nn modules to take a unified form. The goal is to make these definitions machine-readable and those checkable by unifying the style across all of the different modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15741
Differential Revision: D13709601
Pulled By: ezyang
fbshipit-source-id: fb89a03903fdf0cd0dcf76f3e469b8582b2f3634
Summary:
This issue was discovered by fehiepsi in https://github.com/uber/pyro/issues/1706 with the `log_prob` computation for Binomial, ~and can be seen with `torch.float32` when we have a combination of low probability value and high `total_count` - a test is added to capture this (since scipy only uses float64, the comparison is done using relative tolerance).~
The problem is in the code that tries to pull out the minimum values amongst the logits (written by me earlier, presumably to avoid numerical instability issues), but it is not needed.
EDIT: After a few attempts, I have been unable to reliably show that the change is more numerically stable, and have removed my previous test which fails on linux. The reason is that the issue manifests itself when `total_count` is high and `probs` is very low. However, the precision of `lgamma` when `total_count` is high is bad enough to wash away any benefits. The justification for this still stands though - (a) simplifies code (removes the unnecessary bit), (b) is no worse than the previous implementation, (c) has better continuity behavior as observed by fehiepsi in the issue above.
cc. fehiepsi, alicanb, fritzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15962
Differential Revision: D13709541
Pulled By: ezyang
fbshipit-source-id: 596c6853b6e4d5fba42336afa168a665ab6fbde2
Summary: This PR aims to remove support for cuDNN 6.
Differential Revision: D13709595
Pulled By: ezyang
fbshipit-source-id: 853624db1cf66b0534d7028654c38c2806fb4107
Summary:
Idiomatic pyi files will fail with Python 2 flake8 even
though they would work with mypy. This is because pyi
files generally use Python 3 only syntax. No point
in linting them.
There are currently no pyi files checked in, this is purely
a prophylactic measure.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16105
Reviewed By: zou3519
Differential Revision: D13709409
Pulled By: ezyang
fbshipit-source-id: ec4a959e146f81ccb9533b04348be8dd78808421
Summary:
On some cloud-based x86 systems /sys/ is not mounted.
cpuinfo has a work-around for these systems, but it reports an error if sysfs files fail to read, and this error was confusing to some users (e.g. pytorch/cpuinfo#20). This update downgrades the error to a warning, so it is not reported with default configuration options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16107
Differential Revision: D13715243
Pulled By: soumith
fbshipit-source-id: f5c4c86422343ca449487f0185f3a8865ccf3b9d
Summary:
1. Added `torch/csrc/cuda/Event.h` and `torch/csrc/cuda/Event.cpp` to bind Python Event class to C++ implementation.
2. Move all CUDA runtime invocations from `torch/cuda/streams.py` to C++
3. Added tests to cover Stream and Event APIs. ~(event IPC handle tests is introduced in #15974)~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15937
Differential Revision: D13649001
Pulled By: mrshenli
fbshipit-source-id: 84ca58f35f6ba679a4ba33150ceba678d760d240
Summary:
- a typo fixed
- made the docs consistent with #5108
And maybe one more change is needed. According to the current docs
> The batch size should be larger than the number of GPUs used **locally**.
But shouldn't the batch size be larger than the number of GPUs used **either locally or remotely**? Sadly, I couldn't experiment this with my single GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16010
Differential Revision: D13709516
Pulled By: ezyang
fbshipit-source-id: e44459a602a8a834fd365fe46e4063e9e045d5ce
Summary:
There is a little error in the comment, "A->B", so the Task B must start after task A finishes, not "B".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15922
Differential Revision: D13709579
Pulled By: ezyang
fbshipit-source-id: 735afe83f4532b7c7456da3e96209b3e07071f37
Summary:
TensorProto.DataType in caffe2/proto/caffe2.proto has BYTE = 3 defined, while there is no corresponding TypeMeta defined in caffe2/core/types.cc: DataTypeToTypeMeta. This issue failed the C++ tutorial of MNIST + LMDB.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15627
Differential Revision: D13709602
Pulled By: ezyang
fbshipit-source-id: d4826d0f9b3975e6a8478d4bad1abbbedcaea197
Summary:
1. I fixed the importing process, which had some problems
- **I think `setup_helpers` should not be imported as the top level module. It can lead to many future errors. For example, what if `setup_helpers` imports another module from the upper level?** So we need to change it.
- The code is not consistent with other modules in `tools` package. For example, other
modules in the package imports `from tools.setuptools...` not `from setuptools...`.
- **It should be able to run with `python -m tools.build_libtorch` command** because this module is a part of the tools package. Currently, you cannot do that and I think it's simply wrong.
~~2. I Added platform specific warning messages.
- I constantly forgot that I needed to define some environment variables in advance specific to my platform to build libtorch, especially when I'm working at a non pytorch root directory. So I thought adding warnings for common options would be helpful .~~
~~3. Made the build output path configurable. And a few other changes.~~
orionr ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15471
Differential Revision: D13709607
Pulled By: ezyang
fbshipit-source-id: 950d5727aa09f857d973538c50b1ab169d88da38
Summary:
If we use clang with sse4 support, we will have the function redefinition
error between [1] and [2]. This patch try to add some checkings to fix this
problem.
I just turn on USE_NATIVE_ARCH with clang, then I hit the redefinition error.
[1]
caffe2/operators/quantized/int8_simd.h
[2]
third_party/gemmlowp/gemmlowp/fixedpoint/fixedpoint_sse.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13859
Differential Revision: D13095694
Pulled By: ezyang
fbshipit-source-id: c65166e4d5a04bb54e2b82c52740af00116ccb0d
Summary:
Use case:
Some data loader tests rely on `psutil` (a third party lib). So they are guarded by `skipIf`. But we want to always test them on CI envs. With `IS_PYTORCH_CI`, we can raise if `psutil` is not found.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16006
Reviewed By: ezyang
Differential Revision: D13673957
Pulled By: yf225
fbshipit-source-id: c63a7138093f45333c0b371fed0bcc88b67f2a22
Summary:
Adding supports for torch.nomr:
i. multi dimensions for dim
ii. dtype that specifies math/output tensor type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15414
Differential Revision: D13702022
Pulled By: ezyang
fbshipit-source-id: da2676f2b6aff988889b1539d0de8ecd4946823a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15912
Codemod generated with clangr shard mode, 25 files per diff,
To eliminiate partially initialized Tensor, we split the initialization of local Tensor variables into two steps, first declare un uninitialized Tensor, and
call `ReinitializeTensor` to initialize it.
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: dzhulgakov
Differential Revision: D13586734
fbshipit-source-id: 8485d2c51225343961351c7a2e8f95055534f9a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16081
A simple version of bound shape inference, conditioned on batch size. In addition to doing normal shape inference, it will change the batch size (1st dim of the shape) of the inputs as well as batch size modulating ops such as `SparseLengthsSum`. Probably support to more ops is needed, such as `SparseToDense`. We can build on this.
Reviewed By: jackm321, rdzhabarov
Differential Revision: D13661968
fbshipit-source-id: 6a724a647e109757c26e3e26e15a49725ecc75cc
Summary:
1. Port the FractionalMaxPool3d implementation from THNN/THCUNN to ATen.
2. Expose this function to Python module nn.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15575
Differential Revision: D13612848
Pulled By: chandlerzuo
fbshipit-source-id: 5f474b39005efa7788e984e8a805456dcdc43f6c
Summary:
The cumsum over the probabilities can be not monotonically
non-decreasing. Thus it is hard to detect zero probability
classes using just the cumsum.
This changes the binary search postprocessing to use the
(non-cumulated) distribution instead.
Thank you, jcjohnson, for the bug report with
reproducing case.
Fixes: #13867
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16075
Differential Revision: D13695565
Pulled By: soumith
fbshipit-source-id: 02c4d6f868f0050c1ae7d333f4317c5610e49cd9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16047
Implements single thead safe map enabling sharing of generated graph between
different ops.
Added model_id to every onnxified op to help create a unique id in the map.
Some formatting fix.
Reviewed By: yinghai
Differential Revision: D13663927
fbshipit-source-id: 27417e8fe752fdd48abb6a87966cd76d592e1206
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16061
I discovered I needed to delete these names in preparation of moving
THCCachingAllocator to c10_cuda; might as well also fix all the other
sites too.
Reviewed By: dzhulgakov
Differential Revision: D13686869
fbshipit-source-id: e8cc55d39ac4bfd3e3a22c761f89a7a111ce5f5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16059
Just deleted all __cplusplus ifdef guards; we only ever use
these headers in C++ contexts.
Reviewed By: dzhulgakov
Differential Revision: D13686580
fbshipit-source-id: ce28c4a32f3596bfb17aeeb34904a02899991453
Summary:
The correct logic is as follows:
* If there is an earlier split, we need to combine with its result
* If there is *not* a later split, we need to project before saving into the output.
This should partially f i x #15837 . For example:
```
In [7]: a=torch.ones([1838860800], dtype=torch.float, device="cuda:1")
In [8]: a.mean()
Out[8]: tensor(1., device='cuda:1')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16023
Differential Revision: D13678449
Pulled By: umanwizard
fbshipit-source-id: ab5078484c88e96bb30121b5cf24a0e8b0a8c2f8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15814
Plan is to remove the APIs we want to deprecate one by one and make sure it still builds in sandcastle and ossci
Reviewed By: ezyang
Differential Revision: D12812029
fbshipit-source-id: ea0c3dd882bec95fcd4507160ebc61f598b6d040
Summary:
Treat GenericList similarly to tuples and TensorList: recursively unpack them and assignValueTrace accordingly. Also add interpreter support for ListUnpack on GenericList
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15969
Differential Revision: D13665139
Pulled By: jamesr66a
fbshipit-source-id: cd8cb3dd7475f424e48a69d217f2eac529df9f6a
Summary:
This puts stubs in the autograd profiler for the use of cuda APIs allowing the cuda parts of libtorch to be linked separately from the CPU parts.
This also edits the buck build.
Previous:
For GPU builds:
_C -> csrc -> caffe2
For CPU builds:
_C -> csrc-cpu -> caffe2
Now:
GPU:
_C -> libtorch_cuda -> (libtorch -> caffe2, for CPU)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15898
Reviewed By: ailzhang
Differential Revision: D13617991
Pulled By: zdevito
fbshipit-source-id: 6d84a50bb356a54b4217f93219902755601b00e1
Summary:
1. Add some gloo communication operators into related fallback list;
2. Work around to avoid compiling errors while using fallback operator whose CPU operator inherits from 'OperatorBase' directly like PrefetchOperator;
3. Add new cpu context support for some python module files and resnet50 training example file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11330
Reviewed By: yinghai
Differential Revision: D13624519
Pulled By: wesolwsk
fbshipit-source-id: ce39d57ddb8cd7786db2e873bfe954069d972f4f
Summary:
Previously we were only constant propping prim::Constants, but we should be constant propping prim::None as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15979
Differential Revision: D13664692
Pulled By: eellison
fbshipit-source-id: 01839403576c21fc030c427e49275b8e1210fa8f
Summary:
Similarly to https://github.com/pytorch/pytorch/pull/13777, we apply post-processing quantization to RNN cell modules (`RNNCell`, `LSTMCell`, and `GRUCell`).
A further follow-up PR will involve quantizing the full `RNN`, `GRU`, and `LSTM` modules. This depends on those modules being scriptable as part of the standard library scripting effort, though. Note that infrastructure in this pr such as `gather_quantized_params` is currently unused but should be used in the future when we can port over the full RNN modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15469
Differential Revision: D13545802
Pulled By: jamesr66a
fbshipit-source-id: ad3b694517842893ea619438e9f5e88fd7b96510
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/16020
Needs to go over more iterations. For conv, I think we need a high level interface that abstracts out low-level details of which code path will be taken (acc16, outlier-aware, depth-wise, group conv, ...) otherwise the client code will be complex as can be seen from DNNLOWP Conv ops. This will also help us to make interface more stable.
Reviewed By: dskhudia, jianyuh
Differential Revision: D13588996
fbshipit-source-id: 9afce9e441bcaf20437fcc2874fb9d4165a46bcb
Summary:
Timings are the same as for `std` .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15892
Differential Revision: D13651173
Pulled By: umanwizard
fbshipit-source-id: a26bf1021dd972aa9e3e60fb901cd4983bfa190f
Summary:
Use new test utils in converter_nomnigraph_test , and add utils to set device option name, external inputs, outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15751
Differential Revision: D13586228
Pulled By: duc0
fbshipit-source-id: ff809dd7bf9f30641ce2a6fef7e2810f005521c2
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
cc meganset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15694
Differential Revision: D13573064
Pulled By: zou3519
fbshipit-source-id: 1d0b693d7c26db91826b81e6c98b45a69b5e9bc4
Summary:
In #15964, I learned that `errno` is only meaningful if the function call fails. E.g., on some macos, a successful `fork()` sets `errno` to `EINVAL` in child process. This commit changes the `SYSCALL` macro so error checking is only done when an error happens. This means checking whether `rv == -1` for most calls, but is checking `rv == nullptr` for `inet_ntop`.
Now `SYSCALL` accepts a second argument `success_cond`, which should be an expression returning whether the call succeeded. `SYSCHECK_ERR_RETURN_NEG1` is the shorthand for checking if rv is `-1`.
Any suggestion on better macro names is welcomed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15986
Reviewed By: janewangfb
Differential Revision: D13661790
Pulled By: pietern
fbshipit-source-id: 9551b14b9f88805454a7bfb8e4d39e0f3aed8131
Summary:
bypass-lint
- Change all Caffe2 builds to use setup.py instead of cmake
- Add a -cmake- Caffe2 build configuration that uses cmake and only builds cpp
- Move skipIfCI logic from onnx test scripts to the rest of CI logic
- Removal of old PYTHONPATH/LD_LIBRARY_PATH/etc. env management
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15917
Reviewed By: orionr
Differential Revision: D13637583
Pulled By: pjh5
fbshipit-source-id: c5c5639db0251ba12b6e4b51b2ac3b26a8953153
Summary:
In Python, you can use the call operator to invoke the `forward()` method of a module. In C++ this was currently not possible, because I couldn't figure out how to deduce the return type of a module's `forward()` method under the constraint that `forward()` may not exist at all (since the base module class in C++ does not mandate a `forward()` method). I now figured it out, so the call operator can be used.
ezyang ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15831
Differential Revision: D13652676
Pulled By: goldsborough
fbshipit-source-id: ccab45a15215dda56460e560f0038781b539135f
Summary:
- Fixed a few typos and grammar errors.
- Changed the sentences a bit.
- Changed the format of the tuples to be consistent with padding notations in the other places. For example, `ReflectionPad2d`'s dostring contains :math:`H_{out} = H_{in} + \text{padding\_top} + \text{padding\_bottom}`.
I also made sure that the generated html doesn't break.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15984
Differential Revision: D13649939
Pulled By: soumith
fbshipit-source-id: 0abfa22a7bf1cbc6546ac4859652ce8741d41232
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15996
As reported in issue #15911, gcc 4.9 was getting internal compiler error due to a complex use of lambda function in conv_dnnlowp_op.cc and conv_acc16_op.cc . This diff simplifies them.
Reviewed By: viswanathgs
Differential Revision: D13648264
fbshipit-source-id: 1551ae8a0a7653749185dca51ccceb2471b96b82
Summary:
Tested locally. It could be now be started by running `set EXTRA_CAFFE2_CMAKE_FLAGS= -DTORCH_STATIC=1` before build. If we want to make sure it works, then maybe we should add it into CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15989
Differential Revision: D13649935
Pulled By: soumith
fbshipit-source-id: 956945ed572819d8cf0bc9bd48df3ea9bc6f4a8a
Summary:
This is follow up on #13945 where we had to turn off some TRT tests because some ops were not ready to accept ONNX opset 9+ models. This PR fixes Reshape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15380
Differential Revision: D13649825
Pulled By: houseroad
fbshipit-source-id: b72e62803de5b63cc001c3fe4b3bf64dfa996e94
Summary:
For Inference, if the StopGradient op is inpalce, we just remove it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12152
Differential Revision: D13633946
Pulled By: yinghai
fbshipit-source-id: 57762bcc37b38a1d39cb4af316ca50bfe961b105
Summary:
This is the first of several PRs to simplify AliasDb usage.
- Hide the concept wildcards from users. They are too hard to think about and too easy to forget about.
- Start moving "mutability-safe" graph mutation methods into AliasDb (right now, the various methods that deal with topological move).
Eventually I want to create a "mutability-aware" handle to the graph. If you only use that handle to transform the graph, you can be sure that all transformations are safe with respect to mutability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15656
Differential Revision: D13615492
Pulled By: suo
fbshipit-source-id: 5c39a157b4ea76f1f976315d06a314a89cc4f22f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15908
"OperatorBase::" is changed to "this->template ".
For example,
# This no longer works
OperatorBase::GetSingleArgument<>()
# Should change to:
this->template GetSingleArgument<>()
https://fb.workplace.com/groups/101100140348621/permalink/576804082778222/
Follow up of D13574832.
Sample Diff:
D9319742, D10045844.
Reviewed By: jspark1105
Differential Revision: D13613574
fbshipit-source-id: 2cb4094557b4af78d41e289816cad3e1194fb82c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15967
Codemod generated with clangr shard mode, 25 files per diff,
To eliminiate partially initialized Tensor, we split the initialization of local Tensor variables into two steps, first declare un uninitialized Tensor, and
call `ReinitializeTensor` to initialize it.
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: smessmer
Differential Revision: D13586735
fbshipit-source-id: eae2d79e1107a2e813ce3809e690af4706aaa9ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15947
Codemod generated with clangr shard mode, 25 files per diff,
To eliminiate partially initialized Tensor, we split the initialization of local Tensor variables into two steps, first declare un uninitialized Tensor, and
call `ReinitializeTensor` to initialize it.
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: smessmer
Differential Revision: D13586732
fbshipit-source-id: 5295ab27ca0155f96a4fccf9c0ba8a609101ba24
Summary:
While integrating fork/join into production translation, we found that trying to export `transpose()` where the input is of `TensorType` (rather than `CompleteTensorType`) failed. This is not ideal, since `TensorType` still contains the number of dimensions of the tensor, and that's all the `transpose` symbolic needs.
This PR introduces a pybind binding for `dim()` on `TensorType` (and `CompleteTensorType` by inheritance). We now use this in places where it logically makes sense in the symbolics: those symbolics which only require knowledge of the number of dimensions rather than concrete sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15933
Differential Revision: D13639657
Pulled By: jamesr66a
fbshipit-source-id: 6e50e407e93060085fd00a686a928764d0ec888d
Summary:
Implementation LeakyRelu operator for mkl-dnn,the speed-up of a single operation is up to 10X on BDW.
Implementation rashape operator for mkl-dnn,it will resolve occasionally crash issue which use fallback reshape operator.
Implementation CreateBlobQueue and SafeEnqueueBlobs operators,it will resolve crash issue which use fallback operators.
Fallback CreateBlobsQueueDBOp,TensorProtosDBInput,CloseBlobsQueue operators.
Implement adam operator for mkl-dnn,the speed-up of a single operator is up to 6X on BDW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/11696
Reviewed By: yinghai
Differential Revision: D10100438
Pulled By: wesolwsk
fbshipit-source-id: 0b6e06897cc11e0a8e349d80a870b1e72e47f10d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15692
It was leading to ocassional crashes with dynamically linked CUDA because runtime was already destroyed.
Also, unique_ptr<T[]> is more suitable than deque<T> for the purpose.
Reviewed By: Yangqing
Differential Revision: D13571988
fbshipit-source-id: 37eb26dfbe361c49160367b53f87bd037c6c0e46
Summary:
That makes that definition of a "fusable node" much simpler,
as we don't need to keep considering whether something has to be an
"exit node" at every step. The fuser now tries to maximize the
pointwise fusions first, and proceeds to prepending chunks and appending
concats only once a fix point is reached.
This patch not only makes the fuser much simpler to reason about,
making it siginifcantly easier to implement features like SumToSize
fusion, to improve performance of derivative graphs.
cc zou3519 mruberry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15633
Differential Revision: D13575306
Pulled By: zou3519
fbshipit-source-id: 0c55ea61d65d1f1ed3d75a8e1e83bc85a83f3aff
Summary:
Adding bindings for .cpu() and .cuda() to script.
It's worth noting that if the device remains unchanged, than the returned tensor aliases the input, but if it does change than they do not alias each other.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15904
Differential Revision: D13632879
Pulled By: eellison
fbshipit-source-id: 024a04f267909674aa1e510562efd9cb081f407c
Summary:
4GB is still too large and leads to CUDA OOM failures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15959
Differential Revision: D13635146
Pulled By: mrshenli
fbshipit-source-id: 3dc34a03d6ed65c458839d8fa37cd05bf3bc8106
Summary:
Turns out this has basically been implemented already in Resize.h / Resize.cuh.
Also added some testing, basically just to check that empty_strided behaves equivalently to as_strided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15948
Differential Revision: D13631098
Pulled By: gchanan
fbshipit-source-id: eb0e04eead45e4cff393ebde340f9d265779e185
Summary:
This PR moves `deviceProperties` from `THCState` struct to `CUDAContext` in ATen and hence, takes one more step towards removing `THCState`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14834
Differential Revision: D13633956
Pulled By: soumith
fbshipit-source-id: 51820ac224fc566f17aa92570fd378cff4248596
Summary:
When compiling for `TORCH_CUDA_ARCH_LIST=7.5` we were getting ptxas warnings (https://github.com/pytorch/pytorch/issues/14310). This was because we had some hardcoded values when using launch_bounds in kernels. The maximum number of threads per multiprocessor is 1024 for Turing architecture (7.5) but 2048 for previous architectures. The hardcoded launch_bounds in the kernel were requesting for 2048 threads when compiling for Turing and hence were generating the warning.
This PR adds a macro that checks for the bounds on the launch bounds value supplied. The max number of threads per block across all architectures is 1024. If a user supplies more than 1024, I just clamp it down to 512. Depending on this value, I set the minimum number of blocks per sm. This PR should resolve https://github.com/pytorch/pytorch/issues/14310. The gradient computation being wrong reported in that PR is probably due to the faulty card.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15461
Differential Revision: D13633952
Pulled By: soumith
fbshipit-source-id: 795aa151109f343ab5433bf3cb070cb6ec896fff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15884
Codemod generated with clangr shard mode, 25 files per diff,
To eliminiate partially initialized Tensor, we split the initialization of local Tensor variables into two steps, first declare un uninitialized Tensor, and
call `ReinitializeTensor` to initialize it.
motivation: https://github.com/pytorch/pytorch/pull/12407
Reviewed By: hyuen
Differential Revision: D13586737
fbshipit-source-id: dc8e49e9f29505b8898bb19f84c1a983f2d811ab
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15407
Don't ask the tensor for its intrusive pointer if we just want to check if two tensors are the same.
This mirrors ATen APIs.
Reviewed By: dzhulgakov
Differential Revision: D13520389
fbshipit-source-id: 681317f36f480ab60e532bb08a073f98f39770fd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15316
This starts cleaning up the files in c10 according to the module structure we decided on.
Move to c10/util:
- Half.h, Half-inl.h, Half.cpp, bitcasts.h
Move to c10/core:
- Device.h, Device.cpp
- DeviceType.h, DeviceType.cpp
i-am-not-moving-c2-to-c10
Reviewed By: dzhulgakov
Differential Revision: D13498493
fbshipit-source-id: dfcf1c490474a12ab950c72ca686b8ad86428f63
Summary:
Unfortunately I do not know how to test this without merging it first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15934
Reviewed By: orionr
Differential Revision: D13627472
Pulled By: pjh5
fbshipit-source-id: 35eced1483bbf3c0c3f6f62fb7bbbf2f200e50e6
Summary:
Wasn't clearing optimizer buffers before adding new entries to it during deserialization. Successive calls to `torch::load` with the same optimizer would just append to the buffer container. Also moved `serialize()` function from `torch::optim::detail` into `torch::optim` so users can use it for custom optimizers.
Fixes#15792
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15926
Differential Revision: D13623615
Pulled By: goldsborough
fbshipit-source-id: e193091f25f56a95f2a9648af312cb7caa45f300
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15876
Build changes made it so some .so libraries are now registered after GlobalInit is called. Although this shouldn't be common, it also shouldn't be explicitly excluded. These changes allow for late Caffe2 registration, but also warn in that case.
Reviewed By: kuttas
Differential Revision: D13608186
fbshipit-source-id: 0ca7bcd32516d374077db0c2548cf8c28ccdd5f6
Summary:
Currently these tests are taking most of the time in test_jit.py run, with the
proposed changes the testing time is reduced by ~75%:
```
TestEndToEndHybridFrontendModels.test_neural_style: 203.360s -> 10.650s
TestEndToEndHybridFrontendModels.test_snli: 422.315s -> 9.152s
TestEndToEndHybridFrontendModels.test_super_resolution: 73.362s -> 19.185s
time python test/test_jit.py (real): 13m50.828s -> 3m11.768s
time python test/test_jit.py (user): 85m59.745s -> 13m18.135s
time python test/test_jit.py (sys): 144m9.028s -> 25m58.019s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15906
Differential Revision: D13619659
Pulled By: ZolotukhinM
fbshipit-source-id: 6c22d8740f8ddb865c3a0667af32653723383816
Summary:
Other changes:
1. Avoided using `THCDeviceTensor` by re-calculating the mapping from cuda (blockIdx, threadIdx) to input/output tensor index.
2. Changed Camelcase naming to underscore naming.
Differential Revision: D13546803
fbshipit-source-id: 1df54f13e64934da3d803d9b6586bd5208d42d6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15841
Fix the bugs in dnnlowp to support int8/int16 quantization for sparsenn.
Reviewed By: jspark1105
Differential Revision: D13600878
fbshipit-source-id: 27f06d7c54a663208320c8f211714220a9b49540
Summary:
Python2 doesn't allow to invoke `exec` from a nested function:
File "test/test_jit.py", line 4653
exec(code, globals(), scope)
SyntaxError: unqualified exec is not allowed in function 'test' it is a nested function
This patch wraps exec with a separate function, making it work for both python2
and python3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15882
Differential Revision: D13614235
Pulled By: ZolotukhinM
fbshipit-source-id: 9a074308c2379f089402e0bf5a996cc649d6dbca
Summary:
Optimized CPU version of the nonzero. Now 2x faster (in avg.) than numpy.
Can be further optimized for 1D tensors and boolean tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15190
Differential Revision: D13468570
Pulled By: VitalyFedyunin
fbshipit-source-id: e55ce54d60626a42d9a10a02e407856458b8055e
Summary:
macos builds are broken now with the following error:
```
/usr/local/Homebrew/Library/Homebrew/config.rb:39:in `initialize': no implicit conversion of nil into String (TypeError)
from /usr/local/Homebrew/Library/Homebrew/config.rb:39:in `new'
from /usr/local/Homebrew/Library/Homebrew/config.rb:39:in `<top (required)>'
from /usr/local/Homebrew/Library/Homebrew/vendor/portable-ruby/2.3.7/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
from /usr/local/Homebrew/Library/Homebrew/vendor/portable-ruby/2.3.7/lib/ruby/2.3.0/rubygems/core_ext/kernel_require.rb:55:in `require'
from /usr/local/Homebrew/Library/Homebrew/global.rb:25:in `<top (required)>'
from /usr/local/Homebrew/Library/Homebrew/brew.rb:13:in `require_relative'
from /usr/local/Homebrew/Library/Homebrew/brew.rb:13:in `<main>'
Exited with code 1
```
No recent commits look suspicious, and I can even reproduce locally on my macbook, so it might be related to some new `brew` updates. Empirically, calling `brew update` first seems to fix this.
Example error build: https://circleci.com/gh/pytorch/pytorch/534392?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15873
Differential Revision: D13608019
Pulled By: soumith
fbshipit-source-id: 1499cb5246929e275a11ca6fccef6ef32918e45e
Summary:
This was causing a problem in #15735 but appears to have been fixed.
Adding this test to prevent regressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15835
Differential Revision: D13600282
Pulled By: zou3519
fbshipit-source-id: d9939e74d372be71c50122a5f6a615fbd7fa4df6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15865
factored out code used in tests for operators Add, Mul and Sub
into two new methods: a first one to generate the test vectors, a second
one to run the actual tests given a caffe2 and python operator.
Reviewed By: houseroad
Differential Revision: D13526955
fbshipit-source-id: 8970ba5a1305ca19a54a14b51816d4a19f19d678
Summary:
I fixed a very small extra parenthesis in a doctest.
I'm also going to use this issue as a place to propose the eventual inclusion of xdoctest (a pip installable library I wrote) in pytorch's test suite. I think there are a lot of problems with Python's built in doctest module, and I've built xdoctest to fix them. I would love for my project to get some exposure and its addition to PyTorch may benefit both projects. Please see the readme for more details on what xdoctest brings to the table over the builtin doctest module: https://github.com/Erotemic/xdoctest
I came across this small syntax error when working on ensuring xdoctest was compatible with pytorch. It isn't 100% there yet, but I'm working on it. My goal is to ensure that xdoctest is 100% compatible with all of torch's doctest out-of-the-box before writing up the PR. I'm also airing the idea out-loud before I commit too much time into this (or get my hopes up), so I'm attaching this little blurb to a no-brainer-merge PR to (1) demonstrate a little bit of value (because xdoctest flagged this syntax error) and (2) see how its received.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15646
Differential Revision: D13606111
Pulled By: soumith
fbshipit-source-id: d4492801a38ee0ae64ea0326a83239cee4d811a4
Summary:
Thank you, freesouls, for the reproducing example!
This is strictly fixing the bug in gradients for varying length inputs discussed in the middle-to-bottom of the bug report. I'll have a feature patch regarding inf losses -> NaN grads separately.
Fixes: #14401
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15798
Differential Revision: D13605739
Pulled By: soumith
fbshipit-source-id: 167ff42399c7e4cdfbd88d59bac5d25b57c0363f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15199
In order to call it from PyTorch, this op schema can't live in caffe2 but must be included from PyTorch.
Moving it to c10. This is not where it should be in the end (that's why there is a large TODO here),
but an intermediate hack to enable this use case and proof-of-concept.
Reviewed By: ezyang
Differential Revision: D13462124
fbshipit-source-id: 1e187b9def8ef049c91e6de947ea4a85758d711b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15367
This updates flat_hash_map and fixes an issue with singletons across library boundaries
(see the PRs linked at the top of the file)
Reviewed By: ezyang
Differential Revision: D13510912
fbshipit-source-id: e90a297a7a2d69ae3fe48e4fcd8a44ad4b81292a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15324
This was missing but needs to be here, otherwise we can't register schemas without linker errors.
Reviewed By: ezyang
Differential Revision: D13500679
fbshipit-source-id: ba06351cb8ae09ec456cb93e527d388ace578fbb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15195
This removes the use of caffe2::Tensor or at::Tensor in the c10 dispatcher and only uses C10::Tensor.
It also changes output tensors to be passed as `const Tensor&` instead of `Tensor*` because we otherwise can't forward them in operator_c10wrapper.h.
Reviewed By: ezyang
Differential Revision: D13461640
fbshipit-source-id: 7f79925a7d60f01660a24bbfda47391af0c70ed3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14819
This is a minimal wrapper for a c10::TensorImpl,
maybe destined for greatness later when we move caffe2::Tensor or at::Tensor into c10.
Reviewed By: dzhulgakov
Differential Revision: D13348039
fbshipit-source-id: 874f515358e94f35dc7a4c3e55b35fde59c51ff1
Summary:
Allow the comparison function used in ReadyQueue to handle the empty FunctionTasks created by the reentrant autograd.
Fix#11732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15791
Differential Revision: D13598006
Pulled By: soumith
fbshipit-source-id: 0bfdf28a735fbfe44f0fdbaf8b74a6198e6a1984
Summary:
The overhead of the copy actually makes an appreciable difference when doing a lot of small reductions (i.e., when the reduced dimension is significantly smaller than the non-reduced dimensions.
```
x=torch.randn((1024,10,1024),dtype=torch.float64)
torch.set_num_threads(1)
%timeit x.std(1)
```
Before: 813.0 ms
After: 708.25 ms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15845
Differential Revision: D13603246
Pulled By: umanwizard
fbshipit-source-id: 020d224d76fcb8a0b55b75b0f2937e9508891beb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15553
Add unit test and implementation of NHWC layout for Resize operator.
Also, add pragma parallel loop to old NCHWC layout.
Reviewed By: jspark1105
Differential Revision: D13540762
fbshipit-source-id: eebf252bf0d1efdff180a171d804181045f100a5
Summary:
I fixed an grammatical error on this function previously, but I also realized that its content was also wrong. A weight tensors of a convolutional layer should be at least 3 dimensional, not 2.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15830
Differential Revision: D13597968
Pulled By: soumith
fbshipit-source-id: 72a75106e88945c68d6462828b149441cfb5acde
Summary:
Fixes#15223.
This fixes an autograd bug where backprop either fails or produces
gradients of incorrect sizes when tensors with zero-sized dimensions are
involved.
Previously, we were reducing along dimensions that had size greater than 1
when summing to a size in autograd. This is incorrect because we should also reduce
along dimensions with size 0 to produce a tensor of size 1 in that
dimension that then gets viewed to the correct shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15796
Differential Revision: D13593199
Pulled By: zou3519
fbshipit-source-id: 2e2acac34943a9b7fabadc10c9efd4f66db298fd
Summary:
Cache the workspace size information for MIOpen for a given configuration as opposed to inquiring it every time. This reduces overhead significantly as inquiring the workspace size forces a full read of the performance database in MIOpen and this database has grown significantly in recent releases. This caching gets us back to ideal performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15742
Differential Revision: D13598932
Pulled By: bddppq
fbshipit-source-id: 4e65d247b71dec828293cf0562aac3fbd4fad83a
Summary:
This PR:
- Removes shape logic from the code generator, which was previously relied on to return chunk and concat information
- Copies the logic to detect if a kernel has a rand_like node to the executor, making its pass independent of the code generator
- Fixes a possible segfault where references to a vector still being modified were relied upon
The actual shape logic is unchanged.
The possible segfault is in the handling of the former "flat_inputs" in codegen.cpp. This vector holds pairs, and the second element of these pairs is a reference. In some cases these would be references to items in the vector chunk_desc, which could be added to later, possibly invalidating any references to items in it. I hit a similar segfault in testing when naively making parallel code for "flat_outputs."
I'm submitting this small PR because it's separable, self-contained, has a fix, and I am trying to actively get away from large PRs to encourage more stability and incremental change in the fuser.
ngimel zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15750
Differential Revision: D13597451
Pulled By: zou3519
fbshipit-source-id: 0d48b365779b42849b044ba0286258aacc7b0332
Summary:
The Thrust shipped with ROCm is recent enough to support this API. Minimize divergence between CUDA/ROCm by changing idef guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15481
Differential Revision: D13598739
Pulled By: bddppq
fbshipit-source-id: 20d0a7e3887a4050eea65033161561af47411de1
Summary:
This PR contains changes for:
1. Using memory alloc from HIPContext while allocating workspace for MIOpen conv and transpose_conv operators rather than direct HIP mem alloc
2. Minor cleanup and removing an unnecessary sync call from MIOpen conv op
Differential Revision: D13598894
Pulled By: bddppq
fbshipit-source-id: 44886161abdf91cd29c7c93b3e23620e1b09c7c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15418
Previously we are using Resize + ShareData.
Instead, we'll create a function on Tensor that clones itself with same storage.
Suppose we want `t` to `ShareData` with `t0`, Previous:
```
Tensor t(dims, CPU);
t.Resize(t0.sizes());
t.ShareData(t0);
```
Now:
```
Tensor t = t0.Alias();
```
Reviewed By: dzhulgakov
Differential Revision: D13507609
fbshipit-source-id: 6e4275d02f4c3356cbce91127f1b01111dc86b9f
Summary:
Wanted to use `Tensor.isnan` in C++, figured it'd be nice to have, so I made it into a tiny native function.
gchanan ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15722
Differential Revision: D13591315
Pulled By: goldsborough
fbshipit-source-id: a78bd22101fde87a0257f759b9bfcf3b4208f5fa
Summary:
Fixes#15308. Before this change, `torch.save` and `torch.load` would
initialize the CUDA context on GPU 0 if it hadn't been initialized
already, even if the serialized tensors are only on GPU 1.
This PR fixes that bug by using CUDAGuard in the storage serialization
path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15807
Differential Revision: D13593201
Pulled By: zou3519
fbshipit-source-id: 4addc91ea5a5278d56a03f3d422577ee39e99897
Summary:
We don't support reductions yet, but simply decomposing batch_norm
into a kernel that computes the stats, and the fusing everything else
with ReLU and following pointwise ops provides nice speedups.
Note that this is only limited to inference mode for now, because we
don't support convolutions and batch norm in AD, so the fuser isn't
applied to those parts.
This commit gives us a 7% end-to-end speedup for ResNet50 with batch size 32. Note that this only applies to inference mode at the moment due to lack of AD support for CNN operations (I'll be adding that soon), and not to the standard `torchvision` models, because they use in-place ops which aren't supported by the fuser (we need a way of proving that de-inplacing them is safe).
cc zou3519 zdevito mruberry ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15146
Differential Revision: D13548303
Pulled By: zou3519
fbshipit-source-id: a2e2e5abc383f637fae19bd1b423f20c2cbc056a
Summary:
See #15682
Pushing up this small PR to check if I am doing the right thing. If correct, more will follow for other Stream APIs. Questions will be added inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15737
Differential Revision: D13581400
Pulled By: mrshenli
fbshipit-source-id: 24afed7847b89b62f0692c79a101ec7ff9d9ee4d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15759
Some flags have too long names. And some other few minor clean ups.
Reviewed By: jianyuh
Differential Revision: D13587353
fbshipit-source-id: f8aee7f167505644f5d8f80fe2eed70201ef1e54
Summary:
* With the update of split output to dynamic list it breaks the export to onnx.
Now split ir becomes two ops: 1. Dynamic[] <= Split(), and 2. out1, out2, out3
<= Prim::ListUnpack. In this fix these two consecutive ops get fused when being
exported to onnx.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15092
Reviewed By: dzhulgakov
Differential Revision: D13583832
Pulled By: houseroad
fbshipit-source-id: 3eb18c871e750921ad6d5cc179254bee9bcf4c99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15758
DNNLOWP Conv operators became very complex due to many options. This diff simplifies them by not allowing fp32 in/out. This is OK for Conv operators because Conv operators are usually used in deep networks where quantizing and dequantizing using separate operators is not much overhead.
Reviewed By: csummersea
Differential Revision: D13587341
fbshipit-source-id: e88c919dae79d1c5b7d787ea539edf5bcb064afc
Summary:
Enable conv+add fusion, same as conv+sum
Caution: only element-wise add is supported on IDEEP without scalar
broadcast. Otherwise, the fusion is illegal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15268
Differential Revision: D13577375
Pulled By: yinghai
fbshipit-source-id: 92c9c4b667c5ca5f7a262a5bffaa8aa68eeff3bd
Summary:
Adds `List` to eval environment for type lines and allows `List` to be used on PythonOps (follows the same style as the `Tuple` code), fixes#15661
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15721
Differential Revision: D13578540
Pulled By: driazati
fbshipit-source-id: fce54dc3c0931d8b017b2e3483f0ac53826dda94
Summary:
Just changing the version number doesn't seem to work. I needed to also fix macos brew parallel conflict
should this merge together with https://github.com/pytorch/ossci-job-dsl/pull/36 ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15795
Differential Revision: D13591839
Pulled By: yf225
fbshipit-source-id: 6b2a90943e63c8dcc4b6d9159eb54f1b5974c9ac
Summary:
Fix submitted by huntzhan in https://github.com/pytorch/cppdocs/pull/4. The source is in this repo so the patch has to be applied here.
soumith ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15701
Differential Revision: D13591302
Pulled By: goldsborough
fbshipit-source-id: 796957696fd560a9c5fb42265d7b2d018abaebe3
Summary:
Hello,
This is a little patch to fix `DeprecationWarning: invalid escape sequence`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15733
Differential Revision: D13587291
Pulled By: soumith
fbshipit-source-id: ce68db2de92ca7eaa42f78ca5ae6fbc1d4d90e05
Summary:
Fixing error in caffe2_benchmark binary
```
2018-12-29T14:09:59.7867995Z d:\a\1\s\caffe2_builders\v141\pytorch\binaries\benchmark_helper.h(90): error C2678: binary '|=': no operator found which takes a left-hand operand of type 'std::_Iosb<int>::_Openmode' (or there is no acceptable conversion) (compiling source file D:\a\1\s\caffe2_builders\v141\pytorch\binaries\benchmark_helper.cc) [D:\a\1\s\caffe2_builders\v141\pytorch\build\Release\binaries\caffe2_benchmark.vcxproj]
2018-12-29T14:09:59.7868252Z d:\a\1\s\caffe2_builders\v141\pytorch\binaries\benchmark_helper.h(92): error C2678: binary '|=': no operator found which takes a left-hand operand of type 'std::_Iosb<int>::_Openmode' (or there is no acceptable conversion) (compiling source file D:\a\1\s\caffe2_builders\v141\pytorch\binaries\benchmark_helper.cc) [D:\a\1\s\caffe2_builders\v141\pytorch\build\Release\binaries\caffe2_benchmark.vcxproj]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15619
Differential Revision: D13580195
Pulled By: soumith
fbshipit-source-id: b0a4479cd5f7555801b1977aeee96b6433293da7
Summary:
To implement a stream is very annoying, since it is closely defined with the underlying storage streambuffer.
So in this PR, we add ReadAdapterInterface and PyTorchStreamReader will use it. We implement IStreamAdapter as a wrapper of std::istream. And keep the user interface unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15551
Reviewed By: zrphercule
Differential Revision: D13568907
Pulled By: houseroad
fbshipit-source-id: 93708cb801248a6c101f35cb14d1631029365c3c
Summary:
support 0 size in any of the tensor dimensions in mkldnn
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15295
Differential Revision: D13573747
Pulled By: yinghai
fbshipit-source-id: 5bf7a0b9e2567e80f44981a7823be5407fc94e53
Summary:
port replication padding 2D and 3D from legacy TH API implementation
to ATen implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15538
Differential Revision: D13547567
Pulled By: lhuang04
fbshipit-source-id: decfe100d9edfdcfb62f39ee23f37b6cae0d461f
Summary:
Before this pr, rsub did not convert two elements into the same dtype, therefore "1 - x" may export to an onnx model that two elements of rsub having different dtype.
By adding this symbolic patch this bug should be fixed.
Related test cases also created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15707
Differential Revision: D13583042
Pulled By: zrphercule
fbshipit-source-id: 3a2de47a1a8d1ded1a0adfb911adbe6ac729cdef
Summary:
We are going to have some breaking changes in ConstantLike and related operators in onnx, therefore it is better to disable all related tests for these operators for now.
These operators are not currently supported by caffe2, and are not included in our most recently released onnx, therefore we do not need to worry about internal/external production breaking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15740
Differential Revision: D13582528
Pulled By: zrphercule
fbshipit-source-id: 92a890c1dc2a833969af69edfea85331bb4d562f
Summary:
This improves the error message for "unknown builtin op" to suggest similarly named ops.
Currently it prints out all operators with a name within two edits.
Related issue: https://github.com/pytorch/pytorch/issues/13409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15183
Differential Revision: D13578509
Pulled By: eellison
fbshipit-source-id: 5c73408eda1f7aa456f5bd28790c34df0c76aeca
Summary:
see issue #15636
Please note - I build the documents but the HTML is not updated with the edited content.
I did not also build the fork.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15664
Differential Revision: D13571310
Pulled By: soumith
fbshipit-source-id: d43be0f61705693d778cc12c13e86d6b06130ac7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15708
nbits_in_non_outlier == 0 doesn't make sense because it means everything is outlier and we can just use 32-bit accumulation.
Depending on architecture, break-even point between acc16 and acc32 can be different. Adding thresholds for falling back to acc32.
Reviewed By: jianyuh
Differential Revision: D13574832
fbshipit-source-id: b7a37aacbfdc7867e31838dafcdd5f7c2ac282af
Summary:
see #15682
This is a quick fix by implementing the simpler solution as suggested by colesbury. As benchmark result shows, it slows down `Stream.query()` by ~20%, I would be happy to further pursue a more complex solution by implementing this in C++/ATen. But I would still vote for merge this quick fix first just to get rid of the bug sooner.
~Test TBA~ Added
FYI jeffreyksmithjr
now
```python
In [1]: def f():
...: d0 = torch.device('cuda:0')
...: d1 = torch.device('cuda:1')
...: with torch.cuda.device(d0):
...: s0 = torch.cuda.current_stream()
...: with torch.cuda.device(d1):
...: s1 = torch.cuda.current_stream()
...: s0.query()
...: s1.query()
In [4]: %timeit f()
38.1 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
37.6 µs ± 2.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
before
```python
In [4]: %timeit f()
28.5 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [5]: %timeit f()
35.3 µs ± 2.91 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15689
Differential Revision: D13571697
Pulled By: mrshenli
fbshipit-source-id: 4fe697f91248c6419136d37bb5b7147e612e2f4c
Summary:
This PR breaks up `TestJitGenerated` into 3 classes. This makes for
easier testing of specific groups (e.g. run all generated functional
tests without having to wait for the autograd tests)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13992
Differential Revision: D13076371
Pulled By: driazati
fbshipit-source-id: 1267af59be7d69feb690f5805fcd43fea58a7159
Summary:
This PR bypasses checking the user's configuration entirely and always use strict, since the CI considers it a hard failure if you can't pass flake8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15693
Differential Revision: D13574889
Pulled By: suo
fbshipit-source-id: f5e1c5731cc49b6223b415317033c275bc7d4fec
Summary:
These `std::forward` calls cause VS2017 to emit:
error C2872: 'std': ambiguous symbol
This fix prevents the ambiguity by specifying that `::std` is intended.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15697
Differential Revision: D13573483
Pulled By: goldsborough
fbshipit-source-id: 0439de3523a37a18df7af0cff4a1284a53833ddd
Summary:
s_copy_ was previously special-cased for out of place tracing.
This adds support for inplace tracing, which fixes tracing of
inception_v3
Fixes#15216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15690
Differential Revision: D13572011
Pulled By: zdevito
fbshipit-source-id: 1d565dec039a4b8c59179254285e61d2517ef9a9
Summary:
Fixes#15353 .
Like cudnn conv implementation, mkldnn also falls back to the default `_convolution_double_backward` as double backward.
This bug wasn't caught by CI before because mkldnn is only used when input scalar type is float, but our tests are all using double as default.
Adding test for float inputs, but mkldnn seems to have imprecision issues similar to cudnn implementation, so here I only check if double backward exists instead of calling `gradgradcheck`. Please correct me if the precision should actually be checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15686
Differential Revision: D13571682
Pulled By: ailzhang
fbshipit-source-id: f1762439762370f276cfd59e8b8b8a4dee960a4b
Summary:
This is the an updated version of the earlier PR https://github.com/pytorch/pytorch/pull/15185, since that one was closed.
Currently PyTorch ONNX exporter exports the logical ops (lt, gt, le, ge, eq, ne) with output type in corresponding ONNX ops as type tensor(uint8). But ONNX spec allows for only tensor(bool), which is why models that have these ops fail to load properly.
This issue is captured in #11339. Part of this issue, relating to the allowed input types, has been fixed in ONNX spec by houseroad. This PR fixes the other part pertaining to output type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15677
Reviewed By: dzhulgakov
Differential Revision: D13568450
Pulled By: houseroad
fbshipit-source-id: a6afbea1afdb4edad8f8b1bc492f50b14e5f2fce
Summary:
1. Avoided using `THCDeviceTensor` by re-calculating the mapping from cuda (blockIdx, threadIdx) to input/output tensor index.
2. Changed Camelcase naming to underscore naming.
Profiling:
Legacy:
```bash
$py.test test/test_nn.py -k ReflectionPad1d -v -s
....
=========== 2 passed, 1258 deselected, 800 warnings in 4.35 seconds ============
```
Now:
```bash
$py.test test/test_nn.py -k ReflectionPad1d -v -s
...
=========== 2 passed, 1258 deselected, 800 warnings in 4.03 seconds ============
```
I have two questions about the code. Any insights are appreciated. gchanan zou3519
1. I can verify that [this magic](https://github.com/pytorch/pytorch/blob/master/aten/src/THCUNN/TemporalReflectionPadding.cu#L32-L36) correctly maps output index to input index in different cases. But, I have no idea about how did you come up with this algorithm that merges three categories (in left padding, in original input, in right padding) into a single statement?
2. Why do we need [get contiguous](https://github.com/pytorch/pytorch/blob/master/aten/src/THNN/generic/TemporalReflectionPadding.c#L80) tensors when calculating forward and backward propagation?
Reflection_pad2d porting will come in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15480
Differential Revision: D13544924
Pulled By: mrshenli
fbshipit-source-id: 182045434f210032a82cab721a190da0cd781fbf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15625
3D group conv (both NCHW and NHWC layout) was not correct.
Added group=2 in test_1d_convolution and test_3d_convolution in conv_test
Reviewed By: protonu
Differential Revision: D13562099
fbshipit-source-id: 586e8a7574a2764f2a3b559db6c2415b3ab90453
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15651
Add count_include_pad arg for PoolOpGradient on CPU and fix ARM performance issue.
Reviewed By: houseroad
Differential Revision: D13564257
fbshipit-source-id: 3a143f1122bc507ccb7827e9b46908d5c7203735
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15685
The declaration of "Dequantize" is in "fbsource/fbcode/deeplearning/fbgemm2/QuantUtils.h", so it requires the "namespace fbgemm".
<T> is actually optional, since the type can de deduced from the first argument.
In some places we have "Dequantize<T>(...)", while in other places we have "Dequantize(...)". We'd better unify them. As a reference, all occurrences of "Quantize" are using "fbgemm::Quantize<T>(...)".
Reviewed By: jspark1105
Differential Revision: D13570847
fbshipit-source-id: 7fca9f7f9e4e0d9e5eb27ac44b8707adc3c80717
Summary:
soumith zou3519
I was browsing the code, and think `vec256_int.h` might need a minor revision, but not 100% sure.
1. It currently invert the result by `XOR` with 0. Should it `XOR` with 1 instead?
~2. AVX2 logical operations would set all bits in a byte/word/... to `1` if the condition holds. So functions, such as `_mm256_cmpeq_epi64 ` would return `0/-1` instead of `0/1`. Should it be masked with `1` to make sure it returns 0/1?~
~Would I be correct if I assume that the code revised below is not yet activated, but will be after we port legacy code to ATen?~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15659
Differential Revision: D13565929
Pulled By: mrshenli
fbshipit-source-id: 8ae3daf256c3d915dd855a2215c95275e899ea8c
Summary:
The request changes are to support building Pytorch 1.0 on the Jetson Xavier with Openblas. Jetson Xavier with Jetpack 3.3 has generic lapack installed. To pick up the CUDA accelerated BLAS/Lapack, I had to build Openblas and build/link pytorch from source. Otherwise, I got a runtime error indicating lapack routines were not cuda enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15660
Differential Revision: D13571324
Pulled By: soumith
fbshipit-source-id: 9b148d081d6e7fa7e1824dfdd93283c67f69e683
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15417
Right now the way we test whether Blob contains a CPU tensor is broken in ```PythonOpBase``` is broken, which means non-CPU path might never be taken.
Searching through the codebase, non-gpu path is used in PythonDLPack, and it is used in PytorchOp which is unused. So we'll remove non-gpu path in this diff.
Reviewed By: dzhulgakov
Differential Revision: D13495011
fbshipit-source-id: 9fe9537f05026d2a2cf7051efa81d184de722710
Summary:
Right now it just prints whatever flake8 errors and moves forward with the commit. This is too easy to miss.
It should block the commit so that the user can fix the issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15675
Differential Revision: D13567821
Pulled By: suo
fbshipit-source-id: 5f0de40ddd771bad8d6848417408cffbceb03183
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15632
Just formatting and a few lints.
Reviewed By: yinghai
Differential Revision: D13562403
fbshipit-source-id: c56f8ee61f68cdaccc0828a764ff729454f68259
Summary:
Changelog:
- Optimize btriunpack by using `torch.where` instead of indexing, inplace operations instead of out place operations and avoiding costly permutations by computing the final permutation over a list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15286
Differential Revision: D13562038
Pulled By: soumith
fbshipit-source-id: e2c94cfab5322bf1d24bf56d7b056619f553acc6
Summary:
Since #1323 tensors are shared with shared memory, but this feature is not active for numpy.
This PR fix this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14534
Differential Revision: D13561649
Pulled By: soumith
fbshipit-source-id: b6bc9e99fb91e8b675c2ef131fba9fa11c1647c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15588
Use NHWC2NCHW or NCHW2NHWC functions which is easier to understand compared to code using transpose and generalizable to non-2D convolutions.
Reviewed By: csummersea
Differential Revision: D13557674
fbshipit-source-id: c4fdb8850503ea58f6b17b188513ae2b29691ec0
Summary:
This PR removes the TH/THC binding for gesv.
Changelog:
- Remove TH/THC binding
- Port single matrix case to ATen
- Enable test_gesv for CUDA as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15510
Differential Revision: D13559990
Pulled By: soumith
fbshipit-source-id: 9da2825e94d3103627e719709e6b1f8b521a07fb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15244
This DIFF keeps track of the extra_info information attached to each operator. When getPerOpStas() is called, it attaches the extra_info to the result ProfDagStats protobuf.
Facebook
Net transform attaches a global_op_id which is defined as a tuple of (orig_net_name, original_op_index) to each operator,
The global_op_id is encoded as extra_info in each operator.
Reviewed By: aazzolini
Differential Revision: D13016289
fbshipit-source-id: 3e2719ec7ed0ebe47740b77581c565ff7e79b102
Summary:
Throw a warning when calling `torch.load` on a zip file
Fixes#15570
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15578
Differential Revision: D13555954
Pulled By: driazati
fbshipit-source-id: a37ecdb3dd0c23eff809f86e2f8b74cd48ff7277
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15582
Following convention of having caffe2_ prefix in command line options
Reviewed By: viswanathgs
Differential Revision: D13252055
fbshipit-source-id: 142a6395b832f211f34d0a87ec2d62c1e5fcdc69
Summary:
In this PR, we are moving all functions away from `Variable::Impl`, in order to get rid of `Variable::Impl` (and the `data_` Tensor in it) in the next PR. Some of the functions (such as `set_requires_grad` / `requires_grad` / `grad`) will be living in `AutogradMeta` class, while others (such as `backward()` / `rebase_history()` / `grad_accumulator()` / `grad_fn()`) will be living in `Variable` class.
This is the 2nd PR mentioned in https://github.com/pytorch/pytorch/issues/13638.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15487
Differential Revision: D13553173
Pulled By: yf225
fbshipit-source-id: 691f9432d0cd0640af380c757f3e3a2f64f8851c
Summary:
Short term solution, export group norm as an ATen op to unblock users.
Long term will add GroupNorm to onnx.
Add an end to end test for this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15569
Differential Revision: D13554293
Pulled By: houseroad
fbshipit-source-id: b4974c9ea2a1b81338ca1e5c6747efe2715d7932
Summary:
Now that `cuda.get/set_rng_state` accept `device` objects, the default value should be an device object, and doc should mention so.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14324
Reviewed By: ezyang
Differential Revision: D13528707
Pulled By: soumith
fbshipit-source-id: 32fdac467dfea6d5b96b7e2a42dc8cfd42ba11ee
Summary:
It should be ScriptModule rather than TracedModule :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15559
Differential Revision: D13552058
Pulled By: soumith
fbshipit-source-id: 0aa17639c225818b00d59daec4bc2336f039f658
Summary:
Simple check that runs against your PR's changes and complains if running clang-format would have created a change. Does nothing when run against master, so it's "safe" to accept changes that fail this check and it won't break the build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15543
Reviewed By: soumith
Differential Revision: D13552080
Pulled By: suo
fbshipit-source-id: 462a73894c16e7108806af7fa88440c377d4d0d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15365
On top of D13017791, adding rotated NMS support with the same kernel building
blocks. Results in 218x speedup on avg.
Reviewed By: SuperIRabbit
Differential Revision: D13509114
fbshipit-source-id: c1d33c8dc4bc50b5906b4f01bb0caf1115e2a357
Summary:
Changes originally in this PR:
1. Move Variable::Impl data members into TensorImpl as `AutogradMeta` struct
2. Change Variable::Impl functions to use data members in `AutogradMeta` struct
3. Add `shallow_copy_and_detach()` function to each subclass of TensorImpl
4. Do shallow copy when the user calls `make_variable(tensor)` / `make_variable_view(tensor)` / `variable.set_data(tensor)` / `variable.detach()`
Changes moved from https://github.com/pytorch/pytorch/pull/13645:
1. Add a flag to Variable to disallow size/stride/storage_ptr changes from in-place operations such as `resize_` / `resize_as_` / `set_` / `transpose_`, and set this flag to true when people call `tensor.data` in Python.
2. Write text in the docs to actively discourage changing the shape or storage of `tensor_detached` and expecting `tensor` to also be updated.
This is the 1st+2nd PR mentioned in https://github.com/pytorch/pytorch/issues/13638.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13827
Differential Revision: D13507173
Pulled By: yf225
fbshipit-source-id: b177b08438d534a8197e34e1ad4a837e2db0ed6a
Summary:
In current README.md, `CMAKE_PREFIX_PATH` is set to conda root even when you have activated an virtual environment. When an conda virtualenv is activated, packages are installed in `CONDA_PREFIX`, not conda root. I think `CMAKE_PREFIX_PATH` should also be set to `CONDA_PREFIX` in this case. I think some build issues can be solved with the new instruction. Maybe something like #14954.
soumith,
When I made PR #15335 I was confused and made a wrong point. I think this PR could be the real solution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15548
Differential Revision: D13549681
Pulled By: soumith
fbshipit-source-id: 42d855b6e49ee58d735d2f4715d3e5752a748693
Summary:
The `EmbeddingBag` module does not include a `from_pretrained` method like the `Embedding` module. I added it for consistency between the two modules.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15273
Differential Revision: D13547842
Pulled By: soumith
fbshipit-source-id: 8ffde51ff0c1e8fc8310263b6f375da88089ff7d
Summary:
There is an inconsistency in the size of arguments for gesv, which is fixed in this PR.
Changelog:
- Replicate check in CPU as done for CUDA
- Fix argument ordering (minor) in CUDA checking
Fixes#15328
Differential Revision: D13531167
Pulled By: soumith
fbshipit-source-id: c4b4e4fc12880208d08e88d1e47e730ac98c2ad3
Summary:
The PR clang-formats everything in `torch/csrc/jit/` and adds it to the pre-commit hook.
Here is a list of non-mechanical changes:
- I went over each file and fixed up whenever I could tell that clang-format was clobbering comment formatting.
- Made the macros in register_prim_ops a little more clang-format friendly by omitting trailing commas
- Refactored autodiff.cpp to use a helper class with explicit state rather than a bunch of capturing lambdas
- Small improvements to the precommit hook clang-format
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15524
Differential Revision: D13547989
Pulled By: suo
fbshipit-source-id: 3ff1541bb06433ccfe6de6e33f29227a2b5bb493
Summary:
Currently torch.isinf on integral tensor will raise RuntimeError: value cannot be converted to type int16_t without overflow: inf.
This pr will suppress the error and return false(0) for all integral tensors. The behavior will also be consistent with np.isinf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15489
Reviewed By: zou3519
Differential Revision: D13540786
Pulled By: flashhack
fbshipit-source-id: e730dea849da6a59f3752d347bcfbadfd12c6483
Summary:
I removed the explanation on `num_inputs` parameter. This parameter was removed in #8168
colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15529
Differential Revision: D13547854
Pulled By: soumith
fbshipit-source-id: 8a9ac58f2c93a2533b82ec63089477166ed0bcb9
Summary:
Upgrade MKl-DNN to 0.17 and static build MKL-DNN to fix the potentail build error due to old mkldnn version in host system.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15504
Differential Revision: D13547885
Pulled By: soumith
fbshipit-source-id: 46f790a3d9289c1e153e51c62be17c5206ea8f9a
Summary:
There was a typo in C++ docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15527
Differential Revision: D13547858
Pulled By: soumith
fbshipit-source-id: 1f5250206ca6e13b1b1443869b1e1c837a756cb5
Summary:
Currently re-implements the dataloader for stateful datasets. Outstanding work:
- Refactor DataLoader and DataLoader2 to have common base classes and only differ in specifi pieces of logic,
- Figure out how to not duplicate the `MapDataset` logic for stateful vs. non-stateful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15096
Differential Revision: D13522043
Pulled By: goldsborough
fbshipit-source-id: 08e461ca51783047f11facc4d27dfa2e4f1e4c2a
Summary:
This makes compatibility with different versions of python a little bit simpler, and fixes a problem where stdin wasn't being read from the terminal properly in the prompt.
zdevito This should fix your EOF exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15521
Differential Revision: D13546358
Pulled By: suo
fbshipit-source-id: fb7551a86c888196831c046d9d9848e7ff05b925
Summary:
The precommit hook shouldn't hard fail if there's no `clang-tidy`, just warn and omit the check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15514
Differential Revision: D13545776
Pulled By: suo
fbshipit-source-id: 9bf3f8ee18703c6d1a39eb7776092fb5e120d2a1
Summary:
This does two things:
(1): revert #15114 , which is incorrect and actually just completely disables parallelization in this function (because `at::get_num_threads` returns `-1` unless it has been set explicitly)
(2): Fix our (FB-internal) failing tests that #15114 was intended to fix, by still working correctly in a setup where `#ifdef _OPENMP` is set and `omp_get_max_threads() > 1` , but `#pragma omp parallel` only launches one thread. I believe such an unusual situation only exists in certain unit tests within FB infra but we still need it to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15483
Differential Revision: D13538940
Pulled By: umanwizard
fbshipit-source-id: a3362c7ac7327ced350d127bb426f82c59e42732
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15082
We didn't have unit test for low-precision rowwise adagrad
Reviewed By: chocjy
Differential Revision: D13300732
fbshipit-source-id: 46e7bdfc82c5a6855eeb6f653c0a96b0b3a20546
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15389
SparseLengthsMean was generating uninitialized data for empty inputs (lengths == 0). We should return zeros.
The unit tests were also not covering this special case which is fixed by this diff.
Reviewed By: salexspb
Differential Revision: D13515970
fbshipit-source-id: 3c35265638f64f13f0262cee930c94f8628005da
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15492
Add pthreadpool_create and pthreadpool_destroy, which are used by NNPACK tests.
Reviewed By: Maratyszcza
Differential Revision: D13540997
fbshipit-source-id: 628c599df87b552ca1a3703854ec170243f04d2e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15252
We would like to extend the model file format to include strongly type, semantic information
about the model inputs and outputs.
The goal is for a user to be able to consider a model file like a function with
a well defined API describing what the inputs and outputs would be.
Reviewed By: dzhulgakov
Differential Revision: D13009915
fbshipit-source-id: 5df124a876ad03c05fbdaacae0eab659637734c1
Summary:
(otherwise len is not resolvable using torch::jit::compile)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15488
Differential Revision: D13539991
Pulled By: zdevito
fbshipit-source-id: 3ba85fa7b1adb163f9229c568f7997d22321903d
Summary:
This adds `self` to the list of reserved words and also sorts the lines and prevents the tracer from naming values 'self' (which happens in torch/tensor.py)
Fixes#15240
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15318
Differential Revision: D13540192
Pulled By: driazati
fbshipit-source-id: 46ae02e51b1b31d5c62110fa83ba258ea6bada27
Summary:
This adds AD support for adaptive_avg_pool2d, which is necessary for resnet50 in pytorch/vision:master. cc: soumith asuhan dlibenzi
apaszke I saw that autodiff bug you fixed in #15403 , as it doesn't prevent this PR from passing, so I'll leave it for your PR to fix it. :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15459
Differential Revision: D13534732
Pulled By: ailzhang
fbshipit-source-id: 4e48b93e35d5ecfe7bd64b6a132a55b07843f206
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15458
many nets in the wild seem to have outputs that are never produced by the net.
Reviewed By: ZolotukhinM
Differential Revision: D13534185
fbshipit-source-id: 2b23b39c28404c53f68868f3bf6df53c5fea9eab
Summary:
This PR allows a subclass of programs that have return statements that are not final in the graph.
`final_returns.h` contains the a comment describing how this is accomplished.
To minimize complexity in `compiler.cpp`, this pass is done as an AST-to-AST rewrite before the compiler runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15463
Differential Revision: D13538962
Pulled By: zdevito
fbshipit-source-id: 67105ca873351825b4a364092ab1873779f3e462
Summary:
This PR implements infrastructure for post-processing a model to apply int8 quantization to its `nn.Linear` modules. Highlights of the implementation:
1) Inputs and outputs are `float` (quantized and packed internally), but the weight is quantized and packed ahead of time for efficiency. This implementation performs well in small-batch size GEMM calls. It should not be considered a general-purpose quantized GEMM kernel.
2) Weight packing is dependent on machine architecture (e.g. vector register width), so it is done just-in-time. Concretely, it is done on model load for the weights and it is done during operator execution for the input value.
3) Biases are unquantized
4) We fail loudly if we are attempting to run this on a machine that does not support FBGEMM. This is because we do not want a model's numerics to differ based on which machine it is run on. A model containing these FBGEMM ops *must* be run with FBGEMM
The API can be seen in the added test case. Highlights are:
1) `torch.jit.quantized.quantize_linear_modules` walks the module hierarchy of the passed-in Module and replaces all `nn.Linear` modules with a new `QuantizedLinear` module, which encapsulates the behavior described above.
2) `_pack()` and `_unpack()` script methods are present on `QuantizedLinear` modules. These methods should be called before serialization and after deserialization, respectively. This ensures that the weight matrix is properly packed for the running machine's architecture. Note that in the long term, we would like to move toward a more Pickle-style serialization technique, rather than having these explicit methods that mutate member values. This is blocked on being able to assign attributes in a ScriptMethod, among other things.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13777
Differential Revision: D13383276
Pulled By: jamesr66a
fbshipit-source-id: 00f29c9f34544add2b90107e3cf55a287802c344
Summary:
It is sometimes beneficial to run multiple batches in one benchmark and check the aggregated results.
This PR enables this functionality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15443
Reviewed By: llyfacebook
Differential Revision: D13531129
Pulled By: sf-wind
fbshipit-source-id: 553a762a5cbadf5a3d9fd6af767ae34899bc1aa2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15473
Revert accidental changes introduced in D13335176
IntList is a range and copying it just copies pointers. Thus pointers would point either on deallocated memory or on the same memory causing equality always pass.
Reviewed By: ezyang
Differential Revision: D13537131
fbshipit-source-id: c97b3533be689bb4cdadd9e612f1284ac50e4bda
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15453
Just move things around to facilitate further development. No logic change.
Reviewed By: rdzhabarov
Differential Revision: D13533959
fbshipit-source-id: eebab1306939e802aacffb24a711d372fd67916c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15174
Previously, Caffe2 maintained a separate per-thread per-device
current logical CUDA stream ID. In this PR, we switch Caffe2 over
to using c10::Stream to manage the current stream, and also
manage the allocation of cudaStream_t objects.
This results in a slight behavior change: previously, Caffe2
would have been willing to allocate an arbitrary number of
CUDA streams, depending on how high the logical stream IDs
went. The c10::Stream pool has a fixed number of streams, once
you exceed it, it wraps around.
Reviewed By: dzhulgakov
Differential Revision: D13451550
fbshipit-source-id: da6cf33ee026932a2d873835f6e090f7b8a7d8dc
Summary:
Fixes#3584.
Motivation: manually sorting sequences, packing them, and then unsorting them
is something a lot of users have complained about doing, especially when we can
offer library support for them.
Overview: we internally sort sequences before packing them and store a list of
`unsorted_indices` that represent how to unsort the sequences inside
PackedSequence. The packing helper functions return PackedSequence with the
`permutation` field and the unpacking helper functions use it to unsort.
To implement this, the following changes were made:
- PackedSequence now keeps `sorted_indices` and `unsorted_indices`.
These two can be thought of as permutations and are inverses of each other.
`sorted_indices` is how the sequences were sorted; `unsorted_indices` is how
to unsort the sequences.
- Added an `enforce_sorted` argument to pack_sequence and pack_padded_sequence
that maintains the legacy behavior of error-ing out on unsorted-sequences.
When `enforce_sorted=True`, these functions maintain their ONNX exportability.
- pack_sequence(sequences, enforce_sorted) takes in unsorted sequences.
- pack_padded_sequence can take in a padded tensor that represents padded,
unsorted sequences.
- pad_packed_sequence unsorts the PackedSequence such that it is still the
inverse operation of packed_padded_sequence.
- RNNs apply `sort_indices` to their input hidden state and apply
`unsort_indices` to their output hidden state. This is to ensure that the
hidden state batches correspond to the user's ordering of input sequences.
NOT BC-Breaking
- The default for pack_sequence and pack_padded_sequence is
`enforce_sorted=True` to avoid breaking ONNX export. To use the new
functionality, pass in `enforce_sorted=False`
Testing Plan
- Modified TestNN.test_pack_sequence, TestNN.test_packed_padded_sequence,
and TestNN.test_variable_sequence (RNN test) to check the behavior
of unsorted sequences, sorted sequences, and sorted sequences with
enforce_sorted=True
- test/test_jit.py has a test to see if RNNs are exportable with
enforce_sorted=True
cc colesbury
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15225
Reviewed By: soumith
Differential Revision: D13507138
Pulled By: zou3519
fbshipit-source-id: b871dccd6abefffca81bc4e3efef1873faa242ef
Summary:
I noticed that some users don't even know we have this support. Adding into the doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15440
Differential Revision: D13531045
Pulled By: teng-li
fbshipit-source-id: 9757c400c0010608758c754df04e603b36035a10
Summary:
According to mypy, the trailing -> None is mandatory.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15448
Differential Revision: D13532179
Pulled By: ezyang
fbshipit-source-id: e8972f8c9ada4657c518cd7bcd46e489ab8ddf5f
Summary:
ROCm 2.0's compiler requires launch_bounds annotations if flat work group sizes are larger than the default of 256.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15400
Differential Revision: D13531239
Pulled By: ezyang
fbshipit-source-id: c0b40600a8c332823da6c7113c644d8dba424a9c
Summary:
This PR adds enough of the infra for supporting closures (inner script functions) in order to allow us to expression symbolic gradients using them. We do not actually ever run graphs that contain these closures. The symbolic_script infrastructure just extracts them out of the original forward graph and turns them into discrete forward/backward pairs. This cuts down on the type annotations necessary to write forward/backward pairs and aligns closely with the "differentiator" function approach to expression reverse-mode AD.
Example:
This code:
```
import torch
r = torch.jit.CompilationUnit(
'''
def mul_forward(self, other):
def backward(grad_output):
grad_self = (grad_output * other).sum_to_size(self.size())
grad_other = (grad_output * self).sum_to_size(other.size())
return grad_self, grad_other
return self * other, backward
''')
print(r.module.code)
```
Will produce this graph (pretty printed for clarity):
```
def mul_forward(self,
self: Tensor,
other: Tensor) -> Tuple[Tensor, Tuple[None, Tuple[Tensor, Tensor]]]:
backward = (self.__lambda, (other, self))
return (torch.mul(self, other), backward)
def __lambda(self,
context: Tuple[Tensor, Tensor],
grad_output: Tensor) -> Tuple[Tensor, Tensor]:
other, self, = context
grad_self = torch.sum_to_size(torch.mul(grad_output, other), torch.size(self))
grad_other = torch.sum_to_size(torch.mul(grad_output, self), torch.size(other))
return (grad_self, grad_other)
```
symbolic_script will then do some modifications to remove the unsuppored prim::Function node, yielding:
```
def mul_forward(self,
self: Tensor,
other: Tensor) -> Tuple[Tensor, Tuple[None, Tuple[Tensor, Tensor]]]:
return (torch.mul(self, other), (other, self))
def backward(self,
context: Tuple[Tensor, Tensor],
grad_output: Tensor) -> Tuple[Tensor, Tensor]:
other, self, = context
grad_self = torch.sum_to_size(torch.mul(grad_output, other), torch.size(self))
grad_other = torch.sum_to_size(torch.mul(grad_output, self), torch.size(other))
return (grad_self, grad_other)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15411
Differential Revision: D13523340
Pulled By: zdevito
fbshipit-source-id: 4d4a269460e595b16802c00ec55ae00e3e682d49
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15413
In order to pass arguments to the ios app, need to extarct the arguments
to its own file. Also, in the ios app, do not use the benchmark.json, which
parses the arguments.
This is an incompatible change, needs to add hot fix to the tests.
Reviewed By: llyfacebook
Differential Revision: D13523240
fbshipit-source-id: b559cc7f52d8f50ee206a7ff8d7b59292d855197
Summary:
Currently PyTorch ONNX exporter exports the logical ops (`lt`, `gt`, `le`, `ge`, `eq`) with output type in corresponding ONNX ops as type `tensor(uint8)`. But ONNX spec allows for only `tensor(bool)`, which is why models that have these ops fail to load properly.
This issue is captured in https://github.com/pytorch/pytorch/issues/11339. Part of this issue, relating to the allowed input types, has been fixed in ONNX spec by houseroad. This PR fixes the other part pertaining to output type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15185
Differential Revision: D13494873
Pulled By: houseroad
fbshipit-source-id: 069d2f956a5ae9bf0ac2540a32594a31b01adef8
Summary:
This PR makes some small changes for better consistency in our README and
CONTRIBUTING docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15373
Differential Revision: D13512753
Pulled By: driazati
fbshipit-source-id: 44398ad1894eef521d5f5acb1d06acaad67728cf
Summary:
Followup PR of #14904, and the stretch goal of #12653.
Directly calculate coordinates in the original tensor using column index in the result tensor. Every GPU thread takes care of a column (two numbers) in the output tensor.
The implementation detects and handles precision loss during calculating the square root of a `int64_t` variable, and supports tensors with up to `row * column = 2 ^ 59` numbers.
Algorithm details are describe in [comments of TensorFactories.cu](23ddb6f58a/aten/src/ATen/native/cuda/TensorFactories.cu (L109-L255)).
zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15203
Reviewed By: zou3519
Differential Revision: D13517695
Pulled By: mrshenli
fbshipit-source-id: 86b305d22cac08c8962a3b0cf8e9e620b7ec33ea
Summary:
This updates pdist to work for batched inputs, and updates the
documentation to reflect issues raised.
closes#9406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12302
Reviewed By: ezyang
Differential Revision: D13528485
Pulled By: erikbrinkman
fbshipit-source-id: 63d93a6e1cc95b483fb58e9ff021758b341cd4de
Summary:
This is the CUDA version of #14535 .
It refactors Reduce.cuh to allow more general classes of reductions to be performed -- we no longer assume that the temporary data returned during reduction is just one scalar, and instead allow an arbitrary accumulate type.
We also allow 64-bit indexing when necessary, since in general we will no longer be able to accumulate directly in the output. (In the cases when we can, we continue to split the tensors until they can be addressed with 32-bits, as before).
As an initial use-case, we implement `std` in multiple dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14990
Differential Revision: D13405097
Pulled By: umanwizard
fbshipit-source-id: a56c24dc2fd5326d417632089bd3f5c4f9f0d2cb
Summary:
This adds `self` to the list of reserved words and also sorts the lines and prevents the tracer from naming values 'self' (which happens in torch/tensor.py)
Fixes#15240
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15318
Differential Revision: D13498974
Pulled By: driazati
fbshipit-source-id: 488efb661476cdcdb8ecb9cb48942f02e3c1e611
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13321
This diff simply refactors the `ProfDAGCounters` into two:
* `ProfDAGCounters` that gathers stats at runtime.
* `ProfDAGReport` which holds the report from the gathered stats once stats collection is done.
This refactoring allow us to implement `+=` for `ProfDAGReport`, which can be used for aggregating same-net reports on each host.
Reviewed By: donglimm
Differential Revision: D12837988
fbshipit-source-id: 0470c5fd6437f12711cab25a15a12965d79b2a91
Summary:
Optional clean up. This PR remove python_default_init from the yaml files, and the code-gen, and utilize optional type to do the work.
This also fix the bug in the #13149 to correctly adopt as_strided backward.
Fixes#9941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15234
Differential Revision: D13502044
Pulled By: wanchaol
fbshipit-source-id: 774b61fc4414482cf11d56e22bd0275aefb352a4
Summary:
Save error info in the future for parent thread to pick up. Throw the error
when the thread is the root thread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14523
Differential Revision: D13251756
Pulled By: highker
fbshipit-source-id: b40f9a45665e1a934743f131ec5e8bad5622ce67
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15248
OutputTensorCopyFrom takes four arguments: index, a source Tensor, TensorOptions and whether we want to perform an async call.
We want to provide some default option for TensorOptions, (1). default device to context_.device() (2). default dtype to input.dtype(). User can also explicitly provide these options to override default values.
next diff will change the order of TensorOptions parameter so that user don't need to write down tensor options unless they want to override.
Reviewed By: dzhulgakov
Differential Revision: D13453824
fbshipit-source-id: 87401f81c7c3f9fd3d8936c710e6c2e04a59b689
Summary:
Current documentation example doesn't compile. This fixes the doc so the example works.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15372
Differential Revision: D13522167
Pulled By: goldsborough
fbshipit-source-id: 5171a5f8e165eafabd9d1a28d23020bf2655f38b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15366
swap the old implementation with a slightly easier one to understand
I ran the tests and compared the number of chains compared to the old algorithm. This one outperforms on every test, but we have yet to see if that impacts performance at all.
old chain 34 nomnigraph chain 25
old chain 46 nomnigraph chain 34
old chain 228 nomnigraph chain 188
old chain 397 nomnigraph chain 338
Reviewed By: ilia-cher
Differential Revision: D13057451
fbshipit-source-id: ccd050bfead6eb94ab9c7b0a70b09a22c2b9e499
Summary:
Same as #14668, and was approved there.
ailzhang , please apply this patch to Horizon's `data_streamer.py`: https://gist.github.com/SsnL/020fdb3d6b7016d81b6ba1d04cc41459 Thank you!
Below is the original description at #14668:
As I am working on tasks in https://github.com/pytorch/pytorch/issues/13023, I realized how unreadable the code is because all functions to be run in multiprocessing must be at top global level. Adding more functionalities to `dataloader.py` will only make things worse.
So in this PR, I refactor `dataloader.py` and move much of it into `data._utils`. E.g., the `_worker_loop` and related methods are now in `data._utils.worker`, signal handling code in `data._utils.signal_handling`, collating code in `data._utils.collate`, etc. This split, IMHO, makes code much clearer. I will base my future changes to DataLoader on top of this.
No functionality is changed, except that I added `torch._six.queue`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15331
Reviewed By: yf225
Differential Revision: D13503120
Pulled By: ailzhang
fbshipit-source-id: 94df16b4d80ad1102c437cde0d5a2e62cffe1f8e
Summary:
Changelog:
- Renames `potrs` to `cholesky_solve` to remain consistent with Tensorflow and Scipy (not really, they call their function chol_solve)
- Default argument for upper in cholesky_solve is False. This will allow a seamless interface between `cholesky` and `cholesky_solve`, since the `upper` argument in both function are the same.
- Rename all tests
- Create a tentative alias for `cholesky_solve` under the name `potrs`, and add deprecated warning to not promote usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15334
Differential Revision: D13507724
Pulled By: soumith
fbshipit-source-id: b826996541e49d2e2bcd061b72a38c39450c76d0
Summary:
A number of different passes rely on whether a node has side effects. This centralizes the list of side effectful ops in one place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15188
Differential Revision: D13508438
Pulled By: eellison
fbshipit-source-id: 2143e782b787731ce007b6dcd50cbde30e1b8dd0
Summary:
For #6593 and #9515
This completes the support for optional<ScalarType> in native, JIT and autograd.
Note: Mostly following the existing implementation for optional<Scalar> that was added in https://github.com/pytorch/pytorch/pull/12582.
This PR introduces a way to make functions accept an optional dtype and it will unblock #9515 by allowing the `dtype` param for type promotion interface:
```
func: name(inputs, *, ScalarType? dtype=None, Casting casting=same_kind)
```
An alternative approach could have been using `ScalarType::Undefined` for the same purpose but without optional, though it would have been a bit hacky.
```
func: name(inputs, *, ScalarType dtype=Undefined, Casting casting=same_kind)
```
Here's an example use of this in action: 971f69eac6
There are already a bunch of native functions that were getting optional `dtype` through function overloading. https://github.com/pytorch/pytorch/pull/15133 is the attempt to migrate all of those. I will send those changes separately after this since some functions (e.g. sum) need quite a bit of change in the codebase. See the commits over there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15154
Differential Revision: D13457760
Pulled By: tugrulates
fbshipit-source-id: 706134f0bd578683edd416b96329b49a1ba8ab48
Summary:
Pull cpuinfo changes that should make it work on AWS Lambda servers (which don't have `/sys/devices/system/cpu/{possible,present}` files, and probably don't mount sysfs at all).
I'm not 100% sure it will fix the issue, but getting this update in would make it easier for users to test using a nightly build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15385
Reviewed By: soumith
Differential Revision: D13517467
Pulled By: Maratyszcza
fbshipit-source-id: e8e544cd1f9dad304172ebb7b6ba7a8ad7d34e66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15371
Similar to D13387692:
Never call mutable_data from an OpenMP region!!!
Reviewed By: jspark1105
Differential Revision: D13511259
fbshipit-source-id: 100812d2a547c0a1d5018749d5fdc88162375673
Summary:
This PR adds clang-format automation:
- It only checks on whitelisted files, so we can enable incrementally without noise
- There is a pre-commit hook provided that will do the same check, plus prompt users to apply the clang-format changes (no change is made without the user agreeing).
My plan is to migrate over whole files at a time, clang-formatting them and then adding them to the whitelist. Doing it this way should avoid too many merge pains (the most you'll have to is run clang-format on the affected file before rebasing).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15254
Differential Revision: D13515888
Pulled By: suo
fbshipit-source-id: d098eabcc97aa228c4dfce8fc096c3b5a45b591f
Summary:
This separates the different parts of compiler.cpp to make their relationship more clear. In particular it adds:
* sugared_value.{h,cpp} - all the public SugaredValues that the compiler defines and a few that were inside compiler.cpp
* type_parser.{h, cpp} - Turns TreeRef's defining types into TypePtr
* schema_matching.{h, cpp} - infrastructure for matching arguments against overloaded schema and emitting builtin operators with a particular schema.
Retains:
* compiler.{h, cpp} - now responsible simply for the `defineMethodsInModule` infra structure.
Some utility functions like inlineCallTo have moved to ir.h.
Only thing that is not a move is some changes in module.h/cpp that remove multiple returns from `Method::emit_call_to`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15355
Reviewed By: suo, wanchaol
Differential Revision: D13507524
Pulled By: zdevito
fbshipit-source-id: 69ec936a9ff1a383c12a883616346b219c72e393
Summary:
This PR enables autodiff to use the forward/backward graph compiled from python code, instead of using symbolic gradients(modifying the original graph directly).
We put the map in a separate .h file for now to wait for the native_functions.yaml and derivatives.yaml merge. This should ideally go into native_functions.yaml eventually.
This PR should be enough to unblock us for now, we can start writing gradients for aten functions in python.
Differential Revision: D13494635
Pulled By: ailzhang
fbshipit-source-id: f8d51a15243ac46afd09d930c573ccdfcd9fdaaf
Summary:
Modified step_lr for StepLR, MultiStepLR, ExponentialLR and CosineAnnealingLR. In this way, multiple schedulers can be used simultaneously to modify the learning rates.
Related issue: https://github.com/pytorch/pytorch/issues/13022
Added unit tests combining multiple schedulers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14010
Reviewed By: ezyang
Differential Revision: D13494941
Pulled By: chandlerzuo
fbshipit-source-id: 7561270245639ba1f2c00748f8e4a5f7dec7160c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15348
We have a function resize_dim() on TensorImpl in c10/core/TensorImpl.h which lets you change the dimensionality of a tensor, resizing both sizes and strides. Unfortunately, this API is fairly easy to misuse, because it fills in the new entries with garbage when you size it larger. We want to refactor the call sites to use set_sizes_and_strides() instead, so that there is never an intermediate tensor state where the sizes/strides don't make sense. In this diff, resize_dim() is
replaced with set_sizes_and_strides() in aten/src/TH/THTensor.hpp.
Reviewed By: ezyang
Differential Revision: D13505512
fbshipit-source-id: 193bab89f0018c13ca07488be336d8e967746b76
Summary:
Changelog:
- change some expect tests that didn't have to be expect tests,
instead use self.assertAllFused
- Some of the fuser tests weren't using self.assertAllFused.
- Minor test renames
cc apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15134
Differential Revision: D13507481
Pulled By: zou3519
fbshipit-source-id: dd0788530a60bb5ed2f42b961fae3db2b4404b64
Summary:
max and reducemax are smashed together, we need to support one input case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15241
Reviewed By: yinghai
Differential Revision: D13473312
Pulled By: houseroad
fbshipit-source-id: 9b8c847286a2631b006ca900271bc0d26574101a
Summary:
This PR changes Method (just Method not all graphs) to always have a single
return argument.
This is part 1 in a set of changes that will enable us to have better handling if early return statements.
The simplification that this change provides greatly reduces the work for the next step.
This change makes it so that Method and Python handle multiple returns in the same way:
* 0 - None
* 1 - <single value>
* many - Tuple[...]
The result is that a lot of special-case handling in compiler.cpp and its
bindings can be removed. It also fixes several bugs in return handling,
including one where return values were not always checked against their
attributed values.
Notes:
* inferTypeFrom is renamed to be more accurate and discourage use.
* This has uncovered some bugs in other components, which are noted in
the diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15289
Differential Revision: D13481649
Pulled By: zdevito
fbshipit-source-id: 0e2242a40bb28cca2d0e8be48bede96195e4858c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15322
caffe2 mobile opengl code is not used, deleting it to reduce complications when we perform other changes
Reviewed By: Maratyszcza
Differential Revision: D13499943
fbshipit-source-id: 6479f6b9f50f08b5ae28f8f0bc4a1c4fc3f3c3c2
Summary:
There is still limitation on this: if a script module is somewhere
in the trace, the inputs/outputs can only be tensors or tuples of
tensors.
resolves#15052
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15184
Differential Revision: D13457691
Pulled By: highker
fbshipit-source-id: 8fe46afc41357a0eb8eadd83f687b31d074deb0e
Summary:
…on](#12115)
mean is calculated in two step sum()/numel(). For half precision, data gets
casted back to half after sum().
We fused the division into the reduction kernel by adding pre_op/post_op.
This allows us to do torch.ones(65536).cuda().half().mean() to return correct
result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14878
Differential Revision: D13491159
Pulled By: soumith
fbshipit-source-id: e83802e1628b6d2615c45e18d7acf991d143a09e
Summary:
Fixes an issue that arose from https://github.com/pytorch/pytorch/pull/13481 where `.shared_memory()` couldn't be called. Effectively undoes all changes to `nn.Module` from that PR and solve the relevant problem in a different way (the goal was to be able to call `._apply()` on the Python wrapper for a C++ module).
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15305
Differential Revision: D13493937
Pulled By: goldsborough
fbshipit-source-id: 4cb8687f90fc8709a536c5e7eacd0dc8edf6f750
Summary:
The JIT uses `int64_t` for its integer type and `double` for its floating point type, but users quite often want to write `int` or `float` and that currently fails in not-so-nice ways for custom ops. This PR adds a simple `static_assert` to catch these common failure cases.
zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15247
Differential Revision: D13493941
Pulled By: goldsborough
fbshipit-source-id: c1cd0d10ab5838c75f167c0bdb57e45a0bc1344e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15250
This adds `__repr__` methods to all of the classes under task.py. This makes the objects much easier to interact with when using them in an interactive manner, such as in a Jupyter notebook.
The default `__repr__` method just returns the object ID which is very unhelpful.
Reviewed By: hanli0612
Differential Revision: D13475758
fbshipit-source-id: 6e1b166ec35163b9776c797b6a2e0d002560cd29
Summary:
Addresses #918, interpolation results should be similar to tf
* Adds bicubic interpolation operator to `nn.functional.interpolate`
* Corresponding test in `test_nn.py`
The operator is added in legacy `TH` to be aligned with the other upsampling operators; they can be refactored/moved to ATen all at once when #10482 is resolved
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9849
Differential Revision: D9007525
Pulled By: driazati
fbshipit-source-id: 93ef49a34ce4e5ffd4bda94cd9a6ddc939f0a4cc
Summary:
This PR add isinstance to do static type checking in JIT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15076
Differential Revision: D13471067
Pulled By: wanchaol
fbshipit-source-id: d39b7ed5db9fcca4b503659d02cf7795950ea8ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15126
I want to make people stop manufacturing StreamId from thin air,
and a first step is to make people use the default stream.
Reviewed By: dzhulgakov
Differential Revision: D13432922
fbshipit-source-id: 9f0d8d70646c50d979bde5ba3c3addeebac48a3d
Summary:
Allows 2 functions that are boolean dispatched to have no docstrings (the only case that will fail now is if both functions have docstrings)
Fixes#15281
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15306
Differential Revision: D13494884
Pulled By: driazati
fbshipit-source-id: 65fec39ae03a7d6a68ad617c9b270faeb1617930
Summary:
`torch.expand` and `torch.ne` are used often in models and this PR adds ONNX export support for them. ArmenAg has created issue https://github.com/pytorch/pytorch/issues/10882 for this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15050
Differential Revision: D13453036
Pulled By: houseroad
fbshipit-source-id: 4724b4ffcebda6cd6b2acac51d6733cb27318daf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15125
I realized that it is really bad juju if you fake a StreamId
out of thin air, because in general this isn't going to work.
So, make the constructor a lot scarier.
Most "faking StreamId out of thin air" happens because someone
just wants to put something on the default stream.
Reviewed By: dzhulgakov
Differential Revision: D13432800
fbshipit-source-id: a86991d6fc1d8aa4e54e8175e5f06f90856238e6
Summary:
`rsplit` doesn't have kwargs in Python 2 so this line raises an error
Fixes#15135
Pull Request resolved: https://github.com/pytorch/pytorch/pull/12732
Differential Revision: D10458630
Pulled By: driazati
fbshipit-source-id: a63e42fbc0e39e4291480775b516c98122ec05a1
Summary:
Cholesky by default returns the lower triangular matrix, see [docs](https://pytorch.org/docs/stable/torch.html#torch.cholesky).
However `torch.potrs` by default requires the upper triangular matrix. The naming of the variable `u` suggests that the example expects the upper to be returned, so I've added the flag to make that happen in the example.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15215
Differential Revision: D13476468
Pulled By: soumith
fbshipit-source-id: 7b68035f435a2b1be4d363b3f63e407394af949d
Summary:
This makes DCE more granular by tracking live values/aliases through the graph (rather than just nodes). So we can be more aggressive in DCE around control flow blocks. For example, in:
```
%a0 = aten::foo()
%b = aten::foo()
%a2, %b2 = prim::If(%cond) {
block0() {
%a1 = aten::foo(%.0)
%b1 = aten::foo(%b)
} -> (%a1, %b1)
}
return (%a2)
```
we will now dce all the `%b` stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14910
Differential Revision: D13476445
Pulled By: suo
fbshipit-source-id: 2bf5db19711c07dde946697a4f4b270bd8baf791
Summary:
A friend of me is learning deep learning and pytorch, and he is confused by the following piece of code from the tutorial https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html#gradients :
```python
x = torch.randn(3, requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
y = y * 2
print(y)
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)
print(x.grad)
```
He don't know where the following line comes from:
```python
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
```
What are we computing? Why don't we compute "the gradient of `y` w.r.t `x`"?
In the tutorial, it only says
> You can do many crazy things with autograd!
Which does not explain anything. It seems to be hard for some beginners of deep learning to understand why do we ever do backwards with external gradient fed in and what is the meaning of doing so. So I modified the tutorial in https://github.com/pytorch/tutorials/pull/385
and the docstring correspondingly in this PR, explaining the Jacobian vector product. Please review this PR and https://github.com/pytorch/tutorials/pull/385 together.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15197
Differential Revision: D13476513
Pulled By: soumith
fbshipit-source-id: bee62282e9ab72403247384e4063bcdf59d40c3c
Summary:
Several enhancements are implemented:
* Resize the images to be within a boundary between min-size and max-size (can be height and weight). It tries to resize the minimum size to match the min-size and keep the aspect ratio. However, if in that case the maximum size is more than the max-size, then resize the maximum size to be equal to the max-size (and the minimum size is less than min-size). The min/max sizes are specified in argument scale, in a comma separated form. If one of the size is -1, then that size is not a restriction.
* Change the OpenCV resize function arguments from using cv::Size() to the x, y scale. Theoretically they should be the same. But in reality, the two ways of specifying them may result to different resized outputs.
* Once the image is read in, change the data to floats. That means, after resize and other preprocessing steps, the float values are preserved (not truncated to int).
* It is possible to convert data in text format to the blob format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15204
Reviewed By: llyfacebook
Differential Revision: D13467225
Pulled By: sf-wind
fbshipit-source-id: 7da34a72d43a9603cd7ab953f5821c1222d0178f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15242
Newer version ONNX Reshape gets shape info from a tensor. Hence for static backend, we need to provide this info to it when doing `onnxGetCompatibility` too.
Reviewed By: jackm321
Differential Revision: D13471959
fbshipit-source-id: 8a58e28edd900b6ad54a1dbd63ff2579fbe0e820
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15191
OSS:
just splitting out basic flags from a unit test. So I can extend them in another test where I need to add additional flags.
Reviewed By: yinghai
Differential Revision: D13159184
fbshipit-source-id: 9823e792cf0ed8d0379235c44564862b7d784845
Summary: Record unit of time for torch.cuda.Event's elapsed_time
Differential Revision: D13467646
Pulled By: zou3519
fbshipit-source-id: 4f1f4ef5fa4bc5a1b4775dfcec6ab155e5bf8d6e
Summary:
We need this, for example, to properly call `_unpack` when we have a traced module in the hierarchy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15101
Differential Revision: D13468467
Pulled By: jamesr66a
fbshipit-source-id: c2b6740b12cde6e23395d12e42d4fc2c4c7ca3f2
Summary:
The compiler understands it and profits from knowing it by not using too
many VGPRs as it defaults to 256 default workgroup size.
Fixes a problem in bringup of ROCm 2.0 on gfx906.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15228
Differential Revision: D13470950
Pulled By: bddppq
fbshipit-source-id: f9aa44c7c95299a099c0ea9317b9044cc056acc5
Summary:
tests work on ROCm 1.9.2 as present on CI (fp16 bringup, hipMemset and sparse improvements)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15232
Differential Revision: D13470991
Pulled By: bddppq
fbshipit-source-id: 45acc4f9ea5baaaf7672b86eb022948055779925
Summary:
Some of the codeblocks were showing up as normal text and the "unsupported modules" table was formatted incorrectly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15227
Differential Revision: D13468847
Pulled By: driazati
fbshipit-source-id: eb7375710d4f6eca1d0f44dfc43c7c506300cb1e
Summary:
This PR adds the final set of clang-tidy checks we should add for our codebase: a last set of performance-related checks. Most fixes here are around changing `auto` to `const auto&` in a few places where unnecessary copies were made, and adding `reserve()` calls before loops doing repeated `push_back()`. Also a few cases of calling `std::string::find` with a single-character string literal instead of a single char, which uses a less efficient string search algorithm meant for searching larger substrings.

ezyang apaszke
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15198
Differential Revision: D13468797
Pulled By: goldsborough
fbshipit-source-id: 2bed1ea1c7c162b7f3e0e1026f17125e88c4d5b2
Summary:
Methods like `module.named_modules()` returns a container of `shared_ptr<nn::Module>`. Currently the `nn::Module` base class does not have Python bindings. This PR fixes this, and adds more unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15193
Differential Revision: D13458713
Pulled By: goldsborough
fbshipit-source-id: 4091fe1b96a1be8db14c6a4307fbacc2b41ff6fe
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15027
- Make DataRandomFiller able to accept input_dims and input_types for only non intermediate inputs. Add a helper to fill input directly to a workspace
Reviewed By: highker
Differential Revision: D13408345
fbshipit-source-id: 5fc54d33da12e3f0a200e79380d4c695b0339b17
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15022
Add setArgument testing utils to make it easy to set argument for an operator
Reviewed By: yinghai
Differential Revision: D13405225
fbshipit-source-id: b5c1859c6819d53c1a44718e2868e3137067df36
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15020
Add test utils for assertion of a tensor (sizes and values)
Reviewed By: salexspb
Differential Revision: D13401146
fbshipit-source-id: bc385df074043e03ea884940b5631b96de4a607e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15019
Put some utils to fill tensors to test_utils
Reviewed By: salexspb
Differential Revision: D13386691
fbshipit-source-id: 51d891aad1ca12dc5133c0352df65b8db4f96edb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15180
Test utils to create an operator
On top of D13370461
Reviewed By: ZolotukhinM
Differential Revision: D13382773
fbshipit-source-id: a88040ed5a60f31d3e73f1f958219cd7338dc52e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15014
Currently it looks like many of the simple operations such as comparing tensors, creating tensors, fetching tensors... are too verbose and took effort to write correctly in unit tests.
Easy to use utilities are often more important to increase productivity writing unit tests. While caffe2 python unit tests are relatively easier to write at the moment, the C++ side seems lacking.
In this change I create a test_util, started with assertsTensorEquals, getTensor, createTensor, and we can start putting more easy to use utilities there.
Reviewed By: salexspb
Differential Revision: D13370461
fbshipit-source-id: bee467a127e1d032ef19482f98aa5c776cf508c0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14858
This diff doesn't change logic but just takes the existing code and moves it to caffe2::Tensor
Reviewed By: ezyang
Differential Revision: D13365817
fbshipit-source-id: bc73b27a793602cb14200dcdf357aa63233da43c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14656
This diff doesn't move it yet, but prepares it to be moved, i.e. removes all access to class internals.
dzhulgakov: Please comment on if you think it still makes sense to land this even though it's not blocking anymore since we're going to move at::CopyBytes anyhow.
ezyang: There's some changes in the implementation, especially handling undefined dest tensors. Please review carefully.
Reviewed By: ezyang
Differential Revision: D13287688
fbshipit-source-id: 17800ca8a79ab1633f23be58d96f99a160d8ed24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15113
cv::rotatedRectangleIntersection has a known float underflow bug that would cause failure in ```CV_Assert(intersection.size() <= 8)```
For rotated proposals, replace cv::rotatedRectangleIntersection with a correct version that doesn't have underflow problem.
Otherwise, when ```USE_CPP_GENERATE_PROPOSALS = true```, the training would fail.
Reviewed By: viswanathgs
Differential Revision: D13429770
fbshipit-source-id: 5e95d059f3c668f14059a0a83e8e53d8554cdb99
Summary:
Adding support for torch.tensor in script.
The input list is typed as t[], because it can be arbitrarily nested. I added a check a compile time check that the inner type of the list is a bool, float, or int.
Also adds specialization for Boolean Lists, which already existed at the ivalue level but had not been added to the compiler yet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14913
Differential Revision: D13407930
Pulled By: eellison
fbshipit-source-id: d17f1195a22149d5b0d08d76c89a7fab8444f7c5
Summary:
This PR fixes around 250 places in the codebase where we were making unnecessary copies of objects (some large, some small).
ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15026
Differential Revision: D13458784
Pulled By: goldsborough
fbshipit-source-id: be5148b2ce09493588d70952e6f6d6ff5ec5199b
Summary:
This PR removes the usage of _finfo defined in torch.distributions.utils and changes the call sites
to use torch.finfo instead
Differential Revision: D13451936
Pulled By: soumith
fbshipit-source-id: 6dbda3a6179d9407bc3396bf1a2baf3e85bc4cf2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14918
When ProtoBuf-Lite is in use, ProtoDebugString just calls SerializeAsString.
This produces binary output, which is not a very suitable "debug" string.
Specifically, we've observed it causing problems when calling code tries to
add the debug string to a Java exception message (which requires valid UTF-8).
Now, we replace all non-ASCII bytes with "?".
This is not a very fast implementation, but generating debug strings shouldn't
be a performance-sensitive operation in any application.
Reviewed By: dzhulgakov
Differential Revision: D13385540
fbshipit-source-id: 8868172baf20efaf53fecf7d666a6980f59b64f5
Summary:
This PR enables C++ frontend modules to be bound into Python and added as submodules of Python modules. For this, I added lots of pybind11 bindings for the `torch::nn::Module` class, and modified the `torch.nn.Module` class in Python to have a new Metaclass that makes `isinstance(m, torch.nn.Module)` return true when `m` is a C++ frontend module. The methods and fields of C++ modules are bound in such a way that they work seamlessly as submodules of Python modules for most operations (one exception I know of: calling `.to()` ends up calling `.apply()` on each submodule with a Python lambda, which cannot be used in C++ -- this may require small changes on Python side).
I've added quite a bunch of tests to verify the bindings and equality with Python. I think I should also try out adding a C++ module as part of some large PyTorch module, like a WLM or something, and see if everything works smoothly.
The next step for inter-op across our system is ScriptModule <-> C++ Frontend Module inter-op. I think this will then also allow using C++ frontend modules from TorchScript.
apaszke zdevito
CC dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13481
Differential Revision: D12981996
Pulled By: goldsborough
fbshipit-source-id: 147370d3596ebb0e94c82cec92993a148fee50a7
Summary:
Before this PR, loop unrolling + the graph fuser was creating multiple
FusionGroups with the same bodies (with different variable names) for
JIT LSTMs. Each FusionGroup got registered to a separate fusion key;
each key resulted in a different compilation for the same
specializations.
This PR makes it so that when registering FusionGroups with the fusion
compiler, the compiler first checks the KernelSpec cache to see if the
FusionGroup's graph exists already. If it does, then return the
corresponding KernelSpec's key to share compiled kernels.
In addition, graphs in the KernelSpec cache are canonicalized before
being cached. I added a flag to the canonicalize pass to remove unique
names of values.
This shortens the compile time for a JIT LSTM (seq_len of 100, loop
unroll factor of 8) from 5.3s to 2.3s. Most of this compile time is
running the graph fuser and/or fusion compiler; while this PR
makes it so that there is only one unique kernel in the forward pass,
there are a lot of different kernels (6) in the backward pass
(after loop unrolling) that should be investigated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14541
Differential Revision: D13324487
Pulled By: zou3519
fbshipit-source-id: b841d82ed35a959b5cfc72db033bf5a7b42cc4fb
Summary:
We don't need THCNumerics here since at::Half can be implicitly converted to float and the cuda math dispatches are handled by `/usr/local/cuda/include/crt/math_functions.hpp` and `cmath`. ATen should be free of THCNumerics after this and when porting kernels from THC, one should not use THCNumerics.
Should close: https://github.com/pytorch/pytorch/issues/11878
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15085
Differential Revision: D13447558
Pulled By: soumith
fbshipit-source-id: 4ff5cbf838edcd01e2d1397e4d7f4f920e9e9fc3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14950
Minimize the number of headers included from _avx2.cc files to avoid accidental compilation of functions defined the header files reused by other translation units that can lead to illegal instruction errors.
Reviewed By: dskhudia
Differential Revision: D13394483
fbshipit-source-id: 67149a6fb51f7f047e745bfe395cb6dd4ae7c1ae
Summary:
Now that PyTorch 1.0 is out, this should be updated :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15136
Differential Revision: D13447377
Pulled By: soumith
fbshipit-source-id: bd4e662c53d0699f25d4d90c1b4c1e182b4427c2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15110
support casting to string on CPU
Reviewed By: intermilan
Differential Revision: D13429381
fbshipit-source-id: b737a1ba1237b10f692d5c42b42a544b94ba9fd1
Summary:
the speed-up of a single operation is up to 3X .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15106
Differential Revision: D13429596
Pulled By: bddppq
fbshipit-source-id: f8d987cafeac9bef9c3daf7e43ede8c6a4ee2ce5
Summary:
Certain tensor shapes failed when being resized. This pull request addresses the bug found in #13404.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14874
Differential Revision: D13429788
Pulled By: soumith
fbshipit-source-id: 8aa6451dbadce46d6d1c47a01cb26e6559bcfc8c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15147
Forgot to take out dnnlowp.cc from avx2 list in a previous diff.
Reviewed By: dskhudia
Differential Revision: D13440686
fbshipit-source-id: 9ada98b6e885c7d5f22c91a735ff60304480b4cb
Summary:
* relax MIOpen if statement to allow fp16/fp32 mixed precision training now supported by ROCm 1.9.2
* use gemm_ex API of rocBLAS in ROCm 1.9.2 instead of the previous hgemm API
* with this: enable all but one half test in test_nn
While there, fix also:
* a group convolution issue w/ MIOpen pertaining to initializing MIOpen on multi-GPU systems properly we detected while working on this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14994
Differential Revision: D13439869
Pulled By: bddppq
fbshipit-source-id: 75e4eb51a59488882e64b5eabdc30555b25be25e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15103
There are two main optimizations in this diff:
1. We generate all anchors for every single spatial grid first, and then apply
NMS to pick 2000 anchors according to RPN_PRE_NMS_TOP_N. By first sorting the
score and picking the 2000 top ones and then lazily generating only the
corresponding anchors is much faster.
2. Transposing bbox_deltas from (num_anchors * 4, H, W) to
(H, W, num_anchors * 4) was also quite slow - taking about 20ms in the RRPN
case when there are lots of anchors which it's negligible for RPN case (like
0.1 ms). Instead of transponsing, performing all operations in the
(num_anchors, H, W) format speeds things up.
For regular RPN scenario, this gives 5x speedup from 5.84ms to 1.18ms a case
with 35 anchors over a 600x600 image.
For rotated boxes with 245 anchors, the runtime down from 80ms to 27ms per
iter.
Reviewed By: newstzpz
Differential Revision: D13428688
fbshipit-source-id: 6006b332925e01a7c9433ded2ff5dc9e6d96f7d3
Summary:
This is an optimized implementation that does the following:
1. created an empty Tensor of correct size.
2. fill the Tensor with correct values.
The following three designs to fill in the Tensor result in roughly the same performance. Hence, the 2nd option is taken for simpler code, and to return contiguous tensors.
1. Sequential: fill row coordinates first, then columns. This results in two for-loop and more arithmetic operations.
2. Interleaved: fill in index coordinates one by one, which jumps between the two output Tensor rows in every iteration.
3. Transpose: create a n X 2 Tensor, fill the Tensor sequentially, and then transpose it.
<img width="352" alt="screen shot 2018-12-10 at 3 54 39 pm" src="https://user-images.githubusercontent.com/16999635/49769172-07bd3580-fc94-11e8-8164-41839185e9f9.png">
NOTE:
This implementation returns a 2D tensor, instead of a tuple of two tensors. It means that users will not be able to do the following:
```python
x = torch.ones(3, 3)
i = torch.tril_indices(3, 3)
x[i] # need to first convert the 2D tensor into a tuple of two 1D tensors.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14904
Reviewed By: zou3519
Differential Revision: D13433027
Pulled By: mrshenli
fbshipit-source-id: 41c876aafcf584832d7069f7c5929ffb59e0ae6a
Summary:
Documents what is supported in the script standard library.
* Adds `my_script_module._get_method('forward').schema()` method to get function schema from a `ScriptModule`
* Removes `torch.nn.functional` from the list of builtins. The only functions not supported are `nn.functional.fold` and `nn.functional.unfold`, but those currently just dispatch to their corresponding aten ops, so from a user's perspective it looks like they work.
* Allow printing of `IValue::Device` by getting its string representation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14912
Differential Revision: D13385928
Pulled By: driazati
fbshipit-source-id: e391691b2f87dba6e13be05d4aa3ed2f004e31da
Summary:
Fixes#15119. Before this PR, we were propagating constants through
aten::warn AND running it as a part of shape analysis.
This caused aten::warn to be run regardless of if it is
supposed to be run dynamically. This PR adds an exclusion for aten::warn
in constant propagation and shape analysis, similar to that of prim::RaiseException.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15124
Differential Revision: D13432815
Pulled By: zou3519
fbshipit-source-id: 15ab533ce2accb2da3fd4e569070c7979ce61708
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14248
This diff also introduces a horrifying hack to override CUDA's DeviceGuardImpl
with a HIPGuardImplMasqueradingAsCUDA, to accommodate PyTorch's current
behavior of pretending CUDA is HIP when you build with ROCm enabled.
Reviewed By: bddppq
Differential Revision: D13145293
fbshipit-source-id: ee0e207b6fd132f0d435512957424a002d588f02
Summary:
…r_list_unwrap.
These functions use unsafeGetTensorImpl(), which doesn't work with Variables (in a silent way that may blow up later).
So let's do early checking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15105
Reviewed By: ezyang
Differential Revision: D13429149
Pulled By: gchanan
fbshipit-source-id: b85f6f5b7cdb9a6dd0c40205b924c840a3920ba0
Summary:
Fixes#15038.
aten::_cast_Float(tensor, non_blocking) support was added in #14336.
Its second argument is a bool, but because we don't support generating values
of type bool in the fuser codegen, the codegen errored out.
aten::_cast_Float in the fuser never actually uses its non_blocking
argument, so another way to fix this would be to have a special op for a
fused cast but I thought that we might have fusible ops that do take
bool arguments in the future so this would be good to have.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15057
Differential Revision: D13432091
Pulled By: zou3519
fbshipit-source-id: 455fe574f5f080aca9a112e346b841a2534a8dc3
Summary:
While moving these scenarios into `_test_dim_ops` I accidentally left an empty loop in the actual tests, causing them to do nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15077
Differential Revision: D13428759
Pulled By: umanwizard
fbshipit-source-id: 08f53068981d9192c1408878b168e9053f4dc92e
Summary:
When I do this setup in a local Docker development environment,
I get the following error:
x86_64-linux-gnu-gcc: error trying to exec 'cc1plus': execvp: No such file or directory
Somehow, gcc seems to get confused when it gets run from the wrong
directory. Best not to do it.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15078
Differential Revision: D13432143
Pulled By: ezyang
fbshipit-source-id: b18e15f493503a4c8205c85f92a214e49762a7bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14631
adding a empty name scope to allow people jump out from current namescope.
This could be useful when you want to access blob from parent or sibling scope.
Facebook:
e.g: we encoutered a potential usecase in D13124249 (it's a large diff, please search by EmptyNameScope in that diff), we need to access to a blob declared in root namescope from a device namescope (device namescope has been used by parallel_GPU API). `EmptyNameScope` can help us do that with ease.
I referenced to `EmptyDeviceScope` D6103412 while implementing this one.
Reviewed By: yinghai
Differential Revision: D13272240
fbshipit-source-id: d4cde5abcc2336e456b6c6ef086266ef94d86da8
Summary:
Fixes a bug where (de-)/serializing a hierarchy of submodules where one submodule doesn't have any parameters, but its submodules do, doesn't get properly loaded. This had to do with the fact that the old protobuf format couldn't store empty parameters.
Fixes https://github.com/pytorch/pytorch/issues/14891
soumith ezyang ebetica
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15033
Differential Revision: D13411322
Pulled By: goldsborough
fbshipit-source-id: 2ef73b2aa93fa9e46b1cbe1fd47d9f134d6016d5
Summary:
…on first and all the values in the next line. This way, it can output arbitrary blob
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15108
Reviewed By: llyfacebook
Differential Revision: D13429346
Pulled By: sf-wind
fbshipit-source-id: 5e0bba2a46fbe8d997dfc3d55a698484552e3af8
Summary:
Provide a pre-commit hook that does flake8 and clang tidy checks. Enables the clang-tidy script to run in parallel to make it fast enough to be used in a pre-commit hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15102
Reviewed By: soumith
Differential Revision: D13429629
Pulled By: zdevito
fbshipit-source-id: bd52fe5652f29b033de8d9926d78350b2da4c2fc
Summary:
…_tensor.
This is part of a long series of paring down the Type interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15074
Differential Revision: D13421482
Pulled By: gchanan
fbshipit-source-id: 84010ee71fef2cb74d32d5de7858d8ed9f36b885
Summary:
```
This diff changes the HIPification of ATen to be out-of-place.
We now have the following mappings:
- ATen/cuda => ATen/hip
- ATen/native/cuda => ATen/native/hip
- ATen/native/sparse/cuda => ATen/native/sparse/hip
- THC => THH
- THCUNN => THHUNN
The build system is adjusted to know about these new build paths,
and HIPify is taught how to adjust include paths and
THC_GENERIC_FILE appropriately. ATen_hip is now built as
the ATen_hip library, rather than reusing ATen_cuda.
However, despite these new filepaths, none of the identifiers in ATen
have actually changed. So, e.g., THHGeneral.h still defines functions
named THC_blahblah, and HIP still shows up as CUDA in PyTorch itself.
We'll tackle this in a subsequent PR; this diff is just to get the files
out-of-place.
Minor extra improvements:
- Don't edit tmp_install when hipifying
- HIP no longer builds native_cudnn_cpp; it was unnecessary
- Caffe2_HIP_INCLUDES is now Caffe2_HIP_INCLUDE, for consistency
with all the other variables.
- HIP build now properly respects ATEN_CUDA_FILES_GEN_LIB (it
did not previously.)
- You can now override file extension matching in pyHIPIFY
by explicitly specifying its full name in the matching list.
This is used so we can HIPify CMakeLists.txt in some situations.
A little bit of string and ceiling wax:
- gen.py grows a --rocm flag so that it knows to generate CUDA
files which actually refer to the HIP headers (e.g., THH.h)
We'll get rid of this eventually and generate real HIP files,
but not for this PR.
- Management of HIP dependencies is now completely deleted
from the ATen CMakeLists.txt. The old code was dead (because
it was shoveled in ATen_CUDA_DEPENDENCY_LIBS and promptly
ignored by the Caffe2 build system) and didn't actually work.
```
Stacked on https://github.com/pytorch/pytorch/pull/14849 review last commit only
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14866
Differential Revision: D13419475
Pulled By: ezyang
fbshipit-source-id: cb4c843df69a1d8369314c9fab1b7719520fa3db
Summary:
Removing the deprecated functions in `torch/csrc/variable_tensor_functions.h` (like `torch::CPU`) and corresponding implementations from `torch/csrc/torch.cpp` from master after the release.
ezyang gchanan soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15003
Differential Revision: D13418086
Pulled By: goldsborough
fbshipit-source-id: a0accdf6f7b0efa1ec07ac7b74b86ff2da37543f
Summary:
…done once
This allow no-op build to work correctly even when BUILD_CAFFE2_OPS is on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14982
Differential Revision: D13413960
Pulled By: zdevito
fbshipit-source-id: 6e5412a8c375af8a47c76f548cdd31cff15f3853
Summary:
This PR creates TestFuser inside test_jit.py to be a home for graph fuser
specific tests.
This was a useful exercise because now that all the fuser tests are in
one place, I can spot redundant and bitrotting tests for cleanup in a
future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15072
Differential Revision: D13421458
Pulled By: zou3519
fbshipit-source-id: 80b1a7712feff75a0c186d1664601c4edbbca694
Summary: Removes all warnings spew for the TestJitGenerated tests
Differential Revision: D13420919
fbshipit-source-id: f251c12f923088ccc5daa2984c15003a67cbd1c1
Summary:
The coverage of scalar-input test cases were not accurate. This patch fixed that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15029
Differential Revision: D13419764
Pulled By: zrphercule
fbshipit-source-id: a14a5cbef432bea8c9126156f5deb1125e1aeb47
Summary:
We were only using this file to configure flake8, and fbcode linters do not recognize tox.ini which causes spurious linter warnings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15065
Differential Revision: D13420774
Pulled By: suo
fbshipit-source-id: e43a46befa36862c8b3c0a90074aec6a66531492
Summary:
Previously we were returning true if either IValue wasn't a tensor, which…is bad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15031
Differential Revision: D13409759
Pulled By: suo
fbshipit-source-id: f8bdcd05d334c1276ce46f55812065d358c1ff5d
Summary:
Currently in caffe2, one cannot properly fetch the content of Int8 blobs.
Upon digging the source code, it turns out that the relevant source code is not being compiled. Adding the source to CMakeLists.txt fixes this issue.
First time ever doing a pull request. Please let me know if there's any rule I should follow. Thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15047
Differential Revision: D13417583
Pulled By: bddppq
fbshipit-source-id: dd39575971a3012635edbf97a045d80e4b62a8eb
Summary:
This is broken out of https://github.com/pytorch/pytorch/pull/13733/
We want to install cpp tests so they can ultimately be runnable from that location for Caffe2 tests run from PyTorch builds.
cc pjh5 yf225 anderspapitto
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15000
Reviewed By: pjh5
Differential Revision: D13416253
Pulled By: orionr
fbshipit-source-id: 51280be0a22557a742f90c9f303c58c35cbd4a38
Summary:
This removes FloatToInt style names replacing it with just the destination
name (e.g. FloatToInt -> Float). This makes it more consistent with the
syntax and makes it easier to add type conversions (just add a new
prim::Int op, for instance).
None of these ops get serialized so this should not effect loading of
old models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14947
Differential Revision: D13408409
Pulled By: zdevito
fbshipit-source-id: d773fe863f14d9de893f686832769f8cc8903a8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/15041
Adding an alternative implementation of a task graph based on TBB
Reviewed By: dmudiger
Differential Revision: D13412517
fbshipit-source-id: f5efedd680bbe0072bf38d504e5682ab51dd630f
Summary:
1) at::functions are now also exposed in the at::legacy::th namespace and we move relevant calls over to use them (to avoid merge conflicts)
2) LegacyTHDispatch now handles device-type initialization
3) We generate derived LegacyTHDispatchers, e.g. THLegacyCPULongDispatcher, although they are currently empty.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14852
Reviewed By: ezyang
Differential Revision: D13360852
Pulled By: gchanan
fbshipit-source-id: af6705aeba3593ea5dba9bfc62890e5257bc81f8
Summary:
This PR aligns the Array struct such that cuda vector performance improvements can be utilized.
I tested this by using it on our Philox header. Note how the vector store instruction gets used for cuda vector types and when using alignas on Array, vs when not using alignas on Array.
With cuda vector type (uint4, uint2, float4): https://godbolt.org/z/UaWOmR
With alignas: https://godbolt.org/z/Eeh0t5
Without alignas: https://godbolt.org/z/QT63gq
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14920
Differential Revision: D13406751
Pulled By: soumith
fbshipit-source-id: 685b1010ef1f576dde30c278b1e9b642f87c843d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14197
Pull Request resolved: https://github.com/pytorch/pytorch/pull/13642
Previously we pass in a patially initialized Tensor to Deserialize and it will fill
it with the result of deserialization of a tensor proto. Now we want it to return
a Tensor directly since it's just a shared pointer to TensorImpl.
Reviewed By: dzhulgakov
Differential Revision: D12874357
fbshipit-source-id: 12b80a763375da23cfa64a74d6bc186d8d03b94f
Summary:
This can be use to initialize state that is not necessarily eligible for serialization/is implementation-specific. Concretely, I'm going to use this to pack the weight matrices for quantized Linear modules according to the FBGEMM APIs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14655
Differential Revision: D13404438
Pulled By: jamesr66a
fbshipit-source-id: 2d327cef5520fdd716b5b1b29effd60a049e8a4a
Summary:
We've virtualized the destructor for storage, so we
no longer have to forward to a particular backend.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14897
Differential Revision: D13399216
Pulled By: ezyang
fbshipit-source-id: 531d29c3f278477cfa8759f30ab4f304d695b659
Summary:
cc iotamudelta
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14993
Differential Revision: D13405804
Pulled By: ezyang
fbshipit-source-id: c4aa9ed29ee2a4f3abf76c1e0fa8babfd738db35
Summary:
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
cc iotamudelta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14999
Differential Revision: D13405754
Pulled By: ezyang
fbshipit-source-id: 98459496494390ad1115b4f1f6738d53c14f0745
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14911
In optimized modes the compiler tries to inline all the
`unordered_map::operator[]` calls, creating a massive amount of code
which takes several minutes to optimize. Instead, create a table of
PODs and populate the maps using a simple loop.
Reviewed By: soumith, luciang
Differential Revision: D13382948
fbshipit-source-id: b6752921e0f7213595d26b39e4397f6a3897960b
Summary:
When rewriting `default_collate`, I noticed that `from_numpy` and `as_tensor` and `tensor` all do not work on `np.int8` arrays.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14700
Reviewed By: weiyangfb
Differential Revision: D13305297
Pulled By: soumith
fbshipit-source-id: 2937110f65ed714ee830d50098db292238e9b2a9
Summary:
The other direction of #14700
cc soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14710
Reviewed By: weiyangfb
Differential Revision: D13306052
Pulled By: soumith
fbshipit-source-id: 202d038f139cf05e01069ff8d05268c66354c983
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14881
This diff allows us to pre-quantize and pre-pack weight matrix used in DNNLOWP_ACC16 .
The intended use pattern is run Int8ConvPackWeight in init_net that generates a packed weight and Int8Conv with DNNLOWP_ACC16 engine uses the the packed weight.
Reviewed By: csummersea
Differential Revision: D13374662
fbshipit-source-id: dd02b9a4eb7af1fe208aa857fcd0b445e6e395af
Summary:
1. Changes the prints along the 'rebuild' pathway to respect the '-q' flag of setup.py
A clean rebuild now only prints:
[zdevito@devgpu172.prn2 /data/users/zdevito/pytorch] python setup.py -q rebuild develop
[0/1] Install the project...
-- Install configuration: "RelWithDebInfo"
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
ninja: no work to do.
2. Deletes apparently dead calls to `generate_code`. Now that CMake builds these files,
it appears that it is getting called twice and the second version is never used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14972
Reviewed By: soumith
Differential Revision: D13396330
Pulled By: zdevito
fbshipit-source-id: 83c45143bbc6a6d2c1cfee929291ec059f2b5dc3
Summary: we now get mkldnn automatically from third_party/ideep
Differential Revision: D13396480
Pulled By: soumith
fbshipit-source-id: 20f819ba4b78cbe9c7d0baeab1c575669cbf6c20
Summary:
This fixes rebuild issues with the ninja part of the build. With this patch all ninja files will now report `nothing to do` if nothing has changed assuming `BUILD_CAFFE2_OPS=0`.
1. This only does the python file processing for caffe2 when BUILD_CAFFE2_OPS=1, this part of the build file is written in such a way that it is always required to rerun and can take substantial time to move files around in the no-op build. In the future this part should be rewritten to use a faster method of copying the files or should treat copying the files as part of the build rules and only run when the files are out of date.
2. This points `sleef` to a patched version that fixes a dead build output that is causing everything to relink all the time. See https://github.com/shibatch/sleef/pull/231#partial-pull-merging for the upstream change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14969
Reviewed By: soumith
Differential Revision: D13395998
Pulled By: zdevito
fbshipit-source-id: ca85b7be9e99c5c578103c144ef0f2c3b927e724
Summary:
fix auto grad summing for IfOp where intermediate output needs renaming.
Bug before this diff:
- we only renames the output of IfOp without changing the subnet ops output
- this results in blob not found error
the unittest provides an example
this diff fix that for IfOp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14772
Differential Revision: D13327090
Pulled By: harouwu
fbshipit-source-id: ec40ee88526ace3619c54551e223dd71158a02f8
Summary:
This PR does the following:
1) Updates the ONNX export for `torch.zeros_like` and `torch.full_like` ops to use ONNX op `ConstantLike`. This reduces the export of experimental op `ConstantFill`, which may possibly be removed in future, see https://github.com/onnx/onnx/pull/1434).
2) It also adds export support for `torch.ones_like`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14903
Differential Revision: D13383700
Pulled By: houseroad
fbshipit-source-id: 566d00a943e9497172fcd5a034b638a650ab13a2
Summary:
Anywhere we used #include "foo.h", we now say #include <foo.h>
Paths are adjusted to be rooted out of aten/src, torch/lib, or
the root level directory.
I modified CMakeLists.txt by hand to remove TH and THC from
the include paths.
I used the following script to do the canonicalization:
```
import subprocess
import re
import os.path
files = subprocess.check_output(['git', 'ls-files']).decode('utf-8').rstrip().split('\n')
for fn in files:
if not any(fn.endswith(suff) for suff in ['.cu', '.cpp', '.in', '.h', '.hpp', '.cu', '.cuh', '.cc']):
continue
if not any(fn.startswith(pref) for pref in ["aten/", "torch/"]):
continue
with open(fn, 'r') as f:
c = f.read()
def fmt(p):
return "#include <{}>".format(p)
def repl(m):
p = m.group(1)
if p in ["dlfcn.h", "unistd.h", "nvrtc.h", "cuda.h", "cuda_runtime.h", "cstdint", "cudnn.h", "Python.h", "cusparse.h", "cuda_runtime_api.h", "cuda_fp16.h", "cublas_v2.h", "stdint.h", "curand_kernel.h"]:
return fmt(p)
if any(p.startswith(pref) for pref in ["torch/csrc", "c10/", "ATen/", "caffe2/", "TH/", "THC/", "Eigen/", "gtest/", "zdl/", "gloo/", "onnx/", "miopen/"]):
return fmt(p)
for root in ["aten/src", "torch/lib", ""]:
for bad_root in [os.path.dirname(fn), "aten/src/TH", "aten/src/THC", "torch/csrc"]:
new_p = os.path.relpath(os.path.join(bad_root, p), root)
if not new_p.startswith("../") and (os.path.exists(os.path.join(root, new_p)) or os.path.exists(os.path.join(root, new_p + ".in"))):
return fmt(new_p)
print("ERROR: ", fn, p)
return m.group(0)
new_c = re.sub(r'#include "([^"]+)"', repl, c)
if new_c != c:
print(fn)
with open(fn, 'w') as f:
f.write(new_c)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14849
Reviewed By: dzhulgakov
Differential Revision: D13363445
Pulled By: ezyang
fbshipit-source-id: 52361f878a672785f9306c9e9ab2513128092b68
Summary:
50x-100x speedup compared to current version.
Also, fixes a bug in the current version when batch size exceeds 1 (current version processes only the first image in this case).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14883
Differential Revision: D13390655
Pulled By: Maratyszcza
fbshipit-source-id: 1b33a97bf2d0866d38faa2b42e64fd2859017898
Summary:
Tested on a tensor with 1 billion elements and 3 dimensions on a powerful, highly
multi-core Linux machine.
parallelized: All operations (e.g., `t.std(1)`) that could be done in the old code are now several times faster. All
new operations (e.g., `t.std((0,2))` are significantly faster than the NumPy equivalents.
`t.std((0, 1, 2))`, a new operation, is logically equivalent to the
old `t.std()`, but faster.
serial: The above comment about old operationos now being faster still
holds, but `t.std((t1, ..., tn))` is now a few
times slower than `t.std()`. If this turns out to be important, we can
special-case that to use the old algorithm.
The approach is to create a new method, `TensorIterator::foreach_reduced_elt`,
valid for `TensorIterator`s that represent a dimension reduction. This
method calls a supplied function for each element in the output,
supplying it with the input elements that correspond to that output.
Given that primitive, we can implement reductions like the following pseudocode:
If there is more than one output element:
```
PARALLEL FOR EACH element IN output:
accumulator = identity
SERIAL FOR EACH data_point IN element.corresponding_input:
accumulator.update(data_point)
element = accumulator.to_output()
```
If there is only one output element, we still want to parallelize, so we
do so along the *input* instead:
```
accumulators[n_threads]
PARALLEL FOR EACH input_chunk IN input.chunks():
accumulators[thread_num()] = identity
SERIAL FOR EACH data_point IN input_chunk:
accumulators[thread_num()].update_with_data(data_point)
accumulator = identity
SERIAL FOR EACH acc in accumulators:
accumulator.update_with_other_accumulator(acc)
output_element = accumulator.to_output()
```
Note that accumulators and data points do not have to be the same type
in general, since it might be necessary to track arbitrary amounts of
data at intermediate stages.
For example, for `std`, we use a parallel version of Welford's
algorithm, which requies us to track the mean, second moment, and number
of elements, so the accumulator type for `std` contains three pieces of
data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14535
Differential Revision: D13283887
Pulled By: umanwizard
fbshipit-source-id: 8586b7bf00bf9f663c55d6f8323301e257f5ec3f
Summary:
* Enable unit tests known to work on ROCm.
* Disable a few that are known to be flaky for the time being.
* Use std::abs for Half
* No more special casing for ROCm in TensorMathReduce
* Document an important detail for a hardcoded block size w.r.t. ROCm in TensorMathReduce
ezyang bddppq for awareness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14011
Differential Revision: D13387679
Pulled By: bddppq
fbshipit-source-id: 4177f2a57b09d866ccbb82a24318f273e3292f71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14658
Remove this dependency by moving at::CopyBytes to c10.
The implementations for at::CopyBytes will have to live in aten/caffe2 for now because they're not unified for CUDA yet.
They'll be moved into c10/backend/xxx later.
Reviewed By: dzhulgakov
Differential Revision: D13288655
fbshipit-source-id: 1c92379345308b3cd39a402779d7b7999613fc0d
Summary:
Tracing records variable names and we have new types and stuff in the IR, so this updates the graph printouts in the docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14914
Differential Revision: D13385101
Pulled By: jamesr66a
fbshipit-source-id: 6477e4861f1ac916329853763c83ea157be77f23
Summary:
Added a few examples and explains to how publish/load models.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14862
Differential Revision: D13384790
Pulled By: ailzhang
fbshipit-source-id: 008166e84e59dcb62c0be38a87982579524fb20e
Summary:
This will let us install tests and other Caffe2 python code as a part of running Caffe2 tests in PyTorch.
Broken out of https://github.com/pytorch/pytorch/pull/13733/
cc pjh5 yf225
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14898
Reviewed By: pjh5
Differential Revision: D13381123
Pulled By: orionr
fbshipit-source-id: 0ec96629b0570f6cc2abb1d1d6fce084e7464dbe
Summary:
_th_tensor is moving off Type, so these calls need to be replaced.
Unfortunately, replacing these with a full-fledged solution [e.g. from_storage(..., TensorOptions)] is a bit complicated because the storage itself fully defines the Type (modulo variable). It's simpler to just wait for the Variable/Tensor merge rather than to solve this now, so instead I changed the call sites to: at::empty({0}, type.options()).set_(storage...).
This isn't great because we are also trying to get rid of Type::options, but this seems to be the lesser-of-two-evils.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14877
Differential Revision: D13374310
Pulled By: gchanan
fbshipit-source-id: eb953ed041507e6190d6f32e383912e5a08311cd
Summary:
Fixes#14099
I attempted to be as consistent as possible with the formatting, hence why my equation reads d*(k - 1) instead of (k - 1)*d.
Also there is an unused variable on line 46: `n = self.in_channels`. I could fix that here too if that's not too out of scope.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14876
Differential Revision: D13374317
Pulled By: soumith
fbshipit-source-id: a9f110acafa58cdb4206956dbe3ab4738d48292d
Summary:
- allow gradcheck to take sparse tensor as input
- sparse output is not allowed yet at gradcheck
- add backward for `to_dense()` to get around sparse output
- calling gradcheck at test_sparse, so that we can use `_gen_sparse()` and also easily cover coalesced / uncoalesced test cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14596
Differential Revision: D13271904
Pulled By: weiyangfb
fbshipit-source-id: 5317484104404fd38058884c86e987546011dd86
Summary:
Otherwise, these tests will fail, even though there are never meant to run on single GPU machines.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14860
Differential Revision: D13369060
Pulled By: teng-li
fbshipit-source-id: 8a637a6d57335491ba8602cd09927700b2bbf8a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14838
The GPU memory tracking logs are incredibly annoying and merely serve
to pollute output. I `VLOG(1)`ed them. Hopefully, this is non-controversial.
Reviewed By: kuttas
Differential Revision: D13343290
fbshipit-source-id: b3cae99346c97b66e97ea660061e15dc5c99b9fc
Summary:
Latest hcc can now properly cast to correct type internally, so there is no need to insert static_cast in hipify scripts anymore.
However the hcc included in the latest ROCm release (1.9.2) doesn't have this fix, so leaving a flag to continue doing static_cast for those using the official ROCm releases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14853
Differential Revision: D13363171
Pulled By: bddppq
fbshipit-source-id: a36476a8511222ff3c933d31788e8a0ffb04f5ca
Summary:
Drop custom hcc/hip as the 1.9.2 release should contain the relevant patches therein.
Most notable feature in 1.9.2 is mixed precision support in rocBLAS and MIOpen. These features will be enabled by subsequent PRs.
bddppq ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14216
Differential Revision: D13354294
Pulled By: bddppq
fbshipit-source-id: 2541d4a196af21c9432c1aff7f6e65b572628028
Summary:
`torch.linspace(0, 1, 1)` fails with `RuntimeError: invalid argument 3: invalid number of points at ../aten/src/TH/generic/THTensorMoreMath.cpp:2119`, while `np.linspace(0, 1, 1)` works fine.
Looking at the code, there is even a comment by gchanan asking: "NumPy allows you to pass different points even if n <= 1 -- should we?"
I would say "yes". Currently, I would need to handle the case of `steps == 1` or `steps == 0` separately, making sure to change the `end` when calling `torch.linspace`. This is impractical. If we support `start != end`, there are two possibilities for the result: Either we ensure the first value in the resulting sequence always equals `start`, or we ensure the last value in the resulting sequence always equals `end`. Numpy chose the former, which also allows it to support a boolean `endpoint` flag. I'd say we should follow numpy.
This PR adapts `linspace` and `logspace` to mimic the behavior of numpy, adapts the tests accordingly, and extends the docstrings to make clear what happens when passing `steps=1`.
If you decide against this PR, the error message should become explicit about what I did wrong, and the documentation should be extended to mention this restriction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14748
Differential Revision: D13356136
Pulled By: ezyang
fbshipit-source-id: db85b8f0a98a5e24b3acd766132ab71c91794a82
Summary:
Removes cast of half to float in torch.sum, with float16 input tensor and
float32 output tensor, instead we cast data when loading input in kernel.
This supposingly would save a kernel launch as well as a full global memory load
on promoted data type (float).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14580
Differential Revision: D13356203
Pulled By: ezyang
fbshipit-source-id: 85e91225b880a65fe3ceb493371b9b36407fdf48
Summary:
Not ready yet, need some comments / help with this. It's good enough for https://github.com/pytorch/xla immediate goals (forward + backward trace fusion), but there are at least two issues with it:
1. If we don't allow it, `test/test_jit.py` fails to cover the change.
2. If we allow the weight to be set, running `test/test_jit.py TestJitGenerated.test_nn_nll_loss` fails with:
```
======================================================================
ERROR: test_nn_nll_loss (__main__.TestJitGenerated)
----------------------------------------------------------------------
Traceback (most recent call last):
File "test/test_jit.py", line 10001, in do_test
fn, f_args_variable, kwargs_variable, no_grad=no_grad)
File "test/test_jit.py", line 9360, in check_against_reference
outputs_test = self.runAndSaveRNG(func, recording_inputs, kwargs)
File "test/test_jit.py", line 425, in runAndSaveRNG
results = func(*inputs, **kwargs)
File "test/test_jit.py", line 9298, in script_fn
self.assertExportImport(CU.the_method.graph, tensors)
File "test/test_jit.py", line 415, in assertExportImport
self.assertExportImportModule(m, inputs)
File "test/test_jit.py", line 419, in assertExportImportModule
self.assertEqual(self.runAndSaveRNG(m.forward, inputs),
File "test/test_jit.py", line 425, in runAndSaveRNG
results = func(*inputs, **kwargs)
RuntimeError:
arguments for call are not valid:
for operator aten::nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight, *, Tensor out) -> Tensor:
expected a value of type Tensor for argument 'total_weight' but found bool
<internally-created-node>
~ <--- HERE
for operator aten::nll_loss_backward(Tensor grad_output, Tensor self, Tensor target, Tensor? weight, int reduction, int ignore_index, Tensor total_weight) -> Tensor:
expected a value of type Tensor for argument 'total_weight' but found bool
<internally-created-node>
~ <--- HERE
for call at:
<internally-created-node>
~ <--- HERE
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14305
Differential Revision: D13356265
Pulled By: ezyang
fbshipit-source-id: 504d783b2d87f923e698a6a4efc0fd9935a94a41
Summary:
This pull request contains changes for:
1. Added MIOpen RNN API miopenGetRNNLayerBiasSize and miopenGetRNNLayerParamSize.
2. Fixed usage of API miopenGetRNNLayerParam.
3. Modifying the RNN test to run using MIOpen engine.
Differential Revision: D13355699
Pulled By: bddppq
fbshipit-source-id: 6f750657f8049c5446eca893880b397804120b69
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14827
We need to send complete IO info when doing `onnxGetBackendCompatibility` to backend like Glow. Previously we are missing some info because sometimes we generate more than one nodes from one C2 op. This fixes the issue.
Reviewed By: jackm321
Differential Revision: D13352049
fbshipit-source-id: 8d8ac70656a0ac42f3a0ccecad61456a4f3b2435
Summary:
Currently, pytorch doesn't dependent on protobuf. So, we don't need to include the protobuf dir in pytorch cmake file.
And if we build caffe2 without custom-protobuf[1], we will have the protobuf mismatched problem.
[1]
92dbd0219f/CMakeLists.txt (L65)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14182
Differential Revision: D13356273
Pulled By: ezyang
fbshipit-source-id: 8120c3452d158dc51d70156433d7b9076c6aed47
Summary:
Fix CMakeLists.txt, so the test for CPU won't run profile_observer_test.cc, as currently it only supports GPU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14168
Differential Revision: D13356274
Pulled By: ezyang
fbshipit-source-id: 7d105f2e18675e5fab129864958148b0f18d582c
Summary:
I know that including CAFFE2_INCLUDE_DIRS in include headers are not necessary for newer cmakes. But I had this in one of my old projects and **cmake gave me error that "/usr/lib/include" is invalid path**.
It seems like "${_INSTALL_PREFIX}/lib/include" should be changed to "${_INSTALL_PREFIX}/include" as all caffe2 headers are in /include rather than /lib/include/
Please correct me if I am wrong?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14306
Differential Revision: D13356246
Pulled By: ezyang
fbshipit-source-id: e2d5d3c42352e59b245714ad90fd7a9ef48170d7
Summary:
on Windows environment, some ATen core files (Type.h, Tensor.h, TensorMethods.h) are created and it's new line code is CRLF. (maybe enviconment dependant)
therefore, comparing files is failed in generate_outputs()agener917.py and compilation stopped.
this patch generates these files with LF forcibly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14667
Differential Revision: D13356170
Pulled By: ezyang
fbshipit-source-id: ef8cc3a6cc8bf3c45b78e9eb3df98cf47c0d33bb
Summary:
pytorch_theme.css is no longer necessary for the cpp or html docs site build. The new theme styles are located at https://github.com/pytorch/pytorch_sphinx_theme. The Lato font is also no longer used in the new theme.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14779
Differential Revision: D13356125
Pulled By: ezyang
fbshipit-source-id: c7635eb7512c7dcaddb9cad596ab3dbc96480144
Summary:
Implement some simple fixes to clean up windows build by fixing compiler warnings. Three main types of warnings were fixes:
1. GCC specific pragmas were changed to not be used on windows.
2. cmake flags that don't exist on windows were removed from windows build
3. Fix a macro that was defined multiple times on Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14490
Differential Revision: D13241988
Pulled By: ezyang
fbshipit-source-id: 38da8354f0e3a3b9c97e33309cdda9fd23c08247
Summary:
See #14554.
I can't figure out how the reported issue can happen. The best next
thing is have more information when this happens again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14813
Differential Revision: D13351908
Pulled By: pietern
fbshipit-source-id: 61b30fcae2e34da54329d0893ca4921b6ad60f0d
Summary:
It is possible that some sort of contention causes process scheduling
delays which in turn cause the timeout to *not* be hit.
Increased sleep here will decrease the probability of this happening.
Fixes#14555.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/14814
Differential Revision: D13351924
Pulled By: pietern
fbshipit-source-id: 1222cf0855408dfcb79f30f94694c790ee998cf9
2. Persists CircleCI scripts (everything in `.circleci`) into a workspace. Why?
We don't always do a Git checkout on all subjobs, but we usually
still want to be able to call scripts one way or another in a subjob.
Persisting files this way lets us have access to them without doing a
checkout. This workspace is conventionally mounted on `~/workspace`
(this is distinguished from `~/project`, which is the conventional
working directory that CircleCI will default to starting your jobs
in.)
3. Write out the commit message to `.circleci/COMMIT_MSG`. This is so
we can determine in subjobs if we should actually run the jobs or
not, even if there isn't a Git checkout.
CircleCI configuration generator
================================
One may no longer make changes to the `.circleci/config.yml` file directly.
Instead, one must edit these Python scripts or files in the `verbatim-sources/` directory.
Usage
----------
1. Make changes to these scripts.
2. Run the `regenerate.sh` script in this directory and commit the script changes and the resulting change to `config.yml`.
You'll see a build failure on TravisCI if the scripts don't agree with the checked-in version.
Motivation
----------
These scripts establish a single, authoritative source of documentation for the CircleCI configuration matrix.
The documentation, in the form of diagrams, is automatically generated and cannot drift out of sync with the YAML content.
Furthermore, consistency is enforced within the YAML config itself, by using a single source of data to generate
multiple parts of the file.
* Facilitates one-off culling/enabling of CI configs for testing PRs on special targets
Also see https://github.com/pytorch/pytorch/issues/17038
Future direction
----------------
### Declaring sparse config subsets
See comment [here](https://github.com/pytorch/pytorch/pull/17323#pullrequestreview-206945747):
In contrast with a full recursive tree traversal of configuration dimensions,
> in the future future I think we actually want to decrease our matrix somewhat and have only a few mostly-orthogonal builds that taste as many different features as possible on PRs, plus a more complete suite on every PR and maybe an almost full suite nightly/weekly (we don't have this yet). Specifying PR jobs in the future might be easier to read with an explicit list when we come to this.
----------------
----------------
# How do the binaries / nightlies / releases work?
### What is a binary?
A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
A **binary configuration** is a collection of
* release or nightly
* releases are stable, nightlies are beta and built every night
* python version
* linux: 2.7m, 2.7mu, 3.5m, 3.6m 3.7m (mu is wide unicode or something like that. It usually doesn't matter but you should know that it exists)
* macos and windows: 2.7, 3.5, 3.6, 3.7
* cpu version
* cpu, cuda 9.0, cuda 10.0
* The supported cuda versions occasionally change
* operating system
* Linux - these are all built on CentOS. There haven't been any problems in the past building on CentOS and using on Ubuntu
* MacOS
* Windows - these are built on Azure pipelines
* devtoolset version (gcc compiler version)
* This only matters on Linux cause only Linux uses gcc. tldr is gcc made a backwards incompatible change from gcc 4.8 to gcc 5, because it had to change how it implemented std::vector and std::string
### Where are the binaries?
The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
We have 3 types of binary packages
* pip packages - nightlies are stored on s3 (pip install -f <as3url>). releases are stored in a pip repo (pip install torch) (ask Soumith about this)
* conda packages - nightlies and releases are both stored in a conda repo. Nighty packages have a '_nightly' suffix
* libtorch packages - these are zips of all the c++ libraries, header files, and sometimes dependencies. These are c++ only
* shared with dependencies
* static with dependencies
* shared without dependencies
* static without dependencies
All binaries are built in CircleCI workflows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
# CircleCI structure of the binaries
Some quick vocab:
* A\**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on\https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
* **jobs** are a sequence of '**steps**'
* **steps** are usually just a bash script or a builtin CircleCI command.* All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
## How are the workflows structured?
The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build, test, and upload) per binary configuration
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. smoke_linux_conda_3.7_cpu
1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
2. Runs the smoke tests
## How are the jobs structured?
The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources . Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
* binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
* binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
* binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
* binary_run_in_docker.sh - Takes a bash script file (the actual test code) from a hardcoded location, spins up a docker image, and runs the script inside the docker image
### **Why do the steps all refer to scripts?**
CircleCI creates a final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
### **What is binary_run_in_docker for?**
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The 'machine' executor gives you two cores on some linux vm. The 'docker' executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
* linux build jobs use the docker executor. Running them on the docker executor was at least 2x faster than running them on the machine executor
* linux test jobs use the machine executor and spin up their own docker. Why this nonsense? It's cause we run nvidia-docker for our GPU tests; any code that calls into the CUDA runtime needs to be run on nvidia-docker. To run a nvidia-docker you need to install some nvidia packages on the host machine and then call docker with the '—runtime nvidia' argument. CircleCI doesn't support this, so we have to do it ourself.
* This is not just a mere inconvenience. **This blocks all of our linux tests from using more than 2 cores.** But there is nothing that we can do about it, but wait for a fix on circleci's side. Right now, we only run some smoke tests (some simple imports) on the binaries, but this also affects non-binary test jobs.
* linux upload jobs use the machine executor. The upload jobs are so short that it doesn't really matter what they use
* linux smoke test jobs use the machine executor for the same reason as the linux test jobs
binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
### **Why does binary_checkout also checkout pytorch? Why shouldn't it?**
We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called 'setup' to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
# Code structure of the binaries (circleci agnostic)
## Overview
The code that runs the binaries lives in two places, in the normal [github.com/pytorch/pytorch](http://github.com/pytorch/pytorch), but also in [github.com/pytorch/builder](http://github.com/pytorch/builder) , which is a repo that defines how all the binaries are built. The relevant code is
```
# All code needed to set-up environments for build code to run in,
# but only code that is specific to the current CI system
pytorch/pytorch
- .circleci/ # Folder that holds all circleci related stuff
- config.yml # GENERATED file that actually controls all circleci behavior
- verbatim-sources # Used to generate job/workflow sections in ^
- scripts/ # Code needed to prepare circleci environments for binary build scripts
- setup.py # Builds pytorch. This is wrapped in pytorch/builder
- cmake files # used in normal building of pytorch
# All code needed to prepare a binary build, given an environment
# with all the right variables/packages/paths.
pytorch/builder
# Given an installed binary and a proper python env, runs some checks
# to make sure the binary was built the proper way. Checks things like
# the library dependencies, symbols present, etc.
- check_binary.sh
# Given an installed binary, runs python tests to make sure everything
# is in order. These should be de-duped. Right now they both run smoke
# tests, but are called from different places. Usually just call some
# import statements, but also has overlap with check_binary.sh above
- run_tests.sh
- smoke_test.sh
# Folders that govern how packages are built. See paragraphs below
- conda/
- build_pytorch.sh # Entrypoint. Delegates to proper conda build folder
- switch_cuda_version.sh # Switches activate CUDA installation in Docker
- pytorch-nightly/ # Build-folder
- manywheel/
- build_cpu.sh # Entrypoint for cpu builds
- build.sh # Entrypoint for CUDA builds
- build_common.sh # Actual build script that ^^ call into
- wheel/
- build_wheel.sh # Entrypoint for wheel builds
```
Every type of package has an entrypoint build script that handles the all the important logic.
## Conda
Both Linux and MacOS use the same code flow for the conda builds.
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
Basically, you pass `conda build` a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing.
tldr; on conda-build is
1. Creates a brand new conda environment, based off of deps in the meta.yaml
1. Note that environment variables do not get passed into this build env unless they are specified in the meta.yaml
2. If the build fails this environment will stick around. You can activate it for much easier debugging. The “General Python” section below explains what exactly a python “environment” is.
2. Calls build.sh in the environment
3. Copies the finished package to a new conda env, also specified by the meta.yaml
4. Runs some simple import tests (if specified in the meta.yaml)
5. Saves the finished package as a tarball
The build.sh we use is essentially a wrapper around ```python setup.py build``` , but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The entrypoint file `builder/conda/build_conda.sh` is complicated because
* It works for both Linux and MacOS
* The mac builds used to create their own environments, since they all used to be on the same machine. There’s now a lot of extra logic to handle conda envs. This extra machinery could be removed
* It used to handle testing too, which adds more logic messing with python environments too. This extra machinery could be removed.
## Manywheels (linux pip and libtorch packages)
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
`builder/manywheel/build_cpu.sh` and `builder/manywheel/build.sh` (for CUDA builds) just set different env vars and then call into `builder/manywheel/build_common.sh`
The entrypoint file `builder/manywheel/build_common.sh` is really really complicated because
* This used to handle building for several different python versions at the same time. The loops have been removed, but there's still unneccessary folders and movements here and there.
* The script is never used this way anymore. This extra machinery could be removed.
* This used to handle testing the pip packages too. This is why there’s testing code at the end that messes with python installations and stuff
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* This should really be separate. libtorch packages are c++ only and have no python. They should not share infra with all the python specific stuff in this file.
* There is a lot of messing with rpaths. This is necessary, but could be made much much simpler if the above issues were fixed.
## Wheels (MacOS pip and libtorch packages)
The entrypoint file `builder/wheel/build_wheel.sh` is complicated because
* The mac builds used to all run on one machine (we didn’t have autoscaling mac machines till circleci). So this script handled siloing itself by setting-up and tearing-down its build env and siloing itself into its own build directory.
* The script is never used this way anymore. This extra machinery could be removed.
* This also builds libtorch packages
* Ditto the comment above. This should definitely be separated out.
Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
## General notes
### Note on run_tests.sh, smoke_test.sh, and check_binary.sh
* These should all be consolidated
* These must run on all OS types: MacOS, Linux, and Windows
* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on master and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didn’t mess anything up.
* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
### Note on libtorch
Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
* It’s confusinig. Most of those scripts deal with python specifics.
* The extra conditionals everywhere severely complicate the wheel build scripts
* The process for building libtorch is different from the official instructions (a plain call to cmake, or a call to a script)
### Note on docker images / Dockerfiles
All linux builds occur in docker images. The docker images are
* soumith/conda-cuda
* Has ALL CUDA versions installed. The script pytorch/builder/conda/switch_cuda_version.sh sets /usr/local/cuda to a symlink to e.g. /usr/local/cuda-10.0 to enable different CUDA builds
* Also used for cpu builds
* soumith/manylinux-cuda90
* soumith/manylinux-cuda92
* soumith/manylinux-cuda100
* Also used for cpu builds
The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
### General Python
* This is still a good explanation of python installations https://caffe2.ai/docs/faq.html#why-do-i-get-import-errors-in-python-when-i-try-to-use-caffe2
# How to manually rebuild the binaries
tldr; make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
## How to test changes to the binaries via .circleci
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using ```.circleci/regenerate.sh``` in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
# Update the PR, need to force since the commits are different now
git push origin my_branch --force
```
The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
## How to build a binary locally
### Linux
You can build Linux binaries locally easily using docker.
```
# Run the docker
# Use the correct docker image, soumith/conda-cuda used here as an example
#
# -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the
# machine that you're running the command on) accessible to the docker
# container at path/to/bar. So if you then run `touch path/to/bar/baz`
# in the docker container then you will see path/to/foo/baz on your local
# machine. You could also clone the pytorch and builder repos in the docker.
#
# If you're building a CUDA binary then use `nvidia-docker run` instead, see below.
#
# If you know how, add ccache as a volume too and speed up everything
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
export PACKAGE_TYPE=conda
export DESIRED_PYTHON=3.6
export DESIRED_CUDA=cpu
# Call the entrypoint
# `|& tee foo.log` just copies all stdout and stderr output to foo.log
# The builds generate lots of output so you probably need this when
# building locally.
/builder/conda/build_pytorch.sh |& tee build_output.log
```
**Building CUDA binaries on docker**
To build a CUDA binary you need to use `nvidia-docker run` instead of just `docker run` (or you can manually pass `--runtime=nvidia`). This adds some needed libraries and things to build CUDA stuff.
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a loong time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
### MacOS
There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci :/
But if you want to try, then I’d recommend
```
# Create a new terminal
# Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you
# know how to do
# Install a new miniconda
# First remove any other python or conda installation from your PATH
# Always install miniconda 3, even if building for Python <3
# All MacOS builds use conda to manage the python env and dependencies
# that are built with, even the pip packages
conda create -yn binary python=2.7
conda activate binary
# Export whatever variables are important to you. All variables that you'd
# possibly need are in .circleci/scripts/binary_populate_env.sh
# You should probably always export at least these 3 variables
export PACKAGE_TYPE=conda
export DESIRED_PYTHON=3.6
export DESIRED_CUDA=cpu
# Call the entrypoint you want
path/to/builder/wheel/build_wheel.sh
```
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
1. You make the ‘conda’ command accessible by prepending `path/to/conda_root/bin` to your PATH.
2. You make a new env and activate it, which then also gets prepended to your PATH. Now you have `path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH`
3. Now say you (or some code that you ran) call python executable `foo`
1. if you installed `foo` in `new_env`, then `path/to/conda_root/envs/new_env/bin/foo` will get called, as expected.
2. But if you forgot to installed `foo` in `new_env` but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find `path/to/conda_root/bin/foo` . This is dangerous, since `foo` can be a different version than you want; `foo` can even be for an incompatible python version!
Newer conda versions and proper python hygeine can prevent this, but just install a new miniconda to be safe.
z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run `cd .jenkins/pytorch/perf_test/ && bash '''+test_name+'''.sh` on your local machine and compare the runtime before/after your code change.
''')
To reproduce this regression, run
`cd .jenkins/pytorch/perf_test/ && bash {}.sh` on your local machine
and compare the runtime before/after your code change.
'''.format(test_name))
else:
print("z-value < 3, no perf regression detected.")
if ! python perf-tests/modules/test_cpu_torch.py ${ARGS};then
echo"To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash "${FUNCNAME[0]}".sh\` on your local machine and compare the runtime before/after your code change."
echo"To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
if ! python perf-tests/modules/test_cpu_torch_tensor.py ${ARGS};then
echo"To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash "${FUNCNAME[0]}".sh\` on your local machine and compare the runtime before/after your code change."
echo"To reproduce this regression, run \`cd .jenkins/pytorch/perf_test/ && bash ${FUNCNAME[0]}.sh\` on your local machine and compare the runtime before/after your code change."
echo NOTE: To run \`import torch\`, please make sure to activate the conda environment by running \`call %CONDA_PARENT_DIR%\\Miniconda3\\Scripts\\activate.bat %CONDA_PARENT_DIR%\\Miniconda3\` in Command Prompt before running Git Bash.
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
)else(
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\caffe2 && python %SCRIPT_HELPERS_DIR%\upload_image.py %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.