Summary:
[Here](https://docs.gradle.org/current/userguide/gradle_wrapper.html), there is the following description.
`The recommended way to execute any Gradle build is with the help of the Gradle Wrapper`
I took a little time to prepare Gradle for `pytorch_android` build. (version etc.)
I think using Gradle wrapper will make `pytorch_android` build more seamless.
Gradle wrapper version: 4.10.3
250c71121b/.circleci/scripts/build_android_gradle.sh (L13)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51067
Reviewed By: izdeby
Differential Revision: D26315718
Pulled By: IvanKobzarev
fbshipit-source-id: f8077d7b28dc0b03ee48bcdac2f5e47d9c1f04d9
Summary:
This PR adds a local [`mypy` plugin](https://mypy.readthedocs.io/en/stable/extending_mypy.html#extending-mypy-using-plugins) that warns if you accidentally run `mypy` using a version that doesn't match [the version we install for CI](6045663f39/.circleci/docker/common/install_conda.sh (L117)), since this trips people up sometimes when `mypy` gives errors in some versions (see https://github.com/pytorch/pytorch/issues/51513) but not others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51799
Test Plan:
To check that this doesn't break our `mypy` test(s) when you have the correct version installed:
```
python test/test_type_hints.py
```
To check that this does indeed warn when you have an incorrect `mypy` version installed, switch to a different version (e.g. 0.782), and run the above command or either of these:
```
mypy
mypy --config-file=mypy-strict.ini
```
You should get the following message on stderr:
```
You are using mypy version 0.782, which is not supported
in the PyTorch repo. Please switch to mypy version 0.770.
For example, if you installed mypy via pip, run this:
pip install mypy==0.770
Or if you installed mypy via conda, run this:
conda install -c conda-forge mypy=0.770
```
Reviewed By: janeyx99
Differential Revision: D26282010
Pulled By: samestep
fbshipit-source-id: 7b423020d0529700dea8972b27afa2d7068e1b12
Summary:
This is a followup to https://github.com/pytorch/pytorch/issues/49190. Vaguely speaking, the goals are to make it easy to identify test time regressions introduced by PRs. Eventually the hope is to use this information to edit Dr CI comments, but this particular PR just does the analysis and prints it to stdout, so a followup PR would be needed to edit the actual comments on GitHub.
**Important:** for uninteresting reasons, this PR moves the `print_test_stats.py` file.
- *Before:* `test/print_test_stats.py`
- *After:* `torch/testing/_internal/print_test_stats.py`
Notes on the approach:
- Just getting the mean and stdev for the total job time of the last _N_ commits isn't sufficient, because e.g. if `master` was broken 5 commits ago, then a lot of those job times will be much shorter, breaking the statistics.
- We use the commit history to make better estimates for the mean and stdev of individual test (and suite) times, but only when the test in that historical commit is present and its status matches that of the base commit.
- We list all the tests that were removed or added, or whose status changed (e.g. skipped to not skipped, or vice versa), along with time (estimate) info for that test case and its containing suite.
- We don't list tests whose time changed a lot if their status didn't change, because there's a lot of noise and it's unclear how to do that well without too many false positives.
- We show a human-readable commit graph that indicates exactly how many commits are in the pool of commits that could be causing regressions (e.g. if a PR has multiple commits in it, or if the base commit on `master` doesn't have a report in S3).
- We don't show an overall estimate of whether the PR increased or decreased the total test job time, because it's noisy and it's a bit tricky to aggregate stdevs up from individual tests to the whole job level. This might change in a followup PR.
- Instead, we simply show a summary at the bottom which says how many tests were removed/added/modified (where "modified" means that the status changed), and our best estimates of the mean times (and stdevs) of those changes.
- Importantly, the summary at the bottom is only for the test cases that were already shown in the more verbose diff report, and does not include any information about tests whose status didn't change but whose running time got much longer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50171
Test Plan:
To run the unit tests:
```
$ python test/test_testing.py
$ python test/print_test_stats.py
```
To verify that this works, check the [CircleCI logs](https://app.circleci.com/pipelines/github/pytorch/pytorch/258628/workflows/9cfadc34-e042-485e-b3b3-dc251f160307) for a test job run on this PR; for example:
- pytorch_linux_bionic_py3_6_clang9_test
To test locally, use the following steps.
First run an arbitrary test suite (you need to have some XML reports so that `test/print_test_stats.py` runs, but we'll be ignoring them here via the `--use-json` CLI option):
```
$ DATA_DIR=/tmp
$ ARBITRARY_TEST=testing
$ python test/test_$ARBITRARY_TEST.py --save-xml=$DATA_DIR/test/test_$ARBITRARY_TEST
```
Now choose a commit and a test job (it has to be on `master` since we're going to grab the test time data from S3, and [we only upload test times to S3 on the `master`, `nightly`, and `release` branches](https://github.com/pytorch/pytorch/pull/49645)):
```
$ export CIRCLE_SHA1=c39fb9771d89632c5c3a163d3c00af3bef1bd489
$ export CIRCLE_JOB=pytorch_linux_bionic_py3_6_clang9_test
```
Download the `*.json.bz2` file(s) for that commit/job pair:
```
$ aws s3 cp s3://ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/ $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB --recursive
```
And feed everything into `test/print_test_stats.py`:
```
$ bzip2 -kdc $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/*Z.json.bz2 | torch/testing/_internal/print_test_stats.py --compare-with-s3 --use-json=/dev/stdin $DATA_DIR/test/test_$ARBITRARY_TEST
```
The first part of the output should be the same as before this PR; here is the new part, at the end of the output:
- https://pastebin.com/Jj1svhAn
Reviewed By: malfet, izdeby
Differential Revision: D26317769
Pulled By: samestep
fbshipit-source-id: 1ba06cec0fafac77f9e7341d57079543052d73db
Summary:
Currently PyTorch repository provides Dockerfile to build Docker with nightly builds, but it doesn't have CI to actually build those Dockers.
This PR adds a GitHub action workflow to create PyTorch nightly build Docker and publish them to GitHub Container Registry.
Also, add "--always" option to the `git describe --tags` command that generates the Docker image tag.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51755
Test Plan: Manually trigger the workflow build in the GitHub Actions web UI.
Reviewed By: seemethere
Differential Revision: D26320180
Pulled By: xuzhao9
fbshipit-source-id: e00b472df14f5913cab9b06a41e837014e87f1c7
Summary:
Fixes https://github.com/pytorch/pytorch/issues/39502
This PR adds support for exporting **fake_quantize_per_channel_affine** to a pair of QuantizeLinear and DequantizeLinear. Per tensor support was added by PR https://github.com/pytorch/pytorch/pull/39738.
`axis` attribute of QuantizeLinear and DequantizeLinear, which is required for per channel support, is added in opset13 added by https://github.com/onnx/onnx/pull/2772.
[update 1/20/2021]: opset13 is being supported on master, the added function is now properly tested. Code also rebased to new master.
The function is also tested offline with the following code
```python
import torch
from torch import quantization
from torchvision import models
qat_resnet18 = models.resnet18(pretrained=True).eval().cuda()
qat_resnet18.qconfig = quantization.QConfig(
activation=quantization.default_fake_quant, weight=quantization.default_per_channel_weight_fake_quant)
quantization.prepare_qat(qat_resnet18, inplace=True)
qat_resnet18.apply(quantization.enable_observer)
qat_resnet18.apply(quantization.enable_fake_quant)
dummy_input = torch.randn(16, 3, 224, 224).cuda()
_ = qat_resnet18(dummy_input)
for module in qat_resnet18.modules():
if isinstance(module, quantization.FakeQuantize):
module.calculate_qparams()
qat_resnet18.apply(quantization.disable_observer)
qat_resnet18.cuda()
input_names = [ "actual_input_1" ]
output_names = [ "output1" ]
torch.onnx.export(qat_resnet18, dummy_input, "quant_model.onnx", verbose=True, opset_version=13)
```
It can generate the desired graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42835
Reviewed By: houseroad
Differential Revision: D26293823
Pulled By: SplitInfinity
fbshipit-source-id: 300498a2e24b7731b12fa2fbdea4e73dde80e7ea
Summary:
For none support input, we should not do check in a parallel region, this PR will first do the dtype check, and then do parallel for.
Fixes https://github.com/pytorch/pytorch/issues/51352.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51443
Reviewed By: izdeby
Differential Revision: D26305584
Pulled By: ngimel
fbshipit-source-id: 6faa3148af5bdcd7246771c0ecb4db2b31ac82c6
Summary:
Previously TorchScript allows a ignore-all type check suppression rule that looks like
```
code code code # type: ignore
```
But a more common use case is
```
code code code # type: ignore[specific-rule]
```
This PR allows the more common use case
Fixes https://github.com/pytorch/pytorch/issues/48643
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51675
Reviewed By: ansley
Differential Revision: D26304870
Pulled By: gmagogsfm
fbshipit-source-id: 0ac9ee34f0219c86e428318a69484d5aa3ec433f
Summary:
With zasdfgbnm's help and with his small TensorIterator kernel repro https://github.com/zasdfgbnm/tensoriterator we've found a workaround for what looks like a compiler bug in multi_output_kernel that manifests itself with cuda 10.2 and cuda 11 when there is a non-trivial OffsetCalculator.
It looks like those nvcc versions cannot handle inheritance in device structs, so instead of inheriting `multi_outputs_unroll` from `unroll` we make it independent.
cc vkuzo, haichuan-fb I verified that reverting https://github.com/pytorch/pytorch/issues/49315 to bring back multi_output_kernel and running `test_learnable_backward_per_channel_cuda` test passes, but I didn't do it in this PR - can you take it up as a follow-up?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51827
Reviewed By: izdeby
Differential Revision: D26305559
Pulled By: ngimel
fbshipit-source-id: 1168e7c894d237a954abfd1998eaad54f0ce40a7
Summary:
The overloads are a little tricky here. It's important that the overloads are such that it's unambiguous what
`torch.nonzero(x)` will resolve to - so just specify defaults for one of the overloads. Also, `out` is left out of the second overload
because a non-None value for `out` is not valid in combination with `as_tuple=True`.
Closes gh-51434
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51635
Reviewed By: zhangguanheng66
Differential Revision: D26279203
Pulled By: walterddr
fbshipit-source-id: 8459c04fc9fbf7fc5f31b3f631aaac2f98b17ea6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51589
Dropout operators are only needed in training. Remove them for frozen models.
Test Plan: Imported from OSS
Reviewed By: eellison
Differential Revision: D26214259
fbshipit-source-id: 3ab05869e1e1f6c57498ba62bf40944f7c2189aa
Summary:
Toward fixing https://github.com/pytorch/pytorch/issues/47624
~Step 1: add `TORCH_WARN_MAYBE` which can either warn once or every time in c++, and add a c++ function to toggle the value.
Step 2 will be to expose this to python for tests. Should I continue in this PR or should we take a different approach: add the python level exposure without changing any c++ code and then over a series of PRs change each call site to use the new macro and change the tests to make sure it is being checked?~
Step 1: add a python and c++ toggle to convert TORCH_WARN_ONCE into TORCH_WARN so the warnings can be caught in tests
Step 2: add a python-level decorator to use this toggle in tests
Step 3: (in future PRs): use the decorator to catch the warnings instead of `maybeWarnsRegex`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48560
Reviewed By: ngimel
Differential Revision: D26171175
Pulled By: mruberry
fbshipit-source-id: d83c18f131d282474a24c50f70a6eee82687158f
Summary:
This is a followup to https://github.com/pytorch/pytorch/issues/49190. Vaguely speaking, the goals are to make it easy to identify test time regressions introduced by PRs. Eventually the hope is to use this information to edit Dr CI comments, but this particular PR just does the analysis and prints it to stdout, so a followup PR would be needed to edit the actual comments on GitHub.
**Important:** for uninteresting reasons, this PR moves the `print_test_stats.py` file.
- *Before:* `test/print_test_stats.py`
- *After:* `torch/testing/_internal/print_test_stats.py`
Notes on the approach:
- Just getting the mean and stdev for the total job time of the last _N_ commits isn't sufficient, because e.g. if `master` was broken 5 commits ago, then a lot of those job times will be much shorter, breaking the statistics.
- We use the commit history to make better estimates for the mean and stdev of individual test (and suite) times, but only when the test in that historical commit is present and its status matches that of the base commit.
- We list all the tests that were removed or added, or whose status changed (e.g. skipped to not skipped, or vice versa), along with time (estimate) info for that test case and its containing suite.
- We don't list tests whose time changed a lot if their status didn't change, because there's a lot of noise and it's unclear how to do that well without too many false positives.
- We show a human-readable commit graph that indicates exactly how many commits are in the pool of commits that could be causing regressions (e.g. if a PR has multiple commits in it, or if the base commit on `master` doesn't have a report in S3).
- We don't show an overall estimate of whether the PR increased or decreased the total test job time, because it's noisy and it's a bit tricky to aggregate stdevs up from individual tests to the whole job level. This might change in a followup PR.
- Instead, we simply show a summary at the bottom which says how many tests were removed/added/modified (where "modified" means that the status changed), and our best estimates of the mean times (and stdevs) of those changes.
- Importantly, the summary at the bottom is only for the test cases that were already shown in the more verbose diff report, and does not include any information about tests whose status didn't change but whose running time got much longer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50171
Test Plan:
To run the unit tests:
```
$ python test/test_testing.py
$ python test/print_test_stats.py
```
To verify that this works, check the [CircleCI logs](https://app.circleci.com/pipelines/github/pytorch/pytorch/258628/workflows/9cfadc34-e042-485e-b3b3-dc251f160307) for a test job run on this PR; for example:
- pytorch_linux_bionic_py3_6_clang9_test
To test locally, use the following steps.
First run an arbitrary test suite (you need to have some XML reports so that `test/print_test_stats.py` runs, but we'll be ignoring them here via the `--use-json` CLI option):
```
$ DATA_DIR=/tmp
$ ARBITRARY_TEST=testing
$ python test/test_$ARBITRARY_TEST.py --save-xml=$DATA_DIR/test/test_$ARBITRARY_TEST
```
Now choose a commit and a test job (it has to be on `master` since we're going to grab the test time data from S3, and [we only upload test times to S3 on the `master`, `nightly`, and `release` branches](https://github.com/pytorch/pytorch/pull/49645)):
```
$ export CIRCLE_SHA1=c39fb9771d89632c5c3a163d3c00af3bef1bd489
$ export CIRCLE_JOB=pytorch_linux_bionic_py3_6_clang9_test
```
Download the `*.json.bz2` file(s) for that commit/job pair:
```
$ aws s3 cp s3://ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/ $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB --recursive
```
And feed everything into `test/print_test_stats.py`:
```
$ bzip2 -kdc $DATA_DIR/ossci-metrics/test_time/$CIRCLE_SHA1/$CIRCLE_JOB/*Z.json.bz2 | torch/testing/_internal/print_test_stats.py --compare-with-s3 --use-json=/dev/stdin $DATA_DIR/test/test_$ARBITRARY_TEST
```
The first part of the output should be the same as before this PR; here is the new part, at the end of the output:
- https://pastebin.com/Jj1svhAn
Reviewed By: walterddr
Differential Revision: D26232345
Pulled By: samestep
fbshipit-source-id: b687b1737519d2eed68fbd591a667e4e029de509
Summary:
Fixes https://github.com/pytorch/pytorch/issues/49683
This PR solves Backward through sparse_coo_tensor bug by implementing a `sparse_mask_helper` function for n-dimensional sparse tensor for CPU and CUDA which is used to reimplement `sparse_constructor_values_backward` function.
This `sparse_mask` function was implemented before for backward sparse-sparse matmul. However, the algorithm is little different because in this case it should be applyable not only for matrices but for n-dimensional tensors. Thankfully it was not quite hard to extend and now both share the same code base.
Note that no new tests are required because now the backward for sparse-sparse matmul now uses the new `sparse_mask_helper`.
ngimel, mruberry - kindly review this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50361
Reviewed By: zhangguanheng66
Differential Revision: D26270483
Pulled By: ngimel
fbshipit-source-id: ee4bda49ff86e769342674b64d3c4bc34eae38ef
Summary: As titleed
Test Plan: successful test flow with A* setup: f245569242
Reviewed By: anurag16
Differential Revision: D25966283
fbshipit-source-id: ef9945d5039933df44c2c3c26ca149f47538ff31
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51757
Enables backend preprocessing to take place outside of the backend interface.
What's new:
* A new definition for backend preprocessing (i.e. BackendPreprocessFunction).
* Registration of the backend's PyTorchBackendInterface interface implementation is augmented to take the BackendPreprocessFunction.
* A new registry is created to handle the BackendPreprocessFunction functions, using the backend's name as key.
* When a BackendPreprocessFunction is used, the PyTorchBackendInterface's "preprocess" method is not added to the LoweredModule. Instead, the BackendPreprocessFunction is called and its output used to set the LoweredModule's __processed_module.
Why?:
These changes are needed to avoid forcing backend preprocessing to be part of the LoweredModule, and in the future be able to eliminate "preprocess" from the PyTorchBackendInterface.
This is important for Mobile use cases where "preprocess" can take the bulk of the compilation process, and thus contain code dependencies that we do not want to bring (or cannot bring) to the Mobile binary.
What didn't change:
* Everything is backwards compatible:
** The existing "preprocess" method in PyTorchBackendInterface is still there.
** When backend registration is done without the BackendPreprocessFunction, as before, things work the same way: "preprocess" is added to LoweredModule, and invoked through the module's instance of the backend interface.
Longer term, the plan is to refactor existing users to move to the new backend registration.
ghstack-source-id: 121190883
Test Plan:
Updated existing tests (test_backend.py) to use the new registration mechanism.
Verified test ran and passed (in my OSS build).
Reviewed By: iseeyuan
Differential Revision: D26261042
fbshipit-source-id: 0dc378acd5f2ab60fcdc01f7373616d1db961e61
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51119
Adds asm kernel for 8x1 block sparse kernel. Since ukernels is still
producing 4x8 blocks, similar to 1x4 sparsity pattern, we can use the
same prepacking kernel for activation. It does get a tiny bit hacky but
allows us to reuse the kernel.
Test Plan:
q8gemm-sparse-test
fully-connectest-sparse-test
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D26077765
fbshipit-source-id: cc087b0ff717a613906d442ea73680e785e0ecc2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51118
sparsity
Modify BCSR to pack generic block sparsity pattern.
Modify rest of the code to accommodate the change.
This is in preperation to support 8x1 sparsity.
Test Plan:
q8gemm-sparse-test
Imported from OSS
Reviewed By: AshkanAliabadi
Differential Revision: D26077767
fbshipit-source-id: 7179975b07a1cb76ef26896701d782fb04638743
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51748
Adding docs for `fake_quantize_per_tensor_affine` and `fake_quantize_per_channel_affine`
functions.
Note: not documenting `fake_quantize_per_tensor_affine_cachemask` and
`fake_quantize_per_channel_affine_cachemask` since they are implementation details
of `fake_quantize_per_tensor_affine` and `fake_quantize_per_channel_affine`,
and do not need to be exposed to the user at the moment.
Test Plan: Build the docs locally on Mac OS, it looks good
Reviewed By: supriyar
Differential Revision: D26270514
Pulled By: vkuzo
fbshipit-source-id: 8e3c9815a12a3427572cb4d34a779e9f5e4facdd
Summary:
Replacing 11.0 with 11.2 in our nightlies.
(am slightly uncertain why the manywheel linux tests worked before we added the GPU driver for 11.2)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51611
Reviewed By: malfet, seemethere, zhangguanheng66
Differential Revision: D26282829
Pulled By: janeyx99
fbshipit-source-id: b15380e5c44a957e6a85e4f5fb9691ab9c6103a5
Summary:
The new profiler API was added in PR#48280. This PR is to add FLOPS
support to the new profiler API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51734
Test Plan:
```python
python test/test_profiler.py -k test_flops
```
Reviewed By: xuzhao9
Differential Revision: D26261851
Pulled By: ilia-cher
fbshipit-source-id: dbeba4c197e6f51a9a8e640e8bb60ec38df87f73
Summary: Moving caffe2_core_gpu_python contbuild to use GPU/RE
Test Plan: CI
Reviewed By: malfet
Differential Revision: D26261826
fbshipit-source-id: a6f8c7bd8368c1cb69499ea0ea7d5add0956a7ad
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50924
`clamp_min` seems slightly faster than `threshold` (on avx2 cpus)
because it compiles down to vmaxps, rather than vcmpps+vblendv.
I see the biggest perf difference (about 20% faster) with float
tensors at 32k-64k elements. Bigger tensors are more memory bound
although it looks like it might still be a tiny win (2%).
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D26009829
Pulled By: bertmaher
fbshipit-source-id: 7bb1583ffb3ee242e347f59be82e0712c7631f7e