Compare commits

..

946 Commits

Author SHA1 Message Date
7cedac341d Merge branch 'master' into issue#58739 2021-06-17 15:55:42 +02:00
7781b9f16f add blank lines for readability 2021-06-17 15:55:35 +02:00
2119e032fb revert submodule update 2021-06-17 15:54:36 +02:00
8a2bcf9ebf revert submodule update 2021-06-17 16:40:38 +05:30
1ee019fa52 reverted changes 2021-06-17 16:13:25 +05:30
0864aaaeb7 add support for constant 2021-06-17 14:00:51 +05:30
eb36f67dcc [TensorExpr] Minor cleanup in TensorExprKernel::computeValue (#60041)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60041

Differential Revision:
D29146709
D29146709

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 49ac919c18f669d7fda1a26c5a74e62ea752df4f
2021-06-17 01:23:24 -07:00
6b1712019a Revert D29132955: Pass RequestCallback to FaultyPG RPC agent
Test Plan: revert-hammer

Differential Revision:
D29132955 (cbbb7e145e)

Original commit changeset: bb7554b84bcb

fbshipit-source-id: 4dfa2fbe7b8f58c951991c79aa9e2aa819793013
2021-06-17 00:50:32 -07:00
3c3bb91103 Revert D29132956: Add some TORCH_API annotations to RPC
Test Plan: revert-hammer

Differential Revision:
D29132956 (04ec122868)

Original commit changeset: 8637640d56a1

fbshipit-source-id: f497adcbfd5a6b5a46b8689b1943ae2687ea737b
2021-06-17 00:50:30 -07:00
f233274f30 Revert D28875276: Move RPC agents to libtorch
Test Plan: revert-hammer

Differential Revision:
D28875276 (fc50f91929)

Original commit changeset: f2f6970fd74d

fbshipit-source-id: 3c52af652579733ebea8ddfb06576a0ce262bf78
2021-06-17 00:48:58 -07:00
e5c99d9908 Revert D29147009: [pytorch][PR] refine disabled test
Test Plan: revert-hammer

Differential Revision:
D29147009 (5fd6ead097)

Original commit changeset: 37e01ac6e8d6

fbshipit-source-id: e9cd819fd819e3d653deda3b7a981c39ec0452f4
2021-06-17 00:45:21 -07:00
a0ad4c24d1 MAINT Migrates rrelu_with_noise from THC to ATen on Cuda (#57864)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/24618
Related to https://github.com/pytorch/pytorch/issues/24507

<details><summary>Benchmark script:</summary>

```py
import torch
import torch.nn as nn
import time

torch.manual_seed(0)
def _time():
    torch.cuda.synchronize()
    return time.time()

device = "cuda"
m = nn.RReLU().cuda()

for n in [100, 10_000, 100_000]:
    fwd_t = 0
    bwd_t = 0
    input = torch.randn(128, n, device=device)
    grad_output = torch.ones(128, n, device=device)
    for i in range(10000):
        t1 = _time()
        output = m(input)
        t2 = _time()
        fwd_t = fwd_t + (t2 -t1)
    fwd_avg = fwd_t / 10000 * 1000
    print(f"input size(128, {n}) forward time is {fwd_avg:.2f} (ms)")
```

</details>

### Results from benchmark:

#### This PR

```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.06 (ms)
input size(128, 100000) forward time is 0.54 (ms)
```

#### On master

```
input size(128, 100) forward time is 0.01 (ms)
input size(128, 10000) forward time is 0.08 (ms)
input size(128, 100000) forward time is 0.66 (ms)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57864

Reviewed By: H-Huang

Differential Revision: D29177169

Pulled By: ngimel

fbshipit-source-id: 4572133db06f143d27e70a91ade977ea962c8f77
2021-06-17 00:35:16 -07:00
9e79a8a54f [iOS GPU][MaskRCNN] Force the temporaryImage to become static when doing synchronization (#60155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60155

For intermediate tensors, we need to convert them to static images when doing GPU -> CPU synchronization.
ghstack-source-id: 131540760

Test Plan:
- CI
- buck test pp-macos

Reviewed By: SS-JIA

Differential Revision: D29126278

fbshipit-source-id: cd50b5f104e0161ec7fcfcc2c51785f241e48704
2021-06-17 00:25:14 -07:00
0e7b5ea6c0 nonzero: Default to transposed output strides (#59370)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46224

cc ailzhang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59370

Reviewed By: ezyang

Differential Revision: D29143842

Pulled By: ngimel

fbshipit-source-id: 5aa7a247b4a70cd816d0eed368ab4c445568c986
2021-06-16 22:50:38 -07:00
c0b7c59e55 [quant] Equalization Observer modifications (#59953)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59953

The following modifications were made to the equalization
observers due to design changes:
- [InputEqualizationObserver] Replaced `calculate_qparams()` with
`calculate_scaled_minmax()` since we will need to return the scaled
min/max values to update the following input quantization observer
- [WeightEqualizationObserver] We no longer need a row observer since
this will be taken care of by the following weight quantization observer
- [WeightEqualizationObserver] Following the previous comment, we no
longer need to calculate the scaled qparam values. Instead, we will use
the equalization scale to later scale the weights and the qparams will
be taken care of by the weight quantization observer.

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_weight_eq_observer`

Imported from OSS

Reviewed By: supriyar

Differential Revision: D29135332

fbshipit-source-id: be7e468273c8b62fc183b1e1ec50f6bd6d8cf831
2021-06-16 22:32:30 -07:00
45c31cabb5 [quant] Input Weight Equalization - prepare modifications (#59747)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59747

Modifies prepare_fx for input-weight equalization. If a current
node is being equalized (there exists a EqualizationQConfig), then the
EqualizationObserver will be inserted before its quantization observer.

For a singular linear layer, the general flow looks like:
Original graph: `x0 -> linear -> x1`, `w -> linear`
After prepare: `x0 -> InpEqObs -> MinMaxObs -> linear1 -> MinMaxObs -> x1`
  `w -> WeightEqObs -> MinMaxObs -> linear1`

For two connected linear layers, the general flow looks like:
Original graph: `x0 -> linear1 -> linear2 -> x1`,
  `w1 -> linear1`, `w2 -> linear2`
After prepare: `x0 -> InpEqObs -> MinMaxObs -> linear1 -> MinMaxObs -> InpEqObs -> linear2 -> MinMaxObs -> x1`
  `w1 -> WeightEqObs -> MinMaxObs -> linear1`, `w2 -> WeightEqObs -> MinMaxObs -> linear2

Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_equalization_prepare`

Original model with one `nn.Linear` layer
```
LinearModule(
  (linear): Linear(in_features=1, out_features=1, bias=True)
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear : [#users=1] = call_module[target=linear](args = (%x_activation_post_process_0,), kwargs = {})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    return linear_activation_post_process_0
```
--------------------------------------

Original model with two connected functional linear layers
```
FunctionalLinearModule(
  (linear1): Linear()
  (linear2): Linear()
)
```

Graph after `prepare_fx`:
```
graph():
    %x : [#users=1] = placeholder[target=x]
    %x_equalization_process_0 : [#users=1] = call_module[target=x_equalization_process_0](args = (%x,), kwargs = {})
    %x_activation_post_process_0 : [#users=1] = call_module[target=x_activation_post_process_00](args = (%x_equalization_process_0,), kwargs = {})
    %linear1_w : [#users=1] = get_attr[target=linear1.w]
    %linear1_w_equalization_process_0 : [#users=1] = call_module[target=linear1_w_equalization_process_0](args = (%linear1_w,), kwargs = {})
    %linear1_w_activation_post_process_0 : [#users=1] = call_module[target=linear1_w_activation_post_process_00](args = (%linear1_w_equalization_process_0,), kwargs = {})
    %linear1_b : [#users=1] = get_attr[target=linear1.b]
    %linear : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%x_activation_post_process_0, %linear1_w_activation_post_process_0), kwargs = {bias: %linear1_b})
    %linear_activation_post_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0](args = (%linear,), kwargs = {})
    %linear_activation_post_process_0_equalization_process_0 : [#users=1] = call_module[target=linear_activation_post_process_0_equalization_process_0](args = (%linear_activation_post_process_0,), kwargs = {})
    %linear2_w : [#users=1] = get_attr[target=linear2.w]
    %linear2_w_equalization_process_0 : [#users=1] = call_module[target=linear2_w_equalization_process_0](args = (%linear2_w,), kwargs = {})
    %linear2_w_activation_post_process_0 : [#users=1] = call_module[target=linear2_w_activation_post_process_00](args = (%linear2_w_equalization_process_0,), kwargs = {})
    %linear2_b : [#users=1] = get_attr[target=linear2.b]
    %linear_1 : [#users=1] = call_function[target=torch.nn.functional.linear](args = (%linear_activation_post_process_0_equalization_process_0, %linear2_w_activation_post_process_0), kwargs = {bias: %linear2_b})
    %linear_1_activation_post_process_0 : [#users=1] = call_module[target=linear_1_activation_post_process_0](args = (%linear_1,), kwargs = {})
    return linear_1_activation_post_process_0
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29135316

fbshipit-source-id: 91697e805ede254dbb2a42ee4c23eb1c1c64590e
2021-06-16 22:32:28 -07:00
7ce74f3339 [quant] EqualizationQConfig to distinguish input/output activations (#59739)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59739

Created an EqualizationQConfig specifically for equalization.
This inherits from QConfig and is used to distinguish between inserting
an input observer with an output observer. Since the output observer
field is included in the EqualizationQConfig, we no longer need an
output observer field in the _InputEqualizationObserver

Test Plan:
compiles

Imported from OSS

Reviewed By: ezyang

Differential Revision: D29135298

fbshipit-source-id: 3dde9c029c291467ff0a0845f0fc9c44573fc6f6
2021-06-16 22:31:18 -07:00
c6cdb4f113 Refactor ZeroRedundancyOptimizer Assuming SPSD (#59834)
Summary:
**Overview:**
This refactors the `ZeroRedundancyOptimizer` implementation to assume single-process single-device (SPSD) instead of accommodating single-process multiple-device (SPMD). `DistributedDataParallel` [retired SPMD recently](https://github.com/pytorch/pytorch/issues/47012), so this change follows the same spirit.

**Changes:**
The parent-class `Optimizer` constructor permits the input argument `params` to be both an `iterable` of `torch.Tensor` and an `iterable` of `dict`. The latter usage is for initializing the optimizer with multiple `param_group`s to start. However, currently, `ZeroRedundancyOptimizer` only supports the former usage, requiring explicit calls to `add_param_group()` for multiple `param_group`s. Given the existing implementation, the type error would be silent and not manifest until much later (e.g. since `super().__init__()` would have no issue). Hence, I added a series of checks to begin the `__init__()` function (encapsulated in `_verify_and_init_params()`). A postcondition of this validation is that `self._all_params` is a non-empty list of all model parameters.

Additionally, I added a check for SPSD usage assuming that all model parameters exist on the same device. This logic is included in `_verify_same_param_device()` and is called immediately after the `params` type-checking.  Support for SPSD with model parameters sharded across devices may be added in the future.

Related to that aforementioned post-condition on `self._all_params`, previously there was undefined behavior resulting from different typing of the passed in `params` input argument. If `params` was a `List`, then the usage of `self._reference_is_trainable_mask` was as expected. However, if `params` was a generator (e.g. as in the canonical usage of passing `model.parameters()`), then the ensuing behavior was divergent. This is because after a generator is iterated over, it is empty. As a result, when we set `self._all_params = params` [in the old code](68d690ffbd/torch/distributed/optim/zero_redundancy_optimizer.py (L165)), `self._all_params` is empty, reducing `training_mask` to always be the empty list. This causes missed calls to `_update_trainable()` in `step()`. (A consequence of this is that `test_pytorch_parity()`, which is renamed to `test_local_optimizer_parity()`, now outputs warnings about the trainable parameters changing.)

The existing implementation assumes that all parameters share the same dense type when allocating the bucket buffers. This change preserves this assumption, which may be removed in the future. I added a check for this in `_verify_same_dense_param_type()` to avoid erroring silently later on. Note that it is insufficient to simply check for the same `dtype` since dense and sparse tensors may share the same `dtype` but require differing storage sizes. One solution is to use `torch.typename()` as the means for comparison.

 ---

The primary change in this refactor is with respect to `self._per_device_params` and `self.buckets`. `self._per_device_params` mapped `torch.device` to `List[List[Parameter]]`. The keys were the devices that the model parameters exist on, and the values designated which ranks are assigned to updating those parameters. `self.buckets` mapped `torch.device` to `List[torch.Tensor]`. The keys were the same as `self._per_device_params`, and the values were the buckets for that device. The usage of these two data structures were confined to each other only. Hence, because the notions of device and rank are now in 1:1 correspondence, we can eliminate the former completely and only use rank. As such, I removed `self._per_device_params` and made `self.buckets` directly a list of buckets (i.e. `torch.Tensor`s).

Iteration over the parameters of a rank for a given device could be simplified to just iteration over the parameters of a rank. Hence, I relied on `self.partition_parameters()` now for that iteration. Refer to `_setup_flat_buffers()` and `step()` for these changes.

One convenient side effect of removing `self._per_device_params` is that there is no longer the re-computation of the parameter partitions mentioned at the end of this [PR](https://github.com/pytorch/pytorch/pull/59410).

 ---

I changed the data structure `self._index_to_param_cache` from a `dict` to a `List` because the domain is `0`, `1`, ..., `k-1` where `k` is the number of parameters. This should yield marginal improvements in memory usage and access speed.

`_sync_param_groups()` is a static method, meaning it can be called either via `self._sync_param_groups()` or `ZeroRedundancyOptimizer._sync_param_groups()` when inside the class. I made the usage consistently `self._sync_param_groups()` rather than have instances of both.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59834

Test Plan:
I ran through the existing test suite on an AI AWS cluster:
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py
```
Note: The only test where `parameters_as_bucket_view` is `True` is `test_step_with_closure()`, meaning that that is the test that exercises the core changes of removing `self._per_device_params` and changing `self.buckets`.

Also, I added tests for the `ZeroRedundancyOptimizer` constructor changes and the assumption checks.

Reviewed By: mrshenli

Differential Revision: D29177065

Pulled By: andwgu

fbshipit-source-id: 0ff004ae3959d6d3b521024028c7156bfddc93d8
2021-06-16 20:52:13 -07:00
85517a2b70 [TensorExpr] More python binding cleanups (#60058)
Summary:
A few more quality of life improvements for NNC's python bindings:
- Use standard `torch.dtype`s (rather than `te.Dtype`)
- Make names optional (they don't seem to matter)
- Make shapes optional
- A few implicit conversions to make code cleaner

Followup to https://github.com/pytorch/pytorch/issues/59920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60058

Reviewed By: bertmaher

Differential Revision: D29151953

Pulled By: jansel

fbshipit-source-id: c8286e329eb4ee3921ca0786e17248cf6a898bd8
2021-06-16 20:06:08 -07:00
c01939a9b1 [JIT] Handle modules that already have __constants__ (#60003)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60003

**Summary**
`infer_concrete_type_builder` in `_recursive.py` assumes `__constants__`
is a `set` if it exists as an attribute on the module being scripted.
Instead, it should create a set out of whatever `__constants__` is.

**Test Plan**
Ran code from the issue.

**Fixes**
This commit fixes #59947.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D29174243

Pulled By: SplitInfinity

fbshipit-source-id: aeb8bded80038da35478714b6a697a766ac447f5
2021-06-16 20:01:18 -07:00
d99a8a31b1 Fix version comparison for defining CUDA11OrLater (#60010)
Summary:
Before this PR `CUDA11OrLater` was incorrectly set to `False` when `torch.version.cuda == "11.0"`.
`torch.version.cuda` returns major and minor CUDA versions, it doesn't return patch info.
LooseVersion comparison was calling `[11, 0] >= [11, 0, 0]` which evaluates to `False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60010

Reviewed By: mruberry

Differential Revision: D29147107

Pulled By: ezyang

fbshipit-source-id: bd9ed076337b4d32bf1c3376b8f7ae15dbc4d08d
2021-06-16 18:04:29 -07:00
c458bb985e make it easier to grep for unary/binary op kernels (#60128)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60128

Test Plan: Imported from OSS

Reviewed By: wenleix

Differential Revision: D29175499

Pulled By: bdhirsh

fbshipit-source-id: 1838900276e0b956edf25cdddcff438ff685a50e
2021-06-16 17:49:21 -07:00
3288c9d304 [numpy] mvlgamma: int -> float promotion (#59934)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/42515

Last int->float promotion as per the tracker!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59934

Reviewed By: H-Huang

Differential Revision: D29160008

Pulled By: mruberry

fbshipit-source-id: 389a5a7683e0c00d474da913012768bf2a212ef0
2021-06-16 17:44:20 -07:00
f65793507d [fx][Transformer] Add override for call_function (#60057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60057

This ensures that if a function was `wrap`'d before symbolic tracing + being passed into the transformer then it will still be wrapped.

Test Plan: Added test to `test_fx.py`

Reviewed By: jamesr66a

Differential Revision: D29151191

fbshipit-source-id: 93560be59505bdcfe8d4f013e21d4719788afd59
2021-06-16 17:25:55 -07:00
cyy
5f017e91b8 don't use moved field in the second lambda (#59914)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59914

Reviewed By: H-Huang

Differential Revision: D29147018

Pulled By: ezyang

fbshipit-source-id: 04fe52fb8cf3cc8f3a538a2dddb13c52cf558549
2021-06-16 17:22:15 -07:00
64aec8d2ca [testing] OpInfoHelper tool (#58698)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/57577

Usage:
Add OpInfo entry to `common_methods_invocations` with `dtypes=_DYNAMIC_DYTPES`
Eg.
```
OpInfo('atan2',
        dtypes=_DYNAMIC_DTYPES,
        sample_inputs_func=sample_inputs_atan2,)
```

Run the helper with `python -m torch.testing._internal.opinfo_helper`

Output
```
OpInfo(atan2,
       # hint: all_types + (torch.bool,),
       dtypes=[torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.bool],
       # hint: all_types + (torch.bool, torch.bfloat16, torch.float16),
       dtypesIfCUDA=[torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.bool, torch.bfloat16, torch.float16],
       sample_inputs_func=sample_inputs_atan2)
```

Output without CUDA (run with `$ CUDA_VISIBLE_DEVICES=-1 python -m torch.testing._internal.opinfo_helper`)
```
UserWarning: WARNING: CUDA is not available, information pertaining to CUDA could be wrong
  warnings.warn("WARNING: CUDA is not available, information pertaining to CUDA could be wrong")
OpInfo(atan2,
       # hint: all_types + (torch.bool,),
       dtypes=[torch.float32, torch.float64, torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64, torch.bool],
       sample_inputs_func=sample_inputs_atan2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58698

Reviewed By: H-Huang

Differential Revision: D29160668

Pulled By: mruberry

fbshipit-source-id: 707370a83b451b02ad2fe539775c8c50ecf90be8
2021-06-16 17:17:03 -07:00
0bf1260795 Fix Python 3.8 expecttest machinery again, this time for good. (#60044)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60044

In #59709 I attempted to fix the expecttest machinery to work in Python
3.8.  However, I noticed that it would fail to do substitutions in this
case:

```
    self.assertExpectedInline(
        foo(),
        """bar"""
    )
```

This is because the triple quoted string is not on the same line as the
backtrace line number (at the very beginning), and for safety reasons
the preexisting regex refused to search beyond the first line.  This
wasn't a big deal prior to Python 3.8 because the flipped version of
the regex simply required the triple quoted string to be flush with
the end of the statement (which it typically was!)  But it is a big deal
now that we only have the start of the statement.

I couldn't think of a way to fix this in the current model, so I decided
to call in the big guns.  Instead of trying to do the regex with only
the start xor end line number, I now require you provide BOTH line numbers,
and we will only regex within this range.  The way we compute these line
numbers is by parsing the Python test file with ast, and then searching
through statements until we find one that is consistent with the line
number reported by the backtrace.  If we don't find anything, we
conservatively assume that the string lies exactly in the backtrace
(and you'll probably fail the substitution in that case.)

The resulting code is quite a lot simpler (no more reversed regex) and
hopefully more robust, although I suppose we are going to have to do
some field testing.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D29146943

Pulled By: ezyang

fbshipit-source-id: 2c24abc3acd4275c5b3a8f222d2a60cbad5e8c78
2021-06-16 17:10:16 -07:00
dab1e59652 Remove dead code in SavedVariable (#59838)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59838

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29069214

fbshipit-source-id: 5debf93a6c3d1c3d585efbe54438e8df92646d62
2021-06-16 16:44:16 -07:00
1efa863837 Avoid un-necessary unwrapping of Tensor in SavedVariable (#59837)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59837

Fixes #58500

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D29069215

fbshipit-source-id: 603db3c8a64b729e86385ed774825f01c6ce0f20
2021-06-16 16:43:04 -07:00
5948e6f653 removed gelu from autocast fp32 list (#59639)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59639

Reviewed By: H-Huang

Differential Revision: D29155914

Pulled By: ezyang

fbshipit-source-id: feb117181894c2355768d5b1189b3d5f1649fc0b
2021-06-16 16:29:57 -07:00
a95207dad4 [quant] Add a quantize_per_tensor overload that takes Tensor quantization parameters (#59773)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59773

Current quantize_per_tensor takes float scale and int zero_point, which does not work with Proxy,
this PR adds a quantize_per_tensor overload that takes Tensor scale and zero_point instead.

Test Plan:
Tested locally that following runs without errors:

```python
import torch
from torch.quantization.quantize_fx import prepare_fx, convert_fx
from torch.fx.experimental import normalize

class TestModule(torch.nn.Module):
    def forward(self, x):
        return x + x

mod = TestModule()
mod.eval()
config = {"": torch.quantization.get_default_qconfig("fbgemm")}
mod = prepare_fx(mod, config)
mod = convert_fx(mod)
mod = torch.fx.Transformer(mod).transform()
```

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D29019862

fbshipit-source-id: c0176040f3b73f0a30516ed17d261b44cc658407
2021-06-16 16:07:20 -07:00
5686fe5817 Revert D29154971: Training resnext with msuru_suru_union and ig_msuru_suru_union datasets
Test Plan: revert-hammer

Differential Revision:
D29154971 (9f68f93aca)

Original commit changeset: d534d830020f

fbshipit-source-id: a3d16acc8e6b66a6010b501c28dbe295f573bc86
2021-06-16 15:33:14 -07:00
4c8c61f200 Some fixes to vec256_bfloat16.h (#59957)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59957

Test Plan: Sandcastle

Reviewed By: VitalyFedyunin

Differential Revision: D29073913

fbshipit-source-id: dc01a2015e4ff42daa1d69443460182744c06e90
2021-06-16 15:17:15 -07:00
8ce6d0c42f [torch deploy] add register_module_source (#58290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58290

this is a helper function to get some python source code loaded
on each interpreter without having to use the standard import system
or packages. Useful for debugging or for writing wrapper classes for
handling loaded modules.

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D28435306

Pulled By: zdevito

fbshipit-source-id: b85c16346b9001cd7350d65879cb990098060813
2021-06-16 14:41:13 -07:00
fd1e9253ff [Profiler] Fix timestamp discrepancy in profiler_kineto.cpp (#60070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60070

PyTorch pull request https://github.com/pytorch/pytorch/pull/57333 changed high_resolution_clock to system_clock but missed one location in profiler_kineto.cpp.

On some platforms (e.g. Windows), high_resolution_clock and system_clock do not map to the same underlying clock and therefore we get mixed timestamps on some platforms.

Reviewed By: wesolwsk

Differential Revision: D29155809

fbshipit-source-id: a6de6b4d550613f26f5577487c3c53716896e219
2021-06-16 14:25:24 -07:00
9d7764642b Use GitHub's diff directly in clang-tidy (#60048)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60048

This changes clang-tidy in lint.yml to pull the raw diff from GitHub and parse that rather than use the PRs base revision. The base revision can cause the spurious inclusion of files not changed in the PR as in https://github.com/pytorch/pytorch/pull/59967/checks?check_run_id=2832565901. We could be smarter about how we query git, but this approach ends up being simpler since we just need to search for the diff headers in the .diff file.

See https://github.com/pytorch/pytorch/pull/60049/checks?check_run_id=2834140350 for an example CI run with this on

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29148886

Pulled By: driazati

fbshipit-source-id: ca23446d5cc8938d1345f272afe77b9ee8898b74
2021-06-16 13:40:09 -07:00
b2fc6de2c4 support parsing of PR stats in run_test.py (#60026)
Summary:
Currently S3 test stats doesn't support PR stats parisng.

Changes to s3_stats_parser:
1. they are uploaded to `test_times/{sha1}/{job}` and `pr_test_times/{pr}/{sha1}/{job}` separately. Thus we need parsing logics for both
2. need to attach time for PR stats parsing for ordering since PR commits can be force-pushed

Changes to run_test.py
1. Reordering based on previous PR stats if available
2. Falling back to file change option if not enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60026

Test Plan:
- CI.
- local repro: plz run:
```
CIRCLE_JOB="pytorch_linux_bionic_py3_6_clang9_noarch_test" CIRCLE_PR_NUMBER=60057 IN_CI=1 ENABLE_PR_HISTORY_REORDERING=1 python test/run_test.py
```

Reviewed By: samestep

Differential Revision: D29164754

Pulled By: walterddr

fbshipit-source-id: 206688e0fb0b78d1c9042c07243da1fbf88a924b
2021-06-16 13:32:31 -07:00
691183bb74 Fix compile failure on CUDA92 (#60017)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/60016

For CUDA 92
- OptionalBase was not check if `is_arrayref`
- constexpr seems not expect to raise Exception for cuda 92

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60017

Reviewed By: malfet

Differential Revision: D29139515

Pulled By: ejguan

fbshipit-source-id: 4f4f6d9fe6a5f2eadf913de0a9781cc9f2e6ac6f
2021-06-16 12:23:08 -07:00
15dbc566c5 [torch][segment_reduce] Add missing cuda kernel launch check (#60114)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60114

Same as title.

Test Plan: Unit test (test_kernel_launch_checks.py) is passing.

Reviewed By: ngimel

Differential Revision: D29169538

fbshipit-source-id: ba4518dcb1a4713144d92faec2bb5bdf656ff7c5
2021-06-16 12:19:12 -07:00
2c5db9a40a Add c10d filestore functionality to the current c10d_rendezvous_backend (#59719)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59719

Added filestore functionality to the c10d backend. FileStore will create a temporary file in the /tmp directory to use if it is selected as the store type. Appropriate tests were added as well.
FileStore was modified to expose the path field for testing. It was also modified so that the numWorkers field in the constructor is optional (defaulting to -1). A negative value indicates there is not a fixed number of workers. In this case, the file is not attempted to be cleaned at the end.

Test Plan: Unit tests for creating a c10d backend with filestore and simple error handling.

Reviewed By: cbalioglu, H-Huang

Differential Revision: D28997436

fbshipit-source-id: 24c9b2c9b13ea6c947e8b1207beda892bdca2217
2021-06-16 12:13:36 -07:00
84688b0c40 ci: Add note about file_diff_from_base for GHA (#60110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60110

file_diff_from_base is currently bugged for ghstack PRs since it fails
to find a merge base

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D29168767

Pulled By: seemethere

fbshipit-source-id: 580a909aa392541769cbbfdc6acce1e6c5d1c341
2021-06-16 11:31:02 -07:00
15f236f3e3 [package] fix tutorial link (#60113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60113

The tutorial link in the docs was to an fb-only colab.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D29169818

Pulled By: suo

fbshipit-source-id: 374807c234a185bd515b8ffe1300e6cf8d821636
2021-06-16 11:27:25 -07:00
9f68f93aca Training resnext with msuru_suru_union and ig_msuru_suru_union datasets
Summary: We updated the training scripts and re-trained the Resnext model with msuru_suru_union and ig_msuru_suru_union datasets

Test Plan:
Main command line to run:
*./deeplearning/projects/classy_vision/fb/projects/msuru_suru/scripts/train_cluster.sh*

Config we used is *msuru_suru_config.json*, which is "Normal ResNeXt101 with finetunable head".

Experiments:
- msuru_suru_union f279939874
    - Train/test split
        - msuru_suru_union_dataset_train_w_shard: 143,632,674 rows
        - msuru_suru_union_dataset_test_w_shard: 1,831,236  rows
    - Results
       {F625232741}
       {F625232819}
- ig_msuru_suru_union f279964200
    - Train/test split
        - ig_msuru_suru_union_dataset_train_w_shard: 241,884,760 rows
        - ig_msuru_suru_union_dataset_test_w_shard: 3,477,181 rows
    - Results
{F625234126}
{F625234457}

Differential Revision: D29154971

fbshipit-source-id: d534d830020f4f8e596bb6b941966eb84a1e8adb
2021-06-16 11:22:50 -07:00
8c4e78129e .circleci: Disable Windows GPU jobs (#60024)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60024

Disables windows GPU jobs on CircleCI since they have been migrated to
GHA

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D29137287

Pulled By: seemethere

fbshipit-source-id: 204e0c9232201a36a557cd0843e31d34269cc722
2021-06-16 10:45:14 -07:00
74ea1f23b4 Revert D29148233: [pytorch][PR] Add GITHUB_HEAD_REF in check for IN_PULL_REQUEST
Test Plan: revert-hammer

Differential Revision:
D29148233 (241aac3ef8)

Original commit changeset: 7c8c1866f39c

fbshipit-source-id: f32c6c6decd737ef290d3e83c9d021475aabaab0
2021-06-16 10:41:30 -07:00
bac6bcd6d8 Update call site for FBGemm quantization util functions. (#624)
Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59637

Replace FloatToFusedNBitRowwiseQuantizedSBHalf, FusedNBitRowwiseQuantizedSBHalfToFloat, FloatToFused8BitRowwiseQuantizedSBFloat, and Fused8BitRowwiseQuantizedSBFloatToFloat with newer version.

Test Plan: CI tests.

Reviewed By: dskhudia

Differential Revision: D28918581

fbshipit-source-id: a21274add71439c5e51287a0e2ec918a8d8e5392
2021-06-16 10:15:34 -07:00
d88fbf0fbc fix minor typo in run_test.py (#60055)
Summary:
Fixes typo in run_test.py for option use_specified_test_cases_by

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60055

Reviewed By: walterddr

Differential Revision: D29150156

Pulled By: janeyx99

fbshipit-source-id: 375e594d09c83188bfa80762c8b833a0b7c5cca4
2021-06-16 09:30:45 -07:00
241aac3ef8 Add GITHUB_HEAD_REF in check for IN_PULL_REQUEST (#60047)
Summary:
I believe IN_PULL_REQUEST is unset for some GHA test runs because we don't also check GITHUB_HEAD_REF. This PR is a small fix for that.

Example: https://github.com/pytorch/pytorch/pull/60023/checks?check_run_id=2831813860 doesn't set it properly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60047

Reviewed By: walterddr

Differential Revision: D29148233

Pulled By: janeyx99

fbshipit-source-id: 7c8c1866f39ce8af8d13c34ddc0c5786a829321e
2021-06-16 08:57:49 -07:00
a6ecfb3296 Update lint.yml to use custom clang-tidy build (#59967)
Summary:
Related: https://github.com/pytorch/pytorch/issues/59815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59967

Reviewed By: samestep

Differential Revision: D29164686

Pulled By: 1ntEgr8

fbshipit-source-id: b6f9fb6fa4280f757a54a37b30b027b7504bef63
2021-06-16 08:45:24 -07:00
842a831f53 [nnc] Move batchnorm to operators library (#59992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59992

Wrapped batch norm in function `computeBatchNorm`.
ghstack-source-id: 131407851

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D29116661

fbshipit-source-id: 2873a9a3e70f31db1988787160fc96c388ea3d4a
2021-06-16 05:09:59 -07:00
bda40639c5 [nnc] Move operator implementations into a subdirectory (#59988)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59988

As we broaden operator support, putting all the implementations into
kernel.cpp is getting unwieldy.  Let's factor them out into the "operators"
subdirectory.

This diff is big but it's entirely code movement; I didn't change anything,
other than to expose a few utilities in kernel.h.
ghstack-source-id: 131405139

Test Plan: CI

Reviewed By: ZolotukhinM

Differential Revision: D29115916

fbshipit-source-id: ba0df1d8dd4a108b584da3baf168407e966b2c78
2021-06-16 05:08:50 -07:00
f43ff754ca [docs] Correct errata in linalg.eigh and add a bit more information (#59784)
Summary:
Add extra information about the returned elements of the spectral
decompositions

Resolves https://github.com/pytorch/pytorch/issues/59718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59784

Reviewed By: soulitzer

Differential Revision: D29088998

Pulled By: mruberry

fbshipit-source-id: 58a191c41ff5e4c9d9675e5b3d7cbbcf16be4da1
2021-06-16 01:21:09 -07:00
36a5647e30 Handle exceptions from THPModule_setQEngine (#60073)
Summary:
Prevents Python runtime crashes when `torch._C._set_qengine(2**65)` or `torch.backends.quantized.engine="fbgemm"` if PyTorch was compiled without fbgemm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60073

Reviewed By: supriyar

Differential Revision: D29156430

Pulled By: malfet

fbshipit-source-id: 95b97352a52a262f1634b72da64a0c950eaf2373
2021-06-16 00:40:59 -07:00
9fbbab88da [fx-acc] Saturate host by replicating partitions onto idle devices (#60064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60064

This implements a host saturation optimization to maximize the utilization of the available devices.
It uses a greedy heuristic to replicate all partitions on the used devices to another set of idle devices with enough memory.

The added unittest shows an example as follows:

```
partition_0: 192 bytes; partition_1: 48 bytes
dev_0: 200 bytes, [partition_0]
dev_1: 200 bytes, [partition_1]
dev_2: 100 bytes,
dev_3: 100 bytes,
dev_4: 200 bytes,
dev_5: 100 bytes
```

Before host saturation, `partition_0` is assigned to dev_0 and `partition_1` is assigned to dev_1.
After host saturation, `partition_0` is replicated to dev_4 simply because it's the only device that can hold all partitions on dev_0. `partition_1` is replicated to dev_2 because it has minimal but large enough memory to hold all partitions on dev_1.

Test Plan:
```
buck test mode/opt //caffe2/test:test_fx_experimental -- --exact 'caffe2/test:test_fx_experimental - test_saturate_host (test_fx_experimental.TestFXExperimental)'

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/8444249343103429
    ✓ ListingSuccess: caffe2/test:test_fx_experimental - main (1.322)
    ✓ Pass: caffe2/test:test_fx_experimental - test_saturate_host (test_fx_experimental.TestFXExperimental) (1.322)
Summary
  Pass: 1
  ListingSuccess: 1
```

An e2e test will be added to `test_fx_glow.py` in a followup diff.

Reviewed By: gcatron

Differential Revision: D29039998

fbshipit-source-id: 57518aadf668f7f05abd6ff73224c16b5d2a12ac
2021-06-15 23:04:46 -07:00
a344b09db2 [quant][fx][graphmode] Remove Quantizer class (#59606)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59606

Test Plan:
python test/test_quantization.py TestQuantizeFx

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28951432

fbshipit-source-id: 3301f7200a4c7166673c27f9ac7ff559f1e6935d
2021-06-15 21:54:57 -07:00
78011bc0ce typofix (torch.zero to torch.zeros) in docstring (#59703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59703

Reviewed By: ezyang

Differential Revision: D29145998

Pulled By: H-Huang

fbshipit-source-id: f2670502170aa100fb02408046b7f6850f9379cf
2021-06-15 21:12:42 -07:00
e50f264b51 [caffe2] make MulGradient implementation in-place compatible (#60035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60035

In Caffe2, the operator schema for the MulGradient op indicates that MulGradient may be performed in-place, overwriting one of its inputs as the output. The implementation is not safe to perform in-place however, due to an accidentally-introduced write-read dependency on the overwriten input in the in-place case. We fix it here.

Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test
```

Note that the newly added test fails without this change, but passes with this change:

```
    ✓ ListingSuccess: caffe2/caffe2/python/operator_test:elementwise_ops_test - main (24.992)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_exp (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_log1p (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_abs (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_bitwise_and (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_reciprocal (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sqr (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_rsqrt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_mul (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sqrt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_add (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_swish_gradient_inplace (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sigmoid (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_bitwise_or (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_cbrt_grad (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_not (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_sub (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_div (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_eq (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_softsign (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_eq_bcast (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_powt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
*************************************************************************************************************************************************************************************
***********************************<NEW_TEST_YAY>************************************************************************************************************************************
*************************************************************************************************************************************************************************************

   ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_mul_gradient_inplace (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)

*************************************************************************************************************************************************************************************
***********************************</NEW_TEST_YAY>***********************************************************************************************************************************
*************************************************************************************************************************************************************************************
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_hard_sigmoid (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_bitwise_xor (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_log (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_cube (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_swish (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_cbrt (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - test_div_legacy_grad (caffe2.caffe2.python.operator_test.elementwise_ops_test.TestElementwiseOps) (125.898)
    ✓ Pass: caffe2/caffe2/python/operator_test:elementwise_ops_test - main (125.898)
Summary
  Pass: 30
  ListingSuccess: 1
```

Reviewed By: clrfb

Differential Revision: D29034265

fbshipit-source-id: 98550e1d5976398e45d37ff2120591af1439c42a
2021-06-15 20:26:04 -07:00
eda2ddb5b0 [ATen] Fix aten::to schema (#60001)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60001

Fix the aten::to schema to reflect that the output may alias input.

Test Plan: Added new unit tests.

Reviewed By: ezyang

Differential Revision: D29121620

fbshipit-source-id: c29b6aa22d367ffedf06e47116bc46b3e188c39c
2021-06-15 20:04:20 -07:00
95257e8a62 [fx-acc] Fix wrong device assignment in find_single_partition (#60056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60056

Previously we put the whole graph as a single partition onto a device with maximum memory if possible, but the code assumed that the first logical device always has the maximum memory.

This diff fixes this issue and updates the unittest to reflect such a corner case.

Test Plan:
```
buck test mode/opt //caffe2/test:test_fx_experimental -- --exact 'caffe2/test:test_fx_experimental - test_find_single_partition (test_fx_experimental.TestFXExperimental)'

Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/6473924507772744
    ✓ ListingSuccess: caffe2/test:test_fx_experimental - main (1.357)
    ✓ Pass: caffe2/test:test_fx_experimental - test_find_single_partition (test_fx_experimental.TestFXExperimental) (1.206)
Summary
  Pass: 1
  ListingSuccess: 1

```

Reviewed By: gcatron

Differential Revision: D29118715

fbshipit-source-id: cac6a1f0d2f47717446dcc80093bbcf362663859
2021-06-15 19:36:38 -07:00
469f0e42d6 [nnc] Handle more cases of excessive # of cat args (#60043)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60043

And add a unit test

Test Plan: new unit test

Reviewed By: navahgar

Differential Revision: D29146547

fbshipit-source-id: 31532926032dbef70d163930f3d8be160f5eacc3
2021-06-15 18:19:52 -07:00
1207745e98 fixing illegal memory access on NHWC BN kernel (#59981)
Summary:
adding an early exit in the kernel to avoid reading out of bound.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59981

Reviewed By: ezyang

Differential Revision: D29147349

Pulled By: ngimel

fbshipit-source-id: b36a6a9e2526c609ff98fb5a44468f3257e0af67
2021-06-15 16:57:41 -07:00
27a3204982 generate C++ API for meta functions using at::meta:: (#58570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58570

**What the PR does**
Generate a fast-path `at::meta::{op}` API for calling meta functions without having to go through the dispatcher. This will be important for perf for external backends that want to use meta functions for shape checking (which seems likely to be what we end up doing for LazyTensorCore).

**Details**
In order to avoid naming collisions I had to make two small changes:
- rename `MetaFunctions.h` template -> `NativeMetaFunctions.h` (this is the file that declares the impl() function for every structured operator).
- rename the meta class: `at::meta::{op}::meta()` -> `at::meta::structured_{op}::meta()`

I also deleted a few unnecessary includes, since any file that includes NativeFunctions.h will automatically include NativeMetaFunctions.h.

**Why I made the change**
This change isn't actually immediately used anywhere; I already started writing it because I thought it would be useful for structured composite ops, but that isn't actually true (see [comment](https://github.com/pytorch/pytorch/pull/58266#issuecomment-843213147)). The change feels useful and unambiguous though so I think it's safe to add. I added explicit tests for C++ meta function calls just to ensure that I wrote it correctly - which is actually how I hit the internal linkage issue in the PR below this in the stack.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711299

Pulled By: bdhirsh

fbshipit-source-id: d410d17358c2b406f0191398093f17308b3c6b9e
2021-06-15 16:54:46 -07:00
e341bab8ae bugfix: ensure that at::{dispatch_key}:: API gets external linkage (#58569)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58569

This should allow external C++ files that aren't compiled into `libtorch.so`/`libtorch_cpu.so` (including all of fbcode) to use fast path functions like `at::cpu::add()`, which skip the dispatcher.

So, after spending way too much time trying to figure out why I was getting linker errors when calling `at::meta::{op}` and `at::cpu::{op}` from C++ test files, I realized that we're not including the header files for C++ for the namespaced operator definitions. I.e. `RegisterCPU.cpp`, which provides definitions for the `at::cpu::{op}` fast path functions, wasn't including the `CPUFunctions.h` header.

Why that breaks stuff: the `CPUFunctions.h` header file is what marks each function with the `TORCH_API` macro, so without including it, when we build `libtorch.so` and `libtorch_cpu.so`, the compiler will look at the definition in `RegisterCPU.cpp`, not see a `TORCH_API`, and decide that the function should get internal linkage.

An alternative would be to directly mark the function definitions in `RegisterCPU.cpp` with `TORCH_API`, but this seemed cleaner.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711300

Pulled By: bdhirsh

fbshipit-source-id: 535f245c20e977ff566d6da0757b3cefa137040b
2021-06-15 16:53:22 -07:00
5fd6ead097 refine disabled test (#60040)
Summary:
This is to refine:
https://github.com/pytorch/pytorch/pull/60029

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60040

Reviewed By: ezyang

Differential Revision: D29147009

Pulled By: Krovatkin

fbshipit-source-id: 37e01ac6e8d6f7e6b5c517f7804704f9136a56f5
2021-06-15 16:22:29 -07:00
fc50f91929 Move RPC agents to libtorch (#59939)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59939

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28875276

fbshipit-source-id: f2f6970fd74de5f112636e78edaa4410c61d8c45
2021-06-15 16:20:53 -07:00
04ec122868 Add some TORCH_API annotations to RPC
Summary: They will be needed when RPC gets merged into libtorch

Test Plan: CI later in the stack

Reviewed By: mrshenli

Differential Revision: D29132956

fbshipit-source-id: 8637640d56a1744a5dca5eb7d4b8ad0860c6b67c
2021-06-15 16:20:51 -07:00
cbbb7e145e Pass RequestCallback to FaultyPG RPC agent
Summary: This is needed to avoid FaultyPG from including and depending on RequestCallbackImpl, which is Python-only. The other RPC agents accept an explicit (upcast) pointer as an argument, and we can do the same for FaultyPG.

Test Plan: Later in the stack.

Reviewed By: mrshenli

Differential Revision: D29132955

fbshipit-source-id: bb7554b84bcbf39750af637e6480515ac8b92b86
2021-06-15 16:19:50 -07:00
f232b052a6 [fx-acc][easy] Format FX experimental partitioner code (#60030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60030

As titled. Non-functional re-format.

Test Plan: NA

Reviewed By: gcatron

Differential Revision: D29038449

fbshipit-source-id: a7c94eaab86850ef57b51ec66bfe8ea0e68d2dc8
2021-06-15 16:14:33 -07:00
50229b5250 Fix some typing issues (#59952)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59952

Test Plan: Sandcastle

Reviewed By: swolchok

Differential Revision: D29083423

fbshipit-source-id: 7a13d6ba60808bcf88d809db194d0f873605172c
2021-06-15 14:11:06 -07:00
1d5a577f04 Fix some items identified as problematic by Wextra and other clean-up (#59909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59909

Test Plan: Sandcastle

Reviewed By: vkuzo

Differential Revision: D29073150

fbshipit-source-id: 500a92ccb57b0e40277863a3b235099fd66ab8ad
2021-06-15 13:42:32 -07:00
dc1f60a9a2 [sparsity][refactor] Restructure the tests folders (#60032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60032

There will be more sparse tests coming. This PR creates a separate folder for the sparse tests

Test Plan: `python test/test_ao.py`

Reviewed By: raghuramank100

Differential Revision: D29139265

fbshipit-source-id: d0db915f00e6bc8d89a5651f08f72e362a912a6b
2021-06-15 13:37:19 -07:00
8dd0570b34 Reuse build_torch_xla from pytorch/xla repo. (#59989)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59989

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29138211

Pulled By: ailzhang

fbshipit-source-id: 349d307c510e7fad266822e320f0d6904fa00239
2021-06-15 13:19:54 -07:00
b162d95e46 Fix a number of lint perf and safety issues in torch (#59897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29037012

fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb
2021-06-15 13:14:51 -07:00
a0e62c4da4 Reuse run_torch_xla_tests from pytorch/xla (#59888)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59888

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29114274

Pulled By: ailzhang

fbshipit-source-id: d2845c7fc95d038cd68c10e22b68be8ad3cae736
2021-06-15 13:00:09 -07:00
c23624351a disable test_sparse_allreduce_basics (#60029)
Summary:
This test will be disabled due to intermittent failures in https://circleci.com/gh/pytorch/pytorch/14155828?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link
as per https://hud.pytorch.org/build2/pytorch-master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60029

Reviewed By: seemethere

Differential Revision: D29139042

Pulled By: Krovatkin

fbshipit-source-id: 105000e8636f17846be31f517abdf56ea0a994e9
2021-06-15 12:35:11 -07:00
044b519a80 Symbolic for ReLu6 (#58560) (#59538)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59538

Four mealv2 models can export in torch 1.8.1, but fails when torch master introduces relu6 a few months back.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046607

Pulled By: SplitInfinity

fbshipit-source-id: d9cf7050e4ac0dad892441305ffebc19ba84e2be

Co-authored-by: David <jiafa@microsoft.com>
2021-06-15 12:24:17 -07:00
5d00c374dd [ONNX] Sum empty tensor could not be exported to ONNX successfully. (#58141) (#59537)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59537

PyTorch sum over empty tensor gives 0, while ONNX produces an error.

torch.sum will be translated into onnx::ReduceSum op. Per the definition of ReduceSum, update the keepdims attribute for this scenario.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046604

Pulled By: SplitInfinity

fbshipit-source-id: 6f5f3a66cb8eda8b5114b8474dda6fcdbae73469

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-06-15 12:24:16 -07:00
83450aa11d [ONNX] Add support for torch.bernoulli() export (#57003) (#59536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59536

Support export HuggingFace - Training DeBERTa model.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046609

Pulled By: SplitInfinity

fbshipit-source-id: df87e0c6ed0f13463297bdeba73967fcf2aa37ca

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-06-15 12:24:14 -07:00
cd5f142af4 fix error message for type_as (#57948) (#59535)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59535

Improve error message for type_as and add unit test.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046605

Pulled By: SplitInfinity

fbshipit-source-id: 978bceeb62e4d3c68815cd5fdf160909a99d00f2

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-06-15 12:24:12 -07:00
55530e2276 Update Autograd Export Docs (#56594) (#59534)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59534

Update autograd export docs

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb, ansley

Differential Revision: D29046606

Pulled By: SplitInfinity

fbshipit-source-id: 36057f6bdfd3e5c071dbca05d327de7952904120

Co-authored-by: neginraoof <neginmr@utexas.edu>
2021-06-15 12:23:00 -07:00
a120a12ab4 [Bootcamp][pytorch]Add WebIterDataPipe and ToBytesIterDataPipe to the datapipes. (#59816)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59816

Add two new DataPipes, one for getting web file urls to yield streams and one for getting streams to yield bytes.

Test Plan:
Add test_web_iterable_datapipe in test/test_datapipes.py. The test initiates a local http server for serving test files. Test below locally ok.
1. create and load 16M localhost file urls (each of size 10 Bytes)
2. create and load a 64GB localhost file
in the unit test, for sake of testing time, disabling both stress test and large file test

Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D29051186

fbshipit-source-id: f8e44491e670560bf445af96f94d98230436f396
2021-06-15 11:43:26 -07:00
79d7c15dc5 [PyTorch] Add ExclusivelyOwned (#59419)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59419

This introduces ExclusivelyOwned, which allows isolated
pieces of code that can make ownership guarantees to opt out of
reference counting operations on `intrusive_ptr` and `Tensor`
entirely. To elaborate, if you know you are the exclusive owner of an
`intrusive_ptr` or `Tensor`, moving it into an `ExclusivelyOwned` will
avoid performing atomic reference counting operations at destruction
time. The documentation comment should provide sufficient explanation; please request changes if not.
ghstack-source-id: 131376658

Test Plan:
Added `ExclusivelyOwned_test.cpp`. It passes. When I ran it
under valgrind, valgrind reported no leaks.

Inspected assembly from `inspect` functions in
`ExclusivelyOwned_test.cpp` in an optimized (opt-clang) build. As
expected, `ExclusivelyOwned` calls `release_resources()` and the
`TensorImpl` virtual destructor without including any atomic reference
counting operations.

Reviewed By: ezyang

Differential Revision: D28885314

fbshipit-source-id: 20bf6c82b0966aaa635ab0233974781ed15f93c1
2021-06-15 11:26:25 -07:00
d7eb5836bb Add RRef support to ShardedTensor. (#59776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59776

Overall design: https://github.com/pytorch/pytorch/issues/55207.

In this PR, I've added support to ShardedTensor such that it also creates RRefs
pointing to the remote shards if the RPC framework is initialized.

As a result, this provides more flexiblity for ShardedTensor such that users
can use collectives with local shards or use the RPC framework to interact with
remote shards.
ghstack-source-id: 131381914

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D29020844

fbshipit-source-id: acb308d0029a5e486c464d93189b5de1ba680c85
2021-06-15 10:49:31 -07:00
20460b0c05 [nnc] Removed setBufferMap method from LoopNest (#59496)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59496

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915958

Pulled By: navahgar

fbshipit-source-id: 71e649c93fc67b36c37373f043c729aa835968a0
2021-06-15 10:37:48 -07:00
b822928e33 [nnc] Removed setGPUBlockIndex and setGPUThreadIndex methods from LoopNest (#59495)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59495

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915960

Pulled By: navahgar

fbshipit-source-id: 20a4032b031aba6e43d85433ade5f0680c65fbc0
2021-06-15 10:37:46 -07:00
aa163aeff5 [nnc] Made several LoopNest APIs static (#59494)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59494

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28915959

Pulled By: navahgar

fbshipit-source-id: bf52e30d893f4d86812219b538a14307f347f10b
2021-06-15 10:36:31 -07:00
4afd0b7952 .github: Add Windows CUDA 11.1 workflow (#59960)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59960

Adds the CUDA 11.1 workflow to GHA

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29116814

Pulled By: seemethere

fbshipit-source-id: 90601610e481e1f70a60eaa1b640373ecb89bdb9
2021-06-15 10:22:30 -07:00
1c502d1f8e Don't run_build when run_binary_tests (#59982)
Summary:
https://github.com/pytorch/pytorch/issues/59889 wasn't a proper revert of https://github.com/pytorch/pytorch/issues/58778. This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59982

Reviewed By: seemethere

Differential Revision: D29114129

Pulled By: samestep

fbshipit-source-id: b40563db6ff1153a5f759639978279f5fcbccaa9
2021-06-15 07:39:38 -07:00
90cf76dde5 Support torch.nn.parameter type for PDT (#59249)
Summary:
=========

Support torch.nn.parameter type for PDT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59249

Test Plan:
====
with-proxy python test/test_jit.py -k TestPDT

Reviewed By: ZolotukhinM

Differential Revision: D29124413

Pulled By: nikithamalgifb

fbshipit-source-id: b486b82c897dbc2b55fbacd5d610bdb700ddc9fa
2021-06-15 07:22:33 -07:00
f9445c8a6b [torch][segment_reduce] Add cuda support for mean reduction (#59543)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59543

Building on top of previous PR: https://github.com/pytorch/pytorch/pull/59521

This diff is adding support for mean reduction for Cuda (fwd only currently).
Will add cuda backward implementation in subsequent PR.
Next Steps:
cuda backward support for mean
2d data input support
more testing
benchmarking

Test Plan: update unit test to cover this part as well.

Reviewed By: ngimel

Differential Revision: D28922838

fbshipit-source-id: 72b7e5e79db967116b96ad010f290c9f057232d4
2021-06-15 07:00:45 -07:00
f4f7950812 Prepare for TensorPipe separating its CUDA-specific headers (#59788)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59788

This one line is all we need to "migrate" PyTorch to the "new API" of TensorPipe that splits the CUDA-specific stuff in a separate top-level header. (The motivation behind that is that it will allow us to "stack" the CUDA code on top of the CPU one).
ghstack-source-id: 131326166

Test Plan: None yet

Reviewed By: beauby

Differential Revision: D28875277

fbshipit-source-id: ecfd0b7fc0218ab7899bfe64ffe73c1417b897db
2021-06-15 03:28:39 -07:00
5e5ca0682b Move CUDA-related stuff of TP agent to separate file (#59377)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59377

This PR demonstrates that now the CUDA parts of the TensorPipe agent just "plug on top" of the CPU-only parts. Thus ideally the CPU-only parts could go in libtorch while the CUDA-only parts could go in libtorch_cuda. Unfortunately we can't do that just yet, because the TensorPipe agent depends on c10d (for its Store and its ProcessGroup), which lives in libtorch_python.
ghstack-source-id: 131326168

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28796429

fbshipit-source-id: 41b2eb8400c0da282f3750a4eea21ad83ee4a175
2021-06-15 03:28:38 -07:00
83ba71aa0e Make CUDA serde support for TP agent pluggable (#59376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59376

This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.

The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".
ghstack-source-id: 131326167

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28796428

fbshipit-source-id: b52df832e0c0abf489a9e418353103496382ea41
2021-06-15 03:27:40 -07:00
cf63893211 Enable implicit operator versioning via number of arguments (#58852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58852

Enable implicit operator versioning via number of arguments from Mobile.
1. By default, TS doesn't emit instructions for tailing default args and the provided number of specified args is serialized to bytecode. From interpreter the default values are fetched from operator schema. The implementation has been landed in #56845. Please refer to #56845 for details.
2. Since there is bytecode schema change, the bytecode version is bumped from 5 to 6.
3. The corresponding backport function is provided, for forward compatibility use. Note that because there is instruction change, a global flag is used as the switch to control the two versions.

Test Plan: Imported from OSS

Reviewed By: raziel

Differential Revision: D28789746

Pulled By: iseeyuan

fbshipit-source-id: 6e5f16460c79b2bd3312de02d0f57b79f50bf66b
2021-06-15 02:07:40 -07:00
a1780432fa Move c10d to libtorch(_cuda) (#59563)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59563

ghstack-source-id: 131331264

Test Plan: CI

Reviewed By: malfet

Differential Revision: D28932239

fbshipit-source-id: 5df6cdfa5253b15cbbc97039fe672d6d97321e34
2021-06-15 02:01:31 -07:00
8d50a4e326 Add support for embeddingBagBytewise in FXGlow
Summary: This adds support for embeddingBagBytewise with fp32 scale/bias to FXGlow.

Test Plan: buck run  //glow/fb/fx/fx_glow:test_fx_glow

Reviewed By: jfix71

Differential Revision: D29075288

fbshipit-source-id: 4145486505a903129678216b133bbb8ad71f4fef
2021-06-14 23:31:29 -07:00
cbd1e8c335 [Static Runtime] Fix bug in aten::to (#59995)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59995

Reviewed By: ajyu

Differential Revision: D29083106

fbshipit-source-id: 687ffb121af2716d606c145474942650a2d9ac7e
2021-06-14 22:54:43 -07:00
087ac75b26 Fix quantized mean operator in QNNPACK backend (#59761)
Summary:
cc: kimishpatel

Fixes https://github.com/pytorch/pytorch/issues/58668

Test it with `pytest -k test_quantized_mean test/test_quantization.py` or `buck test //caffe2/test:quantization -- test_quantized_mean`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59761

Reviewed By: bdhirsh

Differential Revision: D29013271

Pulled By: kimishpatel

fbshipit-source-id: 020956fb63bd5078856ca17b137be016d3fc29b8
2021-06-14 17:30:21 -07:00
5b9fced70a add output_process_fn_grad before sum().backward() (#59971)
Summary:
This should fix `to_sparse` test issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59971

Test Plan:
CI

Also: directly examine the RuntimeError thrown from test_unsupported_backward
- Before:
```
NotImplementedError: Could not run 'aten::sum' with arguments from the 'SparseCPU' backend.
```
- After:
```
to_dense() not supported for float16 on CPU
```

Reviewed By: soulitzer

Differential Revision: D29112558

Pulled By: walterddr

fbshipit-source-id: c2acd22cd18d5b34d25209b8415feb3ba28fa104
2021-06-14 16:20:03 -07:00
117b7ae38a Remove update-disabled-tests workflow as it is migrated to test-infra (#59986)
Summary:
Will be replaced by https://github.com/pytorch/test-infra/pull/37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59986

Reviewed By: seemethere, soulitzer

Differential Revision: D29115397

Pulled By: janeyx99

fbshipit-source-id: 2c1a88d6a3fec8cef57818a360884644ec2c7b79
2021-06-14 15:25:34 -07:00
c2098487e8 [c10d] Move pg wrapper tests to their own file. (#59840)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59840

moving these tests to their own standalone file. No meaningful code changes.
ghstack-source-id: 131359162

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D29012664

fbshipit-source-id: 348870016509a6ed7e69240fa82bccef4a12d674
2021-06-14 15:05:55 -07:00
5c1d17e697 Revert D29100708: [pytorch][PR] Parametrizations depending on several inputs
Test Plan: revert-hammer

Differential Revision:
D29100708 (061e71b199)

Original commit changeset: b9e91f439cf6

fbshipit-source-id: bff6d8a3d7b24f4beb976383912033c250d91a53
2021-06-14 14:08:50 -07:00
5e993e6c81 [fx2trt] Make TRTInterpreter don't need concrete tensor as arg (#59948)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59948

1. We have two Interpreters. One for vanilla op and one for acc op. Some of the logic between them are similar and in this diff we extract out the similar logic to a Base Interpreter. This makes any future general feature change could benefit both Interpreters.

2. Make TRT Interpreter not depending on concrete tensor arg. We will use `InputTensorSpec` to create necessary inputs for acc tracer.

3. Add unittests for acc op converter.

Test Plan:
```
buck test mode/opt caffe2/torch/fb/fx2trt:test_linear
buck test mode/opt caffe2/torch/fb/fx2trt:test_batchnorm
buck test mode/opt caffe2/torch/fb/fx2trt:test_convolution
buck test mode/opt caffe2/torch/fb/fx2trt:test_reshape
buck test mode/opt caffe2/torch/fb/fx2trt:test_relu
buck test mode/opt caffe2/torch/fb/fx2trt:test_add
buck test mode/opt caffe2/torch/fb/fx2trt:test_maxpool
```

Reviewed By: jackm321

Differential Revision: D28749682

fbshipit-source-id: 830d845aede7203f6e56eb1c4e6776af197a0fc3
2021-06-14 14:03:26 -07:00
c645d39a77 Implementation of torch.isin() (#53125)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/3025

## Background

This PR implements a function similar to numpy's [`isin()`](https://numpy.org/doc/stable/reference/generated/numpy.isin.html#numpy.isin).

The op supports integral and floating point types on CPU and CUDA (+ half & bfloat16 for CUDA). Inputs can be one of:
* (Tensor, Tensor)
* (Tensor, Scalar)
* (Scalar, Tensor)

Internally, one of two algorithms is selected based on the number of elements vs. test elements. The heuristic for deciding which algorithm to use is taken from [numpy's implementation](fb215c7696/numpy/lib/arraysetops.py (L575)): if `len(test_elements) < 10 * len(elements) ** 0.145`, then a naive brute-force checking algorithm is used. Otherwise, a stablesort-based algorithm is used.

I've done some preliminary benchmarking to verify this heuristic on a devgpu, and determined for a limited set of tests that a power value of `0.407` instead of `0.145` is a better inflection point. For now, the heuristic has been left to match numpy's, but input is welcome for the best way to select it or whether it should be left the same as numpy's.

Tests are adapted from numpy's [isin and in1d tests](7dcd29aaaf/numpy/lib/tests/test_arraysetops.py).

Note: my locally generated docs look terrible for some reason, so I'm not including the screenshot for them until I figure out why.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53125

Test Plan:
```
python test/test_ops.py   # Ex: python test/test_ops.py TestOpInfoCPU.test_supported_dtypes_isin_cpu_int32
python test/test_sort_and_select.py   # Ex: python test/test_sort_and_select.py TestSortAndSelectCPU.test_isin_cpu_int32
```

Reviewed By: soulitzer

Differential Revision: D29101165

Pulled By: jbschlosser

fbshipit-source-id: 2dcc38d497b1e843f73f332d837081e819454b4e
2021-06-14 13:50:53 -07:00
f9ec86a6c6 External stream (#59527)
Summary:
Previous is https://github.com/pytorch/pytorch/issues/57781

We add now two CUDA bindings to avoid using ctypes to fix a windows issue.
However, we use ctypes to allocate the stream and create its pointer
(we can do this with a 0-dim tensor too if it feels better).

CC. ezyang rgommers ngimel mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59527

Reviewed By: albanD

Differential Revision: D29053062

Pulled By: ezyang

fbshipit-source-id: 661e7e58de98b1bdb7a0871808cd41d91fe8f13f
2021-06-14 13:46:11 -07:00
8e92a3a8b0 [docs] Add pickle security warning to package docs (#59959)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59959

**Summary**
This commit replaces the warning on the `torch.package` documentation
page about the module not being publicly released (which will no longer
be true as of 1.9) with one that warns about security issues caused by
the use of the `pickle` module.

**Test Plan**
1) Built the docs locally.
2) Continuous integration.

<img width="877" alt="Captura de Pantalla 2021-06-14 a la(s) 11 22 05 a  m" src="https://user-images.githubusercontent.com/4392003/121940300-c98cab00-cd02-11eb-99dc-08e29632079a.png">

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D29108429

Pulled By: SplitInfinity

fbshipit-source-id: 3a0aeac0dc804a31203bc5071efb1c5bd6ef9725
2021-06-14 13:03:05 -07:00
ef13341a8d upgrade onednn to v2.2.3 (#57928)
Summary:
This PR is to upgrade onednn to v2.2.3 (including v2.2 and v2.2.3 changes) which has the following main changes about CPU:

v2.2 changes:
Improved performance of compute functionality for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake).
Improved fp32 inner product forward propagation performance for processors with Intel AVX-512 support.
Improved dnnl_gemm performance for cases with n=1 on all supported processors.

v2.2.3 changes:
Fixed a bug in int8 depthwise convolution ptimitive with groups and 1d spatial size for processors with Intel AVX-512 and Intel AVX2 support
Fixed correctness issue for PReLU primitive on Intel Processor Graphics
Fixed corretness issue in reorder for blocked layouts with zero padding
Improved performance of weights reorders used by BRGEMM-based convolution primitive for processors with Intel AVX-512 support

More changes can be found in https://github.com/oneapi-src/oneDNN/releases.

Ideep used version is pytorch-rls-v2.2.3.
OneDNN used version is v2.2.3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57928

Reviewed By: bdhirsh

Differential Revision: D29037857

Pulled By: VitalyFedyunin

fbshipit-source-id: db74534858bdcf5d6c7dcf58e224fc756188bc31
2021-06-14 11:57:45 -07:00
061e71b199 Parametrizations depending on several inputs (#58488)
Summary:
Makes possible that the first register parametrization depends on a number of parameters rather than just one. Examples of these types of parametrizations are `torch.nn.utils.weight_norm` and low rank parametrizations via the multiplication of a `n x k`  tensor by a `k x m` tensor with `k <= m, n`.

Follows the plan outlined in https://github.com/pytorch/pytorch/pull/33344#issuecomment-768574924. A short summary of the idea is: we call `right_inverse` when registering a parametrization to generate the tensors that we are going to save. If `right_inverse` returns a sequence of tensors, then we save them as `original0`, `original1`...  If it returns a `Tensor` or a sequence of length 1, we save it as `original`.

We only allow to have many-to-one parametrizations in the first parametrization registered. The next parametrizations would need to be one-to-one.

There were a number of choices in the implementation:

If the `right_inverse` returns a sequence of parameters, then we unpack it in the forward. This is to allow to write code as:
```python
class Sum(nn.Module):
  def forward(self, X, Y):
    return X + Y
  def right_inverse(Z):
    return Z, torch.zeros_like(Z)
```
rather than having to unpack manually a list or a tuple within the `forward` function.

At the moment the errors are a bit all over the place. This is to avoid having to check some properties of `forward` and `right_inverse` when they are registered. I left this like this for now, but I believe it'd be better to call these functions when they are registered to make sure the invariants hold and throw errors as soon as possible.

The invariants are the following:
1. The following code should be well-formed
```python
X = module.weight
Y = param.right_inverse(X)
assert isinstance(Y, Tensor) or isinstance(Y, collections.Sequence)
Z = param(Y) if isisntance(Y, Tensor) else param(*Y)
```
in other words, if `Y` is a `Sequence` of `Tensor`s (we check also that the elements of the sequence are Tensors), then it is of the same length as the number parameters `param.forward` accepts.

2. Always: `X.dtype == Z.dtype and X.shape == Z.shape`. This is to protect the user from shooting themselves in the foot, as it's too odd for a parametrization to change the metadata of a tensor.
3. If it's one-to-one: `X.dtype == Y.dtype`. This is to be able to do `X.set_(Y)` so that if a user first instantiates the optimiser and then puts the parametrisation, then we reuse `X` and the user does not need to add a new parameter to the optimiser. Alas, this is not possible when the parametrisation is many-to-one. The current implementation of `spectral_norm` and `weight_norm` does not seem to care about this, so this would not be a regression. I left a warning in the documentation though, as this case is a bit tricky.

I'm still missing to go over the formatting of the documentation, I'll do that tomorrow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58488

Reviewed By: soulitzer

Differential Revision: D29100708

Pulled By: albanD

fbshipit-source-id: b9e91f439cf6b5b54d5fa210ec97c889efb9da38
2021-06-14 11:11:47 -07:00
ab70e1e984 [TensorExpr] Add error checking in mem_arena (#59922)
Summary:
Gives an error message (rather than a segfault) if you forget `KernelScope()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59922

Reviewed By: bertmaher

Differential Revision: D29091303

Pulled By: jansel

fbshipit-source-id: a24ee2385cae1f210b0cbc3f8860948fc052b655
2021-06-14 10:37:32 -07:00
9ad0de3c6f Rework requires_grad on DifferentiableGraphOp (#57575)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57575

This PR does two things:

1. reverts "Manual revert of D27369251 (f88a3fff65) (#56080)" in commit
   92a09fb87a567100122b872613344d3a422abc9f.

2. fixing DifferentiableGraph output with wrong requires_grad flag

Fixing requires_grad on outputs from DifferentiableGraph, the proper flag is
retrieved from profiling information. We previously only retrieves the profiling
information on the first profile node in all its uses. However, in case where
control flows are present, we need to iteratively search for profile node with
profiling information available, in case the first use is in an inactive code
path.

e.g.
```
  graph(%0 : Tensor,
        %1 : Bool):
  ..., %2 : Tensor = prim::DifferentiableGraph_0(%0)
  %3 : Tensor = prim::If(%1)
    block0():
      %4 : Tensor = prim::DifferentiableGraph_1(%2)
      -> (%4)
    block1():
      %5 : Tensor = prim::DifferentiableGraph_2(%2)
      -> (%5)
  -> (%3)
with prim::DifferentiableGraph_0 = graph(%0 : Tensor):
  ...
  %out : Tensor = aten::operation(...)
  ...
  return (..., %out)
with prim::DifferentiableGraph_1 = graph(%0 : Tensor):
  %temp : Tensor = prim::profile[profiled_type=Tensor](%0)
  ...
with prim::DifferentiableGraph_2 = graph(%0 : Tensor):
  %temp : Tensor = prim::profile[profiled_type=Float(...)](%0)
  ...
```

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29038773

Pulled By: Krovatkin

fbshipit-source-id: 6c0a851119f6b8f2f1afae5c74532407aae238fe
2021-06-14 10:37:31 -07:00
1f7251df90 fixing DifferentiableGraphOp updating requires_grad on input tensor list; python test added to verify the test (#57574)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57574

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29038774

Pulled By: Krovatkin

fbshipit-source-id: cb342c1b04fa3713a8166b39213437bc9f2d8606
2021-06-14 10:36:26 -07:00
cyy
c50c77b444 remove unused variables (#59912)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59912

Reviewed By: soulitzer

Differential Revision: D29100518

Pulled By: albanD

fbshipit-source-id: b86a4aa9050e4fa70a0872c1d8799e5953cd2bc8
2021-06-14 10:33:48 -07:00
580a20f33b [reland] torch/lib/c10d: Use torch_check instead of throwing runtime_error (#59918)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59918

Reland of https://github.com/pytorch/pytorch/pull/59684
ghstack-source-id: 131303057

Test Plan: ci

Reviewed By: cbalioglu

Differential Revision: D29081452

fbshipit-source-id: 419df79341f702e796f7adf5f1071a6cd1dcd8d1
2021-06-14 09:52:54 -07:00
3d90c82a5c [TensorExpr] Python binding improvements (#59920)
Summary:
Some minor quality of life improvements for the NNC python bindings:
- expose `call_raw()`
- support passing integers to `call()` (for dynamic shapes)
- implicit conversions to cleanup `[BufferArg(x) for x in [A, B, C]]` into just `[A, B, C]`
- don't silently default to "ir_eval" for unknown mode (e.g. "LLVM")

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59920

Reviewed By: ZolotukhinM

Differential Revision: D29090904

Pulled By: jansel

fbshipit-source-id: 154ace82725ae2046cfe2e6eb324fd37f5d209a7
2021-06-14 09:31:40 -07:00
68d690ffbd Vectorize the softmax calculation when not along the last dim (#59195)
Summary:
Currently, if we do softmax which are not along the last dim, the calculation will fall to a [scalar version](d417a094f3/aten/src/ATen/native/SoftMax.cpp (L14-L64)).  And we find actually we have the chance to vectorize the calculation along the inner_size dim.

Changes we made:

- Use vectorized softmax_kernel instead of host_softmax when not along the last dim.

Performance data on 28 cores' Intel 8280 CPU when the Input size is [32, 81, 15130] and do softmax along the second dim(81).

- FP32 Baseline: 24.67 ms
- FP32 optimized: 9.2 ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59195

Reviewed By: ailzhang

Differential Revision: D28854796

Pulled By: cpuhrsch

fbshipit-source-id: 18477acc3963754c59009b1794f080496ae16c3d
2021-06-14 07:54:11 -07:00
d60d81b5a7 Make PyObject_FastGetAttrString accept const char* (#59758)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59758

The underlying call to tp_getattr is const safe but CPython
has not fixed it due to BC problems.  No reason not to advertise
the better type here though!

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D29017911

Pulled By: ezyang

fbshipit-source-id: 8d55983fe6416c03eb69c6367bcc431c30000133
2021-06-14 07:24:16 -07:00
700add0737 Fix expecttest accept on Python 3.8 and later (#59709)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59709

Fixes #59705.

Python 3.8 fixed tracebacks to report the beginning of the line
that raised an error, rather than the end.  This makes for a simpler
implementation (no more string reversing) but need to actually
implement.  This wasn't caught by tests because we hard coded line
numbers to do substitutions, so I also added a little smoketest to
detect future changes to traceback line number behavior.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28994919

Pulled By: ezyang

fbshipit-source-id: 1fb0a782e17c55c13d668fabd04766d2b3811962
2021-06-14 07:23:12 -07:00
cf38b20c61 Alias for digamma as psi to special namespace (#59143)
Summary:
See https://github.com/pytorch/pytorch/issues/50345

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59143

Reviewed By: jbschlosser

Differential Revision: D28986909

Pulled By: mruberry

fbshipit-source-id: bc8ff0375de968f3662b224689fa0a6b117f9c4e
2021-06-14 03:05:14 -07:00
ff15d93b88 Improve numerical stability of GroupNorm (#54921)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54921

Improve numerical stability of GroupNorm

Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "GroupNorm"

Reviewed By: ngimel

Differential Revision: D27414438

fbshipit-source-id: 815517240ca5ea3e2beb77ced3bd862e9c83d445
2021-06-13 16:13:32 -07:00
095cd6a0da MemoryOverlap: Avoid has_storage calls (#59013)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59013

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29040929

Pulled By: ngimel

fbshipit-source-id: 69745e7abbaf523795a90f68cf01d3d94508210e
2021-06-13 12:31:22 -07:00
be038d8989 [CUDA graphs] Make stream semantics of backward calls consistent with other cuda ops (ci-all edition) (#57833)
Summary:
ci-all resubmit of https://github.com/pytorch/pytorch/pull/54227.

Tests look good except for a few distributed autograd failures (pytorch_linux_xenial_cuda10_2_cudnn7_py3_multigpu_test) and rocm failures (pr/pytorch-linux-bionic-rocm4.1-py3.6).

The common denominator in rocm failures appears to be multi-gpu activity: some [multiprocess DDP failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test1/8115/console), some [single-process failures](https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.1-py3.6-test2/8115/console) where the single process has autograd ops that span devices. jeffdaily jithunnair-amd sunway513, could one of you take a look? The streaming backward change is also beneficial to rocm, I expect.

For debugging rocm failures, I think we should ignore the multiprocess/DDP tests and focus on the single process cases. The root cause is probably the same and the single process cases are simpler.

----------------------------------

Update: Rocm failures are due to https://github.com/pytorch/pytorch/issues/59750.
2718a54032 is a workaround, to be updated once https://github.com/pytorch/pytorch/issues/59750 is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57833

Reviewed By: mruberry

Differential Revision: D28942391

Pulled By: ngimel

fbshipit-source-id: d6047e971c5f1c6386334bf3641402a92f12e2f8
2021-06-13 12:09:56 -07:00
92513038e8 Revert D28994140: [pytorch][PR] Implemented torch.cov
Test Plan: revert-hammer

Differential Revision:
D28994140 (23c232554b)

Original commit changeset: 1890166c0a9c

fbshipit-source-id: 73dfe1b00464e38f004f99960cdeeb604ed4b20a
2021-06-13 02:33:37 -07:00
0ceea7faf4 Refactor SavedVariable (#59836)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59836

Preparing for #58500

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29069159

fbshipit-source-id: dd4d870c8ae10a4bd7f12be127e093f60fa072fa
2021-06-12 23:21:36 -07:00
d03ff1a17d pre compute regex and match simple signature autograd codegen 15s -> 12s (#59852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59852

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D29063814

Pulled By: albanD

fbshipit-source-id: a751047526f8d58f4760ee6f9ae906675bed5d75
2021-06-12 06:58:36 -07:00
30a18fe318 refactor yaml loader import, no runtime change (#59850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59850

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063816

Pulled By: albanD

fbshipit-source-id: ca3067443d8e6282c1077d3dafa3b4f330d43b28
2021-06-12 06:58:34 -07:00
c60d1ac9cf Use C dumper if possible aten codegen 23s -> 13s (#59849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59849

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063815

Pulled By: albanD

fbshipit-source-id: c4baa72594bd2fe50ac67f513916f2b2ccb7488c
2021-06-12 06:58:32 -07:00
504ec30109 avoid error string formatting aten codegen 28s -> 23s (#59848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59848

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063818

Pulled By: albanD

fbshipit-source-id: c68734672eeacd212d7bd9bebe3d53aaa20c3c24
2021-06-12 06:58:31 -07:00
7143a6a189 Avoid unnecessary re-computation autograd codegen 21s -> 15s (#59847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59847

This whole stack does not change anything to the codegened code

Test Plan: Imported from OSS

Reviewed By: ailzhang

Differential Revision: D29063817

Pulled By: albanD

fbshipit-source-id: 284c3e057029b7a67f43a1b034bb30863bd68c71
2021-06-12 06:57:19 -07:00
1f6e39336f Simplify parametrizations.SpectralNorm and improve its initialization (#59564)
Summary:
Implements a number of changes discussed with soulitzer offline.
In particular:
- Initialise `u`, `v` in `__init__` rather than in `_update_vectors`
- Initialise `u`, `v` to some reasonable vectors by doing 15 power iterations at the start
- Simplify the code of `_reshape_weight_to_matrix` (and make it faster) by using `flatten`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59564

Reviewed By: ailzhang

Differential Revision: D29066238

Pulled By: soulitzer

fbshipit-source-id: 6a58e39ddc7f2bf989ff44fb387ab408d4a1ce3d
2021-06-11 19:52:44 -07:00
10a3a3d363 Fix bad change in a CUDACachingAllocator loop (#59903)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59903

D29034650 (cf0c4ac258) probably breaks something because it changes a `for` loop on ~Line 1200 from `[size,max)` to `[0,max)`. This fixes that

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29081688

fbshipit-source-id: 21f08e3f244fc02cf97d137b3cc80d4378d17185
2021-06-11 18:20:07 -07:00
e49f0f4ffd Automated submodule update: FBGEMM (#59874)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: ae8ad8fd04

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59874

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D29064980

fbshipit-source-id: 593f08361817fb771afcf2732f0f647d7c2c72c3
2021-06-11 17:50:40 -07:00
3529a48ebb Revert D28981326: torch/lib/c10d: Use torch_check instead of throwing runtime_error
Test Plan: revert-hammer

Differential Revision:
D28981326 (6ea6075002)

Original commit changeset: 264a7f787ea8

fbshipit-source-id: 75625b76dfbd0cbaf59705d621ef9e2d1677c482
2021-06-11 17:17:10 -07:00
f3218568ad optimize channels last for BatchNorm2d on CPU (#59286)
Summary:
replacement of https://github.com/pytorch/pytorch/issues/48919
optimize channels last performance for BatchNorm2 on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59286

Reviewed By: bdhirsh

Differential Revision: D29008198

Pulled By: VitalyFedyunin

fbshipit-source-id: 8a7d020bd6a42ab5c21ffe788b79a22f4ec82ac0
2021-06-11 16:30:16 -07:00
864d129bae [quant][fx] Remove extra q-dq for weight bias in normalization ops (#59882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59882

Currently for normalization ops, the weight and bias arguments are treated as activationn inputs which require observers.
This results in adding extra quant-dequant ops for the weight and bias inputs.

This PR adds support to skip observing weight/bias inputs of norm operators, thus removing the redundant q-dq ops

Quantized graph with F.layer_norm
Before this PR
```
def forward(self, x):
    _input_scale_0 = self._input_scale_0
    _input_zero_point_0 = self._input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, _input_scale_0, _input_zero_point_0, torch.quint8);  x = _input_scale_0 = _input_zero_point_0 = None
    scale = self.scale
    _input_scale_1 = self._input_scale_1
    _input_zero_point_1 = self._input_zero_point_1
    quantize_per_tensor_1 = torch.quantize_per_tensor(scale, _input_scale_1, _input_zero_point_1, torch.quint8);  scale = _input_scale_1 = _input_zero_point_1 = None
    bias = self.bias
    _input_scale_2 = self._input_scale_2
    _input_zero_point_2 = self._input_zero_point_2
    quantize_per_tensor_2 = torch.quantize_per_tensor(bias, _input_scale_2, _input_zero_point_2, torch.quint8);  bias = _input_scale_2 = _input_zero_point_2 = None
    _scale_0 = self._scale_0
    _zero_point_0 = self._zero_point_0
    dequantize = quantize_per_tensor_1.dequantize();  quantize_per_tensor_1 = None
    dequantize_1 = quantize_per_tensor_2.dequantize();  quantize_per_tensor_2 = None
    layer_norm = torch.ops.quantized.layer_norm(quantize_per_tensor, [2, 5, 5], weight = dequantize, bias = dequantize_1, eps = 1e-05, output_scale = _scale_0, output_zero_point = _zero_point_0);  quantize_per_tensor = dequantize = dequantize_1 = _scale_0 = _zero_point_0 = None
    dequantize_2 = layer_norm.dequantize();  layer_norm = None
    return dequantize_2
```
After
```
def forward(self, x):
    _input_scale_0 = self._input_scale_0
    _input_zero_point_0 = self._input_zero_point_0
    quantize_per_tensor = torch.quantize_per_tensor(x, _input_scale_0, _input_zero_point_0, torch.quint8);  x = _input_scale_0 = _input_zero_point_0 = None
    scale = self.scale
    bias = self.bias
    _scale_0 = self._scale_0
    _zero_point_0 = self._zero_point_0
    layer_norm = torch.ops.quantized.layer_norm(quantize_per_tensor, [2, 5, 5], weight = scale, bias = bias, eps = 1e-05, output_scale = _scale_0, output_zero_point = _zero_point_0);  quantize_per_tensor = scale = bias = _scale_0 = _zero_point_0 = None
    dequantize = layer_norm.dequantize();  layer_norm = None
    return dequantize
```

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_norm_weight_bias

Imported from OSS

Reviewed By: HDCharles, ailzhang

Differential Revision: D29068203

fbshipit-source-id: 24b5c38bbea5fd355d34522bfa654c9db18607da
2021-06-11 16:22:36 -07:00
60eb22e45e Build an -Wextra around c10 (#59853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59853

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29016682

fbshipit-source-id: f6c5f32464d57dbd60b59b5f9e2234ef2c39f1c1
2021-06-11 16:12:21 -07:00
e41bc31eb2 make --run-specified-test-case use --include (#59704)
Summary:
instead of having specific logic to handle run-specific-test-case, we provide the flag to override include or bring-to-front with the SPECIFIED_TEST_CASES_FILE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59704

Reviewed By: janeyx99

Differential Revision: D29038425

Pulled By: walterddr

fbshipit-source-id: 803d3555813437c7f287a22f7704106b0c609919
2021-06-11 13:57:13 -07:00
cf0c4ac258 Fix some issues in CUDACachingAllocator (#59819)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59819

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29034650

fbshipit-source-id: 7e9689fc1ae121432e9421fa4a9ae00f7f78caca
2021-06-11 13:15:27 -07:00
b83ac0cc4e [nnc] Added a check to vectorize only those loops that are normalized. (#59423)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59423

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28886979

Pulled By: navahgar

fbshipit-source-id: edfc61feaf5efe22d4f367ac718b83b3d0f47cb3
2021-06-11 12:03:34 -07:00
30e24b2d2b [nnc] Modified vectorize API to return bool (#59422)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59422

Test Plan: Imported from OSS

Reviewed By: huiguoo

Differential Revision: D28886980

Pulled By: navahgar

fbshipit-source-id: 58cc3ecd86564a312a132f8260d836b096505095
2021-06-11 12:02:19 -07:00
a9e136a61e Remove ci/no-build (#59889)
Summary:
This reverts https://github.com/pytorch/pytorch/issues/58778, since triggering our primary CircleCI workflow only via pytorch-probot has been causing more problems than it's worth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59889

Reviewed By: walterddr, seemethere

Differential Revision: D29070418

Pulled By: samestep

fbshipit-source-id: 0b47121b190c2e9efa27f38000ca362e634876dc
2021-06-11 11:55:56 -07:00
f4fdc49957 [NNC] Add python bindings for loopnest.compress_buffer (#59681)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59681

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28981573

Pulled By: huiguoo

fbshipit-source-id: 003d66df576903c71bf46c95851fe6ccbba76f29
2021-06-11 11:28:39 -07:00
ee3025f734 Give clearer lint error messages (#59876)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59876

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29067747

Pulled By: samestep

fbshipit-source-id: cce7195467b5f9286d55a9d0c1655b4f92d4fbaf
2021-06-11 11:25:42 -07:00
6ea6075002 torch/lib/c10d: Use torch_check instead of throwing runtime_error (#59684)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59684

Same reasoning as in the below diff.
ghstack-source-id: 131167212

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28981326

fbshipit-source-id: 264a7f787ea8be76f743a2eaca67ae1d3bd8073a
2021-06-11 11:16:58 -07:00
d433a55c94 Replace throw std::runtime_error with torch_check in torch/csrc/distributed (#59683)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59683

Replaces usages of throw std::runtime_error("foo") with the better
torch_check(false, "foo") which allows C++ stacktraces to show up when
TORCH_SHOW_CPP_STACKTRACES=1. This will hopefully provide much better debugging
information when debugging crashes/flaky tests.
ghstack-source-id: 131167210

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28981327

fbshipit-source-id: 677f569e28600263cab18759eb1b282e0391aa7b
2021-06-11 11:15:49 -07:00
9cdbddb3f7 Fix Vectorize<float>::trunc on ARM platform (#59858)
Summary:
Use `vrndq_f32`, which corresponds to `VRINTZ` instruction, which rounds floating point value towards zero, which matches `std::trunc` behaviour.
This makes trunc implementation correct even for values that fit into float32, but can not be converted to int32, for example `-1.0e+20`, see the following [gist](https://gist.github.com/malfet/c612c9f4b3b5681ca1b2a69930825871):
```
inp= 3.1 2.7 -2.9 -1e+20
old_trunc= 3 2 -2 -2.14748e+09
new_trunc= 3 2 -2 -1e+20
```

Fixes `test_reference_numerics_hard_trunc_cpu_float32` on M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59858

Reviewed By: kimishpatel

Differential Revision: D29052008

Pulled By: malfet

fbshipit-source-id: 6b567f39151538be1aa3890e3b4e1e978e598657
2021-06-11 10:55:45 -07:00
2ce21b2e61 [Pytorch backend delegation] Preprocess to accept (#58873)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58873

BackenDebugInforRecorder

Prior to this PR:
In order to generate debug handles corresponding to the graph being
lowered, backend's preprocess will call generate_debug_handles and will
get map of Node*-to-debug_handles.
In order to facilitate this, to_backend will own
BackendDebugInfoRecorder and initialize thread local pointer to it.
generate_debug_handle function will query thread local pointer to see if
there is a valid BackendDebugInforRecorder for the context. If there is
it will generate debug handles.

After this PR:
Signature of preprocess is changed such that backends have to register
preprocess that accepts instance of BackendDebugInfoRecorder by
reference. generate_debug_handles is no more a free function but becomes
part of the API of BackendDebugInfoRecorder. Now backend's preprocess
function will call generate_debug_handles on BackendDebugInfoRecorder
instead of free function.

Reason for this change:
With RAII that initializes thread local pointer, results in a lose
contract with backends, which may result in backends not storing
debug information. Making it part of API results in
backends having to be aware of BackendDebugInfoRecorder and explicitly
chosing not to generate/store debug information if they chose to do so.

Test Plan:
backend tests

Imported from OSS

Reviewed By: jbschlosser, raziel

Differential Revision: D28648613

fbshipit-source-id: c9b7e7bf0f78e87023ea7bc08612cf893b08cb98
2021-06-11 10:16:00 -07:00
23c232554b Implemented torch.cov (#58311)
Summary:
Based from https://github.com/pytorch/pytorch/pull/50466

Adds the initial implementation of `torch.cov` similar to `numpy.cov`. For simplicity, we removed support for many parameters in `numpy.cov` that are either redundant such as `bias`, or have simple workarounds such as `y` and `rowvar`.

cc PandaBoi

TODO

- [x] Improve documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58311

Reviewed By: mruberry

Differential Revision: D28994140

Pulled By: heitorschueroff

fbshipit-source-id: 1890166c0a9c01e0a536acd91571cd704d632f44
2021-06-11 09:40:50 -07:00
ba09355b12 Upgrade Windows CI Python to 3.8 (#59729)
Summary:
Python 3.6 EOL is end of this year--we should use newer Python in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59729

Reviewed By: bdhirsh

Differential Revision: D29006807

Pulled By: janeyx99

fbshipit-source-id: c79214b02a72656058ba5d199141f8838212b3b6
2021-06-11 09:09:24 -07:00
d75e99b709 fx quant: enable qconfig_dict to target function invocations by order (#59605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59605

Enables targeting of individual function invocations by execution order.
For example, given a module such as

```
class M1(torch.nn.Module):
  def forward(self, x):
    x = torch.add(x, x)
    x = torch.add(x, x)
    return x

class M2(torch.nn.Module):
  def __init__(self):
    self.m1 = M1()

  def forward(self, x):
    x = self.m1(x)
    return x
```

We can now target the first add of `m1` with

```
qconfig_dict = {
  "module_name_function_order": ("m1", torch.add, 0, custom_qconfig),
}
```

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_qconfig_module_name_function_order
```

Imported from OSS

Reviewed By: hx89

Differential Revision: D28951077

fbshipit-source-id: 311d423724a31193d4fa4bbf3a712b46464b5a29
2021-06-11 08:53:40 -07:00
e6110d4d5d Fix input_buffer check if inplace update is valid (#59817)
Summary:
Fixes an issue introduced in  https://github.com/pytorch/pytorch/issues/17182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59817

Reviewed By: bdhirsh

Differential Revision: D29040738

Pulled By: albanD

fbshipit-source-id: 67fd4e9fa0dadf507ddd954d20e119d8781c4de0
2021-06-11 07:29:03 -07:00
c9e4d1372f Add guards for USE_C10D_FOO in relevant c10d files (#59697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59697

The c10d build process selectively adds files based on the `USE_C10D_FOO` flags (where `FOO` is one of `GLOO`, `NCCL` or `MPI`). Replicating this logic inside libtorch will be harder, since libtorch uses a simpler approach (i.e., it lists the files in `build_variables.bzl`). So instead we could always include all files, and "disable" each file as needed using `#ifdef`s. Note that this is not a new approach: we already do the same for all the files of the TensorPipe agent based on the flag `USE_TENSORPIPE`.
ghstack-source-id: 131169540

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28987577

fbshipit-source-id: 4c6195de4e9a58101dad9379537e8d055dfd38af
2021-06-11 05:06:42 -07:00
773b56e719 Fix Windows guards in c10d (#59696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59696

Some files in c10d refer to dist autograd. However, on Windows, dist autograd isn't built. Hence we need to "mask out" those references under Windows. This was already partly done, but when moving c10d to libtorch some issues came up, possibly due to the different way in which linking happens. Hence I masked out the remaining references.
ghstack-source-id: 131169541

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28987579

fbshipit-source-id: c29c5330f8429d699554972d30f99a89b2e3971d
2021-06-11 05:06:40 -07:00
cbcae46fa5 Remove USE_CUDA from c10d reducer/logger (#59562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59562

Needed to merge c10d into libtorch(_cuda).

ghstack-source-id: 131169542

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28931378

fbshipit-source-id: 71376b862ff6ef7dbfa7331ec8d269bd3fcc7e0d
2021-06-11 05:06:39 -07:00
b4c35d7ae7 Remove USE_CUDA from ProcessGroupGloo (#59561)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59561

Needed to merge c10d into libtorch(_cuda).

ghstack-source-id: 131169544

Test Plan: CI

Reviewed By: agolynski

Differential Revision: D28931379

fbshipit-source-id: 9bd68477ae7bb870b6737a555edd5696149ff5d6
2021-06-11 05:05:31 -07:00
b5e832111e [nnc] Limit the number of inputs to a fusion group.
Summary:
nvrtc has a hard limit to the size of kernel parameters, and llvm has
a tendency to OOM with huge parameter lists, so let's limit the number of
inputs to something sensible.

Test Plan:
tested on pyper OOM test case:
```
flow-cli test-locally --mode=opt-split-dwarf f278102738 --name "PyPer OOM repro f277966799 f63b1f9c5c0c" --run-as-secure-group oncall_pytorch_jit --entitlement default
```

Reviewed By: ZolotukhinM

Differential Revision: D29019751

fbshipit-source-id: b27f2bb5000e31a7b49ea86a6928faa0ae2ead24
2021-06-11 02:25:16 -07:00
df759a3d9e [nnc] Do not fuse matmul/conv2d if inputs are discontiguous. (#59754)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59754

Also, if inputs are contiguous, use their Placeholders
directly rather than generating contiguous Tensors from them.

The rationale for this change is that aten::matmul and aten::conv2d
support transposed inputs; if NNC generates a physical transpose to
perform an external call, performance will be strictly worse than not
fusing (sometimes dramatically so, as in the attached benchmark).

Test Plan: benchmark

Reviewed By: ZolotukhinM

Differential Revision: D29010209

fbshipit-source-id: da6d71b155c83e8d6e306089042b6b0af8f80900
2021-06-11 02:23:47 -07:00
4b91355232 [ONNX] remove raw export type (#59160)
Summary:
[ONNX] remove raw export type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59160

Reviewed By: tugsbayasgalan

Differential Revision: D28937039

Pulled By: SplitInfinity

fbshipit-source-id: 79bf91605526aa32a7304e75f50fe55d872bd4e8
2021-06-11 00:08:06 -07:00
2112074f25 [Static Runtime] Add schema check to several aten ops (#59603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59603

D28698997 (10345010f7) was reverted because I forgot to replace the
```
  VLOG(1) << "Found schema mismatch";
  n->schema().dump();
```
block in `aten::clamp_min` with `LogAndDumpSchema(n)` and that led to the bazel build to fail. I don't know why it makes the bazel build though.

Test Plan: OSS CI.

Reviewed By: ajyu

Differential Revision: D28950177

fbshipit-source-id: 9bb1c6619e6b68415a3349f04933c2fcd24cc9a2
2021-06-10 23:39:00 -07:00
6eabbea47c Disable cuDNN persistent RNN on A30 (#59830)
Summary:
https://github.com/pytorch/pytorch/issues/59829

cherry-picked from ptrblck 's change CC ngimel xwang233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59830

Reviewed By: bdhirsh

Differential Revision: D29046145

Pulled By: ngimel

fbshipit-source-id: 270ab3bb6c1c7c759497a15eb38b20a177c94adb
2021-06-10 22:07:28 -07:00
455afdf974 Automated submodule update: FBGEMM (#59715)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59715

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 0520ad5f95

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59687

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jianyuh

Differential Revision: D28986238

Pulled By: jspark1105

fbshipit-source-id: 12f68830b5b7a858fbc301af50593281852af51f
2021-06-10 21:53:30 -07:00
c7890b4a8e [package] doc string cleanup extravaganza (#59843)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59843

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D29049342

Pulled By: Lilyjjo

fbshipit-source-id: 3330fb439f28dda0cafef5797ff61311f4afbf76
2021-06-10 21:21:48 -07:00
54bfd41a2e Fix torch.angle on aarch64 (#59832)
Summary:
angle should return 0 for positive values, pi for negative and keep nans in place, which can be accomplished using two blendv functions.

Fixes number of unary test failures on M1/aarch64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59832

Reviewed By: kimishpatel

Differential Revision: D29046402

Pulled By: malfet

fbshipit-source-id: cb93ad2de140f7a54796387fc11053c507a1d4e9
2021-06-10 20:48:41 -07:00
4025f95a20 [docs] Add table of contents to torch.package docs (#59842)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59842

Test Plan:
Continuous integration.

<img width="544" alt="Captura de Pantalla 2021-06-10 a la(s) 5 13 07 p  m" src="https://user-images.githubusercontent.com/4392003/121612390-2ccec280-ca0f-11eb-87ad-fef632ba05ca.png">

Reviewed By: Lilyjjo

Differential Revision: D29050627

Pulled By: SplitInfinity

fbshipit-source-id: 76c25ed4002cbaf072036e2e14e7857c15077df7
2021-06-10 19:52:50 -07:00
0e222db087 [docs] Add explanation section to torch.package docs (#59833)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59833

**Summary**
This commit adds an explanation section to the `torch.package`
documentation. This section clarifies and illuminates various aspects of
the internals of `torch.package` that might be of interest to users.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050626

Pulled By: SplitInfinity

fbshipit-source-id: 78e0cda00f69506ef2dfc52d6df63694b502269e
2021-06-10 19:52:48 -07:00
062dde7285 [docs] Add "how do I" section to torch.package docs (#59503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59503

**Summary**
This commit adds a "how do I..." section to the `torch.package`
documentation. This section contains short guides about how to solve
real-world problems that frequently recur while using `torch.package`.

**Test Plan**
Continuous integration.

<img width="877" alt="Captura de Pantalla 2021-06-04 a la(s) 9 19 54 p  m" src="https://user-images.githubusercontent.com/4392003/120879911-98321380-c57b-11eb-8664-c582c92b7837.png">

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050629

Pulled By: SplitInfinity

fbshipit-source-id: 2b7800732e0a3c1c947f110c05562aed5174a87f
2021-06-10 19:52:47 -07:00
6a18ca7a07 [docs] Add tutorials section to torch.package docs (#59499)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59499

**Summary**
This commit adds a tutorials section to the torch.package docs.

**Test Plan**
Continuous integration.

<img width="870" alt="Captura de Pantalla 2021-06-04 a la(s) 5 10 31 p  m" src="https://user-images.githubusercontent.com/4392003/120874257-b9ced300-c55a-11eb-84dd-721cb7ac73ab.png">

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050628

Pulled By: SplitInfinity

fbshipit-source-id: c17ab0100a9d63e7af8da7a618143cedbd0a5872
2021-06-10 19:52:45 -07:00
a3db8e0a26 [docs] Add torch.package documentation preamble (#59491)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59491

**Summary**
This commit adds a preamble to the `torch.package` documentation page
that explains briefly what `torch.package` is.

**Test Plan**
Continous integration.

<img width="881" alt="Captura de Pantalla 2021-06-04 a la(s) 3 57 01 p  m" src="https://user-images.githubusercontent.com/4392003/120872203-d535e000-c552-11eb-841d-b38df19bc992.png">

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D29050630

Pulled By: SplitInfinity

fbshipit-source-id: 70a3fd43f076751c6ea83be3ead291686c641158
2021-06-10 19:51:37 -07:00
a524ee00ca Forward AD formulas batch 3 (#59711)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59711

This is the exact same PR as before.
This was reverted before the PR below was faulty.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28995762

Pulled By: albanD

fbshipit-source-id: 65940ad93bced9b5f97106709d603d1cd7260812
2021-06-10 19:30:02 -07:00
8a7c0d082f ger is an alias to outer, not the other way around (#59710)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59710

This is the exact same PR as before.
The version that landed was actually outdated compared to the github PR and that's why it failed on master... Sorry for the noise.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28995764

Pulled By: albanD

fbshipit-source-id: 8f7ae3356a886d45787c5e6ca53a4e7b033e306e
2021-06-10 19:28:53 -07:00
c2c35c0170 [Binary] Link whole CuDNN for CUDA-11.1 (#59802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59802

Reviewed By: driazati, seemethere

Differential Revision: D29033537

Pulled By: malfet

fbshipit-source-id: e816fc71f273ae0b4ba8a0621d5368a2078561a1
2021-06-10 16:54:53 -07:00
60ba451731 [torch] Remove using directive from header (#59728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59728

I noticed Sandcastle jobs failing with:

```
fbcode/caffe2/torch/csrc/api/include/torch/nn/modules/rnn.h:19:35: error: using namespace directive in global context in header [-Werror,-Wheader-hygiene]
using namespace torch::nn::utils::rnn;
```

(cf. V3 of D28939167 or https://www.internalfb.com/intern/sandcastle/job/36028797455955174/).

Removing `using namespace ...` fixes the problem.

~~... also applied code formatting ...~~

Test Plan: Sandcastle

Reviewed By: jbschlosser

Differential Revision: D29000888

fbshipit-source-id: 10917426828fc0c82b982da435ce891dc2bb6eec
2021-06-10 15:13:07 -07:00
e9e9291dc1 [After fix] Reuse constant and bump bytecode to v5 (#59722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59722

Reintroduce sharing constant between bytecode and torchscript (same as #58629) after the fix #59642

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29002345

Pulled By: cccclai

fbshipit-source-id: d9c8e474ff57d0509580183206df038a24ad27e3
2021-06-10 15:03:16 -07:00
ac6b5beade [torch][segment_reduce] Add support for mean reduction (cpu) (#59521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59521

This diff is adding support for mean reduction for CPU (fwd + bckwd).

Will add cuda implementation in subsequent PR. We are using "cub::DeviceSegmentedReduce" for other aggregation, trying to see how to support mean or will write custom kernel for it.

Next Steps:
- cuda support for mean
- 2d data input support
- more testing
- benchmarking

Test Plan: updated unit test. Still relying on manual data for ease of debugging. Will add more tests that covers edge cases once major features are complete.

Reviewed By: ngimel

Differential Revision: D28922547

fbshipit-source-id: 2fad53bbad2cce714808ff95759cbdbd45bb4ce6
2021-06-10 14:21:31 -07:00
e71db0bb82 .jenkins: Ignore exit code of nvidia-smi (#59826)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59826

It's only informational and will run on Windows CPU executors as well

Fixes issues found in https://github.com/pytorch/pytorch/runs/2797531966

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29042951

Pulled By: seemethere

fbshipit-source-id: 862094e53417c0a59d7728bf680be37b806b5a6f
2021-06-10 14:16:32 -07:00
e7ad82eb2f [DataLoader] Add option to refine type during runtime validation for DP instance (#56066)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56066

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D27776646

Pulled By: ejguan

fbshipit-source-id: 695ff7775177653d809c5917d938c706281e1298
2021-06-10 14:04:24 -07:00
e2c784d940 [reland] .github: Add Windows GPU workflow (#58782) (#59752)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59752

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29009775

Pulled By: seemethere

fbshipit-source-id: 5be1b818b5653a4fdbfe4a79731317068dc1a5d1
2021-06-10 13:38:32 -07:00
54cc477ea3 .github: Ensure cleaner windows workspace (#59742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59742

It looks like Windows workers were failing out due to some leftovers
from previous builds, this should hopefully remedy some of those errors

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29009076

Pulled By: seemethere

fbshipit-source-id: 426d54df14ec580cb24b818c48e2f4bd36159181
2021-06-10 13:37:22 -07:00
0099c25b85 fx quant: remove some dead code in observer insertion (redo) (#59799)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59799

This is a redo of #58574, easier to create a new PR than to fix rebase
conflicts, as there have been a large number of refactors to the
underlying code.

Removes some code which was incorrectly added by #57519 but never
actually used for anything.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D29031955

fbshipit-source-id: f407d181070cb283382965952821e3647c705544
2021-06-10 12:57:09 -07:00
fb620a27d0 [WIP] Add slow gradcheck build for the ci/slow-gradcheck label (#59020)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59020

Reviewed By: bdhirsh

Differential Revision: D29036891

Pulled By: albanD

fbshipit-source-id: b1f87b2cb38642097ad4079d1e818fa5997bedb4
2021-06-10 12:29:57 -07:00
cc32dcadd9 Fix Error when run python setup.py install again on Windows (#59689)
Summary:
Fix https://github.com/pytorch/pytorch/issues/59688

So far, .build.ninja should be removed before building the source code on Windows at any time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59689

Reviewed By: bdhirsh

Differential Revision: D29032960

Pulled By: walterddr

fbshipit-source-id: 2b8162cd119820d3b6d8715745ec29b9c381e01f
2021-06-10 12:22:21 -07:00
1fc3576d97 Fixing and enabling tests that check fake_quant matches quant+dequant (#59095)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59095

These tests were disabled, I'm unsure as to why. I've
re-enabled them and remade them to expand testing to different devices
and dtypes

Test Plan:
python test/test_quantization.py TestFakeQuantizeOps.test_numerical_consistency

Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29018745

fbshipit-source-id: 28188f32bafd1f1704c00ba49d09ed719dd1aeb2
2021-06-10 12:16:54 -07:00
c90260905f [fix] torch.{lin, log}space(): properly examine passed dtype (#53685)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53685

Reviewed By: jbschlosser

Differential Revision: D28331863

Pulled By: anjali411

fbshipit-source-id: e89359b607d058158cfa1c9a82389d9a4a71185b
2021-06-10 11:59:54 -07:00
9bcef86d18 Split slow gradcheck periodic CI job so that it does not time out (#59736)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59736

Reviewed By: albanD

Differential Revision: D29008100

Pulled By: soulitzer

fbshipit-source-id: 76da971356fd985dfbfa56d3573f31ef04701773
2021-06-10 11:32:36 -07:00
f240624080 displays graph node's info (#59679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59679

Displays info about graph's nodes

Test Plan:
Expected view:

%wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
	i0: Tensor CPUFloatType {32, 50}
	i1: Tensor CPUFloatType {1, 50}
	i2: int {1}
	o0: Tensor CPUFloatType {32, 50}
%wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
	i0: Tensor CPUFloatType {32, 50}
	i1: Tensor CPUFloatType {1, 50}
	o0: Tensor CPUFloatType {32, 50}
%wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
	i0: Tensor CPUFloatType {32, 50}
	i1: double {0}
	i2: double {10}
	o0: Tensor CPUFloatType {32, 50}
%user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
	i0: Tensor CPUFloatType {32, 1, 32}
	i1: int {1}
	i2: int {2}
	o0: Tensor CPUFloatType {32, 32, 1}
%dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
	i0: Tensor CPUFloatType {32, 1, 32}
	i1: Tensor CPUFloatType {32, 32, 1}
	o0: Tensor CPUFloatType {32, 1, 1}
%31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
	i0: Tensor CPUFloatType {32, 1, 1}
	i1: int {1}
	i2: int {-1}
	o0: Tensor CPUFloatType {32, 1}
%19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
	i0: Tensor CPUFloatType {32, 1}
	i1: Tensor CPUFloatType {32, 50}
	o0: TensorList {2}
%input.1 : Tensor = aten::cat(%19, %4)
	i0: TensorList {2}
	i1: int {1}
	o0: Tensor CPUFloatType {32, 51}
%fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
	i0: Tensor CPUFloatType {1}
	i1: Tensor CPUFloatType {32, 51}
	i2: Tensor CPUFloatType {51, 1}
	i3: int {1}
	i4: int {1}
	o0: Tensor CPUFloatType {32, 1}
%23 : Tensor = aten::sigmoid(%fc1.1)
	i0: Tensor CPUFloatType {32, 1}
	o0: Tensor CPUFloatType {32, 1}
%24 : (Tensor) = prim::TupleConstruct(%23)
	i0: Tensor CPUFloatType {32, 1}
	o0: Tuple {1}

Reviewed By: hlu1

Differential Revision: D28592852

fbshipit-source-id: 09174014f7d0ce25c511025d2b376f14e16c8a4a
2021-06-10 10:33:30 -07:00
7af9252ed7 [skip ci] export_slow_tests.py - Add option to ignore small differences (#59759)
Summary:
This would lower the number of unnecessary commits to pytorch/test-infra by only exporting a different stats file when the stats are varying enough. This way, if the slow test cases we gather from S3 are the same and their times are trivially different, then we do not bother exporting a different stats file when the --ignore-small-diffs option is enabled.

We instead export the stats already in test-infra, so that when it tries to commit, it realizes it would be an empty commit and not add to the git history.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59759

Test Plan: Run `python tools/export_slow_tests.py --ignore-small-diffs <threshold>`.

Reviewed By: walterddr

Differential Revision: D29032712

Pulled By: janeyx99

fbshipit-source-id: 41d522a4c5f710e776acd1512d41be9791d0cf63
2021-06-10 09:44:33 -07:00
51d954e8e4 Link ATEN tests with OpenMP runtime (#59733)
Summary:
Even if OpenMP extensions are supported by compiler, OpenMP runtime library is not always implicitly added as dependency by linker
Above fixes linker problems on Apple M1, when libomp.dylib is installed via conda, when tests that directly use OpenMP pragams fail to link with following errors:
```
/Library/Developer/CommandLineTools/usr/bin/c++ -Wno-deprecated -fvisibility-inlines-hidden -Wno-deprecated-declarations -DUSE_PTHREADPOOL -Xpreprocessor -fopenmp -I/Users/nshulga/miniforge3/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-typedef-redefinition -Wno-unknown-warning-option -Wno-unused-private-field -Wno-inconsistent-missing-override -Wno-aligned-allocation-unavailable -Wno-c++14-extensions -Wno-constexpr-not-const -Wno-missing-braces -Qunused-arguments -fcolor-diagnostics -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-unused-private-field -Wno-missing-braces -Wno-c++14-extensions -Wno-constexpr-not-const -O3 -DNDEBUG -DNDEBUG -arch arm64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX11.3.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names -rdynamic caffe2/CMakeFiles/test_parallel.dir/__/aten/src/ATen/test/test_parallel.cpp.o -o bin/test_parallel  -Wl,-rpath,/Users/nshulga/git/pytorch/build/lib  lib/libgtest_main.a  lib/libtorch.dylib  lib/libtorch_cpu.dylib  lib/libprotobuf.a  lib/libc10.dylib  lib/libgtest.a && :
Undefined symbols for architecture arm64:
  "___kmpc_fork_call", referenced from:
      TestParallel_NestedParallel_Test::TestBody() in test_parallel.cpp.o
      TestParallel_Exceptions_Test::TestBody() in test_parallel.cpp.o
  "_omp_get_max_threads", referenced from:
      TestParallel_NestedParallel_Test::TestBody() in test_parallel.cpp.o
      TestParallel_Exceptions_Test::TestBody() in test_parallel.cpp.o
  "_omp_get_num_threads", referenced from:
      _.omp_outlined. in test_parallel.cpp.o
      _.omp_outlined..31 in test_parallel.cpp.o
  "_omp_get_thread_num", referenced from:
      _.omp_outlined. in test_parallel.cpp.o
      _.omp_outlined..31 in test_parallel.cpp.o
  "_omp_in_parallel", referenced from:
      TestParallel_NestedParallel_Test::TestBody() in test_parallel.cpp.o
      TestParallel_Exceptions_Test::TestBody() in test_parallel.cpp.o
ld: symbol(s) not found for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59733

Reviewed By: walterddr, seemethere

Differential Revision: D29005511

Pulled By: malfet

fbshipit-source-id: daab5e1b0a58d9b60a8992ef40c743e4b619dac7
2021-06-10 09:41:24 -07:00
4f79270b89 [PyTorch ] Thread parallel bmm across batch dim (#59596)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59596

Parallelize batch matmul across batch dim. This was found to improve perf for
some usecases on mobile.
ghstack-source-id: 130989569

Test Plan: CI unit tests

Reviewed By: albanD

Differential Revision: D26833417

fbshipit-source-id: 9b84d89d29883a6c9d992d993844dd31a25f76b1
2021-06-10 08:25:40 -07:00
3176f16691 [Pytorch benchmark] Add BMM benchmark (#59595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59595

ghstack-source-id: 130946743

Test Plan: bmm_test

Reviewed By: mingzhe09088

Differential Revision: D28873228

fbshipit-source-id: 6e4cb04bb6c63f5f68d8f23c13738e2d58ab499c
2021-06-10 08:24:29 -07:00
58412740ae Added doc for torch.einsum sublist format (#57038)
Summary:
Adds documentation for the new sublist format for `torch.einsum`

closes https://github.com/pytorch/pytorch/issues/21412

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57038

Reviewed By: mruberry

Differential Revision: D28994431

Pulled By: heitorschueroff

fbshipit-source-id: 3dfb154fe6e4c440ac67c2dd92727bb5ecfe289e
2021-06-10 08:01:56 -07:00
5e3e504728 Update TensorPipe submodule (#59789)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59789

The bot messed up in D28867855 (96651458eb) so I've got to do it manually.

Test Plan: CI

Reviewed By: beauby

Differential Revision: D29027901

fbshipit-source-id: 9438e0cfbe932fbbd1e252ab57e2b1b23f9e44cf
2021-06-10 06:36:46 -07:00
96651458eb Automated submodule update: tensorpipe (#59374)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: e942ea1513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59374

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D28867855

fbshipit-source-id: e1325046003f5c546f02024ff4c427c91721cd7e
2021-06-10 04:41:02 -07:00
0d7d316dc1 [fx ir] Support lists and dicts in FX IR GraphDrawer (#58775)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58775

Reviewed By: RoshanPAN

Differential Revision: D28613939

fbshipit-source-id: 4164e2dd772b59240ea3907001fe4ebddb003060
2021-06-10 01:55:53 -07:00
e7cccc23b9 Add query and synchronize to c10::Stream (#59560)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59560

`at::cuda::CUDAStream` has the `query` and `synchronize` methods, but `c10::Stream` does not, and I couldn't find any generic way to accomplish this. Hence I added helpers to do this to the DeviceGuardImpl interface, and then defined these methods on `c10::Stream`. (I had to do it out-of-line to circumvent a circular dependency).
ghstack-source-id: 130932249

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28931377

fbshipit-source-id: cd0c19cf021e305d0c0cf9af364afb445d010248
2021-06-10 01:42:40 -07:00
f11120967e Support EnumerableShardingSpec in ShardedTensor. (#59061)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59061

Overall Design: https://github.com/pytorch/pytorch/issues/55207

This PR builds upon https://github.com/pytorch/pytorch/pull/58517 and
https://github.com/pytorch/pytorch/pull/57409 to support creating a
ShardedTensor using EnumerableShardingSpec.
ghstack-source-id: 130780376

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28734551

fbshipit-source-id: 656f5f2b22041dae071bc475f19fe94c969716e8
2021-06-09 23:21:14 -07:00
48ea7c808d [C10d] Support subgroups (#59111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59111

Create a util function for initializing subgroups. By default, each subgroup contains all the ranks within a machine. This util function can be used by both local SGD and SyncBatchNorm optimization.

Additionally, clang format `distributed/__init__.py` after importing `_rank_not_in_group` which is used by the unit test, and also clang format `distributed_c10d.py`.

Note that this API does not accept another overall main group. Like APEX API `create_syncbn_process_group` [here](https://nvidia.github.io/apex/_modules/apex/parallel.html), always uses the global world size and should only be applied when CUDA is available.

#Closes: https://github.com/pytorch/pytorch/issues/53962
ghstack-source-id: 130975027

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_group_size_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_world_size_not_divisible_by_group_size

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_by_enumeration_input_rank_exceeds_world_size
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_new_subgroups_overlap_not_allowed

Reviewed By: rohan-varma

Differential Revision: D28495672

fbshipit-source-id: fdcc405411dd409634eb51806ee0a320d1ecd4e0
2021-06-09 22:35:11 -07:00
fc0582ee95 [c10d] Use TORCH_CHECK for monitored barrier error (#59667)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59667

Use torch_check over throw std::runtime_error in monitored barrier so
that it works with torch_cpp_show_stacktraces to reveal the entire callstack
where the monitored barrier failed, which can help determine where the
particular rank encountered an issue.
ghstack-source-id: 130993689

Test Plan: CI

Reviewed By: cbalioglu

Differential Revision: D28974510

fbshipit-source-id: 6a6958995c1066cddcd647ca88c74473079b69fc
2021-06-09 22:31:33 -07:00
12b9e99e0d Bump the bytecode reading version kMaxSupportedBytecodeVersion to 6 (#59714)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59714

Bytecode v6 is on implicit operator versioning through number of specified arguments. Both the read and write codes are available. This PR is to enable reading v6 models. The default writing format is not changed yet and will be bumped in a later PR.

Test: CI.
Local: change the writing version to 6 temporally and run the unit tests in LiteInterpreterTest. There are a number of end-to-end tests to write v6 bytecode, read and run it.

Test Plan: Imported from OSS

Reviewed By: raziel, cccclai

Differential Revision: D29007538

Pulled By: iseeyuan

fbshipit-source-id: cb089d5d4c5b26c5b5cd3a5e0954e8c7c4c69aac
2021-06-09 21:58:31 -07:00
3c6ae6f181 [OSS CI][iOS] Use LibTorch-Lite.h for nightly builds (#59762)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59762

Test Plan: Imported from OSS

Reviewed By: cccclai

Differential Revision: D29018267

Pulled By: xta0

fbshipit-source-id: 10213a6811b4e6b33bd13355a7a7af85d21d48d4
2021-06-09 21:38:32 -07:00
a62f6b6d04 ci: Add skipIfOnGHA util (#59748)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59748

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D29008217

Pulled By: seemethere

fbshipit-source-id: ffc2f7935df722f26c1252e3833085430ada7433
2021-06-09 21:19:26 -07:00
1ea5c19c19 Add USE_WHOLE_CUDNN option (#59744)
Summary:
It is only enabled if USE_STATIC_CUDNN is enabled

Next step after https://github.com/pytorch/pytorch/pull/59721 towards resolving fast kernels stripping reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59744

Reviewed By: seemethere, ngimel

Differential Revision: D29007314

Pulled By: malfet

fbshipit-source-id: 7091e299c0c6cc2a8aa82fbf49312cecf3bb861a
2021-06-09 21:12:42 -07:00
bb19dc14cc add channels last support for AvgPool2d on CPU (#58725)
Summary:
replacement of: https://github.com/pytorch/pytorch/pull/48918

enable test case on AvgPool2d channels last for CPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58725

Reviewed By: ngimel

Differential Revision: D28593169

Pulled By: VitalyFedyunin

fbshipit-source-id: 5de870fe1d9dd961fb0dab5f9d531ab14614a160
2021-06-09 21:06:45 -07:00
52b2ed65c0 Revert D29007258: Revert D28926135: [pytorch][PR] Refactor Foreach Tests: Unary Functions
Test Plan: revert-hammer

Differential Revision:
D29007258

Original commit changeset: c15f51661641

fbshipit-source-id: 98236153136a5c6b6c2911079b7bd214da6cb424
2021-06-09 21:02:56 -07:00
827e00c914 Update Kineto to fix fd leak (#59755)
Summary:
Update to commit containing pytorch/kineto#281
Fixes https://github.com/pytorch/pytorch/issues/59746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59755

Reviewed By: seemethere, ngimel

Differential Revision: D29011069

Pulled By: malfet

fbshipit-source-id: 4c7b09ce5d497634f9927c330713c7404d892912
2021-06-09 20:47:04 -07:00
a4e0368c99 Comment on tests reliance on ZeRO's partitioning algo (#59713)
Summary:
Addresses https://github.com/pytorch/pytorch/issues/59548

**Overview:**
Recently, we changed ZeRO's partitioning algorithm to first sort the parameters by decreasing size and then greedily allocate to shards. See [here](ea1de87f4b).

The current tests `test_sharding()` and `test_add_param_group()` check for a uniform partitioning, which is not achieved with the old naive greedy partitioning algorithm for general world sizes but is achieved with the new sorted-greedy algorithm. This reliance is not ideal, but for now, we opt to simply add comments to document the dependency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59713

Test Plan:
I tested for world sizes of 1, 2, 3, and 4 via the AI AWS cluster:
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_sharding

srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=1 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_add_param_group
```
However, because the train queue (which offers instances with 8 GPUs) is not working at the moment, I was unable to test for world sizes of 5+. Nonetheless, I believe that they should still work.

First, consider `test_sharding()`. Given the sorted-greedy algorithm, each shard will be assigned one of the parameters with size `9`, then one of the parameters with size `7`, then `5`, and finally `3`. Hence, each will have a uniform partition. Now, consider `test_add_param_group()`. Similarly, the same allocation behavior occurs, only the last shard is not assigned the final parameter with size `3` to begin. However, after adding the new `param_group` with the parameter with size `3`, a re-partitioning occurs. The first `param_group` is partitioned as before, and the parameter with size `3` in the new `param_group` is assigned to the last shard since it has the minimal total size. Thus, in the end, all shards have a uniform partition.

Reviewed By: mrshenli

Differential Revision: D28996460

Pulled By: andwgu

fbshipit-source-id: 22bdc638d8569ed9a20836812eac046d628d6df2
2021-06-09 19:56:28 -07:00
25179ecb63 [caffe2] Fix verbose templated signed/unsigned comparison warning (#59578)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59578

This is verbose warning formed from one `CAFFE_ENFORCE_GT()` check:
```
third-party\toolchains\vs2017_15.9\buildtools\vc\tools\msvc\14.16.27023\include\xstddef(271): warning C4018: '>': signed/unsigned mismatch
xplat\caffe2\c10\util\logging.h(208): note: see reference to function template instantiation 'bool std::greater<void>::operator ()<const T1&,const T2&>(_Ty1,_Ty2) const' being compiled
        with
        [
            T1=int,
            T2=unsigned int,
            _Ty1=const int &,
            _Ty2=const unsigned int &
        ]
xplat\caffe2\caffe2\operators\conv_pool_op_base.h(539): note: see reference to function template instantiation 'void c10::enforce_detail::enforceThatImpl<std::greater<void>,int,unsigned int,>(Pred,const T1 &,const T2 &,const char *,int,const char *,const void *)' being compiled
        with
        [
            Pred=std::greater<void>,
            T1=int,
            T2=unsigned int
        ]
xplat\caffe2\caffe2\operators\conv_pool_op_base.h(536): note: while compiling class template member function 'std::vector<caffe2::TensorShape,std::allocator<_Ty>> caffe2::ConvPoolOpBase<caffe2::CPUContext>::TensorInferenceForSchema(const caffe2::OperatorDef &,const std::vector<_Ty,std::allocator<_Ty>> &,int)'
        with
        [
            _Ty=caffe2::TensorShape
        ]
xplat\caffe2\caffe2\operators\conv_pool_op_base.h(631): note: see reference to function template instantiation 'std::vector<caffe2::TensorShape,std::allocator<_Ty>> caffe2::ConvPoolOpBase<caffe2::CPUContext>::TensorInferenceForSchema(const caffe2::OperatorDef &,const std::vector<_Ty,std::allocator<_Ty>> &,int)' being compiled
        with
        [
            _Ty=caffe2::TensorShape
        ]
xplat\caffe2\caffe2\operators\pool_op.cc(1053): note: see reference to class template instantiation 'caffe2::ConvPoolOpBase<caffe2::CPUContext>' being compiled
xplat\caffe2\c10\core\memoryformat.h(63): note: see reference to class template instantiation 'c10::ArrayRef<int64_t>' being compiled
```
Use a signed `0` because `.dims_size()` returns a signed integer.

Test Plan: Confirm warning no longer present in Windows build logs

Reviewed By: simpkins

Differential Revision: D28941905

fbshipit-source-id: acdc1281df2fe7f30b14cfad917cbbe8f3336d29
2021-06-09 19:48:29 -07:00
b0fd3ca542 [sparse] Add the AO namespace to torch (#58703)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58703

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28970962

Pulled By: z-a-f

fbshipit-source-id: 0d0f62111a0883af4143a933292dfaaf8fae220d
2021-06-09 19:47:21 -07:00
3dfb94c17c Construct a -Wall around Torch (#59668)
Summary:
Removes unused variables and functions and performs other minor mods sufficient to introduce `-Wall` as a default build flag. This should enhance code safety in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59668

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28974453

fbshipit-source-id: 011c720dd6e65fdbbd87aa90bf57d67bfef32216
2021-06-09 19:42:43 -07:00
fa030d1213 [DataPipes] Add simple unbatch to DataPipe (#59610)
Summary:
Implements the simple unbatch feature for DataPipe https://github.com/pytorch/pytorch/issues/58148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59610

Reviewed By: VitalyFedyunin

Differential Revision: D28994180

Pulled By: NivekT

fbshipit-source-id: 4bafe6e26c4f95a808c489b147369413a196fa1c
2021-06-09 16:53:31 -07:00
2f395f3b54 [reland] Document debugability features in torch.distributed (#59726)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59726

Reland of https://github.com/pytorch/pytorch/pull/59604 with indentation fix
ghstack-source-id: 130979356

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D29001923

fbshipit-source-id: 225d9dc5054c223b453f3b39749e2b62f61b9a2c
2021-06-09 16:40:11 -07:00
c5bee1ec4f [PyTorch] Parallelize gelu via tensoriterator (#58950)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58950

Use tensor iterator's API to set grain size in order to parallelize gelu op.
ghstack-source-id: 130947174

Test Plan: test_gelu

Reviewed By: ezyang

Differential Revision: D28689819

fbshipit-source-id: 0a02066d47a4d9648323c5ec27d7e0e91f4c303a
2021-06-09 16:09:38 -07:00
8b63573c31 [PyTorch Operator Benchmark] gelu benchmark (#59334)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59334

Add gelu op benchmark
ghstack-source-id: 130947172

Test Plan: gelu_test

Reviewed By: hl475

Differential Revision: D28842959

fbshipit-source-id: 93e23e027a488412488ecf22335d7d915f6cc3b4
2021-06-09 16:09:37 -07:00
874e7b889d [PyTorch] Expose interface to set grain size on tensor iterator (#58949)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58949

To parallelize ops grain size setting is exposed at for_each level.
This is too far deep in the stack cpu_kernel_vec which does not know what the
op is. You would want to parallelize op depending on the op type. Non trivial
ops can benefit from threads even when the # of elements in tensor is not high.
This API exposes setting grain size at tensor iterator level so that operator
creating it can have control over it.
ghstack-source-id: 130947175

Test Plan: CI + will add more test

Reviewed By: ezyang

Differential Revision: D26857523

fbshipit-source-id: 09fc2953061069967caa9c78b010cb1b68fcc6c9
2021-06-09 16:08:25 -07:00
1735775662 [Torch] Cast timestamp type to int (#59712)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59712

When worker process fails in fb due to signal failure, the TerminationHandler writes error reply file. Recently the error reply file was changed for mast jobs. The Json value of ``timestamp`` is string, even though in the thrift struct it is int: https://fburl.com/diffusion/upa228u5

This diff adds support for casting str timestamp to int.

Test Plan: buck test mode/dev-nosan //caffe2/test/distributed/elastic/multiprocessing/errors:api_test

Reviewed By: suphoff

Differential Revision: D28995827

fbshipit-source-id: 333448cfb4d062dc7fe751ef5839e66bfcb3ba00
2021-06-09 15:56:37 -07:00
44c442293f [torch/elastic] Fix the edge case when no node is alive (#59663)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59663

This PR fixes an edge case bug in `DynamicRendezvousHandler` where the state of the rendezvous is not always entirely updated when one or more nodes are not alive anymore.

Test Plan: Run the existing and newly-introduced unit tests.

Reviewed By: tierex

Differential Revision: D28971809

fbshipit-source-id: ebbb6a5f2b04f045c3732d6cf0f8fdc7c2381a7c
2021-06-09 15:31:50 -07:00
0fa3db5594 Fix subgradient for element-wise max and min (#59669)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59669

Fixes #56734

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28975531

fbshipit-source-id: 4e774dc8c6e095bc66962ce2411466de3880c2d3
2021-06-09 15:21:45 -07:00
e3d75b8475 irange for PyTorch sans jit (#59481)
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.

Generated with D28874212.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28909681

fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
2021-06-09 14:46:11 -07:00
804f924504 Fix accuraccy failures when running test_nn on A100s (#59624)
Summary:
Make sure tests run explicitely without TF32 don't use TF32 operations

Fixes https://github.com/pytorch/pytorch/issues/52278

After the tf32 accuracy tolerance was increased to 0.05 this is the only remaining change required to fix the above issue (for TestNN.test_Conv3d_1x1x1_no_bias_cuda)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59624

Reviewed By: heitorschueroff

Differential Revision: D28996279

Pulled By: ngimel

fbshipit-source-id: 7f1b165fd52cfa0898a89190055b7a4b0985573a
2021-06-09 14:38:34 -07:00
47e286d024 Merge c10d elastic agent tests into local_elastic_agent_test.py file (#59657)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59657

Introduce tests that test elastic agent with c10d and etc2-v2 rendezvous backends.
Added a port allocation method that uses sockets to find an available port for the c10d backend. This way, agents that are created will all share the specified address/port and can communicate.
Added a method that abstracts the backend to use when running a test. This way, any tests can quickly be switched to run on the backend of choice (c10d, etcd, or etcd-v2)

Test Plan: Tests various components of the elastic agent with 3 different backends: etcd, etcd-v2, and c10d.

Reviewed By: tierex

Differential Revision: D28972604

fbshipit-source-id: fd4cff6417fefdf0de9d7a114820914b968006a8
2021-06-09 14:28:59 -07:00
13a2025469 Delete empty caffe2/quantization/CMakeLists.txt (#59717)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59717

Reviewed By: walterddr

Differential Revision: D28997598

Pulled By: malfet

fbshipit-source-id: ef2654577c0784254f3d74bc340cdabc76fa345c
2021-06-09 14:20:33 -07:00
171142f9cc Revert D28926135: [pytorch][PR] Refactor Foreach Tests: Unary Functions
Test Plan: revert-hammer

Differential Revision:
D28926135 (0897df18a3)

Original commit changeset: 4eb21dcebbff

fbshipit-source-id: c15f51661641f455ae265cdf048051a3c01198f9
2021-06-09 14:05:56 -07:00
9bb5663979 Use commit stats from viable/strict instead of nightlies for sharding (#59727)
Summary:
Currently, not all of CI runs on nightlies, so it's better to use viable/strict.

For example, current 11.1 test jobs do not get to use automatic sharding because of the lack of stats: https://app.circleci.com/jobs/github/pytorch/pytorch/14010983?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-build-link

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59727

Reviewed By: heitorschueroff

Differential Revision: D29004910

Pulled By: janeyx99

fbshipit-source-id: eb0c54a7e7947decba8134a1d67e4b0434151a06
2021-06-09 13:52:15 -07:00
8845cbabf0 [CMake] Split caffe2::cudnn into public and private (#59721)
Summary:
This is only important for builds where cuDNN is linked statically into libtorch_cpu.
Before this PR PyTorch wheels often accidentally contained several partial copies of cudnn_static library.
Splitting the interface into header only (cudnn-public) and library+headers(cudnn-private) prevents those from happening.
Preliminary step towards enabling optional linking whole cudnn_library to workaround issue reported in https://github.com/pytorch/pytorch/issues/50153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59721

Reviewed By: ngimel

Differential Revision: D29000967

Pulled By: malfet

fbshipit-source-id: f054df92b265e9494076ab16c247427b39da9336
2021-06-09 13:18:48 -07:00
c738c13304 Fix typo in checkpoint docs (#59646)
Summary:
This small typo causing this valuable piece of information to be excluded from the docs.

<img width="876" alt="image" src="https://user-images.githubusercontent.com/8812459/121240517-47f2d400-c84f-11eb-9288-23c551c1591a.png">

The last "warning" is missing a second ":", so it doesn't render in the docs {emoji:1f447}

<img width="875" alt="image" src="https://user-images.githubusercontent.com/8812459/121240467-39a4b800-c84f-11eb-9dd6-ec26754c43d3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59646

Reviewed By: mruberry

Differential Revision: D28972541

Pulled By: jbschlosser

fbshipit-source-id: d10c6688d8db4d4ec4b02858a4c7b352365219c0
2021-06-09 12:48:18 -07:00
51af772937 [jit] Set debug name for value coming out of GetAttr nodes. (#59123)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59123

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28766023

fbshipit-source-id: 0919f4318fb5a7b1d5adc8f976dfc9309e233d13
2021-06-09 12:24:55 -07:00
bbd58d5c32 fix :attr: rendering in F.kl_div (#59636)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59636

Fixes #57538

Test Plan:
Rebuilt docs to verify the fix:

{F623235643}

Reviewed By: zou3519

Differential Revision: D28964825

fbshipit-source-id: 275c7f70e69eda15a807e1774fd852d94bf02864
2021-06-09 12:20:14 -07:00
e385be7611 .circleci: Disable pytorch_windows_test_multigpu (#59725)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59725

These are failing on CircleCI with no apparent debug messages, see https://github.com/pytorch/pytorch/issues/59724

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D29001353

Pulled By: seemethere

fbshipit-source-id: ce0f4fbcfc7918824f6bad47b922d914eeb2f5a6
2021-06-09 12:12:13 -07:00
f8bb7e2f7c Magma isn't needed in cpu build (#59619)
Summary:
Fix incorrect logic in windows CPU build script
VERSION_SUFFIX shouldn't be cpu

https://github.com/pytorch/pytorch/pull/59618/checks?check_run_id=2771591019
![image](https://user-images.githubusercontent.com/16190118/121158840-3f18f700-c87d-11eb-9c03-277856afb1b2.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59619

Reviewed By: samestep

Differential Revision: D29000213

Pulled By: seemethere

fbshipit-source-id: fcc474967e281fbf9be69f14c0aedfd01820573f
2021-06-09 12:06:33 -07:00
ed3884c3e9 Fix timeout with ZeRO test_step() and test_step_with_closure() (#59648)
Summary:
Partially fixes https://github.com/pytorch/pytorch/issues/59548

**Overview:**
This fixes the timeout issues with `test_step()` and `test_step_with_closure()` for the `ZeroRedundancyOptimizer`.

The existing tests partially assumed a `world_size` of `2` (hence why [this](https://github.com/pytorch/pytorch/pull/59622) seems to be a temporary fix). This change instead should avoid baking in that assumption and allow `world_size` to be flexible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59648

Test Plan:
I tested with 2, 3, and 4 GPUs (and hence `world_size`s of 2, 3, and 4, respectively) via the AI AWS cluster.
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step

srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=2 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=3 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python test/distributed/optim/test_zero_redundancy_optimizer.py -- TestZeroRedundancyOptimizerDistributed.test_step_with_closure
```

Reviewed By: jbschlosser

Differential Revision: D28975035

Pulled By: andwgu

fbshipit-source-id: 2cbaf6a35e22a95e19fc97e1b64e585e452e774e
2021-06-09 12:03:05 -07:00
61965abad7 Move _PartialWrapper to module scope (#59660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59660

Context https://github.com/pytorch/pytorch/issues/57352

Test Plan: Pytorch CI tests

Reviewed By: vkuzo

Differential Revision: D28972991

fbshipit-source-id: efc9dd3e90e18e1cdf27d5ef0f168abd8169bc42
2021-06-09 11:55:04 -07:00
0f6bd550a4 Revert D28981443: reland D28645531: .github: Add Windows GPU workflow
Test Plan: revert-hammer

Differential Revision:
D28981443 (21121675b3)

Original commit changeset: 5d24cccfb8c8

fbshipit-source-id: 14e5b610978882bace2f834f61e5457f62b11290
2021-06-09 11:43:10 -07:00
167477329d [Reland] adding base commit to scribe report (#59677)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/59570.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59677

Reviewed By: janeyx99

Differential Revision: D28980356

Pulled By: walterddr

fbshipit-source-id: 9c4671d18ce00fda98d774d1b2aa556662ecddfe
2021-06-09 11:06:01 -07:00
d42e6c7f70 Clang format distributed_test.py (#59693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59693

ghstack-source-id: 130931133

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D28987619

fbshipit-source-id: 3681cc262b889653615ec64da8c23c96cc0d997b
2021-06-09 10:58:48 -07:00
68f74966fc [ttk] Store float64 in tensorboard instead of float32 (#59435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59435

Sometimes we need to compare 10+ digits. Currenlty tensorboard only saves float32. Provide an option to save float64

Reviewed By: yuguo68

Differential Revision: D28856352

fbshipit-source-id: 05d12e6f79b6237b3497b376d6665c9c38e03cf7
2021-06-09 10:42:37 -07:00
3271853912 hold references to storages during TorchScript serializaiton (#59642)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59642

Test Plan: Imported from OSS

Reviewed By: jbschlosser, cccclai

Differential Revision: D28968947

Pulled By: Lilyjjo

fbshipit-source-id: 0046da8adb3a29fb108965a1d2201749fe2d0b41
2021-06-09 10:12:07 -07:00
21121675b3 reland D28645531: .github: Add Windows GPU workflow (#59678)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59678

This reverts commit 2956bbaf2388d424ef986c22fac8287f7c345978.

Reland of https://github.com/pytorch/pytorch/pull/58782

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28981443

Pulled By: seemethere

fbshipit-source-id: 5d24cccfb8c87832fa0233d0b524575dc04f8f05
2021-06-09 09:51:29 -07:00
0897df18a3 Refactor Foreach Tests: Unary Functions (#58960)
Summary:
Related issue: https://github.com/pytorch/pytorch/issues/58833

__changes__
- slowpath tests: pass every dtype&device tensors and compare the behavior with regular functions including inplace
- check of #cudaLaunchKernel
- rename `ForeachUnaryFuncInfo` -> `ForeachFuncInfo`: This change is mainly for the future binary/pointwise test refactors

cc: ngimel ptrblck mcarilli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58960

Reviewed By: ejguan

Differential Revision: D28926135

Pulled By: ngimel

fbshipit-source-id: 4eb21dcebbffffaf79259e31961626e0707fb8d1
2021-06-09 09:45:16 -07:00
62583e51a5 [reland] Add a ci/no-build label (#58778)
Summary:
Depends on https://github.com/pytorch/pytorch-probot/pull/22. Adds a new label called `ci/no-build` that disables the CircleCI `build` workflow on PRs. The current behavior should be the same in the absence of `ci/no-build`.

Specifically, after this PR lands, for anyone who isn't rebased onto the latest `master`, I believe this will happen:
- when they push to their PR, the CircleCI app triggers CI
- the `pytorch-probot` app sees that their PR doesn't have the `ci/no-build` tag, so it also triggers CI
- the latter should auto-cancel the former

After checking with https://github.com/pytorch/pytorch/issues/59087, it looks like this would cause the "errored" number to go up and then go down as Circle jobs are canceled (saying "Your CircleCI tests were canceled") and then restarted:

<img width="868" alt="Screen Shot 2021-05-27 at 12 39 20 PM" src="https://user-images.githubusercontent.com/8246041/119887123-9667b080-bee8-11eb-8acb-e1967899c9d5.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58778

Reviewed By: malfet

Differential Revision: D28995335

Pulled By: samestep

fbshipit-source-id: 8d7543b911e4bbbeef14639baf9d9108110b97c8
2021-06-09 09:05:44 -07:00
b844fd11ee Allow tools/test_history.py to be piped to head (#59676)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59676

Test Plan:
```
tools/test_history.py --mode=columns --ref=3cf783 --test=test_set_dir --job pytorch_linux_xenial_py3_6_gcc5_4_test --job pytorch_linux_xenial_py3_6_gcc7_test | head -n10
```
Before this PR, the above command seems to just hang. After this PR, it nicely prints the following, line by line, and then exits:
```
2021-02-10 12:18:50Z 3cf78395cbc32fa9c83b585c9ec63f960b32d17f    0.644s    0.312s
2021-02-10 11:13:34Z 594a66d778a660faed0b0fbbe1dd8c2c318707ff    0.360s  errored
2021-02-10 10:13:25Z 9c0caf0384690cb67dcccb7066ece5184f72ca78    0.819s    0.449s
2021-02-10 10:09:14Z 602434bcbebb82c6f3741b2a3d5ebac7ee482268    0.361s    0.454s
2021-02-10 10:09:10Z 2e35fe953553247d8a22fc38b039374e426f13b8
2021-02-10 10:09:07Z ff73be7e45616fe106b9e5040bc091ca5cdbfc7f
2021-02-10 10:05:39Z 74082f0d6f8dfd60f28c0de0fe43bcb97b95ee5a
2021-02-10 07:42:29Z 0620c96fd6a140e68c49d68ed14721b1ee108ecc    0.414s    0.377s (2 job re-runs omitted)
2021-02-10 07:27:53Z 33afb5f19f4e427f099653139ae45b661b8bc596    0.381s    0.294s
2021-02-10 07:05:15Z 5f9fb93c1423814a20007faa506ceb8b4828c8d1    0.461s    0.361s
```

Reviewed By: seemethere

Differential Revision: D28978017

Pulled By: samestep

fbshipit-source-id: 021e634bbf40eb1d3b131fac574343dd5cef5deb
2021-06-09 08:42:05 -07:00
26beda8ed5 [BE] unsupported backward failing on single sample (#59455)
Summary:
Echo on https://github.com/pytorch/pytorch/pull/58260#discussion_r637467625

similar to `test_unsupported_dtype` which only check exception raised on the first sample. we should do similar things for unsupported_backward as well. The goal for both test is to remind developer to
1. add a new dtype to the support list if they are fulling runnable without failure (over all samples)
2. replace the skip mechanism which will indefinitely ignore tests without warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59455

Test Plan: CI.

Reviewed By: mruberry

Differential Revision: D28927169

Pulled By: walterddr

fbshipit-source-id: 2993649fc17a925fa331e27c8ccdd9b24dd22c20
2021-06-09 08:17:03 -07:00
12b4e8996f [DataLoader] Add nesting_level argument to map and filter (#59498)
Summary:
This PR implements the .map and .filter APIs for IterDataPipe.

[DataPipes] Make .map of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58145
[DataPipes] Make .filter of DataPipe sensitive to nested_level argument https://github.com/pytorch/pytorch/issues/58147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59498

Reviewed By: ejguan

Differential Revision: D28964280

Pulled By: NivekT

fbshipit-source-id: b1ee6cafa3953093ebd7bf30eacc80c3ef7cd190
2021-06-09 07:40:53 -07:00
2693b0bef3 Fix compile error when debugging (#59616)
Summary:
Signed-off-by: caozhong <zhong.z.cao@intel.com>

Triggered this probably because my full debug version python. ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59616

Reviewed By: jbschlosser

Differential Revision: D28958685

Pulled By: albanD

fbshipit-source-id: fdab622c4d1be93eb27e9006dcf3db7c5b44a04b
2021-06-09 06:34:06 -07:00
f1786b293d Revert D28972444: [pytorch][PR] Document debugability features in torch.distributed
Test Plan: revert-hammer

Differential Revision:
D28972444 (a9d2810817)

Original commit changeset: da5e8ee84f0d

fbshipit-source-id: 94d3b3b75ddec74ea5b2b76f6a7519dc921ee2a7
2021-06-09 03:04:36 -07:00
a56c89a160 Revert D28918331: [pytorch][PR] Automated submodule update: FBGEMM
Test Plan: revert-hammer

Differential Revision:
D28918331 (cc840cf544)

Original commit changeset: def60efe5584

fbshipit-source-id: 88101feb87ebfbd38cf10b45d09af309e9759852
2021-06-09 01:36:06 -07:00
a9d2810817 Document debugability features in torch.distributed (#59604)
Summary:
Adds comprehensive documentation around debugability features added to `torch.distributed` recently, including the `monitored_barrier` and TORCH_DISTRIBUTED_DEBUG env variable.

![dist_one](https://user-images.githubusercontent.com/8039770/121102672-0f052180-c7b3-11eb-974c-81dbbe102cb6.png)
![dist_two](https://user-images.githubusercontent.com/8039770/121102734-39ef7580-c7b3-11eb-94f7-c75469351440.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59604

Reviewed By: jbschlosser, SciPioneer

Differential Revision: D28972444

Pulled By: rohan-varma

fbshipit-source-id: da5e8ee84f0d6f252c703c4d70ff2a0d5817cc4e
2021-06-08 23:52:19 -07:00
daa35141e8 Reland: "[TensorExpr] Fix handling of 0-dim tensors." (#59508)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59508

An assert that was triggering in a previous version is now relaxed to
take 0-dim tensors into account.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918342

Pulled By: ZolotukhinM

fbshipit-source-id: c09b62c9725d1603b0ec11fcc051e7c932af06ae
2021-06-08 22:48:17 -07:00
9f9904969f Reland: "[TensorExpr] Fix printing of Bool dtype." (#59507)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59507

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918344

Pulled By: ZolotukhinM

fbshipit-source-id: b75aa9f316e4f3f648130a3171a35bfbbf1f397d
2021-06-08 22:48:16 -07:00
0b6ec32004 Reland: "[TensorExpr] Improve debug messages." (#59506)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59506

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28918343

Pulled By: ZolotukhinM

fbshipit-source-id: 168703f6368f5182cf9762600d7f0f6ea5b20280
2021-06-08 22:47:06 -07:00
04986b909f [package] Add docstring for PackageExporter.intern (#59602)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59602

**Summary**
This commit adds a docstring for `PackageExporter.intern`.

**Test Plan**
Continuous integration.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28972939

Pulled By: SplitInfinity

fbshipit-source-id: 1765541aa2ed88e01beb48c08b90f56df3a591b7
2021-06-08 19:53:36 -07:00
f52e202840 Add warning when accessing Tensor::grad() in the C++ API (#59362)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35379

 - Adds  `retains_grad` attribute backed by cpp as a native function. The python bindings for the function are skipped to be consistent with `is_leaf`.
   - Tried writing it without native function, but the jit test `test_tensor_properties` seems to require that it be a native function (or alternatively maybe it could also work if we manually add a prim implementation?).
 - Python API now uses `retain_grad` implementation from cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59362

Reviewed By: jbschlosser

Differential Revision: D28969298

Pulled By: soulitzer

fbshipit-source-id: 335f2be50b9fb870cd35dc72f7dadd6c8666cc02
2021-06-08 19:43:21 -07:00
90303157ab Enable complex dtypes for coo_sparse-coo_sparse matmul [CPU] (#59554)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59554

This PR enables complex numbers supports for matrix-matrix
multiplication of COO sparse matrices.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28968309

Pulled By: anjali411

fbshipit-source-id: 4fd471e76a5584366aabc86c08b4564667ee54ca
2021-06-08 19:34:41 -07:00
b386ed6f9b Fix some compiler warnings (#59643)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59643

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D28916206

fbshipit-source-id: 4f6c8e0faeb76848f5951ff85db7c9da7fe9bf54
2021-06-08 18:22:57 -07:00
02d380450d [FX][docs][EZ] Fix link to fuser example (#59670)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59670

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D28975704

Pulled By: jamesr66a

fbshipit-source-id: 2fb759224b5b1ecc62c0ab26563d2a35ed422794
2021-06-08 17:32:55 -07:00
1733d10399 Warn when backward() is called with create_graph=True (#59412)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/4661
- Add warnings in engine's `execute` function so it can be triggered through both cpp and python codepaths
- Adds an RAII guard version of `c10::Warning::set_warnAlways` and replaces all prior usages of the set_warnAlways with the new one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59412

Reviewed By: jbschlosser

Differential Revision: D28969294

Pulled By: soulitzer

fbshipit-source-id: b03369c926a3be18ce1cf363b39edd82a14245f0
2021-06-08 17:19:04 -07:00
82466e0605 Revert D28900487: ger is an alias to outer, not the other way around
Test Plan: revert-hammer

Differential Revision:
D28900487 (4512d75063)

Original commit changeset: e9065c5b2907

fbshipit-source-id: 712c05d2fba28c83958ef760290e1e08c147a907
2021-06-08 17:09:15 -07:00
cc840cf544 Automated submodule update: FBGEMM (#59505)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 77a4792062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59505

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: dskhudia

Differential Revision: D28918331

fbshipit-source-id: def60efe55843023e70b94726cde1faf6857be0b
2021-06-08 17:03:26 -07:00
2956bbaf23 Revert D28645531: .github: Add Windows GPU workflow
Test Plan: revert-hammer

Differential Revision:
D28645531 (51884c6479)

Original commit changeset: 6ed1a2dead9c

fbshipit-source-id: e082d7d50de77d0572596111e95a3da3a350a319
2021-06-08 16:59:56 -07:00
97dfc7e300 [Reland] Adding run specified tests option to run_test.py (#59649)
Summary:
Reland of https://github.com/pytorch/pytorch/issues/59487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59649

Reviewed By: samestep

Differential Revision: D28970751

Pulled By: janeyx99

fbshipit-source-id: 6e28d4dcfdab8a49da4b6a02c57516b08bacd7b5
2021-06-08 16:04:46 -07:00
51884c6479 .github: Add Windows GPU workflow (#58782)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58782

[skip ci]

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28645531

Pulled By: seemethere

fbshipit-source-id: 6ed1a2dead9cca29e26e613afdbcf46ba7cee88c
2021-06-08 16:00:21 -07:00
6104ac5aaf [libkineto] Refactor trace activities (#59360)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59360

Pull Request resolved: https://github.com/pytorch/kineto/pull/206

Replace ClientTraceActivity with GenericActivity.
In addition:
* Add a couple of new activity types for user annotations
* Simplify code for GPU-side user annotations
* Add accessor to containing trace span object in activities. Later we can replace this with a trace context / trace session object.
* Simplified MemoryTraceLogger
* Added early exit for cupti push/pop correlation ID

Reviewed By: ilia-cher

Differential Revision: D28231675

fbshipit-source-id: 7129f2493016efb4d3697094f24475e2c39e6e65
2021-06-08 15:49:35 -07:00
acc47357b5 Fix torch.conj for zero-dimensional sparse coo matrix (#59553)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59553

Added a test for 0x0 sparse coo input for sparse_unary_ufuncs.
This test fails for `conj` on master.

Modified `unsupportedTypes` for test_sparse_consistency, complex dtypes
pass, but float16 doesn't pass for `conj` because `to_dense()` doesn't
work with float16.

Fixes https://github.com/pytorch/pytorch/issues/59549

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28968215

Pulled By: anjali411

fbshipit-source-id: 44e99f0ce4aa45b760d79995a021e6139f064fea
2021-06-08 15:46:49 -07:00
894aaa3997 Revert D28943928: [pytorch][PR] adding base commit to scribe report
Test Plan: revert-hammer

Differential Revision:
D28943928 (92ed70a048)

Original commit changeset: ae3d279005f5

fbshipit-source-id: fda98b6c54425bba2f937a1cb921027531d61842
2021-06-08 15:43:57 -07:00
6ca141fe6c Make detach return an alias even under inference mode (#59633)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59633

Fixes #59614

This fix isn't 100% correct but it appears to stem the bleeding.
A better fix would be understand how to detect when function
implementations don't uphold required invariants, leading to
refcount disaster.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28962183

Pulled By: ezyang

fbshipit-source-id: 6ec71994666289dadef47bac363e6902df90b094
2021-06-08 15:31:29 -07:00
14f4c8d333 Revert D28387762: Forward AD formulas batch 3
Test Plan: revert-hammer

Differential Revision:
D28387762 (58348bea06)

Original commit changeset: fc395c92af7e

fbshipit-source-id: 608d704ff5bc560714790a576eaf9ed7f1f44e13
2021-06-08 15:19:26 -07:00
528d82d6a6 [torch] Add debug name to assert message for useOf
Summary:
Make an assert message in Pytorch's JIT provide better information by
printing the debug name of a value in `PythonPrintImpl::useOf` if it's not
found in any tables.

Test Plan:
Tested printing a `module.code` where the module had an invalid value used
as an operand. Before it asserted without any more details, afterwards it
printed the debug name which made it easy to track down the offending value.

Reviewed By: SplitInfinity

Differential Revision: D28856026

fbshipit-source-id: 479f66c458a0a2d9a161ade09f20382e7b19d60e
2021-06-08 15:03:58 -07:00
9d533ef3ac Renorm fix (#59615)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59584
albanD, soulitzer, `renorm` grad was completely busted. Fast gradcheck is definitely not doing its job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59615

Reviewed By: jbschlosser

Differential Revision: D28964271

Pulled By: ngimel

fbshipit-source-id: b6878cd24db9189b64b67eb58bd2cd8956cda78a
2021-06-08 14:59:24 -07:00
67b8e6410d [OSS] Add podspec for libtorch-lite (#59638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59638

ghstack-source-id: 130847775

Test Plan: .

Reviewed By: husthyc, cccclai

Differential Revision: D28966693

fbshipit-source-id: 1b82623279709d0118c0967e2ba730d5dec040cc
2021-06-08 14:46:23 -07:00
1bb1a9e22b [ROCm] enable test_cufft_plan_cache test (#57520)
Summary:
This pr enables the test_cufft_plan_cache in test_spectral suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57520

Reviewed By: ejguan

Differential Revision: D28936128

Pulled By: ngimel

fbshipit-source-id: c843ab31c50855b624a986155c17c8d24e89a2ac
2021-06-08 14:42:01 -07:00
43274ca145 test_store multiworker remove multiprocessing (#59599)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59599

This will fix the flakiness for these tests internally when running under TSAN. We don't need multiprocessing since we should restrict the testing to the `wait_for_workers` and `world_size` parameters of the tcp store master store.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28947838

Pulled By: H-Huang

fbshipit-source-id: d3e3904aa7ac81ae4c744a193a3b7167c2227bc8
2021-06-08 14:38:42 -07:00
40cbf342d3 Fix vectorized calculations on POWER (#59382)
Summary:
This fixes multiple bugs introduced by the VSX optimized code in https://github.com/pytorch/pytorch/pull/41541

- min/max/clamp now consistently return nan when any value is NaN as on other architectures
- The non-complex angle functions return PI for negative values now
- The complex angle functions have been corrected and optimized
- The float32-log function implementation returned a wrong result when inf was passed (and maybe other inputs), replaced by the sleef function just as for float64

Fixes https://github.com/pytorch/pytorch/issues/59248
Fixes https://github.com/pytorch/pytorch/issues/57537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59382

Reviewed By: jbschlosser

Differential Revision: D28944626

Pulled By: ezyang

fbshipit-source-id: 1ae2782b9e34e458a19cec90617037654279e0e0
2021-06-08 14:18:47 -07:00
ea3b2fd0fa Throw RunTimeError using TORCH_CHECK (#59485)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59485

... when varaible is not allowed to required grad

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28933808

fbshipit-source-id: ef3536049d3a4a2f6e2f4b1787f0c17763f5828c
2021-06-08 14:03:21 -07:00
5fc105b323 Raise NotImplementedError on forward passes (#59483)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59483

... for functions that are not implemented

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28933806

fbshipit-source-id: dadae1af6609f15419cf0f47a98361dc87dff849
2021-06-08 14:03:19 -07:00
c268eefe96 Use TORCH_CHECK_NOT_IMPLEMENTED for AD not implemented (#59482)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59482

Fixes #53398

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D28933809

fbshipit-source-id: 53387ec9690fc235b0622b50800feced706ea1ee
2021-06-08 14:02:04 -07:00
84061dadad Add reduce variants for scatter operation. (#57015)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56463 #56464

- Add reduce variants for `scatter` in both _native_functions.yaml_ and _TensorAdvancedIndexing.cpp_
- Add `OpInfo` tests and reduce tests in _test_torch.py_
- Fix default reduce argument for `scatter_` in __tensor_docs.py_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57015

Reviewed By: mrshenli

Differential Revision: D28162657

Pulled By: ezyang

fbshipit-source-id: 4d37ed1569ce8560aca1085c9cf5349f11427c4f
2021-06-08 13:37:26 -07:00
9de0c214bd [quant] Fix dimension for output of batchnorm 1d (#59264)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59264

Previously batchnorm 1d did unsqueeze twice but only squeeze once before return when the dimension
for input Tensor is 2, this PR adds an extra squeeze

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D28810597

fbshipit-source-id: 879873bbf39ed3607762684694f6e81b423740c2
2021-06-08 13:07:00 -07:00
58348bea06 Forward AD formulas batch 3 (#58094)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58094

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28387762

Pulled By: albanD

fbshipit-source-id: fc395c92af7ebb5ebae95c40f6c76273047f4097
2021-06-08 13:00:21 -07:00
4512d75063 ger is an alias to outer, not the other way around (#59448)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59448

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28900487

Pulled By: albanD

fbshipit-source-id: e9065c5b29078d92ea9b746e188ebc1e62a407a0
2021-06-08 12:59:06 -07:00
d0e84c2f23 Revert D28961233: [pytorch][PR] Adding run-specified-test-cases option in run_test.py
Test Plan: revert-hammer

Differential Revision:
D28961233 (a6c9483c2f)

Original commit changeset: 6b7ddc6e6185

fbshipit-source-id: 4f8471df987a03d5c928a04f989d5d43f9cc47e9
2021-06-08 12:04:15 -07:00
0208e604e3 seems os.environ.get() not working well on windows (#59634)
Summary:
replace with os.getenv() instead

For some reason this was intermittently failing azure pipelines. I can't login to the pipeline itself for debugging but here are 2 examples: [successful](https://app.circleci.com/pipelines/github/pytorch/pytorch/332405/workflows/944609ad-5dcf-49da-984f-26c381d1f16c/jobs/13969059) vs [failed](https://app.circleci.com/pipelines/github/pytorch/pytorch/332518/workflows/21f8a5a6-3b95-432e-be42-ac98008c671b/jobs/13975637)

However given the fact that the other common_utils.py exposed constants using `os.getenv()` was working. I am making them consistent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59634

Test Plan: CI/master

Reviewed By: jbschlosser

Differential Revision: D28966412

Pulled By: walterddr

fbshipit-source-id: 7bcb9adf06df0acabd9574459eb6637c3e6a2947
2021-06-08 11:59:39 -07:00
1242dd1357 Remove cancel_redundant_workflows job (#59608)
Summary:
After https://github.com/pytorch/pytorch/issues/59019 this workflow itself is redundant, so we don't need it anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59608

Reviewed By: jbschlosser, seemethere

Differential Revision: D28952314

Pulled By: driazati

fbshipit-source-id: 41aa33164be8271210ec23b9641e74596114416d
2021-06-08 11:38:29 -07:00
7949fdd2b6 ninja 1.9.0 couldn't be installed, CI might be broken (#59625)
Summary:
I suddenly find that `pip install ninja==1.9.0 ` failed in CI.
And I tested locally and on another colleague's machine.
It looks it conflicts with cmake installed in conda.

https://app.circleci.com/pipelines/github/pytorch/pytorch/332470/workflows/d8b6ed30-1c7e-4863-898a-7f067c6202e1/jobs/13972409
![image](https://user-images.githubusercontent.com/16190118/121175743-02a1c700-c88e-11eb-9596-97b903b727f9.png)

1.10.0 couldn't be installed either.
![image](https://user-images.githubusercontent.com/16190118/121176606-fbc78400-c88e-11eb-931c-aa65bad080f8.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59625

Reviewed By: jbschlosser

Differential Revision: D28966699

Pulled By: seemethere

fbshipit-source-id: a1150e411ba3b4ab65448a087aa65f4ebe6c3596
2021-06-08 11:07:14 -07:00
13917bab7f [Torch] Correct launcher tests (#59635)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59635

The diff corrects launcher tests. The follow up would be to determine why the tests succeeded during the ``use_env`` diff removal

Test Plan: buck test mode/dev-tsan //caffe2/test/distributed/launcher:run_test -- --exact 'caffe2/test/distributed/launcher:run_test - test_launch_user_script_python_caffe2_bc (run_test.ElasticLaunchTest)' --run-disabled

Reviewed By: cbalioglu

Differential Revision: D28963813

fbshipit-source-id: a9f9b80787fb5c2f40a69ce31c8c2f3138654cad
2021-06-08 11:05:57 -07:00
3b0c6a7b50 fix AddPadding tensor shape inference (#59572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59572

fix AddPadding tensor shape inference

Test Plan: sandcastle

Reviewed By: dehuacheng

Differential Revision: D28686983

fbshipit-source-id: 03f70335fcfd94a1241562f8fbf12043a0deac2b
2021-06-08 11:02:33 -07:00
7dac2987ce [quant][eager][fix] Fix a typo in convert function in eager mode quantization (#59571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59571

Test Plan:
python test/test_quantization.py TestPostTrainingStatic.test_custom_module_class

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28938355

fbshipit-source-id: 566daeb07d616ae40e52754d3d4581f75f248f04
2021-06-08 10:24:22 -07:00
31d136c81f [DDP] Rename the member divFactor_ as div_factor for naming consistency in reducer (#59523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59523

Should use snake case instead of camel case for the consistency.
ghstack-source-id: 130759655

Test Plan: buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs

Reviewed By: cbalioglu

Differential Revision: D28922896

fbshipit-source-id: e04298284a78b2e71b562f790a878731962f873a
2021-06-08 10:04:20 -07:00
b7ee164456 [DDP] Remove the duplicate parseHookResult in reducer (#59510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59510

Address the comment in https://github.com/pytorch/pytorch/pull/58937#discussion_r645822768

#Closes: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130758758

Test Plan: waitforbuildbot

Reviewed By: cbalioglu

Differential Revision: D28918694

fbshipit-source-id: 7ac4e4e6268e220adefed230bdb377ab3b25e302
2021-06-08 10:04:18 -07:00
2b398d0537 [Reland][Gradient Compression] Apply division first to avoid overflow (#59576)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59576

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130754510

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28941327

fbshipit-source-id: 932e8ddbdb2bfd609a78943f6dc390d3d6ca333f
2021-06-08 10:03:21 -07:00
92ed70a048 adding base commit to scribe report (#59570)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59570

Test Plan: CI

Reviewed By: samestep

Differential Revision: D28943928

Pulled By: walterddr

fbshipit-source-id: ae3d279005f54d83d7a3acae508d3ccdf1cd46b8
2021-06-08 09:58:38 -07:00
a6c9483c2f Adding run-specified-test-cases option in run_test.py (#59487)
Summary:
The run-specified-test-cases option would allow us to specify a list of test cases to run by having a CSV with minimally two columns: test_filename and test_case_name.

This PR also adds .json to some files we use for better clarity.

Usage:
`python test/run_test.py --run-specified-test-cases <csv_file>` where the csv file can look like:
```
test_filename,test_case_name,test_total_time,windows_only_failure_sha_count,total_sha_count,windows_failure_count,linux_failure_count,windows_total_count,linux_total_count
test_cuda,test_cudnn_multiple_threads_same_device,8068.8409659525,46,3768,53,0,2181,6750
test_utils,test_load_standalone,8308.8062920459,14,4630,65,0,2718,8729
test_ops,test_forward_mode_AD_acosh_cuda_complex128,91.652619369806,11,1971,26,1,1197,3825
test_ops,test_forward_mode_AD_acos_cuda_complex128,91.825633094915,11,1971,26,1,1197,3825
test_profiler,test_source,60.93786725749,9,4656,21,3,2742,8805
test_profiler,test_profiler_tracing,203.09352795241,9,4662,21,3,2737,8807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59487

Test Plan:
Without specifying the option, everything should be as they were before.

Running `python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv` resulted in this paste P420276949 (you can see internally). A snippet looks like:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv
Loading specified test cases to run from windows_smoke_tests.csv.
Processed 28 test cases.
Running test_cpp_extensions_jit ... [2021-06-04 17:24:41.213644]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', 'test_cpp_extensions_jit.py', '-k', 'test_jit_cuda_archflags'] ... [2021-06-04 17:24:41.213781]
s
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)
...
```
With pytest, an example executable would be:
`Running test_dataloader ... [2021-06-04 17:37:57.643039]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', '-m', 'pytest', 'test_dataloader.py', '-v', '-k', 'test_segfault or test_timeout'] ... [2021-06-04 17:37:57.643327]`

Reviewed By: jbschlosser

Differential Revision: D28961233

Pulled By: janeyx99

fbshipit-source-id: 6b7ddc6e61856aa0002e1a0afc845770e4f8400b
2021-06-08 09:49:10 -07:00
ea1de87f4b Sort params by size (decreasing)
Summary:
Pull Request: https://github.com/pytorch/pytorch/pull/59586
Task: https://www.internalfb.com/tasks/?t=90847711

**Overview:**
Suppose we have `n` items with positive integer sizes and `k` buckets. We want to assign items to buckets with the goal of uniformity. The precise criteria for uniformity can vary: e.g. minimize the maximum size, maximize the minimum size, etc. This is known as [multiway number partitioning](https://en.wikipedia.org/wiki/Multiway_number_partitioning). ZeRO's partitioning task reduces to solving this problem. In particular, this is the subproblem to be solved for each `param_group` in `self.param_groups`, where the parameters are the items and the ranks give the buckets.

The existing implementation uses the linear-time [greedy number partitioning algorithm](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Linear-time_algorithm), which assigns the next tensor-parameter to the process with the smallest total parameter size so far. In this task, I explore the [extension](https://en.wikipedia.org/wiki/Greedy_number_partitioning#Improved_algorithm) where each parameter group is sorted by decreasing size before applying the greedy algorithm, requiring linearithmic time (as dominated by the sort).

**Experiments**
The mean number of parameters represents a perfectly uniform allocation and hence the ideal allocation (which may be even better than the optimal partition). In the following tables, I present the maximum number of parameters for any one process and the difference from the mean in parentheses for ResNet-50, ResNet-152, and BERT (the bare BERT model). The best-performing partitioning strategy for each model is bolded.

Two processes:
| Model | Max Num Params - Greedy (Diff) | Max Num Params - Greedy-Sorted (Diff) | Mean Num Params |
| --- | --- | --- | --- |
| ResNet-50 | 13,249,600 (471,084) | **12,794,816 (16,300)** | 12,778,516 |
| ResNet-152 | 30,567,488 (471,084) | **30,111,424 (15,020)** | 30,096,404 |
| BERT | **54,749,184 (8,064)** | 55,327,488 (586,368) | 54,741,120 |

Four processes:
| Model | Max Num Params - Greedy (Diff) | Max Num Params - Greedy-Sorted (Diff) | Mean Num Params |
| --- | --- | --- | --- |
| ResNet-50 | 7,524,864 (1,135,606) |  **6,436,864 (47,606)** | 6,389,258 |
| ResNet-152 | 16,232,192 (1,183,990) | **15,090,152 (41,950)** | 15,048,202 |
| BERT | **28,151,040 (780,480)** | 28,352,256 (981,696)  | 27,370,560 |

 ---

I also investigated the latency of `optimizer.step()` for the different partitioning algorithms. I measured the latency for 30 iterations and took the mean latency per process (excluding the first iteration due to cache coldness). In the following tables, I present the maximum of those mean latencies over all processes and the standard deviation of the latencies contributing to that maximum. Again, the best-performing partitioning strategy for each model is bolded. All entries are presented in seconds and used `gloo` backend.

Two processes:
| Model | Max `optimizer.step()` Time - Greedy (Std.) | Max `optimizer.step()` Time - Greedy-Sorted (Std.) |
| --- | --- | --- |
| ResNet-50 | **0.060 (0.002)** | 0.061 (0.002) |
| ResNet-152 | 0.166 (0.003) | **0.160 (0.004)** |
| BERT | 0.220 (0.009) | **0.199 (0.006)** |

Four processes:
| Model | Max `optimizer.step()` Time - Greedy | Max `optimizer.step()` Time - Greedy-Sorted |
| --- | --- | --- |
| ResNet-50 | 0.094 (0.004) | **0.093 (0.004)** |
| ResNet-152 | **0.228 (0.011)** | 0.231 (0.009) |
| BERT | **0.328 (0.015)** | 0.329 (0.021) |

Based on the standard deviations, the differences in the latency measurements across the different algorithms appear to be within the uncertainty in the measurement itself. Hence, it is difficult to argue that one algorithm is clearly the fastest.

 ---

`zero.py` is my experiment script, and I use the AI AWS cluster. The run command looks like:
```
srun -p $DEV_QUEUE --cpus-per-task=16 -t 5:00:00 --gpus-per-node=4 python zero.py -b nccl greedy 2 4
```
This runs the experiment script on an instance with 4 GPUs using `nccl` backend, outputting to a directory named `greedy/`, and using world sizes of 2 and 4. An analogous command can be used after modifying `partition_parameters()`, e.g. replacing `greedy` with `greedy_sorted` as the output directory name. Then, to run the analysis script:
```
python analyze.py greedy greedy_sorted
```
For more details on the experiment code, refer to: https://www.internalfb.com/diff/D28946756

**Notes:**
There exists an optimal solution to this partitioning problem. An algorithm that finds such a solution is the [complete greedy algorithm (CGA)](https://en.wikipedia.org/wiki/Greedy_number_partitioning#An_exact_algorithm), which reduces to the brute-force combinatorial search in the worst case. There exist heuristics to improve the `k = 2` case (i.e. when there are two processes); however, given that `n` in typical use cases is very large, any algorithm that is quadratic or slower is unrealistic. Other exact algorithms are similarly exponential in the worst case, rendering them intractable. Given this, I do not currently see a need for future proofing the partitioning algorithm against the introduction of algorithms beyond the naive greedy and the sorted greedy algorithms.

 ---

In the current ZeRO implementation, the core `partition_parameters()` computation happens twice upon initialization (i.e. call to `__init__()`): first from a call to `_param_to_rank()` (i.e. an access to `_param_to_rank`) and then from a call to `_update_trainable()`. `_update_trainable()` sees that no optimizer has been constructed yet, so it clears the cache, eliminating the first `partition_parameters()` computation and performing a redundant re-computation.

Here is a typical trace:
- [The ZeRO optimizer object is initialized, calling `__init__()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L142))
- [In `__init__()`, `self._device` is set, so it accesses `self._per_device_params`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L182))
- [`self._per_device_params` is not cached, so it accesses `self._param_to_rank`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L340))
- [`self._param_to_rank` is not cached, so it calls `partition_parameters()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L353)) (first call to `partition_parameters()`)
- [`__init__()` later calls `_update_trainable()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L185))
- [In `_update_trainable()`, `self` does not have `attr` `"optim"`, so it clears the cached objects (notably, `self._partition_parameters_cache`).](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L591))
- [`_update_trainable()` calls `self.partition_parameters()`.](d125694d0b/torch/distributed/optim/zero_redundancy_optimizer.py (L593)) (second call to `partition_parameters()`)

Based on the discussion [here](https://github.com/pytorch/pytorch/pull/59410), this recomputation is unintentional and should be addressed in a future diff.

Test Plan: I verified that the total number of parameters across the processes was consistent after the partitioning algorithm change. Otherwise, no additional modifications were made to existing tests.

Reviewed By: mrshenli

Differential Revision: D28946755

fbshipit-source-id: 7ad66a21a963555b3b2e693ba8069d2dddc94c60
2021-06-08 09:47:35 -07:00
935057fc74 [package] turn MockZipReader into DirectoryReader and add test coverage (#59107)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59107

Adding documentation, test coverage, and a missing method to the `DirectoryReader` class. `DirectoryReader` was previously named `MockZipReader`, and is used for operating on opened package archives via a `PackageImporter`.

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28760410

Pulled By: Lilyjjo

fbshipit-source-id: aa9d0a68e19738a6d5555bb04ce33af6a53f1268
2021-06-08 08:02:34 -07:00
693b2696f8 add dispatch for bitwise_and (#59388)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59388

Reviewed By: agolynski

Differential Revision: D28891985

Pulled By: ezyang

fbshipit-source-id: 4f8b301ba615f1e21a920f02166d64c978204adb
2021-06-08 07:51:47 -07:00
4920d5a05a Temporarily add skip to fix slow gradcheck failure on master (#59585)
Summary:
Related https://github.com/pytorch/pytorch/issues/59584

Failure https://app.circleci.com/pipelines/github/pytorch/pytorch/331771/workflows/fed7923c-3490-490f-8769-81a71beae558/jobs/13940286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59585

Reviewed By: albanD

Differential Revision: D28945267

Pulled By: soulitzer

fbshipit-source-id: 72ae4b6c9a04fe9fdfb89888e12bae25c78be23c
2021-06-08 07:21:30 -07:00
5c7e14d2bc [DataLoader] Switch NotImplementedError to TypeError for len (#59464)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59464

Fixes #59378

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28944447

Pulled By: ejguan

fbshipit-source-id: 8b3d53a1863b41e578d56f219e452d18d7eae0d8
2021-06-08 07:16:18 -07:00
1b578c4bf5 [DataLoader] Close byte stream explicitly (#58938)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58938

When run `test_datapipe.py`, python `gc` would report lots of `ResourceWarning`s due to unclosed stream. It's not only annoying, there are two potential problems:
- Performance regression because `gc` requires additional memory and computation to track reference
- Python `gc` runs periodically so we many encountered an error of too many open files due to OS limitation
To reduce the warning:
- Explicitly close byte stream
- Modify `test_datapipe.py` to use context manager

Small fix:
- Reorder import in `test_datapipe.py`

Further investigation:
Can we directly use context manager in `LoadFileFromDisk` and `ReadFileFromTar` to eliminate this Error?
- Probably no. It's feasible only if the pipeline is synchronized and without prefetching. When we enable these two features, the scope guard of the context manager doesn't work.
- We may need to implement some reference counter attached to these file byte stream to close by itself.

Test Plan: Imported from OSS

Reviewed By: jbschlosser

Differential Revision: D28689862

Pulled By: ejguan

fbshipit-source-id: bb2a85defb8a4ab5384db902ef6ad062185c2653
2021-06-08 07:15:08 -07:00
90c5b74e47 Back out "[PyTorch Edge] bytecode version bump to v5 and enable share constant table" (#59432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59432

Original commit changeset: 6f5cf4296eaa
ghstack-source-id: 130805860

Test Plan: CI

Reviewed By: raziel, iseeyuan

Differential Revision: D28892955

fbshipit-source-id: ce414a4c7a18001bdd27333cea03c6403b39d146
2021-06-08 07:11:26 -07:00
5d6a10a765 Revert D28913223: [pytorch][PR] Adding run-specified-test-cases option in run_test.py
Test Plan: revert-hammer

Differential Revision:
D28913223 (24432eaa29)

Original commit changeset: 0d1f99109734

fbshipit-source-id: 47c073720cff23a5d4cb64556381c46025e90937
2021-06-08 02:18:16 -07:00
010bcb4c2d Fix xnnpack hardswish memory issue (#59577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59577

Collapse all dimensions of tensor into batch and use channels as 1. Fixes the 1D over calculation case

Test Plan:
buck test fbandroid/mode/server fbandroid/mode/asan_ubsan fbsource//xplat/caffe2:pt_xnnpack_test

buck test fbsource//xplat/caffe2:pt_xnnpack_test

Reviewed By: kimishpatel

Differential Revision: D28942141

fbshipit-source-id: b36f820a900b6a2ed649d6b9bac79d3392d3537c
2021-06-07 21:56:05 -07:00
1faba1e4cc [Pytorch Edge] Make RegisterBackendSelect Selective (#59096)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59096

RegisterBackendSelect is bringing in ~100 extra ops to the runtime. This messes with the compatibility api, and also adds a nontrivial amount of size.

Test Plan: Model Unittests/CI

Reviewed By: iseeyuan

Differential Revision: D28588100

fbshipit-source-id: ffd0b5b9cbe20f27dbf3be418a6c1f80c7396fdb
2021-06-07 19:48:46 -07:00
501320ed81 [pytorch] deprecate default_op_deps.yaml (#59573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59573

To do mobile selective build, we have several options:
1. static dispatch;
2. dynamic dispatch + static analysis (to create the dependency graph);
3. dynamic dispatch + tracing;

We are developing 3. For open source, we used to only support 1, and
currently we support both 1 and 2.

This file is only used for 2. It was introduced when we deprecated
the static dispatch (1). The motivation was to make sure we have a
low-friction selective build workflow for dynamic dispatch (2).
As the name indicates, it is the *default* dependency graph that users
can try if they don't bother to run the static analyzer themselves.
We have a CI to run the full workflow of 2 on every PR, which creates
the dependency graph on-the-fly instead of using the committed file.

Since the workflow to automatically update the file has been broken
for a while, it started to confuse other pytorch developers as people
are already manually editing it, and it might be broken for some models
already.

We reintroduced the static dispatch recently, so we decide to deprecate
this file now and automatically turn on static dispatch if users run
selective build without providing the static analysis graph.

The tracing-based selective build will be the ultimate solution we'd
like to provide for OSS, but it will take some more effort to polish
and release.

Differential Revision:
D28941020
D28941020

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: ljk53

fbshipit-source-id: 9977ab8568e2cc1bdcdecd3d22e29547ef63889e
2021-06-07 19:37:37 -07:00
c436426be8 [fbgemm] fix gconv + acc16 (#59541)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59541

Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/621

Fixing 2 issues. These are actually 2 independent issues one in Caffe2 and another in FBGEMM, so no need to wait until FBGEMM is synchronized with PyTorch

1) conv 16-bit accumulation doesn't support fast gconv path, so TakeGConvFastPath_ should honor it
2) packed_index_ generates indices up to (G/GTogether_) F R S OC_per_G GTogether_ paddedICPerG which can exceed G kernel_prod OC_per_G paddedICPerG allocated in PackWeightMatrixForGConv (kernel_prod = F R S): e.g., when G=3, GTogether_=2, we allocate 3 F R S OC_per_G paddedICPerG but we access up to 2 F R S OC_per_G 2 paddedICPerG

BTW, not sure how we haven't known about this issue for so long. Any idea will be really appreciated.

Test Plan:
In a BDW machine,
buck test //caffe2/caffe2/quantization/server:conv_groupwise_dnnlowp_acc16_op_test -- --run-disabled

Reviewed By: dskhudia

Differential Revision: D28927214

fbshipit-source-id: 3ec98ea2fc177545392a0148daca592d80f40ad3
2021-06-07 19:20:59 -07:00
57d8bccd00 only reorder tests based on git diff if IN_CI (#59565)
Summary:
Do not reorder tests unless they are in IN_CI, this causes local development test ordering indeterministic. most of use branch out from viable strict not head of master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59565

Reviewed By: ejguan

Differential Revision: D28943906

Pulled By: walterddr

fbshipit-source-id: e742e7ce4b3fc017d7563b01e93c4cd774d0a537
2021-06-07 17:54:19 -07:00
dafa4b3517 quantization: improve documentation on natively supported backends (#58925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58925

Cleans up documentation on natively supported backends.  In particular:
* adds a section title
* deduplicates information about fbgemm/qnnpack
* clarifies what `torch.backends.quantized.engine` does
* adds code samples with default settings for `fbgemm` and `qnnpack`

Test Plan: Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28681840

Pulled By: vkuzo

fbshipit-source-id: 51a6ab66934f657553351f6c84a638fd5f7b4e12
2021-06-07 17:29:03 -07:00
6575975da9 [Reland2][DDP] Merge work and future_work in reducer (#59574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59574

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

1) Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.

2) Compared with the reverted https://github.com/pytorch/pytorch/pull/59520, disabled `test_DistributedDataParallel_non_default_stream` on AMD, because now applying division first hurts the gradient averaging accuracy on AMD.
See [07:48:26]:
https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.2-py3.6-test1/1129/console

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130752393

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork --  test_DistributedDataParallel_non_default_stream

Reviewed By: rohan-varma

Differential Revision: D28940800

fbshipit-source-id: 1ba727ac951ebc1e7875dc1a1be8108a2c8d9462
2021-06-07 16:52:20 -07:00
fbe65b16ae Use irange in torch/csrc/jit (#55716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27690245

fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5
2021-06-07 16:48:08 -07:00
ff553e5b09 enable upload test stats on PR (#59567)
Summary:
Enable test stats upload on PR.

Uses PR number as part of the key so that it can be properly indexed and later parsed if PR has been merged/closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59567

Reviewed By: ejguan

Differential Revision: D28943654

Pulled By: walterddr

fbshipit-source-id: f3a7a25ae14c6877067e1b347e3a8658d80d1544
2021-06-07 16:45:10 -07:00
24432eaa29 Adding run-specified-test-cases option in run_test.py (#59487)
Summary:
The run-specified-test-cases option would allow us to specify a list of test cases to run by having a CSV with minimally two columns: test_filename and test_case_name.

This PR also adds .json to some files we use for better clarity.

Usage:
`python test/run_test.py --run-specified-test-cases <csv_file>` where the csv file can look like:
```
test_filename,test_case_name,test_total_time,windows_only_failure_sha_count,total_sha_count,windows_failure_count,linux_failure_count,windows_total_count,linux_total_count
test_cuda,test_cudnn_multiple_threads_same_device,8068.8409659525,46,3768,53,0,2181,6750
test_utils,test_load_standalone,8308.8062920459,14,4630,65,0,2718,8729
test_ops,test_forward_mode_AD_acosh_cuda_complex128,91.652619369806,11,1971,26,1,1197,3825
test_ops,test_forward_mode_AD_acos_cuda_complex128,91.825633094915,11,1971,26,1,1197,3825
test_profiler,test_source,60.93786725749,9,4656,21,3,2742,8805
test_profiler,test_profiler_tracing,203.09352795241,9,4662,21,3,2737,8807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59487

Test Plan:
Without specifying the option, everything should be as they were before.

Running `python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv` resulted in this paste P420276949 (you can see internally). A snippet looks like:
```
(pytorch) janeyx@janeyx-mbp pytorch % python test/run_test.py --run-specified-test-cases windows_smoke_tests.csv
Loading specified test cases to run from windows_smoke_tests.csv.
Processed 28 test cases.
Running test_cpp_extensions_jit ... [2021-06-04 17:24:41.213644]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', 'test_cpp_extensions_jit.py', '-k', 'test_jit_cuda_archflags'] ... [2021-06-04 17:24:41.213781]
s
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK (skipped=1)
...
```
With pytest, an example executable would be:
`Running test_dataloader ... [2021-06-04 17:37:57.643039]
Executing ['/Users/janeyx/miniconda3/envs/pytorch/bin/python', '-m', 'pytest', 'test_dataloader.py', '-v', '-k', 'test_segfault or test_timeout'] ... [2021-06-04 17:37:57.643327]`

Reviewed By: samestep

Differential Revision: D28913223

Pulled By: janeyx99

fbshipit-source-id: 0d1f9910973426b8756815c697b483160517b127
2021-06-07 16:27:43 -07:00
caf76c2445 Move sharding to after all tests have been excluded (#59583)
Summary:
It would be most accurate if sharding occurred after all other changes to selected_tests were complete.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59583

Reviewed By: ejguan

Differential Revision: D28944737

Pulled By: janeyx99

fbshipit-source-id: a851473948a5ec942ffeeedeefdc645536a3d9f7
2021-06-07 15:04:36 -07:00
93140a31e2 Use irange in a few places (#55325)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55325

Test Plan: Sandcastle

Reviewed By: SciPioneer

Differential Revision: D27573006

fbshipit-source-id: 647b5da3901e92c23e95b2fe5e833e9081d72837
2021-06-07 14:53:41 -07:00
737d920b21 Strictly type everything in .github and tools (#59117)
Summary:
This PR greatly simplifies `mypy-strict.ini` by strictly typing everything in `.github` and `tools`, rather than picking and choosing only specific files in those two dirs. It also removes `warn_unused_ignores` from `mypy-strict.ini`, for reasons described in https://github.com/pytorch/pytorch/pull/56402#issuecomment-822743795: basically, that setting makes life more difficult depending on what libraries you have installed locally vs in CI (e.g. `ruamel`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59117

Test Plan:
```
flake8
mypy --config mypy-strict.ini
```

Reviewed By: malfet

Differential Revision: D28765386

Pulled By: samestep

fbshipit-source-id: 3e744e301c7a464f8a2a2428fcdbad534e231f2e
2021-06-07 14:49:36 -07:00
6ff001c125 DOC Improve documentation for LayerNorm (#59178)
Summary:
Closes https://github.com/pytorch/pytorch/issues/51455

I think the current implementation is aggregating over the correct dimensions. The shape of `normalized_shape` is only used to determine the dimensions to aggregate over. The actual values of `normalized_shape` are used when `elementwise_affine=True` to initialize the weights and biases.

This PR updates the docstring to clarify how `normalized_shape` is used. Here is a short script comparing the implementations for tensorflow and pytorch:

```python
import torch
import torch.nn as nn

import tensorflow as tf
from tensorflow.keras.layers import LayerNormalization

rng = np.random.RandomState()
x = rng.randn(10, 20, 64, 64).astype(np.float32)
# slightly non-trival
x[:, :10, ...] = x[:, :10, ...] * 10 + 20
x[:, 10:, ...] = x[:, 10:, ...] * 30 - 100

# Tensorflow Layer norm
x_tf = tf.convert_to_tensor(x)
layer_norm_tf = LayerNormalization(axis=[-3, -2, -1], epsilon=1e-5)
output_tf = layer_norm_tf(x_tf)
output_tf_np = output_tf.numpy()

# PyTorch Layer norm
x_torch = torch.as_tensor(x)
layer_norm_torch = nn.LayerNorm([20, 64, 64], elementwise_affine=False)
output_torch = layer_norm_torch(x_torch)
output_torch_np = output_torch.detach().numpy()

# check tensorflow and pytorch
torch.testing.assert_allclose(output_tf_np, output_torch_np)

# manual comutation
manual_output = ((x_torch - x_torch.mean(dim=(-3, -2, -1), keepdims=True)) /
                 (x_torch.var(dim=(-3, -2, -1), keepdims=True, unbiased=False) + 1e-5).sqrt())

torch.testing.assert_allclose(output_torch, manual_output)
```

To get to the layer normalization as shown here:

<img width="157" alt="Screen Shot 2021-05-29 at 2 13 52 PM" src="https://user-images.githubusercontent.com/5402633/120080691-1e37f100-c088-11eb-9060-4f263e4cd093.png">

One needs to pass in `normalized_shape` with shape `x.dim() - 1` with the size of the channels and all spatial dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59178

Reviewed By: ejguan

Differential Revision: D28931877

Pulled By: jbschlosser

fbshipit-source-id: 193e05205b9085bb190c221428c96d2ca29f2a70
2021-06-07 14:34:10 -07:00
a30b359590 fix double backward for binary_cross_entropy loss function when reduction=sum. (#59479)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59477.

```python
In [1]: import torch

In [2]: x = torch.rand(3, 3, dtype=torch.double, requires_grad=True)

In [3]: y = torch.rand(3, 3, dtype=torch.double)

In [4]: torch.autograd.gradgradcheck(lambda x, y: torch.nn.functional.binary_cross_entropy(x, y, reduction='sum'), [x, y])
Out[4]: True

In [5]: torch.autograd.gradgradcheck(lambda x, y: torch.nn.functional.binary_cross_entropy(x, y, reduction='mean'), [x, y])
Out[5]: True

In [6]: torch.autograd.gradcheck(lambda x, y: torch.nn.functional.binary_cross_entropy(x, y, reduction='sum'), [x, y])
Out[6]: True

```

More comprehensive testing could be added in https://github.com/pytorch/pytorch/pull/59447 where explicit `gradcheck` and `gradgradcheck` tests are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59479

Reviewed By: ejguan

Differential Revision: D28934354

Pulled By: albanD

fbshipit-source-id: 12ce68e3c5c499b2531f7cdba3c22548d67e07e9
2021-06-07 14:14:08 -07:00
77dde35f1a Fix error message formatting in _make_grads (#59532)
Summary:
- TORCH_CHECK doesn't handle printf style format and it will output like: `got %ld tensors and %ld gradients21`
- `got 2 tensors and 1 gradients` should be the expected message for this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59532

Reviewed By: ejguan

Differential Revision: D28934680

Pulled By: albanD

fbshipit-source-id: 2d27a754ae81310b9571ae2a2ea09d0f8d8a3d81
2021-06-07 14:05:24 -07:00
24e27af683 [ROCm] enable kernel asserts (#49624)
Summary:
Addresses missing ROCm feature indicated in https://github.com/pytorch/pytorch/issues/38943.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49624

Reviewed By: agolynski

Differential Revision: D28902459

Pulled By: malfet

fbshipit-source-id: 29c9b552770241a0ec52cd057ea45efc4389d838
2021-06-07 13:43:07 -07:00
05b571ee8e fix name of 'dims' kwarg in torch.tile docs (#59471)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59471

Fixes #59150

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28908569

Pulled By: saketh-are

fbshipit-source-id: 57d0e75d899a1d9979e8bdb20dfd2b136dd63d1b
2021-06-07 13:18:19 -07:00
b0ac9bfb2b Add warning about should_drop for JIT coverage plug-in (#57961)
Summary:
This adds a comment above `should_drop` to prevent someone from inadvertently breaking JIT coverage by renaming the function without updating the correct references.

The current JIT plug-in uses `should_drop` to figure out which code is going to be JIT'd. If the function is named differently, the plug-in would also need to be updated.

Question: I understand this may not be the cleanest solution. Would a cleaner solution be to create a dummy function that would simply exist for the JIT plug-in? I did not immediately do that as that may be adding unnecessary code complexity in torch.jit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57961

Reviewed By: samestep

Differential Revision: D28933587

Pulled By: janeyx99

fbshipit-source-id: 260aaf7b11f07de84a81d6c3554c4a5ce479d623
2021-06-07 12:48:01 -07:00
8693e288d7 DOC Small rewrite of interpolate recompute_scale_factor docstring (#58989)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55909

This PR looks to improve the documentation to describe the following behavior:

8130f2f67a/torch/nn/functional.py (L3673-L3685)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58989

Reviewed By: ejguan

Differential Revision: D28931879

Pulled By: jbschlosser

fbshipit-source-id: d1140ebe1631c5ec75f135c2907daea19499f21a
2021-06-07 12:40:05 -07:00
1798ff02e4 [PyTorch] Optimize c10::optional<ArrayRef<T>> for size (#59333)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59333

Code comment should explain this in sufficient detail. In brief, making it 16 bytes should get it to be passed in registers.
ghstack-source-id: 130631329

Test Plan: Updated optional_test and added static_assert in Optional.cpp.

Reviewed By: ezyang

Differential Revision: D28843027

fbshipit-source-id: 3029f05e03a9f04ca7337962e7770cdeb9a608d9
2021-06-07 11:35:17 -07:00
cc03ea2c47 [quant] Implemented InputWeightObserver for Linear inputs
Summary: Implemented two observers (InputEqualObserver and WeightEqualObserver) which will be inserted into the graph during prepare_fx().

Test Plan: python test/test_quantization.py TestEqualizeFx

Reviewed By: supriyar

Differential Revision: D28836954

fbshipit-source-id: 25517dc82ae67698ed8b2dc334e3323286976104
2021-06-07 11:19:43 -07:00
c51abf8fca Make binary_cross_entropy differentiable wrt target (#59447)
Summary:
As per title. Resolves https://github.com/pytorch/pytorch/issues/56683.
`gradgradcheck` will fail once `target.requires_grad() == True` because of the limitations of the current double backward implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59447

Reviewed By: agolynski

Differential Revision: D28910140

Pulled By: albanD

fbshipit-source-id: 20934880eb4d22bec34446a6d1be0a38ef95edc7
2021-06-07 09:20:17 -07:00
94cc681fc2 Revert D28922305: [Reland][DDP] Merge work and future_work in reducer
Test Plan: revert-hammer

Differential Revision:
D28922305 (3137bbeb1a)

Original commit changeset: 6388a96eda7a

fbshipit-source-id: bc150672e857286eeb129ea683b1cfd2034f0564
2021-06-07 03:58:20 -07:00
f998e63dca Revert D28922548: [Gradient Compression] Apply division first to avoid overflow
Test Plan: revert-hammer

Differential Revision:
D28922548 (459270ac01)

Original commit changeset: 442bd3cc7a35

fbshipit-source-id: 7e4361b4eb283cdb21f15a36d6eebf558dd7386f
2021-06-07 03:57:10 -07:00
459270ac01 [Gradient Compression] Apply division first to avoid overflow (#59522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59522

If the gradients before allreduce are large, then the sum after allreduce may overflow, especially for FP16. Therefore, apply the division before allreduce.

This fix is applied to both C++ and Python comm hooks.
ghstack-source-id: 130686229

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_ddp_comm_hook_allreduce_hook_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_default_ddp_comm_hooks_nccl_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_compress_wrapper_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_builtin_ddp_comm_hooks_nccl_grad_is_view
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl_grad_is_view

Reviewed By: rohan-varma

Differential Revision: D28922548

fbshipit-source-id: 442bd3cc7a35a8b948f626062fa7ad2e3704c5be
2021-06-07 01:43:10 -07:00
a2e56fa0dc Adding users of a node to the serialized JSON. (#59357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59357

Adding users of a node to the serialized JSON. Illustrated in the example:

JSON:
P419734894

Examples:
    {
      "shape": "[7]",
      "dtype": "torch.float16",
      "stride": "[1]",
      "is_quantized": false,
      "target": "conv.bias",
      "op_code": "get_attr",
      "name": "conv_bias",
      "args": [],
      "kwargs": {},
      "users": [
        {
          "is_node": true,
          "name": "to_dtype"
        }
      ]
    }

    {
      "target": "output",
      "op_code": "output",
      "name": "output",
      "args": [
        {
          "is_node": true,
          "name": "fba_layout_transform_1",
          "shape": "[3, 7, 12, 12]",
          "dtype": "torch.float16",
          "stride": "[1008, 144, 12, 1]",
          "is_quantized": false
        }
      ],
      "kwargs": {},
      "users": []
    }

Test Plan: buck test //caffe2/test:test_fx_experimental

Reviewed By: gcatron, jfix71

Differential Revision: D28857487

fbshipit-source-id: a3bac6bdb21ce10ba4a0d170c809aef13e6174a6
2021-06-06 23:15:32 -07:00
de40c8e495 Adds remaining OpInfos and removes redundant test generators (#55558)
Summary:
Per title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55558

Reviewed By: ngimel

Differential Revision: D28922522

Pulled By: mruberry

fbshipit-source-id: 89cefd93788bc8aa0683f4583cf5caa81aa2dc93
2021-06-06 14:52:26 -07:00
8c852de54d [PyTorch Edge] Remove legacy and kineto profilers from mobile build (#58730)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58730

The sources for the profilers are not needed in the mobile build, and unnecessarily add weight to the build. Remove them from the lite-interpreter build.

ghstack-source-id: 130684568

Test Plan: Build + BSB

Reviewed By: kimishpatel, raziel

Differential Revision: D28563725

fbshipit-source-id: 9d6f76176c2d2bbc25703281af1a076b1f2b4f19
2021-06-06 13:16:07 -07:00
3137bbeb1a [Reland][DDP] Merge work and future_work in reducer (#59520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59520

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

Compared with the reverted https://github.com/pytorch/pytorch/pull/58937, updated `_AllReduceCommHookWithDivFactor` in `default_comm_hooks.cpp` to apply division first and hence avoid FP16 overflow.

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130685351

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_fp16_grad_is_view

Reviewed By: walterddr

Differential Revision: D28922305

fbshipit-source-id: 6388a96eda7a06f292873afed6d1362096c13e1c
2021-06-06 09:49:08 -07:00
390fe74944 Migrate torch.lstsq to ATen (#59400)
Summary:
Closes  https://github.com/pytorch/pytorch/issues/24726, closes https://github.com/pytorch/pytorch/issues/44011

This builds on the port from https://github.com/pytorch/pytorch/issues/44011. I've rebased on master and addressed mruberry's comments. There were also some unnecessary copies of `B` taking place that I've cleaned up. This function is already deprecated, but since it's the last lapack routine in TH, it's still worth porting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59400

Reviewed By: mruberry

Differential Revision: D28922060

Pulled By: ngimel

fbshipit-source-id: cfd7ec8b50d2ab886f0e04a2a557e4e410ee8184
2021-06-06 02:18:17 -07:00
da972afdcd OpInfo: to_sparse (#59445)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59445

Reviewed By: ngimel

Differential Revision: D28920866

Pulled By: mruberry

fbshipit-source-id: ba8d3071d9937096288b69511000eeb007f53434
2021-06-05 19:13:58 -07:00
96ac0e0340 OpInfo: t (#59442)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59442

Reviewed By: agolynski

Differential Revision: D28898946

Pulled By: mruberry

fbshipit-source-id: be32429fa7306554e4912fdcc382593d00c9f4ad
2021-06-05 18:59:38 -07:00
0a5bfa9919 Support __rmod__ (#58476)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58035.

This PR implements `torch.Tensor.__rmod__` and `torch.remainder(scalar, tensor)` for the compatibility with NumPy’s interface.
(cc: mruberry, rgommers, emcastillo, kmaehashi)

TODO:
  - [x] Update `tensor_binary_op` in test/test_binary_ufuncs.py after https://github.com/pytorch/pytorch/issues/58216 is merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58476

Reviewed By: ngimel

Differential Revision: D28776810

Pulled By: mruberry

fbshipit-source-id: 74f8aea80f439ef2cc370333524e39971eeb7bf4
2021-06-05 16:19:24 -07:00
344ecb2e71 flip via TI (#59509)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59509

Reviewed By: mruberry

Differential Revision: D28918665

Pulled By: ngimel

fbshipit-source-id: b045c7b35eaf22e53b1bc359ffbe5a4fda05dcda
2021-06-05 15:43:29 -07:00
1be7ca71ee OpInfo: log_softmax (#59336)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59336

Reviewed By: agolynski

Differential Revision: D28899052

Pulled By: mruberry

fbshipit-source-id: 60a9a4ffbca5a0f2c899d4d83500dcab4555ffb0
2021-06-05 13:51:50 -07:00
1dcc034fba [caffe2] Avoid attempt to use undefined preprocessor directive
Summary:
This is somewhat more verbose, but it's more correct and addresses this warning on Visual Studio 2017:
```
xplat\caffe2\caffe2\core\common.h(76): warning C4067: unexpected tokens following preprocessor directive - expected a newline
```

Test Plan: Built locally with fix

Reviewed By: simpkins

Differential Revision: D28868632

fbshipit-source-id: f6a583e8275162adedb2a4bc5ed0f64847020871
2021-06-05 09:22:52 -07:00
1d9c1cc00a [4/n] [c10d] Introduce the multi-tenancy feature in TCPStore (#58331)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58331

This PR is the final part of a stack that addresses the GitHub issue #41614; it introduces the multi-tenancy feature to the `TCPStore` class allowing two server stores to be instantiated with the same host:port pair.
ghstack-source-id: 130676394

Test Plan:
- Run the existing and newly-introduced tests.
- Run several smoke tests including the short code snippet referred in GitHub issue #41614.

Reviewed By: H-Huang

Differential Revision: D28453850

fbshipit-source-id: f9066b164305de0f8c257e9d5736e93fd7e21ec6
2021-06-05 07:50:07 -07:00
844a98758a [3/n] [c10d] Revise the implementation of TCPStore (#58330)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58330

This PR is part of a stack that addresses the GitHub issue #41614; it introduces a major refactoring of the `TCPStore` class in preparation of the multi-tenancy feature.

- All TCP sockets are wrapped with a new `TCPSocket` RAII type.
- `BackgroundThread` and daemon types are moved from header to cpp file.
- Server, client, and callback sockets are refactored into their own internal types `TCPServer`, `TCPClient` and `TCPCallbackClient`.
- Calls to `tcputil::send*` and `tcputil::recv*` are wrapped in `TCPClient` for easier readability and maintenance purposes.
- Two `TODO` statements are put to reference future improvements. Based on feedback, I will either create separate GitHub issues for them or address them as part of this stack.
ghstack-source-id: 130676392

Test Plan: Run the existing tests since there are no user-facing behavioral changes.

Reviewed By: H-Huang

Differential Revision: D28448981

fbshipit-source-id: 415b21e74b3cd51d673c1d5c349c6a2cb21dd667
2021-06-05 07:50:06 -07:00
4ee761c2c5 [2/n] [c10d] Introduce the 'multiTenant' constructor parameter in TCPStore (#58329)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58329

This PR is part of a stack that addresses the GitHub issue #41614; it introduces:

- A new `multiTenant` constructor option for the `TCPStore` class indicating whether multiple store instances can be initialized with the same host:port pair.

- Updates to the C10d distributed (elastic) rendezvous and the `init_process_group` method to leverage the new `multiTenant` feature.

Note that the multi-tenancy feature itself is implemented in the fourth PR of this stack. In this PR passing `true` to `multiTenant` results only with a warning output.
ghstack-source-id: 130676389

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: rohan-varma

Differential Revision: D28424978

fbshipit-source-id: fb1d1d81b8b5884cc5b54486700a8182a69c1f29
2021-06-05 07:50:04 -07:00
cf408c3743 [1/n] [c10d] Introduce a new TCPStore constructor (#58328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58328

This PR is part of a stack that addresses the GitHub issue #41614; it introduces a new `TCPStore` constructor that takes its optional parameters via a newly introduced `TCPStoreOptions` structure. This gives the API callers the flexibility to specify only the desired options while skipping the rest.

The main motivation behind this change is the introduction of the `multiTenant` constructor option in the second PR of this stack.
ghstack-source-id: 130676384

Test Plan: Run the existing tests since there are no behavioral changes.

Reviewed By: H-Huang

Differential Revision: D28417742

fbshipit-source-id: e6ac2a057f7ad1908581176ee6d2c2554c3c74a9
2021-06-05 07:50:02 -07:00
91eb831422 Revert D28698997: [Static Runtime] Add schema check to aten ops
Test Plan: revert-hammer

Differential Revision:
D28698997 (10345010f7)

Original commit changeset: 232fc60c0321

fbshipit-source-id: e351df62779fea85b7afe5160d3c40c4e7cee4ed
2021-06-05 07:48:49 -07:00
c88a0b55b3 Revert D28677383: [DDP] Merge work and future_work in reducer
Test Plan: revert-hammer

Differential Revision:
D28677383 (f8bebade47)

Original commit changeset: 85e0620378b7

fbshipit-source-id: ef3c65b88c375aa9a6befe2ab004ec37ae7eb587
2021-06-05 07:25:44 -07:00
f8bebade47 [DDP] Merge work and future_work in reducer (#58937)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58937

Remove `work` attribute from Reducer class in favor of `future_work`.

Additionally, remove `copy_grad_to_bucket` method since now it's only one-line implementation, and created a new C++ comm hook called `_AllReduceCommHookWithDivFactor` to replace allreduce and also support handling uneven input.

#Original PR Issue: https://github.com/pytorch/pytorch/issues/41266
ghstack-source-id: 130673249

Test Plan:
buck test caffe2/test/distributed:distributed_gloo_fork --  test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_accumulate_gradients_no_sync
buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_ddp_grad_div_uneven_inputs

Reviewed By: agolynski

Differential Revision: D28677383

fbshipit-source-id: 85e0620378b7e9d837e436e94b9d807631d7d752
2021-06-05 01:18:30 -07:00
5117ac3bb4 Revert D28877076: [pytorch][PR] torch.flip via TI
Test Plan: revert-hammer

Differential Revision:
D28877076 (d82bc3feb8)

Original commit changeset: 4fa6eb519085

fbshipit-source-id: c81e7d3283ff6822db913bf9f49a1533268755d0
2021-06-04 23:03:53 -07:00
10345010f7 [Static Runtime] Add schema check to aten ops (#59426)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59426

Reviewed By: ajyu

Differential Revision: D28698997

fbshipit-source-id: 232fc60c0321b8e68e4f1b6705233485260c281d
2021-06-04 21:38:45 -07:00
d82bc3feb8 torch.flip via TI (#58747)
Summary:
Implements an idea by ngimel to improve the performance of `torch.flip` via a clever hack into TI to bypass the fact that TI is not designed to work with negative indices.

Something that might be added is vectorisation support on CPU, given how simple the implementation is now.

Some low-hanging fruits that I did not implement:
- Write it as a structured kernel
- Migrate the tests to opinfos
- Have a look at `cumsum_backward` and `cumprod_backward`,  as I think that they could be implemented faster with `flip`, now that `flip` is fast.

**Edit**
This operation already has OpInfos and it cannot be migrated to a structured kernel because it implements quantisation

Summary of the PR:
- x1.5-3 performance boost on CPU
- x1.5-2 performance boost on CUDA
- Comparable performance across dimensions, regardless of the strides (thanks TI)
- Simpler code

<details>
<summary>
Test Script
</summary>

```python
from itertools import product

import torch
from torch.utils.benchmark import Compare, Timer

def get_timer(size, dims, num_threads, device):
    x = torch.rand(*size, device=device)

    timer = Timer(
        "torch.flip(x, dims=dims)",
        globals={"x": x, "dims": dims},
        label=f"Flip {device}",
        description=f"dims: {dims}",
        sub_label=f"size: {size}",
        num_threads=num_threads,
    )

    return timer.blocked_autorange(min_run_time=5)

def get_params():
    sizes = ((1000,)*2, (1000,)*3, (10000,)*2)
    for size, device in product(sizes, ("cpu", "cuda")):
        threads = (1, 2, 4) if device == "cpu" else (1,)
        list_dims = [(0,), (1,), (0, 1)]
        if len(size) == 3:
            list_dims.append((0, 2))
        for num_threads, dims in product(threads, list_dims):
            yield size, dims, num_threads, device

def compare():
    compare = Compare([get_timer(*params) for params in get_params()])
    compare.trim_significant_figures()
    compare.colorize()
    compare.print()

compare()
```
</details>

<details>
<summary>
Benchmark PR
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139954-81e46d80-ba3b-11eb-9aad-e825e515d41b.png)

</details>

<details>
<summary>
Benchmark master
</summary>

![image](https://user-images.githubusercontent.com/3291265/119139915-76914200-ba3b-11eb-9aa8-84b3ca220c93.png)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58747

Reviewed By: agolynski

Differential Revision: D28877076

Pulled By: ngimel

fbshipit-source-id: 4fa6eb519085950176cb3a9161eeb3b6289ec575
2021-06-04 20:13:38 -07:00
bca25d97ad [itemwise-dropout][1/x][low-level module] Implement Itemwise Sparse Feature Dropout in Dper3 (#59322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59322

Implement sparse feature dropout (with replacement) that can drop out individual items in each sparse feature. For example, the existing sparse feature dropout with replacement drops out whole feature (e.g., a list of page ids) when the feature is selected for drop out. This itemwise dropout assigns probability and drops out to individual items in sparse features.

Test Plan:
```
buck test mode/dev caffe2/torch/fb/sparsenn:test
```

https://www.internalfb.com/intern/testinfra/testrun/281475166777899/

```
buck test mode/dev //dper3/dper3/modules/tests:sparse_itemwise_dropout_with_replacement_test
```
https://www.internalfb.com/intern/testinfra/testrun/6473924504443423

```
buck test mode/opt caffe2/caffe2/python:layers_test
```
https://www.internalfb.com/intern/testinfra/testrun/2533274848456607

```
buck test mode/opt caffe2/caffe2/python/operator_test:sparse_itemwise_dropout_with_replacement_op_test
```
https://www.internalfb.com/intern/testinfra/testrun/8725724318782701

Reviewed By: Wakeupbuddy

Differential Revision: D27867213

fbshipit-source-id: 8e173c7b3294abbc8bf8a3b04f723cb170446b96
2021-06-04 19:59:17 -07:00
68df4d40d2 show_pickle/model_dump: Handle invalid UTF-8 in pickles (#57661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57661

Thie Pickle "specification" (pickletools.py) states that the argument to
a BINUNICODE opcode must be UTF-8 encoded.  However, if a PyTorch custom
class returns a non-UTF-8 std::string from its pickle method the
libtorch Pickler will write it to the output pickle without complaining.
Python's _Unpickler (the Python implementation of Unpickler) always
throws an exception when trying to deserialize these invalid pickles.

We still want to be able to dump these pickle files.  Update
DumpUnpickler to create its own opcode dispatch table (initialized as a
clone of the _Unpickler dispatch table) and patch in a custom function
for the BINUNICODE op.  We try to emulate the default behavior, but any
UnicodeDecodeError is caught and replaced with a dummy object.  This
could violate the assumptions of a user that expects a str in that
position, so we disable this behavior by default.

Update model_dump to recognize this special object and allow it to be
rendered.

Test Plan: Dumped and viewed a model with an invalid string in an object state.

Reviewed By: malfet

Differential Revision: D28531392

Pulled By: dreiss

fbshipit-source-id: ab5aea20975a0ef53ef52a880deaa2c5a626e4a2
2021-06-04 19:42:25 -07:00
ba3a90b55e Revert D28819780: [TensorExpr] Fix handling of 0-dim tensors.
Test Plan: revert-hammer

Differential Revision:
D28819780

Original commit changeset: f3feff35a1ce

fbshipit-source-id: 1dca4ac9cea0b67e9f02800f6d5b3c7e4ae1d81a
2021-06-04 19:25:30 -07:00
88fb5ee84c Revert D28819779: [TensorExpr] Improve debug messages.
Test Plan: revert-hammer

Differential Revision:
D28819779

Original commit changeset: 2eaa0b78fb30

fbshipit-source-id: babc22f75d87b1ba25f78ffe59266560413778ce
2021-06-04 19:20:31 -07:00
aa66990ef1 Automated submodule update: kineto (#54604)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: 88e3332ab9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54604

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: malfet

Differential Revision: D27297755

fbshipit-source-id: 5f5dd2429fb561530e6a59285c6ae708e5818ce9
2021-06-04 18:54:32 -07:00
18848d55b7 Do not use gold linker for CUDA builds (#59490)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59490

Reviewed By: agolynski, seemethere

Differential Revision: D28913160

Pulled By: malfet

fbshipit-source-id: d27092c252fc86424028abe146cf5f33a2f74544
2021-06-04 18:12:26 -07:00
a682ff7ef1 Add kMaxSupportedBytecodeVersion for Lite Interpreter (#59472)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59472

Previously, the lite interpreter would refuse to load any model
with a version greater than kProducedBytecodeVersion.  Now, we're
able to independently advance the loading and saving code, so we
can roll out changes without breaking forward compatibility.

Test Plan:
CI.
Loaded a bytecode v5 model even with setting kProducedBytecodeVersion
to v4.

Reviewed By: raziel

Differential Revision: D28904350

fbshipit-source-id: 598c22f0adf47d4ed3e976bcbebdf3959dacb1df
2021-06-04 17:55:02 -07:00
d125694d0b Move CUDA async warning to suffix (#59467)
Summary:
After the change async error warnings look as follows:
```
$ python -c "import torch;torch.eye(3,3,device='cuda:777')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59467

Reviewed By: ngimel

Differential Revision: D28904360

Pulled By: malfet

fbshipit-source-id: 2a8fa5affed5b4ffcaa602c8ab2669061cde7db0
2021-06-04 17:26:28 -07:00
f23c45bd04 Revert D28841011: [TensorExpr] Fix printing of Bool dtype.
Test Plan: revert-hammer

Differential Revision:
D28841011 (19985d6f84)

Original commit changeset: 9f68dd47e14a

fbshipit-source-id: ff517cfff49e46ed513e79eabbe9e9fd246ccce8
2021-06-04 16:27:14 -07:00
6309b342c3 [nnc] Enable CPU fuser inside FB, take 5 (#59461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59461

long tail test failues
ghstack-source-id: 130607578

Test Plan: fixed T92123560

Reviewed By: navahgar

Differential Revision: D28892885

fbshipit-source-id: 762a275b5aa14af0847e46cbf4036d3342b82189
2021-06-04 16:26:46 -07:00
f5e3eae82a [nnc] Infer device type from nodes if inputs are all scalars (#59430)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59430

With constant support added, we can now have fusion groups with only
scalar inputs.  So, we need to get the device type from the nodes in the graph
rather than just the inputs.
ghstack-source-id: 130613871

Test Plan: new unit test; also see test_tracer test_trace_of_script

Reviewed By: navahgar

Differential Revision: D28891989

fbshipit-source-id: f9e824acbd4856216b85a135c8cb60a2eac3c628
2021-06-04 16:25:33 -07:00
a776072de6 .github: Switch windows instance types (#59473)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59473

Switches windows instance types to prep for usage of the AWS built
Windows AMI with pre-installed Tesla Driver.

Unforutnately neither c5.4xlarge nor g3.8xlarge is not supported by this
AMI but luckily we can swap those out with pretty comparable
alternatives like c5d.4xlarge and p3.2xlarge.

For CPU workflows this shouldn't provide any real difference since the
CPU / Memory is the same with c5d.4xlarge. For GPU workflows the GPU
with the p3.2xlarge is a Tesla V100 which should suit our needs.

<details>
<summary> nvidia-smi.exe (p3.2xlarge) </summary>

```
PS C:\Users\Administrator> & 'C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe'
Fri Jun  4 18:53:10 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 462.31       Driver Version: 462.31       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  TCC  | 00000000:00:1E.0 Off |                    0 |
| N/A   42C    P0    23W / 300W |      0MiB / 16258MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
```

</details>

It might eventually make sense to also switch linux to these instance types but do bear in mind that p3.2xlarge for linux is ~$0.75 more expensive than g3.8xlarge

* [Price comparison for g3.8xlarge vs. p3.2xlarge](https://instances.vantage.sh/?compare_on=true&selected=p3.2xlarge,g3.8xlarge)
* [Price comparison for c5.4xlarge vs. c5d.4xlarge](https://instances.vantage.sh/?compare_on=true&selected=c5.4xlarge,c5d.4xlarge)

AMI that I'm planning on using as the new base AMI with included Tesla driver: https://aws.amazon.com/marketplace/pp/prodview-jrxucanuabmfm?qid=1622830809415&sr=0-2&ref_=srh_res_product_title#pdp-pricing

Info about c5 instances can be found here: https://aws.amazon.com/ec2/instance-types/c5/

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: agolynski

Differential Revision: D28913659

Pulled By: seemethere

fbshipit-source-id: 11b4d332e82b078a6801b312dc4ace2928838fc8
2021-06-04 16:22:05 -07:00
bbf7eceaf0 Refactor c10d and dist aliases for torch.distributed (#59456)
Summary:
**Overview:**
This consolidates `c10d` and `dist` to only `dist` as the alias for `torch.distributed` in `test_store.py`. Both aliases were used most likely due to incremental additions to the test file and not intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59456

Test Plan:
```
python test/distributed/test_store.py
```

Reviewed By: agolynski

Differential Revision: D28910169

Pulled By: andwgu

fbshipit-source-id: f830dead29e9de48aaf2845dfa5861c9cccec15d
2021-06-04 16:07:44 -07:00
1183fa3817 Switch PG::Work to Future in default_comm_hooks.cpp (#59398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59398

Test Plan: Imported from OSS

Reviewed By: SciPioneer

Differential Revision: D28876182

Pulled By: agolynski

fbshipit-source-id: 9d8f09ffa2f40bb0fb25c626b52678a1597a797e
2021-06-04 15:27:13 -07:00
aa27136e3c Fix test_randperm_device_compatibility for 1 GPU (#59484)
Summary:
Do not try to create tensors on 2nd device if device_count() == 1

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59484

Reviewed By: ngimel

Differential Revision: D28910673

Pulled By: malfet

fbshipit-source-id: e3517f31a463dd049ce8a5155409b7b716c8df18
2021-06-04 14:41:06 -07:00
a7c8c56b7f torchdeploy allow embedded cuda interp use without cuda (#59459)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59459

For any binary that can be used both with and without cuda, it's better to allow just including the cuda flavor of the interpreter.  The previous logic would fail in this case, as it only allows using the cuda flavor if torch::cuda::is_available() reports true.  Now, we unconditionally allow the cuda flavor to be used if it's present.

Test Plan: Added new unit test to exercise this scenario, ran locally on devvm without cuda.

Reviewed By: dzhulgakov

Differential Revision: D28902176

fbshipit-source-id: 5c7c90d84987848471bb6dd5318db15314e0b442
2021-06-04 14:37:39 -07:00
aeb55225e0 [caffe2] add a basic implementation of run-time feature rollout checks (#59355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59355

Add a `CheckKnob()` function for doing run-time checks of feature roll-out
knobs.  This provides an API for safely controlling the roll-out of new
functionality in the code.

Test Plan: Included some basic unit tests.

Reviewed By: voznesenskym

Differential Revision: D26536430

fbshipit-source-id: 2e53234c6d9ce624848fc8b2c76f6833f344f48b
2021-06-04 14:34:41 -07:00
90ad0f316f try fixing checkout dirty issue (#59450)
Summary:
Testing and see if checkout with submodules during build phase will help.

tentatively address https://github.com/pytorch/pytorch/issues/58867. but since the repro is not reliable. we cant be sure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59450

Reviewed By: malfet

Differential Revision: D28908537

Pulled By: walterddr

fbshipit-source-id: 21ad1392a5066554b5c633f31616ab3e6541c54d
2021-06-04 14:31:43 -07:00
c4349bfa84 [GHA] add upload binary size step (#58341)
Summary:
GHA upload working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58341

Test Plan: Internal table pytorch_binary_size row for this PR: https://github.com/pytorch/pytorch/issues/58341

Reviewed By: agolynski

Differential Revision: D28908549

Pulled By: walterddr

fbshipit-source-id: 313e5b2c5ce2a47af3c37652612af922a68fd246
2021-06-04 14:17:13 -07:00
3607478ecd Conjugate View (#54987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54987

Based off of ezyang (https://github.com/pytorch/pytorch/pull/44799) and bdhirsh (https://github.com/pytorch/pytorch/pull/43702) 's prototype:

Here's a summary of the changes in this PR:
This PR adds a new dispatch key called Conjugate. This enables us to make conjugate operation a view and leverage the specialized library functions that fast path with the hermitian operation (conj + transpose).

1. Conjugate operation will now return a view with conj bit (1) for complex tensors and returns self for non-complex tensors as before. This also means `torch.view_as_real` will no longer be a view on conjugated complex tensors and is hence disabled. To fill the gap, we have added `torch.view_as_real_physical` which would return the real tensor agnostic of the conjugate bit on the input complex tensor. The information about conjugation on the old tensor can be obtained by calling `.is_conj()` on the new tensor.
2. NEW API:
    a) `.conj()` -- now returning a view.
    b) `.conj_physical()` -- does the physical conjugate operation. If the conj bit for input was set, you'd get `self.clone()`, else you'll get a new tensor with conjugated value in its memory.
    c) `.conj_physical_()`, and `out=` variant
    d) `.resolve_conj()`  -- materializes the conjugation. returns self if the conj bit is unset, else returns a new tensor with conjugated values and conj bit set to 0.
    e) `.resolve_conj_()` in-place version of (d)
    f) `view_as_real_physical` -- as described in (1), it's functionally same as `view_as_real`, just that it doesn't error out on conjugated tensors.
    g) `view_as_real` -- existing function, but now errors out on conjugated tensors.
3. Conjugate Fallback
    a) Vast majority of PyTorch functions would currently use this fallback when they are called on a conjugated tensor.
    b) This fallback is well equipped to handle the following cases:
        - functional operation e.g., `torch.sin(input)`
        - Mutable inputs and in-place operations e.g., `tensor.add_(2)`
        - out-of-place operation e.g., `torch.sin(input, out=out)`
        - Tensorlist input args
        - NOTE: Meta tensors don't work with conjugate fallback.
4. Autograd
    a) `resolve_conj()` is an identity function w.r.t. autograd
    b) Everything else works as expected.
5. Testing:
    a) All method_tests run with conjugate view tensors.
    b) OpInfo tests that run with conjugate views
        - test_variant_consistency_eager/jit
        - gradcheck, gradgradcheck
        - test_conj_views (that only run for `torch.cfloat` dtype)

NOTE: functions like `empty_like`, `zero_like`, `randn_like`, `clone` don't propagate the conjugate bit.

Follow up work:
1. conjugate view RFC
2. Add neg bit to re-enable view operation on conjugated tensors
3. Update linalg functions to call into specialized functions that fast path with the hermitian operation.

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28227315

Pulled By: anjali411

fbshipit-source-id: acab9402b9d6a970c6d512809b627a290c8def5f
2021-06-04 14:12:41 -07:00
19985d6f84 [TensorExpr] Fix printing of Bool dtype. (#59328)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59328

Before the change we printed:
```
aten_eq[0] = decltype(::c10::impl::ScalarTypeToCPPType< ::c10::ScalarType::Bool>::t)((targ_0[0])==(targ_1[0]) ? 1 : 0);
```
After the change we print:
```
aten_eq[0] = bool((targ_0[0])==(targ_1[0]) ? 1 : 0);
```

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28841011

Pulled By: ZolotukhinM

fbshipit-source-id: 9f68dd47e14a7bc28156b56414c2d5c0aad6b2d4
2021-06-04 13:59:38 -07:00
285b8a5252 [TensorExpr] Improve debug messages. (#59280)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59280

Differential Revision:
D28819779
D28819779

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 2eaa0b78fb309cccb0efe9025a5c3b039e717027
2021-06-04 13:59:36 -07:00
d60efd8207 [TensorExpr] Fix handling of 0-dim tensors. (#59279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59279

There were some issues with how we handle 0-dim cases in lowerings and
also in how we generate reductions in that special case. This PR fixes
those issues and reenables a bunch of tests.

Differential Revision:
D28819780
D28819780

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f3feff35a1ce11821ada2f8d04ae9d4be10dc736
2021-06-04 13:58:15 -07:00
dce8697aea [PyTorch][vulkan] Unify convert as vTensor& convert(const Tensor&) (#59268)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59268

There's no reason we can't give `convert` this signature: `Tensor::unsafeGetTensorImpl() cocnst ` returns a non-const TensorImpl pointer. (See https://github.com/zdevito/ATen/issues/27#issuecomment-330717839)
ghstack-source-id: 130548716

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D28811477

fbshipit-source-id: 269f58980c1f68b29d4be3cba4cd340299ce39af
2021-06-04 13:16:14 -07:00
c99d6254fb remove THCReduce.cuh (#59431)
Summary:
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59431

Reviewed By: malfet

Differential Revision: D28904504

Pulled By: ngimel

fbshipit-source-id: 25d98b736d74d64fd20a40e0d9c773332f56cc30
2021-06-04 12:57:07 -07:00
780faf52ca [profile] Clarify record_shapes=True docstring (#59469)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59469

Clarify that using record_shapes=True may cause extra tensor copies.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28905089

Pulled By: ilia-cher

fbshipit-source-id: 7642cb16f6697b6d255a2b82348d4c17486680d0
2021-06-04 12:01:35 -07:00
b3ee645cbf Migrate _th_std_var to ATen (#59258)
Summary:
Ref https://github.com/pytorch/pytorch/issues/49421

This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from gh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors.

TH Results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       3.6   |       3.8    |     8.2  |      1.2
      80       |       3.7   |       3.8    |     8.4  |      1.2
      800      |       4.2   |       4.3    |     8.7  |      1.2
      8000     |       9.0   |       9.1    |    11.2  |      1.5
      80000    |      58.3   |      59.0    |    30.6  |      4.2
      800000   |     546.9   |     546.9    |   183.4  |     31.3
      8000000  |    5729.7   |    5701.0    |  6165.4  |    484.1
```

ATen results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       4.0   |       4.0    |     8.7  |      1.2
      80       |       3.6   |       3.8    |     9.0  |      1.2
      800      |       4.1   |       4.3    |     8.9  |      1.2
      8000     |       8.9   |       9.2    |    10.6  |      1.5
      80000    |      57.0   |      57.4    |    28.8  |      4.3
      800000   |     526.9   |     526.9    |   178.3  |     30.2
      8000000  |    5568.1   |    5560.6    |  6042.1  |    453.2

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
8 threads: ---------------------------------------------------------
      8        |      3.9    |      3.8     |     9.1  |      1.2
      80       |      3.8    |      3.9     |     8.8  |      1.2
      800      |      4.2    |      4.3     |     8.9  |      1.3
      8000     |      9.0    |      9.2     |    10.4  |      1.5
      80000    |     26.0    |     26.8     |    26.4  |      4.4
      800000   |     92.9    |     87.3     |    72.1  |     22.4
      8000000  |    793.5    |    791.8     |  5334.8  |    115.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59258

Reviewed By: jbschlosser

Differential Revision: D28860353

Pulled By: ngimel

fbshipit-source-id: 80c9fe1db84dbc864eeb1a319076c7aaff0a04e5
2021-06-04 11:58:12 -07:00
689a5edd0a Revert D28326365: [pytorch][PR] Add torch.cuda.streams.ExternalStream
Test Plan: revert-hammer

Differential Revision:
D28326365 (d7ef9b73fb)

Original commit changeset: b67858c80339

fbshipit-source-id: 337588d40b96cf04e46e554fa481ae7fd4254478
2021-06-04 11:19:36 -07:00
3472f0c94d Enable torch::deploy GPU tests in sandcastle (#59460)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59460

Original commit changeset: 6e01a96d3746

Test Plan: Verify new tests run in sandcastle and existing CI is OK

Reviewed By: H-Huang

Differential Revision: D28900869

fbshipit-source-id: a8962ec48c66bba3b4b8f001ece7231953b29e82
2021-06-04 11:13:43 -07:00
ed993f3243 [CODEOWNERS] spandantiwari -> shubhambhokare1 (#59427)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59427

Reviewed By: agolynski

Differential Revision: D28902131

Pulled By: SplitInfinity

fbshipit-source-id: 6a583c5087caf147f9033b73765b1dd3f59a405c
2021-06-04 11:06:55 -07:00
e90caac676 Port gelu_backward to structured (#58665)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58665

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28572527

Pulled By: ezyang

fbshipit-source-id: 0cb286f20c5f91453594a7dfe39ae4e4d24a13a1
2021-06-04 11:06:54 -07:00
153a96054b Port gelu to structured (#58664)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58664

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28572533

Pulled By: ezyang

fbshipit-source-id: 8be00ecdcc224b516de28bf5f43ec308174053db
2021-06-04 11:06:52 -07:00
5f824ef437 Port hardshrink to structured (#58663)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58663

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28572531

Pulled By: ezyang

fbshipit-source-id: 3fc8c33445adeae1789774fb6d8099278b93f8f8
2021-06-04 11:06:50 -07:00
b4fa4c86f7 Port hardshrink_backward and softshrink_backward to structured (#58662)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58662

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28572532

Pulled By: ezyang

fbshipit-source-id: 8ecebc1090d884ee579f5d04a46f1e60a2dd978e
2021-06-04 11:05:44 -07:00
2119efd234 reflection_pad1d_backward: Port to structured (#59103)
Summary:
Tracking Issue: https://github.com/pytorch/pytorch/issues/55070
Port `reflection_pad1d_backward` to structured kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59103

Test Plan: Pre-existing tests

Reviewed By: jbschlosser

Differential Revision: D28836043

Pulled By: ezyang

fbshipit-source-id: 4c3b0880edf305896f540113dcab70c8af24253b
2021-06-04 10:23:53 -07:00
a6bd6b9ca5 [NNC] Fix the uninitialized pointer in loopnest.fuse_loops (#59411)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59411

Bug: the uninitialized For* caused a casting error in pybind11.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28882635

Pulled By: huiguoo

fbshipit-source-id: e3f2b659bae94e9617936b1b2368157bed73c2fe
2021-06-04 10:04:34 -07:00
aa06bc0731 OpInfo: minor fix in sample_inputs_diff (#59181)
Summary:
sample_inputs_diff constructs all five positional arguments for [diff ](https://pytorch.org/docs/stable/generated/torch.diff.html) but uses only the first three. This doesn't seem to be intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59181

Test Plan: This change expands coverage of diff's OpInfo sample inputs. Related tests still pass.

Reviewed By: mruberry

Differential Revision: D28878359

Pulled By: saketh-are

fbshipit-source-id: 1466f6c6c341490885c85bc6271ad8b3bcdf3a3e
2021-06-04 09:53:31 -07:00
b99523832b Remove use_env from torch.distributed.run, clarify bc around that parameter in comment. (#59409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59409

Remove use_env from torch.distributed.run, and clarify bc around that parameter in comment.

Test Plan: n/a

Reviewed By: cbalioglu

Differential Revision: D28876485

fbshipit-source-id: 5f10365968d204985ce517b83c392c688995d76e
2021-06-04 09:02:47 -07:00
4ae5764d47 Add is_inference to native functions (#58729)
Summary:
Adds `is_inference` as a native function w/ manual cpp bindings.
Also changes instances of `is_inference_tensor` to `is_inference` to be consistent with other properties such as `is_complex`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58729

Reviewed By: mruberry

Differential Revision: D28874507

Pulled By: soulitzer

fbshipit-source-id: 0fa6bcdc72a4ae444705e2e0f3c416c1b28dadc7
2021-06-04 08:59:11 -07:00
fa597ee17f Fix torch.randperm for CUDA (#59352)
Summary:
Context https://github.com/pytorch/pytorch/issues/58545

The logic is that we are going to keep it consistent for both
torch.randperm and torch.randint

1. Generators can have either a fully-specified or non-fully specified device
2. As long as the device type match with the result, we don't error out

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59352

Test Plan:
```
python test/test_tensor_creation_ops.py -k TestRandomTensorCreation
```

Reviewed By: ngimel

Differential Revision: D28855920

Pulled By: zhouzhuojie

fbshipit-source-id: f8141a2c4b2f177e1aa7baec6999b65916cba02c
2021-06-04 08:56:18 -07:00
202b2c9fc2 Remove many unnecessary constructor calls of Vectorized<T> (#58875)
Summary:
Refresh https://github.com/pytorch/pytorch/issues/56241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58875

Reviewed By: mruberry

Differential Revision: D28892034

Pulled By: ezyang

fbshipit-source-id: 21074e45f29a780168852be5305420a3cc1148fc
2021-06-04 08:50:53 -07:00
d7ef9b73fb Add torch.cuda.streams.ExternalStream (#57781)
Summary:
This is required in https://github.com/pytorch/pytorch/pull/57110#issuecomment-828357947

We need to provide means to synchronize on externally allocated streams for dlpack support in python array data api.

cc mruberry rgommers leofang asi1024 kmaehashi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57781

Reviewed By: mrshenli

Differential Revision: D28326365

Pulled By: ezyang

fbshipit-source-id: b67858c8033949951b49a3d319f649884dfd0a91
2021-06-04 08:47:09 -07:00
c769300301 Fix MaxPool default pad documentation (#59404)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59404

Reviewed By: albanD

Differential Revision: D28879049

Pulled By: Varal7

fbshipit-source-id: 03a86cd347d53ac2d06028b3f213c5b5d5ab7e91
2021-06-04 08:32:03 -07:00
6d51a89778 Fix broken hyperlinks (#59425)
Summary:
**Overview:**
A number of the hyperlinks in the [`CONTRIBUTING.md` file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken since they include an extraneous `/torch/`. This PR fixes those links.

The files whose links are broken are
- `ProcessGroupNCCL.hpp`
- `Store.hpp`
- `FileStore.hpp`
- `TCPStore.hpp`
- `PrefixStore.hpp`
- `rref_impl.h`
- `rref_context.h`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59425

Test Plan:
The `CONTRIBUTING.md` file is at https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md.

`ProcessGroupNCCL.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/ProcessGroupGloo.hpp, which is equivalent to `../lib/c10d/ProcessGroupGloo.hpp`.

`Store.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/Store.hpp, which is equivalent to `../lib/c10d/Store.hpp`.

`FileStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/FileStore.hpp, which is equivalent to `../lib/c10d/FileStore.hpp`.

`PrefixStore.hpp` should have link https://github.com/pytorch/pytorch/blob/master/torch/lib/c10d/PrefixStore.hpp, which is equivalent to `../lib/c10d/PrefixStore.hpp`.

`rref_interface.h` should have link https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/rref_interface.h, which is equivalent to `../../aten/src/ATen/core/rref_interface.h`.

`rref_context.h` should have link https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/rpc/rref_context.h, which is equivalent to `../csrc/distributed/rpc/rref_context.h`.

Reviewed By: mruberry

Differential Revision: D28888188

Pulled By: andwgu

fbshipit-source-id: 023219184d42284ea1cbfcf519c1b4277dd5a02b
2021-06-04 08:27:26 -07:00
63956610a7 Search for static OpenBLAS compiled with OpenMP (#59428)
Summary:
Before that, only dynamically linked OpenBLAS compield with OpenMP could
be found.

Also get rid of hardcoded codepath for libgfortran.a in FindLAPACK.cmake

Only affects aarch64 linux builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59428

Reviewed By: agolynski

Differential Revision: D28891314

Pulled By: malfet

fbshipit-source-id: 5af55a14c85ac66551ad2805c5716bbefe8d55b2
2021-06-04 08:09:21 -07:00
c7a3a13bab .circleci: Disable USE_GOLD_LINKER for CUDA 10.2 (#59413)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59413

For CUDA 10.2 builds linked with the gold linker we were observing
crashes when exceptions were being raised

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28888054

Pulled By: seemethere

fbshipit-source-id: f9b38147591721803ed3cac607510fe5bbc49d6d
2021-06-04 07:02:54 -07:00
06ed658358 Merge TensorPipe's CPU and CUDA channel registry (#59375)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59375

The CPU and CUDA channels used to be separate classes in TensorPipe, but they recently got merged in order to support cross-device-type channels. We used to need two separate registries in PyTorch, but now we can merge them. This simplifies some registration logic, and will help in future PRs.
ghstack-source-id: 130583770

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28796427

fbshipit-source-id: b7db983293cbbddd1aedec6428de08d8944b0344
2021-06-04 06:53:49 -07:00
c09beaaf4a Remove LazyStreamContext (2 out of 2) (#59299)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59299

After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type.

This PR migrates the TensorPipe agent. The previous PR migrated the RequestCallback internals.
ghstack-source-id: 130583773

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28789174

fbshipit-source-id: a27d2b1f40ab3cf2ac0dd946232fd0eecda6d450
2021-06-04 06:53:47 -07:00
03a5c6ea99 Remove LazyStreamContext (1 out of 2) (#59298)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59298

After recent changes, LazyStreamContext had in fact always become eager, and was in fact equivalent to a vector of streams. So it makes more sense now to remove this abstraction and use a more self-descriptive type.

This PR migrates the RequestCallback internals. The next PR migrates the TensorPipe agent.
ghstack-source-id: 130583774

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28789175

fbshipit-source-id: fa581a50f9a6a1e42c2ad8c808a9b099bea7433e
2021-06-04 06:53:46 -07:00
3e7396f99d Fix CUDA sync when switching streams in RPC tests (#59297)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59297

PyTorch requires users to manually record tensors with the CUDA caching allocator when switching streams. We weren't doing it.

Also, the usage of an Event can be simplified by using `s1.wait(s2)`.
ghstack-source-id: 130583777

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28832902

fbshipit-source-id: cd4f40ff811fa1b0042deedda2456e22f33b92bd
2021-06-04 06:53:44 -07:00
8f4cfaa9db Fix race condition in TP agent (#58753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58753

TSAN was (rightfully!) detecting and complaining about a race due to the fact that upon init the TP agent exchanges the device maps between nodes using RPC requests (and by doing so it accesses the device maps) and then sets the reverse device maps (thus possibly modifying the set of devices). This resulted in a data race, i.e., simultaneously reading and writing the set of devices without synchronizing.

One solution is to add a mutex around the devices, which works, but is "annoying". An alternative solution is to make the set of devices immutable (i.e., `const`). For that to work, we need to exchange the device maps without using RPC calls. We can do so using the process group that we need to create anyways.

Since now there's a lot more logic in Python, I've moved (and restructured) all safety checks over there, and removed them from C++.
ghstack-source-id: 130583775

Test Plan: Unit tests

Reviewed By: mrshenli

Differential Revision: D28603754

fbshipit-source-id: 88533e65d72d1eb806dc41bec8d55def5082e290
2021-06-04 06:53:42 -07:00
c0acffa6ef Ensure async_execution works with CUDAFuture (#56863)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56863

ghstack-source-id: 130583772

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D27985908

fbshipit-source-id: 09469ee0eb70b8e3b61f6278f2c881ce7f5244d6
2021-06-04 06:53:40 -07:00
7bcd8f94a5 Avoid re-doing CUDA stream sync in OwnerRRef (#57355)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57355

We had started fixing OwnerRRef to make it CUDA-compatible, by properly synchronizing CUDA streams/events where appropriate. However, since we started using CUDAFuture (or, well, ivalue::Future nowadays, after they got merged) this is all done automatically for us, hence we can undo these "fixes" as they're now duplicated.
ghstack-source-id: 130583771

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28118182

fbshipit-source-id: 4b1dd9fe88c23802b1df573941d1b73af48bb67b
2021-06-04 06:52:33 -07:00
d009c9c129 [RPC Framework] Separate initialize_from_module_rref method out of RemoteModule constructor (#59292)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59292

#Closes: https://github.com/pytorch/pytorch/issues/58274

Create an alternate initialization method, and also create a few util functions to avoid duplicate code.
ghstack-source-id: 130575373

Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_create_remote_module_from_module_rref

Reviewed By: vipannalla

Differential Revision: D28825895

fbshipit-source-id: 87803e94d9b50f94e1b7b2c99b9bf1634e20d065
2021-06-04 03:43:36 -07:00
c3bf42e0d8 Fix symbolic derivative of hardswish (#59405)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59405

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28879698

Pulled By: bertmaher

fbshipit-source-id: 2f2d9836bf592b18ed9a19aab4f5967e653b5898
2021-06-03 23:12:18 -07:00
9ac954789d [nnc] Add hardsigmoid (#59069)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59069

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28738166

Pulled By: bertmaher

fbshipit-source-id: d9f5b87ef1f2323a3631add79c2670ce794f911e
2021-06-03 23:10:36 -07:00
c717ce6771 [NNC] Add python bindings for Compute2 (#59350)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59350

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28854806

Pulled By: huiguoo

fbshipit-source-id: b9091f9183249257aedc1eafb1992e0faf5dea82
2021-06-03 22:37:08 -07:00
db90533b9e Make JIT not assume that the device is CUDA. (#54238)
Summary:
Decouple the JIT argument spec and shape analysis with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54238

Reviewed By: ngimel

Differential Revision: D28802085

Pulled By: Krovatkin

fbshipit-source-id: 4068c9460cdec2d80733f001ca90ea3f5e6d3a7e
2021-06-03 22:21:27 -07:00
7c4ac9e3ee [NNC] Fix loopnest.cache_accesses for reduce ops (fixed #59002) (#59136)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59136

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28768598

Pulled By: huiguoo

fbshipit-source-id: 99ab8430bc0ba395e2a041b03a7761de335ddda5
2021-06-03 21:04:14 -07:00
d9d7d5e24a [torch] Remove migration warning for ScriptDict
Summary:
This commit removes the warning that suggests that users script their
dictionaries before passing them into TorchScript code. The ScriptDict feature
is not fully ready, so it does not make sense to recommend this yet.

Test Plan:
Sandcastle.

In addition, the PyPER test broken by the original diff passes:

```
buck test mode/opt //caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_lwt -- --exact 'caffe2/torch/fb/training_toolkit/backend/tests:test_model_materializer_full_sync_lwt - caffe2.torch.fb.training_toolkit.backend.tests.test_model_materializer_full_sync_lwt.ModelMaterializerFullSyncLwtTest: test_materialization_determinism_cpu' --run-disabled
```

Differential Revision: D28891351

fbshipit-source-id: 2a3a00cde935d670fb1dc7fd8c709ae9c2ad8cdc
2021-06-03 20:55:40 -07:00
6627c00e63 [Static Runtime] Fix bug in quantized::linear wrapper (#59407)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59407

Reviewed By: ajyu

Differential Revision: D28881307

fbshipit-source-id: 46c169f783cf05c585871c2e074d52255116b9c3
2021-06-03 19:18:04 -07:00
7d38901e7c [NNC] Fix BufHandle arguments in loopnest python API (#59348)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59348

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28854233

Pulled By: huiguoo

fbshipit-source-id: 2484249992903ed7af0de504ac27f96f30e993d1
2021-06-03 17:34:17 -07:00
77de640f4b [torch distributed] Implementing reduce_scatter_base (#57567)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57567

Support flattened reduce_scatter.

Test Plan:
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/torch/lib/c10d:ProcessGroupNCCLTest
buck test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:c10d

Reviewed By: zhaojuanmao

Differential Revision: D27876281

fbshipit-source-id: 58e2edfb1baff5cdc083dbaaba9f19502ef0b298
2021-06-03 17:17:53 -07:00
46d724c919 Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4
Test Plan: revert-hammer

Differential Revision:
D28859795 (6baa66ece9)

Original commit changeset: 826801db24e8

fbshipit-source-id: c85a0fc7e88c95af939d5c0f50c0c8878e1174d3
2021-06-03 16:29:51 -07:00
526445dfa8 Update reviewer list for the distributed package (#59417)
Summary:
Added cbalioglu to the default reviewer list of the distributed package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59417

Reviewed By: mruberry

Differential Revision: D28883997

Pulled By: cbalioglu

fbshipit-source-id: 0ed9638f25bd914b71d96203579507af3b830df4
2021-06-03 15:38:07 -07:00
aa4f27c12a Prefer accurate reciprocal on ARMv8 (#59361)
Summary:
Default NEON accelerated implementation of reciprocal uses vrecpeq_f32 which yield  Newton-Raphson approximation rather than actual value
Use regular NEON accelerated division for reciprocal and reciprocal square root operations.

This fixes `test_reference_numerics_hard_frac_cpu_float32`, `test_reference_numerics_normal_rsqrt_cpu_float32` etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59361

Reviewed By: mruberry

Differential Revision: D28870456

Pulled By: malfet

fbshipit-source-id: e634b0887cce7efb046ea1fd9b74424e0eceb164
2021-06-03 15:28:36 -07:00
3416b8dd70 Automated submodule update: FBGEMM (#59337)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 9cb33bcfe5

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59337

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: caogao

Differential Revision: D28846199

fbshipit-source-id: b78f087129edef97247d4ceea77cfede0c6800fe
2021-06-03 14:45:32 -07:00
1aa14fcb14 Fix the "tensors to be on the same device" error in HistogramObserver (#59234)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59075

This PR fixes the "tensors to be on the same device" error in `HistogramObserver`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59234

Reviewed By: jbschlosser

Differential Revision: D28837572

Pulled By: vkuzo

fbshipit-source-id: ff7c3229ced7de2cdd8f76d526f0fd33ac643216
2021-06-03 13:30:56 -07:00
2aa463d931 Support switching RemoteModule between train/eval (#59026)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59026

#Closes: https://github.com/pytorch/pytorch/issues/51480

Enabled methods train and eval in RemoteModule to call the underlying train/eval methods on the actual
 nn.Module
ghstack-source-id: 130421137

Test Plan:
Call these two updated methods in method test_send_remote_module_over_the_wire in remote_module_test.py. To test the correctness, after running method train, the training mode should be set to True; after running method eval, the training mode of the remote module should be set to False.

	Related test output:

    ✓ Pass: caffe2/test/distributed/rpc:process_group_agent - test_send_remote_module_over_the_wire (fb.test_process_group_agent.ProcessGroupThreeWorkersRemoteModuleTestWithFork) (23.059)
    ✓ Pass: caffe2/test/distributed/rpc:thrift_agent - test_send_remote_module_over_the_wire (fb.test_thrift_agent.ThriftThreeWorkersRemoteModuleTestWithFork) (27.965)
    ✓ Pass: caffe2/test/distributed/rpc:process_group_agent - test_send_remote_module_over_the_wire (test_process_group_agent.ProcessGroupThreeWorkersRemoteModuleTestWithSpawn) (74.481)
    ✓ Pass: caffe2/test/distributed/rpc:thrift_agent - test_send_remote_module_over_the_wire (fb.test_thrift_agent.ThriftThreeWorkersRemoteModuleTestWithSpawn) (77.243)
    ✓ Pass: caffe2/test/distributed/rpc:tensorpipe_agent - test_send_remote_module_over_the_wire (fb.test_tensorpipe_agent.TensorPipeThreeWorkersRemoteModuleTestWithFork) (58.644)
    ✓ Pass: caffe2/test/distributed/rpc:tensorpipe_agent - test_send_remote_module_over_the_wire (test_tensorpipe_agent.TensorPipeThreeWorkersRemoteModuleTestWithSpawn) (90.229)

Reviewed By: pritamdamania87, SciPioneer

Differential Revision: D28721078

fbshipit-source-id: aa45c1e5755f583200144ecfec3704f28221972c
2021-06-03 13:13:58 -07:00
c1c9774acb Revert D28538996: Enable torch::deploy GPU tests in sandcastle
Test Plan: revert-hammer

Differential Revision:
D28538996 (4b74c848aa)

Original commit changeset: 1a6ccea07cfe

fbshipit-source-id: 6e01a96d3746d3ca3e4e792a7b623ef960c9d2d6
2021-06-03 13:00:25 -07:00
e66015dadf Add build support for kineto + rocm (#58401)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58399

CMake changes to allow kineto to build with rocm support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58401

Reviewed By: mruberry

Differential Revision: D28479807

Pulled By: walterddr

fbshipit-source-id: fc01f05b2a5592ee1d1dbd71d2d4f7aec1bd74f7
2021-06-03 12:15:20 -07:00
332b01e93f [DDP] log usage of torch_distributed_debug (#59351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59351

Logging PT distributed debug level to track usage internally.
ghstack-source-id: 130443122

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28854914

fbshipit-source-id: a8e85ca4a3c9ac2f18d13190e87c0ebc4a8e7ea2
2021-06-03 11:49:23 -07:00
6408cbd918 Migrate renorm to ATen (CPU and CUDA) (#59250)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/59108, closes https://github.com/pytorch/pytorch/issues/24754, closes https://github.com/pytorch/pytorch/issues/24616

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59250

Reviewed By: mruberry

Differential Revision: D28820359

Pulled By: ngimel

fbshipit-source-id: 572486adabac8135d52a9b8700f9d145c2a4ed45
2021-06-03 11:43:27 -07:00
2ad4b8e58c Extract c10d Store tests to dedicated test file (#59271)
Summary:
Partially addresses https://github.com/pytorch/pytorch/issues/55340

**Overview**
This factors out `FileStoreTest`, `HashStoreTest`, `PrefixFileStoreTest`, `TCPStoreTest`, `PrefixTCPStoreTest`, `PythonStoreTest`, `RendezvousTest`, `RendezvousEnvTest`, `RendezvousFileTest`, and `RendezvousTCPTest` from `test_c10d_common.py` to a new file `test_store.py`.

Additionally, unused import/initialization statements are removed from `test_c10d_common.py`, and the minimal set of import/initialization statements are used for `test_store.py`.

Also, this changes `.jenkins/pytorch/multigpu-test.sh`, `.jenkins/pytorch/win-test-helpers/test_distributed.bat`, and `test/run_test.py` to include the new `test_store.py`.

**Testing**
All commands shown are run on an AI AWS cluster.

I check the Store tests:
```
python test/distributed/test_store.py
```

I also check `test_c10d_common.py` since it is the source of the refactored code. In addition, I check `test_c10d_nccl.py` and `test_c10d_gloo.py` since they import from `test_c10d_common.py`; those two should be the only test files depending on `test_c10d_common.py`.
```
python test/distributed/test_c10d_common.py
python test/distributed/test_c10d_nccl.py
python test/distributed/test_c10d_gloo.py
```
`test_c10d_gloo.py` produces warnings about how using sparse tensors in TorchScript is experimental, but the warnings do not result from this PR's changes.

**Testing Issues** (To Be Revisited)
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py
```
Running the above command fails three tests (written as `[Test]`: `[Error]`):
- `ProcessGroupGlooWrapperTest.test_collective_hang`: `RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:598] Connection closed by peer [10.200.24.101]:15580`
- `CommTest.test_broadcast_coalesced_gloo_cuda`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
- `CommTest.test_sequence_num_incremented_gloo_default`: `RuntimeError: cuda runtime error (3) : initialization error at ../aten/src/THC/THCGeneral.cpp:54`
However, running each of the following yields no errors:
```
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_collective_hang
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_broadcast_coalesced_gloo_cuda
WORLD_SIZE=4 BACKEND=gloo gpurun pytest test/distributed/test_c10d_gloo.py -k test_sequence_num_incremented_gloo_default
```
This suggests the existence of some inadvertent state dependency between tests (e.g. improper cleanup). I have not explored this further yet. In particular, I do not have a solid understanding of the tests to be able to explain why using `pytest` and `gpurun` induces the failure (since notably, running the `.py` directly shows no issue).

Similarly, running the following yields 47 errors:
```
WORLD_SIZE=4 BACKEND=nccl gpurun pytest test/distributed/test_c10d_nccl.py
```
The errors seem to all be simply complaining about the usage of `fork()` instead of `spawn()` for CUDA multiprocessing. Though, most of the tests in `test_c10d_nccl.py` ask for at least 2 CUDA devices, so I think that the `gpurun` is warranted (assuming that the test file does not need to be run partially on different machines).

Both `test_c10d_common.py` and `test_store.py` work fine with `pytest`.

**Other Notes**
I noticed that `torch.distributed` is imported both as `dist` and as `c10d` and that `c10d` is used throughout the Store tests. I was curious if this is intentional (as opposed to using `dist` to refer to `torch.distributed`). Also, the original [issue](https://github.com/pytorch/pytorch/issues/55340) suggests that the Store tests do not use multiprocessing, but I saw that `torch.multiprocessing` is still used in `TCPStoreTest`.

The links for the Store files in the `CONTRIBUTING.md` [file](https://github.com/pytorch/pytorch/blob/master/torch/distributed/CONTRIBUTING.md) are broken. This can fixed in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59271

Reviewed By: jbschlosser, mrshenli

Differential Revision: D28856920

Pulled By: andwgu

fbshipit-source-id: 630950cba18d34e6b5de661f5a748f2cddc1b446
2021-06-03 10:53:33 -07:00
f05d5bec48 Preserve PyObject even when it goes dead (#56017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56017

Fixes #55686

This patch is seemingly straightforward but some of the changes are very
subtle.  For the general algorithmic approach, please first read the
quoted issue.  Based on the algorithm, there are some fairly
straightforward changes:

- New boolean on TensorImpl tracking if we own the pyobj or not
- PythonHooks virtual interface for requesting deallocation of pyobj
  when TensorImpl is being released and we own its pyobj, and
  implementation of the hooks in python_tensor.cpp
- Modification of THPVariable to MaybeOwned its C++ tensor, directly
  using swolchok's nice new class

And then, there is python_variable.cpp.  Some of the changes follow the
general algorithmic approach:

- THPVariable_NewWithVar is simply adjusted to handle MaybeOwned and
  initializes as owend (like before)
- THPVariable_Wrap adds the logic for reverting ownership back to
  PyObject when we take out an owning reference to the Python object
- THPVariable_dealloc attempts to resurrect the Python object if
  the C++ tensor is live, and otherwise does the same old implementation
  as before
- THPVariable_tryResurrect implements the resurrection logic.  It is
  modeled after CPython code so read the cited logic and see if
  it is faithfully replicated
- THPVariable_clear is slightly updated for MaybeOwned and also to
  preserve the invariant that if owns_pyobj, then pyobj_ is not null.
  This change is slightly dodgy: the previous implementation has a
  comment mentioning that the pyobj nulling is required to ensure we
  don't try to reuse the dead pyobj.  I don't think, in this new world,
  this is possible, because the invariant says that the pyobj only
  dies if the C++ object is dead too.  But I still unset the field
  for safety.

And then... there is THPVariableMetaType.  colesbury explained in the
issue why this is necessary: when destructing an object in Python, you
start off by running the tp_dealloc of the subclass before moving up
to the parent class (much in the same way C++ destructors work).  The
deallocation process for a vanilla Python-defined class does irreparable
harm to the PyObject instance (e.g., the finalizers get run) making it
no longer valid attempt to resurrect later in the tp_dealloc chain.
(BTW, the fact that objects can resurrect but in an invalid state is
one of the reasons why it's so frickin' hard to write correct __del__
implementations).  So we need to make sure that we actually override
the tp_dealloc of the bottom most *subclass* of Tensor to make sure
we attempt a resurrection before we start finalizing.  To do this,
we need to define a metaclass for Tensor that can override tp_dealloc
whenever we create a new subclass of Tensor.  By the way, it was totally
not documented how to create metaclasses in the C++ API, and it took
a good bit of trial error to figure it out (and the answer is now
immortalized in https://stackoverflow.com/q/67077317/23845 -- the things
that I got wrong in earlier versions of the PR included setting
tp_basicsize incorrectly, incorrectly setting Py_TPFLAGS_HAVE_GC on
the metaclass--you want to leave it unset so that it inherits, and
determining that tp_init is what actually gets called when you construct
a class, not tp_call as another not-to-be-named StackOverflow question
suggests).

Aside: Ordinarily, adding a metaclass to a class is a user visible
change, as it means that it is no longer valid to mixin another class
with a different metaclass.  However, because _C._TensorBase is a C
extension object, it will typically conflict with most other
metaclasses, so this is not BC breaking.

The desired new behavior of a subclass tp_dealloc is to first test if
we should resurrect, and otherwise do the same old behavior.  In an
initial implementation of this patch, I implemented this by saving the
original tp_dealloc (which references subtype_dealloc, the "standard"
dealloc for all Python defined classes) and invoking it.  However, this
results in an infinite loop, as it attempts to call the dealloc function
of the base type, but incorrectly chooses subclass type (because it is
not a subtype_dealloc, as we have overridden it; see
b38601d496/Objects/typeobject.c (L1261) )
So, with great reluctance, I must duplicate the behavior of
subtype_dealloc in our implementation.  Note that this is not entirely
unheard of in Python binding code; for example, Cython
c25c3ccc4b/Cython/Compiler/ModuleNode.py (L1560)
also does similar things.  This logic makes up the bulk of
THPVariable_subclass_dealloc

To review this, you should pull up the CPython copy of subtype_dealloc
b38601d496/Objects/typeobject.c (L1230)
and verify that I have specialized the implementation for our case
appropriately.  Among the simplifications I made:

- I assume PyType_IS_GC, because I assume that Tensor subclasses are
  only ever done in Python and those classes are always subject to GC.
  (BTW, yes!  This means I have broken anyone who has extend PyTorch
  tensor from C API directly.  I'm going to guess no one has actually
  done this.)

- I don't bother walking up the type bases to find the parent dealloc;
  I know it is always THPVariable_dealloc.  Similarly, I can get rid
  of some parent type tests based on knowledge of how
  THPVariable_dealloc is defined

- The CPython version calls some private APIs which I can't call, so
  I use the public PyObject_GC_UnTrack APIs.

- I don't allow the finalizer of a Tensor to change its type (but
  more on this shortly)

One alternative I discussed with colesbury was instead of copy pasting
the subtype_dealloc, we could transmute the type of the object that was
dying to turn it into a different object whose tp_dealloc is
subtype_dealloc, so the stock subtype_dealloc would then be applicable.
We decided this would be kind of weird and didn't do it that way.

TODO:

- More code comments

- Figure out how not to increase the size of TensorImpl with the new
  bool field

- Add some torture tests for the THPVariable_subclass_dealloc, e.g.,
  involving subclasses of Tensors that do strange things with finalizers

- Benchmark the impact of taking the GIL to release C++ side tensors
  (e.g., from autograd)

- Benchmark the impact of adding a new metaclass to Tensor (probably
  will be done by separating out the metaclass change into its own
  change)

- Benchmark the impact of changing THPVariable to conditionally own
  Tensor (as opposed to unconditionally owning it, as before)

- Add tests that this actually indeed preserves the Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D27765125

Pulled By: ezyang

fbshipit-source-id: 857f14bdcca2900727412aff4c2e2d7f0af1415a
2021-06-03 10:50:36 -07:00
fa72d9a379 [quant] Fix use after free (#59267)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59267

fixes: https://github.com/pytorch/pytorch/issues/58868

Test Plan: Imported from OSS

Reviewed By: jbschlosser, supriyar

Differential Revision: D28811529

fbshipit-source-id: f27018ae0a02d1dd229d1ff7638f130c38a00986
2021-06-03 10:35:48 -07:00
6baa66ece9 [nnc] Enable CPU fusion inside Facebook, take 4
Summary:
fixed the awkward configerator initialization issue that broke some
tests.  Trying again

Test Plan: predictor comparisons

Reviewed By: ZolotukhinM

Differential Revision: D28859795

fbshipit-source-id: 826801db24e86b1c3594a86e3ac32f0a84c496f7
2021-06-03 09:33:13 -07:00
57e452ff5d Revert D28856713: [PyTorch Edge] Add proper error message when loading incompatible model with lite interpreter
Test Plan: revert-hammer

Differential Revision:
D28856713

Original commit changeset: c3f9a3b64459

fbshipit-source-id: cc6ba8ec1047f29e62061107a2e5f245981b8039
2021-06-03 08:40:28 -07:00
6620d7d688 OpInfo: norm (#59259)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

EDIT:
~~Test takes whooping 4 mins to run 😓~~ (Filtered tests also included linalg norm)

Newly added tests take around 2 mins.
```
==================================================== 193 passed, 224 skipped, 27224 deselected, 5 warnings in 138.87s (0:02:18) ====================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59259

Reviewed By: jbschlosser

Differential Revision: D28833962

Pulled By: mruberry

fbshipit-source-id: 40b24d6a8cb8b7d231b2f6b34b87cee4f136c5f9
2021-06-03 08:25:58 -07:00
4b74c848aa Enable torch::deploy GPU tests in sandcastle
Summary:
Added GPU tests in previous diffs but had to disable them as they only
pass locally on devgpu, but not in sandcastle.

note: local testing requires mode/dev-nosan or else ASAN interferes with CUDA.

Test Plan: Verify tests passing in sandcastle.

Reviewed By: malfet

Differential Revision: D28538996

fbshipit-source-id: 1a6ccea07cfe2f150eee068594e636add620cd91
2021-06-03 08:10:19 -07:00
f1ce7f4b7f Update PyTorch version to 0.10.0a (#59345)
Summary:
Also fix `TestProducerVersion` by removing assumption that major and minor are single digit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59345

Reviewed By: robieta

Differential Revision: D28853720

Pulled By: malfet

fbshipit-source-id: 4b6d03c6b0c9d652a5aef792aaa84eaa522d10e8
2021-06-03 07:55:44 -07:00
c829095590 Revert D28802058: [pytorch][PR] add dispatch for bitwise_and
Test Plan: revert-hammer

Differential Revision:
D28802058 (874f287c52)

Original commit changeset: cccbbff46df5

fbshipit-source-id: 1675fe42966278aa446496445342d6d8a92ecea0
2021-06-03 07:38:13 -07:00
d095ec75a1 Forward AD formulas batch 2 (#57863)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57863

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28387763

Pulled By: albanD

fbshipit-source-id: e1b60ab728bb05b9e3323ee0dc7e401aaf5b8817
2021-06-03 07:33:04 -07:00
add291cf66 [JIT] Add a phase to perform inplace<->functional conversion for activation operators (#57477)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57477

Currently the conversion only deals with activation operators. The legality check is somewhat strict for now.

Test Plan:
```
python test/test_jit.py -k test_functional_to_inplace_activation
python test/test_jit.py -k test_inplace_to_functional_activation
```

Reviewed By: mrshenli

Differential Revision: D28155153

Pulled By: desertfire

fbshipit-source-id: df092830c4dff3ce9578ff76285eb7a566b7d81b
2021-06-03 06:43:23 -07:00
91b7bcf4c0 [PyTorch Edge] Add proper error message when loading incompatible model with lite interpreter (#59354)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59354

Check if the model has `bytecode.pkl` and provide proper error message before loading model. Test it by loading a model.pt and model.ptl.
```
>>> from torch.jit.mobile import _load_for_lite_interpreter
>>> _load_for_lite_interpreter("/Users/chenlai/Documents/pytorch/data/mobilenet_v2.pt")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/chenlai/pytorch/torch/jit/mobile/__init__.py", line 48, in _load_for_lite_interpreter
    cpp_module = torch._C._load_for_lite_interpreter(f, map_location)  # type: ignore[attr-defined]
RuntimeError: The model is not generated from the api _save_for_lite_interpreter. Please regenerate the module by scripted_module._save_for_lite_interpreter('model.ptl'). Refer to https://pytorch.org/tutorials/prototype/lite_interpreter.html for more details.
```

iOS:
![image](https://user-images.githubusercontent.com/16430979/120593077-cbe23180-c3f3-11eb-9745-ee2b04b78c6c.png)

Android:
![image](https://user-images.githubusercontent.com/16430979/120594357-af46f900-c3f5-11eb-9fb0-500a038148e3.png)

Differential Revision:
D28856713
D28856713

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: cccclai

fbshipit-source-id: c3f9a3b64459dda6811d296371c8a2eaf22f8b20
2021-06-03 03:18:14 -07:00
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
f914ab193e Use irange in a few places in torch/csrc (#55100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55100

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27447708

fbshipit-source-id: 4f21133bd76f29d73a51befcae649ab55637b36e
2021-06-03 00:58:51 -07:00
18642e664a [quant][graphmode][fx][refactor] Split quantize.py to prepare.py and convert.py (#59353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59353

Next: remove Quantizer class

Test Plan: Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D28856277

fbshipit-source-id: 25f5502be387dbe9706780f667501b46b82789a5
2021-06-02 23:52:39 -07:00
8b4784a9c6 Revert D28821216: [pytorch][PR] Migrate _th_std_var to ATen
Test Plan: revert-hammer

Differential Revision:
D28821216 (1fb5cf5a71)

Original commit changeset: f35992c21f08

fbshipit-source-id: d068a62b7fa941188591a74dcb5d1a24719af7b3
2021-06-02 21:18:26 -07:00
eb55b086b7 [DDP] Log some python-side errors (#59284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59284

Logs a few python-side errors to DDP logging.

TODO: Most python errors actually have to do with user input correctness, so they throw before reducer is constructed and thus there is no logger. For this case, should we allow `logger` to be created optionally without a reducer, just for the purpose of logging errors, so that we can gain insight into these errors in scuba?
ghstack-source-id: 130412973

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28820290

fbshipit-source-id: 610e5dba885b173c52351f7ab25c923edce639e0
2021-06-02 19:49:26 -07:00
79aeca0b00 [DDP] Log when errors happen (#59281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59281

Adds ability to log when reducer/ddp encounters an error. We add fields "has_error" and "error" to indicate that an error has
occured in this iteration, and the other fields (performance stats) are not
guaranteed to be updated.

Errors encountered in python-side DDP will be added in the next diff.
ghstack-source-id: 130412974

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28652717

fbshipit-source-id: 9772abc2647a92dac6a325da6976ef5eb877c589
2021-06-02 19:48:26 -07:00
d2e03051e0 Fix fecher continue next after StopIterator (#59313)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59312

cc VitalyFedyunin dzhulgakov

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59313

Reviewed By: jbschlosser

Differential Revision: D28837762

Pulled By: dzhulgakov

fbshipit-source-id: 95cc29359aaba0f24ca169c5495ab5c6c53a0dce
2021-06-02 19:14:25 -07:00
1fb5cf5a71 Migrate _th_std_var to ATen (#59258)
Summary:
Ref https://github.com/pytorch/pytorch/issues/49421

This migrates `std`/`var`'s special case all-reduction from TH to ATen. Using the benchmark from gh-43858 that was used to justify keeping the TH version; I find this PR has similar (slightly better) performance in single threaded. And unlike the TH version, this is multi-threaded and so much faster for large tensors.

TH Results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       3.6   |       3.8    |     8.2  |      1.2
      80       |       3.7   |       3.8    |     8.4  |      1.2
      800      |       4.2   |       4.3    |     8.7  |      1.2
      8000     |       9.0   |       9.1    |    11.2  |      1.5
      80000    |      58.3   |      59.0    |    30.6  |      4.2
      800000   |     546.9   |     546.9    |   183.4  |     31.3
      8000000  |    5729.7   |    5701.0    |  6165.4  |    484.1
```

ATen results:
```
[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
1 threads: ---------------------------------------------------------
      8        |       4.0   |       4.0    |     8.7  |      1.2
      80       |       3.6   |       3.8    |     9.0  |      1.2
      800      |       4.1   |       4.3    |     8.9  |      1.2
      8000     |       8.9   |       9.2    |    10.6  |      1.5
      80000    |      57.0   |      57.4    |    28.8  |      4.3
      800000   |     526.9   |     526.9    |   178.3  |     30.2
      8000000  |    5568.1   |    5560.6    |  6042.1  |    453.2

[----------------------------- Index ------------------------------]
               |  torch_var  |  torch_var0  |  stdfn   |  torch_sum0
8 threads: ---------------------------------------------------------
      8        |      3.9    |      3.8     |     9.1  |      1.2
      80       |      3.8    |      3.9     |     8.8  |      1.2
      800      |      4.2    |      4.3     |     8.9  |      1.3
      8000     |      9.0    |      9.2     |    10.4  |      1.5
      80000    |     26.0    |     26.8     |    26.4  |      4.4
      800000   |     92.9    |     87.3     |    72.1  |     22.4
      8000000  |    793.5    |    791.8     |  5334.8  |    115.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59258

Reviewed By: mruberry

Differential Revision: D28821216

Pulled By: ngimel

fbshipit-source-id: f35992c21f08a0a8878053680dc0ca7a8facd155
2021-06-02 19:01:39 -07:00
c03cae49fc [DDP] Remove unused initialize_buckets (#59066)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59066

Per title
ghstack-source-id: 130338812

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28734666

fbshipit-source-id: 89ca7f8e625c4068ba0ed9800be2619e469ae515
2021-06-02 17:20:22 -07:00
2a78e896a0 [DDP] use work.result() in _check_global_requires_backward_grad_sync (#59065)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59065

Cleaner to use work.result() instead of sending back the tensor from
this function.
ghstack-source-id: 130338813

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28551203

fbshipit-source-id: d871fed78be91f0647687ea9d6fc86e576dc53a6
2021-06-02 17:19:07 -07:00
517ea26eee [deploy] Make load_library a no-op inside a package (#58933)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58933

**Summary**
This commit makes load_library calls no-ops inside packages run with
deploy. Libraries containing custom C++ operators and classes are statically linked in C++
and don't need to be loaded. This commit takes advantage of the fact that sys.executable is
set to torch_deploy in deploy and uses that to exit early in load_library if
the program is running inside deploy.

**Test Plan**
This commit adds a test to `generate_examples`/`test_deploy` that
packages and runs a function that calls `load_library`. The library
doesn't exist, but that's okay because the function should be a no-op
anyway.

Test Plan: Imported from OSS

Reviewed By: Lilyjjo

Differential Revision: D28687159

Pulled By: SplitInfinity

fbshipit-source-id: 4a61fc636698e44f204334e338c5ce35257e7ae2
2021-06-02 17:01:31 -07:00
dfe85d6fd7 Revert D28840199: [pytorch][PR] Update version to 1.10
Test Plan: revert-hammer

Differential Revision:
D28840199 (3453aa44c1)

Original commit changeset: acc5a93e12a3

fbshipit-source-id: a41eb7c882fe0bf8f9a35ef180e99a7e72f6857d
2021-06-02 16:25:51 -07:00
2ce23136d0 Use irange in torch/csrc utils (#55556)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55556

Test Plan: Sandcastle

Reviewed By: ezyang

Differential Revision: D27625936

fbshipit-source-id: 79065438f582a6f5fe6f1f796b6984767605197e
2021-06-02 15:47:00 -07:00
e6c8e9497c Small fix type hints in mobile optimizer (#59282)
Summary:
Adjusts type hints for optimize_for_mobile to be consistent with the default. Right now using optimize_for_mobile and only passing a script_module gives me a type error complaining about preserved_methods can't be None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59282

Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Open source tests ran the lints. Internal CI should be enough here.

Reviewed By: jbschlosser

Differential Revision: D28838159

Pulled By: JacobSzwejbka

fbshipit-source-id: dd1e9aff00a759f71d32025d8c5b01e612c869a5
2021-06-02 15:32:16 -07:00
318c858eb5 [fx2trt] Organize converters and add unittests (#59261)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59261

Split converters to different files instead of putting them in a single file.

Reviewed By: jackm321

Differential Revision: D28613989

fbshipit-source-id: f25ca3732c457af51a07ef466915a4a08bd45e6e
2021-06-02 15:22:15 -07:00
0eafef5031 Fix internal assert location in custom Function binding (#59301)
Summary:
For facebook employees, this fix some internal failures from https://www.internalfb.com/tasks/?t=92100671

This was not a problem before https://github.com/pytorch/pytorch/pull/58271 because these cycles used to just be leaked (so nothing was cleared/dealloced).
Now that we properly clean up these cycles, we have to fix the assert in the clear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59301

Reviewed By: jbschlosser

Differential Revision: D28841564

Pulled By: albanD

fbshipit-source-id: e2ec51f6abf44c4e3a83c293e90352295a43ba37
2021-06-02 15:09:51 -07:00
c3745dc580 Small change for torch.distributed launcher (#59152)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59152

Small change for https://fb.workplace.com/groups/319878845696681

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D28773682

Pulled By: H-Huang

fbshipit-source-id: acf82273e8622b7ffd3088d8d766bdf49273754c
2021-06-02 15:05:41 -07:00
3453aa44c1 Update version to 1.10 (#59325)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59325

Reviewed By: jbschlosser, seemethere

Differential Revision: D28840199

Pulled By: malfet

fbshipit-source-id: acc5a93e12a3db47d6103ea064bec9e40320f708
2021-06-02 15:00:33 -07:00
7ee68363a8 Add new rpc.barrier API (#53423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53423

closes #40166

This change exposes a new API, rpc.barrier() which blocks the main processes of all workers running RPC until the whole group completes this function. Optionally rpc.barrier can take in a set of worker_names and only synchronize across those worker names.

Example:
```python
import os
import torch.multiprocessing as mp
import torch.distributed.rpc as rpc
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "5678"

world_size = 4
odd_num_workers = [f"worker{i}" for i in range(world_size) if i % 2]
even_num_workers = [f"worker{i}" for i in range(world_size) if not i % 2]

def worker(i):
    print(i)
    rpc.init_rpc(f"worker{i}", rank=i, world_size=world_size)
    if i % 2:
        print(f"start barrier {i}")
        rpc.barrier(set(odd_num_workers))
    else:
        print(f"start barrier {i}")
        rpc.barrier(set(even_num_workers))
    rpc.shutdown()
    print(f"shutdown{i}")

if __name__ == '__main__':
    with mp.Pool(processes=world_size) as pool:
        pool.map(worker, range(world_size))
```

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D27737145

Pulled By: H-Huang

fbshipit-source-id: 369196bc62446f506d1fb6a3fa5bebcb0b09da9f
2021-06-02 14:20:16 -07:00
1765f51618 [iOS GPU] [BE] use channel-last to transform the weights (#59113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59113

Manually permuting the weights is slower than `calling at::contiguous()`
ghstack-source-id: 130374487

Test Plan: CI

Reviewed By: SS-JIA

Differential Revision: D28762278

fbshipit-source-id: 1dde3ef82018bc2507d0ca5132b1ee97dc99787f
2021-06-02 14:02:11 -07:00
1968efa2dd [c10d] Remove verbose log (#59070)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59070

This log is too verbose, especially in the case we call monitored
barrier before every collective as we do in ProcessGroupWrapper.
ghstack-source-id: 130052822

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28738189

fbshipit-source-id: f2899537caa4c13508da31134d5dd0f4fd6a1f3a
2021-06-02 13:50:11 -07:00
7f2e620105 FIX Validates that weights are 2d in embedding (#59314)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59314

Reviewed By: H-Huang

Differential Revision: D28837753

Pulled By: jbschlosser

fbshipit-source-id: 683378244c61b0937c95563f91ef87ab09fd1653
2021-06-02 12:52:21 -07:00
fb709a8ca5 Build with USE_GLOO_WITH_OPENSSL=1 (#59274) (#59323)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59323

Reviewed By: jbschlosser

Differential Revision: D28839920

Pulled By: malfet

fbshipit-source-id: 63cffa6fe25cf354966354e5dd5490ba6e5b3d11
2021-06-02 12:51:00 -07:00
f7097b0c0b Make unary tests runnable if SCIPY is not installed (#59304)
Summary:
By adding `if TEST_SCIPY else _NOTHING` to special.i1 and special.i1e

Discovered while running tests on M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59304

Reviewed By: jbschlosser

Differential Revision: D28835693

Pulled By: malfet

fbshipit-source-id: e4fde6584da29fa43bc6da75eebe560512754ed0
2021-06-02 12:47:30 -07:00
eae84f0d5d Fix ONNX forward compatibility (#59327)
Summary:
Fixes `onnx.utils.polish_model` not found exception when executed using onnx-1.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59327

Reviewed By: H-Huang

Differential Revision: D28840563

Pulled By: malfet

fbshipit-source-id: 403a29a88e7dee8b3414602b9fe2b31baf737dce
2021-06-02 12:39:56 -07:00
c22ac14969 [Error-reporting] Set upper boundary on border element (#59311)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59311

The diff sets the upper boundary on border element when presenting the error message. This is required in order to avoid unnecessary log contamination

Test Plan: Example of log contamination: https://www.internalfb.com/fblearner/details/276849996/operator/2942475685?tab=try_27021599785797968

Reviewed By: d4l3k

Differential Revision: D28812745

fbshipit-source-id: 4f491b9acc8cc9831d763f185022879bbbfb4c8a
2021-06-02 12:28:54 -07:00
99f2000a99 Migrate nonzero from TH to ATen (CPU) (#59149)
Summary:
Resubmit of https://github.com/pytorch/pytorch/issues/58811, Closes gh-24745

The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2150 us |            551 us |
| 128,128,32 | 1250 us |          1020 us |            197 us |
|  64,128,32 |  581 us |           495 us |             99 us |
|  32,128,32 |  292 us |           255 us |             83 us |
|  16,128,32 |  147 us |           126 us |             75 us |
|  8,128,32  |   75 us |            65 us |             65 us |
|  4,128,32  |   39 us |            33 us |             33 us |
|  2,128,32  |   20 us |            18 us |             18 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59149

Reviewed By: mruberry

Differential Revision: D28817466

Pulled By: ngimel

fbshipit-source-id: f08f6c003c339368fd53dabd28e9ada9e59de732
2021-06-02 12:26:29 -07:00
b4d30bb583 [PyTorch] Use expect_contiguous in CPU matmul (#58895)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58895

There doesn't seem to be any reason we can't use expect_contiguous here.
ghstack-source-id: 130283300

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28666399

fbshipit-source-id: b4a9bcb01ff1c30d991765140c8df34c3ac3a89b
2021-06-02 12:04:18 -07:00
0528325b5f [iOS GPU] Raise the minimum OS support version to 11.0 (#59310)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59310

We recently updated the GK to deliver GPU models to only 11.0+ devices. Will do a clean up in following diffs to clean up shader functions written for iOS 10.0.
ghstack-source-id: 130374598

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D28805864

fbshipit-source-id: 4cde34ff9fbbe811a69686a0f29b56d69aeefbee
2021-06-02 11:53:45 -07:00
f8f06e7099 [iOS GPU] Fix the OSS macos build (#59102)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59102

ghstack-source-id: 130374334

Test Plan:
- On the OSS side
    - CI
    - `USE_PYTORCH_METAL=ON python setup.py install --cmake`

Reviewed By: IvanKobzarev

Differential Revision: D28757412

fbshipit-source-id: 2efea9dfe7361a73c02d1ca5fbf587835d39d325
2021-06-02 11:47:11 -07:00
874f287c52 add dispatch for bitwise_and (#59125)
Summary:
ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59125

Reviewed By: ngimel

Differential Revision: D28802058

Pulled By: ezyang

fbshipit-source-id: cccbbff46df552235072fa38fea1c19b068991ea
2021-06-02 11:42:49 -07:00
484d53f4a0 [torch][JIT] Warn only once when using unscripted dictionary (#59287)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59287

D27211605 added a warning in `toIValue` that warns users to script their
dictionaries before passing them to TorchScript functions in order to get some
performance benefits and reference semantics. However, this warning is emitted
every time `toIValue` is called (e.g. when a dictionary is passed to
TorchScript function), which can lead to noisy log output. This diff changes
this changes to use `TORCH_WARN_ONCE` instead.

Test Plan: Sandcastle, OSS CI.

Reviewed By: hyuen

Differential Revision: D28824468

fbshipit-source-id: e651eade4380abaf77c6c8a81ec4e565b0c2c714
2021-06-02 11:41:37 -07:00
82052b0a76 [vulkan] Remove constant duplication for Vulkan optimize_for_mobile (#59276)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59276

Test Plan: Imported from OSS

Reviewed By: cccclai, ngimel

Differential Revision: D28814072

Pulled By: IvanKobzarev

fbshipit-source-id: d5cfd1352a2e07cdd4708d19fe4320444521db78
2021-06-02 11:38:18 -07:00
3ec0904718 docs: Add note about nightly versions bump (#59324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59324

Also updates section on pinning pytorch/builder with an example

[skip ci]

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28840049

Pulled By: seemethere

fbshipit-source-id: e5d6722713680e969893d9df97ec269fc9c00411
2021-06-02 11:29:41 -07:00
5386f6935a avg_pool3d: port to structured (#59083)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59083

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28802620

Pulled By: ezyang

fbshipit-source-id: 1e890af3c37912447198aa2f20914b99decda8b2
2021-06-02 11:29:39 -07:00
5dc426a6f6 avg_pool2d_backward: Port to structured (#59082)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59082

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28802621

Pulled By: ezyang

fbshipit-source-id: 15b8ba562eee132ef8390a7de520bdd8e15d0f86
2021-06-02 11:28:25 -07:00
eb1adc4c5e cmake: Add USE_GLOO_WITH_OPENSSL to Summary.cmake (#59321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59321

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28839370

Pulled By: seemethere

fbshipit-source-id: 0d4b35c05c2b1a78b752088cd16cd6263958e7f6
2021-06-02 11:10:55 -07:00
afd5237a4f Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3
Test Plan: revert-hammer

Differential Revision:
D28800692 (6e7dae9cec)

Original commit changeset: d791c3b2ccd7

fbshipit-source-id: 5042fecfbab59181572013bf39760bc716e86430
2021-06-02 10:07:46 -07:00
a7aeaaf99e Added missing namespaces for C++ API (#45736)
Summary:
Hello,

depending on the build environment you may encounter
```c++
error: reference to 'optional' is ambiguous
```
when using the Torch-C++-API.

This PR adds `c10::` to avoid possible ambiguities with **std::optional** and does not introduce any functional change.

Fixes https://discuss.pytorch.org/t/linker-failed-with-ambiguous-references/36255 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45736

Reviewed By: dzhulgakov

Differential Revision: D24125123

Pulled By: VitalyFedyunin

fbshipit-source-id: df21420f0a2d0270227c28976a7a4218315cc107
2021-06-02 09:46:20 -07:00
87a25e09f4 [quant][graphmode][fx][refactor] Remove _convert from Quantizer class (#59042)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59042

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724867

fbshipit-source-id: 9f87d51020caa20d5408cb2820947e23d92d5fc3
2021-06-02 08:50:56 -07:00
580831bfbb Add support for MatMul to BatchMatMulFP16Acc{16,32}Fake Op Mapping
Test Plan: f276981395

Reviewed By: hx89

Differential Revision: D28815646

fbshipit-source-id: c16b081bf3da2b157b9d42ea67b03dae88e82c6d
2021-06-02 08:32:21 -07:00
599f5058cf [ONNX] Update ONNX to rel-1.9 (#55889) (#57080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57080

ONNX optimizer is removed in ONNX 1.9
This PR removes ONNX optimizer from a C++ code path and uses `try-except` block in Python to keep it compatible with both ONNX-1.8 and 1.9.

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D28467330

Pulled By: malfet

fbshipit-source-id: 5e4669dd0537648898e593f9e253da18d6dc7568

Co-authored-by: neginraoof <neginmr@utexas.edu>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
2021-06-02 08:27:17 -07:00
f87aa23125 .github: Remove windows dependency installs (#59283)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59283

We were observing 403s when attempting to install dependencies from
chocolatey leading us to believe that we were getting rate limited from
chocolatey.

We've instead opted to install our dependencies in our base AMIs instead
considering we would install them on every workflow anyway. This also
comes with the moving of the windows 10 sdk installation to the base sdk
as well since we were observing failures there as well due to failed
dependency installations.

Also moves windows 10 sdk installations to our visual studio installation script, which is activated by an passing an environment variable

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: janeyx99

Differential Revision: D28822962

Pulled By: seemethere

fbshipit-source-id: b5e35ffe4537db55deb027376bd2d418683707a5
2021-06-02 08:16:21 -07:00
3a2149a4ce [reland] Make TP agent use streams from Future when sending response (#59212)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59212

Reland of https://github.com/pytorch/pytorch/pull/58428

Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 130202842

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623885

fbshipit-source-id: 29333bcb75d077ab801eac92017d0e381e8f5569
2021-06-02 05:46:05 -07:00
258a991027 [reland] Set and propagate devices in RRef completion future (#59211)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59211

Reland of https://github.com/pytorch/pytorch/pull/58674

I found this missing parameter while debugging failures in the next PR.
I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 130202843

Test Plan: Should fix the next diff.

Reviewed By: mrshenli

Differential Revision: D28623886

fbshipit-source-id: 6c82ed7c785ac3bf32fff7eec67cdd73b96aff28
2021-06-02 05:46:04 -07:00
a3392cafe0 [reland] Set streams when invoking UDFs (#59210)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59210

Reland of https://github.com/pytorch/pytorch/pull/58427

Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 130202847

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623889

fbshipit-source-id: ed38242f813dac075d162685d52ae89f408932f9
2021-06-02 05:46:02 -07:00
f8a3fd4e34 [reland] Create CUDA-aware futures in RequestCallback (#59209)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59209

Reland of https://github.com/pytorch/pytorch/pull/58426

The operations in RequestCallback can return CUDA tensors, thus the futures used to hold them must be CUDA-aware.
ghstack-source-id: 130202844

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623887

fbshipit-source-id: 53561b8ae011458d8f848f0a03830925aff2f0c2
2021-06-02 05:46:00 -07:00
3af6ff98ff [reland] Provide pre-extracted DataPtrs when completing a Future with a Message (#59208)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59208

Reland of https://github.com/pytorch/pytorch/pull/58425

Now that callbacks can provide pre-extracted DataPtrs, let's do so. This will become of crucial importance in the next PR, where some of these futures will become CUDA-aware, and thus they will try to extract DataPtrs on their own, but they would fail to do so here because Message isn't "inspectable".
ghstack-source-id: 130202845

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623888

fbshipit-source-id: 1aa4bde8014870c071685ba8f72d5f3f01f0a512
2021-06-02 05:45:59 -07:00
1adc289e10 [reland] Allow Future::then to return pre-extracted DataPtrs (#59207)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59207

Reland of https://github.com/pytorch/pytorch/pull/58424

In CUDA mode, Future must inspect its value and extract DataPtrs. However some types are not supported, for example the C++/JIT custom classes, which include Message, which is widely used in RPC. Hence for these scenarios we allow the user to perform the custom DataPtr extraction on their own, and pass the pre-extracted DataPtrs.

Note that `markCompleted` already allowed users to pass in pre-extracted DataPtrs, hence this PR simply extends this possibility to the `then` method too.
ghstack-source-id: 130202846

Test Plan: Used in next PR.

Reviewed By: mrshenli

Differential Revision: D28623890

fbshipit-source-id: 468c5308b40774ba0a778b195add0e0845c1929e
2021-06-02 05:45:57 -07:00
b07d68e24c [reland] Always use intrusive_ptr for Message (2 out of 2) (#59206)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59206

Reland of https://github.com/pytorch/pytorch/pull/58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.
ghstack-source-id: 130202848

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623892

fbshipit-source-id: f815cf6b93e488c118e5d2298473e6e9d9f4c132
2021-06-02 05:45:55 -07:00
5ec169b4c3 [reland] Always use intrusive_ptr for Message (1 out of 2) (#59205)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59205

Reland of https://github.com/pytorch/pytorch/pull/58422

Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).

By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.

In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 130202849

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28623891

fbshipit-source-id: c9aeea3440679a11741ca78c06b03c57cb815a5e
2021-06-02 05:44:49 -07:00
44c20ce676 Alias for i0 to special namespace (#59141)
Summary:
See https://github.com/pytorch/pytorch/issues/50345

cc: mruberry kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59141

Reviewed By: ngimel

Differential Revision: D28784097

Pulled By: mruberry

fbshipit-source-id: 9b61a21906ef337292686fd40e328502a79e6f09
2021-06-01 23:04:09 -07:00
059a717c9e Fix breakpad build and add to more images (#59236)
Summary:
This PR
* adds the breakpad build to most of the remaining docker images (except the mobile + slim ones)
* pins to a [fork of breakpad](https://github.com/google/breakpad/compare/master...driazati:master?expand=1) to enable dasiy chaining on signal handlers
* renames the API to be nicer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59236

Reviewed By: malfet

Differential Revision: D28792511

Pulled By: driazati

fbshipit-source-id: 83723e74b7f0a00e1695210ac2620a0c91ab4bf2
2021-06-01 22:47:14 -07:00
dbe629c51d [RPC Framework] Support creating a RemoteModule by RRef (#59242)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59242

#Oringal PR Issue: https://github.com/pytorch/pytorch/issues/58274

This can be a workaround: Instead of passing a script `RemoteModule` over RPC, pass its `module_rref` field over RPC, and then construct a new `RemoteModule` on the receiver end.
ghstack-source-id: 130268018

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_send_remote_module_over_the_wire_script_not_supported

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_remote_module_py_pickle_not_supported_script

buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_create_remote_module_by_module_rref

Reviewed By: vipannalla

Differential Revision: D28794905

fbshipit-source-id: 1a677ff0d4b47c078ad47b50d7102a198a1fc39b
2021-06-01 22:35:03 -07:00
3218d890dd [quant][graphmode][fx][fix] Fix support for custom module (#59041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59041

Static quantization for Custom module support was removed in a previous refactor
https://github.com/pytorch/pytorch/pull/57519 since it's not covered by the test case
This PR re-enabled the test case and fixed the support

Test Plan: Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724866

fbshipit-source-id: 1974675b88b56a2173daf86965d6f3fb7ebd783b
2021-06-01 22:31:15 -07:00
06af7618e7 [quant][graphmode][fx][refactor] Remove Quantizer class from convert (QuantizeHandler) (#59040)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59040

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724870

fbshipit-source-id: c0f748711b825cd46bdfcc05c054c77a41e8207a
2021-06-01 22:00:49 -07:00
0a26781966 fix numpy compatibility in test for torch.kthvalue (#59214)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/59201. Should be merged after https://github.com/pytorch/pytorch/issues/59067 to ensure this actually working correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59214

Reviewed By: albanD

Differential Revision: D28792363

Pulled By: mruberry

fbshipit-source-id: 0cf613463139352906fb567f1efcc582c2c25de8
2021-06-01 21:57:09 -07:00
e9e1bb1a4e Fix device of info tensor for torch.linalg.inv_ex with MAGMA backend (#59223)
Summary:
This PR fixes `torch.linalg.inv_ex` with MAGMA backend.
`info` tensor was returned on CPU device even for CUDA inputs.
Now it's on the same device as input.

Fixes https://github.com/pytorch/pytorch/issues/58769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59223

Reviewed By: ngimel

Differential Revision: D28814876

Pulled By: mruberry

fbshipit-source-id: f66c6f06fb8bc305cb2e22b08750a25c8888fb65
2021-06-01 21:49:57 -07:00
50e6ee3ca2 [quant][graphmode][fx][refactor] Remove Quantizer class from quantize_node (#59039)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59039

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724874

fbshipit-source-id: bd984716b2da1d6879c3e92fa827574783a41567
2021-06-01 21:40:08 -07:00
2d8f0d966f CUDA support in the CSR layout: CUDA addmm/matvec (#59012)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59012

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28719631

Pulled By: bhosmer

fbshipit-source-id: 43e2004a61e114aeb0a7c6ad8a25fedda238c6da
2021-06-01 21:16:42 -07:00
3efefc4016 [CUDA graphs] Makes sure all graphs tests call empty_cache() at some point before capture (#59233)
Summary:
Graphs tests are sometimes flaky in CI ([example](https://app.circleci.com/pipelines/github/pytorch/pytorch/328930/workflows/0311199b-a0be-4802-a286-cf1e73f96c70/jobs/13793451)) because when the GPU runs near its max memory capacity (which is not unusual during a long test), sometimes, to satisfy new allocations that don't match any existing unused blocks, the caching allocator may call `synchronize_and_free_events` to wait on block end-of-life events and cudaFree unused blocks, then re-cudaMalloc a new block. For ungraphed ops this isn't a problem, but synchronizing or calling cudaFree while capturing is illegal, so `synchronize_and_free_events` raises an error if called during capture.

The graphs tests themselves don't use much memory, so calling torch.cuda.empty_cache() at some point before their captures should ensure memory is available and the captures never need `synchronize_and_free_events`.

I was already calling empty_cache() near the beginning of several graphs tests. This PR extends it to the ones I forgot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59233

Reviewed By: mruberry

Differential Revision: D28816691

Pulled By: ngimel

fbshipit-source-id: 5cd83e48e43b1107daed5cfa2efff0fdb4f99dff
2021-06-01 21:05:46 -07:00
1d37f41567 [quant][graphmode][fx][refactor] Remove _prepare from Quantizer class (#59038)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59038

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724869

fbshipit-source-id: e8501c9720b5ddb654e78bc8fa08de0466c1d52b
2021-06-01 18:01:22 -07:00
970096b624 [Reland] Adds an aten::_ops namespace with unambiguous function names (#59018)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59018

Fixes #58044.

This PR:
- adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to
an non-overloaded function in aten::_ops that calls the desired operator
(without default arguments).

The motivation for this is two-fold:
1) Using aten operators with templates is hard if the operator is
overloaded (e.g. add.Tensor and add.Scalar).
2) Method-only operators require special handling; pointers-to-method
are different from function pointers. `ATEN_FN2(add_, Tensor)` returns
a function instead of a method.

There is some interesting behavior for out= operations.
`ATEN_FN2(sin, "out")` gives a function that is *faithful* to the schema;
that is, the order of arguments is exactly what it looks like in the
schema. This makes it so that you can directly register
`ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature)
as an override for a DispatchKey.

Test Plan:
- New tests that ATEN_FN2 works on function and method-only operators
- New test that ATEN_FN works
- New test that ATEN_FN macro returns a "faithful" function.

Codegen output:
Operators.h and Operators.cpp are both here:
https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175

Reviewed By: bdhirsh

Differential Revision: D28721206

Pulled By: zou3519

fbshipit-source-id: a070017f98e8f4038cb0c64be315eef45d264217
2021-06-01 17:19:06 -07:00
8805093ec5 use long index type for index_add_cuda deterministic path (#59254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59254

index_add can take int or long index tensor whereas index_put only takes long indices tensor.

In the deterministic path of index_add_cuda, we use index_put. Hence we better convert index tensor to long.

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_index_add_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (14.748)
    ✓ Pass: caffe2/test:torch_cuda - test_index_add_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (27.717)
    ✓ Pass: caffe2/test:torch_cuda - main (27.717)

Reviewed By: ngimel

Differential Revision: D28804038

fbshipit-source-id: de12932a7738f2805f3bceb3ec024497625bce6a
2021-06-01 16:28:18 -07:00
20348fb32e [quant][graphmode][fx][refactor] Remove find_matches from Quantizer class (#59037)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59037

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724865

fbshipit-source-id: 6c6824d0af7dd47d4c111d6a08e373bc65f33e08
2021-06-01 16:07:07 -07:00
7d64fc675b [quant][graphmode][fx][refactor] Remove fold_weights from Quantizer class (#59036)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59036

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724862

fbshipit-source-id: 5900420127fcc14846bc34c9ac29ff7e6a703f1e
2021-06-01 15:52:57 -07:00
8af6281201 DOC Adds register_module_full_backward_hook into docs (#58954)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/54443

Adds `register_module_full_backward_hook` into the index so it is rendered in the html docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58954

Reviewed By: ngimel

Differential Revision: D28801816

Pulled By: jbschlosser

fbshipit-source-id: a2e737fe983e5d7e4e26d7639183bca34b571cb8
2021-06-01 15:47:10 -07:00
6e7dae9cec [nnc] Enable CPU fusion inside Facebook, take 3 (#59253)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59253

Fixed a miscompilation exposed by multithreaded profiling collection; let's try again.
ghstack-source-id: 130286580

Test Plan: servicelab

Reviewed By: navahgar, huiguoo

Differential Revision: D28800692

fbshipit-source-id: d791c3b2ccd75fe5e6eca0859083d4cd67460147
2021-06-01 15:42:22 -07:00
cc4891804c [quant][graphmode][fx][refactor] Remove save_state and restore_state from Quantizer class (#59035)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59035

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724872

fbshipit-source-id: d32752c635917c9820e5e7cc414ba9d48a258a19
2021-06-01 15:38:36 -07:00
336ac9496f Fix mismatch in README.md Docker Image section (#59199)
Summary:
docker.Makefile has CUDNN_VERSION=8 as the defaults, but README.md states cuDNN v7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59199

Reviewed By: mruberry

Differential Revision: D28808611

Pulled By: ngimel

fbshipit-source-id: 96cea32bfe33184b2bff69b7bb7f3e50a2b9c6aa
2021-06-01 15:22:30 -07:00
95c26b2806 [ROCm] disable test test_Conv2d_groups_nobias for ROCm (#59158)
Summary:
Disabling the test since its failing in ROCm4.2

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59158

Reviewed By: mruberry

Differential Revision: D28808953

Pulled By: ngimel

fbshipit-source-id: 134f147ead6dc559d2cde49cf8343cd976e6c224
2021-06-01 15:10:06 -07:00
3d521e8b40 [quant][graphmode][fx][refactor] Remove prepare_custom_config from Quantizer class (#59034)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59034

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724873

fbshipit-source-id: 870e0822843ad1d035f41eaa015bdde9ccf6ec23
2021-06-01 14:52:22 -07:00
a5dcd3c4b7 Revert D28240105: [pytorch][PR] Fix DistributedSampler mem usage on large datasets
Test Plan: revert-hammer

Differential Revision:
D28240105 (a0ce8da26e)

Original commit changeset: 4c6aa493d0f7

fbshipit-source-id: 8a0e17764c2f26c8316f88ad6c8772b08883ceee
2021-06-01 14:44:23 -07:00
a0ce8da26e Fix DistributedSampler mem usage on large datasets (#51841)
Summary:
The current implementation of DistributedSampler generates a python list to hold all of the indices, and then returns a slice of this list for the given rank (creating a partial copy of the list). When the underlying dataset is large, both of these choices waste a large amount of memory. It is much more efficient to create a tensor to hold the indices, and then index into that tensor instead of creating slices.

In the case of a sampler with `shuffle=False`, it would be possible to avoid creating the `indices` tensor entirely (since the index will always match the value), but I have opted instead here to keep the implementation as similar to the existing version as possible. One possible benefit of this approach is that memory usage will not significantly change based on changing this parameter. Still, it might be better to simply return the indices directly without the underlying array.

Additionally, the logic around calculating the number of samples is unnecessarily complex. When dropping the last batch, this can be a simple floor division.

In a simple test script which creates a sampler for a dataset with a 100,000,000 items, memory usage is reduced 98% compared to the existing implementation.

Fixes https://github.com/pytorch/pytorch/issues/45427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51841

Reviewed By: albanD

Differential Revision: D28240105

Pulled By: rohan-varma

fbshipit-source-id: 4c6aa493d0f75c07ec14c98791b3a531300fb1db
2021-06-01 14:15:14 -07:00
5a42a97c49 Add NCCL_ASYNC_ERROR_HANDLING as an environment variable (#59109)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57878.

This adds `NCCL_ASYNC_ERROR_HANDLING` as a DDP relevant environment variable and includes a check for that variable in the test `test_dump_DDP_relevant_env_vars()`. Notably, the modified test now checks for the new variable but does not check for any of the other previously-existing relevant environment variables that were not already tested for (e.g. `NCCL_BLOCKING_WAIT`).

The change was tested via the following on an AI AWS cluster:
`WORLD_SIZE=2 BACKEND=nccl gpurun pytest test/distributed/test_distributed_spawn.py -k test_dump_DDP_relevant_env_vars -vs`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59109

Reviewed By: H-Huang, SciPioneer

Differential Revision: D28761148

Pulled By: andwgu

fbshipit-source-id: 7be4820e61a670b001408d0dd273f65029b1d2fe
2021-06-01 14:02:41 -07:00
5f1117226f DOC Update register_buffer/parameter docstring explaining None (#59015)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/40977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59015

Reviewed By: ngimel

Differential Revision: D28797948

Pulled By: jbschlosser

fbshipit-source-id: 3bf60af5c1cfc5f1786b4975b48f093391374503
2021-06-01 13:55:07 -07:00
e4b2684331 [quant][graphmode][fx][refactor] Remove patterns from Quantizer class (#59033)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59033

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724861

fbshipit-source-id: 97b38e851b6bf581510a24636b1d8d6f1d977f5a
2021-06-01 13:44:08 -07:00
83892c1861 [quant][graphmode][fx][refactor] Remove node_name_to_scope from Quantizer (#59032)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59032

To remove Quantizer class and split prepare and convert functions to different files

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724868

fbshipit-source-id: 6df639f20076b480812b6dcf0fc7d2c87ca29d8b
2021-06-01 13:26:09 -07:00
3826f7e8e0 [quant][graphmode][fx][refactor] Remove quantized_graph from Quantizer (#59031)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59031

Trying to remove Quantizer class and split prepare and convert code

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724871

fbshipit-source-id: dad0332ba271c4cfb6ec1e8f2036443149b5bea4
2021-06-01 13:01:54 -07:00
1b4586ee20 [quant][gx][graphmode][refactor] Remove modules from Quantizer (#59030)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59030

Trying to remove Quantizer class and split prepare and convert code

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724875

fbshipit-source-id: d6610c1d5eb7755331252be9e348a230abf4175c
2021-06-01 12:42:28 -07:00
aa857850bb Add check_env, getenv api (#59052)
Summary:
Related Issue: https://github.com/pytorch/pytorch/issues/57691
This PR introduces an API for checking environment variables:

```c++
optional<bool> check_env(const char *name)
```
Reads the environment variable name and returns
- `optional<true>`,                       if set equal to "1"
- `optional<false>`,                      if set equal to "0"
- `nullopt`,   otherwise

Issues a warning if the environment variable was set to any value other than 0 or 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59052

Test Plan:
Manually run the following test case:

- Apply this diff to the repo
```
 diff --git a/torch/csrc/Exceptions.cpp b/torch/csrc/Exceptions.cpp
index d008643f70..990d254f0d 100644
 --- a/torch/csrc/Exceptions.cpp
+++ b/torch/csrc/Exceptions.cpp
@@ -9,6 +9,9 @@

 #include <torch/csrc/THP.h>

+#include <c10/util/Optional.h>
+#include <c10/util/env.h>
+
 // NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)
 PyObject *THPException_FatalError;

@@ -23,18 +26,7 @@ bool THPException_init(PyObject *module)
 namespace torch {

 static bool compute_cpp_stack_traces_enabled() {
-  auto envar = std::getenv("TORCH_SHOW_CPP_STACKTRACES");
-  if (envar) {
-    if (strcmp(envar, "0") == 0) {
-      return false;
-    }
-    if (strcmp(envar, "1") == 0) {
-      return true;
-    }
-    TORCH_WARN("ignoring invalid value for TORCH_SHOW_CPP_STACKTRACES: ", envar,
-               " valid values are 0 or 1.");
-  }
-  return false;
+ return c10::utils::check_env("TORCH_SHOW_CPP_STACKTRACES").value_or(false);
 }

 bool get_cpp_stacktraces_enabled() {
```
This patch replaces the prior `std::getenv` usage in `torch/csrc/Exceptions.cpp` to use the new api.
- Run the following python3 script
```python
import torch

print(torch.__version__) # should print local version (not release)

a1 = torch.tensor([1,2,3])
a2 = torch.tensor([2])

a1 @ a2
```
using the following commands
```bash
python3 test.py # should not output CPP trace
TORCH_SHOW_CPP_STACKTRACES=1 python3 test.py # should output CPP trace
```

Reviewed By: ngimel

Differential Revision: D28799873

Pulled By: 1ntEgr8

fbshipit-source-id: 3e23353f48679ba8ce0364c049420ba4ff86ff09
2021-06-01 12:24:14 -07:00
fd2a36369a Fixed torch.nn.MultiMarginLoss equation format error (#59188)
Summary:
Removed the extra parenthesis from the right side
Fixes https://github.com/pytorch/pytorch/issues/58634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59188

Reviewed By: ngimel

Differential Revision: D28797720

Pulled By: jbschlosser

fbshipit-source-id: 47e3084526389e7d1cc17c1a01b253e666c58784
2021-06-01 12:04:34 -07:00
06399d441d Create EngineHolder for serializing and running TRT Engines with PyTorch
Test Plan:
**python tests**
`buck test mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 deeplearning/trt/EngineHolder:engine_holder_test`

**python tests to generate test models** (this outputs the jit model files for use with cpp tests)
`buck run mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 deeplearning/trt/EngineHolder:engine_holder_generate_test_models`

**cpp tests**
`buck test mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 deeplearning/trt/EngineHolder:engine_holder_test_cpp`

**run service locally**

*build service*
`buck build mode/opt-split-dwarf -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 smart/inference_platform_sp/predictor_gpu:service`

*run service*
`buck-out/gen/smart/inference_platform_sp/predictor_gpu/service --model_dir="/home/jackmontgomery" --model_id=123_0 --pytorch_predictor_use_cuda`

*build requester*
`buck build mode/opt -c python.package_style=inplace -c fbcode.platform=platform009 -c fbcode.enable_gpu_sections=true -j 20 glow/fb/test:invoke_cv_pt_predictor`

*run requester*
`buck-out/gen/glow/fb/test/invoke_cv_pt_predictor.par --model_id=123_0 --port=33131 --host="2401:db00:eef0:1100:3560:0:1c02:2115" --num_parallel_requesters=1`

Reviewed By: 842974287

Differential Revision: D28581591

fbshipit-source-id: 7738b05543c2c840ee6b8f0d4818f21dc7f61b19
2021-06-01 11:41:33 -07:00
e9e5588588 Improve Tensor traverse to traverse its grad_fn when possible (#58271)
Summary:
There are two main changes here:
- THPVariable will actually visit their grad_fn if there are no other reference to the c++ Tensor and no other reference to the grad_fn. The critical observation compared to the existing comment (thanks Ed!) is that if we also check that the c++ Tensor object is not referenced somewhere else, we're sure that no one can change the grad_fn refcount between the traverse and the clear.
- THPVariable don't need a special clear for this new cases as we're the only owner of the c++ Tensor and so the cdata.reset() will necessarily free the Tensor and all its resources.

The two tests are to ensure:
- That the cycles are indeed collectible by the gc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58271

Reviewed By: ngimel

Differential Revision: D28796461

Pulled By: albanD

fbshipit-source-id: 62c05930ddd0c48422c79b03118db41a73c1355d
2021-06-01 10:27:52 -07:00
65748f81c9 Un-verbose the build (#59235)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59235

Reviewed By: zou3519

Differential Revision: D28792468

Pulled By: driazati

fbshipit-source-id: 98f730ea0ee28b4b5c13198879bee8f586c0c14c
2021-06-01 10:14:26 -07:00
7523728368 [quant][graphmode][fx] Factor out run_weight_observer (#59029)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59029

Trying to remove Quantizer class and split prepare and convert code

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724864

fbshipit-source-id: 67ac5e7eb351970fdf46532c3c2ac6ac831bc697
2021-06-01 10:01:42 -07:00
10fc42eacc [quant][graphmode][fx] Merge quant_env and env (#59028)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59028

Previously we have an env and a quant_env in convert, which is a bit confusing,
in this PR we merged them and have a Dict[str, Tuple[Node, torch.dtype]]

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28724863

fbshipit-source-id: 722a682c70d300a6ccd2b988786a1ac2d45e880e
2021-06-01 09:21:38 -07:00
afdfd2288a Revert D28767060: [pytorch][PR] Migrate renorm to ATen (CPU and CUDA)
Test Plan: revert-hammer

Differential Revision:
D28767060 (74ec50893d)

Original commit changeset: 93dcbe5483f7

fbshipit-source-id: ae85d90212df4e6bb3a5da310e97ad1c06aa9a77
2021-06-01 05:15:21 -07:00
0b040e17e5 More user-friendly error messages (#59106)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59106

Should make debugging a bit easier

Test Plan:
Example error in https://www.internalfb.com/intern/aibench/details/884106485190261 (open log for Portal or Portal+):
```
The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 29, in forward
    _0 = uninitialized(__torch__.torch.classes._nnapi.Compilation)
    if torch.__is__(self.comp, None):
      _1 = (self).init(args, )
            ~~~~~~~~~~ <--- HERE
    else:
      pass
  File "code/__torch__/torch/backends/_nnapi/prepare.py", line 97, in init
    comp = __torch__.torch.classes._nnapi.Compilation.__new__(__torch__.torch.classes._nnapi.Compilation)
    _22 = (comp).__init__()
    _23 = (comp).init(self.ser_model, self.weights, )
           ~~~~~~~~~~ <--- HERE
    self.comp = comp
    return None

Traceback of TorchScript, original code (most recent call last):
  File "/data/users/dhaziza/fbsource/fbcode/buck-out/dev/gen/mobile-vision/d2go/projects/facegen/tools/export_to_app#link-tree/torch/backends/_nnapi/prepare.py", line 47, in forward
    def forward(self, args: List[torch.Tensor]) -> List[torch.Tensor]:
        if self.comp is None:
            self.init(args)
            ~~~~~~~~~ <--- HERE
        comp = self.comp
        assert comp is not None
  File "/data/users/dhaziza/fbsource/fbcode/buck-out/dev/gen/mobile-vision/d2go/projects/facegen/tools/export_to_app#link-tree/torch/backends/_nnapi/prepare.py", line 42, in init
        self.weights = [w.contiguous() for w in self.weights]
        comp = torch.classes._nnapi.Compilation()
        comp.init(self.ser_model, self.weights)
        ~~~~~~~~~ <--- HERE
        self.comp = comp
RuntimeError: [enforce fail at nnapi_model_loader.cpp:171] result == ANEURALNETWORKS_NO_ERROR. NNAPI returned error: 4
```

Reviewed By: axitkhurana

Differential Revision: D28287450

fbshipit-source-id: ccd10301e1492f8879f9d6dd57b60c4e683ebb9e
2021-06-01 02:05:24 -07:00
cab4849463 [caffe2][glow] Share info about current batch_size (#58902)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58902

Pull Request resolved: https://github.com/pytorch/glow/pull/5681

Reviewed By: ChunliF

Differential Revision: D28665162

fbshipit-source-id: 39e173a24ee247bc6fee44009798c74dddb27648
2021-06-01 01:21:42 -07:00
7fb3385f4b Automated submodule update: FBGEMM (#59170)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59170

This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: ffc2e1a91e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58874

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: hx89

Differential Revision: D28648577

Pulled By: jspark1105

fbshipit-source-id: 0ad1a6fdf27cd3f05f9e342030461cb7caa9986b
2021-05-31 23:18:58 -07:00
74ec50893d Migrate renorm to ATen (CPU and CUDA) (#59108)
Summary:
Closes https://github.com/pytorch/pytorch/issues/24754, closes https://github.com/pytorch/pytorch/issues/24616, closes https://github.com/pytorch/pytorch/issues/50874

This reuses `linalg_vector_norm` to calculate the norms. I just add a new kernel that turns  the norm into a normalization factor, then multiply the original tensor using a normal broadcasted `mul` operator. The result is less code, and better performance to boot.

#### Benchmarks (CPU):
|     Shape    | Dim |  Before | After (1 thread) | After (8 threads) |
|:------------:|:---:|--------:|-----------------:|------------------:|
| (10, 10, 10) | 0   | 11.6 us |           4.2 us |            4.2 us |
|              | 1   | 14.3 us |           5.2 us |            5.2 us |
|              | 2   | 12.7 us |           4.6 us |            4.6 us |
| (50, 50, 50) | 0   |  330 us |           120 us |           24.4 us |
|              | 1   |  350 us |           135 us |           28.2 us |
|              | 2   |  417 us |           130 us |           24.4 us |

#### Benchmarks (CUDA)
|     Shape    | Dim |  Before |   After |
|:------------:|:---:|--------:|--------:|
| (10, 10, 10) | 0   | 12.5 us | 12.1 us |
|              | 1   | 13.1 us | 12.2 us |
|              | 2   | 13.1 us | 11.8 us |
| (50, 50, 50) | 0   | 33.7 us | 11.6 us |
|              | 1   | 36.5 us | 15.8 us |
|              | 2   | 41.1 us |   15 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59108

Reviewed By: mrshenli

Differential Revision: D28767060

Pulled By: ngimel

fbshipit-source-id: 93dcbe5483f71cc6a6444fbd5b1aa1f29975d857
2021-05-31 22:38:16 -07:00
223725cfb0 OpInfo: div - port pending method_tests entry (#59173)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Depends on: https://github.com/pytorch/pytorch/issues/59154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59173

Reviewed By: ngimel

Differential Revision: D28785178

Pulled By: mruberry

fbshipit-source-id: 902310f2d77e499a2355a23b2d5a8c0b21b8c5bb
2021-05-31 17:32:27 -07:00
6d45d7a6c3 Enables previously "slow" gradgrad checks on CUDA (#57802)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57508

Earlier, a few CUDA `gradgrad` checks (see the list of ops below) were disabled because of them being too slow. There have been improvements (see https://github.com/pytorch/pytorch/issues/57508 for reference) and this PR aimed on:

1. Time taken by `gradgrad` checks on CUDA for the ops listed below.
2. Enabling the tests again if the times sound reasonable

Ops considered: `addbmm, baddbmm, bmm, cholesky, symeig, inverse, linalg.cholesky, linalg.cholesky_ex, linalg.eigh, linalg.qr, lu, qr, solve, triangular_solve, linalg.pinv, svd, linalg.svd, pinverse, linalg.householder_product, linalg.solve`.

For numbers (on time taken) on a separate CI run: https://github.com/pytorch/pytorch/pull/57802#issuecomment-836169691.

cc: mruberry albanD pmeier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57802

Reviewed By: ngimel

Differential Revision: D28784106

Pulled By: mruberry

fbshipit-source-id: 9b15238319f143c59f83d500e831d66d98542ff8
2021-05-30 22:16:46 -07:00
ef40757de3 OpInfo: zero_ (#58731)
Summary:
See https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58731

Reviewed By: ngimel

Differential Revision: D28784083

Pulled By: mruberry

fbshipit-source-id: f06de8045afd3728b1fedc014c091d8fd1955a9f
2021-05-30 21:49:29 -07:00
2aeb16c13a [fix] i1-i1e ROCm failure: mark array as const so that it is available for host and device (#59187)
Summary:
Fix failing ROCm build introduced by https://github.com/pytorch/pytorch/issues/56352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59187

Reviewed By: ngimel

Differential Revision: D28784072

Pulled By: mruberry

fbshipit-source-id: 36a5bd11ad2fe80a81aae6eb8b21f0901c842ddc
2021-05-30 21:44:54 -07:00
fea7a79e0b [special] Add ndtr (#58126)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

Plot:
![image](https://user-images.githubusercontent.com/19503980/117942099-54efd680-b328-11eb-8948-c3080779ce19.png)
https://colab.research.google.com/drive/1Of67A042rOImj8wrLF_fUTgoy_wVEOZS?usp=sharing

TODO:
* [x] Add docs (https://13385714-65600975-gh.circle-artifacts.com/0/docs/special.html#torch.special.ndtr)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58126

Reviewed By: anjali411

Differential Revision: D28700957

Pulled By: mruberry

fbshipit-source-id: 5b9991e97ec1e8fd01518cc9d9849108d35fe406
2021-05-30 21:12:04 -07:00
2a78f6376c TensorIterator: Reduce serial_for_each static overhead (#58909)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58909

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28776507

Pulled By: ngimel

fbshipit-source-id: 4f0283d03b26aa5785b687b78d77e6b0efcbaf65
2021-05-30 21:08:54 -07:00
445e838210 OpInfo: resize_, resize_as_ (#59176)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59176

Reviewed By: ngimel

Differential Revision: D28780083

Pulled By: mruberry

fbshipit-source-id: 472584e8faa4cb1031908df097849d2d4167fdf5
2021-05-30 18:53:17 -07:00
ea465f7378 OpInfo: true_divide and minor fix (#59154)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59154

Reviewed By: ngimel

Differential Revision: D28780115

Pulled By: mruberry

fbshipit-source-id: 91e254698597fa0c7d4df6053ec017a85e180304
2021-05-30 18:35:30 -07:00
aaccdc3996 SparseCsr: Fix some uses of deprecated Tensor methods (#58990)
Summary:
This fixes some deprecation warnings in the build that were introduced by https://github.com/pytorch/pytorch/issues/58768.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58990

Reviewed By: ngimel

Differential Revision: D28776804

Pulled By: mruberry

fbshipit-source-id: 8abf75ea8f7adca537f9c808e68356829407665e
2021-05-30 03:58:19 -07:00
6ee9466d3a OpInfo: tensor_split: port remaining method_test entries (#59133)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59133

Reviewed By: ngimel

Differential Revision: D28776470

Pulled By: mruberry

fbshipit-source-id: 975a7062788de514f214f8c4ef0146eaf6b407f7
2021-05-30 00:40:29 -07:00
96c549d1c6 Replace dim_apply with TensorIterator (#58656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58656

Ref gh-56794

`dim_apply` is problematic because it calls `Tensor.select` inside of a parallel
region. Instead, replace it with `TensorIterator` by squashing the
apply-dimension. This is similar to the `_dim_apply` function already used by
the sort kernels:

8c91acc161/aten/src/ATen/native/cpu/SortingKernel.cpp (L27)

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28776441

Pulled By: ngimel

fbshipit-source-id: 14449d4b12ed4576f879bb65a35e881ce1a953b1
2021-05-30 00:09:14 -07:00
cab65ea3b9 OpInfo: renorm (#59079)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59079

Reviewed By: ngimel

Differential Revision: D28776789

Pulled By: mruberry

fbshipit-source-id: ca46f2debe918c3de1f3b5bbc9924b7ddfe9442a
2021-05-29 22:38:15 -07:00
5c18994674 [special] Add i1 and i1e (#56352)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/50345

* [x] Check Docs https://12721710-65600975-gh.circle-artifacts.com/0/docs/special.html
* [x] Investigate fp32 failure on CI?! (Fails on clang. Reproduced locally with clang-11)
* [ ] Kernel vs Composite?
* [x] Autograd for `i0e` for zero?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56352

Reviewed By: anjali411

Differential Revision: D28700888

Pulled By: mruberry

fbshipit-source-id: 91a3cbb94f5b8a3b063589ec38179848c11def83
2021-05-29 20:55:23 -07:00
27009d6129 [TensorExpr] Add NNC lowerings for aten::view, aten::reshape and aten::expand_as. (#59157)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59157

Currently view is represented as a copy since we don't support inplace
operations in NNC (similar to `aten::reshape`).  Lowering for
`aten::expand_as` is exactly the same as for the `aten::expand`, since
we're building the TE expression basing on the output shape anyway.

Differential Revision:
D28774224
D28774224

Test Plan: Imported from OSS

Reviewed By: Chillee

Pulled By: ZolotukhinM

fbshipit-source-id: 0a1593c4c6500dcc5a374213adb734180ae1f72e
2021-05-29 20:36:32 -07:00
355b24438c make vector_norm backward call norm_backward (#59135)
Summary:
Per title. Remove duplicated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59135

Reviewed By: mruberry

Differential Revision: D28775716

Pulled By: ngimel

fbshipit-source-id: 50dc77590db15976453fc41c3657a77198749849
2021-05-29 12:14:46 -07:00
9fc0c5a54a OpInfo: tril, triu (#59145)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59145

Reviewed By: ngimel

Differential Revision: D28776433

Pulled By: mruberry

fbshipit-source-id: 2ff11a5202af1e73ffc2b242035c990646bd2259
2021-05-29 02:55:50 -07:00
1871d4e604 avoid explicitly casting low precision inputs to fp32 in norm (#59134)
Summary:
Per title. Now `norm` with fp16/bfloat16 inputs and fp32 outputs on cuda won't do explicit cast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59134

Reviewed By: mruberry

Differential Revision: D28775729

Pulled By: ngimel

fbshipit-source-id: 896daa4f02e8a817cb7cb99ae8a93c02fa8dd5e9
2021-05-29 00:48:18 -07:00
d68df54269 OpInfo: fill_ (#59138)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59138

Reviewed By: ngimel

Differential Revision: D28776451

Pulled By: mruberry

fbshipit-source-id: 2e8e9f1805ec7d900223ea749a4a0b86a1bedb54
2021-05-29 00:35:02 -07:00
a427820350 [NNC] Added triangular_solve external call + fixed permute (#59131)
Summary:
The triangular_solve only returns the first input, since the second input is just a copy of the first one. Why does that exist?

Also, I fixed the permute lowering - I was previously doing the inverse application of the permute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59131

Reviewed By: ansley

Differential Revision: D28768169

Pulled By: Chillee

fbshipit-source-id: 8e78611c6145fb2257cb409ba98c14ac55cdbccf
2021-05-28 22:29:30 -07:00
c9af4c2636 OpInfo: where (#58349)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58349

Reviewed By: mrshenli

Differential Revision: D28744220

Pulled By: mruberry

fbshipit-source-id: 893a2fb88a48a60df75c7d6e2f58a42ca949daa7
2021-05-28 18:22:03 -07:00
b977a3b66d [c10d] Split custom class bindings out of python binding code (#58992)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58992

Currently, we define Torchbind custom classes in the same place that we define Python bindings.

This is nice from a code location perspective, but has two downsides:
1. These custom classes are not available in a C++-only build.
2. These break when included in torch::deploy.

Some explanation on the second issue: torch::deploy creates many Python
interpreters, and creates a full copy of all the bindings for each one. This
will run the static initialization code once for each copy of the bindings,
leading to multiple registration of the custom classes (and therefore an
error).

This PR splits out the relevant custom class binding code into its own source
file to be included in libc10d, which can be compiled and statically
initialized a single time and linked against from the c10d python bindings.
ghstack-source-id: 130168942

Test Plan: CI

Reviewed By: wconstab

Differential Revision: D28690832

fbshipit-source-id: 3c5e3fff28abb8bcdb4a952794c07de1ee2ae5a8
2021-05-28 15:35:23 -07:00
ab372ba510 [iOS GPU] Add debug information to track memory allocation exception (#59112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59112

ghstack-source-id: 130027273

Test Plan: CI

Reviewed By: linbinyu

Differential Revision: D28730604

fbshipit-source-id: 2ec7ca1b722a9fe496635cb6eea7e0d88b0c18b1
2021-05-28 12:16:29 -07:00
41054f2ab5 CUDA support in the CSR layout: sparse_to_dense/add_sparse_csr (#59011)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59011

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28719550

Pulled By: bhosmer

fbshipit-source-id: 530c7cd1b20ae6d8865fd414afaf6fab27a643e6
2021-05-27 20:59:22 -07:00
9c83e4160d Use some c10::ThreadLocal to avoid crashes on old Android toolchains (#59017)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59017

See the comment in ThreadLocal.h for context.
I used a slightly dirty preprocessor hack to minimize the number of changes.
The hope is that we'll be able to revert all of these soon.

Test Plan:
CI.
Built FB4A with gnustl and saw no references to cxa_thread_atexit
in the PyTorch libraries.

Reviewed By: ilia-cher

Differential Revision: D28720762

fbshipit-source-id: 0f13c7ac5a108b95f8fde6dbc63c6b8bdb8599de
2021-05-27 20:49:03 -07:00
4b3d17c0a2 Include Macros.h in ThreadLocal
Summary: This wasn't picking up C10_ANDROID.  Not sure how to prevent stuff like this.

Test Plan: Build for Android+gnustl, saw proper ThreadLocal being defined.

Reviewed By: swolchok

Differential Revision: D28720763

fbshipit-source-id: 58eb4ea80ad32a856fcea6d65e5c1c37ebf3bd55
2021-05-27 20:47:56 -07:00
0c1420aa3c OpInfo: fmod and remainder (#57941)
Summary:
See https://github.com/pytorch/pytorch/issues/54261

cc: mruberry Lezcano kshitij12345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57941

Reviewed By: mrshenli

Differential Revision: D28744464

Pulled By: mruberry

fbshipit-source-id: 19847277d4f8d3a39a706c2b3c9eddf0dedcb20c
2021-05-27 20:32:56 -07:00
657b75d155 Revert D28700259: [pytorch][PR] Migrate nonzero from TH to ATen (CPU)
Test Plan: revert-hammer

Differential Revision:
D28700259 (95b1bc1009)

Original commit changeset: 9b279ca7c36d

fbshipit-source-id: 267afe63376be598d24c862e02e3b4b3ea75f77c
2021-05-27 20:07:30 -07:00
4e543d017f Move remaining \*Sort\* in THC to ATen (#58953)
Summary:
https://github.com/pytorch/pytorch/issues/24637

CC zasdfgbnm ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58953

Reviewed By: mrshenli

Differential Revision: D28749713

Pulled By: ngimel

fbshipit-source-id: 33ce87cf77e23d5d67d193d6368131cb8dab39ae
2021-05-27 18:35:42 -07:00
f3aa61b9ed Add peephole for len(x.size()) (#59051)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59051

Test Plan: Imported from OSS

Reviewed By: Krovatkin

Differential Revision: D28727247

Pulled By: eellison

fbshipit-source-id: 6474d39773b640992bdaf261575a3dbd48c6d56c
2021-05-27 17:57:53 -07:00
b9dc51863c Add more shape functions for mobilenet (#58932)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58932

This adds all the operators necessary for mobilenet. I kind of wanted to get these landed to unblock ZolotukhinM, but I'm happy to split these up into multiple PRs if it makes reviewing easier. In terms of testing, i'm going to add an automated shape analysis OpInfo test.

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28727246

Pulled By: eellison

fbshipit-source-id: c17f9b7bdf7a43ddf99212b281ae2dd311259374
2021-05-27 17:57:52 -07:00
0ebc665305 Switch symbolic shape registry to operator map' (#58890)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58890

'

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28663681

Pulled By: eellison

fbshipit-source-id: 5b44fdf14a8ffe05606cc12897e366a64259650d
2021-05-27 17:57:50 -07:00
d8cbba3ee2 [JIT] Disable Complete Shape Inlining For Testing Purposes (#56966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56966

This PR adds a toggle to shape analysis which won't inline complete tensor shapes as constants into the shape compute graph, which is a good stress test on the partial evaluation pipeline.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28444664

Pulled By: eellison

fbshipit-source-id: a62e424515a8837a4b596546efa93af5e8e61f10
2021-05-27 17:57:48 -07:00
f66fbb1e2e Add unary/binary ops necessary for mobilenet (#56828)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56828

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28444660

Pulled By: eellison

fbshipit-source-id: 656673e6139550f2752c0d3ac2fb8731f4bf9bbb
2021-05-27 17:56:30 -07:00
40f851c53e Use dataclasses to simplify ShardingSpec (#58893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58893

Leverage dataclasses to simplify some of the ShardingSpec classes.
ghstack-source-id: 130041687

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28665137

fbshipit-source-id: da37517cf2bd8c65d4a5b7cae171fa460e6b0946
2021-05-27 17:33:28 -07:00
89d78851e6 [quant][refactor tests] Move qtensor serialization tests from test_deprecated_jit (#59089)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59089

Move these tests into test_quantized_tensor

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28750065

fbshipit-source-id: 5c4350d49dd07710b86ba330de80369403c6013c
2021-05-27 17:04:15 -07:00
886a2ddc83 [quant][refactor tests] Clean up test_quantization.py (#59088)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59088

Clean up comments and organize the tests better

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28750064

fbshipit-source-id: 4c36922e25e3adea3aaa8b4d9185dc28b17aa57c
2021-05-27 17:03:01 -07:00
f993ceffb5 TensorIteratorReduce: Avoid tensor operations in parallel_for (#58655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58655

Ref gh-56794

The two pass reduction calls `copy_` and `select` inside a parallel region. The
`copy_` can just be moved outside of the parallel region, but avoiding the
`select` call is more complicated because it's needed to construct the
`TensorIterator`. Instead, I factor out a `serial_for_each` free-function that
just takes pointers and strides. Then manually advance the pointer to the
thread-specific slice of data.

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D28735330

Pulled By: ngimel

fbshipit-source-id: 8e096eb5801af9381ebd305e3ae7796a79b86298
2021-05-27 15:58:03 -07:00
ef32a29c97 Back out "[pytorch][PR] ENH Adds dtype to nn.functional.one_hot" (#59080)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59080

Original commit changeset: 3686579517cc

Test Plan: None; reverting diff

Reviewed By: albanD

Differential Revision: D28746799

fbshipit-source-id: 75a7885ab0bf3abadde9a42b56d479f71f57c89c
2021-05-27 15:40:52 -07:00
3d2b55553b Retiring _module_copies field in DDP reducer. (#59094)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59094

Commented out _module_copies fields and changed for loops accordingly

Test Plan: Test cases mentioned in T91292908 passed succesfully

Reviewed By: SciPioneer

Differential Revision: D28736135

fbshipit-source-id: 1857102f0c57a734026f3025e9653d8fad57d0b6
2021-05-27 15:09:14 -07:00
c6c563fc26 Added minor fixes to Az DevOps Build Logic (#59016)
Summary:
This PR also adds a a few minor logic changes to the custom PyTorch PR tests logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59016

Reviewed By: mrshenli

Differential Revision: D28732437

Pulled By: malfet

fbshipit-source-id: 14b7ed837209d77e0e175d92959aeb0f086e6737
2021-05-27 14:35:11 -07:00
61f946bba6 don't copy indices to the self device in dispatch_index (#59059)
Summary:
Let index/index_put implementation in aten take care of moving the indices to the correct device, don't make python wrapper do that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59059

Reviewed By: mruberry

Differential Revision: D28750562

Pulled By: ngimel

fbshipit-source-id: 2f2b5f875733898f1c0b30b544c89808f91e4a6f
2021-05-27 14:19:59 -07:00
16ae6cad3d Revert D28615349: [pytorch][PR] Add a ci/no-build label
Test Plan: revert-hammer

Differential Revision:
D28615349 (bae06a0293)

Original commit changeset: 1ed521761ca4

fbshipit-source-id: 987439c2570bbffc0f0f8517d82970a3a4add789
2021-05-27 14:17:06 -07:00
bae06a0293 Add a ci/no-build label (#58778)
Summary:
Depends on https://github.com/pytorch/pytorch-probot/pull/22. Adds a new label called `ci/no-build` that disables the CircleCI `build` workflow on PRs. The current behavior should be the same in the absence of `ci/no-build`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58778

Reviewed By: malfet

Differential Revision: D28615349

Pulled By: samestep

fbshipit-source-id: 1ed521761ca4ffa32db954a51918f693beddb3f3
2021-05-27 14:03:03 -07:00
3e2db56dcf [docs] document dim argument to tensor.size() (#58777)
Summary:
[docs] document dim argument to tensor.size()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58777

Reviewed By: gchanan

Differential Revision: D28641109

Pulled By: zou3519

fbshipit-source-id: 5cb46bb8abe45ed299843af38515e5db89ad02a1
2021-05-27 13:51:56 -07:00
18302bcdf3 Add script to cancel workflows (#59019)
Summary:
This removes our cancel_redundant_workflows job in favor of GitHub's built in [`concurrency`](https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#concurrency) keyword which limits runs of a particularly named group. Since the group names have to be unique per job per PR, it should end up looking something like `filename-job_name-{pr number | sha (for non-PR workflows)}`. There's also a script to check workflows and ensure that it is being properly gated so people don't forget to add the key in the future.

`ruamel.YAML` also didn't like some of the spacing so that is changed but it also makes it more consistent so �

This also has a minor change of renaming the workflow templates from `.in` to `.j2` which is the standard Jinja2 extension that the VSCode extension automatically picks up for syntax highlighting / errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59019

Test Plan: pushed a commit `reset` and then immediately another commit `test`: the jobs from `reset` are cancelled: https://github.com/pytorch/pytorch/actions/runs/880099510

Reviewed By: samestep

Differential Revision: D28722419

Pulled By: driazati

fbshipit-source-id: c547a161877a0583be9d7edb29244b086b6bcad1
2021-05-27 12:32:15 -07:00
920619dc2b [PyTorch] Save a refcount bump in meta functions for addmm and mm (#59063)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59063

`TensorMeta::maybe_get_output()` returns `const Tensor&`, no need to copy the Tensor..
ghstack-source-id: 130044287

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28735225

fbshipit-source-id: f2bdf39b28de245ec4664718490e7e0b36bc8819
2021-05-27 12:15:52 -07:00
2c17b6a0fe [ONNX] Enable support for roll() op. (#58389) (#58697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58697

1. Add a symbolic function for aten::roll() op in symbolic_opset9.py.
2. Add a test with multiple scenarios as well.

Test Plan: Imported from OSS

Reviewed By: driazati, bhosmer

Differential Revision: D28714807

Pulled By: SplitInfinity

fbshipit-source-id: eae85f2dcf02737c9256a180f6905a935ca3f57e

Co-authored-by: fatcat-z <jiz@microsoft.com>
2021-05-27 12:06:45 -07:00
1aabb8f98c [ONNX] handle aten::_set_item on Dict in convertInplaceOpsAndTrackAlias (#58317) (#58696)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58696

It seems the JIT produces an output for aten::_set_item on lists but
not on dicts. Previously the code would crash because it assumed it
was operating on a list.

The different behavior can be seen with the following test:

```python
class DictModule(torch.nn.Module):
    def forward(self, x_in: torch.Tensor) -> typing.Dict[str, torch.Tensor]:
        x_out = {}
        x_out["test_key_out"] = x_in
        return x_out

x_in = torch.tensor(1)
dms = torch.jit.script(DictModule())
torch.onnx.export(dms, (x_in,), "/dev/null", example_outputs=(dms(x_in),))
```

Before this change:
`RuntimeError: outputs_.size() == 1INTERNAL ASSERT FAILED at "../torch/csrc/jit/ir/ir.h":452, please report a bug to PyTorch.`

After this change:
`RuntimeError: Exporting the operator prim_DictConstruct to ONNX opset version 9 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub.`

This is a more useful error message.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714804

Pulled By: SplitInfinity

fbshipit-source-id: 1e5dc5fb44d1e3f971a22a79b5cf009d7590bf84

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-27 12:06:44 -07:00
0a6828a306 [ONNX] use consistent quoting for string literals (#57757) (#58695)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58695

As PEP8 says: "Pick a rule and stick to it." [1]

[1] https://www.python.org/dev/peps/pep-0008/#string-quotes

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714811

Pulled By: SplitInfinity

fbshipit-source-id: c95103aceb1725c17c034dc6fc8216627f189548

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-27 12:06:42 -07:00
b27fc0ff85 [ONNX] Improve lower tuples and handle control flow (#57650) (#58694)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58694

Improving the logic for finding tuple patterns within control flow.
Also fixes: https://github.com/pytorch/pytorch/issues/56914

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714806

Pulled By: SplitInfinity

fbshipit-source-id: 1552100cf9cc88e6f58df2e90758e8898ba0a9b3

Co-authored-by: neginraoof <neginmr@utexas.edu>
2021-05-27 12:06:40 -07:00
57c9355a0d [ONNX] Update special post process for SequenceInsert after SequenceEmpty (#56965) (#58693)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58693

`ONNX::SequenceEmpty` requires dtype to be provided, and is default to float. We updates previous dtype of created `ONNX::SequenceEmpty` node when dtype is later discovered to be other than float, through downstream `ONNX::SequenceInsert` node. This PR improves the algorithm to cover nested loop case.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714808

Pulled By: SplitInfinity

fbshipit-source-id: e45ab3a12d0fec637733acbd3cd0438ff80d2cd4

Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-27 12:06:39 -07:00
b8c96e6b08 Support symbolic for conv_tbc (#58359) (#58692)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58692

This is a fix for exporting fairseq models, see:
```python
model = torch.hub.load(github, 'conv.wmt14.en-fr', tokenizer='moses', bpe='subword_nmt')
model = torch.hub.load(github, 'conv.wmt17.en-de', tokenizer='moses', bpe='subword_nmt')
```
With this fix, and comment out model script one line `GradMultiply`, these two models can be exported successfully with perf met.

The original PR https://github.com/pytorch/pytorch/pull/57708 has merging issue, use this one instead.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714809

Pulled By: SplitInfinity

fbshipit-source-id: 71c2de6cec7ee05af68560996acf47d97af46fb2

Co-authored-by: David <jiafa@microsoft.com>
2021-05-27 12:06:37 -07:00
d101816fdd [ONNX] RNN scripting (#57564) (#58691)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58691

Note the first commit in this PR has its own pull request here since it seemed self-contained:
https://github.com/pytorch/pytorch/pull/57082

* [ONNX] simplify batch_first logic in RNN tests

* [ONNX] support GRU with packed input in scripting mode

This required two changes:
* Add as_tensor to symbolic_opset9.py
* Change torch::jit::pushPackingPastRnn to recognize and properly
  replace another use of the batch_sizes output of prim::PackPadded.
  Previously the code assumed that the first use was as input to the
  RNN operator. However in some cases, it is also used to compute
  max_batch_size. For example in this code:
  https://github.com/pytorch/pytorch/blob/febff45/torch/nn/modules/rnn.py#L815-L815

With these changes the GRU tests now pass in scripting mode for opset
version >= 11.

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714805

Pulled By: SplitInfinity

fbshipit-source-id: f19647a04533d9ec76399a8793b3f712ea0337d2

Co-authored-by: Gary Miguel <garymiguel@microsoft.com>
2021-05-27 12:06:35 -07:00
51d14b6859 [ONNX] Update instance_norm2 symbolic to handle track_running_stats=True (#55051) (#58690)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58690

Fixes [#53887](https://github.com/pytorch/pytorch/issues/53887)
Not input calling running_mean and running_var when track_running_stats=True

Test Plan: Imported from OSS

Reviewed By: driazati

Differential Revision: D28714812

Pulled By: SplitInfinity

fbshipit-source-id: 3f2f2ec9a7eaf8a1432a552d751cbd5974b20195

Co-authored-by: hwangdeyu <deyhuang@qq.com>
2021-05-27 12:05:26 -07:00
ba694520e5 [ROCm] fix JIT codegen (#57400)
Summary:
Fixes upcoming changes that are part of ROCm 4.2 and affect PyTorch JIT.

- ROCM_VERSION macro must be available to both device and host compilation passes.
- Unifies some of CUDA and HIP differences in the code generated.
  - NAN / POS_INFINITY / NEG_INFINITY
  - Do not hipify `extern __shared__` -> `HIP_DYNAMIC_SHARED()` macro [deprecated]
- Differentiates bf16 codegen for HIP.
- Optionally provides missing macros when using hiprtc precompiled header feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57400

Reviewed By: ejguan

Differential Revision: D28421065

Pulled By: malfet

fbshipit-source-id: 215f476773c61d8b0d9d148a4e5f5d016f863074
2021-05-27 11:45:07 -07:00
7e4e648c2a Enable NNC fusion for relu6 (#58773)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58773

Test Plan:
```
python test/test_ops.py -k relu6
python test/test_jit_fuser_te.py
```

Reviewed By: bertmaher

Differential Revision: D28721791

Pulled By: desertfire

fbshipit-source-id: a94f711977afd080faae052f66eb8dded3cdc79e
2021-05-27 10:54:02 -07:00
0106fe3934 avg_pool2d: port to structured (#58987)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58987

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28717067

Pulled By: ezyang

fbshipit-source-id: 984a8ae8bc05811b787fdac565566f359b55a3d6
2021-05-27 10:51:11 -07:00
d935259171 Remove redundant code from LayerNorm Fake Op. (#59054)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59054

to handle elementwise_affine

Test Plan: GLOW_NNPI=1 buck run -c fbcode.platform=platform009 //caffe2/caffe2/contrib/fakelowp/test:test_layernorm_nnpi_fp16nnpi

Reviewed By: hx89

Differential Revision: D28726804

fbshipit-source-id: b03485e98d490d4e9e1b178a8c50677b77e27596
2021-05-27 10:35:32 -07:00
b14c3205fd [JIT] Add torch._C.ScriptDict (#52659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52659

**Summary**
This commit adds `torch._C.ScriptDict`, a dictionary type that has reference
semantics across the Python/TorchScript boundary. That is, modifications
made to instances of `torch._C.ScriptDict` in TorchScript are visible in
Python even when it is not returned from the function. Instances can be
constructed by passing an instance of a Python dictionary to
`torch.jit.script`. In the case of an empty dictionary, its type is
assumed to be `Dict[str, Tensor]` to be consistent with the handling of
empty dictionaries in TorchScript source code.

`torch._C.ScriptDict` is implemented using a modified version of pybind's `stl_bind.h`-style bindings attached to `ScriptDict`, `ScriptDictIterator` and `ScriptDictKeyIterator`, wrapper classes around `c10::impl::GenericDict` and `c10::impl::GenericDict::iterator`. These bindings allow instances of `torch._C.ScriptDict` to be used as if it were a regular `dict` Python. Reference semantics are achieved by simply retrieving the `IValue` contained in `ScriptDict` in `toIValue` (invoked when converting Python arguments to `IValues` before calling TorchScript code).

**Test Plan**
This commit adds `TestScriptDict` to `test_list_dict.py`, a set of tests
that check that all of the common dictionary operations are supported
and that instances have reference semantics across the
Python/TorchScript boundary.

Differential Revision:
D27211605
D27211605

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Pulled By: SplitInfinity

fbshipit-source-id: 446d4e5328375791aa73eb9e8b04dfe3465af960
2021-05-27 10:25:30 -07:00
95b1bc1009 Migrate nonzero from TH to ATen (CPU) (#58811)
Summary:
Closes gh-24745

The existing PR (gh-50655) has been stalled because `TensorIterator` doesn't guarantee iteration order in the same way that `TH_TENSOR_APPLY` does. For contiguous test cases this isn't an issue; but it breaks down for example with channels last format. I resolve this by adding a new `TensorIteratorConfig` parameter, `enforce_linear_iteration`, which disables dimension reordering. I've also added a test case for non-contiguous tensors to verify this works.

This PR also significantly improves performance by adding multithreading support to the algorithm.  As part of this, I wrote a custom `count_nonzero` that gives per-thread counts which is necessary to write the outputs in the right location.

|    Shape   |  Before | After (1 thread) | After (8 threads) |
|:----------:|--------:|-----------------:|------------------:|
| 256,128,32 | 2610 us |          2220 us |            496 us |
| 128,128,32 | 1250 us |           976 us |            175 us |
|  64,128,32 |  581 us |           486 us |             88 us |
|  32,128,32 |  292 us |           245 us |             80 us |
|  16,128,32 |  147 us |           120 us |             71 us |
|  8,128,32  |   75 us |            61 us |             61 us |
|  4,128,32  |   39 us |            32 us |             32 us |
|  2,128,32  |   20 us |            17 us |             17 us |
|  1,128,32  |   11 us |             9 us |              9 us |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58811

Reviewed By: anjali411

Differential Revision: D28700259

Pulled By: ngimel

fbshipit-source-id: 9b279ca7c36d8e348b7e5e4be0dd159e05aee159
2021-05-27 10:06:54 -07:00
934f6dca65 Fix pthreadpool guard test (#58977)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58977

* Test was flaky as part of it ran async
* Remove async part to test only the functionality added

Test Plan:
regular test:

`buck test mode/dev //caffe2/aten:test_thread_pool_guard -- --exact 'caffe2/aten:test_thread_pool_guard - TestThreadPoolGuard.TestRunWithGuard' --run-disabled`

stress test:

`buck test mode/dev //caffe2/aten:test_thread_pool_guard -- --exact 'caffe2/aten:test_thread_pool_guard - TestThreadPoolGuard.TestRunWithGuard' --run-disabled --jobs 18 --stress-runs 10 --record-results`

Reviewed By: kimishpatel

Differential Revision: D28703064

fbshipit-source-id: be19da3f42f44288afc726bdb2f40342eee26e01
2021-05-27 09:49:52 -07:00
e89b150a39 [typing] Pyre fixes for remote_module (#59046)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59046

Correcting type hint for _RemoteModule to pass Pyre checks.

Test Plan: N/A

Reviewed By: walterddr, SciPioneer

Differential Revision: D28725237

fbshipit-source-id: 1ca714bbf1a597a29850f70bac826a0c95a4019f
2021-05-27 09:44:50 -07:00
11aa5e4f66 Add underscores to some internal names (#59027)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59027

Add underscores to some of the internal names

Test Plan:
python test/test_profiler.py -v

Imported from OSS

Reviewed By: mrshenli

Differential Revision: D28724294

fbshipit-source-id: 1f6252e4befdf1928ac103d0042cbbf40616f74a
2021-05-27 09:39:28 -07:00
617b74aa35 [nnc] LLVMCodeGen for any target (#58713)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58713

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28585722

Pulled By: bertmaher

fbshipit-source-id: 82885b9780dc1a8610660a90969d8d2baad97920
2021-05-27 09:25:15 -07:00
a1806134a7 [QAT] Fix the runtime run cannot resize variables that require grad (#57068)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57068

When training with histogram observer on, we got this runtime error:
```
torch/quantization/observer.py", line 942, in forward
                    self.bins)

            self.histogram.resize_(combined_histogram.shape)
            ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
            self.histogram.copy_(combined_histogram)
            self.min_val.resize_(combined_min.shape)
RuntimeError: cannot resize variables that require grad
```

Since this is the histogram observer that is used to collect histogram information, should not need gradient. So turn off the grad before resizing using `detach_()` method.

Test Plan:
- arc lint
- Train with histogram observer turned on, training finished successfully

f264139727

Reviewed By: supriyar

Differential Revision: D27147212

fbshipit-source-id: abed5b9c4570ffc6bb60e58e64791cfce66856cd
2021-05-27 09:12:06 -07:00
25ac647f64 [QAT] Auto format the torch/quantization/observer.py` (#57067)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57067

auto format the code

Test Plan: lint

Reviewed By: jerryzh168

Differential Revision: D27147213

fbshipit-source-id: 008871d276c8891b2411549e17617e5c27d16ee3
2021-05-27 09:10:34 -07:00
9baf75c86e add test_filename field in scribe upload (#59024)
Summary:
Add test filename dimension to scribe upload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59024

Test Plan: CI - validate result on scuba table

Reviewed By: janeyx99

Differential Revision: D28726711

Pulled By: walterddr

fbshipit-source-id: 78a1708787f0507d1171800f633e1f7137f629cd
2021-05-27 08:21:05 -07:00
7461792c4a adding upload step on all build jobs (#58998)
Summary:
Relates to https://github.com/pytorch/pytorch/issues/58826.

Currently we don't have the exact build time for non-binary jobs collected. collecting this reports the exact test time from pytorch checkout finish till build stage successful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58998

Test Plan: CI - validate result on scuba table

Reviewed By: janeyx99

Differential Revision: D28747962

Pulled By: walterddr

fbshipit-source-id: 715d91d597bc004977fdceaf245263c9c8aacc84
2021-05-27 08:17:01 -07:00
3d70ab08ae bump out repeat_interleave BC allow date (#59057)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59057

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Differential Revision: D28732990

Pulled By: bhosmer

fbshipit-source-id: 27a9fe9169b2da9405d2c3f235e7c015896fe7fc
2021-05-26 23:32:05 -07:00
74089a0d34 [quant][refactor tests] Move quantization tests into subfolders (#59007)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59007

Create folders for each test category and move the tests.
Will follow-up with a cleanup of test_quantization.py

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28718742

fbshipit-source-id: 4c2dbbf36db35d289df9708565b7e88e2381ff04
2021-05-26 23:02:12 -07:00
e146ed21fb [quant][refactor tests] Move TestModelNumerics to a separate file (#59000)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59000

These tests span both QAT and PTQ APIs so factor them out

Test Plan:
python test/test_quantization.py TestModelNumericsEager

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28713910

fbshipit-source-id: b2ad27cf59abb7cc0c4e4da705f8c9220410f8ad
2021-05-26 23:02:11 -07:00
b6c5c5d90e [quant][refactor tests] Rename test_numeric_suite and equalization tests (#58999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58999

Rename the test files to be more explicit that they are for eager mode

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28713909

fbshipit-source-id: b4ccd06c841fe96edf8c065a0bceae15fed260f9
2021-05-26 23:02:09 -07:00
82d587f434 [quant][refactor tests] split test_workflow_module into test_workflow_ops and test_workflow_module (#58963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58963

some tests are used to check the op level numerics of the fake quantize operations

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: HDCharles

Differential Revision: D28696599

fbshipit-source-id: 98f9b0c993dd43050176125461ddd5288142989b
2021-05-26 23:01:08 -07:00
0e9a295b41 Refactor GlooDeviceFactory::makeDeviceFor... (#58996)
Summary:
`makeDeviceForHostname` and `makeDeviceForInterface` are almost
duplicate except for different default argument values

Create generic `makeGlooDevice` anonymous function that takes both host
name and interface name and call it from both
makeDeviceFor[Hostname|Interface]

Also solve two other minor issues:
 - do not call `getenv("GLOO_DEVICE_TRANSPORT")` during library load
   time
 - Raise exception rather than crash if GLOO_DEVICE_TRANSPORT is set to unknown value

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58996

Reviewed By: pbelevich

Differential Revision: D28713324

Pulled By: malfet

fbshipit-source-id: cb33b438078d163e3ec6f047f2e5247b07d94f8d
2021-05-26 20:33:11 -07:00
9e60c7dee3 Add docstring for is_inference_mode_enabled (#59047)
Summary:
Fixes` #{issue number}

Testing:
```
>>> import torch
>>> torch.is_inference_mode_enabled.__doc__
'\nis_inference_mode_enabled(input) -> (bool)\n\nReturns True if inference mode is currently enabled.\n\nArgs:\n    input (Tensor): the input tensor.\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59047

Reviewed By: ailzhang

Differential Revision: D28726991

Pulled By: soulitzer

fbshipit-source-id: c117c7d73e551a1b5f0e215f2aed528bf558ef7c
2021-05-26 19:27:33 -07:00
1bd22e28b3 BFloat16 support for torch.sort (#58196)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58196

Reviewed By: anjali411

Differential Revision: D28721364

Pulled By: ngimel

fbshipit-source-id: 0785f7100fb76d69da7a73022c7d2eb43c91fa6e
2021-05-26 16:49:03 -07:00
f4a890d7c6 fix unique for discontiguous inputs (#59003)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59003

Reviewed By: mruberry

Differential Revision: D28714534

Pulled By: ngimel

fbshipit-source-id: d9bf82f54be5b5919e27281e49fad74e00d8b766
2021-05-26 16:43:19 -07:00
b435a27fb7 CUDA support in the CSR layout: constructors (#59010)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59010

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28719287

Pulled By: bhosmer

fbshipit-source-id: fbb5784ccb5ce19dcca1f2f95c4ee16f9b7680c4
2021-05-26 16:39:43 -07:00
7c17e1dd90 [fx2trt] Quantized uru on gpu (#58965)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58965

Test Plan:
```
// This script is just for playing around
buck run mode/opt -c python.package_style=inplace deeplearning/trt/fx2trt:fx2trt_quantized_test

// To check accuracy
buck run mode/opt -c python.package_style=inplace deeplearning/trt/fx2trt:uru_10x10_to_trt_eval.py
```

Reviewed By: mortzur

Differential Revision: D28445702

fbshipit-source-id: 5357a02a78cb7f9cf772e7a91a08166ef90cc4f8
2021-05-26 16:00:34 -07:00
58d1b3639b fix nn.MHA scriptability (#58727)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58727

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D28593830

Pulled By: bhosmer

fbshipit-source-id: 37dee9efededaea9985a2bf040df1ba4b46f6580
2021-05-26 15:29:49 -07:00
ac67cda272 [PyTorch] Rename TI::add_borrowed_{in,out}put to TI::add_{in,out}put (#58608)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58608

D28523254 (705dd9ffac) ensures us that this was save: we renamed away all the internal uses of add_input/add_output. (Also, practically everything I found internally could borrow, and the stuff that couldn't wouldn't compile because it is passed unnamed temporaries.)
ghstack-source-id: 129882758

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28524585

fbshipit-source-id: 437235d5cc55c3737c928991a996b8f5e1c5beaa
2021-05-26 15:06:28 -07:00
7db36c0792 [PyTorch] Add temporary guardrail to borrowing_ op variants on TensorIterator (#58607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58607

Don't let code that tries to pass temporaries to these variants compile.
ghstack-source-id: 129882759

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D28524227

fbshipit-source-id: e5ce80f048480c67645198eaa0e43532567d4adb
2021-05-26 15:06:27 -07:00
bed0eb5428 [PyTorch] Add TI::add_owned_{in,out}put for clarity & use them (#58606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58606

Removes the pit of non-success around using the owning variants; gives us the option to make add_{in,out}put borrow in the future as a pit of success if we decide that's not bc-breaking.
ghstack-source-id: 129882760

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28523976

fbshipit-source-id: ab5eb7bf5d672a0f8c4a50eb8a21c156d4189709
2021-05-26 15:05:08 -07:00
4f390eb6b6 Document factory_kwargs in nn.Quantize + remove Attributes section (#59025)
Summary:
The `factory_kwargs` kwarg was previously undocumented in `nn.Quantize`. Further, the `Attributes` section of the docs was improperly filled in, resulting in bad formatting. This section doesn't apply since `nn.Quantize` doesn't have parameters, so it has been removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59025

Reviewed By: anjali411

Differential Revision: D28723889

Pulled By: jbschlosser

fbshipit-source-id: ba86429f66d511ac35042ebd9c6cc3da7b6b5805
2021-05-26 14:40:48 -07:00
a749e8edf5 Add UninitializedBuffer to nn docs (#59021)
Summary:
The `UninitializedBuffer` class was previously left out of `nn.rst`, so it was not included in the generated documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59021

Reviewed By: anjali411

Differential Revision: D28723044

Pulled By: jbschlosser

fbshipit-source-id: 71e15b0c7fabaf57e8fbdf7fbd09ef2adbdb36ad
2021-05-26 14:36:05 -07:00
de22657e1c [PyTorch] Replace RecordFunction shouldRun callback with atomic bools (#56504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56504

Having callbacks registered but disabled via their
`shouldRun` callback defeats the `shouldRunRecordFunction`
optimization (no relation between the two things, despite the
shared prefix on the names) that aims to skip `RecordFunction`
construction.

This diff attempts to safely rectify this issue: we drop support for
`shouldRun` callbacks (this is bc-breaking; does anything use these
externally? do I need to add the support back and just stop using it
internally?), add support for enabling and disabling callbacks, and
(for global callbacks) make doing so thread-safe.

There is an interesting subtlety with `std::atomic` that came up: it
is neither copyable nor movable, which precludes putting it into
`std::vector`. I manually overrode this because the thread safety
reasons it is neither copyable nor movable don't apply here; we
already state that adding or removing callbacks (the operations that
might copy/move an atomic) are not thread-safe and should be done at
initialization time.
ghstack-source-id: 129614296

Test Plan:
Existing CI should cover correctness, right?  Inspected
perf report of a simple benchmark that runs nn.Linear in a loop on
CUDA, where internally have Kineto initialized and thus had a
shouldRun observer previously; we are no longer going through the
dispatcher's slow RecordFunction path or spending measurable time
constructing RecordFunction instances.

Reviewed By: ilia-cher

Differential Revision: D27834944

fbshipit-source-id: 93db1bc0a28b5372f7307490c908457e7853fa92
2021-05-26 14:31:33 -07:00
ac07c6451e [nnc] Use BufHandle in loopnest.cache_accesses python API (#59006)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59006

Related github issue: https://github.com/pytorch/pytorch/issues/59002

Test Plan:
Imported from OSS

Tested in the github issue: https://github.com/pytorch/pytorch/issues/59002

Reviewed By: bertmaher

Differential Revision: D28714829

Pulled By: huiguoo

fbshipit-source-id: 5fd7d5426c5cdb5af30731f662b083d2bd611bc4
2021-05-26 13:58:55 -07:00
b93e7a7602 concurrency fixes (#58961)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58961

Reviewed By: anjali411

Differential Revision: D28700307

Pulled By: Krovatkin

fbshipit-source-id: 1fed90c64e88aaf2824c48e006f66a9266d1e163
2021-05-26 13:53:44 -07:00
97c1179c9d Revert D28549240: [typing] Pyre fixes for batch_distributed_inference
Test Plan: revert-hammer

Differential Revision:
D28549240 (671c224b0a)

Original commit changeset: dadfedf93aae

fbshipit-source-id: 820fefccf2b4c6368defd762ce55245dd35505ca
2021-05-26 13:39:30 -07:00
0d5527de7a Back out "Back out "[ONNX] Process const folding progressively when converts to ONNX (#54569)"" (#58923)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58923

Original commit changeset: c54597b2048e
ghstack-source-id: 129842041

Test Plan: Sandcastle and OSS CI.

Reviewed By: snisarg

Differential Revision: D28432555

fbshipit-source-id: 2a9ec22cc004c7c6979f1cc8f3124b833cdc6634
2021-05-26 13:29:07 -07:00
b420ded66f ShardedTensor framework for ChunkedShardingSpec (#58517)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58517

Building upon the sharding specifications, this PR introduces the
intial skeleton of ShardedTensor and allows building a ShardedTensor by
specifying ChunkedShardingSpec.

In follow up PRs, I'll add further support for GenericShardingSpec.
ghstack-source-id: 129917841

Test Plan:
1) unit tests.
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28526012

fbshipit-source-id: 8e62847b58957d284e40f57a644302c171289138
2021-05-26 13:24:23 -07:00
671c224b0a [typing] Pyre fixes for batch_distributed_inference
Summary:
Pyre does not support dynamic imports, so we can leave the pyre-ignores for those. (https://fb.workplace.com/groups/pyreqa/permalink/3119812734775204/)

Parameterized pyre-ignore are also necessary as explained by [this Q&A](https://www.internalfb.com/intern/qa/109058/pyre-says-undefined-attribute-16-module-parameteri)

Test Plan:
- `pyre -l .`
- `pyre check`
- `buck test //caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test`

Reviewed By: vipannalla

Differential Revision: D28549240

fbshipit-source-id: dadfedf93aae860fe6d0a112002bdfe743139b1e
2021-05-26 13:08:19 -07:00
be47060af9 [remove xla from codegen] rename aten_xla_type.h -> DispatchKeyNativeFunctions.h (#58568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58568

I split out the file rename into a separate commit to make the diff easier. The template file name is `aten_xla_type.h` -> `{DispatchKey}NativeFunctions.h`

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D28711298

Pulled By: bdhirsh

fbshipit-source-id: 2fa7d2abede560a2c577300f0b5a1f7de263d897
2021-05-26 12:53:19 -07:00
86ce2950f6 remove xla-specific stuff from codegen (minus CPU fallback) (#58064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58064

**Summary**
This PR tries to remove all xla-specific logic from the codegen except for two places:
- renaming the `aten_xla_type.h/cpp` template files; Going to do that in a separate PR just to make the diff easier to understand
- CPU fallback logic (everything in `aten_xla_type_default.h/cpp` and `gen_external_aten_fallbacks.py`). I'm trying to kill all of that logic in a subsequent PR by making the CPU fallback a boxed kernel, so it felt unnecessary to go through it all and remove the xla references here.

**Notable changes**
The xla codegen includes some custom logging in each kernel wrapper, so I added a few new knobs to the external yaml, that we now test. I have a corresponding [xla-side PR](https://github.com/pytorch/xla/pull/2944) with the new yaml changes, which look like this:
```
per_op_log: XLA_FN_TRACK(3)
per_argument_log: TF_VLOG(3)
cpu_fallback_counter: XLA_COUNTER("aten::{name}", 1)
extra_headers: >
     #include <tensorflow/compiler/xla/xla_client/debug_macros.h>
     #include <tensorflow/compiler/xla/xla_client/metrics.h>
     #include <tensorflow/compiler/xla/xla_client/tf_logging.h>
     #include <torch_xla/csrc/function_call_tracker.h>
     #include <torch_xla/csrc/aten_xla_type.h>
     #include <torch_xla/csrc/aten_xla_type_default.h>
```

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D28711095

Pulled By: bdhirsh

fbshipit-source-id: 90a48440f2e865a948184e2fb167ea240ada47bb
2021-05-26 12:52:13 -07:00
511979df85 Define the SYCL device version __assert_fail when the NDEBUG defined. (#58906)
Summary:
## Motivation
The utils in namespace  `c10` require the `__assert_fail` when the NDEBUG is defined in kernel code.

The `__assert_fail` declaration in pytorch is not compatible to the SYCL‘s specification.

This causes compile error when use these utils in SYCL kernels.

## Solution
Add the `__assert_fail` declaration for SYCL kernels to pytorch when compiling the SYCL kernels with `c10` utils.

## Additional context
`__assert_fail` in SYCL kernel

`extern SYCL_EXTERNAL void __assert_fail(const char *expr, const char *file, unsigned int line, const char *func);`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58906

Reviewed By: anjali411

Differential Revision: D28700863

Pulled By: ezyang

fbshipit-source-id: 81896d022b35ace8cd16474128649eabedfaf138
2021-05-26 12:48:37 -07:00
2e2a75720b [structured] remainder (#58732)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58732

Reviewed By: gchanan

Differential Revision: D28666480

Pulled By: ezyang

fbshipit-source-id: f247365f2e6b3cdf29f7cc242f179041b968e75a
2021-05-26 12:44:32 -07:00
29487ac7ff Add 11.3 binaries without conda (#58877)
Summary:
Tested specifically for cuda 11.3 in https://github.com/pytorch/pytorch/pull/57522.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58877

Reviewed By: walterddr

Differential Revision: D28697703

Pulled By: janeyx99

fbshipit-source-id: 08ae7f7d023cb93e47a2e0a4f115cee9e8a6156a
2021-05-26 12:40:13 -07:00
24508337f4 Revert D28643215: Adds an aten::_ops namespace with unambiguous function names
Test Plan: revert-hammer

Differential Revision:
D28643215 (28740869a1)

Original commit changeset: 7b2b8459f1b2

fbshipit-source-id: ea869bf4cfde7038087e990b2cff5a86f9e2a531
2021-05-26 12:35:34 -07:00
12418a4f86 Back out "Revert D28664514: [pytorch][PR] various TensorIterator speed improvements"
Summary: Original commit changeset: fcad039b7dc8

Test Plan: Existing tests

Reviewed By: mruberry

Differential Revision: D28720186

fbshipit-source-id: 14ac99ee2d7cafb86b20c979f8917beeefd616e1
2021-05-26 12:22:48 -07:00
17fb651a3b Make torch.Tensor(torch.tensor(1.0)) work (#58885)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58885

Fixes #58884

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28687510

Pulled By: ezyang

fbshipit-source-id: 81325f501cc3e83cbac02f7c44ded9d396356bb8
2021-05-26 11:33:05 -07:00
e24362746a [nnc] Concat input shapes must be known to fuse (#58974)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58974

I don't know how we overlooked this for so long...
ghstack-source-id: 129932134

Test Plan:
Predictor test of model 184778294_0 using multiple request replay
threads.  It's not clear to me why multithreading matters, except that perhaps
it makes it easier to get an unknown shape in the profile.

Reviewed By: navahgar

Differential Revision: D28702660

fbshipit-source-id: 565550b1d2e571d62d0c8b21150193f2a7ace334
2021-05-26 11:29:26 -07:00
8398ebaa86 Revert D28664514: [pytorch][PR] various TensorIterator speed improvements
Test Plan: revert-hammer

Differential Revision:
D28664514 (8a28bbeeb9)

Original commit changeset: 2e03cf90b37a

fbshipit-source-id: fcad039b7dc823fec8afa694ab74a7ac5011f8ab
2021-05-26 10:49:58 -07:00
c06d2afa99 [caffe2] Add support for int32 lengths in BatchSparseToDense (#58062)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58062

Make templated function to make sure BatchSparseToDense supports int32 lengths/indices

Test Plan:
```buck test //caffe2/caffe2/python/operator_test:batch_sparse_to_dense_op_test
```

Reviewed By: khabinov

Differential Revision: D28271423

fbshipit-source-id: 41b88b7a3663616b533aaf4731ff35cdf6ec4c85
2021-05-26 10:33:32 -07:00
444e195b6d Use docker base for clang-lint in CI (#58964)
Summary:
This PR introduces a docker base to speed up the `clang-tidy`'s dependencies stage. Originally I was looking into using the native github action cache, but the dependencies are spread across many apt and pip installation places, thus consolidating with a docker image might work better. It shortens the deps installation time from 4min down to 1min by pulling from docker base image.

Base image used: https://github.com/pytorch/test-infra/pull/15

```
FROM nvidia/cuda:10.2-devel-ubuntu18.04

RUN apt-get update && apt-get upgrade -y
RUN apt install -y software-properties-common wget
RUN wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -
RUN apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-11 main"
RUN apt-add-repository ppa:git-core/ppa

RUN apt-get update && apt-get upgrade -y && apt-get install -y git python3-dev python3-pip build-essential cmake clang-tidy-11
RUN update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-11 1000
RUN pip3 install pyyaml typing_extensions dataclasses

```

Previous successful run of clang-tidy: https://github.com/pytorch/pytorch/runs/2671193875?check_suite_focus=true

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58964

Reviewed By: samestep

Differential Revision: D28712536

Pulled By: zhouzhuojie

fbshipit-source-id: 0c48a605efe8574c104da6a0cad1a8b7853ba35e
2021-05-26 10:15:24 -07:00
b8d56572a1 Open json config file in context manager (#58077)
Summary:
* Open json config file safely using a context manager (using a with block).
* This will make sure that the file closed even if an exception is raised.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58077

Reviewed By: anjali411

Differential Revision: D28711177

Pulled By: H-Huang

fbshipit-source-id: 597ba578311b1f1d6706e487872db4e784c78c3c
2021-05-26 08:58:40 -07:00
8130f2f67a DOC Adds code comment for _ConvNd.reset_parameters (#58931)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55741 by adding a comment regarding the behavior of `kaiming_uniform_`

The docstring is correct in this case. For example:

```python
import math
import matplotlib.pyplot as plt

import torch
import torch.nn as nn

in_channels = 120
groups = 2
kernel = (3, 8)
m = nn.Conv2d(in_channels=in_channels, groups=groups,
              out_channels=100, kernel_size=kernel)

k = math.sqrt(groups / (in_channels * math.prod(kernel)))
print(f"k: {k:0.6f}")

print(f"min weight: {m.weight.min().item():0.6f}")
print(f"max weight: {m.weight.max().item():0.6f}")
```

outputs:
```
k: 0.026352
min weight: -0.026352
max weight: 0.026352
```

And when we plot the distribution, it is uniform with the correct bounds:

```python
_ = plt.hist(m.weight.detach().numpy().ravel())
```

![Unknown](https://user-images.githubusercontent.com/5402633/119552979-21ba3800-bd69-11eb-8e10-e067c943abe3.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58931

Reviewed By: anjali411

Differential Revision: D28689863

Pulled By: jbschlosser

fbshipit-source-id: 98eebf265dfdaceed91f1991fc4b1592c0b3cf37
2021-05-26 08:39:40 -07:00
950e67fa43 [quant][refactor tests] Move test_qat_module into test_quantize_eager_qat (#58928)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58928

Test Plan:
python test/test_quantization.py TestConvBNQATModule

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D28683925

fbshipit-source-id: 59d240d521c8067a344c9bdf4bec94e82f52e76f
2021-05-26 07:49:59 -07:00
cc07825a21 [quant][refactor tests] Split test_quantize into test_quantize_eager_ptq, test_quantize_eager_qat and test_fusion (#58927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58927

Part of larger re-factor of quantization tests to make it clearer as to which test belongs where.

proposed folder structure
```
test/quantization
         - bc/
            - test_backward_compatibility.py
         - core/
            - test_quantized_kernels.py
            - test_quantized_workflow_ops.py
            - test_quantized_tensor.py
            - test_workflow_module.py
         - eager/
            - test_quantize_eager_ptq.py
            - test_quantize_eager_qat.py
            - test_fusion.py
         - equalization/
            - test_equalize_eager.py
            - test_bias_correction_eager.py
         - fx/
           - test_quantize_fx.py
         - jit/
            - test_quantize_jit.py
            - test_fusion_passes.py
         - numeric_suite/
            - test_numeric_suite_fx.py
            - test_numeric_suite_eager.py
```

Test Plan:
python test/test_quantization.py

Imported from OSS

Reviewed By: raghuramank100

Differential Revision: D28683926

fbshipit-source-id: f84a4271c77c418ce9751196241933ea8cc14913
2021-05-26 07:48:28 -07:00
28740869a1 Adds an aten::_ops namespace with unambiguous function names (#58092)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58092

Fixes #58044.

This PR:
- adds `ATEN_FN(op)` and `ATEN_FN2(op, overload)` macros that resolve to
an non-overloaded function in aten::_ops that calls the desired operator
(without default arguments).

The motivation for this is two-fold:
1) Using aten operators with templates is hard if the operator is
overloaded (e.g. add.Tensor and add.Scalar).
2) Method-only operators require special handling; pointers-to-method
are different from function pointers. `ATEN_FN2(add_, Tensor)` returns
a function instead of a method.

There is some interesting behavior for out= operations.
`ATEN_FN2(sin, "out")` gives a function that is *faithful* to the schema;
that is, the order of arguments is exactly what it looks like in the
schema. This makes it so that you can directly register
`ATEN_FN2(sin,"out")` (or a function wrapping it using the same signature)
as an override for a DispatchKey.

Test Plan:
- New tests that ATEN_FN2 works on function and method-only operators
- New test that ATEN_FN works
- New test that ATEN_FN macro returns a "faithful" function.

Codegen output:
Operators.h and Operators.cpp are both here:
https://gist.github.com/zou3519/c2c6a900410b571f0d7d127019ca5175

Reviewed By: mruberry

Differential Revision: D28643215

Pulled By: zou3519

fbshipit-source-id: 7b2b8459f1b2eb5ad01ee7b0d2bb77639f77940e
2021-05-26 07:29:15 -07:00
032d6b0643 Revert D28112689: CUDA support in the CSR layout: constructors
Test Plan: revert-hammer

Differential Revision:
D28112689 (1416e57465)

Original commit changeset: f825cd4bce40

fbshipit-source-id: 421fc590797ac5fab6a55ac6f213361fbba7cd5b
2021-05-26 06:15:05 -07:00
bbdc428db2 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28704311

fbshipit-source-id: f089266771c1ceba127116638a4dd87aa21e2e27
2021-05-26 03:19:49 -07:00
9ba9a16700 [PyTorch Edge] Use stream as backport_vi_to_vi-1 interface (#58790)
Summary:
Two main changes:
1. Change the argument of the collection of backport_v{i}_to_v{i-1} from (reader, writer) to (input_model_stream, output_model_stream), so it's easier to backport a model in option 2.

>  2) [Both format and content change] ]Use torch.jit.load() to load the stream,
 and save it to output_model_stream.

2. Fix an issue in the test `backportAllVersionCheck`. Previous it declares `std::ostringstream oss` and uses `oss.clear()` to reset the stringstream. However, the `clear()` function doesn't reset the stream content, and causes problematic stream. As a mitigation, checks are added to prevent corrupted stream for each iteration in while loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58790

ghstack-source-id: 129929960

Test Plan:
CI
```
buck test mode/dev //caffe2/test/cpp/jit:jit
```

Reviewed By: raziel, iseeyuan

Differential Revision: D28620961

fbshipit-source-id: b0cbe0e88645ae278eb3999e2a84800702b5f985
2021-05-26 02:07:46 -07:00
1416e57465 CUDA support in the CSR layout: constructors (#57274)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57274

Test Plan: Imported from OSS

Reviewed By: astaff

Differential Revision: D28112689

Pulled By: bhosmer

fbshipit-source-id: f825cd4bce402dd4c3f71db88854f77830b687b8
2021-05-26 01:36:20 -07:00
be4ba29d49 Detect overflow in numel of sparse COO tensor (#57492)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57492

Reviewed By: albanD

Differential Revision: D28273649

Pulled By: mruberry

fbshipit-source-id: 08ba50509556df1981d7ede025d84a836d2e8e5e
2021-05-25 22:16:21 -07:00
948df6c7a9 [numpy] torch.i0: promote integer inputs to float (#52735)
Summary:
Reference : https://github.com/pytorch/pytorch/issues/42515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/52735

Reviewed By: zou3519

Differential Revision: D28630505

Pulled By: mruberry

fbshipit-source-id: e81a35dfc1a322daf0c44718901470fac677bc94
2021-05-25 22:02:00 -07:00
49c2da0ee0 [testing] improve broadcasts_input error message (#58295)
Summary:
Context:
The Error message when `broadcasts_input` is marked incorrectly is uninformative [See Error Currently]
https://github.com/pytorch/pytorch/pull/57941#discussion_r631749435

Error Currently
```
Traceback (most recent call last):
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 326, in test_variant_consistency_eager
    _test_consistency_helper(samples, variants)
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 310, in _test_consistency_helper
    variant_forward = variant(cloned,
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 227, in __exit__
    self._raiseFailure("{} not raised".format(exc_name))
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 164, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: RuntimeError not raised
```

Error After PR
```
Traceback (most recent call last):
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 329, in test_variant_consistency_eager
    _test_consistency_helper(samples, variants)
  File "/home/kshiteej/Pytorch/pytorch_i0_promotion/test/test_ops.py", line 313, in _test_consistency_helper
    variant_forward = variant(cloned,
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 227, in __exit__
    self._raiseFailure("{} not raised".format(exc_name))
  File "/home/kshiteej/.conda/envs/pytorch-cuda-dev/lib/python3.8/unittest/case.py", line 164, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: RuntimeError not raised : inplace variant either allowed resizing or you have marked the sample SampleInput(input=Tensor, args=(tensor([[[ 2.1750, -8.5027, -3.1403, -6.9942,  3.2609],
         [-2.5057, -5.9123, -5.4633,  6.1203, -8.2124],
         [-3.5802, -8.4869, -6.0700,  2.3431, -8.1955],
         [-7.3316,  1.3248, -6.8661,  7.1483, -8.0719],
         [ 4.5977, -4.0448, -6.2044, -2.1314, -8.4956]],

        [[ 3.2769, -8.4360,  1.2826,  7.1749,  4.7653],
         [-0.2816, -2.5997, -4.7659, -3.7814,  3.9704],
         [-2.1778, -3.8117, -6.0276, -0.8423, -5.9646],
         [ 8.6544, -3.0922,  0.2558, -4.9318, -4.7596],
         [ 4.5583,  4.3830,  5.8793,  0.9713, -2.1481]],

        [[-1.0447,  0.9334,  7.6405, -4.8933, -7.4010],
         [ 7.7168, -8.4266, -5.5980, -6.9368,  7.1309],
         [-8.7720, -5.0890, -0.4975,  1.9518,  1.7074],
         [-8.5783,  8.5510, -8.5459, -3.5451,  8.4319],
         [ 8.5052, -8.9149, -6.6298, -1.2750, -5.7367]],

        [[-6.5625,  8.2795, -4.9311,  1.9501, -7.1777],
         [-8.4035,  1.1136, -7.6418, -7.0726, -2.8281],
         [ 4.2668, -0.2883, -6.2246,  2.3396,  1.2911],
         [ 4.6550, -1.9525,  4.4873, -3.8061, -0.8653],
         [-3.4256,  4.4423,  8.2937, -5.3456, -4.2624]],

        [[ 7.6128, -6.3932,  4.7131, -5.4938,  6.4792],
         [-6.5385,  2.4385,  4.5570,  3.7803, -8.3281],
         [-2.9785, -4.4745, -1.1778, -8.9324,  1.3663],
         [ 3.7437,  3.5171, -6.3135, -8.4519, -2.7033],
         [-5.0568, -8.4630, -4.2870, -3.7284, -1.5238]]], device='cuda:0',
       dtype=torch.float32, requires_grad=True),), broadcasts_input=True) incorrectly with `broadcasts_self=True
```

**NOTE**:
Printing the sample looks very verbose and it may be hard to figure out which sample is incorrectly configured if there are multiple samples with similar input shapes.

Two Options to make this error less verbose
* Don't print the sample and just print `inplace variant either allowed resizing or you have marked one of the sample incorrectly with broadcasts_self=True`
* Have some mechanism to name samples which will be printed in the `repr` (which will need extra machinery)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58295

Reviewed By: ngimel

Differential Revision: D28627308

Pulled By: mruberry

fbshipit-source-id: b3bdeacac3cf9c0d984f0b85410ecce474291d20
2021-05-25 21:14:17 -07:00
083d3bb93b [torch][repeat_interlaeve] Add to exception list in backward compat check (#58966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58966

Same as title.

Test Plan: CI since updated the check

Reviewed By: ngimel

Differential Revision: D28699577

fbshipit-source-id: 436fdc648a4c653081ff0e1b6b809c4af742055a
2021-05-25 20:04:50 -07:00
26c1f0f72e [skip ci] Skip debug info on PRs (#58897)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58897

We don't need to be building debug info on PRs since it's just filling up S3/CircleCI storage with useless 800 MB zips, this flips it so it's only run on master + release branches. See #58898 for CI signal

Also see pytorch/builder counterpart (unlike the last debuginfo PR there is no hard dependency between these two so there won't be any churn on un-rebased PRs): https://github.com/pytorch/builder/pull/778

Test Plan: Imported from OSS

Reviewed By: seemethere, samestep

Differential Revision: D28689413

Pulled By: driazati

fbshipit-source-id: 77a37e84afe492215008d5e023ceab0c24adb33c
2021-05-25 17:01:51 -07:00
32273e806a Ensure NativeFunctions.h codegen output is deterministic (#58889)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58889

fixes https://github.com/pytorch/pytorch/issues/58796

Planning on re-testing locally tomorrow morning to confirm, but this change should fix the non-determinism in the codegen output that was causing `ccache` not to re-use its cached output.

I built from the commit referenced in https://github.com/pytorch/pytorch/issues/58796 a few times and ran `diff -Naur` on the codegen output in `build/aten/src/ATen`. After a few tries, `NativeFunctions.h` had a few diffs. The diffs were all related to the ordering of functional/inplace/out variants of a NativeFunctionGroup, which looked non-deterministic.

That looks like it's coming from my calling `set()` to filter out duplicate NativeFunction declarations. The earlier version of the codegen also called `set()` to filter out duplicates, but it did so individually for each `NativeFunction` object, before merging the groups (I'm not too sure why this didn't introduce non-determinism before. though). With the refactor from https://github.com/pytorch/pytorch/pull/57361, we're calling `set()` on the declarations from every operator for a given DispatchKey, which is probably what introduced the nondeterminism.

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D28675941

Pulled By: bdhirsh

fbshipit-source-id: bb66de00aafeeb9720d85e8156ac9f7539aed0d6
2021-05-25 16:33:03 -07:00
db5e5781ad replace all remaining occurrences of deadline=1000, to prevent test flakiness
Summary: Per title

Test Plan: Fixes existing tests

Reviewed By: robieta

Differential Revision: D28690296

fbshipit-source-id: d7b5b5065517373b75d501872814c89b24ec8cfc
2021-05-25 15:55:30 -07:00
60af6e928a [PyTorch Edge][Version] Fix torchscript model after backport (#58892)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58892

The torchscript model after backport misses the `constants` archive. Add it back, and extend the unit test to run torchscript part.
ghstack-source-id: 129853819

Test Plan:
```
buck test mode/dev //caffe2/test/cpp/jit:jit -- --exact 'caffe2/test/cpp/jit:jit
- LiteInterpreterTest.BackPortByteCodeModelAllVersions'
```

Reviewed By: raziel, iseeyuan

Differential Revision: D28664507

fbshipit-source-id: 5f98723231cc64ed203c062ee6f00d8adbdccf77
2021-05-25 15:36:56 -07:00
fb120493b1 Make Scalar.to<> for invalid types a compile-time error (#58726)
Summary:
Currently calling `scalar.to<std::complex<double>>()` for example compiles but throws an error at runtime. Instead, marking the non-specialized cases as `= delete` means the code fails to compile and you catch the error sooner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58726

Reviewed By: zou3519, seemethere

Differential Revision: D28646057

Pulled By: ezyang

fbshipit-source-id: 9e4e3d1b4586eeecbb73db61bba56560b2657351
2021-05-25 15:34:01 -07:00
36a77580f5 [docs] Clarify batch_first behavior for nn.LSTM, nn.RNN, and nn.GRU (#58809)
Summary:
Fixes the high-pri doc component of https://github.com/pytorch/pytorch/issues/4145.

To make the input / output shapes more readable for both `batch_first` states, this PR also introduces short dim names. Opinions welcome on the readability of the restructured docs!

Screenshot for `nn.LSTM`:
<img width="791" alt="Screen Shot 2021-05-24 at 5 11 39 PM" src="https://user-images.githubusercontent.com/75754324/119408130-389e5300-bcb3-11eb-9a4f-1df96a0a4d70.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58809

Reviewed By: gchanan

Differential Revision: D28685415

Pulled By: jbschlosser

fbshipit-source-id: e8c92e3d7e052071a505b55dca976fd2ef5a8307
2021-05-25 15:27:17 -07:00
7179e7ea7b [CMake] Prefer third_party/pybind11 by default (#58951)
Summary:
To make build behaviour aligned with other third_party/ libraries,
introduce `USE_SYSTEM_PYBIND11 (d55b25a633)` build option, which set to OFF by
default, which means PyTorch will be build with bundled pybind11 even if
other version is already installed locally.

Fixes https://github.com/pytorch/pytorch/issues/58750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58951

Reviewed By: driazati

Differential Revision: D28690411

Pulled By: malfet

fbshipit-source-id: e56b5a8f2a23ee1834b2a6d3807f287149decf8c
2021-05-25 15:10:17 -07:00
45aa54d83c relax test deadlines
Summary: Relax test deadlines for c2 tests. We run on loaded machines, and timings are unreliable.

Test Plan: Fixes existing tests

Reviewed By: mruberry

Differential Revision: D28690006

fbshipit-source-id: 457707e81a1ec92548c1f23ea7a0022fa0a3bfda
2021-05-25 15:02:52 -07:00
b4b95fc87a Expose cudaMemGetInfo (#58635)
Summary:
This PR resolves the second issue outlined in https://github.com/pytorch/pytorch/issues/58376, which has previously been discussed in https://github.com/pytorch/pytorch/issues/50722.

`cudaMemGetInfo` is bound/exposed to the Python API. An example function call is provided below:

```
device_free, device_total = torch.cuda.mem_get_info(torch.device('cuda:0'))
print(device_free, device_total)
```

In  `CUDACachingAllocator.cpp`, in constant to my initial PR, the newly defined function `std::pair<size_t, size_t> raw_cuda_mem_get_info(int device)` has been moved from the `CUDACaching` namespace to the `cuda` namespace. In addition, as suugested by ezyang, `det` has been removed from all function names.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58635

Reviewed By: zou3519

Differential Revision: D28649093

Pulled By: ezyang

fbshipit-source-id: d8b7c53e52cf73f35495d8651863c5bb408d7a6a
2021-05-25 14:58:35 -07:00
133133afa8 [PyTorch] Extract non-template parts of torch::class_ (#54548)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54548

We don't need to inline most of this class; doing so bloats code size and build time.
ghstack-source-id: 129765666

Test Plan:
Existing CI

buildsizebot some mobile apps

Reviewed By: jamesr66a

Differential Revision: D27277317

fbshipit-source-id: 7643aa35e4d794fee0a48a3bbe0890c2e428ae78
2021-05-25 14:47:00 -07:00
ec89bf6535 .github: Ensure 7zip install for windows (#58924)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58924

Was observing behavior where 7zip was nowhere to be found after a build
was completed. Let's just have 7zip be installed within the workflow as
well just to be completely sure 7zip is there.

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28681241

Pulled By: seemethere

fbshipit-source-id: f649c1713edcdeb82c84fd67866700caa2726d71
2021-05-25 13:58:35 -07:00
ede3f5421f [Pytorch Delegated Backend] Save function name in debug info (#57481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57481

This diff introduces function name to InlinedCallStack.
Since we are using InlinedCallStack for debug information in lite
interpreter as well as delegate backends, where InlinedCallStack cannot
be constructed from model source code, we need to save function name.
In the absence of function name Function* is used to get name of the
function. This is when JIT compiles code at runtime.
When that is not possible, this diff introduces a way to obtain function
name.

Test Plan:
test_backend
test_cs_debug_info_serialization

test_backend
test_cs_debug_info_serialization

Imported from OSS

Differential Revision:
D28159097
D28159097

Reviewed By: raziel, ZolotukhinM

Pulled By: kimishpatel

fbshipit-source-id: deacaea3325e27273f92ae96cf0cd0789bbd6e72
2021-05-25 13:19:02 -07:00
813adf1076 [Pytorch Delegated Backend] Save operator name and function name in (#57441)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57441

debug info

Previous diffs did not save operator name in debug info. For delegated
backends that only idenfity op for profiling with debug handle, operator
name should be stores as well.
Furthermore to complete debug informaton also serialize function name.

Test Plan:
Existing lite interpreter and backend tests

Existing lite interpreter and backend tests

Imported from OSS

Differential Revision:
D28144581
D28144581

Reviewed By: raziel

Pulled By: kimishpatel

fbshipit-source-id: 415210f147530a53b444b07f1d6ee699a3570d99
2021-05-25 13:17:54 -07:00
a7a5992d7d Add no-grad inference mode note (#58513)
Summary:
Adds a note explaining the difference between several often conflated mechanisms in the autograd note
Also adds a link to this note from the docs in `grad_mode` and `nn.module`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58513

Reviewed By: gchanan

Differential Revision: D28651129

Pulled By: soulitzer

fbshipit-source-id: af9eb1749b641fc1b632815634eea36bf7979156
2021-05-25 13:06:54 -07:00
5268b5a29a Add parsing logic for Tuple[()] annotation (#58340)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58340

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D28459502

Pulled By: ansley

fbshipit-source-id: 4bb188448d66269b42b068858b895debac86e9ee
2021-05-25 12:12:43 -07:00
b9d1ad9c78 OpInfo: diag_embed, diagonal (#58642)
Summary:
See: https://github.com/pytorch/pytorch/issues/54261.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58642

Reviewed By: ngimel

Differential Revision: D28627226

Pulled By: mruberry

fbshipit-source-id: b96fa8410bd53937ddb72a46c02b949691ee9458
2021-05-25 11:52:53 -07:00
f976275858 Run pthreadpool with _NoPThreadPoolGuard on the same thread (#58759)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58759

* Makes `pthreadpool()->run` respect `_NoPThreadPoolGuard`
   Runs tasks on the same thread instead of parallelizing when guard is present

Test Plan:
buck build //xplat/caffe2:aten_test_test_thread_pool_guard
./buck-out/last/aten_test_test_thread_pool_guard

Reviewed By: kimishpatel

Differential Revision: D28597425

fbshipit-source-id: 0365ad9947c239f5b37ce682802d4d401b8b0a48
2021-05-25 11:39:05 -07:00
b703f1e02d [NNC] Add documentation for splitWith APIs (#58270)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58270

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427226

Pulled By: navahgar

fbshipit-source-id: 39635e985095c7b581452464d7a515c6f86b24e8
2021-05-25 11:32:53 -07:00
dd7bbe1a63 [NNC] Make splitWithMask transform in-place (#58269)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58269

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427227

Pulled By: navahgar

fbshipit-source-id: 4e38a436abcf4752fd7ef6ab3666876eec6ea5ba
2021-05-25 11:32:51 -07:00
e2467cc43e [NNC] Make splitWithTail transform in-place (#58268)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58268

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D28427228

Pulled By: navahgar

fbshipit-source-id: 270b62c4e83739ad21dd68f375120e56881b394f
2021-05-25 11:31:14 -07:00
6b6a27e430 [jit] Add Python API for ScriptProfile (#57398)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57398

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28133577

fbshipit-source-id: dcb8338159a24b00b5c495ecec66a3303d9b4aba
2021-05-25 11:09:18 -07:00
c88333484f [resubmit] masked_scatter thrust->cub (#58865)
Summary:
See ae7760cf50bb2cddff4663a07b9d68decf4b6c75 for the fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58865

Reviewed By: mruberry

Differential Revision: D28657940

Pulled By: ngimel

fbshipit-source-id: 9155c710b0e18ebb3bfa2dabfdd117355ac30840
2021-05-25 11:00:50 -07:00
fedf6f2db2 Check memory overlap in sort for large input sizes (#58327)
Summary:
The downstream cub sort doesn't support inplace sorting; this PR adds a check to bail out to allocating a new tensor instead of silently corrupting the returned indices.

CC ngimel zasdfgbnm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58327

Reviewed By: mruberry

Differential Revision: D28661244

Pulled By: ngimel

fbshipit-source-id: 40617a7d3bfcebbe187bb706b6b753371bb99097
2021-05-25 10:57:31 -07:00
7eade660c6 [PyTorch] Reduce errors of foreach functions (#56993)
Summary:
This is based on  https://github.com/pytorch/pytorch/issues/48224.

To make `foreach` more flexible, this PR pushes unsupported cases to slow path.
Also, this adds some tests to verify that
- `foreach` functions work with tensors of different dtypes and/or memory layouts in 7bd4b2c89f
- `foreach` functions work with tensors on different devices in a list, but are on the same device if the indices are the same: def4b9b5a1

Future plans:
1. Improve the coverage of unittests using `ops` decorator & updating `foreach_unary_op_db` and creating `foreach_(binary|pointwise|minmax)_db`.
2. Support broadcasting in slow path. Ref:  https://github.com/pytorch/pytorch/pull/52448
3. Support type promotion in fast path. Ref https://github.com/pytorch/pytorch/pull/52449

CC: ngimel mcarilli  ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56993

Reviewed By: zou3519

Differential Revision: D28630580

Pulled By: ngimel

fbshipit-source-id: e26ee74a39a591025e18c1ead48948cb7ec53c19
2021-05-25 10:50:20 -07:00
8a28bbeeb9 various TensorIterator speed improvements (#58810)
Summary:
1) remove pushing back to strides vector for 1D tensors, those strides are never used in the loop anyway
2) avoid calling get_data_ptrs unless necessary
3) don't call into assert_no_partial_overlap if tensorImpls are the same (assert_no_partial_overlap has this comparison too, but after a couple of nested function calls)
4) is_non_overlapping_and_dense instead of is_contiguous in memory overlap (which, for some reason, is faster than is_contiguous, though I hoped after is_contiguous is non-virtualized, it should be the same).

Altogether, brings instruction count down from ~110K to 102735 for the following binary inplace benchmark:
```
In [2]:  timer = Timer("m1.add_(b);", setup="at::Tensor m1=torch::empty({1}); at::Tensor b = torch::empty({1});", language="c++", timer=timeit.default_timer)
   ...:  stats=timer.collect_callgrind(number=30, repeats=3)
   ...:  print(stats[1].as_standardized().stats(inclusive=False))
```
similar improvements for unary inplace.

Upd: returned stride packing for now, counts is now 104295, so packing is worth ~ 52 instructions, we should think about how to remove it  safely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58810

Reviewed By: bhosmer

Differential Revision: D28664514

Pulled By: ngimel

fbshipit-source-id: 2e03cf90b37a411d9994a7607402645f1d8f3c93
2021-05-25 10:44:51 -07:00
09a8f22bf9 Add mish activation function (#58648)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4
2021-05-25 10:36:21 -07:00
bf269fdc98 Re-enable torchdeploy oss tests and move to per-PR cuda11 job (#58872)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58872

Test Plan: verify tests running on CI as expected

Reviewed By: suo

Differential Revision: D28646660

fbshipit-source-id: eb7d784844fb7bc447b4232e2f1e479d4d5aa72f
2021-05-25 10:05:32 -07:00
19bcbfc5cf [c10d] Use pg wrapper in detailed debug mode (#58281)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58281

When TORCH_DISTRIBUTED_DEBUG=DETAIL is enabled, this PR causes process groups created by `new_group` and `init_process_group` that are nccl or gloo to be wrapped in `ProcessGroupWrapper`.

As a result, the user will get back a `ProcessGroupWrapper` that they can use in the exact same way as a regular nccl/gloo pg, but will be more helpful in terms of debugging desync/hangs.

Besides doing collective desync checks, which should be transparent if there are indeed no issues in the user application, there are no semantic differences in using the wrapper pg. Note that there is a performance implication here but that is a tradeoff we are making when DETAIL debug mode is enabled.

Open to suggestions on how to test better. Currently I verified locally that enabling TORCH_DISTRIBUTED_DEBUG=detail creates the wrapper and all tests still pass, but that doesn't run in CI. On the other hand testing everything with debug=detail and the regular tests might be too much, so we have only added it to a few tests for now. We also do have tests in the below diff.
ghstack-source-id: 129817857

Test Plan: ci

Reviewed By: SciPioneer

Differential Revision: D28402301

fbshipit-source-id: c4d3438320f6f0986e128c738c9d4a87bbb6eede
2021-05-25 09:55:52 -07:00
aad2ad883a Disable test_nccl_errors_blocking_abort (#58921)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58921

Reviewed By: ezyang

Differential Revision: D28680061

Pulled By: malfet

fbshipit-source-id: bab4a28f054ed26bcd6431576b60268ba4db8e6b
2021-05-25 09:50:24 -07:00
470160ad78 [Pytorch] Update FuseLinear to map source range information (#58492)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58492

Update graph rewrite to specify how values in replacement pattern should
map to values in original pattern for fuse_linear pass

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_quantization.py TestQuantizeJitPasses.test_fuse_linear

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28512464

fbshipit-source-id: 250a69cebc11eb4328a34c8f685b36e337439aae
2021-05-25 09:19:57 -07:00
e067675167 [Pytorch] Provide API to preserve source range and callstack information during graph rewrite (#58300)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58300

Current state: During graph rewriting that can fuse nodes or add nodes
result in new nodes without debug information that was available in
original node. Thus we lose this information during graph rewrite.

This PR changes graph rewriting API to let user specify how the values
in the replacement pattern map to values in the pattern to be matched.
Then the graph rewriting will copy source range and inlined callstack
from the matched nodes onto the nodes being inserted.

(Note: this ignores all push blocking failures!)

Test Plan:
python test/test_jit.py
TestJit.test_pattern_based_rewrite_with_source_range_preserved

Imported from OSS

Reviewed By: malfet

Differential Revision: D28512465

fbshipit-source-id: 863173c29de726be85b3acbd3ddf3257eea36d13
2021-05-25 09:18:59 -07:00
2ef9a1df22 Increase mimimum number of warmup runs to 2 (#58801)
Summary:
The JIT will typically need two warmup runs to do profiling and optimization.
This is not the perfect solution but it will substantially reduce the number of surprised people when the docs say torch.utils.benchmark.Timer takes care of warmup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58801

Reviewed By: desertfire

Differential Revision: D28644244

Pulled By: robieta

fbshipit-source-id: cc54ed019e882a379d6e4a0c6a01fd5873dd41c3
2021-05-25 08:38:52 -07:00
09a1b1cf87 Forward AD formulas batch 1 (#57768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57768

Note that this PR implements formulas only for ops that are supported by OpInfo.

Test Plan: Imported from OSS

Reviewed By: zou3519, malfet

Differential Revision: D28387766

Pulled By: albanD

fbshipit-source-id: b4ba1cf1ac1dfd46cdd889385c9c2d5df3cf7a71
2021-05-25 07:29:25 -07:00
b4f3a989da [torch][repeat_interleave] Fix ambigious function call (#58881)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58881

recently added new parameter to the function with PR: https://github.com/pytorch/pytorch/pull/58417

However, this introduced ambiguity when making call below:
  some_tensor.repeat_interleave(some_integer_value)

Making it optional to avoid the issue.

Reviewed By: ezyang, ngimel

Differential Revision: D28653820

fbshipit-source-id: 5bc0b1f326f069ff505554b51e3b24d60e69c843
2021-05-25 00:31:32 -07:00
3dbfaddfa1 Port elu_backward to structured (#58660)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58660

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572528

Pulled By: ezyang

fbshipit-source-id: 12265c287f178f9435d5d96f3bba49082d9e7f2c
2021-05-25 00:14:13 -07:00
5850553bc0 Port hardsigmoid_backward to strucutred (#58484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58484

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572529

Pulled By: ezyang

fbshipit-source-id: aee125aa59a1f2b1ddb0c29a287097d866121379
2021-05-25 00:14:12 -07:00
3f0b7e0feb Port leaky_relu_backward to structured (#58483)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58483

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572526

Pulled By: ezyang

fbshipit-source-id: a73bdf06967687dbb1d4fbb0f2ca80115db57a07
2021-05-25 00:14:10 -07:00
ad27513430 Port softplus to structured (#58482)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58482

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28571059

Pulled By: ezyang

fbshipit-source-id: a1065294f3c459e7c99aaed9edb09f88705f58e9
2021-05-25 00:12:57 -07:00
0b8931fe4b [torch][JIT] Predicate uses of RPC APIs on torch.distributed.rpc.is_available() (#58887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58887

There are some callsites of `torch.distributed.rpc.XXX` APIs that are compiled
or not based on `USE_RPC`. However, `torch::deploy`, at least for now,
is compiled with `USE_RPC=1`, but the `torch.distributed.rpc.XXX` APIs used by
the aforementioned pieces of code are not available (i.e.
`torch.distributed.rpc.is_available()` returns `False`). This can cause
Torchscript compilation to fail, even if the code being compiled doesn't use
RPC.

This commit fixes this problem (at least temporarily) by predicating the use
all thse `torch.distributed.rpc` APIs on the value of
`torch.distributed.rpc.is_available()`.

Test Plan: Ran packaged XLM-R model with C++ benchmark.

Reviewed By: suo

Differential Revision: D28660925

fbshipit-source-id: fbff7c7ef9596549105e79f702987a53b04ba6f9
2021-05-24 21:53:53 -07:00
c502f49535 Fix failing torch deploy tests and reenable. (#58871)
Summary:
Fix is simple; alias inputs before feeding them to distinct
torchdeploy interpreters.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Fixes https://github.com/pytorch/pytorch/issues/58832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58871

Reviewed By: wconstab, zou3519

Differential Revision: D28646784

Pulled By: ezyang

fbshipit-source-id: 6d2850f3226b5b99468d1465723b421ce4d7ab89
2021-05-24 20:27:41 -07:00
cf395c0718 [c10d] Introduce ProcessGroupWrapper (#58224)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58224

Adds C++ implementation of ProcessGroupWrapper. It wraps
an underlying ProcessGroup and does debug checks before dispatching the
collective to the underlying pg. The design mostly follows https://github.com/pytorch/pytorch/issues/22071.

Concretely, on each collective, we:
1. Verify op type consistency. This can help catch mismatched ops in the user application (i.e. allreduce on one rank and allgather on another)
2. Verify tensor shapes. This can help catch bugs where the tensor inputs are malformed, whereas normally in NCCL this would just lead to a hang. The shapes verification for allgather/allreduce_coalesced is omitted because they actually accept different shape tensors and don't error out.

This is done through an abstraction called `CollectiveFingerPrint` which uses a helper process group to do the above verification. Concretely, we gather the data we need for each of the above checks into tensors, and allgather them, and verify their equivalence.

Once all of this passes we simply dispatch the collective to the underlying pg.

Added `ProcessGroupWrapperTest` in python to comprehensively test these changes.
ghstack-source-id: 129735687

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28023981

fbshipit-source-id: 1defc203c5efa72ca0476ade0d1d8d05aacd4e64
2021-05-24 20:09:51 -07:00
c00eefb6c7 [Static Runtime] Clean up and fix bugs in Static Runtime (#58829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58829

- Delete copying and moving of MemoryPlanner.
- Remove `inline` in some of the member functions because member functions implemented in classes are inline by default.
- Clean up ad update comments.
- Reorganize some code

Reviewed By: edvgha

Differential Revision: D28555476

fbshipit-source-id: 7ea8efc0e2ed93a6788a742470b9e753a85df677
2021-05-24 19:46:58 -07:00
de845020a0 fix docstring for fusing functions (#58638)
Summary:
This PR fixes docstrings of fusing functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58638

Reviewed By: H-Huang

Differential Revision: D28584501

Pulled By: jerryzh168

fbshipit-source-id: 77a53a709d968df8ba8f5b613ad7cf225ba2826a
2021-05-24 18:27:22 -07:00
2b0ec9c3cf Reapply "[jit] Implement ScriptProfile to collect instruction profiles." (#58783)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58783

This reverts commit fc804b5def5e7d7ecad24c4d1ca4ac575e588ae8.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28617037

Pulled By: zhxchen17

fbshipit-source-id: 645de2ede20500a5c218d6ec3c7faae94de37a14
2021-05-24 18:23:21 -07:00
705dd9ffac [PyTorch] Migrate remaining stray uses of TI:add_output to borrowing (#58605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58605

Found a few more by grepping.
ghstack-source-id: 129730281

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D28523254

fbshipit-source-id: 317baea88885586c5106c8335ebde0d8802a3532
2021-05-24 17:34:54 -07:00
12bb1e86ed Make c10::ThreadPool::available_ atomic. (#58457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58457

This variable had concurrent read/write access without any
synchronization. The issue was caught and reported by TSAN.
ghstack-source-id: 129311384

Test Plan:
1) Verify test locally.
2) waitforbuildbot.

Reviewed By: ezyang

Differential Revision: D28498116

fbshipit-source-id: 89af068467fed64c131d743504c0cecf3017d638
2021-05-24 17:29:44 -07:00
a5250425e0 [quant] Eager mode equalization support for ConvReLU and LinearReLU (#58792)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58792

Enabling support for fused modules like ConvReLU or LinearReLU on eager mode cross-layer equalization.

Test Plan:
`python test/test_quantization.py TestEqualizeEager`

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28647242

fbshipit-source-id: 286e057ce70aa7de45d575afd6c13e55120ff18a
2021-05-24 17:25:13 -07:00
b593dd2027 [Gradient Compression] Re-enable test_ddp_hook_parity_powerSGD on Gloo backend (#58882)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58882

Re-enable this test since https://github.com/facebookincubator/gloo/pull/309 is already picked up by Gloo submodule.
ghstack-source-id: 129760436

Test Plan: waitforbuildbot

Reviewed By: agolynski

Differential Revision: D28654433

fbshipit-source-id: dfc002936e88c074be529d6024f889214130b1b9
2021-05-24 16:52:54 -07:00
a566005679 [skip ci] Update readme to use hud.pytorch.org (#58835)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58835

Pulled By:
davidriazati
driazati

Reviewed By: seemethere

Differential Revision: D28632504

fbshipit-source-id: 867f061be039bc63c1478b1b1eed8c0380e94faa
2021-05-24 15:02:18 -07:00
f29e75c4dc [reland][quant][fx][graphmode][refactor] Remove qconfig_map from Quantizer (#58455) (#58756)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58756

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Imported from OSS

Reviewed By: supriyar

Differential Revision: D28607564

fbshipit-source-id: 979cf165941bb3a9044d03077a170b5ea64dc36a
2021-05-24 14:57:45 -07:00
76f03bc42f Fix torch.finfo.bits typo in stub (#58819)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58818.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58819

Reviewed By: walterddr, malfet

Differential Revision: D28641587

Pulled By: ezyang

fbshipit-source-id: b19b519db43f2075c64f4f9ba922310f2561ca70
2021-05-24 14:52:49 -07:00
bc2ee078d1 Update Gloo submodule (#58853)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58853

Reviewed By: pbelevich, SciPioneer

Differential Revision: D28642642

Pulled By: agolynski

fbshipit-source-id: 8c31f9ab86c5f3063733199474022e7e2c6e9a2f
2021-05-24 14:23:41 -07:00
51b7224f8f [vulkan] Add max_pool2d op (#58806)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58806

Adds the max_pool2d op to Vulkan.

Test Plan:
```
cd ~/fbsource
buck build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 //xplat/caffe2:pt_vulkan_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_api_test
adb shell "/data/local/tmp/vulkan_api_test"
cd -
```

Reviewed By: IvanKobzarev

Differential Revision: D28625049

fbshipit-source-id: 75c82a84f0eca51627e33a6182ef51cb7e82e068
2021-05-24 14:16:19 -07:00
a679bb5ecf Refactor local lint (#58798)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58798

In #58623 there was a bug in `make quicklint` where ShellCheck would run on the entire repo when there were no files. This PR fixes that by refactoring out common stuff (like skipping quicklint when there are no files, let checks do their own file filtering) and pushes the logic into a runner class.

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28649889

Pulled By: driazati

fbshipit-source-id: b19f32cdb63396c806cb689b2f6daf97e1724d44
2021-05-24 13:52:53 -07:00
a7f4f80903 ENH Adds dtype to nn.functional.one_hot (#58090)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/33046
Related to https://github.com/pytorch/pytorch/issues/53785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58090

Reviewed By: zou3519

Differential Revision: D28640893

Pulled By: jbschlosser

fbshipit-source-id: 3686579517ccc75beaa74f0f6d167f5e40a83fd2
2021-05-24 13:48:25 -07:00
e4be80c1b8 simplify cpu_kernel to not have contiguous special case (#58830)
Summary:
Per title
 `unroll_contiguous_scalar_checks` tries to verify that all arguments (including outputs) are contiguous except maybe 1 scalar (with stride 0). Then it calls the passed lambda with index of the scalar arg if this verification succeeded, or 0 if args were not contiguous/there was no scalar. Depending on the value of this index (with 0=not found) a different function can be called (in vectorized kernels it’s vectorized loop if args are contiguous + scalar, and basic loop if not). It makes sense for vectorized kernel (vectorized loop can still be used in some broadcasted cases), but all other (cpu_kernel, serial_cpu_kernel, cpu_kernel_multiple_outputs) don’t even use idx argument in lambda, so regardless of what `unroll_contiguous_scalar_checks` does, they'll do the same thing. No point in calling `unroll_contiguous_scalar_checks` then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58830

Reviewed By: zou3519, mruberry

Differential Revision: D28632668

Pulled By: ngimel

fbshipit-source-id: c6db3675933184e17cc249351c4f170b45d28865
2021-05-24 12:07:29 -07:00
1c5f63d86d [Pytorch Edge] Model Ops compatibility api (#57501)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57501

Add an api _get_model_ops_and_info to get root operators and versioning info of a model in both cxx and python, and the input can be from a file path or buffer.
ghstack-source-id: 129620112

Test Plan: unit test.

Reviewed By: xcheng16, raziel

Differential Revision: D28162765

fbshipit-source-id: 4413c1e906b8a872e4a717d849da37347adbbea4
2021-05-24 12:00:06 -07:00
2a456e4f49 [skip ci] Move debug wheels out of package dir before test (#58685)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58685

This moves debug packages out of the artifacts dir before running tests (as a counterpart to https://github.com/pytorch/builder/pull/770). Doing it this way allows us to keep the CI configs simple since there's one directory to use for artifacts / upload to S3.

See #58684 for actual CI signals (the ones on this PR are all cancelled since it depends on the builder branch set in the next PR up the stack)

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28646995

Pulled By: driazati

fbshipit-source-id: 965265861968906770a6e6eeecfe7c9458631b5a
2021-05-24 11:46:37 -07:00
2733555ed1 replace all_gather with more efficient collective api _all_gather_base (#57769)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57769

_all_gather_base saved copies in all_gather, so it is more efficient

Test Plan: unit test

Reviewed By: SciPioneer

Differential Revision: D28227193

fbshipit-source-id: ddd8590095a5b45676497a71ed792a457f9825c6
2021-05-24 11:34:45 -07:00
c58709b7bb Helper function for skipping module parameter / buffer initialization (#57555)
Summary:
This PR introduces a helper function named `torch.nn.utils.skip_init()` that accepts a module class object + `args` / `kwargs` and instantiates the module while skipping initialization of parameter / buffer values. See discussion at https://github.com/pytorch/pytorch/issues/29523 for more context. Example usage:

```python
import torch

m = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1)
print(m.weight)

m2 = torch.nn.utils.skip_init(torch.nn.Linear, 5, 1, device='cuda')
print(m2.weight)

m3 = torch.nn.utils.skip_init(torch.nn.Linear, in_features=5, out_features=1)
print(m3.weight)
```
```
Parameter containing:
tensor([[-3.3011e+28,  4.5915e-41, -3.3009e+28,  4.5915e-41,  0.0000e+00]],
       requires_grad=True)
Parameter containing:
tensor([[-2.5339e+27,  4.5915e-41, -2.5367e+27,  4.5915e-41,  0.0000e+00]],
       device='cuda:0', requires_grad=True)
Parameter containing:
tensor([[1.4013e-45, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00]],
       requires_grad=True)
```

Bikeshedding on the name / namespace is welcome, as well as comments on the design itself - just wanted to get something out there for discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57555

Reviewed By: zou3519

Differential Revision: D28640613

Pulled By: jbschlosser

fbshipit-source-id: 5654f2e5af5530425ab7a9e357b6ba0d807e967f
2021-05-24 11:28:32 -07:00
277f587496 rename benchmark_cpp_extension (#58708)
Summary:
Currently the cpp_extension build in benchmarks is misleading as it has the same name with torch.utils.cpp_extension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58708

Test Plan:
Run from `./benchmarks/operator_benchmark/pt_extension` folder:
```
python setup.py install
python cpp_extension_test.py
```

Note: CI doesn't matter as currently benchmarks/ folder is not compiled/test against CI

Reviewed By: robieta

Differential Revision: D28585582

Pulled By: walterddr

fbshipit-source-id: fc071040cf3cb52ee6c9252b2c5a0c3043393f57
2021-05-24 11:04:02 -07:00
a083933d2a .github: Add windows.8xlarge.nvidia.gpu (#58781)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58781

Adds windows GPU workers to our GHA self hosted infra

[skip ci]

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: walterddr

Differential Revision: D28645532

Pulled By: seemethere

fbshipit-source-id: b00d0caef727c597ee15d19c76bda384231f42c9
2021-05-24 10:40:46 -07:00
8ae4d07dac .circleci: Disable windows CPU builds for CircleCI (#58855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58855

We have successfully migrated windows CPU builds to Github Actions so
let's go ahead and disable them in CircleCI

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D28642875

Pulled By: seemethere

fbshipit-source-id: 8ffe9338e58952531a70002891a19ea33363d958
2021-05-24 10:28:41 -07:00
1fca1545d4 fixing csr addmm bug (#58768)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58768

Fixes gh-58757

This PR has a fix for CPU version of addmm op. Just for context, before this PR, only CSR @ vector was supported. I found out a minor bug in the addmm_out_sparse_csr_dense_cpu for the non MKL code which is solved in this PR.

Moreover, I discovered a limitation in the current MKL implementation. It only works well (acceptable tolerance for output error) with square matrices. I was looking in deep to this issue and I found out that it could be a limitation of the MKL API.

I used this [gist code](https://gist.github.com/aocsa/0606e833cd16a8bfb7d37a5fbb3a5b14) based on [this](https://github.com/baidu-research/DeepBench/blob/master/code/intel/spmm/spmm_bench.cpp) to test this behavior.

As you can see there is not an acceptable output error (last column) when the matrices are squares and there is a not acceptable error when the matrices are not square. I reported the issue here: https://github.com/pytorch/pytorch/issues/58770

Looking forward to your comments.

Test Plan: Imported from OSS

Reviewed By: zou3519

Differential Revision: D28629563

Pulled By: malfet

fbshipit-source-id: 5ee00ae667336e0d9301e5117057213f472cbc86
2021-05-24 09:54:07 -07:00
2dda8d7571 Move cublas dependency after CuDNN (#58287)
Summary:
Library linking order matters during static linking
Not sure whether its a bug or a feature, but if cublas is reference
before CuDNN, it will be partially statically linked into the library,
even if it is not used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58287

Reviewed By: janeyx99

Differential Revision: D28433165

Pulled By: malfet

fbshipit-source-id: 8dffa0533075126dc383428f838f7d048074205c
2021-05-24 09:39:09 -07:00
bb4770462f .github: Enable Windows workflow for pull_request (#58418)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58418

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28483418

Pulled By: seemethere

fbshipit-source-id: c9f5a4df5a308e0ac6fc6fdc1a26d04723ffded7
2021-05-24 09:34:47 -07:00
007fe949aa Adding a new include directory in BLIS search path (#58166)
Summary:
While trying to build PyTorch with BLIS as the backend library,
we found a build issue due to some missing include files.
This was caused by a missing directory in the search path.
This patch adds that path in FindBLIS.cmake.

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58166

Reviewed By: zou3519

Differential Revision: D28640460

Pulled By: malfet

fbshipit-source-id: d0cd3a680718a0a45788c46a502871b88fbadd52
2021-05-24 08:57:02 -07:00
0e16087064 [DataLoader] Fix bugs for typing (#58450)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58450

Test Plan: Imported from OSS

Reviewed By: VitalyFedyunin

Differential Revision: D28507877

Pulled By: ejguan

fbshipit-source-id: f4051ff51ce77ef45214f11cba10c8a7e1da4dad
2021-05-24 07:14:40 -07:00
5c7dace309 Automated submodule update: FBGEMM (#58161)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: 4b8aaad426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58161

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: jspark1105

Differential Revision: D28385619

fbshipit-source-id: ace938b1e43760b4bedd596ebbd355168a8706b7
2021-05-23 23:33:19 -07:00
74c12da451 add deterministic path for scatter_add_cuda for 1D tensors (#58761)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58761

previously we implemented deterministic path for gather_backward in https://github.com/pytorch/pytorch/pull/55573, which replaced non-deterministic scatter_add_cuda.

It's better to move it inside scatter_add so scatter_add can benefit from the deterministic path.

Test Plan:
buck test mode/opt //caffe2/test:torch_cuda -- test_scatter_add_one_dim_deterministic

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.063)
    ✓ Pass: caffe2/test:torch_cuda - test_scatter_add_one_dim_deterministic_cuda (test_torch.TestTorchDeviceTypeCUDA) (30.909)
    ✓ Pass: caffe2/test:torch_cuda - main (30.909)
Summary
  Pass: 2
  ListingSuccess: 1

buck test mode/opt //caffe2/test:torch_cuda -- test_gather_backward

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (4.613)
    ✓ Pass: caffe2/test:torch_cuda - test_gather_backward_deterministic_path_cuda (test_torch.TestTorchDeviceTypeCUDA) (25.369)

buck test mode/opt //caffe2/test:torch_cuda -- test_nondeterministic_alert

    ✓ ListingSuccess: caffe2/test:torch_cuda - main (5.356)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_CTCLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_accumulate_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_scatter_add_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_NLLLoss_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_put_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_median_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_gather_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_bincount_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_histc_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad1d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_bicubic_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_grid_sample_3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_MaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveAvgPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_EmbeddingBag_max_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_trilinear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_AdaptiveMaxPool2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReflectionPad2d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_FractionalMaxPool3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_kthvalue_cuda_float64 (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_interpolate_linear_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - test_nondeterministic_alert_ReplicationPad3d_cuda (test_torch.TestTorchDeviceTypeCUDA) (28.146)
    ✓ Pass: caffe2/test:torch_cuda - main (28.146)
Summary
  Pass: 30
  ListingSuccess: 1

Reviewed By: ngimel

Differential Revision: D28585659

fbshipit-source-id: 1ad003d4130501ceff5f6a7a870ca3dbc9a3f1f2
2021-05-23 21:36:02 -07:00
50ded095e4 [deploy] temporarily disable deploy tests (#58832)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58832

While we investigate breakage.

Differential Revision:
D28631469
D28631469

Test Plan: Imported from OSS

Reviewed By: SplitInfinity

Pulled By: suo

fbshipit-source-id: 43d51c1c9d81e951074824ccf624e42f6bec4242
2021-05-23 19:26:06 -07:00
a7fdd487e5 Port kthvalue tests to OpInfo (#58654)
Summary:
Tracking issue https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58654

Reviewed By: ngimel

Differential Revision: D28627207

Pulled By: mruberry

fbshipit-source-id: f662f178ab87a9d461f1e0c91b02942c64125e73
2021-05-23 16:44:16 -07:00
4709fdb117 Add GenericShardingSpec for generic tensor sharding. (#57409)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57409

Full design: https://github.com/pytorch/pytorch/issues/55207

In https://github.com/pytorch/pytorch/issues/55207, we proposed
`MeshShardingSpec` as a generic sharding mechanism. However, that proposal does
not provide the flexibility to specify shards which have uneven
sizes/partitions and assumes even partitioning. Uneven partitioning is one of
the requirements of an internal use case.

As a result, instead of that we introduce a `GenericShardingSpec` which allows
specifying any arbitrary partitioning of a multi dimensional tensor. Basically
it specifies the start offsets of each shard and the length of each dim of the
shard allowing for greater flexibility
ghstack-source-id: 129604155

Test Plan:
1) unit tests
2) waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D28137616

fbshipit-source-id: 61255762485fb8fa3ec3a43c27bbb222ca25abff
2021-05-23 16:06:05 -07:00
0d6fa1adc5 Introduce ChunkShardingSpec as a model sharding specification. (#55728)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55728

Full design: https://github.com/pytorch/pytorch/issues/55207

This PR introduces ChunkShardingSpec (SingleShardingSpec in the design). Used
the name ChunkShardingSpec since it is very similar to `torch.chunk` in terms
of how a Tensor is split up and feels more clear compared to SingleShardingSpec.
ghstack-source-id: 129603318

Test Plan: waitforbuildbot

Reviewed By: SciPioneer

Differential Revision: D27694108

fbshipit-source-id: c8764abe6a4d5fc56d023fda29b74b5af2a73b49
2021-05-23 16:04:57 -07:00
c5a1f04367 Enabled BFloat16 support for cumsum, logcumsumexp, cumprod, cummin & cummax on CUDA (#57904)
Summary:
Enabled BFloat16 support for `cumsum`, `logcumsumexp`, `cumprod`, `cummin` & `cummax` on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57904

Reviewed By: ailzhang

Differential Revision: D28558722

Pulled By: ngimel

fbshipit-source-id: 2a8e49c271e968f841d24534b6cc7be162d3a5aa
2021-05-23 15:51:23 -07:00
ee3ea31f12 OpInfo: split, split_with_sizes (#58184)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58184

Reviewed By: ngimel

Differential Revision: D28627271

Pulled By: mruberry

fbshipit-source-id: e6c0d2b005904ddebc9dab76685403530a6f6519
2021-05-23 15:47:35 -07:00
52a8031e8c [ROCm] disable test test_Conv2d_groups_nobias_v2 for ROCm (#58701)
Summary:
Disable test_Conv2d_groups_nobias_v2 test because it is failing on ROCm 4.2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58701

Reviewed By: ngimel

Differential Revision: D28626651

Pulled By: mruberry

fbshipit-source-id: a74bdf45335ae2afee0aa5e3bece6e208e75a63f
2021-05-23 15:43:36 -07:00
fa0b89bbf7 Change list striding kernel implementation to handle optional integers (#58536)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58536

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28531720

Pulled By: tugsbayasgalan

fbshipit-source-id: c06a8933aa9b4aa562ea65ac2558353b05d0f624
2021-05-23 12:34:22 -07:00
28840b9a44 [Gradient Compression] Disable test_ddp_hook_parity_powerSGD on Gloo backend (#58802)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58802

This test can only be re-enabled once https://github.com/facebookincubator/gloo/pull/309 is picked up by Gloo submodule.
ghstack-source-id: 129661729

Test Plan: unit test.

Reviewed By: rohan-varma

Differential Revision: D28623214

fbshipit-source-id: 0249ae816469c3e8cabd08db415821091a064d58
2021-05-22 23:41:27 -07:00
4ca4640bae [torch][repeat_interleave] remove stream syncronization if output size is given (#58417)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58417

Same as title.

Test Plan:
Rely on CI signal.

Update unit test to exercise new code path as well.

Reviewed By: ngimel

Differential Revision: D28482927

fbshipit-source-id: 3ec8682810ed5c8547b1e8d3869924480ce63dcd
2021-05-22 20:53:28 -07:00
c1c9be16c4 port mm to structure kernel (#57755)
Summary:
relate to https://github.com/pytorch/pytorch/issues/57417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57755

Reviewed By: ezyang

Differential Revision: D28426111

Pulled By: walterddr

fbshipit-source-id: 943d3e36433ca846990b940177fb040553961156
2021-05-22 19:24:14 -07:00
f9e8dc005a OpInfo: clone, contiguous (#58390)
Summary:
Reference: https://github.com/pytorch/pytorch/issues/54261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58390

Reviewed By: soulitzer

Differential Revision: D28567821

Pulled By: mruberry

fbshipit-source-id: bcf42cb4a9a57d8a15a76819b8a9e2df97cf00be
2021-05-22 18:25:31 -07:00
a70020465b adding test_sparse_csr to run_test (#58666)
Summary:
fixes https://github.com/pytorch/pytorch/issues/58632.

Added several skips that relates to test assert and MKL. Will address them in separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58666

Reviewed By: seemethere, janeyx99

Differential Revision: D28607966

Pulled By: walterddr

fbshipit-source-id: 066d4afce2672e4026334528233e69f68da04965
2021-05-22 13:17:46 -07:00
22776f0857 [PyTorch] Remove device check from a few indexing methods (#58800)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58800

These methods leverages TensorIterator which will handle
(or skip) device check.
ghstack-source-id: 129654358

Test Plan: CI && sandcastle

Reviewed By: ngimel

Differential Revision: D28622626

fbshipit-source-id: 6153299780d4f7bf286423520ba4cb60b554335e
2021-05-22 13:13:39 -07:00
796c97a88f [Pytorch Delegated Backend] Add python binding for (#57156)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57156

generate_debug_handles

To be able to generate debug handles for preprocess written inpython.

Test Plan:
CI

CI

Imported from OSS

Differential Revision:
D28062328
D28062328

Reviewed By: raziel

Pulled By: kimishpatel

fbshipit-source-id: 8795d089edc00a292a2221cfe80bbc671468055c
2021-05-22 08:34:19 -07:00
d6d726f781 [Pytorch Backend delegation] Add api for backend lowering to query debug (#55462)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55462

handles and symbolicate exception callstack thrown from backend.

Objective of this diff is to achieve improve error reporting when
exceptions are raised from lowered backend. We would effectively like to
get the same model level stack trace that you would get without having
lowered some module to backend.

For example:
```
class AA(nn.Module):
  def forward(self, x, y):
    return x + y

class A(nn.Module):
  def __init__(...):
    self.AA0 = AA()
  def forward(self, x, y):
    return self.AA0.forward(x, y) + 3

class B(nn.Module):
  def forward(self, x):
    return x + 2

class C(nn.Module):
  def __init__(...):
    self.A0 = A()
    self.B0 = B()
  def forward(self, x, y):
    return self.A0.forward(x, y) + self.B0.forward(x)
```
If the we then do C().forward(torch.rand((2,3)), torch.rand(14,2))) we
will likely see error stack like:
```
C++ exception with description "The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<string>", line 3, in forward

    def forward(self, x, y):
      return self.A0.forward(x, y) + self.B0.forward(x)
             ~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 3, in forward

    def forward(self, x, y):
      return self.AA0.forward(x, y) + 3
             ~~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 3, in forward

    def forward(self, x, y):
      return x + y
             ~~~~~ <--- HERE
```

We would like to see the same error stack if we lowered C.A0 to some
backend.

With this diff we get something like:
```
  Module hierarchy:top(C).A0(backend_with_compiler_demoLoweredModule).AA0(AA)
Traceback of TorchScript (most recent call last):
  File "<string>", line 3, in FunctionName_UNKNOWN

    def forward(self, x, y):
      return self.A0.forward(x, y) + self.B0.forward(x)
             ~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 5, in FunctionName_UNKNOWN
                typed_inputs: List[Any] = [x, y, ]
                if self.__backend.is_available() :
                  _0, = self.__backend.execute(self.__handles["forward"], typed_inputs)
                        ~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
                  assert isinstance(_0, Tensor)
                  return _0
  File "<string>", line 3, in FunctionName_UNKNOWN

    def forward(self, x, y):
      return self.AA0.forward(x, y) + 3
             ~~~~~~~~~~~~~~~~ <--- HERE

  File "<string>", line 3, in FunctionName_UNKNOWN

    def forward(self, x, y):
      return x + y
             ~~~~~ <--- HERE
```
This is achieved in 3 parts:
Part 1:
A. BackendDebugInfoRecorder:
   During backend lowering, in `to_backend`, before calling the preprocess
   function corresponding to the backend. This will facilitate recording of
   debug info (such as source range + inlined callstack) for the lowered module.
B. Instantiate WithBackendDebugInfoRecorder with BackendDebugInfoRecorder.
   This initializes thread local pointer to BackendDebugInfoRecorder.
C. generate_debug_handles:
   In preprocess function, the backend will call generate_debug_handles
   for each method being lowered separately. generate_debug_handles
   takes `Graph` of the method being lowered and returns a map
   of Node*-to-debug_handles. Backend is responsible for storing debug
   handles appropriately so as to raise exception (and later profiling)
   using debug handles when the exception being raised corresponds to
   particular Node that was lowered.
   Inside generate_debug_handles, we will query the current
   BackendDebugHandleInfoRecorder, that is issuing debug handles. This debug
   handle manager will issue debug handles as well as record
   debug_handles-to-<source range, inlined callstack> map.
D. Back in `to_backend`, once the preprocess function is has finished
   lowering the module, we will call `stopRecord` on
   BackendDebugInfoRecorder. This will return the debug info map. This
   debug info is then stored inside the lowered module.

Part 2:
Serialization:
During serialization for bytecode (lite interpreter), we will do two
things:
1. Extract all the source ranges that are contained inside
debug_handles-to-<source range, inlined callstack> map for lowered
module. This will be source range corresponding to debug handles,
including what is there is inlined callstack. Since we replaced original
module with lowered module, we wont be serializing code for the original
module and thus no source range. That is why the source range will have
to be stored separately. We will lump all the source ranges for all the
lowered modules in one single debug_pkl file.
2. Then we will serialize debug_handles-to-<source range, inlined
callstack> map.

Now during deserialization we will be able to reconstruct
debug_handles-to-<source range, inlined callstack> map. Given all
debug_handles are unique we would not need any module information.

Test Plan:
Tests are added in test_backend.cpp

Tests are added in test_backend.cpp

Imported from OSS

Differential Revision:
D27621330
D27621330

Reviewed By: raziel

Pulled By: kimishpatel

fbshipit-source-id: 0650ec68cda0df0a945864658cab226a97ba1890
2021-05-22 08:33:07 -07:00
e7c35a3363 Revert D28617214: [Gradient Compression] Do not skip the comm hook tests on Gloo backend
Test Plan: revert-hammer

Differential Revision:
D28617214 (3e88acbf05)

Original commit changeset: 3bafb0c837a1

fbshipit-source-id: 0b6254e9766436633faea63cd64c454b739f74b4
2021-05-22 07:47:18 -07:00
6093161158 Separated out working tests from not working tests for NNC OpInfo (#58788)
Summary:
This gets rid of a lot of the try/else rigamarole.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58788

Reviewed By: ZolotukhinM

Differential Revision: D28621054

Pulled By: Chillee

fbshipit-source-id: d0d8a1b6466eb318d939a1ed172b78f492ee0d5b
2021-05-22 02:24:23 -07:00
dc8bc6ba4b [PyTorch Edge] Check if open paren ( occurs in an operator name string (#58687)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58687

We want to validate if the usages are all okay.
ghstack-source-id: 129639560

Test Plan: Tested on master. Build fails. The tested with D28549578 (db67699ae6) applied, and the build succeeds.

Reviewed By: JacobSzwejbka

Differential Revision: D28579734

fbshipit-source-id: 1ac65474762855562109adc0bac2897b59f637ce
2021-05-21 20:23:42 -07:00
4c961beacb Revert D28474878: Always use intrusive_ptr for Message (1 out of 2)
Test Plan: revert-hammer

Differential Revision:
D28474878 (4d704e607d)

Original commit changeset: 5b76d45e05f6

fbshipit-source-id: 677c5bc7f02dca23213f778eb0e626a2f6600f3b
2021-05-21 19:24:22 -07:00
a6b9268f31 Revert D28474879: Always use intrusive_ptr for Message (2 out of 2)
Test Plan: revert-hammer

Differential Revision:
D28474879 (ebf55a7d13)

Original commit changeset: 498652a8b80a

fbshipit-source-id: 4d81e9769699356bf2a2ffc14b26f480bfeef9a1
2021-05-21 19:24:20 -07:00
c1a9befba2 Revert D28474880: Allow Future::then to return pre-extracted DataPtrs
Test Plan: revert-hammer

Differential Revision:
D28474880 (a0ee299d92)

Original commit changeset: 91a0dde5e29d

fbshipit-source-id: fabf7b0bcbd41342553660a4d1e4bfc3d1dd2d41
2021-05-21 19:24:19 -07:00
a1719be07f Revert D28474877: Provide pre-extracted DataPtrs when completing a Future with a Message
Test Plan: revert-hammer

Differential Revision:
D28474877 (bdf6a4bffd)

Original commit changeset: e68d7d45f1c1

fbshipit-source-id: b89858b4e82f4f766031cfaad9fc736cf8097816
2021-05-21 19:24:17 -07:00
341f83d6a2 Revert D28474981: Create CUDA-aware futures in RequestCallback
Test Plan: revert-hammer

Differential Revision:
D28474981 (027c68ef00)

Original commit changeset: 492b8e71a43d

fbshipit-source-id: 0697c0922cd6bcbea2505efeecbbcbb3ffcfff2b
2021-05-21 19:24:15 -07:00
7a8336a5a7 Revert D28474983: Set streams when invoking UDFs
Test Plan: revert-hammer

Differential Revision:
D28474983 (ab1e958d20)

Original commit changeset: 358292764d0a

fbshipit-source-id: b4d4c25fe551d83848a9d023c139a9f1acc4c23d
2021-05-21 19:24:14 -07:00
89c81c5bba Revert D28574083: Set and propagate devices in RRef completion future
Test Plan: revert-hammer

Differential Revision:
D28574083 (23df70359a)

Original commit changeset: 5c89902cdc5c

fbshipit-source-id: e48043b6c4fb8a6f383f78e1aa88f7614f9fa13a
2021-05-21 19:24:12 -07:00
b8a04e25ec Revert D28474982: Make TP agent use streams from Future when sending response
Test Plan: revert-hammer

Differential Revision:
D28474982 (19a7472702)

Original commit changeset: c0034eb3f2a2

fbshipit-source-id: fb260c71e6c9dd5a2c44121fe4729a4f4418532b
2021-05-21 19:23:01 -07:00
dceaf98e79 [torch][package] Fix importlib.resources.path for python <3.8.8 (#58718)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58718

`PackageImporter` does not populate `module.__spec__.origin`, which causes an
unhandled `Exception` to be raised when using `importlib.resources.path` to get
a path to a binary file resource in the package in python <3.8.6.

This commit fixes this issue by setting `module.__spec__.origin` to
"<package_importer>". The actual value is not important as far as I can tell;
the simple fact that it is not `None` allows `importlib` to avoid raising an
`Exception` in `importlib.resources.path`.

Test Plan:
This commit adds a unit test to `test_resources.py` that tests that
`importlib.resources.path` can be used within a package.

Reviewed By: suo

Differential Revision: D28589117

fbshipit-source-id: 870d606a30fce6884ae48b03ff71c0864e4b325f
2021-05-21 19:16:54 -07:00
071d49a970 Document monitored barrier (#58322)
Summary:
Will not land before the release, but would be good to have this function documented in master for its use in distributed debugability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58322

Reviewed By: SciPioneer

Differential Revision: D28595405

Pulled By: rohan-varma

fbshipit-source-id: fb00fa22fbe97a38c396eae98a904d1c4fb636fa
2021-05-21 19:04:57 -07:00
84b6c629d3 [lint] Move shellcheck to its own step (#58623)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58623

This splits out everything shellcheck related into its own job that generates and checks GHA workflows, then shellchecks those + jenkins scripts. This PR also integrates shellcheck into the changed-only stuff in `actions_local_runner.py` so that shellcheck won't do anything unless someone edits a shell script in their local checkout. This is the final piece to clean up the output of `make quicklint` and speeds it up by a good bit (before it was shellchecking everything which took a few seconds):

```
$ make quicklint -j $(nproc)
✓ quick-checks: Ensure no unqualified noqa
✓ quick-checks: Ensure canonical include
✓ quick-checks: Ensure no unqualified type ignore
✓ quick-checks: Ensure no direct cub include
✓ quick-checks: Ensure no tabs
✓ quick-checks: Ensure no non-breaking spaces
✓ shellcheck: Regenerate workflows
✓ quick-checks: Ensure no versionless Python shebangs
✓ quick-checks: Ensure correct trailing newlines
✓ shellcheck: Assert that regenerating the workflows didn't change them
✓ mypy (skipped typestub generation)
✓ cmakelint: Run cmakelint
✓ quick-checks: Ensure no trailing spaces
✓ flake8
✓ shellcheck: Extract scripts from GitHub Actions workflows
✓ shellcheck: Run Shellcheck
real 0.92
user 6.12
sys 2.45
```

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28617293

Pulled By: driazati

fbshipit-source-id: af960ed441db797d07697bfb8292aff5010ca45b
2021-05-21 18:23:40 -07:00
b842351b4f Skip SVE acceleration on M1 (#58785)
Summary:
As it's not supported by the chip and also crashes compiler, see https://bugs.llvm.org/show_bug.cgi?id=50407

Fixes https://github.com/pytorch/pytorch/issues/58653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58785

Reviewed By: zhouzhuojie, driazati

Differential Revision: D28619231

Pulled By: malfet

fbshipit-source-id: 34367c074f9624b21d239eec757891cbb51f5bed
2021-05-21 18:08:30 -07:00
3e88acbf05 [Gradient Compression] Do not skip the comm hook tests on Gloo backend (#58784)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58784

DDP communication hooks are already supported on Gloo backend. No longer need to skip these tests on Gloo.

Original PR issue: https://github.com/pytorch/pytorch/issues/58467
ghstack-source-id: 129635828

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_comm_hook_logging
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce_process_group
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_powerSGD

Reviewed By: rohan-varma

Differential Revision: D28617214

fbshipit-source-id: 3bafb0c837a15ad203a8570f90750bc5177d5207
2021-05-21 17:47:52 -07:00
041bff77b6 Make tools/actions_local_runner.py PY-3.X compatible (#58787)
Summary:
Do not use `shlex.join`, which is a simple join over quoted args, i.e.
a9e43615c2/Lib/shlex.py (L318-L320)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58787

Reviewed By: driazati

Differential Revision: D28619996

Pulled By: malfet

fbshipit-source-id: dd4e939a88e2923b41084da2b5fbdbee859c0104
2021-05-21 17:40:48 -07:00
829a096cd7 Fix arange functions for VSX specializations of Vec256 (#58553)
Summary:
Need a templated 2nd parameter to support e.g. double steps even for int vectors.

This extends https://github.com/pytorch/pytorch/pull/34555 x86 specific fix to VSX instruction set.

Fixes https://github.com/pytorch/pytorch/issues/58551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58553

Reviewed By: ailzhang

Differential Revision: D28551266

Pulled By: malfet

fbshipit-source-id: de7d23685da06b1b3089933d74398667cfb43c9f
2021-05-21 17:30:12 -07:00
e094980060 Makefile should use python3 instead of python alias (#58786)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58786

Reviewed By: driazati

Differential Revision: D28619802

Pulled By: malfet

fbshipit-source-id: 8f81298d39ba89c4e007f537ec2dd64bb23338af
2021-05-21 17:23:27 -07:00
1d885fbd0e Update GraphTask::owner_ in a single thread for DistEngine. (#58625)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58625

Several TSAN tests were failing for distributed since `owner_` was not
atomic and was being accessed by several threads. As an example:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/autograd/engine/dist_engine.cpp#L333.

To fix this, I've set the owner_ only once when the graphTask is created.

Test Plan:
1) Validated change fixes failing TSAN test.
2) waitforbuildbot

Reviewed By: albanD

Differential Revision: D28496878

fbshipit-source-id: 473f4f6d859595749a02563a204ba7aa35ea19e3
2021-05-21 17:12:27 -07:00
d9aa0b53eb [PyTorch] Migrate TI usage in ATen/native/quantized to borrowing (#58307)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58307

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129598791

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28445922

fbshipit-source-id: ce12743980296bab72a0cb83a8baff0bb6d80091
2021-05-21 16:31:01 -07:00
3ddb4b3e68 [PyTorch] Migrate TI usage in ATen/native to borrowing (#58305)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58305

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129598793

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28445712

fbshipit-source-id: 0822f1408a0a71c8f8934e6d90659ae3baa085ac
2021-05-21 16:29:50 -07:00
e574c2c025 [quant][fx] Validate qconfig_dict keys (#58566)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58566

Validates the keys of the qconfig_dict, prepare_custom_config_dict, convert_custom_config_dict, and
fuse_custom_config_dict. If the user passes in an invalid key or makes a type, we will throw and error and let the user know what keys are supported.

Test Plan:
Imported from OSS

python test/test_quantization.py

Reviewed By: jerryzh168

Differential Revision: D28540923

fbshipit-source-id: 5958c32017b7d16abd219aefc8e92c42543897c2
2021-05-21 15:20:05 -07:00
ed4cda0183 [pkg] opt into autoformat
Summary: woooo

Test Plan: arc lint --apply-patches --take BLACK --paths-cmd 'hg files -I "caffe2/**/*.py"'

Reviewed By: SplitInfinity

Differential Revision: D28608934

fbshipit-source-id: 7768fed50a87883a95319376c0a6d73a9492bdcc
2021-05-21 15:03:52 -07:00
e5ba9307b7 catch exception when running print regression (#58751)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58751

Test Plan: https://github.com/pytorch/pytorch/issues/58752

Reviewed By: samestep

Differential Revision: D28605667

Pulled By: walterddr

fbshipit-source-id: 3796c924df8e50849dd08ecbeab612ba4f0c569b
2021-05-21 14:59:42 -07:00
378b2af93d T90561249: Enforce kernel launch checks for OSS CI (#58465)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58465

Test Plan: how to test?

Reviewed By: r-barnes

Differential Revision: D28500258

fbshipit-source-id: 19e56d5e18d77b951acb510e1e7ac834ce1ffc9b
2021-05-21 14:03:48 -07:00
19a7472702 Make TP agent use streams from Future when sending response (#58428)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58428

Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 129567045

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474982

fbshipit-source-id: c0034eb3f2a2ea525efb63a31b839bc086060e7e
2021-05-21 13:15:35 -07:00
23df70359a Set and propagate devices in RRef completion future (#58674)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58674

I found this missing parameter while debugging failures in the next PR.

I'm very unhappy about this change. I think this future, which we know for sure won't contain tensors, shouldn't have to worry about CUDA devices. And yet, it does. This means that basically any future anywhere might have to worry about it, and this just doesn't scale, and thus it's bad.
ghstack-source-id: 129567042

Test Plan: Should fix the next diff.

Reviewed By: mrshenli

Differential Revision: D28574083

fbshipit-source-id: 5c89902cdc5cc12f1ebeea860b90cd9c3d7c7da1
2021-05-21 13:15:34 -07:00
ab1e958d20 Set streams when invoking UDFs (#58427)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58427

Running the UDF (be it Python or JIT) is the first step of (most?) RPC calls, which is where the inputs are consumed. The lazy stream context contains the streams used by the inputs, thus it must be made current before any UDF call. I opt to do this as "close" as possible to the place the UDF is invoked, to make the relationship as explicit as possible.
ghstack-source-id: 129567052

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474983

fbshipit-source-id: 358292764d0a6832081c34bf6736f0961475ff3d
2021-05-21 13:15:32 -07:00
027c68ef00 Create CUDA-aware futures in RequestCallback (#58426)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58426

The operations in RequestCallback can return CUDA tensors, thus the futures used to hold them must be CUDA-aware.
ghstack-source-id: 129567051

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474981

fbshipit-source-id: 492b8e71a43da5f63b4b7a31f820427cde9736e4
2021-05-21 13:15:30 -07:00
bdf6a4bffd Provide pre-extracted DataPtrs when completing a Future with a Message (#58425)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58425

Now that callbacks can provide pre-extracted DataPtrs, let's do so. This will become of crucial importance in the next PR, where some of these futures will become CUDA-aware, and thus they will try to extract DataPtrs on their own, but they would fail to do so here because Message isn't "inspectable".
ghstack-source-id: 129567057

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474877

fbshipit-source-id: e68d7d45f1c1dc6daa5e05cf984cfc93d2dce0d0
2021-05-21 13:15:29 -07:00
a0ee299d92 Allow Future::then to return pre-extracted DataPtrs (#58424)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58424

In CUDA mode, Future must inspect its value and extract DataPtrs. However some types are not supported, for example the C++/JIT custom classes, which include Message, which is widely used in RPC. Hence for these scenarios we allow the user to perform the custom DataPtr extraction on their own, and pass the pre-extracted DataPtrs.

Note that `markCompleted` already allowed users to pass in pre-extracted DataPtrs, hence this PR simply extends this possibility to the `then` method too.
ghstack-source-id: 129567044

Test Plan: Used in next PR.

Reviewed By: mrshenli

Differential Revision: D28474880

fbshipit-source-id: 91a0dde5e29d1afac55650c5dfb306873188d785
2021-05-21 13:15:27 -07:00
ebf55a7d13 Always use intrusive_ptr for Message (2 out of 2) (#58423)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58423

This is part 2 of the previous PR. Here we address the remaining occurrences of "raw" Message, namely the ones within toMessageImpl. And since they're the last ones, we make the constructor of Message private, to prevent new usages from emerging.
ghstack-source-id: 129567049

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474879

fbshipit-source-id: 498652a8b80a953396cd5d4b275c0b2e869c9ecf
2021-05-21 13:15:25 -07:00
4d704e607d Always use intrusive_ptr for Message (1 out of 2) (#58422)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422

Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).

By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.

In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 129567053

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474878

fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c
2021-05-21 13:15:24 -07:00
35ea8779da Prevent using anything other than intrusive_ptr for Future (#58421)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58421

Here I make it impossible to create Futures that do not use intrusive_ptr, by making the constructor private. This makes it safer (by "forcing" people to do the right thing) and prevents a proliferation of new shared_ptrs or of accidental copies/moves.
ghstack-source-id: 129567047

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28474484

fbshipit-source-id: 82c487e1bb7c27a2e78cb5d594e00e54c752bf09
2021-05-21 13:15:22 -07:00
44daf1930b Migrate remaining shared_ptr<Future> to intrusive_ptr (#58420)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58420

In https://github.com/pytorch/pytorch/pull/57636 I migrated most uses of Future to an intrusive_ptr. I thought I had all of them but I missed a couple. These are the remaining ones. (The next PR will make it impossible to add new usages of shared_ptr).
ghstack-source-id: 129567071

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28477285

fbshipit-source-id: 75008276baa59e26b450e942c009ec7e78f89b13
2021-05-21 13:15:20 -07:00
59454ce36e Make remaining autograd methods return futures (#57861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57861

The very last methods left that still didn't return Futures were the autograd ones, but they're very easy to port.

We've now finished the conversion of RequestCallback to be fully Future-based!
ghstack-source-id: 129567055

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286173

fbshipit-source-id: 1de58cee1b4513fb25b7e089eb9c45e2dda69fcb
2021-05-21 13:15:19 -07:00
d6d2fb3323 Make remaining RRef methods return futures (#57860)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57860

The other methods for RRefs just did bookkeeping and are trivially easy to migrate to Futures (which is done mainly for consistency at this point).
ghstack-source-id: 129567068

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286175

fbshipit-source-id: 1d97142803f73fe522435ca75200403c78babc68
2021-05-21 13:15:17 -07:00
797dff55b5 Unify fetching RRefs (#57859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57859

Just like with assigning OwnerRRefs, we can also deduplicate the code paths for fetching their values. In fact this was duplicated three times, with different ways of post-processing the value (once for JIT, once for Python, once for autograd). Thanks to future, we can have that logic once, and then connect it to different follow-up steps.
ghstack-source-id: 129567050

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286172

fbshipit-source-id: e0742a99cf555755e848057ab6fee5285ff0df2a
2021-05-21 13:15:15 -07:00
b9b41f6d1b Deduplicate Python object serialization (#57858)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57858

Just a small deduplication, which move complexity our of the way, and ensures consistent error checking.
ghstack-source-id: 129567056

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28286174

fbshipit-source-id: 6eab8d3f30405d49c51f8b9220453df8773ff410
2021-05-21 13:15:14 -07:00
cd9dbbd93a Simplify process(Script|Python)(Remote)?Call (#57857)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57857

There used to be a whole lot of methods: `processPythonCall`, `processScriptCall`, `processScriptRemoteCall`, `processPythonRemoteCall`, `processScriptCallOp`, `processBaseScriptRemoteCall` and `processScriptRemoteCallOp`. Thanks to the previous simplification, we can now drop all but the first four, which map nicely 1:1 to the four message types we need to handle. Also their signatures become much simpler: they take an RPC command and return a future.
ghstack-source-id: 129567070

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253848

fbshipit-source-id: e0e45345c414a96900f9d70ee555359d28908833
2021-05-21 13:15:12 -07:00
c96a05d148 Unify assignment of OwnerRRef result (#57856)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57856

Thanks to Futures providing a "common language" between various steps, we can now deduplicate the creation of OwnerRRef, by having two different ways of creating the result (JIT and Python) but then connecting them to a single method that wraps and stores that result in an OwnerRRef.
ghstack-source-id: 129567072

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253845

fbshipit-source-id: a156e56cac60eb22f557c072b61ebac421cfad43
2021-05-21 13:15:10 -07:00
e220a1bbcd Make processPythonExecution return a future (#57855)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57855

We already had a helper to run Python functions, which was nice (it de-duplicated some code). This helper was however taking a callback which, as I said, isn't as nice as it returning a Future. Hence here I change this.
ghstack-source-id: 129567054

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253846

fbshipit-source-id: d854d4aa163798fb015cd6d46932f9ff1d18262e
2021-05-21 13:15:09 -07:00
20d02cb7dd Remove getScriptRemoteCallType (#57854)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57854

Because OwnerRRefs used to be created before their value was computed, we had to figure out their type ahead of time. After the previous diff, we inverted the order of operations, and we can now first compute the result and then create the OwnerRRef. Which means we can just inspect the value to get its type. Much simpler, and much less likely to get it wrong.
ghstack-source-id: 129567060

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253843

fbshipit-source-id: f13c9b294f477ae66fcbdbc85c642fdc69b2740f
2021-05-21 13:15:07 -07:00
60fc37393e Simplify OwnerRRef completion (#57853)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57853

A bunch of methods received an OwnerRRef to "fill in". I think it will be more flexible to do it the other way around, and have these methods return a value (wrapped in a Future), which can then be "connected" to an OwnerRRef, but which can also potentially be consumed in different ways.
ghstack-source-id: 129567059

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253844

fbshipit-source-id: 7e3772312dbacfc75a6ac0f62189fc9828001fc7
2021-05-21 13:15:05 -07:00
ea2f5bbb4c Unify async execution for JIT functions (#57852)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57852

Another great example of the benefits of Futures. Thanks to the "right abstraction" (i.e., the `thenAsync` method), adding support for async execution becomes trivial, and the code much simpler than what it used to be.
ghstack-source-id: 129567063

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253842

fbshipit-source-id: b660151ca300f3d6078db0f3e380c80a4d8f5190
2021-05-21 13:15:04 -07:00
bfdc279134 Unify invoking JIT functions (#57851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57851

The same as the previous PR, but for JIT functions.
ghstack-source-id: 129567069

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253841

fbshipit-source-id: 2b8affde16c106f5c76efa8be49af070213708bf
2021-05-21 13:15:02 -07:00
77428159f5 Unify invoking JIT operands (#57850)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57850

What I envision is a modular decomposed code, with separate steps which each consume/produce Futures, and which can be chained together to obtain the desired results. One common "starting point" for these chains is the execution of a remote function (Python or JIT or otherwise). I'm thus creating a helper function for one of these, the JIT operators (by deduplicating the places where we used to run them). More will follow.

This deduplication will also help to add CUDA support to JIT RPC, since the execution of the JIT function/operators is where we need to set our custom streams.
ghstack-source-id: 129567058

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28253847

fbshipit-source-id: 24ab67ad89c8796861e9bbcb78878b26704c0c48
2021-05-21 13:15:00 -07:00
f94f1db938 Make some methods of RequestCallback return void instead of bool (#57849)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57849

Some methods are currently returning bool, but I'll soon want them to return a Future. I could have them return a tuple of bool and Future, but that's a bit heavy. Instead it turns out we can very easily make them return void, which will simplify things.
ghstack-source-id: 129567061

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224476

fbshipit-source-id: 26dc796b7e38f03aa269cf0731b0059d58e57e94
2021-05-21 13:14:59 -07:00
4ac18f6710 Centralize setting messageId in RequestCallback (#57848)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57848

This PR looks large, but all it does is add a dozen lines and remove a lot of other ones.

One first advantage of using Futures is that we can easily chain some "post-processing" to them. Until now we needed to pass the message ID around everywhere because it was set separately by each method. Instead, we could simply add a follow-up step to the final future which sets this ID, and remove all the former logic.
ghstack-source-id: 129567065

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224477

fbshipit-source-id: 7b6e21646262abe5bbbf268897e2d792e5accc27
2021-05-21 13:14:57 -07:00
f6844eafce Make RequestCallback collect Futures from methods, rather than providing them (#57847)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57847

This is the first PR of a stack that aims to simplify RequestCallback, and I want to start by explaining my intentions.

With the introduction of CUDA support in the TensorPipe agent, we found out that other layers higher up in the stack (RRefs, dist autograd, ...) were not "ready" to support CUDA. One cause of this was that in PyTorch most CUDA state is thread-local, and the RequestCallback class (and others) might execute different steps of an operation on multiple threads. The solution to this problem is to preserve or recreate the CUDA state when switching between threads (propagating streams, or recording events and then wait on them). If we were to manually do this everywhere it would be tedious, error-prone, and hard to maintain.

In fact, we already have a primitive that can do this for us: CUDAFuture (now known as just Future). If whenever we switch threads we were to pack the values in a CUDAFuture and then unpack them on the other threads, all CUDA stuff would be taken care of for us.

If our code leveraged CUDAFuture at its core, thing would become the "idiomatic" thing to do, the natural behavior. Future changes would thus also be inclined to follow this pattern, hence automatically doing the right thing.

I also think that, even without these concerns about CUDA, there are benefits to use Futures more extensively. Currently RequestCallback uses a mix of Futures and callbacks. These are two tools for the same job, and thus mixing them creates confusion. Futures are more powerful than simple callbacks (they can be passed around, inspected, chained, waited on, ...) and thus should be preferred. They also lead to more readable code, as each step can be defined and chained in logical order, whereas callbacks must either be nested, or defined inline, or defined before and used later (thus making the code out-of-order).

In short: I intend to rework RequestCallback to use Futures much more. I believe it will greatly simplify the code, help readability, and prove invaluable to support CUDA.

 ---

Until now, we had the final result future being created at the very beginning, and then passed around everywhere, so that the various method could "fill in" its value. I think it's much lighter to instead allow each method to create or obtain its futures however it wants, and have it return them. I.e., have these futures "bubble up" from the lower layers, rather than them being "pushed down" from the upper ones.

In this initial PR, I move the place where we create this "final result future", but I still keep it around. I will then, in later PRs, slowly migrate each method so that it returns a future, and in the end I will avoid creating the final result future.
ghstack-source-id: 129567062

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28224478

fbshipit-source-id: dbdc66b6458645a4a164c02f00d8618fa64da028
2021-05-21 13:14:55 -07:00
7e1f2b33ce Add helpers to manipulate futures (#57846)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57846

In later PRs I'll need to create already-completed futures (it'll make sense then, I hope). Here are a few helpers for that, which I'm adding separately to reduce the noise later.
ghstack-source-id: 129567064

Test Plan: See later.

Reviewed By: mrshenli

Differential Revision: D28253664

fbshipit-source-id: f091e1d3ea353bb5bfbd2f582f1b8f84e4b0114f
2021-05-21 13:14:54 -07:00
1d7cf4b248 Reduce overhead when Future invokes callbacks inline (#57638)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57638

In RPC there are a few instances of "fastpaths" which do `if (fut.isCompleted()) { do_sth(); } else { fut.addCallback(do_sth); }`. I intend to get rid of them, for reasons I'll clarify later but which in a nutshell have to do with CUDA correctness and readability. Note that dropping the fastpath introduces no change in behavior (because `addCallback` invokes the callback inline anyways), thus the only perf concern comes from the fact that the fastpath avoids constructing and passing around a `std::function`. I don't think this is a significant performance hit. Regardless, this PR preemptively addresses this concern, by tweaking `addCallback` (and similar methods) so they can handle raw lambdas, and so that they do _not_ wrap them into `std::function`s if they are invoked inline.
ghstack-source-id: 129567067

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D28222808

fbshipit-source-id: eb1c7114cf7aca3403cb708f14287cab0907ecfa
2021-05-21 13:14:52 -07:00
ce2f1c29f9 Introduce thenAsync method on Future (#57637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57637

I had proposes a similar method in https://github.com/pytorch/pytorch/pull/48790, although that PR was exposing it to Python and thus requires a bit more work. This PR only introduces this method as a C++ API. Python can be added later.

This new method is useful when one wants to use `then` but the callback does perform some async operation itself, and one wants to "reconcile" the future produced inside the callback with the one produced outside.
ghstack-source-id: 129567066

Test Plan: Used (and thus tested) later in the stack.

Reviewed By: mrshenli

Differential Revision: D28222809

fbshipit-source-id: 869f11ab390b15e80c0855750e616f41248686c5
2021-05-21 13:13:02 -07:00
d7d0fa2069 Fix typo. (#58728)
Summary:
Fix typo in docs and comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58728

Reviewed By: mruberry

Differential Revision: D28603612

Pulled By: H-Huang

fbshipit-source-id: b3cd8f6f19354201d597254d0b3cb4e2062471ab
2021-05-21 11:45:10 -07:00
13c975684a c10/util/thread_name.cpp: pthread_setname_np requires Glibc 2.12 (#55063)
Summary:
`pthread_setname_np` requires Glibc 2.12. The patch reproduces what numactl does: 93867c59b0/syscall.c (L132-L136)

Related to issue https://github.com/pytorch/pytorch/issues/23482 and the `pthread_setname_np.patch` patch that adamjstewart shared.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55063

Reviewed By: soulitzer

Differential Revision: D28577146

Pulled By: malfet

fbshipit-source-id: 85867b6f04795b1ae7bd46dbbc501cfd0ec9f163
2021-05-21 10:26:51 -07:00
76ce925257 [c10d] Fix monitored_barrier with wait_all_ranks (#58702)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58702

Off by one error when determining if some ranks failed or not with
`wait_all_ranks=True`. This wasn't caught by tests because the tests only
tested failure scenarios, not success scenarios with `wait_all_ranks=True`.
ghstack-source-id: 129559840

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28583235

fbshipit-source-id: a8f376efb13a3f36c788667acab86543c80aff59
2021-05-21 09:40:50 -07:00
9e261de630 Revert D28547564: [pytorch][PR] masked_scatter thrust->cub
Test Plan: revert-hammer

Differential Revision:
D28547564 (5152cf8647)

Original commit changeset: 83aeddfaf702

fbshipit-source-id: d5259afb584e0f6c0a11de4d4cb3d56a2a562eb7
2021-05-21 09:18:34 -07:00
5313bafd31 [JIT] integer value refinement (#56438)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56438

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27924239

Pulled By: eellison

fbshipit-source-id: ace54fcb594853f30c242369ea203b0eb5527ac1
2021-05-21 08:51:01 -07:00
483ea176b3 Factor out isDominatedBy (#56437)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/56437

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27924240

Pulled By: eellison

fbshipit-source-id: d600f895bfb06304957fe65155fceab0e5f873ea
2021-05-21 08:50:59 -07:00
0d9f1c1ec6 Add Value * == Value * peephole (#55978)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55978

This is needed for broadcasting two of the same symbolic shape

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27755328

Pulled By: eellison

fbshipit-source-id: d38d9458a9e28d31558f0bc55206516b78131032
2021-05-21 08:50:57 -07:00
391603d883 Factor out non tensor peephole (#55977)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55977

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D27755329

Pulled By: eellison

fbshipit-source-id: 0e8948c0607fa59133310e4db8e05ac6847c9f8b
2021-05-21 08:50:55 -07:00
5cebf29b4e Add list len refinement (#55926)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55926

This is necessary for code like conv2d where we wish to share a generic convolution shape function logic with that of conv2d but for conv2d always infer the output is dimension 4. I'm also hoping the refinement algorithm here could be refactored out and used to support refining tensor types from user annotations. i have a length comment explaining how this works, and the logic outside of data structures is pretty small and contained. Additionally, you might check out https://fb.quip.com/X7EVAdQ99Zzm for a very similar description of how to refine values based on comparison operators.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27750997

Pulled By: eellison

fbshipit-source-id: d962415af519ac37ebc9de88f2e1ea60a1374f7c
2021-05-21 08:50:54 -07:00
9fd2306036 Add handling of symbolic shapes (#55925)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55925

This sets up the initial handling of symbolic shapes. As in the test, it doesn't work perfectly yet because it needs a couple other optimization passes. The basic description is pretty simple: we resolve tensor dimension indices to the same Value *, and before extracting out the output Tensor shape we substitute in symbolic shapes. We don't substitute during optimization because they are represented as negative numbers so we don't want them inadvertently used in Constant prop or something else.

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D27750996

Pulled By: eellison

fbshipit-source-id: 6984e7276b578f96b00fc2025cef0e13f594b6e6
2021-05-21 08:50:52 -07:00
f39471a171 Initial Symbolic Shape Analysis (#54809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54809

I'm going to post on dev-discuss soon with a more thorough explanation of the design and advantages of this shape analysis, so I'm leaving out that for now.

There is still a ton left to do, I'm posting this initial version so we can get something on master multiple can work on. List of many remaining steps to do:

- [ ] Add symbolic shapes support
- [ ] Bind shape functions for operators in C++
- [ ] Make classes of operators share the same shape function (e.g. pointwise, broadcast two inputs)
- [ ] Refactor APIs
- [ ] Only iteratively optimize shape function while a change has been made
- [ ] Expand coverage of coverage to common ops
- [ ] Add shape analysis pass on Graph that handles Ifs and Loops
- [ ] Allow concurrent reads to the operator map
- [ ] Successive applications of same inputs to same shape function (e.g. series of pointwise ops)

For this review, I am mostly looking for comments related to the implementation of symolic_shape_analysis.cpp, with the caveats listed above. I am not really looking for comments related to api/registration/graph level analysis as those are all planned to be changed. I am fine landing this as is or waiting until necessary components of the TODOs above are finished.

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D27750998

Pulled By: eellison

fbshipit-source-id: 4338b99e8651df076291c6b781c0e36a1bcbec03
2021-05-21 08:49:46 -07:00
72ae924fad Added sublist support for torch.einsum (#56625)
Summary:
This PR adds an alternative way of calling `torch.einsum`. Instead of specifying the subscripts as letters in the `equation` parameter, one can now specify the subscripts as a list of integers as in `torch.einsum(operand1, subscripts1, operand2, subscripts2, ..., [subscripts_out])`. This would be equivalent to `torch.einsum('<subscripts1>,<subscripts2>,...,->[<subscript_out>]', operand1, operand2, ...)`

TODO
- [x] Update documentation
- [x] Add more error checking
- [x] Update tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56625

Reviewed By: zou3519

Differential Revision: D28062616

Pulled By: heitorschueroff

fbshipit-source-id: ec50ad34f127210696e7c545e4c0675166f127dc
2021-05-21 08:36:45 -07:00
fc804b5def Revert D28133579: [jit] Implement ScriptProfile to collect instruction profiles.
Test Plan: revert-hammer

Differential Revision:
D28133579 (034a238bab)

Original commit changeset: e7e30e961513

fbshipit-source-id: 5a7756468b4f2eeed24d2abb7b52ab46d081a95e
2021-05-21 08:18:40 -07:00
e56d3b0238 Added OpInfo tests for NNC (#58719)
Summary:
Finds a couple of bugs:

1. permute needs to wrap dimensions
2. slice needs to wrap dimensions
3. frac doesn't work correctly for negative values
4. Permute has some other failures.

This PR also fixes 1 + 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58719

Reviewed By: SplitInfinity

Differential Revision: D28590457

Pulled By: Chillee

fbshipit-source-id: a67fce67799602f9396bfeef615e652364918fbd
2021-05-21 01:41:28 -07:00
d88d321ee3 More robust slicing logic for nn.ModuleList (#58361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58361

Fixes: https://github.com/pytorch/pytorch/issues/16123

Test Plan: Imported from OSS

Reviewed By: ppwwyyxx

Differential Revision: D28464855

Pulled By: tugsbayasgalan

fbshipit-source-id: db8c41b15dbe6550035e8230dea68ce60e5a6f9a
2021-05-20 23:00:17 -07:00
b301558410 [Reducer] Remove replica size == 1 checks (#58603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58603

No longer need these checks
ghstack-source-id: 129498227

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28549893

fbshipit-source-id: a89bf8c3fc3aba311a70fd37e5a6aa5dc14b41b9
2021-05-20 22:34:23 -07:00
1d67c6d639 [DDP] Remove train call to module copies (#58595)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58595

No longer needed since this list is always of size 1.
ghstack-source-id: 129498229

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28548426

fbshipit-source-id: 7d6dba92fff685ec7f52ba7a3d350e36405e2578
2021-05-20 22:34:20 -07:00
88c76b43fb [Reducer] move comment to the right place (#58594)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58594

This comment was misplaced after some changes, move it to the right
place.
ghstack-source-id: 129498228

Test Plan: ci

Reviewed By: zhaojuanmao

Differential Revision: D28548100

fbshipit-source-id: a9163fc3b25a9d9b8b6d4bfa2a77af290108fc09
2021-05-20 22:34:17 -07:00
d83c5a5c7f Format reducer.cpp, hpp (#58593)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58593

Per title
ghstack-source-id: 129498230

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28528465

fbshipit-source-id: 89e4bfcb4a0275dc17090a934d4c0a41a3c54046
2021-05-20 22:32:30 -07:00
6d97a80dd2 [fx][graph_drawer] Improve graph drawer coloring and tensor_meta handling (#58699)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58699

Make `call_function`/`call_method` random colors based on their target name. This coloring is stable according to the name of the target. Also handle tensor_meta more elegantly for quantized types, including print q_scale/q_zero_point if they're used.

Test Plan: Tested locally

Reviewed By: chenccfb, 842974287

Differential Revision: D28580333

fbshipit-source-id: ad9961e1106a1bfa5a018d009b0ddb8802d2163c
2021-05-20 21:26:04 -07:00
5455df2b99 [codemod][dirsync] Apply clang-format
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D28477071

fbshipit-source-id: e844e0fad2f5599fd27e0fd113a328031cb63aa7
2021-05-20 21:23:24 -07:00
21a9334034 Revert D28497967: [quant][fx][graphmode][refactor] Remove qconfig_map from Quantizer
Test Plan: revert-hammer

Differential Revision:
D28497967 (1cf8f7a439)

Original commit changeset: 421ce3d86fad

fbshipit-source-id: b1b290be47d847ab0e0128e3ae89f528578550ee
2021-05-20 20:56:12 -07:00
1cf8f7a439 [quant][fx][graphmode][refactor] Remove qconfig_map from Quantizer (#58455)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58455

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28497967

fbshipit-source-id: 421ce3d86fadd3d92f4120b850b0167270509189
2021-05-20 20:34:47 -07:00
62adf9e1c9 [Reducer] Completely remove VariableIndex (#58592)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58592

Completely removes VariableIndex from reducer code, as it is not
needed. replica_index is always 0 so simplify the code to only use the
parameter index. Next, we should also remove all of the nested data structures
that were needed when num_replicas > 1 was possible.
ghstack-source-id: 129498226

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28528440

fbshipit-source-id: e0568399264ab4f86de3b7a379a4f0831f8f42e9
2021-05-20 19:47:50 -07:00
8e4fc0063a [Try] [PyTorch Edge] Trim unused code related to CUDA and HIP Interfaces (#58689)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58689

This doesn't seem to be mobile related, but ends up getting called from multiple places, so is hard to get rid of entirely.
ghstack-source-id: 129413850

Test Plan: Build

Reviewed By: iseeyuan

Differential Revision: D28543374

fbshipit-source-id: 867b3e2fafdcbf6030d7029a82a2b711bcecefc5
2021-05-20 18:23:36 -07:00
773cfae93b Tag PyObject on TensorImpl per torchdeploy interpreter (#57985)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57985

Fixes https://github.com/pytorch/pytorch/issues/57756

This PR introduces a new `pyobj_interpreter_` field on TensorImpl which tracks what Python interpreter (if any) owns the TensorImpl. This makes it illegal to bind a TensorImpl from multiple Python interpreters, and means that we can now directly store PyObject pointer on TensorImpl even in the presence of multiple Python interpreters, as is the case in torchdeploy. This is a necessary step for PyObject preservation, which cannot be easily implemented when there are multiple Python interpreters.

Although the PR is not that long, there is a very subtle portion of the implementation devoted to ensuring that the tagging process is thread safe, since multiple threads can concurrently try to tag a PyObject. Check Note [Python interpreter tag] and Note [Memory ordering on Python interpreter tag] for detailed discussion of how this is handled. You will have to check this code carefully in code review; I did not torture test the multithreaded paths in any meaningful way.

In a follow up PR, I will pack the interpreter and PyObject fields into single atomic word on 64-bit.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: wconstab

Differential Revision: D28390242

Pulled By: ezyang

fbshipit-source-id: a6d9b244ee6b9c7209e1ed185e336297848e3017
2021-05-20 18:18:39 -07:00
fe8e5eb260 Change native functions to take c10::string_view args instead of std::string (#57680)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/53546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57680

Reviewed By: malfet

Differential Revision: D28511799

Pulled By: ezyang

fbshipit-source-id: 43142f994d048b28b3279ccdb7a28cbaa3190973
2021-05-20 18:15:45 -07:00
d1d24304ee [Caffe2] [Easy] Fix comment on caffe2_serialize_using_bytes_as_holder to reflect correct types
Summary:
the logic is:

```
template <typename T>
typename std::enable_if<
    std::is_same<T, bool>::value || std::is_same<T, uint8_t>::value ||
        std::is_same<T, int8_t>::value || std::is_same<T, uint16_t>::value ||
        std::is_same<T, int16_t>::value,
    void>::type
```

Test Plan: N/A

Reviewed By: simpkins

Differential Revision: D28587311

fbshipit-source-id: 970c673a9c1256600ec8bdd5f9ca53333a60d588
2021-05-20 18:03:34 -07:00
db67699ae6 [Pytorch Edge] NAME -> SCHEMA (#58604)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58604

Minor bug fix. Schemas should be defined with the schema macro not the name one.

Test Plan: ci and buck test fbsource//xplat/pytorch_models/build/cair_messaging_2021_05_17/v2:cair_messaging_2021_05_17_test

Reviewed By: dhruvbird, iseeyuan

Differential Revision: D28549578

fbshipit-source-id: 0c64eb8c60f1aee8213a1fc1fb7231226b905795
2021-05-20 17:51:38 -07:00
0ede83db7a enable torch.cpu.amp.autocast (#57386)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57386

Here is the PR for what's discussed in the RFC https://github.com/pytorch/pytorch/issues/55374 to enable the autocast for CPU device. Currently, this PR only enable BF16 as the lower precision datatype.

Changes:
1.  Enable new API `torch.cpu.amp.autocast` for autocast on CPU device: include the python API, C++ API, new Dispatchkey etc.
2.  Consolidate the implementation for each cast policy sharing between CPU and GPU devices.
3.  Add the operation lists to corresponding cast policy for cpu autocast.

Test Plan: Imported from OSS

Reviewed By: soulitzer

Differential Revision: D28572219

Pulled By: ezyang

fbshipit-source-id: db3db509973b16a5728ee510b5e1ee716b03a152
2021-05-20 17:48:36 -07:00
b6dcdeacc9 [quant][graphmode][fx] Move qat_swap_modules outside of Quantizer (#58454)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58454

Trying to remove Quantizer in the end

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28497966

fbshipit-source-id: 800f8e4afd99918d7330345f8ae7bcf018a5bde7
2021-05-20 17:27:49 -07:00
fdc5dfdd50 [PyTorch] Migrate TI usage in ATen/native/cpu to borrowing (#58303)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58303

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129471191

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28444032

fbshipit-source-id: f6a9e9effb43c273f464ef6ff410274962f3ab23
2021-05-20 17:24:13 -07:00
7c15d3206d [PyTorch] Add TI::borrowing_nullary_op and use it (#58280)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58280

All core PyTorch uses of TensorIterator::nullary_op look like they can safely borrow.
ghstack-source-id: 129471193

Test Plan: Existing CI

Reviewed By: bhosmer

Differential Revision: D28429695

fbshipit-source-id: 404cf6db31e45e5cf7ae6d2f113c5a8eff6f7c3d
2021-05-20 17:22:58 -07:00
618be18a41 Enable the quantization on XPU devices (#54857)
Summary:
Enable the quantization on XPU devices. Keep the model as is if the model is on XPU devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/54857

Reviewed By: ailzhang

Differential Revision: D28501381

Pulled By: jerryzh168

fbshipit-source-id: 6d3e9b04075393248b30776c69881f957a1a837c
2021-05-20 17:02:13 -07:00
ce3788d6a5 Add #pragma once to CUDA foreach headers (#58209)
Summary:
Per the title, adding `#pragma once` to cuda headers related to foreach functions.

cc: ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58209

Reviewed By: ailzhang

Differential Revision: D28558620

Pulled By: ngimel

fbshipit-source-id: 195f68435999eb7409ba904daf6fc5f0962d375d
2021-05-20 16:35:43 -07:00
f879e70fc1 [quant][fx][graphmode][refactor] Factor out generate_qconfig_map to qconfig_utils.py (#58453)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58453

Move the class method generate_qconfig_map to qconfig_utils, will add more PRs
to remove functions out of Quantizer and eventually remove the Quantizer object

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28497965

fbshipit-source-id: 3c78cfe676965d20a8834a859ffed4d8e9ecade4
2021-05-20 16:26:24 -07:00
bf1c936e06 [static runtime] out variant for full_like (#58079)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58079

Support full_like

Test Plan:
`buck test mode/dev caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.IndividualOps_FullLike`

Test on regenerated local inline_cvr model
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --scripted_model=/data/users/ansha/tmp/adfinder/dec_6x/266377643_shrunk.predictor.disagg.local.regenerated.pt --pt_inputs=/data/users/ansha/tmp/adfinder/dec_6x/local_inputs --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=5000 --warmup_iters=5000 --num_threads=1 --do_profile=0 --do_benchmark=1 --adsfinder_compatibility=1 --v=1
```

`V0511 10:59:57.187054 1911683 impl.cpp:1229] Switch to out variant for node: %5571 : Tensor = aten::full_like(%blob_for_shape.1, %235, %654, %75, %75, %75, %75)`

Reviewed By: hlu1

Differential Revision: D28361997

fbshipit-source-id: 89c41e37ce23d6008cfe4d80536832ee76d3405e
2021-05-20 16:17:40 -07:00
5211eeb22b Support aten::leaky_relu for TE (#58464)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58464

Test Plan:
./bin/test_tensorexpr

python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops

Reviewed By: Krovatkin

Differential Revision: D28499776

fbshipit-source-id: 20094a1bc78aa485f76aec4e065ff69e43d692d7
2021-05-20 16:12:03 -07:00
4668d09ca6 [quant][graphmode][fx] Quantize the output of statically quantized fp16 op in QuantizeHandler (#58445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58445

Previously the output of statically quantized fp16 operator is not quantized in QuantizeHandler, which is not consistent with
the behavior of static int8 operators. Also it does not work well with reference functions, this PR
changes the fp16 static QuantizeHandler to quantize (call to(torch.float16)) in the QuantizeHandler, this also
makes the future support for reference functions easier.

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D28495830

fbshipit-source-id: 2140eab8ab2dd08f6570d9e305485e3029e1f47d
2021-05-20 16:03:42 -07:00
6edd49a8e8 [Android]Removed dependency with AppCompat. (#58527)
Summary:
I build using [Bazel](https://bazel.build/).

When I use `pytorch_android` in latest Android app, I get the following error due to dependencies:

```
$ bazel build //app/src/main:app
WARNING: API level 30 specified by android_ndk_repository 'androidndk' is not available. Using latest known API level 29
INFO: Analyzed target //app/src/main:app (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
ERROR: /home/H1Gdev/android-bazel-app/app/src/main/BUILD.bazel:3:15: Merging manifest for //app/src/main:app failed: (Exit 1): ResourceProcessorBusyBox failed: error executing command bazel-out/k8-opt-exec-2B5CBBC6/bin/external/bazel_tools/src/tools/android/java/com/google/devtools/build/android/ResourceProcessorBusyBox --tool MERGE_MANIFEST -- --manifest ... (remaining 11 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox ResourceProcessorBusyBox failed: error executing command bazel-out/k8-opt-exec-2B5CBBC6/bin/external/bazel_tools/src/tools/android/java/com/google/devtools/build/android/ResourceProcessorBusyBox --tool MERGE_MANIFEST -- --manifest ... (remaining 11 argument(s) skipped)

Use --sandbox_debug to see verbose messages from the sandbox
Error: /home/H1Gdev/.cache/bazel/_bazel_H1Gdev/29e18157a4334967491de4cc9a879dc0/sandbox/linux-sandbox/914/execroot/__main__/app/src/main/AndroidManifest.xml:19:18-86 Error:
	Attribute application@appComponentFactory value=(androidx.core.app.CoreComponentFactory) from [maven//:androidx_core_core] AndroidManifest.xml:19:18-86
	is also present at [maven//:com_android_support_support_compat] AndroidManifest.xml:19:18-91 value=(android.support.v4.app.CoreComponentFactory).
	Suggestion: add 'tools:replace="android:appComponentFactory"' to <application> element at AndroidManifest.xml:5:5-19:19 to override.
May 19, 2021 10:45:03 AM com.google.devtools.build.android.ManifestMergerAction main
SEVERE: Error during merging manifests
com.google.devtools.build.android.AndroidManifestProcessor$ManifestProcessingException: Manifest merger failed : Attribute application@appComponentFactory value=(androidx.core.app.CoreComponentFactory) from [maven//:androidx_core_core] AndroidManifest.xml:19:18-86
	is also present at [maven//:com_android_support_support_compat] AndroidManifest.xml:19:18-91 value=(android.support.v4.app.CoreComponentFactory).
	Suggestion: add 'tools:replace="android:appComponentFactory"' to <application> element at AndroidManifest.xml:5:5-19:19 to override.
	at com.google.devtools.build.android.AndroidManifestProcessor.mergeManifest(AndroidManifestProcessor.java:186)
	at com.google.devtools.build.android.ManifestMergerAction.main(ManifestMergerAction.java:217)
	at com.google.devtools.build.android.ResourceProcessorBusyBox$Tool$5.call(ResourceProcessorBusyBox.java:93)
	at com.google.devtools.build.android.ResourceProcessorBusyBox.processRequest(ResourceProcessorBusyBox.java:233)
	at com.google.devtools.build.android.ResourceProcessorBusyBox.main(ResourceProcessorBusyBox.java:177)

Warning:
See http://g.co/androidstudio/manifest-merger for more information about the manifest merger.
Target //app/src/main:app failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 2.221s, Critical Path: 1.79s
INFO: 2 processes: 2 internal.
FAILED: Build did NOT complete successfully
```

This is due to conflict between `AndroidX` and `Support Library` on which `pytorch_android_torch` depends.
(In the case of `Gradle`, it is avoided by `android.useAndroidX`.)

I created [Android application](https://github.com/H1Gdev/android-bazel-app) for comparison.

At first, I updated `AppCompat` from `Support Library` to `AndroidX`, but `pytorch_android` and `pytorch_android_torchvision` didn't seem to need any dependencies, so I removed dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58527

Reviewed By: xta0

Differential Revision: D28585234

Pulled By: IvanKobzarev

fbshipit-source-id: 78aa6b1525543594ae951a6234dd88a3fdbfc062
2021-05-20 15:49:19 -07:00
d84121421e [third-party] Update nccl to 2.9.8 (#58667)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58667

Reviewed By: ngimel

Differential Revision: D28577042

Pulled By: malfet

fbshipit-source-id: 62f1c67f35bf5a004852806c1a74bb068cefb79b
2021-05-20 15:42:17 -07:00
bbf92e6176 Add missing .to_sparse(ndim) gradient (#58413)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46720, extends PR https://github.com/pytorch/pytorch/issues/46825 by adding test requested in [this comment](https://github.com/pytorch/pytorch/pull/46825#issuecomment-842304079).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58413

Reviewed By: ailzhang

Differential Revision: D28540550

Pulled By: albanD

fbshipit-source-id: d7e292e09b5402336c43844ee233b83b0a095035
2021-05-20 15:08:34 -07:00
8a3d9962e0 Enable ceil, floor, frac, round & trunc for BFloat16 on CUDA (#57910)
Summary:
Enable `ceil`, `floor`, `frac`, `round` & `trunc` for BFloat16 on CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57910

Reviewed By: soulitzer

Differential Revision: D28579486

Pulled By: ngimel

fbshipit-source-id: 2f90354339dbccb69cea7ec9caf9b066ea13a666
2021-05-20 14:52:45 -07:00
034a238bab [jit] Implement ScriptProfile to collect instruction profiles. (#57397)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57397

Introduces two main classes in C++ runtime:

ScriptProfile is the implementation for enalbing and disabling interpreter
profiling in C++. This should be only used from Python, and we will add
corresponding Python API in the next diff.

InstructionSpan is a utility class to instrument execution of each single
instruction. A start timestamp is recorded in the consturctor, and an end
timestamp is recorded in the destructor. During destruction, this will send
runtime data to all enabled ScriptProfile instances.

Test Plan:
build/bin/test_jit --gtest_filter='ScriptProfileTest.Basic'

Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28133579

fbshipit-source-id: e7e30e96151367022793ab3ad323f01c51ad4a3b
2021-05-20 14:11:03 -07:00
e8c6a65074 Adds grid_sampler to autocast fp32 list for 1.9 (#58679)
Summary:
Temporary fix for https://github.com/pytorch/pytorch/issues/42218.

Numerically, grid_sampler should be fine in fp32 or fp16. So grid_sampler really belongs on the promote list. But performancewise, native grid_sampler backward kernels use gpuAtomicAdd, which is notoriously slow in fp16. So the simplest functionality fix is to put grid_sampler on the fp32 list.

In https://github.com/pytorch/pytorch/pull/58618 I implement the right long-term fix (refactoring kernels to use fp16-friendly fastAtomicAdd and moving grid_sampler to the promote list). But that's more invasive, and for 1.9 ngimel says this simple temporary fix is preferred.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58679

Reviewed By: soulitzer

Differential Revision: D28576559

Pulled By: ngimel

fbshipit-source-id: d653003f37eaedcbb3eaac8d7fec26c343acbc07
2021-05-20 14:05:09 -07:00
691c139144 Do not use TF32 matmul in linalg and DDP tests (#56114)
Summary:
This PR does several things to relax test tolerance

- Do not use TF32 in cuda matmul in test_c10d. See https://github.com/pytorch/pytorch/issues/52941.
- Do not use TF32 in cuda matmul in test_linalg. Increase atol for float and cfloat. See https://github.com/pytorch/pytorch/issues/50453
    The tolerance is increased because most linear algebra operators are not that stable in single precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56114

Reviewed By: ailzhang

Differential Revision: D28554467

Pulled By: ngimel

fbshipit-source-id: 90416be8e4c048bedb16903b01315584d344ecdf
2021-05-20 14:01:19 -07:00
a7f06e1e55 Added statistic related to out variant nodes
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
2021-05-20 13:57:07 -07:00
056287aec4 turn off deadline for adagrad test
Summary: Tests are frequently failing with "exceeded the deadline of 1000.00ms", we expect this to happen, so remove the deadline

Test Plan: N/A: Fix breakages

Reviewed By: robieta

Differential Revision: D28581051

fbshipit-source-id: 4825ada9af151fa5d57c45c549138c15ba613705
2021-05-20 13:47:02 -07:00
9db64e6e56 Revert "Striding for lists Part 2 (#49352)" (#58523)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58523

This reverts commit fee7e8b91d4434b976a339330bfa89bd827ab9ec.

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D28528023

Pulled By: tugsbayasgalan

fbshipit-source-id: 9fa1d86f0c81fcc6fd3798e0d51a712a3c9b3952
2021-05-20 13:20:33 -07:00
9123229684 Cleanup functional.py after lu_unpack was removed (#58669)
Summary:
Remove code in functional.py that became unused after PR c790fd2bf8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58669

Reviewed By: driazati

Differential Revision: D28572377

Pulled By: heitorschueroff

fbshipit-source-id: c90d80ead5f3d69100667488bc6b14ef54b95b54
2021-05-20 13:06:30 -07:00
0e1bed364d [nnc] Use int64 to compute matmul flops heuristic (#58676)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58676

We only generate asm for small matmuls, but we were computing the # of
flops using an int32, which is too small.

Test Plan:
```
buck test mode/dev //caffe2/test:static_runtime -- --exact 'caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule)'
```

Reviewed By: navahgar

Differential Revision: D28562157

fbshipit-source-id: a07ceba5209ef6022ead09140380c116994755cf
2021-05-20 13:05:21 -07:00
a60ce98a2e Remove opinfo warning from floor_divide (#58682)
Summary:
This warning makes downstream users of OpInfo error when they use this opinfo, unless they actually run the operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58682

Reviewed By: mruberry

Differential Revision: D28577334

Pulled By: Chillee

fbshipit-source-id: f10e64f8ad3fb50907531d8cb89ce5b0d06ac076
2021-05-20 12:57:58 -07:00
1981904c8d [Static Runtime] Check input container type in aten::__getitem__ (#58639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58639

Fix two tests in `//caffe2/test:static_runtime` that were previously broken.

Reviewed By: ajyu, edvgha

Differential Revision: D28561185

fbshipit-source-id: 3cfb0960666c808523d65da267f70bd51e828313
2021-05-20 12:47:01 -07:00
84500d03d2 .github: Upload /download large artifacts to s3 (#58506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58506

We were experiencing 500 errors when it came to downloading large
artifacts so let's just use s3 for those larger artifacts just in case

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: zhouzhuojie

Differential Revision: D28520792

Pulled By: seemethere

fbshipit-source-id: 3aa15c4872fe46c9491ac31dc969bf71175378aa
2021-05-20 11:52:05 -07:00
151ec56311 ENH Adds check for input sizes in cosine_similarity (#58559)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/55273

Adds check for input sizes to be consistent with the docstring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58559

Reviewed By: soulitzer

Differential Revision: D28562376

Pulled By: ailzhang

fbshipit-source-id: f292e8a26f11a40d146fbed94a28025794808216
2021-05-20 11:40:06 -07:00
3c55db8065 Add Deploy to PredictorContainer (#58503)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58503

add gflags to force using deploy for torchscript models

Test Plan: Add parametrization to PredictorContainer test to exercise gflag override and test deploy codepath.  Add test case to exercise new torch.package codepath.

Reviewed By: suo

Differential Revision: D28246793

fbshipit-source-id: 88a2c8322c89284e3c8e14fee5f20e9d8a4ef300
2021-05-20 11:29:31 -07:00
1fc3e1e1fb Abladawood patch 1 (#58496)
Summary:
Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58496

Reviewed By: soulitzer

Differential Revision: D28562333

Pulled By: ailzhang

fbshipit-source-id: aa9fcc03ba7ffe03db6cc5da353d37d679a0a160
2021-05-20 10:32:18 -07:00
5152cf8647 masked_scatter thrust->cub (#56750)
Summary:
Benchmark:

```python
import torch
import itertools

def run50_sync(f):
    for _ in range(50):
        f()
    torch.cuda.synchronize()

run50_sync(lambda: torch.randperm(1000000, device='cuda'))

def benchmark(M):
    a = torch.randn(M, device='cuda')
    m = torch.randint(1, (M,), dtype=torch.long, device='cuda').bool()
    v = torch.randn(M, device='cuda')

    torch.cuda.synchronize()

    %timeit run50_sync(lambda:a.masked_scatter_(m, v))

for M in (100, 1000, 100000, 10000000):
    print(M)
    benchmark(M)
```

Before:
```
100
8.65 ms ± 80.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.75 ms ± 72.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
9.27 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
33.6 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

After
```
100
8.04 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1000
8.09 ms ± 38.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
100000
8.63 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
10000000
31.9 ms ± 298 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56750

Reviewed By: ailzhang

Differential Revision: D28547564

Pulled By: ngimel

fbshipit-source-id: 83aeddfaf7023f9f9501c6b1e2faf91e8b6277b1
2021-05-20 10:27:58 -07:00
4942fe0290 [DataLoader] Introduce MapMapDataPipe functional datapipe (#58258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58258

As part of https://github.com/pytorch/pytorch/issues/57031, this PR adds the `MapMapDataPipe` functional datapipe for the `MapDataPipe` class.

Usage:
```
def fn(x):
    return x * 10

dp = CountingDataset(n=10)
dp.map(fn)
```

Reviewed By: ejguan

Differential Revision: D28394510

fbshipit-source-id: 8d71b1f5723dff52385c3ce753944304896af678
2021-05-20 09:00:21 -07:00
faa7d3793d [DDP] Support not all outputs used in loss calculation (#57081)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57081

Changes in this diff:

Enable passthrough autograd function when find_unused_parameters=True.
With above, move prepare_for_backward which does unused parameter checking logic to beginning of backwards pass, only when find_unused_parameters=True.
Enhance process of unused parameter checking to account for outputs not being used in loss.
The way (3) is implemented is by triggering the autograd hook corresponding to parameters that did not participate in loss computation. Since they did not participate, the autograd hook is triggered with a gradient of None, and the reducer handles this appropriately to ensure that the gradient is not touched.

Tested by ensuring that when a model output is not used in loss, the corresponding grad is not modified. Also verified that the grads are the same in local vs DDP training case. Also verified that gradients are not touched in this case, i.e. if grad is originally None, it stays as None, not zero, after.

Note that in this diff we are not enabling the pass through autograd function for regular case find_unused_parameters=False because that has a much bigger blast radius and needs additional careful analysis especially with regard to the performance.
ghstack-source-id: 129425139

Test Plan: CI

Reviewed By: zhaojuanmao

Differential Revision: D28048628

fbshipit-source-id: 71d7b6af8626804710017a4edd753787aa9bba61
2021-05-20 08:34:33 -07:00
abb215e229 Fix dtype inference in sparse_csr_tensor_ctor (#58631)
Summary:
`NULL` return from `PyObject_GetAttrString` should never get ignored without handling the exception, as behavior of subsequent Python C API calls are undefined until `PyErr_Fetch` or `PyErr_Clear` is called.

This accidentally leads to `list` type being incorrectly identified as `Tensor`

Fixes https://github.com/pytorch/pytorch/issues/58520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58631

Reviewed By: albanD

Differential Revision: D28559454

Pulled By: malfet

fbshipit-source-id: 46f044b5f0f94264779a6108474d04a8ba851c53
2021-05-20 08:02:05 -07:00
9ac0bd23a2 Fix bug in test_fx_experimental codegen (#58587)
Summary:
This PR fixes a bug in test_fx_experimental where code generated for ops with kwarg-only Tensor parameters would fail to execute because they would be called as positional parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58587

Reviewed By: ailzhang

Differential Revision: D28548365

Pulled By: heitorschueroff

fbshipit-source-id: 8f1746053cbad1b11e817b0099db545d8dd22232
2021-05-20 07:49:08 -07:00
bf00d26deb Enables builds with Compute Library backend for oneDNN (#55913)
Summary:
Since v1.7, oneDNN (MKL-DNN) has supported the use of Compute Library
for the Arm architeture to provide optimised convolution primitives
on AArch64.

This change enables the use of Compute Library in the PyTorch build.
Following the approach used to enable the use of CBLAS in MKLDNN,
It is enabled by setting the env vars USE_MKLDNN and USE_MKLDNN_ACL.
The location of the Compute Library build must be set useing `ACL_ROOT_DIR`.

This is an extension of the work in https://github.com/pytorch/pytorch/pull/50400
which added support for the oneDNN/MKL-DNN backend on AArch64.

_Note: this assumes that Compute Library has been built and installed at
ACL_ROOT_DIR. Compute library can be downloaded here:
`https://github.com/ARM-software/ComputeLibrary`_

Fixes #{issue number}

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55913

Reviewed By: ailzhang

Differential Revision: D28559516

Pulled By: malfet

fbshipit-source-id: 29d24996097d0a54efc9ab754fb3f0bded290005
2021-05-20 07:43:56 -07:00
145a6f7985 DOC Adds code comment to clarify nn.Linear.reset_parameters (#58487)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57109

Adds comment to clarify `a=sqrt(5)` in `nn.Linear.reset_parameters`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58487

Reviewed By: ailzhang

Differential Revision: D28548391

Pulled By: jbschlosser

fbshipit-source-id: 2d5910b2576a04f19edbd8b8515cdb55fc249ce5
2021-05-20 06:15:47 -07:00
5caccbe39e [pkg] Catch exceptions where dependency resolution gets invalid imports (#58573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58573

Users can create invalid imports, like:
```
HG: in a top-level package
if False:
  from .. import foo
```

Since this code is never executed, it will not cause the module to fail to
load. But our dependency analysis walks every `import` statement in the AST,
and will attempt to resolve the (incorrectly formed) import, throwing an exception.

For posterity, the code that triggered this: https://git.io/JsCgM

Differential Revision: D28543980

Test Plan: Added a unit test

Reviewed By: Chillee

Pulled By: suo

fbshipit-source-id: 03b7e274633945b186500fab6f974973ef8c7c7d
2021-05-19 23:04:21 -07:00
703f24397b [pkg] simplifications to broken dependency handling (#58572)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58572

Right now, we have three categories of error (broken, denied, unhandled). This
PR unifies them into a single "error" field in the node, with optional context.
It also generalizes how formatting of the error in PackagingError occurs.

Differential Revision: D28543982

Test Plan: sandcastle

Reviewed By: Chillee

Pulled By: suo

fbshipit-source-id: d99d37699ec2e172e3798763e60aafe9a66ed6f4
2021-05-19 23:03:12 -07:00
c4f0c5ee50 Quote in setup-ci-env (#58637)
Summary:
Do not put quotes for arguments that do not have space in them in add_to_env_file

ENV file is used both by bash as well as by docker, which does not omit
quotes when they are present there

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58637

Reviewed By: wconstab

Differential Revision: D28561159

Pulled By: malfet

fbshipit-source-id: 0843aad22703b6c3adebeb76175de1cfc1a974b5
2021-05-19 22:20:13 -07:00
8615fd65e3 Fix GIL issue when acquiring multiple sessions. (#58584)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58584

Test Plan: buck test //caffe2/torch/csrc/deploy:test_deploy

Reviewed By: wconstab

Differential Revision: D28545314

fbshipit-source-id: 45cb0e4d80d4766ec1aed6a51679af3424cb0878
2021-05-19 22:05:52 -07:00
24786bd6ef Make torch::deploy work with or without cuda (#58493)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58493

In fbcode, we want torch::deploy to be a target that works with or without cuda, depending only on whether cuda is linked in the final binary.  To enable this, we build both flavors of libinterpreter,  and choose which to load at runtime depending on whether cuda is available in the application.  This comes at a cost to binary size, as it includes two copies of libinterpreter instead of one.  However, it does not require _loading_ two copies of libinterpreter into memory at runtime, so the memory footprint of the interpreter (which we make N copies of) is not impacted.

In oss/cmake, this change is a no-op.  cuda is already handled there by building just one libinterpreter, but building cuda or not for the whole pytorch build based on a global cmake flag.

Test Plan: test in fbcode with new gpu mode unit tests, verify existing oss CI passes

Reviewed By: suo

Differential Revision: D28512178

fbshipit-source-id: 61354bf78b1932605a841388fcbc4bafc0c4bbb4
2021-05-19 21:44:23 -07:00
fbc235c226 port sgn to structured (#58197)
Summary:
https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58197

Reviewed By: ejguan

Differential Revision: D28416538

Pulled By: ezyang

fbshipit-source-id: bd78172ff4b11bfc69304c426d5817a47bcbb567
2021-05-19 20:10:01 -07:00
b5e39bceec Port fmax & fmin to structured kernel (#58458)
Summary:
Port fmax & fmin to structured kernel
Related https://github.com/pytorch/pytorch/issues/55070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58458

Reviewed By: ailzhang

Differential Revision: D28509263

Pulled By: ezyang

fbshipit-source-id: 3fccb46746e5c0695fe8fa498ce32f8ab4609f04
2021-05-19 20:06:06 -07:00
e179a56839 [FX Splitter] dump final graph and print operator stats via to_glow API
Summary:
- dump final graph in glow
- print operator stats via to_glow API
   - 1) node stats for final glow graph
   - 2) operator stats in TorchGlowBackend for torch::jit::graph to lower

Reviewed By: khabinov

Differential Revision: D28444501

fbshipit-source-id: 743755c320071edc4c045ad004adeb16b4a9c323
2021-05-19 19:16:19 -07:00
9a622f4cd9 refactor ASGD to use functional API (#58410)
Summary:
Functional API is used in large scale distributed training to enable multithreaded training instead of multiprocess, as it gives more optimal resource utilization and efficiency.

In this PR, we provide code migration and refactoring for functional API for ASGD algorithm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58410

Reviewed By: ailzhang

Differential Revision: D28546702

Pulled By: iramazanli

fbshipit-source-id: 4f62b6037d53f35b19f98340e88af2ebb6243a4f
2021-05-19 18:55:52 -07:00
208b36f109 remove redundant getDispatchKeySetUnboxed(eligibleKeys) (#58535)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58535

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D28531377

Pulled By: bhosmer

fbshipit-source-id: ade1427c8c9ada10ecdc69ef80c5d90be23f5787
2021-05-19 17:08:03 -07:00
47c566ebb1 Rename namespace vec256 to vec, struct Vec256 to Vectorized (and other related classes/structs) (#58438)
Summary:
In order to make it more convenient for maintainers to review the ATen AVX512 implementation, the namespace `vec256` is being renamed to `vec` in this PR, as modifying 77 files & creating 2 new files only took a few minutes, as these changes aren't significant, so fewer files would've to be reviewed while reviewing https://github.com/pytorch/pytorch/issues/56992.
The struct `Vec256` is not being renamed to `Vec`, but `Vectorized` instead, because there are some `using Vec=` statements in the codebase, so renaming it to `Vectorized` was more convenient. However, I can still rename it to `Vec`, if required.

### Changes made in this PR -
Created `aten/src/ATen/cpu/vec` with subdirectory `vec256` (vec512 would be added via https://github.com/pytorch/pytorch/issues/56992).
The changes were made in this manner -

1. First, a script was run to rename `vec256` to `vec` & `Vec` to `Vectorized` -
```
# Ref: https://stackoverflow.com/a/20721292
cd aten/src
grep -rli 'vec256\/vec256\.h' * | xargs -i@ sed -i 's/vec256\/vec256\.h/vec\/vec\.h/g' @
grep -rli 'vec256\/functional\.h' * | xargs -i@ sed -i 's/vec256\/functional\.h/vec\/functional\.h/g' @
grep -rli 'vec256\/intrinsics\.h' * | xargs -i@ sed -i 's/vec256\/intrinsics\.h/vec\/vec256\/intrinsics\.h/g' @
grep -rli 'namespace vec256' * | xargs -i@ sed -i 's/namespace vec256/namespace vec/g' @
grep -rli 'Vec256' * | xargs -i@ sed -i 's/Vec256/Vectorized/g' @
grep -rli 'vec256\:\:' * | xargs -i@ sed -i 's/vec256\:\:/vec\:\:/g' @
grep -rli 'at\:\:vec256' * | xargs -i@ sed -i 's/at\:\:vec256/at\:\:vec/g' @
cd ATen/cpu
mkdir vec
mv vec256 vec
cd vec/vec256
grep -rli 'cpu\/vec256\/' * | xargs -i@ sed -i 's/cpu\/vec256\//cpu\/vec\/vec256\//g' @
grep -rli 'vec\/vec\.h' * | xargs -i@ sed -i 's/vec\/vec\.h/vec\/vec256\.h/g' @
```

2. `vec256` & `VEC256` were replaced with `vec` & `VEC` respectively in 4 CMake files.

3. In `pytorch_vec/aten/src/ATen/test/`, `vec256_test_all_types.h` & `vec256_test_all_types.cpp` were renamed.

4. `pytorch_vec/aten/src/ATen/cpu/vec/vec.h` & `pytorch_vec/aten/src/ATen/cpu/vec/functional.h` were created.
Both currently have one line each & would have 5 when AVX512 support would be added for ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58438

Reviewed By: malfet

Differential Revision: D28509615

Pulled By: ezyang

fbshipit-source-id: 63840df5f23b3b59e203d25816e2977c6a901780
2021-05-19 16:04:36 -07:00
a6b358d53b Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2
Test Plan: revert-hammer

Differential Revision:
D28461013 (c76405d3b1)

Original commit changeset: 79a80b6ffb65

fbshipit-source-id: d9cc5c512542153f39664635fb080d797a9de7d0
2021-05-19 15:27:38 -07:00
36adc3f04d [FX] Add APIs to mutate specific args/kwargs (#58571)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58571

Test Plan: Imported from OSS

Reviewed By: jansel

Differential Revision: D28543359

Pulled By: jamesr66a

fbshipit-source-id: 44812d04886e653b5439c880dd831ecbc893fe23
2021-05-19 14:54:16 -07:00
296d2a4399 [THC] Rename THCTensorMathMagma from cu to cpp (#58521)
Summary:
This supposed to be a no-op (as .cu file do not contain any cuda code),
that reduces compilation time 2.5x:
```
$ time /usr/local/cuda/bin/nvcc /home/nshulga/git/pytorch/aten/src/THC/THCTensorMathMagma.cu -c ...
real	0m7.701s
$ time /usr/local/cuda/bin/nvcc /home/nshulga/git/pytorch/aten/src/THC/THCTensorMathMagma.cpp -c ...
real	0m2.657s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58521

Reviewed By: ngimel

Differential Revision: D28526946

Pulled By: malfet

fbshipit-source-id: ed42a9db3349654b75dcf63605bb4256154f01ff
2021-05-19 14:26:21 -07:00
ae99640a78 Added publishing of test results and minor fixes to Az DevOps Build Logic (#58436)
Summary:
This PR adds the ability to publish the xml test data of custom PyTorch PR tests. This PR also adds a few fixes to the custom PyTorch PR tests logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58436

Reviewed By: seemethere, mruberry

Differential Revision: D28512958

Pulled By: malfet

fbshipit-source-id: d3a1a251d3d126c923d5f733dccfb31a4b701b7e
2021-05-19 14:17:48 -07:00
b9b8522e00 [profile] fix recorded data type (#58531)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58531

fix data type of alltoall(v) when recording communication metadata via DebugInfo in NCCL PG

Reviewed By: chaekit

Differential Revision: D28529372

fbshipit-source-id: 2917653f73f5fe4f6dc901803235994ca042bba2
2021-05-19 14:14:54 -07:00
8de8b492f7 Revert "Move Azure MultiGPU tests back to nightly (#58242)" (#58451)
Summary:
This reverts commit 2afcb7e8fde0476db2e32feae9a80e36f23c1b19.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58451

Reviewed By: ailzhang

Differential Revision: D28497920

Pulled By: malfet

fbshipit-source-id: 7e9e4f1e3e6e46d8d2a4cba2e6147e0b50d27f6d
2021-05-19 13:55:26 -07:00
3113a1de4a Fix some tensor operators to return NotImplemented for invalid inputs (#58216)
Summary:
Same as https://github.com/pytorch/pytorch/issues/57934. (cc/ albanD)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58216

Reviewed By: ailzhang

Differential Revision: D28494886

Pulled By: albanD

fbshipit-source-id: 380205867ee1cde90e1c6fcfe2a31749e1243530
2021-05-19 13:09:57 -07:00
6c70cbedb6 step 0 of cuDNN v8 convolution API integration (#51390)
Summary:
This PR is step 0 of adding PyTorch convolution bindings using the cuDNN frontend. The cuDNN frontend is the recommended way of using cuDNN v8 API. It is supposed to have faster release cycles, so that, for example, if people find a specific kernel has a bug, they can report it, and that kernel will be blocked in the cuDNN frontend and frameworks could just update that submodule without the need for waiting for a whole cuDNN release.

The work is not complete, and this PR is only step 0.

**What this PR does:**
- Add cudnn-frontend as a submodule.
- Modify cmake to build that submodule.
- Add bindings for convolution forward in `Conv_v8.cpp`, which is disabled by a macro by default.
- Tested manually by enabling the macro and run `test_nn.py`. All tests pass except those mentioned below.

**What this PR doesn't:**
- Only convolution forward, no backward. The backward will use v7 API.
- No 64bit-indexing support for some configuration. This is a known issue of cuDNN, and will be fixed in a later cuDNN version. PyTorch will not implement any workaround for issue, but instead, v8 API should be disabled on problematic cuDNN versions.
- No test beyond PyTorch's unit tests.
  - Not tested for correctness on real models.
  - Not benchmarked for performance.
- Benchmark cache is not thread-safe. (This is marked as `FIXME` in the code, and will be fixed in a follow-up PR)
- cuDNN benchmark is not supported.
- There are failing tests, which will be resolved later:
  ```
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float16 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0.001 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (in...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_cudnn_nhwc_cuda_float32 - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 32 element(s) (out of 32) whose difference(s) exceeded the margin of error (...
  FAILED test/test_nn.py::TestNNDeviceTypeCUDA::test_conv_large_cuda - RuntimeError: CUDNN_BACKEND_OPERATION: cudnnFinalize Failed cudnn_status: 9
  FAILED test/test_nn.py::TestNN::test_Conv2d_depthwise_naive_groups_cuda - AssertionError: False is not true : Tensors failed to compare as equal!With rtol=0 and atol=1e-05, found 64 element(s) (out of 64) whose difference(s) exceeded the margin of error (including 0 an...
  FAILED test/test_nn.py::TestNN::test_Conv2d_deterministic_cudnn - RuntimeError: not supported yet
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_fp32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  FAILED test/test_nn.py::TestNN::test_ConvTranspose2d_groups_cuda_tf32 - RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
  ```

Although this is not a complete implementation of cuDNN v8 API binding, I still want to merge this first. This would allow me to do small and incremental work, for the ease of development and review.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/51390

Reviewed By: malfet

Differential Revision: D28513167

Pulled By: ngimel

fbshipit-source-id: 9cc20c9dec5bbbcb1f94ac9e0f59b10c34f62740
2021-05-19 12:54:09 -07:00
954d39ba38 [ATen][Quant] Pass at::Tensor by reference (#58284)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58284

- Passing at::Tensor by value can incur a lot of refcount bumps overhead. Passing-by-reference is much more efficient.
- Use Tensor::expect_contiguous() where possible to remove refcount bump overhead when input tensor is already contiguous.

Reviewed By: supriyar, swolchok

Differential Revision: D28432300

fbshipit-source-id: 089ceed08f0d54f109e441f8a1314d726e8481ce
2021-05-19 12:36:50 -07:00
a91375432a model_dump: Accept variable-length debug info (#57660)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57660

Ignore trailing elements so we're compatible with both old and new
models.

Test Plan: Dumped and old model.  Unit test.

Reviewed By: malfet

Differential Revision: D28531391

Pulled By: dreiss

fbshipit-source-id: 197a55ab0e6a7d8e25cbee83852e194afacc988e
2021-05-19 12:25:27 -07:00
ab1fdbefe1 model_dump: Use DumpUnpickler.load instead of .dump (#57659)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57659

Faster since we don't do an automatic pprint, and shorter, simpler code.

Test Plan: Dumped some models.

Reviewed By: malfet

Differential Revision: D28531398

Pulled By: dreiss

fbshipit-source-id: 47f1f646d4576af9f7e680933e0512f616dab5c0
2021-05-19 12:25:25 -07:00
53078924ad model_dump: Add a section that summarizes tensor memory usage (#57658)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57658

Since there is no Python change here and we only do the analysis when
rendering the open section, this should have no impact on page size or
load time!  (Well, a constant impact on page size due to the added
code.)  Before I made it lazy, I observed that it increased load time by
over 100ms for a large model.

Test Plan: Dumped a CUDA model and saw the size summary.

Reviewed By: malfet

Differential Revision: D28531394

Pulled By: dreiss

fbshipit-source-id: f77012b7bab069de861a4ba23486c665e1306aa0
2021-05-19 12:25:23 -07:00
ef4e6036bc model_dump: Handle dict rendering (#57657)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57657

Test Plan: Clicked around a model with some dicts in it.

Reviewed By: malfet

Differential Revision: D28531397

Pulled By: dreiss

fbshipit-source-id: 069690f147e91eadd76fec5f5ca4eec057abcb98
2021-05-19 12:25:21 -07:00
72ff3163bd model_dump: Handle torch.device objects (#57656)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57656

This came up when dumping a CUDA model.

Test Plan: Dumped a CUDA model.

Reviewed By: malfet

Differential Revision: D28531396

Pulled By: dreiss

fbshipit-source-id: fe0e94248c8085a8b760d253ba0b517f153b3442
2021-05-19 12:25:19 -07:00
a380575f5b model_dump: Refactor renderTensor into a helper method (#57655)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57655

Now lots of code is shared between tensor and qtensor rendering.  Net
lines of code is actually +1, but it should result in a savings if/when
we implement some of those todos.

Test Plan: Clicked around in Chrome.

Reviewed By: malfet

Differential Revision: D28531395

Pulled By: dreiss

fbshipit-source-id: 190a04ed587b54d27f3410246763cd636c0634be
2021-05-19 12:25:17 -07:00
3ff76af23c model_dump: Implement "Hider" properly (#57654)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57654

I learned how to use children in React/Preact. :)  Now it's not
necessary to give every hidable section its own id and synchonize the
"shown=false" with "style='display:none;'".

This also means that the hidden elements aren't rendered to the DOM
unless the hider is open.

Test Plan: Clicked around in Chrome.

Reviewed By: malfet

Differential Revision: D28531393

Pulled By: dreiss

fbshipit-source-id: bc86c823ae4b7e80c000f50c5429d89dff6ae64d
2021-05-19 12:23:59 -07:00
3f0b081636 move code to Blas.cpp, clean up THC magma (#58526)
Summary:
To improve compilation times

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58526

Reviewed By: malfet

Differential Revision: D28540035

Pulled By: ngimel

fbshipit-source-id: 01a6b1e2b12aa246c5ecfa810ad4e87bde040553
2021-05-19 12:04:18 -07:00
703cfdc9ed [JIT] improve documentation (#57991)
Summary:
* Fix lots of links.
* Minor improvements for consistency, clarity or grammar.
* Update jit_python_reference to note the limitations on __exit__.
  (Related to https://github.com/pytorch/pytorch/issues/41420).
* Fix a comment in exit_transforms.cpp: removed the word "not" which
  made the comment say the opposite of the truth.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57991

Reviewed By: malfet

Differential Revision: D28522247

Pulled By: SplitInfinity

fbshipit-source-id: fc63a59d19ea6c89f957c9f7d451be17d1c5fc91
2021-05-19 11:47:32 -07:00
79a258f448 s/foward/forward/g (#58497)
Summary:
Annoying typo.

Prompted by these profiling results: https://github.com/pytorch/pytorch/issues/56419#issuecomment-825787828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58497

Reviewed By: malfet

Differential Revision: D28521081

Pulled By: Chillee

fbshipit-source-id: ab91a2e167dd7d3387fd56106a6cff81f7a32f10
2021-05-19 11:42:42 -07:00
ccad77aa22 Added OperatorMap for mapping Operator to any template <T> (#58060)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58060

Generic way to check if Operator belongs to predefined map, and if so via public method(s) access to map value. In general value can be anything for example Operator's schema.

Test Plan: buck test caffe2/test/cpp/jit:jit -- OperatorMap

Reviewed By: Krovatkin

Differential Revision: D28357933

fbshipit-source-id: ba3248cf06c07f16aebafccb7ae71c1245afb083
2021-05-19 11:38:49 -07:00
1ba05efd26 [Reducer] Remove some unused variables (#58524)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58524

Per title
ghstack-source-id: 129311600

Test Plan: CI

Reviewed By: SciPioneer

Differential Revision: D28528223

fbshipit-source-id: 239a15de4b602e35ed9b15b8a4bea3c28b61de12
2021-05-19 09:55:04 -07:00
4cf9b11022 Fix issues regarding binary_checkout (#58558)
Summary:
Cherry-pick of https://github.com/pytorch/pytorch/issues/58495 back to master

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes https://github.com/pytorch/pytorch/issues/58557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58558

Reviewed By: albanD

Differential Revision: D28538867

Pulled By: malfet

fbshipit-source-id: 3517d8729df7c0c0a221d26f6966c8dcef2f3076
2021-05-19 08:24:34 -07:00
baf05c3f5e Split CUDA SpectralOp (#58459)
Summary:
Move all cuFFT related parts to SpectralOps.cpp
Leave only _fft_fill_with_conjugate_symmetry_cuda_ in SpecralOps.cu

Keep `CUDAHooks.cpp` in torch_cuda_cpp by introducing `at::cuda::detail::THCMagma_init` functor and registering it from global constructor in `THCTensorMathMagma.cu`

Move entire detail folder to torch_cuda_cpp library.

This is a no-op that helps greatly reduce binary size for CUDA-11.x builds by avoiding cufft/cudnn symbol duplication between torch_cuda_cpp(that makes most of cuFFT calls) and torch_cuda_cu (that only needed it to compile SpectralOps.cu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58459

Reviewed By: ngimel

Differential Revision: D28499001

Pulled By: malfet

fbshipit-source-id: 425a981beb383c18a79d4fbd9b49ddb4e5133291
2021-05-19 07:59:03 -07:00
029bec4505 [lint] Fix uninitialized variable lint error in Module.cpp (#58499)
Summary:
This PR fixes two uninitialized variable lint warnings in `Module.cpp` by initializing them to `nullptr`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58499

Reviewed By: driazati, samestep

Differential Revision: D28519192

Pulled By: 1ntEgr8

fbshipit-source-id: 293cd4b296eea70b72adf02cd73f354063b124c6
2021-05-19 07:55:24 -07:00
b45a105acb Automated submodule update: tensorpipe (#58477)
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: a0c6aa1422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58477

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D28506522

fbshipit-source-id: 2da92feae212a568cfe441d33e4966ffe6c182e5
2021-05-19 05:49:29 -07:00
4d7abdbdad [Quant] Add out variant for int8 quantized::linear (#58282)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58282

Reviewed By: ajyu

Differential Revision: D28428734

fbshipit-source-id: f25243cdbc220e59659605a3a29e2b161dd7c1f2
2021-05-19 00:24:23 -07:00
c76405d3b1 [nnc] Enable CPU fusion inside Facebook, take 2 (#58347)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58347

Back out "Revert D27652484 (ac04cc775b): [nnc] Enable CPU fusion inside Facebook"
Original commit changeset: ecfef3ee1e71
ghstack-source-id: 129279584

Test Plan: Tests for bugfix included in this stack

Reviewed By: navahgar

Differential Revision: D28461013

fbshipit-source-id: 79a80b6ffb653ab952ff5efaa143d3362bb7d966
2021-05-18 21:45:48 -07:00
dcfc2050bd VaryingShape<Strides>::isComplete() needs to consider whether each Stride is complete (#58510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58510

In some case that I don't fully understand we're getting a stride that is:
```
{2:1, 1:1, 0:*}
```
(in this debug output, M:N means stride index M, stride value N).  This shape
should be considered incomplete, since we don't actually know the values of the
stride, but VaryingShape::isComplete considers it complete because it only
checks the presence of elements in the vector, not whether those elements are
themselves complete.
ghstack-source-id: 129279583

Test Plan:
new unit test in test/cpp/jit

To see the failure in the context of a real model:
```
./fblearner/predictor/loadgen/download-requests.sh 272478342_0 10 ~/local/requests/272478342_0.recordio

buck-out/gen/fblearner/predictor/loadgen/replay_model_requests --model_id=272478342_0 --replay_record_source=recordio:/data/users/bertrand/requests/272478342_0.recordio --remote_port=9119 --output_file=/data/users/bertrand/responses/272478342_0_actual.recordio --output_type=recordio

buck-out/gen/fblearner/predictor/loadgen/replay_model_requests --model_id=272478342_0 --replay_record_source=recordio:/data/users/bertrand/requests/272478342_0.recordio --remote_port=9119 --output_file=/data/users/bertrand/responses/272478342_0_actual.recordio --output_type=recordio
```

Reviewed By: Krovatkin

Differential Revision: D28520062

fbshipit-source-id: 3ca900337d86480a40fbd90349a698cbb2fa5f11
2021-05-18 21:45:46 -07:00
3d20ddfe92 [nnc] Do not fuse unsqueeze with variable dim (#58346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58346

If `dim` is a variable, NNC doesn't know how to translate the result,
since the shape is unknown.  This issue manifested as a `bad_variant_access`
when we try to pull an int constant out of that arg.

Note that, while the PE will pick up the resultant shape, it won't set guards accordingly.
ghstack-source-id: 129078971

Test Plan: new fuser test

Reviewed By: navahgar

Differential Revision: D28460956

fbshipit-source-id: 57ef918ef309ee57bfdf86717b910b6549750454
2021-05-18 21:44:37 -07:00
2ddd841635 [nnc] Make the pretty printer prettier (#57874)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57874

Before:
```
{
  for (int v = 0; v < 100; v++) {
    aten_sin[v] = sin(x_1[v]);
  }
{
    sum = float(0);
    for (int v_1 = 0; v_1 < 100; v_1++) {
      sum = ReduceOp((sum) + float(aten_sin[v_1]), reduce_args={});
    }
  }  for (int v_2 = 0; v_2 < 100; v_2++) {
    aten_cos[v_2] = cos(x_1[v_2]);
  }
  for (int v_3 = 0; v_3 < 100; v_3++) {
    aten_mul[v_3] = (_tensor_constant0[v_3]) * (aten_cos[v_3]);
  }
}
```

After:
```
{
  for (int v = 0; v < 100; v++) {
    aten_sin[v] = sin(x_1[v]);
  }
  {
    sum = float(0);
    for (int v_1 = 0; v_1 < 100; v_1++) {
      sum = ReduceOp((sum) + float(aten_sin[v_1]), reduce_args={});
    }
  }
  for (int v_2 = 0; v_2 < 100; v_2++) {
    aten_cos[v_2] = cos(x_1[v_2]);
  }
  for (int v_3 = 0; v_3 < 100; v_3++) {
    aten_mul[v_3] = (_tensor_constant0[v_3]) * (aten_cos[v_3]);
  }
}
```

Test Plan: Imported from OSS

Reviewed By: navahgar, malfet

Differential Revision: D28455842

Pulled By: bertmaher

fbshipit-source-id: 6d5ca9be12afd66a9ba32c129a3f4d618247cd35
2021-05-18 18:26:58 -07:00
3a3959d253 [jit] Add a utility class SourceRef to represent Source as keys (#57396)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57396

A new type SourceRef is introduced to represent a unique identifier to source
text. The type holds refcount to underlying source, and supports comparators
and hash functions, such that it can be used in C++ and Python maps. In later
diffs we will use this to aggregate and print profiling information.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D28133578

fbshipit-source-id: c3d5199a8269c5006c85a145b281bcaaf3e2dc1c
2021-05-18 18:20:53 -07:00
0362b753db [BE] Use __func__ as checkAllSameGPU() 1st arg (#58502)
Summary:
Hardcoded names often get out of date, for example in AdaptiveAverafePooling those names contained cudnn_ prefix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58502

Reviewed By: samestep

Differential Revision: D28518917

Pulled By: malfet

fbshipit-source-id: 9b16adae85a179e335da4facb4e769b9f67824bc
2021-05-18 16:45:54 -07:00
ea0f7c4720 move unused parameters to end of bucket orders when rebuild buckets for static graph (#58097)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58097

move unused parameters to end of bucket orders when rebuild buckets for static graph

Test Plan: unit tests

Reviewed By: rohan-varma

Differential Revision: D28366689

fbshipit-source-id: fbd224aeb761d5aa3bab35a00d64974eb4455b2e
2021-05-18 16:36:40 -07:00
a7b62abeb0 [PyTorch Edge] bytecode version bump to v5 and enable share constant table (#57888)
Summary:
As title, main change:
1. Enable share constant table and reduce model size up to 50%
2. Bump bytecode version from v4 to v5.
3. Add the unittest back. (It was partially removed because `script_module_v5.ptl` bytecode version is v5. When current runtime is v4 and try to load a v5 model, it will raise an error because version is not within the range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57888

As title
ghstack-source-id: 129255867

Test Plan:
CI
```
buck test papaya/toolkit/frontend/torch/...
buck test mode/opt papaya/integration/service/test/smartkeyboard:smartkeyboard_system_test
```

Reviewed By: raziel, iseeyuan

Differential Revision: D28309381

fbshipit-source-id: 6f5cf4296eaadde913d55f27d5bfb9d1dea2fbaf
2021-05-18 16:17:13 -07:00
9eee782cb6 [nnc][scripts] Add a script for bisecting the TE fuser pass (#58357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58357

Finding a miscompilation in a large program can be tedious; this
script automates the process of bisecting based on the number of fused
instructions.  Since fusing aten::cat without the corresponding
prim::ListConstruct will cause an assertion failure, we treat that case as a
"skip" and ignore it for the purpose of bisection.
ghstack-source-id: 129079484

Test Plan:
Tried it on some failing testcases, plus I wrote a simple bash
script to simulate "failure" and "skip" and verified a few different cases.

Reviewed By: huiguoo

Differential Revision: D28463808

fbshipit-source-id: 64836f1d37a573549179410316ea7168e3dc1f23
2021-05-18 16:10:20 -07:00
7d78d72d7b removing old comment (#56430)
Summary:
Removing a comment which is no longer relevant after
https://github.com/pytorch/pytorch/pull/56089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56430

Reviewed By: desertfire

Differential Revision: D28515547

Pulled By: Krovatkin

fbshipit-source-id: c4e62741a872fef015248cd7ab1b3213d35109ee
2021-05-18 14:56:22 -07:00
a07cd22efb Comment why render_test_results is its own step (#58505)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58505

Reviewed By: seemethere

Differential Revision: D28520332

Pulled By: samestep

fbshipit-source-id: 6637b58b399caf6019d6fd8bfab21646cbd219b6
2021-05-18 14:40:32 -07:00
8efaab1b83 Add long tensor type to AddFakeFp16 Op (#58504)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58504

.. to support QRT inline_CVR models to avoid failure
```
[DataPreproc] User preprocessing error: c10::Error: [enforce fail at operator.h:1307] . Unsupported type of tensor: long (Error from operator:
input: "sparse_nn_2/HistogramBinningCalibrationByFeature_2/cast_22/cast_22_5" input: "sparse_nn_2/HistogramBinningCalibrationByFeature_2/mul_5/Mul" output: "sparse_nn_2/HistogramBinningCalibrationByFeature_2/add_7/Add_2" name: "" type: "AddFakeFp16" arg { name: "broadcast" i: 1 } device_option { extra_info: "inference_split:force_merge" extra_info: "inference_split:force_merge" })
```
f273407515

Test Plan: f273692411

Reviewed By: hx89

Differential Revision: D28513550

fbshipit-source-id: 86892e1a98b5219cd187731018ce2692b231fb58
2021-05-18 14:25:56 -07:00
4b859cbca1 [NNC] Do not optimize conditionals when the corresponding loop is not normalized (#57675)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57675

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28231375

Pulled By: navahgar

fbshipit-source-id: bcbcebca25577744c7190a0aa9fa376f76dea77d
2021-05-18 14:25:53 -07:00
a71b99b50d [NNC] Add a method to check if a loop is normalized (#57674)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57674

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28231377

Pulled By: navahgar

fbshipit-source-id: 3d92d532f1e1f78c9d94619980340622b73f99ec
2021-05-18 14:25:50 -07:00
3fe72d30dc [NNC] Optimize conditionals that correspond to the form generated for aten::cat op. (#57673)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57673

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28231374

Pulled By: navahgar

fbshipit-source-id: 1777a63df4e5ebed6d515683bd772a88be465b3a
2021-05-18 14:23:48 -07:00
db42ec4297 [Pytorch Sparsity] Add sparse sources to build target
Summary:
This adds to internal build target and makes it ready for selective build
workflow.

Test Plan: CI builds

Reviewed By: z-a-f

Differential Revision: D28103697

fbshipit-source-id: 19c8b27aae4de1cece8d88d13ea51ca4ac7d79b6
2021-05-18 14:19:14 -07:00
ad97fd8031 Support symbolic diff for leaky_relu (#58337)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58337

supports  symbolic differentiation for leaky_relu

Test Plan:
test/test_jit.py
test/test_ops.py

Reviewed By: Krovatkin

Differential Revision: D28458898

fbshipit-source-id: bdde74d689d2c2ea1f59507456c2efa4e38de1cc
2021-05-18 14:13:40 -07:00
e1551f1678 Clarify .github/scripts/generate_ci_workflows.py (#58498)
Summary:
Followup to https://github.com/pytorch/pytorch/issues/58491:

- use f-string to remove the literal `generated` string from the generator script, so Phabricator no longer thinks it is a generated file
- remove the special logic for `test_runner_type` and instead explicitly specify for every workflow

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58498

Test Plan:
```
make generate-gha-workflows
```
Also, check that Phabricator doesn't classify `.github/scripts/generate_ci_workflows.py` as "Generated changes" in this diff.

Reviewed By: seemethere

Differential Revision: D28516291

Pulled By: samestep

fbshipit-source-id: 8736eaad5d28082490be0a9b2e271c9493c2ba9d
2021-05-18 12:50:00 -07:00
5fcf49f596 [PyTorch] Add a guard rail to TensorIterator::add_borrowed_{in,out}put (#58279)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58279

See comment in source code.
ghstack-source-id: 129002040

Test Plan: CI

Reviewed By: wenleix

Differential Revision: D28428962

fbshipit-source-id: e011819e5579396f3ca2d87978c84965260adb1b
2021-05-18 12:46:33 -07:00
03f2f0f88f [PyTorch] Migrate remaining CUDA TI usage to borrowing where possible (#58278)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58278

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129002042

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28428809

fbshipit-source-id: 23ccf508c4413371a88085271f11c7d0cc861a9e
2021-05-18 12:46:32 -07:00
1fd256dc3b [PyTorch] Migrate CUDA indexing TI usage to borrowing (#58277)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58277

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129002044

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D28428441

fbshipit-source-id: 243b746aeb5fdf8b95c8e591c066c5eab140deb6
2021-05-18 12:46:30 -07:00
029289bd6c [PyTorch] Migrate TensorAdvancedIndexing TI usage to borrowing where possible (#58276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58276

Borrowing is more efficient, and we can see in all these cases that the TensorIterator doesn't outlive the input & output Tensors.
ghstack-source-id: 129002045

Test Plan: Existing CI

Reviewed By: ngimel

Differential Revision: D28428234

fbshipit-source-id: 9eada7725a070799b55e6683509e359505a2b80a
2021-05-18 12:46:28 -07:00
439ba27dea [PyTorch] Migrate all extant uses of build_binary_float_op to build_borrowing_binary_float_op (#58273)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58273

Borrowing is more efficient, and structured kernels can always borrow.
ghstack-source-id: 129002041

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28427914

fbshipit-source-id: eed27a10603b412af5357d3554477ba407abba73
2021-05-18 12:46:26 -07:00
8a4a511ff5 [PyTorch] Migrate all extant uses of build_binary_op to build_borrowing_binary_op (#58272)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58272

Borrowing is more efficient, and structured kernels can always borrow.
ghstack-source-id: 129002046

Test Plan: Existing CI

Reviewed By: ezyang

Differential Revision: D28427768

fbshipit-source-id: 6314a682556c6914c843aaacf2d75b2adb164e9a
2021-05-18 12:44:50 -07:00
07da584dbd Fix KeyError returned by _maybe_get_last_node_only_observer (#58443)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58443

Test Plan: arc lint

Reviewed By: vkuzo

Differential Revision: D28494119

fbshipit-source-id: 05abf4e12051afc237096812fb0ee08a8b9447f9
2021-05-18 12:41:19 -07:00
46484e8dfe Simplify .github/scripts/generate_ci_workflows.py (#58491)
Summary:
This PR simplifies `.github/scripts/generate_ci_workflows.py` by using the same strategy as https://github.com/pytorch/pytorch/issues/54344, representing workflows as plain data to avoid duplicating the definition of the `generate_workflow_file` function. This will make the script easier to maintain if/when that function is modified and/or more workflow types are added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58491

Test Plan:
The Lint job in CI; specifically:
```
make generate-gha-workflows
mypy --config mypy-strict.ini
```

Reviewed By: malfet, seemethere

Differential Revision: D28511918

Pulled By: samestep

fbshipit-source-id: aaf415a954d938a29aee7c9367c9bc2b9f44bb01
2021-05-18 11:49:51 -07:00
f7c15610aa Collect kernel version (#58485)
Summary:
Collect env should collect kernel and glibc version

Fixes https://github.com/pytorch/pytorch/issues/58387

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58485

Reviewed By: walterddr

Differential Revision: D28510564

Pulled By: malfet

fbshipit-source-id: ad3d4b93f51db052720bfaa4322138c55816921b
2021-05-18 10:57:59 -07:00
92e36240f5 fix nonzero perf regression (#58468)
Summary:
https://github.com/pytorch/pytorch/issues/55292 introduced perf regression for nonzero cuda, this fixes it. nvcc is still pretty bad about unrolling loops with boundaries that are not known at compile time, this makes `write_indices` kernels ~5x slower than it should be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58468

Reviewed By: mruberry

Differential Revision: D28511147

Pulled By: ngimel

fbshipit-source-id: fe7303ec77da1abbe5e874093eca247b3919616f
2021-05-18 10:33:10 -07:00
4ce8378ec5 [local lint] Remove success checks in tests (#58490)
Summary:
Testing for both that a lint job ran and that it was successful depends
on having lint pass for the PR, which can create confusion if it doesn't
(i.e. a flake8 failure also causes this job to fail, and it's not
immediately clear why). With this PR we just check for the presence of
job names to see that something ran.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58490

Reviewed By: samestep

Differential Revision: D28511229

Pulled By: driazati

fbshipit-source-id: 3036deff9f9d0ef2e78b44a9a43b342acdcfa296
2021-05-18 09:31:13 -07:00
afe23b8f8b Fix alpine image (#58462)
Summary:
Fixes dockerhub rate limiting issue, use the ECR image instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58462

Reviewed By: malfet

Differential Revision: D28510603

Pulled By: zhouzhuojie

fbshipit-source-id: 2cac59da1d1efdf31df71e9f76d802f8e9a0bfd5
2021-05-18 09:22:28 -07:00
821a97595b fx quant: improve performance of all_node_args_have_no_tensors (#58461)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58461

Improves the logic which calculates whether a node has any tensors
in its arguments by terminating the recursion early when possible.

In a future PR, we should probably ditch this entire approach and switch to
using dtype propagation.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D28499455

fbshipit-source-id: bedd844022b90e1fcb7d7a3cb4cc65440dc9cc59
2021-05-18 07:19:59 -07:00
e059fd40a8 Remove master documentation from being indexable by search engines (#58056)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58056

This PR addresses an action item in #3428: disabling search engine
indexing of master documentation. This is desireable because we want to
direct users to our stable documentation (instead of master
documentation) because they are more likely to have a stable version of
PyTorch installed.

Test Plan:
1. run `make html`, check that the noindex tags are there
2. run `make html-stable, check that the noindex tags aren't there

Reviewed By: bdhirsh

Differential Revision: D28490504

Pulled By: zou3519

fbshipit-source-id: 695c944c4962b2bd484dd7a5e298914a37abe787
2021-05-18 06:20:09 -07:00
52b45b7655 Revert D28494073: [Gradient Compression] Do not skip the comm hook tests for Gloo/MPI backends
Test Plan: revert-hammer

Differential Revision:
D28494073 (df44f015fe)

Original commit changeset: 6ba14082f986

fbshipit-source-id: 0e094f09b59c93f5ee13a667aacfb3ccf608547e
2021-05-18 05:39:09 -07:00
34d6618386 [NNC] Fixing a bug in simplifier (#58291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28435393

Pulled By: navahgar

fbshipit-source-id: 517e47385a93a43d2ddf054382adc81c18484066
2021-05-18 01:28:33 -07:00
df44f015fe [Gradient Compression] Do not skip the comm hook tests for Gloo/MPI backends (#58444)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58444

DDP communication hooks are already supported on Gloo and MPI backends. No longer need to skip these tests on Gloo/MPI backends.

TODO: `test_ddp_hook_parity_powerSGD` failes on Gloo backend. Filed a bug #58467.
ghstack-source-id: 129209528

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_comm_hook_logging
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_allreduce_process_group
buck test mode/dev-nosan caffe2/test/distributed:distributed_gloo_fork -- test_ddp_hook_parity_powerSGD

Reviewed By: rohan-varma

Differential Revision: D28494073

fbshipit-source-id: 6ba14082f98696bc4bd8c02395cb58b9c1795015
2021-05-17 23:05:01 -07:00
c38616491f Conservatively move all suitable prim ops from full-jit to mobile, and make them selective. (#58353)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58353

There are long tail operators in register_prim_ops_fulljit.cpp, that may be used in mobile build. In this PR
1. All of the ops that are likely to be used in mobile are moved to register_prim_ops.cpp.
2. Note that this move is conservative. If an op is likely to have fulljit dependency, or cannot be selective, it will be kept. Later if there's need to be used in mobile (rare), it will be adapted and moved case by case.
3. All the moved ops are marked selective. The registration function is changed from `Operator()` to `OperatorGenerator()`. Size regression is not expected.

Test Plan:
* Internal size tests
* CI

Reviewed By: dhruvbird

Differential Revision: D28463158

Pulled By: iseeyuan

fbshipit-source-id: 34536b8a569f1274329ccf1dac809fe9b891b4ff
2021-05-17 23:01:22 -07:00
b5a834a739 [Pytorch] Build lite interpreter as default for iOS
Summary:
Two changes:
1. Build lite interpreter as default for iOS
2. Switch the previous lite interpreter test to full jit build test

Test Plan: Imported from OSS

Differential Revision: D27698039

Reviewed By: xta0

Pulled By: cccclai

fbshipit-source-id: 022b554f4997ae577681f2b79a9ebe9236ca4f7d
2021-05-17 22:36:05 -07:00
8a3fb2689f Wrap torch::deploy API functions in safe rethrow macros (#58412)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58412

Second try- avoid ctor/dtor handling this time as it is kind of
pointless if the rethrow will still terminate(), and upsets -Werror=terminate
Original commit changeset: 1775bed18269

Test Plan: existing unit tests and CI

Reviewed By: suo

Differential Revision: D28478588

fbshipit-source-id: 84191cecc3ef52e23f11bfea07bbb9773ebc5df4
2021-05-17 22:09:19 -07:00
7b73fdf597 [FX] Fix retracing wrapped functions (#58061)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58061

Test Plan: Imported from OSS

Reviewed By: yuhc

Differential Revision: D28358801

Pulled By: jamesr66a

fbshipit-source-id: c7c9a8a80e5bfe1eb1f6d2cf858ac7e57153a860
2021-05-17 19:50:16 -07:00
5fa4541c65 Make new_ones an operator (#58405)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/58394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58405

Reviewed By: HDCharles

Differential Revision: D28480075

Pulled By: Chillee

fbshipit-source-id: bd29399867e2a002a2f395554621761d3c701f68
2021-05-17 19:24:34 -07:00
0547a3be63 Change link order for BUILD_SPLIT_CUDA option (#58437)
Summary:
torch_cuda_cu depends on torch_cuda_cpp, so it should be linked first
Otherwise linker keeps lots of cudnn symbols for no good reason

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58437

Reviewed By: janeyx99

Differential Revision: D28496472

Pulled By: malfet

fbshipit-source-id: 338605ff755591476070c172a6ea0a0dcd0beb23
2021-05-17 18:38:04 -07:00
af463d2235 Add shape documentation for CosineEmbeddingLoss (#58403)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/52732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58403

Reviewed By: HDCharles

Differential Revision: D28480076

Pulled By: jbschlosser

fbshipit-source-id: c2c51e9da86e274e80126bbcabebb27270f2d2d0
2021-05-17 18:14:16 -07:00
e24dee00d4 add kernel launch checks after each kernel launch to silence the check (#58432)
Summary:
T90898552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58432

Reviewed By: r-barnes

Differential Revision: D28487446

Pulled By: ngimel

fbshipit-source-id: 3a756ffa3cd68720e132af27cd5ae36f7fd4a2d8
2021-05-17 18:03:19 -07:00
7dd08504f6 [package] fix persistent_load error (#58439)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58439

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D28494250

Pulled By: Lilyjjo

fbshipit-source-id: c068760db9c25dcbf5a88ea9343eab11f0e7736a
2021-05-17 17:38:53 -07:00
314a578154 Clang format distributed_c10d.py (#58435)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58435

Prepare for #53962

ghstack-source-id: 129171617

Test Plan: N/A

Reviewed By: zhaojuanmao

Differential Revision: D28490326

fbshipit-source-id: 2ed3c5850788b9702a8020f6ee6d0b579625bf89
2021-05-17 16:47:35 -07:00
b6d3929b51 [ATen] Use MaybeOwned<T> in at::argmin/argmax (#58338)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58338

Test Plan: CI

Reviewed By: swolchok

Differential Revision: D28458968

fbshipit-source-id: 2c759bdb9fbdbef32d804f6d8efb09fb1d2bb30a
2021-05-17 16:42:52 -07:00
6989eb60e5 Remove timeouts for C2 tests
Summary: When run on very heavily loaded machines, some of these tests are timing out. It's not an issue with the test, it's an issue with the environment. I've removed the timeout so we at least keep unit test coverage.

Test Plan: N/A: Fix breakages

Reviewed By: ngimel

Differential Revision: D28492334

fbshipit-source-id: aed3ee371763161aab2d356f5623c7df053fda6f
2021-05-17 16:39:30 -07:00
4310decfbf .github: Add intial Windows CPU GHA workflow (#58199)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58199

Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28465272

Pulled By: seemethere

fbshipit-source-id: d221ad71d160088883896e018c58800dae85ff2c
2021-05-17 15:04:16 -07:00
c156a4ffaa fx quant: fix crash on output dicts and lists (#58416)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58416

https://github.com/pytorch/pytorch/pull/57519 had a regression not
caught by CI, it added an assertion which failed on various model
output types.

This PR removes the assertion and adds the logic to observe graph
outputs in a way that supports arbitrary output formats.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_output_lists_and_dicts
```

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D28479946

fbshipit-source-id: bcce301f98a057b134c0cd34ab0ca96ba457863f
2021-05-17 15:02:09 -07:00
a1cacf3b5d fx quant: remove test debug logs (#58415)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58415

Removes test debugging logs which were committed, probably
someone forgot to remove before landing.

Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```

Imported from OSS

Reviewed By: z-a-f

Differential Revision: D28479947

fbshipit-source-id: 3adba87c51652e3353f455b293abc90debe3dd7d
2021-05-17 15:01:03 -07:00
3d12ab452e [ONNX] Fix split export in opset13 (#56277) (#57605)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57605

Fix split export in opset13

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D28393522

Pulled By: SplitInfinity

fbshipit-source-id: 4de83345ec7bc9bafe778fe534d9a8760ce16ab3

Co-authored-by: Ksenija Stanojevic <ksenija.stanojevic@gmail.com>
Co-authored-by: BowenBao <bowbao@microsoft.com>
2021-05-17 14:50:33 -07:00
0c3db1cb33 [Pytorch] Build lite interpreter as default for Android
Summary:
Build lite interpreter as default for android, should wait until https://github.com/pytorch/pytorch/pull/56002 lands
Mainly two changes:
1. Use lite interpreter as default for Android
2. Switch the lite interpreter build test to full jit build test

Test Plan: Imported from OSS

Differential Revision: D27695530

Reviewed By: IvanKobzarev

Pulled By: cccclai

fbshipit-source-id: e1b2c70fee6590accc22c7404b9dd52c7d7c36e2
2021-05-17 14:12:48 -07:00
d645088f2f [torch] Format repeat_interleave op files (#58313)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58313

Same as title.

I am planning to send a follow-up diff to this op, so sending formatting diff ahead to keep PR simple.

Test Plan: Rely on existing signals since this is simple formatting diff.

Reviewed By: ngimel

Differential Revision: D28447685

fbshipit-source-id: c7cd473b61e40e6f50178aca88b9af197a759099
2021-05-17 13:51:53 -07:00
06c1094ea0 Merge CreationMeta MULTI_OUTPUT_SAFE with MULTI_OUTPUT_NODE (#58285)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/57679

##### Release Notes
This is part of the end of the deprecation of inplace/view:
- `detach_` will now raise an error when invoked on any view created by `split`, `split_with_sizes`, or `chunk`. You should use the non-inplace `detach` instead.
- The error message for when an in-place operation (that is not detach) is performed on a view created by `split`, `split_with_size`, and `chunk` has been changed from  "This view is **an** output of a function..." to "This view is **the** output of a function...".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58285

Reviewed By: bdhirsh

Differential Revision: D28441980

Pulled By: soulitzer

fbshipit-source-id: e2301d7b8cbc3dcdd328c46f24bcb9eb7f3c0d87
2021-05-17 13:48:39 -07:00
3507ca320b Remove unused python2 shebang (#58409)
Summary:
This is the only line (not in `third_party`) matching the regex `^#!.*python2`, and [it is not the first line of its file](https://github.com/koalaman/shellcheck/wiki/SC1128), so it has no effect. As a followup to https://github.com/pytorch/pytorch/issues/58275, this PR removes that shebang to reduce confusion, so now all Python shebangs in this repo are `python3`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58409

Reviewed By: walterddr

Differential Revision: D28478469

Pulled By: samestep

fbshipit-source-id: c17684c8651e45d3fc383cbbc04a31192d10f52f
2021-05-17 13:19:32 -07:00
98cc0aa6b0 Use torch.allclose to check tensor equality (#58429)
Summary:
This fixes test_lkj_cholesky_log_prob if default codepath is used
I.e. test is executed as follows:
```
 ATEN_CPU_CAPABILITY=default python3 distributions/test_distributions.py -v -k test_lkj_cholesky_log_prob
```

Fixes https://github.com/pytorch/pytorch/issues/58381

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58429

Reviewed By: neerajprad

Differential Revision: D28484340

Pulled By: malfet

fbshipit-source-id: 32afcc75e5250f5a11d66b4fa194ea1c784454a6
2021-05-17 13:16:35 -07:00
50f9a1812e Enable NNAPI in internal build (#58324)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58324

Test Plan: Build Size Bot.  Segmentation in Spark Player.

Reviewed By: axitkhurana

Differential Revision: D28435176

fbshipit-source-id: f2fb25e3cd331433e7a3156a528811abd3bcbf3a
2021-05-17 12:52:56 -07:00
532632ca26 Don't bind Android NNAPI on Apple platforms (#58323)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58323

Currently there is no way to run NNAPI on Apple platforms.
Disabling the binding with the preprocessor makes this easier
to enable NNAPI in the internal build without affecting iOS size.

This should be reverted soon and migrated to selective build.

Test Plan: Build Size Bot on later diff.

Reviewed By: axitkhurana

Differential Revision: D28435179

fbshipit-source-id: 040eeb74532752630d329b15d5f95c538c2e3f9e
2021-05-17 12:51:46 -07:00
1891e4bf1e [Pytorch] Remove run_on_bundled_input (#58344)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58344

remove a helper function thats more trouble then its worth.

ghstack-source-id: 129131889

Test Plan: ci and {P414950111}

Reviewed By: dhruvbird

Differential Revision: D28460607

fbshipit-source-id: 31bd6c1cc169785bb360e3113d258b612cad47fc
2021-05-17 12:44:00 -07:00
443ce1e8a1 Improve error message when Proxy object is iterated (#58302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58302

Test Plan: Imported from OSS

Reviewed By: jamesr66a

Differential Revision: D28444030

Pulled By: ansley

fbshipit-source-id: ee29b0f7b2199f8590de4c5945b0d4ce59230ce2
2021-05-17 12:42:23 -07:00
a4ce85ad68 Chown workspace in calculate-docker-image (#58398)
Summary:
Since https://github.com/pytorch/pytorch/issues/58299 changed the calculate-docker-image job from `ubuntu-18.04` to `linux.2xlarge`, it has been sometimes failing with this message:

```
Warning: Unable to clean or reset the repository. The repository will be recreated instead.
Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch'
Error: Command failed: rm -rf "/home/ec2-user/actions-runner/_work/pytorch/pytorch/.azure_pipelines"
```

- https://github.com/pytorch/pytorch/runs/2587348894
- https://github.com/pytorch/pytorch/runs/2592943274
- https://github.com/pytorch/pytorch/runs/2600707737

This PR hopes to fix that issue by adding the "Chown workspace" step that we already use for the other jobs in the Linux CI workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58398

Reviewed By: seemethere

Differential Revision: D28476902

Pulled By: samestep

fbshipit-source-id: a7dbf0ad9c18ac44cc1a3cef7647f56489958fe6
2021-05-17 12:40:55 -07:00
e8981e7c5d Improve CONTRIBUTING.md (#58396)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58396

Test Plan: Imported from OSS

Reviewed By: samestep

Differential Revision: D28476510

Pulled By: ansley

fbshipit-source-id: 3f45bee93dfeda06a44570305f9699bcafc45d2e
2021-05-17 12:36:38 -07:00
9afe9fba29 Reland OpInfo support for forward AD (#58304)
Summary:
Try 3 to land this.
Trying ci-all label to ensure we test everything.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58304

Reviewed By: heitorschueroff

Differential Revision: D28474343

Pulled By: albanD

fbshipit-source-id: 8230fa3c0a8d3633f09999e7c2f47dbdc5fe57e9
2021-05-17 12:33:27 -07:00
1a9efbbc92 generate inplace/out kernels for xla (#57510)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57510

This is a re-write of https://github.com/pytorch/pytorch/pull/56835, which is significantly shorter thanks to the data model change in the PR below this one in the stack. See the original description in the linked PR for details.

The functional changes in this PR are the same as in the above linked one, so the description is the same with a few small changes:
- I don't bother generating `at::xla::{op}` entries for CPU fallbacks. After looking around, I see precedent for that. For example, we don't have `at::cpu::{op}` entries for composite ops- if you really want to bypass the dispatcher you need to call `at::compositeimplicitautograd::{op}`. Maybe we should revisit that later if we find an important use case for having full namespace coverage, but that doesn't seem worth half-fixing for external backends in this PR.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474364

Pulled By: bdhirsh

fbshipit-source-id: 4d58b60e5debad6f1ff06420597d8df8505b2876
2021-05-17 12:25:38 -07:00
9354a68e7d [codegen] split out backend-specific information from NativeFunction in the model (#57361)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57361

Data model change in the codegen, which splits backend-specific information out of `NativeFunction`

### Overview
Currently in the codegen, native_functions.yaml has backend-specific information about each operator that is encoded directly into the data model, in the `NativeFunction` object. That's reasonable, since the native_functions.yaml is the source of truth for information about an operator, and the data model encodes that information into types.

Now that external backends can use the codegen though, that information is technically incomplete/inaccurate. In another PR, I tried patching the information on the `NativeFunction` object with the additional external information, by updating the `dispatch` entry to contain the external backend kernel name and dispatch key.

Instead, this PR tries to split out that information. The `NativeFunction` class contains all information about an operator from native_functions.yaml that's backend-independent and is known never to change regardless of what extra information backends provide. We also build up a backend "index", which is basically a mapping from [backend] -> [backend-specific-metadata]. Reading in an external backend yaml just involves updating that index with the new backend.

There were a few places where `NativeFunction` used the dispatch table directly, that I encoded as properties directly on the NativeFunction object (e.g. `is_abstract`). They were mostly around whether or not the operator has a composite kernel, which isn't something that's going to change for any external backends.

This has a few advantages:
- We can more easily re-use the existing logic in `native_function.py` and `register_dispatch_key.py` for both native and external backends, since they both involve a NativeFunction + a particular backend index
- The data in the data model will be the same regardless of how the codegen is run. Running the codegen with a new external backend doesn't change the data inside of NativeFunction or an existing backend index. It just adds a new index for that backend.
- There are several of codegen areas that don't care about backend-specific information: mostly the tracing and autograd codegen. We can reason about the codegen there more easily, knowing that backend-specific info is entirely uninvolved.

An alternative to this split would be to augment the NativeFunction objects with external backend information at the time that we create them. So the external codegen could read both native_functions.yaml and the external backend's yaml at the same time, and construct a NativeObject with a full dispatch table (including the XLA entry), and the correct setting of structured (taking into account both yamls). One disadvantage to this approach is that NativeFunction objects now contain different stuff depending on how you ran the codegen, and you have to make sure that any changes to the codegen can properly handle all the different variants.

### Data Model Changes
Removed 3 classes, which are used by the external codegen:
- ExternalBackendFunction
- ExternalBackendFunctionsGroup
- ExternalBackendMetadata

And added two new ones:
- BackendIndex
- BackendMetadata

`BackendIndex` contains any info that's specific to that backend, plus a mapping from operator names to backend specific metadata about the operator. One example of backend-specific info that's not operator-dependent is the fact that XLA prefers to implement functional kernels instead of out kernels (and so when they eventually mark an op as structured, they're going to mark the functional op and not the out op).

`BackendMetadata` contains info specific to an (operator, backend) pair. Right now, that's just (a) the name of the kernel, and (b) whether or not that operator is structured.

### Questions
I wanted to get this PR up earlier so I could get feedback, but there are a few things I want to call out:

**Dealing with `structured`.**
This PR separates out the notion of `structured` into two bits of information:
- Does [operator] have a meta() function. This is backend-agnostic, and is represented by the `structured` property on `NativeFunction`, same as before. This is used, e.g., to decide what signatures to add to `MetaFunctions.h`.
- Does [operator, backend] have an impl() function. This is backend dependent; even though technically all in-tree backends are forced to write impl() functions for an operator when we port the op to structured in native_functions.yaml, out-of-tree backends can decide to opt in independently. This is represented as a property on `BackendMetadata`. This is used in most other cases, e.g. in `RegisterDispatchKey` when we're deciding whether or not to gen a structured or unstructured wrapper.

I also baked `is_structured_dispatch_key` directly into each BackendIndex. So for operators marked "structured" in native_functions.yaml, their corresponding CPU/CUDA BackendIndex entries will be marked structured, and all others (except for potentially external backends) will not.

I ended up trying to deal with `structured` in this change since it's technically backend dependent (XLA can opt kernels into structured separately from in-tree ops), but that may have been too ambitious: it's technically not relevant until we actually add support for structured external kernels. If it's not clear that this is the right path for dealing with structured and we want to push that off, I'm fine with backing out the bits of this PR that make `structured` backend-dependent. I don't see anything *too* controversial related to structured in the change, but I tried to call out any areas in the comments

**Localizing the fact that external backends follow Dispatcher convention.**
Another thing that's sort of backend specific that I didn't totally address in this PR is the fact the fact that in-tree backends follow the Native API while external backends follow the Dispatcher API. I painted over that in `native_functions.py` by adding a helper, `kernel_signature`, that takes in a native function and gives you the "correct" signature for the specified backend- NativeSignature for in-tree backends, and DispatcherSignature for out-of-tree backends. In order to make that fully useable though, we'll need `NativeSignature` and `DispatcherSignature` to have matching interfaces. I didn't bother with that in this PR, which is why `gen_external_aten_fallbacks.py` still has a bunch of direct references to the dispatcher API. Thinking of adding it in a later PR but wanted to see if anyone has other opinions.

Maybe `is_external()` shouldn't even be a property on the BackendMetadata, and anything the codegen does that requires asking for that information should just be better abstracted away.

**Thoughts on the `BackendIndex` / `BackendMetadata` breakdown.**
One thing that's annoying right now is that to query for various pieces of metadata, you call helper functions like `backend_index.structured(f)`, which queries that particular backend and tells you if that specific NativeFunctionGroup is structured for that backend. It has to return an `Optional[bool]` though, since you have to handle the case where that operator doesn't have a kernel for that backend at all. So users of those helpers end up with a bunch of optionals that they need to unpack, even if they know at some point that the result isn't None. I think it would be easier instead to just store the NativeFunction object as a field directly on the BackendMetadata. Curious if there are any other opinions on a better way to model it though.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474362

Pulled By: bdhirsh

fbshipit-source-id: 41a00821acf172467d764cb41e771e096542f661
2021-05-17 12:25:35 -07:00
0db33eda2a remove bridge API from codegen (#55796)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55796

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474361

Pulled By: bdhirsh

fbshipit-source-id: c7f5ce35097f8eaa514f3df8f8559548188b265b
2021-05-17 12:25:32 -07:00
3d9f10f530 [external codegen] better yaml error messaging, added explicit error message tests (#56597)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56597

3 small changes, all centered around error messaging.

1) Improved error messages when `gen_backend_stubs.py` receives invalid yaml

2) Added error message tests. I wasn't sure if there was a canonical way to do this, so I just wrote a test that takes in a list of (yaml input, expected error message) pairs and runs the codegen pipeline on each of them.

3) I also removed the LineLoader from the yaml parsing bit that reads in the external backend yaml file. Two reasons that I took it out:
 - The main reason we use it with native_functions.yaml is to easily pinpoint problems with new ops as they're added, that the codegen can pick up. 99% of these problems have to do with schema, which is irrelevant to the external yaml since it pulls the schema from native_functions
 - Not all operators have to appear in the external yaml. We could do something like "line: -1", but that's kind of weird.

If you think the line numbers would actually be of more use than I'm thinking of in the external yaml though, let me know!

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474363

Pulled By: bdhirsh

fbshipit-source-id: 8b5ec804b388dbbc0350a20c053da657fad0474f
2021-05-17 12:25:29 -07:00
4dc1b8e06b add _to_cpu() operator (#55795)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55795

description coming soon

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D28474365

Pulled By: bdhirsh

fbshipit-source-id: 0704d7ce354308601a0af9ab48851459f34ce7a0
2021-05-17 12:23:35 -07:00
1562 changed files with 66690 additions and 48492 deletions

View File

@ -48,4 +48,14 @@ jobs:
_TS_CLONE_P: $(TS_CLONE_PASSWORD)
_TS_P: $(TS_PAT)
_TS_SM_P: $(TS_SM_PAT)
_AZUREML_CLONE_PASSWORD: $(AZUREML_CLONE_PASSWORD)
_SPPASSWORD: $(SPPASSWORD)
displayName: Run PyTorch Unit Tests
# Tests results are available outside the docker container since
# the current directory is mounted as a volume of the container.
- task: PublishTestResults@2
condition: always()
inputs:
testResultsFiles: '**/test-*.xml'
testRunTitle: 'Publish test results for Python'

View File

@ -47,3 +47,11 @@ jobs:
_TS_P: $(TS_PAT)
_TS_SM_P: $(TS_SM_PAT)
displayName: Run PyTorch Unit Tests
# Tests results are available outside the docker container since
# the current directory is mounted as a volume of the container.
- task: PublishTestResults@2
condition: always()
inputs:
testResultsFiles: '**\test-*.xml'
testRunTitle: 'Publish test results for Python'

View File

@ -8,7 +8,7 @@ steps:
connectionType: 'connectedServiceName'
serviceConnection: circleciconn
method: 'POST'
headers: '{"Content-Type":"application/json", "BranchName":"$(TARGET_BRANCH_TO_CHECK_PR)", "JobName":"$(TARGET_CIRCLECI_PR)", "PlanUrl":"$(System.CollectionUri)", "ProjectId":"$(System.TeamProjectId)", "HubName":"$(System.HostType)", "PlanId":"$(System.PlanId)", "JobId":"$(System.JobId)", "TimelineId":"$(System.TimelineId)", "TaskInstanceId":"$(System.TaskInstanceId)", "AuthToken":"$(System.AccessToken)"}'
headers: '{"Content-Type":"application/json", "BranchName":"$(_TARGET_BRANCH_TO_CHECK)", "JobName":"$(TARGET_CIRCLECI_BUILD_PR)", "PRNumber":"$(_NUMBER_BUILD_PR)", "TargetCommit":"$(_TARGET_COMMIT)", "PlanUrl":"$(System.CollectionUri)", "ProjectId":"$(System.TeamProjectId)", "HubName":"$(System.HostType)", "PlanId":"$(System.PlanId)", "JobId":"$(System.JobId)", "TimelineId":"$(System.TimelineId)", "TaskInstanceId":"$(System.TaskInstanceId)", "AuthToken":"$(System.AccessToken)"}'
body: ''
urlSuffix: 'api/JobStatus'
waitForCompletion: true

View File

@ -1,6 +1,6 @@
# Initiate 5 agentless-server waiting jobs to check on the
# status of PR artifact builds, for a maximum wait time of
# 5 * 60 min =300 minutes. These jobs will pass immediately
# 11*60 min=660 mins. These jobs will pass immediately
# once targeted CircleCI build is ready.
jobs:
@ -8,7 +8,6 @@ jobs:
pool: server
timeoutInMinutes: 60
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -17,7 +16,6 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob1
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -26,7 +24,6 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob2
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -35,7 +32,6 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob3
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
@ -44,6 +40,53 @@ jobs:
timeoutInMinutes: 60
dependsOn: checkjob4
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob6
pool: server
timeoutInMinutes: 60
dependsOn: checkjob5
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob7
pool: server
timeoutInMinutes: 60
dependsOn: checkjob6
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob8
pool: server
timeoutInMinutes: 60
dependsOn: checkjob7
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob9
pool: server
timeoutInMinutes: 60
dependsOn: checkjob8
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob10
pool: server
timeoutInMinutes: 60
dependsOn: checkjob9
continueOnError: true
steps:
- template: wheel-wait-job-template.yml
- job: checkjob11
pool: server
timeoutInMinutes: 60
dependsOn: checkjob10
continueOnError: true
steps:
- template: wheel-wait-job-template.yml

View File

@ -7,14 +7,28 @@
# 2) runs custom PyTorch unit-tests on PyTorch
# wheels generated during PR builds.
resources:
webhooks:
- webhook: GitHubPyTorchPRTrigger
connection: GitHubPyTorchPRTriggerConnection
filters:
- path: repositoryName
value: pytorch_tests
stages:
- stage: 'EnsureArtifactsReady'
displayName: 'Ensure PyTorch PR Artifacts are ready'
jobs:
- template: job_templates/wheel-wait-template.yml
variables:
_TARGET_BRANCH_TO_CHECK: ${{parameters.GitHubPyTorchPRTrigger.TARGET_BRANCH_TO_CHECK_AZ_DEVOPS_PR}}
_NUMBER_BUILD_PR: ${{parameters.GitHubPyTorchPRTrigger.PR_NUMBER}}
_TARGET_COMMIT: ${{parameters.GitHubPyTorchPRTrigger.TARGET_COMMIT}}
- stage: 'PRCustomTests'
displayName: 'Run custom unit tests on PyTorch wheels'
dependsOn: EnsureArtifactsReady
condition: succeeded()
jobs:
- template: job_templates/pytorch-template-unix.yml
parameters:
@ -24,7 +38,9 @@ stages:
PR_Custom_Tests:
_PYTHON_VERSION: $(PYTHON_VERSION_PR)
_CUDA_BUILD_VERSION: $(CUDA_BUILD_VERSION_PR)
_TARGET_CIRCLECI_BUILD: $(TARGET_CIRCLECI_PR)
_TARGET_BRANCH_TO_CHECK: $(TARGET_BRANCH_TO_CHECK_PR)
_TARGET_CIRCLECI_BUILD: $(TARGET_CIRCLECI_BUILD_PR)
_TARGET_BRANCH_TO_CHECK: ${{parameters.GitHubPyTorchPRTrigger.TARGET_BRANCH_TO_CHECK_AZ_DEVOPS_PR}}
_NUMBER_BUILD_PR: ${{parameters.GitHubPyTorchPRTrigger.PR_NUMBER}}
_TARGET_COMMIT: ${{parameters.GitHubPyTorchPRTrigger.TARGET_COMMIT}}
_DOCKER_IMAGE: $(DOCKER_IMAGE_PR)
_RUN_TESTS: $(RUN_TESTS_PR)

View File

@ -126,6 +126,10 @@ class PackageFormatConfigNode(ConfigNode):
self.props["python_versions"] = python_versions
self.props["package_format"] = package_format
# XXX Disabling conda for 11.3 as there's currently no appropriate cudatoolkit available
if package_format == "conda":
self.props["gpu_versions"] = filter(lambda x: x != "cuda113", self.find_prop("gpu_versions"))
def get_children(self):
if self.find_prop("os_name") == "linux":
return [LinuxGccConfigNode(self, v) for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]]

View File

@ -3,6 +3,7 @@ PHASES = ["build", "test"]
CUDA_VERSIONS = [
"102",
"111",
"113",
]
ROCM_VERSIONS = [

View File

@ -35,6 +35,11 @@ CONFIG_TREE_DATA = [
("10.2", [
("3.6", [
("shard_test", [X(True)]),
("slow_gradcheck", [
(True, [
('shard_test', [XImportant(True)]),
]),
]),
("libtorch", [
(True, [
('build_only', [X(True)]),
@ -176,10 +181,18 @@ class ExperimentalFeatureConfigNode(TreeConfigNode):
"cuda_gcc_override": CudaGccOverrideConfigNode,
"coverage": CoverageConfigNode,
"pure_torch": PureTorchConfigNode,
"slow_gradcheck": SlowGradcheckConfigNode,
}
return next_nodes[experimental_feature]
class SlowGradcheckConfigNode(TreeConfigNode):
def init2(self, node_name):
self.props["is_slow_gradcheck"] = True
def child_constructor(self):
return ExperimentalFeatureConfigNode
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)

View File

@ -258,7 +258,7 @@ def gen_tree():
return configs_list
def instantiate_configs():
def instantiate_configs(only_slow_gradcheck):
config_list = []
@ -277,8 +277,12 @@ def instantiate_configs():
is_onnx = fc.find_prop("is_onnx") or False
is_pure_torch = fc.find_prop("is_pure_torch") or False
is_vulkan = fc.find_prop("is_vulkan") or False
is_slow_gradcheck = fc.find_prop("is_slow_gradcheck") or False
parms_list_ignored_for_docker_image = []
if only_slow_gradcheck ^ is_slow_gradcheck:
continue
python_version = None
if compiler_name == "cuda" or compiler_name == "android":
python_version = fc.find_prop("pyver")
@ -342,6 +346,10 @@ def instantiate_configs():
if build_only or is_pure_torch:
restrict_phases = ["build"]
if is_slow_gradcheck:
parms_list_ignored_for_docker_image.append("old")
parms_list_ignored_for_docker_image.append("gradcheck")
gpu_resource = None
if cuda_version and cuda_version != "10":
gpu_resource = "medium"
@ -381,7 +389,7 @@ def instantiate_configs():
tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
if cuda_version == "10.2" and python_version == "3.6" and not is_libtorch:
if cuda_version == "10.2" and python_version == "3.6" and not is_libtorch and not is_slow_gradcheck:
c.dependent_tests = gen_dependent_configs(c)
if (
@ -408,9 +416,9 @@ def instantiate_configs():
return config_list
def get_workflow_jobs():
def get_workflow_jobs(only_slow_gradcheck=False):
config_list = instantiate_configs()
config_list = instantiate_configs(only_slow_gradcheck)
x = []
for conf_options in config_list:

View File

@ -40,11 +40,13 @@ class WindowsJob:
target_arch = self.cuda_version.render_dots() if self.cuda_version else "cpu"
python_version = "3.8"
base_name_parts = [
"pytorch",
"windows",
self.vscode_spec.render(),
"py36",
"py" + python_version.replace(".", ""),
target_arch,
]
@ -65,7 +67,7 @@ class WindowsJob:
["pytorch", "win"]
+ self.vscode_spec.get_elements()
+ arch_env_elements
+ ["py3"]
+ ["py" + python_version.split(".")[0]]
)
is_running_on_cuda = bool(self.cuda_version) and not self.force_on_cpu
@ -75,7 +77,7 @@ class WindowsJob:
else:
props_dict = {
"build_environment": build_environment_string,
"python_version": miniutils.quote("3.6"),
"python_version": miniutils.quote(python_version),
"vc_version": miniutils.quote(self.vscode_spec.dotted_version()),
"vc_year": miniutils.quote(str(self.vscode_spec.year)),
"vc_product": self.vscode_spec.get_product(),
@ -145,18 +147,11 @@ _VC2019 = VcSpec(2019)
WORKFLOW_DATA = [
# VS2019 CUDA-10.1
WindowsJob(None, _VC2019, CudaVersion(10, 1), master_only=True),
WindowsJob(1, _VC2019, CudaVersion(10, 1), master_only=True),
WindowsJob(2, _VC2019, CudaVersion(10, 1), master_only=True),
# VS2019 CUDA-11.1
WindowsJob(None, _VC2019, CudaVersion(11, 1)),
WindowsJob(1, _VC2019, CudaVersion(11, 1), master_only=True),
WindowsJob(2, _VC2019, CudaVersion(11, 1), master_only=True),
WindowsJob('_azure_multi_gpu', _VC2019, CudaVersion(11, 1), multi_gpu=True, nightly_only=True),
# VS2019 CPU-only
WindowsJob(None, _VC2019, None),
WindowsJob(1, _VC2019, None),
WindowsJob(2, _VC2019, None),
# VS2019 CUDA-10.1 force on cpu
WindowsJob(1, _VC2019, CudaVersion(10, 1), force_on_cpu=True, master_only=True),
# TODO: This test is disabled due to https://github.com/pytorch/pytorch/issues/59724
# WindowsJob('_azure_multi_gpu', _VC2019, CudaVersion(11, 1), multi_gpu=True, master_and_nightly=True),
]

File diff suppressed because it is too large Load Diff

View File

@ -2,9 +2,15 @@
set -ex
git clone https://github.com/malfet/breakpad.git -b pytorch/release-1.9
git clone https://github.com/driazati/breakpad.git
pushd breakpad
# breakpad has no actual releases, so this is pinned to the top commit from
# main when this was forked (including the one patch commit). This uses a fork
# of the breakpad mainline that automatically daisy-chains out to any previously
# installed signal handlers (instead of overwriting them).
git checkout 5485e473ed46d065e05489e50dfc59d90dfd7e22
git clone https://chromium.googlesource.com/linux-syscall-support src/third_party/lss
pushd src/third_party/lss
# same as with breakpad, there are no real releases for this repo so use a

View File

@ -61,10 +61,6 @@ RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
ENV INSTALLED_VISION ${VISION}
ADD ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
ADD ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
@ -97,5 +93,9 @@ ENV TORCH_NVCC_FLAGS "-Xfatbin -compress-all"
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ADD ./common/install_openssl.sh install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
RUN bash ./install_openssl.sh
USER jenkins
CMD ["bash"]

View File

@ -113,10 +113,6 @@ ADD ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
# Install ccache/sccache (do this last, so we get priority in PATH)
ADD ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
@ -134,5 +130,9 @@ ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh
ENV OPENSSL_ROOT_DIR /opt/openssl
USER jenkins
CMD ["bash"]

View File

@ -145,14 +145,17 @@ def gen_build_workflows_tree():
binary_build_definitions.get_post_upload_jobs,
binary_build_definitions.get_binary_smoke_test_jobs,
]
build_jobs = [f() for f in build_workflows_functions]
master_build_jobs = filter_master_only_jobs(build_jobs)
binary_build_functions = [
binary_build_definitions.get_binary_build_jobs,
binary_build_definitions.get_nightly_tests,
binary_build_definitions.get_nightly_uploads,
]
build_jobs = [f() for f in build_workflows_functions]
master_build_jobs = filter_master_only_jobs(build_jobs)
slow_gradcheck_jobs = pytorch_build_definitions.get_workflow_jobs(only_slow_gradcheck=True)
return {
"workflows": {
"binary_builds": {
@ -167,6 +170,10 @@ def gen_build_workflows_tree():
"when": r"<< pipeline.parameters.run_master_build >>",
"jobs": master_build_jobs,
},
"slow_gradcheck_build": {
"when": r"<< pipeline.parameters.run_slow_gradcheck_build >>",
"jobs": slow_gradcheck_jobs,
},
}
}

View File

@ -62,7 +62,6 @@ popd
# Clone the Builder master repo
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
git checkout release/1.9
pushd "$BUILDER_ROOT"
echo "Using builder from "
git --no-pager log --max-count 1

View File

@ -24,7 +24,7 @@ do
done
lipo -i ${ZIP_DIR}/install/lib/*.a
# copy the umbrella header and license
cp ${PROJ_ROOT}/ios/LibTorch.h ${ZIP_DIR}/src/
cp ${PROJ_ROOT}/ios/LibTorch-Lite.h ${ZIP_DIR}/src/
cp ${PROJ_ROOT}/LICENSE ${ZIP_DIR}/
# zip the library
ZIPFILE=libtorch_ios_nightly_build.zip

View File

@ -4,10 +4,14 @@ echo "RUNNING ON $(uname -a) WITH $(nproc) CPUS AND $(free -m)"
set -eux -o pipefail
source /env
# Defaults here so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( $(nproc) - 2 ))}
# Because most Circle executors only have 20 CPUs, using more causes OOMs w/ Ninja and nvcc parallelization
MEMORY_LIMIT_MAX_JOBS=18
NUM_CPUS=$(( $(nproc) - 2 ))
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
# Defaults here for **binary** linux builds so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( ${NUM_CPUS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${NUM_CPUS} ))}
if [[ "${DESIRED_CUDA}" == "cu111" || "${DESIRED_CUDA}" == "cu113" ]]; then
export BUILD_SPLIT_CUDA="ON"
fi
@ -22,5 +26,9 @@ else
build_script='manywheel/build.sh'
fi
if [[ "$CIRCLE_BRANCH" == "master" ]] || [[ "$CIRCLE_BRANCH" == release/* ]]; then
export BUILD_DEBUG_INFO=1
fi
# Build the package
SKIP_ALL_TESTS=1 "/builder/$build_script"

View File

@ -9,6 +9,10 @@ python_nodot="\$(echo $DESIRED_PYTHON | tr -d m.u)"
# Set up Python
if [[ "$PACKAGE_TYPE" == conda ]]; then
# There was a bug that was introduced in conda-package-handling >= 1.6.1 that makes archives
# above a certain size fail out when attempting to extract
# see: https://github.com/conda/conda-package-handling/issues/71
conda install -y conda-package-handling=1.6.0
retry conda create -qyn testenv python="$DESIRED_PYTHON"
source activate testenv >/dev/null
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then

View File

@ -85,7 +85,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="1.9.0.dev$DATE"
BASE_BUILD_VERSION="1.10.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag
@ -148,7 +148,7 @@ if [[ "${BUILD_FOR_SYSTEM:-}" == "windows" ]]; then
fi
export DATE="$DATE"
export NIGHTLIES_DATE_PREAMBLE=1.9.0.dev
export NIGHTLIES_DATE_PREAMBLE=1.10.0.dev
export PYTORCH_BUILD_VERSION="$PYTORCH_BUILD_VERSION"
export PYTORCH_BUILD_NUMBER="$PYTORCH_BUILD_NUMBER"
export OVERRIDE_PACKAGE_VERSION="$PYTORCH_BUILD_VERSION"

View File

@ -15,7 +15,7 @@ else
export VC_YEAR=2019
fi
if [[ "${DESIRED_CUDA}" == "cu111" ]]; then
if [[ "${DESIRED_CUDA}" == "cu111" || "${DESIRED_CUDA}" == "cu113" ]]; then
export BUILD_SPLIT_CUDA="ON"
fi

View File

@ -142,8 +142,8 @@ if __name__ == "__main__":
report_android_sizes(file_dir)
else:
size = get_size(file_dir)
if size != 0:
try:
send_message([build_message(size)])
except Exception:
logging.exception("can't send message")
# Sending the message anyway if no size info is collected.
try:
send_message([build_message(size)])
except Exception:
logging.exception("can't send message")

View File

@ -14,6 +14,10 @@ $VS_INSTALL_ARGS = @("--nocache","--quiet","--wait", "--add Microsoft.VisualStud
"--add Microsoft.VisualStudio.Component.VC.Tools.x86.x64",
"--add Microsoft.VisualStudio.ComponentGroup.NativeDesktop.Win81")
if (${env:INSTALL_WINDOWS_SDK} -eq "1") {
$VS_INSTALL_ARGS += "--add Microsoft.VisualStudio.Component.Windows10SDK.19041"
}
curl.exe --retry 3 -kL $VS_DOWNLOAD_LINK --output vs_installer.exe
if ($LASTEXITCODE -ne 0) {
echo "Download of the VS 2019 Version 16.8.5 installer failed"

View File

@ -25,7 +25,7 @@ else
exit 1
fi
if [[ "$cuda_major_version" == "11" && "${JOB_EXECUTOR}" == "windows-with-nvidia-gpu" ]]; then
if [[ "$cuda_major_version" == "11" && "${JOB_EXECUTOR:-}" == "windows-with-nvidia-gpu" ]]; then
cuda_install_packages="${cuda_install_packages} Display.Driver"
fi

View File

@ -20,9 +20,13 @@ else
fi
cudnn_installer_link="https://ossci-windows.s3.amazonaws.com/${cudnn_installer_name}.zip"
cudnn_install_folder="C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/"
curl --retry 3 -O $cudnn_installer_link
7z x ${cudnn_installer_name}.zip -ocudnn
cp -r cudnn/cuda/* "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v${CUDA_VERSION}/"
curl --retry 3 -O "$cudnn_installer_link"
7z x "${cudnn_installer_name}.zip" -ocudnn
# shellcheck recommends to use '${var:?}/*' to avoid potentially expanding to '/*'
# Remove all of the directories before attempting to copy files
rm -rf "${cudnn_install_folder:?}/*"
cp -rf cudnn/cuda/* "${cudnn_install_folder}"
rm -rf cudnn
rm -f ${cudnn_installer_name}.zip
rm -f "${cudnn_installer_name}.zip"

View File

@ -84,7 +84,7 @@ pytorch_windows_params: &pytorch_windows_params
default: "10.1"
python_version:
type: string
default: "3.6"
default: "3.8"
vc_version:
type: string
default: "14.16"

View File

@ -17,6 +17,9 @@ parameters:
run_master_build:
type: boolean
default: false
run_slow_gradcheck_build:
type: boolean
default: false
executors:
windows-with-nvidia-gpu:

View File

@ -22,7 +22,7 @@
command: |
ls -lah /final_pkgs
- run:
name: save binary size
name: upload build & binary data
no_output_timeout: "5m"
command: |
source /env

View File

@ -74,6 +74,14 @@ jobs:
docker commit "$id" ${COMMIT_DOCKER_IMAGE}
time docker push ${COMMIT_DOCKER_IMAGE}
fi
- run:
name: upload build & binary data
no_output_timeout: "5m"
command: |
cd /pytorch && export COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
python3 -mpip install requests && \
SCRIBE_GRAPHQL_ACCESS_TOKEN=${SCRIBE_GRAPHQL_ACCESS_TOKEN} \
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- store_artifacts:
path: /home/circleci/project/dist
@ -245,7 +253,7 @@ jobs:
default: "10.1"
python_version:
type: string
default: "3.6"
default: "3.8"
vc_version:
type: string
default: "14.16"
@ -312,7 +320,7 @@ jobs:
default: "10.1"
python_version:
type: string
default: "3.6"
default: "3.8"
vc_version:
type: string
default: "14.16"

View File

@ -35,7 +35,7 @@
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
name: periodic_pytorch_windows_cuda11.3_build
python_version: "3.6"
python_version: "3.8"
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
@ -45,7 +45,7 @@
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: periodic_pytorch_windows_cuda11.3_test1
python_version: "3.6"
python_version: "3.8"
requires:
- periodic_pytorch_windows_cuda11.3_build
test_name: pytorch-windows-test1
@ -58,7 +58,7 @@
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: periodic_pytorch_windows_cuda11.3_test2
python_version: "3.6"
python_version: "3.8"
requires:
- periodic_pytorch_windows_cuda11.3_build
test_name: pytorch-windows-test2
@ -116,8 +116,8 @@
- pytorch_windows_build:
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
name: pytorch_windows_vs2019_py36_cuda11.3_build
python_version: "3.6"
name: pytorch_windows_vs2019_py38_cuda11.3_build
python_version: "3.8"
use_cuda: "1"
vc_product: BuildTools
vc_version: "14.28.29333"
@ -131,10 +131,10 @@
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: pytorch_windows_vs2019_py36_cuda11.3_test1
python_version: "3.6"
name: pytorch_windows_vs2019_py38_cuda11.3_test1
python_version: "3.8"
requires:
- pytorch_windows_vs2019_py36_cuda11.3_build
- pytorch_windows_vs2019_py38_cuda11.3_build
test_name: pytorch-windows-test1
use_cuda: "1"
vc_product: BuildTools
@ -149,10 +149,10 @@
build_environment: pytorch-win-vs2019-cuda11-cudnn8-py3
cuda_version: "11.3"
executor: windows-with-nvidia-gpu
name: pytorch_windows_vs2019_py36_cuda11.3_test2
python_version: "3.6"
name: pytorch_windows_vs2019_py38_cuda11.3_test2
python_version: "3.8"
requires:
- pytorch_windows_vs2019_py36_cuda11.3_build
- pytorch_windows_vs2019_py38_cuda11.3_build
test_name: pytorch-windows-test2
use_cuda: "1"
vc_product: BuildTools
@ -186,10 +186,18 @@
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-build"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
- pytorch_linux_test:
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_old_gradcheck_tests
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_old_gradcheck_test1
requires:
- periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_build
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-old-gradcheck-tests"
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-old-gradcheck-test1"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium
- pytorch_linux_test:
name: periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_old_gradcheck_test2
requires:
- periodic_pytorch_xenial_cuda10_2_cudnn7_gcc7_build
build_environment: "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-old-gradcheck-test2"
docker_image: "308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
use_cuda_docker_runtime: "1"
resource_class: gpu.medium

View File

@ -13,3 +13,5 @@ labels_to_circle_params:
- run_build
ci/master:
parameter: run_master_build
ci/slow-gradcheck:
parameter: run_slow_gradcheck_build

View File

@ -28,7 +28,12 @@ runner_types:
max_available: 50
disk_size: 150
windows.4xlarge:
instance_type: c5.4xlarge
instance_type: c5d.4xlarge
os: windows
max_available: 200
disk_size: 256
windows.8xlarge.nvidia.gpu:
instance_type: p3.2xlarge
os: windows
max_available: 25
disk_size: 256

54
.github/scripts/ensure_actions_will_cancel.py vendored Executable file
View File

@ -0,0 +1,54 @@
#!/usr/bin/env python3
import argparse
import sys
import yaml
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parent.parent.parent
WORKFLOWS = REPO_ROOT / ".github" / "workflows"
def concurrency_key(filename: Path) -> str:
workflow_name = filename.with_suffix("").name.replace("_", "-")
return f"{workflow_name}-${{{{ github.event.pull_request.number || github.sha }}}}"
def should_check(filename: Path) -> bool:
with open(filename, "r") as f:
content = f.read()
data = yaml.safe_load(content)
on = data.get("on", data.get(True, {}))
return "pull_request" in on
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Ensure all relevant GitHub actions jobs will be cancelled based on a concurrency key"
)
args = parser.parse_args()
files = list(WORKFLOWS.glob("*.yml"))
errors_found = False
files = [f for f in files if should_check(f)]
for filename in files:
with open(filename, "r") as f:
data = yaml.safe_load(f)
expected = {
"group": concurrency_key(filename),
"cancel-in-progress": True,
}
if data.get("concurrency", None) != expected:
print(
f"'concurrency' incorrect or not found in '{filename.relative_to(REPO_ROOT)}'",
file=sys.stderr,
)
errors_found = True
if errors_found:
sys.exit(1)

View File

@ -1,6 +1,7 @@
#!/usr/bin/env python3
from pathlib import Path
from typing import Any, Dict
import jinja2
@ -8,141 +9,202 @@ DOCKER_REGISTRY = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
GITHUB_DIR = Path(__file__).parent.parent
CPU_TEST_RUNNER = "linux.2xlarge"
CUDA_TEST_RUNNER = "linux.8xlarge.nvidia.gpu"
# it would be nice to statically specify that build_environment must be
# present, but currently Python has no easy way to do that
# https://github.com/python/mypy/issues/4617
PyTorchWorkflow = Dict[str, Any]
WINDOWS_CPU_TEST_RUNNER = "windows.4xlarge"
WINDOWS_CUDA_TEST_RUNNER = "windows.8xlarge.nvidia.gpu"
class PyTorchLinuxWorkflow:
def __init__(
self,
build_environment: str,
docker_image_base: str,
on_pull_request: bool = False,
enable_doc_jobs: bool = False,
):
self.build_environment = build_environment
self.docker_image_base = docker_image_base
self.test_runner_type = CPU_TEST_RUNNER
self.on_pull_request = on_pull_request
self.enable_doc_jobs = enable_doc_jobs
if "cuda" in build_environment:
self.test_runner_type = CUDA_TEST_RUNNER
def generate_workflow_file(
self, workflow_template: jinja2.Template, jinja_env: jinja2.Environment
) -> Path:
output_file_path = GITHUB_DIR.joinpath(
f"workflows/{self.build_environment}.yml"
)
with open(output_file_path, "w") as output_file:
output_file.writelines(["# @generated DO NOT EDIT MANUALLY\n"])
output_file.write(
workflow_template.render(
build_environment=self.build_environment,
docker_image_base=self.docker_image_base,
test_runner_type=self.test_runner_type,
enable_doc_jobs=self.enable_doc_jobs,
on_pull_request=self.on_pull_request,
)
)
output_file.write('\n')
return output_file_path
def PyTorchWindowsWorkflow(
*,
build_environment: str,
test_runner_type: str,
cuda_version: str,
on_pull_request: bool = False
) -> PyTorchWorkflow:
return {
"build_environment": build_environment,
"test_runner_type": test_runner_type,
"cuda_version": cuda_version,
"on_pull_request": on_pull_request,
}
WORKFLOWS = [
LINUX_CPU_TEST_RUNNER = "linux.2xlarge"
LINUX_CUDA_TEST_RUNNER = "linux.8xlarge.nvidia.gpu"
def PyTorchLinuxWorkflow(
*,
build_environment: str,
docker_image_base: str,
test_runner_type: str,
on_pull_request: bool = False,
enable_doc_jobs: bool = False,
) -> PyTorchWorkflow:
return {
"build_environment": build_environment,
"docker_image_base": docker_image_base,
"test_runner_type": test_runner_type,
"on_pull_request": on_pull_request,
"enable_doc_jobs": enable_doc_jobs,
}
def generate_workflow_file(
*,
workflow: PyTorchWorkflow,
workflow_template: jinja2.Template,
) -> Path:
output_file_path = GITHUB_DIR / f"workflows/{workflow['build_environment']}.yml"
with open(output_file_path, "w") as output_file:
GENERATED = "generated"
output_file.writelines([f"# @{GENERATED} DO NOT EDIT MANUALLY\n"])
output_file.write(workflow_template.render(**workflow))
output_file.write("\n")
return output_file_path
WINDOWS_WORKFLOWS = [
PyTorchWindowsWorkflow(
build_environment="pytorch-win-vs2019-cpu-py3",
cuda_version="cpu",
test_runner_type=WINDOWS_CPU_TEST_RUNNER,
on_pull_request=True
),
PyTorchWindowsWorkflow(
build_environment="pytorch-win-vs2019-cuda10-cudnn7-py3",
cuda_version="10.1",
test_runner_type=WINDOWS_CUDA_TEST_RUNNER,
),
PyTorchWindowsWorkflow(
build_environment="pytorch-win-vs2019-cuda11-cudnn8-py3",
cuda_version="11.1",
test_runner_type=WINDOWS_CUDA_TEST_RUNNER,
)
]
LINUX_WORKFLOWS = [
PyTorchLinuxWorkflow(
build_environment="pytorch-linux-xenial-py3.6-gcc5.4",
docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
test_runner_type=LINUX_CPU_TEST_RUNNER,
on_pull_request=True,
enable_doc_jobs=True,
),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-paralleltbb-linux-xenial-py3.6-gcc5.4",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-parallelnative-linux-xenial-py3.6-gcc5.4",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-pure_torch-linux-xenial-py3.6-gcc5.4",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc5.4",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-gcc7",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3.6-gcc7",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-asan",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-asan",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang7-onnx",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang7-onnx",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
PyTorchLinuxWorkflow(
build_environment="pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7",
docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7",
test_runner_type=LINUX_CUDA_TEST_RUNNER,
),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-cuda11.1-cudnn8-py3.6-gcc7",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
# test_runner_type=LINUX_CUDA_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-libtorch-linux-xenial-cuda11.1-cudnn8-py3.6-gcc7",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7",
# test_runner_type=LINUX_CUDA_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-bionic-py3.6-clang9-noarch",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.6-clang9",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-xla-linux-bionic-py3.6-clang9",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.6-clang9",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-vulkan-linux-bionic-py3.6-clang9",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.6-clang9",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-bionic-py3.8-gcc9-coverage",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-py3.8-gcc9",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-bionic-rocm3.9-py3.6",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-bionic-rocm3.9-py3.6",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-x86_32",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-x86_64",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-arm-v7a",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-android-ndk-r19c-arm-v8a",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-asan",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile-custom-dynamic",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile-custom-static",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
# PyTorchLinuxWorkflow(
# build_environment="pytorch-linux-xenial-py3.6-clang5-mobile-code-analysis",
# docker_image_base=f"{DOCKER_REGISTRY}/pytorch/pytorch-linux-xenial-py3-clang5-android-ndk-r19c",
# test_runner_type=LINUX_CPU_TEST_RUNNER,
# ),
]
@ -151,11 +213,10 @@ if __name__ == "__main__":
variable_start_string="!{{",
loader=jinja2.FileSystemLoader(str(GITHUB_DIR.joinpath("templates"))),
)
workflow_template = jinja_env.get_template("linux_ci_workflow.yml.in")
for workflow in WORKFLOWS:
print(
workflow.generate_workflow_file(
workflow_template=workflow_template,
jinja_env=jinja_env
)
)
template_and_workflows = [
(jinja_env.get_template("linux_ci_workflow.yml.j2"), LINUX_WORKFLOWS),
(jinja_env.get_template("windows_ci_workflow.yml.j2"), WINDOWS_WORKFLOWS)
]
for template, workflows in template_and_workflows:
for workflow in workflows:
print(generate_workflow_file(workflow=workflow, workflow_template=template))

View File

@ -16,12 +16,12 @@ LEGACY_BASE_VERSION_SUFFIX_PATTERN = re.compile("a0$")
class NoGitTagException(Exception):
pass
def get_pytorch_root():
def get_pytorch_root() -> Path:
return Path(subprocess.check_output(
['git', 'rev-parse', '--show-toplevel']
).decode('ascii').strip())
def get_tag():
def get_tag() -> str:
root = get_pytorch_root()
# We're on a tag
am_on_tag = (
@ -46,7 +46,7 @@ def get_tag():
tag = re.sub(TRAILING_RC_PATTERN, "", tag)
return tag
def get_base_version():
def get_base_version() -> str:
root = get_pytorch_root()
dirty_version = open(root / 'version.txt', 'r').read().strip()
# Strips trailing a0 from version.txt, not too sure why it's there in the
@ -54,29 +54,34 @@ def get_base_version():
return re.sub(LEGACY_BASE_VERSION_SUFFIX_PATTERN, "", dirty_version)
class PytorchVersion:
def __init__(self, gpu_arch_type, gpu_arch_version, no_build_suffix):
def __init__(
self,
gpu_arch_type: str,
gpu_arch_version: str,
no_build_suffix: bool,
) -> None:
self.gpu_arch_type = gpu_arch_type
self.gpu_arch_version = gpu_arch_version
self.no_build_suffix = no_build_suffix
def get_post_build_suffix(self):
def get_post_build_suffix(self) -> str:
if self.gpu_arch_type == "cuda":
return f"+cu{self.gpu_arch_version.replace('.', '')}"
return f"+{self.gpu_arch_type}{self.gpu_arch_version}"
def get_release_version(self):
def get_release_version(self) -> str:
if not get_tag():
raise NoGitTagException(
"Not on a git tag, are you sure you want a release version?"
)
return f"{get_tag()}{self.get_post_build_suffix()}"
def get_nightly_version(self):
def get_nightly_version(self) -> str:
date_str = datetime.today().strftime('%Y%m%d')
build_suffix = self.get_post_build_suffix()
return f"{get_base_version()}.dev{date_str}{build_suffix}"
def main():
def main() -> None:
parser = argparse.ArgumentParser(
description="Generate pytorch version for binary builds"
)

View File

@ -14,19 +14,19 @@ is simply to make sure that there is *some* configuration of ruamel that can rou
the YAML, not to be prescriptive about it.
'''
import ruamel.yaml
import ruamel.yaml # type: ignore[import]
import difflib
import sys
from pathlib import Path
from io import StringIO
def fn(base):
def fn(base: str) -> str:
return str(base / Path("aten/src/ATen/native/native_functions.yaml"))
with open(Path(__file__).parent.parent.parent / fn('.'), "r") as f:
contents = f.read()
yaml = ruamel.yaml.YAML()
yaml = ruamel.yaml.YAML() # type: ignore[attr-defined]
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']

View File

@ -1,37 +0,0 @@
#!/usr/bin/env python3
'''
This file verifies that the workflows that are potentially canceled in our cancel_redundant_workflow.yml
match the workflows we have running on pull requests (found in .github/workflows). This way, anytime a
workflow is added or removed, people can be reminded to modify the cancel_redundant_workflow.yml accordingly.
'''
import ruamel.yaml
from pathlib import Path
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.boolean_representation = ['False', 'True']
yaml.default_flow_style = False
if __name__ == '__main__':
workflow_paths = (Path(__file__).parent.parent / 'workflows').rglob('*')
workflows = []
for path in workflow_paths:
if path.suffix in {'.yml', '.yaml'}:
with open(path) as f:
data = yaml.load(f)
assert 'name' in data, 'Every GHA workflow must have a name.'
if 'pull_request' in data['on']:
workflows.append(data['name'])
with open('.github/workflows/cancel_redundant_workflows.yml', 'r') as f:
data = yaml.load(f)
# Replace workflows to cancel
data['on']['workflow_run']['workflows'] = sorted(workflows)
with open('.github/workflows/cancel_redundant_workflows.yml', 'w') as f:
yaml.dump(data, f)

View File

@ -1,5 +1,5 @@
#!/usr/bin/env bash
CHANGES=$(git status --porcelain)
CHANGES=$(git status --porcelain "$1")
echo "$CHANGES"
git diff
git diff "$1"
[ -z "$CHANGES" ]

View File

@ -31,7 +31,7 @@ direction: decrease
timeout: 720
tests:"""
def gen_abtest_config(control: str, treatment: str, models: List[str]):
def gen_abtest_config(control: str, treatment: str, models: List[str]) -> str:
d = {}
d["control"] = control
d["treatment"] = treatment
@ -43,7 +43,7 @@ def gen_abtest_config(control: str, treatment: str, models: List[str]):
config = config + "\n"
return config
def deploy_torchbench_config(output_dir: str, config: str):
def deploy_torchbench_config(output_dir: str, config: str) -> None:
# Create test dir if needed
pathlib.Path(output_dir).mkdir(exist_ok=True)
# TorchBench config file name
@ -71,7 +71,7 @@ def extract_models_from_pr(torchbench_path: str, prbody_file: str) -> List[str]:
return []
return model_list
def run_torchbench(pytorch_path: str, torchbench_path: str, output_dir: str):
def run_torchbench(pytorch_path: str, torchbench_path: str, output_dir: str) -> None:
# Copy system environment so that we will not override
env = dict(os.environ)
command = ["python", "bisection.py", "--work-dir", output_dir,

View File

@ -1,5 +1,5 @@
# Template is at: .github/templates/linux_ci_workflow.yml
# Generation script: .github/scripts/generate_linux_ci_workflows.py
# Template is at: .github/templates/linux_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Linux CI (!{{ build_environment }})
on:
@ -23,6 +23,10 @@ env:
CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
concurrency:
group: !{{ build_environment }}-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
calculate-docker-image:
runs-on: linux.2xlarge
@ -32,6 +36,15 @@ jobs:
outputs:
docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
steps:
- name: Log in to ECR
run: |
aws ecr get-login --no-include-email --region us-east-1 > /tmp/ecr-login.sh
bash /tmp/ecr-login.sh
rm /tmp/ecr-login.sh
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
@ -49,7 +62,6 @@ jobs:
DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker_tag }}
BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
run: |
eval "$(aws ecr get-login --no-include-email --region us-east-1)"
set -x
# Check if image already exists, if it does then skip building it
if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
@ -84,6 +96,7 @@ jobs:
run: |
export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
cd .circleci/docker && ./build_docker.sh
build:
runs-on: linux.2xlarge
needs: calculate-docker-image
@ -128,6 +141,22 @@ jobs:
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
- name: Display and upload binary build size statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
export COMMIT_TIME
pip3 install requests
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
@ -135,11 +164,22 @@ jobs:
- name: Archive artifacts into zip
run: |
zip -r artifacts.zip dist/ build/
# Upload to github so that people can click and download artifacts
- uses: actions/upload-artifact@v2
name: Store PyTorch Build Artifacts
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
name: Store PyTorch Build Artifacts on Github
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 30
retention-days: 14
if-no-files-found: error
path:
artifacts.zip
- uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
name: Store PyTorch Build Artifacts on S3
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 14
if-no-files-found: error
path:
artifacts.zip
@ -148,6 +188,7 @@ jobs:
run: |
# Prune all of the docker images
docker system prune -af
test:
runs-on: !{{ test_runner_type }}
needs:
@ -167,6 +208,8 @@ jobs:
docker run --rm -v "$(pwd)/../":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Pull docker image
run: |
docker pull "${DOCKER_IMAGE}"
@ -187,7 +230,7 @@ jobs:
;;
esac
echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
- uses: actions/download-artifact@v2
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
@ -209,6 +252,7 @@ jobs:
${GPU_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-e GITHUB_ACTIONS \
-e IN_CI \
-e MAX_JOBS="$(nproc --ignore=2)" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
@ -231,7 +275,7 @@ jobs:
if: always()
with:
name: test-reports
retention-days: 30
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
@ -242,6 +286,12 @@ jobs:
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
# Prune all of the docker images
docker system prune -af
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:
@ -289,6 +339,7 @@ jobs:
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
{%- if enable_doc_jobs %}
pytorch_python_doc_build:
runs-on: linux.2xlarge
needs:
@ -305,7 +356,7 @@ jobs:
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v alpine chown -R "$(id -u):$(id -g)" .
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
@ -317,7 +368,7 @@ jobs:
- name: Preserve github env variables for use in docker
run: |
env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
- uses: actions/download-artifact@v2
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
@ -350,7 +401,7 @@ jobs:
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v alpine chown -R "$(id -u):$(id -g)" .
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Archive artifacts into zip
run: |
zip -r pytorch_github_io.zip "${GITHUB_WORKSPACE}/pytorch.github.io"

View File

@ -0,0 +1,196 @@
# Template is at: .github/templates/windows_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Windows CI (!{{ build_environment }})
on:
{%- if on_pull_request %}
pull_request:
{%- endif %}
push:
branches:
- master
- release/*
workflow_dispatch:
env:
BUILD_ENVIRONMENT: !{{ build_environment }}
BUILD_WHEEL: 1
CUDA_VERSION: "!{{ cuda_version }}"
IN_CI: 1
INSTALL_WINDOWS_SDK: 1
JOB_BASE_NAME: test
PYTHON_VERSION: "3.8"
SCCACHE_BUCKET: "ossci-compiler-cache"
VC_PRODUCT: "BuildTools"
VC_VERSION: ""
VC_YEAR: "2019"
{%- if cuda_version != "cpu" %}
TORCH_CUDA_ARCH_LIST: "7.0"
USE_CUDA: 1
{%- endif %}
concurrency:
group: !{{ build_environment }}-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
build:
runs-on: "windows.4xlarge"
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
{%- if cuda_version != "cpu" %}
- name: Install Cuda
shell: bash
run: |
.circleci/scripts/windows_cuda_install.sh
- name: Install Cudnn
shell: bash
run: |
.circleci/scripts/windows_cudnn_install.sh
{%- endif %}
- name: Build
shell: bash
run: |
.jenkins/pytorch/win-build.sh
# Upload to github so that people can click and download artifacts
- name: Upload artifacts to Github
if: always()
uses: actions/upload-artifact@v2
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
- name: Upload artifacts to s3
if: always()
uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
test:
runs-on: !{{ test_runner_type }}
env:
JOB_BASE_NAME: !{{ build_environment }}-test
needs:
- build
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
{%- if cuda_version != "cpu" %}
- name: Install Cuda
shell: bash
run: |
.circleci/scripts/windows_cuda_install.sh
- name: Install Cudnn
shell: bash
run: |
.circleci/scripts/windows_cudnn_install.sh
{%- endif %}
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\${{ github.run_id }}\build-results
- name: Check build-results folder
shell: powershell
run: |
tree /F C:\$Env:GITHUB_RUN_ID\build-results
# Needed for coverage in win-test.sh
- uses: actions/setup-python@v2
name: Setup Python3
with:
python-version: '3.x'
- name: Run test scripts
shell: bash
env:
PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
run: |
.jenkins/pytorch/win-test.sh
- uses: actions/upload-artifact@v2
name: Store PyTorch Test Reports
if: always()
with:
name: test-reports
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:
- test
runs-on: ubuntu-18.04
# TODO: Make this into a composite step
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
# deep clone, to allow tools/print_test_stats.py to use Git commands
fetch-depth: 0
- uses: actions/download-artifact@v2
name: Download PyTorch Test Reports
with:
name: test-reports
path: test/test-reports
- uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
# boto3 version copied from .circleci/docker/common/install_conda.sh
run: |
pip install -r requirements.txt
pip install boto3==1.16.34 junitparser rich
- name: Output Test Results (Click Me)
run: |
python tools/render_junit.py test
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Display and upload test statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_JOB: !{{ build_environment }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test

View File

@ -29,8 +29,7 @@ jobs:
needs: generate-build-matrix
runs-on: linux.2xlarge
strategy:
matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
matrix: ${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
fail-fast: false
container:
image: ${{ matrix.container_image }}
@ -41,7 +40,7 @@ jobs:
DESIRED_CUDA: ${{ matrix.gpu_arch_version }}
GPU_ARCH_VERSION: ${{ matrix.GPU_ARCH_VERSION }}
GPU_ARCH_TYPE: ${{ matrix.gpu_arch_type }}
NO_BUILD_SUFFIX: True
NO_BUILD_SUFFIX: true
# TODO: This is a legacy variable, we should just default all build to use
# this folder within the conda/build_pytorch.sh script
TORCH_CONDA_BUILD_FOLDER: pytorch-nightly
@ -92,4 +91,23 @@ jobs:
with:
name: pytorch-conda-py${{ matrix.python_version }}-${{matrix.gpu_arch_type}}-${{ matrix.gpu_arch_version }}
path: /remote/**/*.bz2
# TODO: Add a step here for uploading binaries
- name: Display and upload binary build size statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
export COMMIT_TIME
pip3 install requests
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
concurrency:
group: build-linux-conda-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -29,8 +29,7 @@ jobs:
needs: generate-build-matrix
runs-on: linux.2xlarge
strategy:
matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
matrix: ${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
fail-fast: false
container:
image: ${{ matrix.container_image }}
@ -91,4 +90,23 @@ jobs:
with:
name: pytorch-libtorch-${{ matrix.libtorch_variant }}-${{ matrix.devtoolset }}-${{matrix.gpu_arch_type}}-${{ matrix.gpu_arch_version }}
path: /remote/**/*.zip
# TODO: Add a step here for uploading binaries
- name: Display and upload binary build size statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
export COMMIT_TIME
pip3 install requests
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
concurrency:
group: build-linux-libtorch-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -29,8 +29,7 @@ jobs:
needs: generate-build-matrix
runs-on: linux.2xlarge
strategy:
matrix:
${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
matrix: ${{ fromJson(needs.generate-build-matrix.outputs.matrix) }}
fail-fast: false
container:
image: ${{ matrix.container_image }}
@ -90,4 +89,23 @@ jobs:
with:
name: pytorch-wheel-py${{ matrix.python_version }}-${{matrix.gpu_arch_type}}-${{ matrix.gpu_arch_version }}
path: /remote/**/*.whl
# TODO: Add a step here for uploading binaries
- name: Display and upload binary build size statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
export COMMIT_TIME
pip3 install requests
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
concurrency:
group: build-linux-wheels-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -1,24 +0,0 @@
name: Cancel redundant workflows
on:
workflow_run:
types:
- requested
# NOTE: Make sure to add to this list as you add more workflows running on 'pull_request'
workflows:
- Lint
- Linux CI (pytorch-linux-xenial-py3.6-gcc5.4)
- Test tools
- TorchBench CI (pytorch-linux-py3.7-cu102)
- clang-format
jobs:
cancel:
# We do not want to cancel reruns on master
if: github.event.workflow_run.head_branch != 'master'
runs-on: ubuntu-18.04
steps:
- name: Cancel duplicate workflow runs
uses: potiuk/cancel-workflow-runs@a81b3c4d59c61e27484cfacdc13897dd908419c9
with:
cancelMode: duplicates
token: ${{ secrets.GITHUB_TOKEN }}
sourceRunId: ${{ github.event.workflow_run.id }}

View File

@ -42,3 +42,7 @@ jobs:
fi
echo "$GIT_DIFF"
exit 1
concurrency:
group: clang-format-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -3,7 +3,7 @@ name: Lint
on:
push:
branches:
- master
- master
pull_request:
jobs:
@ -23,39 +23,11 @@ jobs:
- name: Ensure consistent CircleCI YAML config
if: always() && steps.requirements.outcome == 'success'
run: cd .circleci && ./ensure-consistency.py
- name: Ensure consistent GHA workflows in cancel_redundant_workflows.yml
if: always() && steps.requirements.outcome == 'success'
run: |
pip install ruamel.yaml==0.17.4
echo "Please locally run .github/scripts/regenerate_cancel_redundant_workflow.py and commit if this step fails."
.github/scripts/regenerate_cancel_redundant_workflow.py
git diff --exit-code .github/workflows/cancel_redundant_workflows.yml
- name: Lint native_functions.yaml
if: always() && steps.requirements.outcome == 'success'
run: |
pip install ruamel.yaml==0.17.4
.github/scripts/lint_native_functions.py
- name: Extract scripts from GitHub Actions workflows
if: always() && steps.requirements.outcome == 'success'
run: |
# For local lints, remove the .extracted_scripts folder if it was already there
rm -rf .extracted_scripts
tools/extract_scripts.py --out=.extracted_scripts
- name: Install ShellCheck
id: install_shellcheck
if: always()
# https://github.com/koalaman/shellcheck/tree/v0.7.2#installing-a-pre-compiled-binary
run: |
set -x
scversion="v0.7.2"
wget -qO- "https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz" | tar -xJv
sudo cp "shellcheck-${scversion}/shellcheck" /usr/bin/
rm -r "shellcheck-${scversion}"
shellcheck --version
- name: Run ShellCheck
if: always() && steps.install_shellcheck.outcome == 'success'
run: |
tools/run_shellcheck.sh .jenkins/pytorch .extracted_scripts
- name: Ensure correct trailing newlines
if: always() && steps.requirements.outcome == 'success'
run: |
@ -109,7 +81,7 @@ jobs:
if: always() && steps.requirements.outcome == 'success'
run: |
set -eux
python torch/testing/check_kernel_launches.py |& tee "${GITHUB_WORKSPACE}"/cuda_kernel_launch_checks.txt
python torch/testing/_check_kernel_launches.py |& tee "${GITHUB_WORKSPACE}"/cuda_kernel_launch_checks.txt
- name: Ensure no direct cub include
if: always()
run: |
@ -127,9 +99,12 @@ jobs:
uses: actions/checkout@v2
- name: Attempt to run setup.py
run: |
python2 setup.py | grep "Python 2 has reached end-of-life and is no longer supported by PyTorch."
if ! python2 setup.py | grep -q "Python 2 has reached end-of-life and is no longer supported by PyTorch."; then
echo 'Running setup.py with Python 2 did not give the expected error message.'
false
fi
templates:
shellcheck:
runs-on: ubuntu-18.04
steps:
- name: Setup Python
@ -137,14 +112,68 @@ jobs:
with:
python-version: 3.x
architecture: x64
- name: Checkout PyTorch
uses: actions/checkout@v2
- name: Install requirements
id: requirements
run: |
pip install -r requirements.txt
- name: Install Jinja2
run: pip install Jinja2
run: |
pip install Jinja2==3.0.1
- name: Checkout PyTorch
uses: actions/checkout@v2
- name: Regenerate workflows
run: .github/scripts/generate_linux_ci_workflows.py
id: generate_workflows
run: .github/scripts/generate_ci_workflows.py
- name: Assert that regenerating the workflows didn't change them
run: .github/scripts/report_git_status.sh
run: |
if ! .github/scripts/report_git_status.sh .github/workflows; then
echo
echo 'As shown by the above diff, the committed .github/workflows'
echo 'are not up to date according to .github/templates.'
echo 'Please run this command, commit, and push again to your PR:'
echo
echo ' .github/scripts/generate_ci_workflows.py'
echo
echo 'If running that command does nothing, you may need to rebase'
echo 'onto a more recent commit from the PyTorch master branch.'
false
fi
- name: Install ShellCheck
id: install_shellcheck
if: always()
# https://github.com/koalaman/shellcheck/tree/v0.7.2#installing-a-pre-compiled-binary
run: |
set -x
scversion="v0.7.2"
wget -qO- "https://github.com/koalaman/shellcheck/releases/download/${scversion?}/shellcheck-${scversion?}.linux.x86_64.tar.xz" | tar -xJv
sudo cp "shellcheck-${scversion}/shellcheck" /usr/bin/
rm -r "shellcheck-${scversion}"
shellcheck --version
- name: Extract scripts from GitHub Actions workflows
if: always() && steps.install_shellcheck.outcome == 'success'
run: |
# For local lints, remove the .extracted_scripts folder if it was already there
rm -rf .extracted_scripts
tools/extract_scripts.py --out=.extracted_scripts
- name: Run ShellCheck
if: always() && steps.install_shellcheck.outcome == 'success'
run: |
if ! tools/run_shellcheck.sh .extracted_scripts .jenkins/pytorch; then
echo
echo 'ShellCheck gave a nonzero exit code. Please fix the warnings'
echo 'listed above. Note that if a path in one of the above warning'
echo 'messages starts with .extracted_scripts/ then that means it'
echo 'is referring to a shell script embedded within another file,'
echo 'whose path is given by the path components immediately'
echo 'following the .extracted_scripts/ prefix.'
false
fi
- name: Check that jobs will be cancelled
if: always() && steps.generate_workflows.outcome == 'success'
run: |
.github/scripts/ensure_actions_will_cancel.py
toc:
runs-on: ubuntu-18.04
@ -160,13 +189,27 @@ jobs:
run: npm install -g markdown-toc
- name: Regenerate ToCs and check that they didn't change
run: |
set -eux
set -eu
export PATH=~/.npm-global/bin:"$PATH"
for FILE in $(git grep -Il '<!-- toc -->' -- '**.md'); do
markdown-toc --bullets='-' -i "$FILE"
done
.github/scripts/report_git_status.sh
if ! .github/scripts/report_git_status.sh .; then
echo
echo 'As shown by the above diff, the table of contents in one or'
echo 'more Markdown files is not up to date with the file contents.'
echo 'You can either apply that Git diff directly to correct the'
echo 'table of contents, or if you have npm installed, you can'
echo 'install the npm package markdown-toc and run the following'
# shellcheck disable=SC2016
echo 'command (replacing $FILE with the filename for which you want'
echo 'to regenerate the table of contents):'
echo
# shellcheck disable=SC2016
echo " markdown-toc --bullets='-' -i \"\$FILE\""
false
fi
flake8-py3:
runs-on: ubuntu-18.04
@ -214,20 +257,21 @@ jobs:
path: flake8-output/
- name: Fail if there were any warnings
run: |
set -eux
set -eu
# Re-output flake8 status so GitHub logs show it on the step that actually failed
cat "${GITHUB_WORKSPACE}"/flake8-output.txt
[ ! -s "${GITHUB_WORKSPACE}"/flake8-output.txt ]
if [ -s "${GITHUB_WORKSPACE}"/flake8-output.txt ]; then
echo 'Please fix the above Flake8 warnings.'
false
fi
clang-tidy:
if: github.event_name == 'pull_request'
runs-on: ubuntu-18.04
container:
# ubuntu18.04-cuda10.2-py3.6-tidy11
image: ghcr.io/pytorch/cilint-clang-tidy:52a8ad78d49fc9f40241fee7988db48c920499df
steps:
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: 3.x
architecture: x64
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
@ -236,58 +280,46 @@ jobs:
env:
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
run: |
cd "${GITHUB_WORKSPACE}"
mkdir clang-tidy-output
cd clang-tidy-output
echo "$HEAD_SHA" > commit-sha.txt
- name: Install dependencies
run: |
set -eux
# Install CUDA
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
sudo apt-get update
sudo apt-get --no-install-recommends -y install cuda-toolkit-10-2
# Install dependencies
pip install pyyaml typing_extensions
wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
sudo apt-add-repository "deb http://apt.llvm.org/bionic/ llvm-toolchain-bionic-11 main"
sudo apt-get update
sudo apt-get install -y clang-tidy-11
sudo update-alternatives --install /usr/bin/clang-tidy clang-tidy /usr/bin/clang-tidy-11 1000
- name: Generate build files
run: |
cd "${GITHUB_WORKSPACE}"
set -eux
git remote add upstream https://github.com/pytorch/pytorch
git fetch upstream "$GITHUB_BASE_REF"
if [[ ! -d build ]]; then
if [ ! -d build ]; then
git submodule update --init --recursive
export USE_NCCL=0
export USE_DEPLOY=1
# We really only need compile_commands.json, so no need to build!
time python setup.py --cmake-only build
time python3 setup.py --cmake-only build
# Generate ATen files.
time python -m tools.codegen.gen \
time python3 -m tools.codegen.gen \
-s aten/src/ATen \
-d build/aten/src/ATen
# Generate PyTorch files.
time python tools/setup_helpers/generate_code.py \
time python3 tools/setup_helpers/generate_code.py \
--declarations-path build/aten/src/ATen/Declarations.yaml \
--native-functions-path aten/src/ATen/native/native_functions.yaml \
--nn-path aten/src
fi
- name: Run clang-tidy
env:
BASE_SHA: ${{ github.event.pull_request.base.sha }}
HEAD_SHA: ${{ github.event.pull_request.head.sha }}
PR_NUMBER: ${{ github.event.pull_request.number }}
run: |
cd "${GITHUB_WORKSPACE}"
set -eux
wget -O pr.diff "https://patch-diff.githubusercontent.com/raw/pytorch/pytorch/pull/$PR_NUMBER.diff"
# Run Clang-Tidy
# The negative filters below are to exclude files that include onnx_pb.h or
# caffe2_pb.h, otherwise we'd have to build protos as part of this CI job.
@ -296,27 +328,28 @@ jobs:
# /torch/csrc/generic/*.cpp is excluded because those files aren't actually built.
# deploy/interpreter files are excluded due to using macros and other techniquies
# that are not easily converted to accepted c++
python tools/clang_tidy.py \
--verbose \
--paths torch/csrc/ \
--diff "$BASE_SHA" \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp"\
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp" \
"$@" > "${GITHUB_WORKSPACE}"/clang-tidy-output.txt
python3 tools/clang_tidy.py \
--verbose \
--paths torch/csrc/ \
--diff-file pr.diff \
-g"-torch/csrc/jit/passes/onnx/helper.cpp" \
-g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
-g"-torch/csrc/jit/serialization/onnx.cpp" \
-g"-torch/csrc/jit/serialization/export.cpp" \
-g"-torch/csrc/jit/serialization/import.cpp" \
-g"-torch/csrc/jit/serialization/import_legacy.cpp" \
-g"-torch/csrc/onnx/init.cpp" \
-g"-torch/csrc/cuda/nccl.*" \
-g"-torch/csrc/cuda/python_nccl.cpp" \
-g"-torch/csrc/autograd/FunctionsManual.cpp" \
-g"-torch/csrc/generic/*.cpp" \
-g"-torch/csrc/jit/codegen/cuda/runtime/*" \
-g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
-g"-torch/csrc/deploy/interpreter/interpreter.h" \
-g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
-g"-torch/csrc/deploy/interpreter/test_main.cpp" \
"$@" >"${GITHUB_WORKSPACE}"/clang-tidy-output.txt
cat "${GITHUB_WORKSPACE}"/clang-tidy-output.txt
@ -380,3 +413,7 @@ jobs:
run: |
set -eux
for CONFIG in mypy*.ini; do mypy --config="$CONFIG"; done
concurrency:
group: lint-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -1,6 +1,6 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_ci_workflow.yml
# Generation script: .github/scripts/generate_linux_ci_workflows.py
# Template is at: .github/templates/linux_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Linux CI (pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7)
on:
@ -21,6 +21,10 @@ env:
CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
concurrency:
group: pytorch-linux-xenial-cuda10.2-cudnn7-py3.6-gcc7-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
calculate-docker-image:
runs-on: linux.2xlarge
@ -30,6 +34,15 @@ jobs:
outputs:
docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
steps:
- name: Log in to ECR
run: |
aws ecr get-login --no-include-email --region us-east-1 > /tmp/ecr-login.sh
bash /tmp/ecr-login.sh
rm /tmp/ecr-login.sh
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
@ -47,7 +60,6 @@ jobs:
DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker_tag }}
BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
run: |
eval "$(aws ecr get-login --no-include-email --region us-east-1)"
set -x
# Check if image already exists, if it does then skip building it
if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
@ -82,6 +94,7 @@ jobs:
run: |
export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
cd .circleci/docker && ./build_docker.sh
build:
runs-on: linux.2xlarge
needs: calculate-docker-image
@ -126,6 +139,22 @@ jobs:
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
- name: Display and upload binary build size statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
export COMMIT_TIME
pip3 install requests
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
@ -133,11 +162,22 @@ jobs:
- name: Archive artifacts into zip
run: |
zip -r artifacts.zip dist/ build/
# Upload to github so that people can click and download artifacts
- uses: actions/upload-artifact@v2
name: Store PyTorch Build Artifacts
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
name: Store PyTorch Build Artifacts on Github
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 30
retention-days: 14
if-no-files-found: error
path:
artifacts.zip
- uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
name: Store PyTorch Build Artifacts on S3
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 14
if-no-files-found: error
path:
artifacts.zip
@ -146,6 +186,7 @@ jobs:
run: |
# Prune all of the docker images
docker system prune -af
test:
runs-on: linux.8xlarge.nvidia.gpu
needs:
@ -165,6 +206,8 @@ jobs:
docker run --rm -v "$(pwd)/../":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Pull docker image
run: |
docker pull "${DOCKER_IMAGE}"
@ -185,7 +228,7 @@ jobs:
;;
esac
echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
- uses: actions/download-artifact@v2
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
@ -207,6 +250,7 @@ jobs:
${GPU_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-e GITHUB_ACTIONS \
-e IN_CI \
-e MAX_JOBS="$(nproc --ignore=2)" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
@ -229,7 +273,7 @@ jobs:
if: always()
with:
name: test-reports
retention-days: 30
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
@ -240,6 +284,12 @@ jobs:
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
# Prune all of the docker images
docker system prune -af
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:

View File

@ -1,6 +1,6 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_ci_workflow.yml
# Generation script: .github/scripts/generate_linux_ci_workflows.py
# Template is at: .github/templates/linux_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Linux CI (pytorch-linux-xenial-py3.6-gcc5.4)
on:
@ -22,6 +22,10 @@ env:
CUSTOM_TEST_ARTIFACT_BUILD_DIR: build/custom_test_artifacts
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
concurrency:
group: pytorch-linux-xenial-py3.6-gcc5.4-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
calculate-docker-image:
runs-on: linux.2xlarge
@ -31,6 +35,15 @@ jobs:
outputs:
docker_image: ${{ steps.calculate-tag.outputs.docker_image }}
steps:
- name: Log in to ECR
run: |
aws ecr get-login --no-include-email --region us-east-1 > /tmp/ecr-login.sh
bash /tmp/ecr-login.sh
rm /tmp/ecr-login.sh
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
@ -48,7 +61,6 @@ jobs:
DOCKER_TAG: ${{ steps.calculate-tag.outputs.docker_tag }}
BASE_REVISION: ${{ github.event.pull_request.base.sha || github.sha }}
run: |
eval "$(aws ecr get-login --no-include-email --region us-east-1)"
set -x
# Check if image already exists, if it does then skip building it
if docker manifest inspect "${DOCKER_IMAGE_BASE}:${DOCKER_TAG}"; then
@ -83,6 +95,7 @@ jobs:
run: |
export IMAGE_NAME=${DOCKER_IMAGE_BASE#308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/}
cd .circleci/docker && ./build_docker.sh
build:
runs-on: linux.2xlarge
needs: calculate-docker-image
@ -127,6 +140,22 @@ jobs:
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
sh -c 'sudo chown -R jenkins . && .jenkins/pytorch/build.sh'
- name: Display and upload binary build size statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
COMMIT_TIME=$(git log --max-count=1 --format=%ct || echo 0)
export COMMIT_TIME
pip3 install requests
python3 .circleci/scripts/upload_binary_size_to_scuba.py || exit 0
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
@ -134,11 +163,22 @@ jobs:
- name: Archive artifacts into zip
run: |
zip -r artifacts.zip dist/ build/
# Upload to github so that people can click and download artifacts
- uses: actions/upload-artifact@v2
name: Store PyTorch Build Artifacts
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
name: Store PyTorch Build Artifacts on Github
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 30
retention-days: 14
if-no-files-found: error
path:
artifacts.zip
- uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
name: Store PyTorch Build Artifacts on S3
with:
name: ${{ env.BUILD_ENVIRONMENT }}
retention-days: 14
if-no-files-found: error
path:
artifacts.zip
@ -147,6 +187,7 @@ jobs:
run: |
# Prune all of the docker images
docker system prune -af
test:
runs-on: linux.2xlarge
needs:
@ -166,6 +207,8 @@ jobs:
docker run --rm -v "$(pwd)/../":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Pull docker image
run: |
docker pull "${DOCKER_IMAGE}"
@ -186,7 +229,7 @@ jobs:
;;
esac
echo "SHM_SIZE=${shm_size}" >> "${GITHUB_ENV}"
- uses: actions/download-artifact@v2
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
@ -208,6 +251,7 @@ jobs:
${GPU_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e CUSTOM_TEST_ARTIFACT_BUILD_DIR \
-e GITHUB_ACTIONS \
-e IN_CI \
-e MAX_JOBS="$(nproc --ignore=2)" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
@ -230,7 +274,7 @@ jobs:
if: always()
with:
name: test-reports
retention-days: 30
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
@ -241,6 +285,12 @@ jobs:
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
# Prune all of the docker images
docker system prune -af
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:
@ -287,6 +337,7 @@ jobs:
run: |
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test
pytorch_python_doc_build:
runs-on: linux.2xlarge
needs:
@ -303,7 +354,7 @@ jobs:
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v alpine chown -R "$(id -u):$(id -g)" .
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
@ -315,7 +366,7 @@ jobs:
- name: Preserve github env variables for use in docker
run: |
env | grep '^GITHUB' > "/tmp/github_env_${GITHUB_RUN_ID}"
- uses: actions/download-artifact@v2
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
@ -348,7 +399,7 @@ jobs:
- name: Chown workspace
run: |
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "$(pwd)":/v -w /v alpine chown -R "$(id -u):$(id -g)" .
docker run --rm -v "$(pwd)":/v -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- name: Archive artifacts into zip
run: |
zip -r pytorch_github_io.zip "${GITHUB_WORKSPACE}/pytorch.github.io"

View File

@ -0,0 +1,171 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Windows CI (pytorch-win-vs2019-cpu-py3)
on:
pull_request:
push:
branches:
- master
- release/*
workflow_dispatch:
env:
BUILD_ENVIRONMENT: pytorch-win-vs2019-cpu-py3
BUILD_WHEEL: 1
CUDA_VERSION: "cpu"
IN_CI: 1
INSTALL_WINDOWS_SDK: 1
JOB_BASE_NAME: test
PYTHON_VERSION: "3.8"
SCCACHE_BUCKET: "ossci-compiler-cache"
VC_PRODUCT: "BuildTools"
VC_VERSION: ""
VC_YEAR: "2019"
concurrency:
group: pytorch-win-vs2019-cpu-py3-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
build:
runs-on: "windows.4xlarge"
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
- name: Build
shell: bash
run: |
.jenkins/pytorch/win-build.sh
# Upload to github so that people can click and download artifacts
- name: Upload artifacts to Github
if: always()
uses: actions/upload-artifact@v2
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
- name: Upload artifacts to s3
if: always()
uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
test:
runs-on: windows.4xlarge
env:
JOB_BASE_NAME: pytorch-win-vs2019-cpu-py3-test
needs:
- build
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\${{ github.run_id }}\build-results
- name: Check build-results folder
shell: powershell
run: |
tree /F C:\$Env:GITHUB_RUN_ID\build-results
# Needed for coverage in win-test.sh
- uses: actions/setup-python@v2
name: Setup Python3
with:
python-version: '3.x'
- name: Run test scripts
shell: bash
env:
PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
run: |
.jenkins/pytorch/win-test.sh
- uses: actions/upload-artifact@v2
name: Store PyTorch Test Reports
if: always()
with:
name: test-reports
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:
- test
runs-on: ubuntu-18.04
# TODO: Make this into a composite step
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
# deep clone, to allow tools/print_test_stats.py to use Git commands
fetch-depth: 0
- uses: actions/download-artifact@v2
name: Download PyTorch Test Reports
with:
name: test-reports
path: test/test-reports
- uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
# boto3 version copied from .circleci/docker/common/install_conda.sh
run: |
pip install -r requirements.txt
pip install boto3==1.16.34 junitparser rich
- name: Output Test Results (Click Me)
run: |
python tools/render_junit.py test
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Display and upload test statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_JOB: pytorch-win-vs2019-cpu-py3
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test

View File

@ -0,0 +1,188 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Windows CI (pytorch-win-vs2019-cuda10-cudnn7-py3)
on:
push:
branches:
- master
- release/*
workflow_dispatch:
env:
BUILD_ENVIRONMENT: pytorch-win-vs2019-cuda10-cudnn7-py3
BUILD_WHEEL: 1
CUDA_VERSION: "10.1"
IN_CI: 1
INSTALL_WINDOWS_SDK: 1
JOB_BASE_NAME: test
PYTHON_VERSION: "3.8"
SCCACHE_BUCKET: "ossci-compiler-cache"
VC_PRODUCT: "BuildTools"
VC_VERSION: ""
VC_YEAR: "2019"
TORCH_CUDA_ARCH_LIST: "7.0"
USE_CUDA: 1
concurrency:
group: pytorch-win-vs2019-cuda10-cudnn7-py3-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
build:
runs-on: "windows.4xlarge"
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
- name: Install Cuda
shell: bash
run: |
.circleci/scripts/windows_cuda_install.sh
- name: Install Cudnn
shell: bash
run: |
.circleci/scripts/windows_cudnn_install.sh
- name: Build
shell: bash
run: |
.jenkins/pytorch/win-build.sh
# Upload to github so that people can click and download artifacts
- name: Upload artifacts to Github
if: always()
uses: actions/upload-artifact@v2
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
- name: Upload artifacts to s3
if: always()
uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
test:
runs-on: windows.8xlarge.nvidia.gpu
env:
JOB_BASE_NAME: pytorch-win-vs2019-cuda10-cudnn7-py3-test
needs:
- build
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
- name: Install Cuda
shell: bash
run: |
.circleci/scripts/windows_cuda_install.sh
- name: Install Cudnn
shell: bash
run: |
.circleci/scripts/windows_cudnn_install.sh
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\${{ github.run_id }}\build-results
- name: Check build-results folder
shell: powershell
run: |
tree /F C:\$Env:GITHUB_RUN_ID\build-results
# Needed for coverage in win-test.sh
- uses: actions/setup-python@v2
name: Setup Python3
with:
python-version: '3.x'
- name: Run test scripts
shell: bash
env:
PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
run: |
.jenkins/pytorch/win-test.sh
- uses: actions/upload-artifact@v2
name: Store PyTorch Test Reports
if: always()
with:
name: test-reports
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:
- test
runs-on: ubuntu-18.04
# TODO: Make this into a composite step
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
# deep clone, to allow tools/print_test_stats.py to use Git commands
fetch-depth: 0
- uses: actions/download-artifact@v2
name: Download PyTorch Test Reports
with:
name: test-reports
path: test/test-reports
- uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
# boto3 version copied from .circleci/docker/common/install_conda.sh
run: |
pip install -r requirements.txt
pip install boto3==1.16.34 junitparser rich
- name: Output Test Results (Click Me)
run: |
python tools/render_junit.py test
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Display and upload test statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_JOB: pytorch-win-vs2019-cuda10-cudnn7-py3
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test

View File

@ -0,0 +1,188 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_ci_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: Windows CI (pytorch-win-vs2019-cuda11-cudnn8-py3)
on:
push:
branches:
- master
- release/*
workflow_dispatch:
env:
BUILD_ENVIRONMENT: pytorch-win-vs2019-cuda11-cudnn8-py3
BUILD_WHEEL: 1
CUDA_VERSION: "11.1"
IN_CI: 1
INSTALL_WINDOWS_SDK: 1
JOB_BASE_NAME: test
PYTHON_VERSION: "3.8"
SCCACHE_BUCKET: "ossci-compiler-cache"
VC_PRODUCT: "BuildTools"
VC_VERSION: ""
VC_YEAR: "2019"
TORCH_CUDA_ARCH_LIST: "7.0"
USE_CUDA: 1
concurrency:
group: pytorch-win-vs2019-cuda11-cudnn8-py3-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true
jobs:
build:
runs-on: "windows.4xlarge"
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
- name: Install Cuda
shell: bash
run: |
.circleci/scripts/windows_cuda_install.sh
- name: Install Cudnn
shell: bash
run: |
.circleci/scripts/windows_cudnn_install.sh
- name: Build
shell: bash
run: |
.jenkins/pytorch/win-build.sh
# Upload to github so that people can click and download artifacts
- name: Upload artifacts to Github
if: always()
uses: actions/upload-artifact@v2
# Don't fail on upload to GH since it's only for user convenience
continue-on-error: true
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
- name: Upload artifacts to s3
if: always()
uses: seemethere/upload-artifact-s3@9d7ceb0ab39c2c88d93ef7792b27425b27d59162
with:
retention-days: 14
if-no-files-found: error
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\w\build-results
test:
runs-on: windows.8xlarge.nvidia.gpu
env:
JOB_BASE_NAME: pytorch-win-vs2019-cuda11-cudnn8-py3-test
needs:
- build
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
submodules: recursive
- name: Clean workspace (including things in .gitignore)
shell: bash
run: |
git clean -xdf
- name: Install Visual Studio 2019 toolchain
shell: powershell
run: |
.\.circleci\scripts\vs_install.ps1
- name: Install Cuda
shell: bash
run: |
.circleci/scripts/windows_cuda_install.sh
- name: Install Cudnn
shell: bash
run: |
.circleci/scripts/windows_cudnn_install.sh
- uses: seemethere/download-artifact-s3@0504774707cbc8603d7dca922e8026eb8bf3b47b
name: Download PyTorch Build Artifacts
with:
name: ${{ env.BUILD_ENVIRONMENT }}
path: C:\${{ github.run_id }}\build-results
- name: Check build-results folder
shell: powershell
run: |
tree /F C:\$Env:GITHUB_RUN_ID\build-results
# Needed for coverage in win-test.sh
- uses: actions/setup-python@v2
name: Setup Python3
with:
python-version: '3.x'
- name: Run test scripts
shell: bash
env:
PYTORCH_FINAL_PACKAGE_DIR: /c/${{ github.run_id }}/build-results/
run: |
.jenkins/pytorch/win-test.sh
- uses: actions/upload-artifact@v2
name: Store PyTorch Test Reports
if: always()
with:
name: test-reports
retention-days: 14
if-no-files-found: error
path:
test/**/*.xml
# this is a separate step from test because the log files from test are too
# long: basically, GitHub tries to render all of the log files when you click
# through an action causing extreme slowdown on actions that contain too many
# logs (like test); we can always move it back to the other one, but it
# doesn't create the best experience
render_test_results:
if: always()
needs:
- test
runs-on: ubuntu-18.04
# TODO: Make this into a composite step
steps:
- name: Checkout PyTorch
uses: actions/checkout@v2
with:
# deep clone, to allow tools/print_test_stats.py to use Git commands
fetch-depth: 0
- uses: actions/download-artifact@v2
name: Download PyTorch Test Reports
with:
name: test-reports
path: test/test-reports
- uses: actions/setup-python@v2
with:
python-version: 3.9
- name: Install dependencies
# boto3 version copied from .circleci/docker/common/install_conda.sh
run: |
pip install -r requirements.txt
pip install boto3==1.16.34 junitparser rich
- name: Output Test Results (Click Me)
run: |
python tools/render_junit.py test
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Display and upload test statistics (Click Me)
# temporary hack: set CIRCLE_* vars, until we update
# tools/print_test_stats.py to natively support GitHub Actions
env:
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_OSSCI_METRICS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_OSSCI_METRICS_SECRET_ACCESS_KEY }}
CIRCLE_BRANCH: ${{ steps.parse-ref.outputs.branch }}
CIRCLE_JOB: pytorch-win-vs2019-cuda11-cudnn8-py3
CIRCLE_PR_NUMBER: ${{ github.event.pull_request.number }}
CIRCLE_SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
CIRCLE_TAG: ${{ steps.parse-ref.outputs.tag }}
CIRCLE_WORKFLOW_ID: ${{ github.run_id }} # dunno if this corresponds
run: |
export PYTHONPATH=$PWD
python tools/print_test_stats.py --upload-to-s3 --compare-with-s3 test

View File

@ -64,3 +64,7 @@ jobs:
with:
name: TorchBench result
path: ~/.torchbench/bisection/pr${{ github.event.number }}
concurrency:
group: run-torchbench-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -29,3 +29,7 @@ jobs:
make setup_lint
- name: Run tests
run: python -m unittest discover -vs tools/test -p 'test_*.py'
concurrency:
group: test-tools-${{ github.event.pull_request.number || github.sha }}
cancel-in-progress: true

View File

@ -1,30 +0,0 @@
name: Update disabled tests
on:
issues:
types: [opened, edited, labeled, unlabeled, closed, reopened]
# Have the ability to trigger this job manually through the API
workflow_dispatch:
jobs:
update-disabled-tests:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: ubuntu-18.04
steps:
- name: Generate new disabled test list
run: |
# score changes every request, so we strip it out to avoid creating a commit every time we query.
curl 'https://api.github.com/search/issues?q=is%3Aissue+is%3Aopen+label%3A%22module%3A+flaky-tests%22+repo:pytorch/pytorch+in%3Atitle+DISABLED' \
| sed 's/"score": [0-9\.]*/"score": 0.0/g' > disabled-tests.json
- name: Push file to test-infra repository
uses: dmnemec/copy_file_to_another_repo_action@5f40763ccee2954067adba7fb8326e4df33bcb92
env:
API_TOKEN_GITHUB: ${{ secrets.TEST_INFRA_TOKEN }}
with:
source_file: 'disabled-tests.json'
destination_repo: 'pytorch/test-infra'
destination_folder: 'stats'
destination_branch: master
user_email: 'test-infra@pytorch.org'
user_name: 'Pytorch Test Infra'
commit_message: 'Updating disabled tests stats'

4
.gitignore vendored
View File

@ -15,8 +15,8 @@ coverage.xml
.hypothesis
.mypy_cache
/.extracted_scripts/
**/.pytorch-test-times
**/.pytorch-slow-tests
**/.pytorch-test-times.json
**/.pytorch-slow-tests.json
*/*.pyc
*/*.so*
*/**/__pycache__

3
.gitmodules vendored
View File

@ -130,6 +130,9 @@
ignore = dirty
path = third_party/tensorpipe
url = https://github.com/pytorch/tensorpipe.git
[submodule "third_party/cudnn_frontend"]
path = third_party/cudnn_frontend
url = https://github.com/NVIDIA/cudnn-frontend.git
[submodule "third_party/kineto"]
path = third_party/kineto
url = https://github.com/pytorch/kineto

View File

@ -24,7 +24,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-mobile-code-analysis* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile-code-analysis.sh" "$@"
fi
if [[ "$BUILD_ENVIRONMENT" == pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7* ]]; then
if [[ "$BUILD_ENVIRONMENT" == pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7* ]]; then
# Enabling DEPLOY build (embedded torch python interpreter, experimental)
# only on one config for now, can expand later
export USE_DEPLOY=ON
@ -200,8 +200,10 @@ fi
# Patch required to build xla
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
git clone --recursive -b r1.9 https://github.com/pytorch/xla.git
./xla/scripts/apply_patches.sh
clone_pytorch_xla
# shellcheck disable=SC1091
source "xla/.circleci/common.sh"
apply_patches
fi
if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3.6-gcc7-build || "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-py3.6-gcc5.4-build ]]; then
@ -311,36 +313,10 @@ fi
# Test XLA build
if [[ "${BUILD_ENVIRONMENT}" == *xla* ]]; then
# TODO: Move this to Dockerfile.
pip_install lark-parser
pip_install cloud-tpu-client
sudo apt-get -qq update
sudo apt-get -qq install npm nodejs
# XLA build requires Bazel
# We use bazelisk to avoid updating Bazel version manually.
sudo npm install -g @bazel/bazelisk
sudo ln -s "$(command -v bazelisk)" /usr/bin/bazel
# Install bazels3cache for cloud cache
sudo npm install -g bazels3cache
BAZELS3CACHE="$(which bazels3cache)"
if [ -z "${BAZELS3CACHE}" ]; then
echo "Unable to find bazels3cache..."
exit 1
fi
bazels3cache --bucket="${XLA_CLANG_CACHE_S3_BUCKET_NAME}" --maxEntrySizeBytes=0
pushd xla
export CC=clang-9 CXX=clang++-9
# Use cloud cache to build when available.
# shellcheck disable=SC1003
sed -i '/bazel build/ a --remote_http_cache=http://localhost:7777 \\' build_torch_xla_libs.sh
python setup.py install
popd
XLA_DIR=xla
# These functions are defined in .circleci/common.sh in pytorch/xla repo
install_deps_pytorch_xla $XLA_DIR
build_torch_xla $XLA_DIR
assert_git_not_dirty
fi

View File

@ -52,9 +52,9 @@ function get_exit_code() {
function file_diff_from_base() {
# The fetch may fail on Docker hosts, this fetch is necessary for GHA
set +e
git fetch origin release/1.9 --quiet
git fetch origin master --quiet
set -e
git diff --name-only "$(git merge-base origin/release/1.9 HEAD)" > "$1"
git diff --name-only "$(git merge-base origin/master HEAD)" > "$1"
}
function get_bazel() {
@ -86,3 +86,7 @@ function checkout_install_torchvision() {
time python setup.py install
popd
}
function clone_pytorch_xla() {
git clone --recursive https://github.com/pytorch/xla.git
}

View File

@ -28,13 +28,7 @@ fi
export PATH="${WORKSPACE_DIR}/miniconda3/bin:$PATH"
# shellcheck disable=SC1091
source "${WORKSPACE_DIR}"/miniconda3/bin/activate
# NOTE: mkl 2021.3.0+ cmake requires sub-command PREPEND, may break the build
retry conda install -y \
mkl=2021.2.0 mkl-include=2021.2.0 \
numpy=1.18.5 pyyaml=5.3 setuptools=46.0.0 \
cmake cffi ninja typing_extensions dataclasses pip
retry conda install -y mkl mkl-include numpy=1.18.5 pyyaml=5.3 setuptools=46.0.0 cmake cffi ninja typing_extensions dataclasses pip
# The torch.hub tests make requests to GitHub.
#
# The certifi package from conda-forge is new enough to make the

View File

@ -53,7 +53,8 @@ test_python_all() {
# Try to pull value from CIRCLE_PULL_REQUEST first then GITHUB_HEAD_REF second
# CIRCLE_PULL_REQUEST comes from CircleCI
# GITHUB_HEAD_REF comes from Github Actions
# NOTE: file_diff_from_base is currently bugged for GHA due to an issue finding a merge base for ghstack PRs
# see https://github.com/pytorch/pytorch/issues/60111
IN_PULL_REQUEST=${CIRCLE_PULL_REQUEST:-${GITHUB_HEAD_REF:-}}
if [ -n "$IN_PULL_REQUEST" ]; then
DETERMINE_FROM=$(mktemp)

View File

@ -25,6 +25,8 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_process_group_agent
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent
assert_git_not_dirty

View File

@ -37,8 +37,6 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
# mainly used so that we're not spending extra cycles testing cpu
# devices on expensive gpu machines
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
elif [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xla"
fi
if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
@ -126,7 +124,8 @@ fi
# Try to pull value from CIRCLE_PULL_REQUEST first then GITHUB_HEAD_REF second
# CIRCLE_PULL_REQUEST comes from CircleCI
# GITHUB_HEAD_REF comes from Github Actions
# NOTE: file_diff_from_base is currently bugged for GHA due to an issue finding a merge base for ghstack PRs
# see https://github.com/pytorch/pytorch/issues/60111
IN_PULL_REQUEST=${CIRCLE_PULL_REQUEST:-}
if [ -n "$IN_PULL_REQUEST" ] && [[ "$BUILD_ENVIRONMENT" != *coverage* ]]; then
DETERMINE_FROM=$(mktemp)
@ -339,23 +338,9 @@ test_torch_function_benchmark() {
}
test_xla() {
export XLA_USE_XRT=1 XRT_DEVICE_MAP="CPU:0;/job:localservice/replica:0/task:0/device:XLA_CPU:0"
# Issue #30717: randomize the port of XLA/gRPC workers is listening on to reduce flaky tests.
XLA_PORT=$(shuf -i 40701-40999 -n 1)
export XRT_WORKERS="localservice:0;grpc://localhost:$XLA_PORT"
pushd xla
echo "Running Python Tests"
./test/run_tests.sh
# Disabled due to MNIST download issue.
# See https://github.com/pytorch/pytorch/issues/53267
# echo "Running MNIST Test"
# python test/test_train_mnist.py --tidy
echo "Running C++ Tests"
pushd test/cpp
CC=clang-9 CXX=clang++-9 ./run_tests.sh
popd
# shellcheck disable=SC1091
source "./xla/.circleci/common.sh"
run_torch_xla_tests "$(pwd)" "$(pwd)/xla"
assert_git_not_dirty
}
@ -368,7 +353,7 @@ test_backward_compatibility() {
python -m venv venv
# shellcheck disable=SC1091
. venv/bin/activate
pip_install --pre torch -f https://download.pytorch.org/whl/test/cpu/torch_test.html
pip_install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
pip show torch
python dump_all_function_schemas.py --filename nightly_schemas.txt
deactivate
@ -452,7 +437,7 @@ elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
# TODO: run some C++ tests
echo "no-op at the moment"
elif [[ "${BUILD_ENVIRONMENT}" == *-test1 || "${JOB_BASE_NAME}" == *-test1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7-test1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" == pytorch-linux-xenial-cuda11.1-cudnn8-py3-gcc7-test1 ]]; then
test_torch_deploy
fi
test_without_numpy

View File

@ -22,7 +22,7 @@ call %INSTALLER_DIR%\install_miniconda3.bat
:: Install ninja and other deps
if "%REBUILD%"=="" ( pip install -q "ninja==1.9.0" dataclasses typing_extensions )
if "%REBUILD%"=="" ( pip install -q "ninja==1.10.0.post1" dataclasses typing_extensions )
:: Override VS env here
pushd .
@ -38,7 +38,15 @@ if not "%USE_CUDA%"=="1" goto cuda_build_end
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v%CUDA_VERSION%
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
)
rem version transformer, for example 10.1 to 10_1.
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
)
set VERSION_SUFFIX=%CUDA_VERSION:.=_%
set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
@ -119,6 +127,6 @@ python setup.py install --cmake && sccache --show-stats && (
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\caffe2 && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
:: export test times so that potential sharded tests that'll branch off this build will use consistent data
python test/run_test.py --export-past-test-times %PYTORCH_FINAL_PACKAGE_DIR%/.pytorch-test-times
python test/run_test.py --export-past-test-times %PYTORCH_FINAL_PACKAGE_DIR%/.pytorch-test-times.json
)
)

View File

@ -1,4 +1,14 @@
rem remove dot in cuda_version, fox example 11.1 to 111
if not "%USE_CUDA%"=="1" (
exit /b 0
)
if x%CUDA_VERSION:.=%==x%CUDA_VERSION% (
echo CUDA version %CUDA_VERSION% format isn't correct, which doesn't contain '.'
exit /b 1
)
set VERSION_SUFFIX=%CUDA_VERSION:.=%
set CUDA_SUFFIX=cuda%VERSION_SUFFIX%

View File

@ -20,9 +20,7 @@ if NOT "%BUILD_ENVIRONMENT%"=="" (
)
call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3
if NOT "%BUILD_ENVIRONMENT%"=="" (
:: We have to pin Python version to 3.6.7, until mkl supports Python 3.7
:: Numba is pinned to 0.44.0 to avoid https://github.com/numba/numba/issues/4352
call conda install -y -q python=3.6.7 numpy mkl cffi pyyaml boto3 protobuf numba==0.44.0 scipy==1.5.0 typing_extensions dataclasses libuv
call conda install -y -q python=3.8 numpy mkl cffi pyyaml boto3 protobuf numba scipy typing_extensions dataclasses libuv
if %errorlevel% neq 0 ( exit /b %errorlevel% )
call conda install -y -q -c conda-forge cmake
if %errorlevel% neq 0 ( exit /b %errorlevel% )

View File

@ -19,3 +19,9 @@ if %errorlevel% neq 0 ( exit /b %errorlevel% )
%1\python.exe test/run_test.py --verbose -i distributed/test_data_parallel
if %errorlevel% neq 0 ( exit /b %errorlevel% )
%1\python.exe test/run_test.py --verbose -i distributed/test_store
if %errorlevel% neq 0 ( exit /b %errorlevel% )
%1\python.exe test/run_test.py --verbose -i distributed/test_pg_wrapper
if %errorlevel% neq 0 ( exit /b %errorlevel% )

View File

@ -1,7 +1,7 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times" "%TEST_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%TEST_DIR_WIN%"
pushd test

View File

@ -1,7 +1,7 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times" "%TEST_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%TEST_DIR_WIN%"
pushd test

View File

@ -1,7 +1,7 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times" "%TEST_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%TEST_DIR_WIN%"
cd test && python run_test.py --exclude-jit-executor --shard 2 2 --verbose --determine-from="%1" && cd ..

View File

@ -23,7 +23,7 @@ export PROJECT_DIR_WIN
export TEST_DIR="${PWD}/test"
TEST_DIR_WIN=$(cygpath -w "${TEST_DIR}")
export TEST_DIR_WIN
export PYTORCH_FINAL_PACKAGE_DIR="/c/users/circleci/workspace/build-results"
export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/users/circleci/workspace/build-results}"
PYTORCH_FINAL_PACKAGE_DIR_WIN=$(cygpath -w "${PYTORCH_FINAL_PACKAGE_DIR}")
export PYTORCH_FINAL_PACKAGE_DIR_WIN
export PYTORCH_TEST_SKIP_NOARCH=1
@ -42,10 +42,10 @@ fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
# Try to pull value from CIRCLE_PULL_REQUEST first then GITHUB_HEAD_REF second
# CIRCLE_PULL_REQUEST comes from CircleCI
# GITHUB_HEAD_REF comes from Github Actions
IN_PULL_REQUEST=${CIRCLE_PULL_REQUEST:-${GITHUB_HEAD_REF:-}}
# Try to pull value from CIRCLE_PULL_REQUEST
# NOTE: file_diff_from_base is currently bugged for GHA due to an issue finding a merge base for ghstack PRs
# see https://github.com/pytorch/pytorch/issues/60111
IN_PULL_REQUEST=${CIRCLE_PULL_REQUEST:-}
if [ -n "$IN_PULL_REQUEST" ]; then
DETERMINE_FROM="${TMP_DIR}/determine_from"
file_diff_from_base "$DETERMINE_FROM"
@ -57,9 +57,9 @@ fi
run_tests() {
# Run nvidia-smi if available
for path in /c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe /c/Windows/System32/nvidia-smi.exe; do
if [ -x $path ]; then
$path;
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do
if [[ -x "$path" ]]; then
"$path" || echo "true";
break
fi
done

View File

@ -132,17 +132,22 @@ genrule(
"aten/src/ATen/RegisterSparseCPU.cpp",
"aten/src/ATen/RegisterSparseCsrCPU.cpp",
"aten/src/ATen/RegisterCompositeImplicitAutograd.cpp",
"aten/src/ATen/RegisterMeta.cpp",
"aten/src/ATen/RegisterCompositeExplicitAutograd.cpp",
"aten/src/ATen/RegisterMeta.cpp",
"aten/src/ATen/RegisterSchema.cpp",
"aten/src/ATen/CPUFunctions.h",
"aten/src/ATen/CUDAFunctions.h",
"aten/src/ATen/CompositeExplicitAutogradFunctions.h",
"aten/src/ATen/CompositeImplicitAutogradFunctions.h",
"aten/src/ATen/Functions.h",
"aten/src/ATen/Functions.cpp",
"aten/src/ATen/RedispatchFunctions.h",
"aten/src/ATen/RedispatchFunctions.cpp",
"aten/src/ATen/Operators.h",
"aten/src/ATen/Operators.cpp",
"aten/src/ATen/NativeFunctions.h",
"aten/src/ATen/MetaFunctions.h",
"aten/src/ATen/NativeMetaFunctions.h",
"aten/src/ATen/core/TensorBody.h",
"aten/src/ATen/core/TensorMethods.cpp",
"aten/src/ATen/core/ATenOpList.cpp",
@ -326,12 +331,8 @@ filegroup(
"aten/src/TH/THAllocator.cpp",
"aten/src/TH/THBlas.cpp",
"aten/src/TH/THGeneral.cpp",
"aten/src/TH/THLapack.cpp",
"aten/src/TH/THStorageFunctions.cpp",
"aten/src/TH/THTensor.cpp",
"aten/src/TH/THTensorEvenMoreMath.cpp",
"aten/src/TH/THTensorLapack.cpp",
"aten/src/TH/THTensorMath.cpp",
"aten/src/TH/THTensorMoreMath.cpp",
],
)
@ -385,7 +386,6 @@ filegroup(
"aten/src/THC/THCTensorMath.cu.cc",
"aten/src/THC/THCTensorMathMagma.cu.cc",
"aten/src/THC/THCTensorMathPairwise.cu.cc",
"aten/src/THC/THCTensorMathReduce.cu.cc",
"aten/src/THC/THCTensorMathScan.cu.cc",
"aten/src/THC/THCTensorScatterGather.cu.cc",
"aten/src/THC/THCTensorSort.cu.cc",
@ -398,16 +398,6 @@ filegroup(
"aten/src/THC/generated/THCTensorMathPointwiseInt.cu.cc",
"aten/src/THC/generated/THCTensorMathPointwiseLong.cu.cc",
"aten/src/THC/generated/THCTensorMathPointwiseShort.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceBFloat16.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceBool.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceByte.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceChar.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceDouble.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceFloat.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceHalf.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceInt.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceLong.cu.cc",
"aten/src/THC/generated/THCTensorMathReduceShort.cu.cc",
"aten/src/THC/generated/THCTensorSortByte.cu.cc",
"aten/src/THC/generated/THCTensorSortChar.cu.cc",
"aten/src/THC/generated/THCTensorSortDouble.cu.cc",
@ -431,7 +421,6 @@ filegroup(
"aten/src/THCUNN/LogSigmoid.cu.cc",
"aten/src/THCUNN/MultiLabelMarginCriterion.cu.cc",
"aten/src/THCUNN/MultiMarginCriterion.cu.cc",
"aten/src/THCUNN/RReLU.cu.cc",
"aten/src/THCUNN/SoftMarginCriterion.cu.cc",
"aten/src/THCUNN/SoftPlus.cu.cc",
"aten/src/THCUNN/SoftShrink.cu.cc",
@ -1888,8 +1877,6 @@ cc_library(
"torch/lib/c10d/*.hpp",
],
exclude = [
"torch/lib/c10d/ProcessGroupMPI.hpp",
"torch/lib/c10d/ProcessGroupNCCL.hpp",
"torch/csrc/autograd/generated/VariableType.h",
"torch/csrc/autograd/generated/RegistrationDeclarations.h",
"torch/csrc/autograd/generated/variable_factories.h",

View File

@ -197,6 +197,9 @@ cmake_dependent_option(
cmake_dependent_option(
USE_WHOLE_CUDNN "Use whole-library linking for cuDNN" OFF
"USE_STATIC_CUDNN" OFF)
cmake_dependent_option(
USE_EXPERIMENTAL_CUDNN_V8_API "Use experimental cuDNN v8 API" OFF
"USE_CUDNN" OFF)
option(USE_FBGEMM "Use FBGEMM (quantized 8-bit server operators)" ON)
option(USE_KINETO "Use Kineto profiling library" ON)
option(USE_CUPTI_SO "Use CUPTI as a shared library" OFF)
@ -286,6 +289,12 @@ cmake_dependent_option(
cmake_dependent_option(
USE_GLOO_WITH_OPENSSL "Use Gloo with OpenSSL. Only available if USE_GLOO is on." OFF
"USE_GLOO AND LINUX AND NOT INTERN_BUILD_MOBILE" OFF)
cmake_dependent_option(
USE_C10D_GLOO "USE C10D GLOO" ON "USE_DISTRIBUTED;USE_GLOO" OFF)
cmake_dependent_option(
USE_C10D_NCCL "USE C10D NCCL" ON "USE_DISTRIBUTED;USE_NCCL" OFF)
cmake_dependent_option(
USE_C10D_MPI "USE C10D MPI" ON "USE_DISTRIBUTED;USE_MPI" OFF)
cmake_dependent_option(
USE_TENSORPIPE "Use TensorPipe. Only available if USE_DISTRIBUTED is on." ON
"USE_DISTRIBUTED" OFF)
@ -351,6 +360,7 @@ option(USE_SYSTEM_CPUINFO "Use system-provided cpuinfo." OFF)
option(USE_SYSTEM_SLEEF "Use system-provided sleef." OFF)
option(USE_SYSTEM_GLOO "Use system-provided gloo." OFF)
option(USE_SYSTEM_FP16 "Use system-provided fp16." OFF)
option(USE_SYSTEM_PYBIND11 "Use system-provided PyBind11." OFF)
option(USE_SYSTEM_PTHREADPOOL "Use system-provided pthreadpool." OFF)
option(USE_SYSTEM_PSIMD "Use system-provided psimd." OFF)
option(USE_SYSTEM_FXDIV "Use system-provided fxdiv." OFF)
@ -371,6 +381,7 @@ if(USE_SYSTEM_LIBS)
set(USE_SYSTEM_BENCHMARK ON)
set(USE_SYSTEM_ONNX ON)
set(USE_SYSTEM_XNNPACK ON)
set(USE_SYSTEM_PYBIND11 ON)
endif()
# Used when building Caffe2 through setup.py

View File

@ -19,20 +19,20 @@
# Distributed package
# This list is mostly if you'd like to be tagged as reviewer, feel free to add
# or remove yourself from it.
/torch/lib/c10d/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang
/torch/csrc/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang
/torch/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang
/torch/nn/parallel/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang
/torch/lib/c10d/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang @cbalioglu
/torch/csrc/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang @cbalioglu
/torch/distributed/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang @cbalioglu
/torch/nn/parallel/ @mrshenli @zhaojuanmao @pritamdamania87 @rohan-varma @SciPioneer @mingzhe09088 @H-Huang @cbalioglu
# Distributed tests
# This list is mostly if you'd like to be tagged as reviewer, feel free to add
# or remove yourself from it.
/test/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @SciPioneer @H-Huang
/torch/testing/_internal/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @SciPioneer @H-Huang
/test/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @SciPioneer @H-Huang @cbalioglu
/torch/testing/_internal/distributed @mrshenli @pritamdamania87 @zhaojuanmao @rohan-varma @SciPioneer @H-Huang @cbalioglu
# ONNX Export
/torch/csrc/jit/passes/onnx.h @bowenbao @neginraoof @spandantiwari
/torch/csrc/jit/passes/onnx.cpp @bowenbao @neginraoof @spandantiwari
/torch/csrc/jit/passes/onnx/ @bowenbao @neginraoof @spandantiwari
/torch/onnx/ @bowenbao @neginraoof @spandantiwari
/test/onnx/ @bowenbao @neginraoof @spandantiwari
/torch/csrc/jit/passes/onnx.h @bowenbao @neginraoof @shubhambhokare1
/torch/csrc/jit/passes/onnx.cpp @bowenbao @neginraoof @shubhambhokare1
/torch/csrc/jit/passes/onnx/ @bowenbao @neginraoof @shubhambhokare1
/torch/onnx/ @bowenbao @neginraoof @shubhambhokare1
/test/onnx/ @bowenbao @neginraoof @shubhambhokare1

View File

@ -77,11 +77,17 @@ https://github.com/pytorch/pytorch#from-source
To develop PyTorch on your machine, here are some tips:
1. Uninstall all existing PyTorch installs:
1. Uninstall all existing PyTorch installs. You may need to run `pip
uninstall torch` multiple times. You'll know `torch` is fully
uninstalled when you see `WARNING: Skipping torch as it is not
installed`. (You should only have to `pip uninstall` a few times, but
you can always `uninstall` with `timeout` or in a loop if you're feeling
lazy.)
```bash
conda uninstall pytorch
pip uninstall torch
pip uninstall torch # run this command twice
conda -y uninstall pytorch
yes | pip uninstall torch
```
2. Clone a copy of PyTorch from source:
@ -134,8 +140,10 @@ For example:
You do not need to repeatedly install after modifying Python files (`.py`). However, you would need to reinstall
if you modify Python interface (`.pyi`, `.pyi.in`) or non-Python files (`.cpp`, `.cc`, `.cu`, `.h`, ...).
In case you want to reinstall, make sure that you uninstall PyTorch first by running `pip uninstall torch`
and `python setup.py clean`. Then you can install in `develop` mode again.
In case you want to reinstall, make sure that you uninstall PyTorch
first by running `pip uninstall torch` until you see `WARNING: Skipping
torch as it is not installed`; next run `python setup.py clean`. After
that, you can install in `develop` mode again.
### Tips and Debugging
* A prerequisite to installing PyTorch is CMake. We recommend installing it with [Homebrew](https://brew.sh/)
@ -902,7 +910,7 @@ tensor([1., 2., 3., 4.], dtype=torch.float64)
```
GDB tries to automatically load `pytorch-gdb` thanks to the
[.gdbinit](.gdbinit) at the root of the pytorch repo. Howevever, auto-loadings is disabled by default, because of security reasons:
[.gdbinit](.gdbinit) at the root of the pytorch repo. However, auto-loadings is disabled by default, because of security reasons:
```
$ gdb

View File

@ -1,7 +1,8 @@
# This makefile does nothing but delegating the actual building to cmake.
PYTHON = python3
all:
@mkdir -p build && cd build && cmake .. $(shell python ./scripts/get_python_cmake_flags.py) && $(MAKE)
@mkdir -p build && cd build && cmake .. $(shell $(PYTHON) ./scripts/get_python_cmake_flags.py) && $(MAKE)
local:
@./scripts/build_local.sh
@ -28,16 +29,35 @@ shellcheck-gha:
tools/run_shellcheck.sh $(SHELLCHECK_GHA_GENERATED_FOLDER)
generate-gha-workflows:
./.github/scripts/generate_linux_ci_workflows.py
.github/scripts/generate_ci_workflows.py
$(MAKE) shellcheck-gha
shellcheck:
@$(PYTHON) tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'shellcheck' \
--step "Regenerate workflows"
@$(PYTHON) tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'shellcheck' \
--step "Assert that regenerating the workflows didn't change them"
@$(PYTHON) tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'shellcheck' \
--step 'Extract scripts from GitHub Actions workflows'
@$(PYTHON) tools/actions_local_runner.py \
$(CHANGED_ONLY) \
--job 'shellcheck'
setup_lint:
python tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'flake8-py3' --step 'Install dependencies' --no-quiet
python tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'cmakelint' --step 'Install dependencies' --no-quiet
python tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'mypy' --step 'Install dependencies' --no-quiet
$(PYTHON) tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'flake8-py3' --step 'Install dependencies' --no-quiet
$(PYTHON) tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'cmakelint' --step 'Install dependencies' --no-quiet
$(PYTHON) tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'mypy' --step 'Install dependencies' --no-quiet
$(PYTHON) tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'shellcheck' --step 'Install Jinja2' --no-quiet
@if [ "$$(uname)" = "Darwin" ]; then \
if [ -z "$$(which brew)" ]; then \
@ -46,20 +66,15 @@ setup_lint:
fi; \
brew install shellcheck; \
else \
python tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'quick-checks' --step 'Install ShellCheck' --no-quiet; \
$(PYTHON) tools/actions_local_runner.py --file .github/workflows/lint.yml \
--job 'shellcheck' --step 'Install ShellCheck' --no-quiet; \
fi
pip install jinja2
quick_checks:
@python tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'quick-checks' \
--step 'Extract scripts from GitHub Actions workflows'
# TODO: This is broken when 'git config submodule.recurse' is 'true' since the
# lints will descend into third_party submodules
@python tools/actions_local_runner.py \
@$(PYTHON) tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'quick-checks' \
--step 'Ensure no trailing spaces' \
@ -70,23 +85,20 @@ quick_checks:
--step 'Ensure no unqualified noqa' \
--step 'Ensure no unqualified type ignore' \
--step 'Ensure no direct cub include' \
--step 'Run ShellCheck' \
--step 'Ensure correct trailing newlines'
flake8:
@python tools/actions_local_runner.py \
--file-filter '.py' \
@$(PYTHON) tools/actions_local_runner.py \
$(CHANGED_ONLY) \
--job 'flake8-py3'
mypy:
@python tools/actions_local_runner.py \
--file-filter '.py' \
@$(PYTHON) tools/actions_local_runner.py \
$(CHANGED_ONLY) \
--job 'mypy'
cmakelint:
@python tools/actions_local_runner.py \
@$(PYTHON) tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'cmakelint' \
--step 'Run cmakelint'
@ -96,12 +108,12 @@ clang_tidy:
exit 1
toc:
@python tools/actions_local_runner.py \
@$(PYTHON) tools/actions_local_runner.py \
--file .github/workflows/lint.yml \
--job 'toc' \
--step "Regenerate ToCs and check that they didn't change"
lint: flake8 mypy quick_checks cmakelint generate-gha-workflows
lint: flake8 mypy quick_checks cmakelint shellcheck
quicklint: CHANGED_ONLY=--changed-only
quicklint: mypy flake8 mypy quick_checks cmakelint generate-gha-workflows
quicklint: mypy flake8 mypy quick_checks cmakelint shellcheck

View File

@ -48,7 +48,7 @@ You can reuse your favorite Python packages such as NumPy, SciPy, and Cython to
| Linux (ppc64le) GPU | <center></center> | [![Build Status](https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/badge/icon)](https://powerci.osuosl.org/job/pytorch-master-nightly-py3-linux-ppc64le-gpu/) | <center></center> |
| Linux (aarch64) CPU | [![Build Status](http://openlabtesting.org:15000/badge?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py36)](https://status.openlabtesting.org/builds/builds?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py36) | [![Build Status](http://openlabtesting.org:15000/badge?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py37)](https://status.openlabtesting.org/builds/builds?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py37) | [![Build Status](http://openlabtesting.org:15000/badge?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py38)](https://status.openlabtesting.org/builds/builds?project=pytorch%2Fpytorch&job_name=pytorch-arm64-build-daily-master-py38) |
See also the [ci.pytorch.org HUD](https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master).
See also the [ci.pytorch.org HUD](https://hud.pytorch.org/build2/pytorch-master).
## More About PyTorch
@ -270,13 +270,13 @@ Sometimes there are regressions in new versions of Visual Studio, so
it's best to use the same Visual Studio Version [16.8.5](https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/vs_install.ps1) as Pytorch CI's.
You can use Visual Studio Enterprise, Professional or Community though PyTorch CI uses Visual Studio BuildTools.
If you want to build legacy python code, please refert to [Building on legacy code and CUDA](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#building-on-legacy-code-and-cuda)
If you want to build legacy python code, please refer to [Building on legacy code and CUDA](https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#building-on-legacy-code-and-cuda)
Build with CPU
It's fairly easy to build with CPU.
Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In order to link against iomp, you'll need to manually download the library and set up the buliding environment by tweaking `CMAKE_INCLUDE_PATH` and `LIB`. The instruction [here](https://github.com/pytorch/pytorch/blob/master/docs/source/notes/windows.rst#building-from-source) is an example for setting up both MKL and Intel OpenMP. Without these configuraions for CMake, Microsoft Visual C OpenMP runtime (vcomp) will be used.
Note on OpenMP: The desired OpenMP implementation is Intel OpenMP (iomp). In order to link against iomp, you'll need to manually download the library and set up the building environment by tweaking `CMAKE_INCLUDE_PATH` and `LIB`. The instruction [here](https://github.com/pytorch/pytorch/blob/master/docs/source/notes/windows.rst#building-from-source) is an example for setting up both MKL and Intel OpenMP. Without these configurations for CMake, Microsoft Visual C OpenMP runtime (vcomp) will be used.
Build with CUDA
@ -353,7 +353,7 @@ should increase shared memory size either with `--ipc=host` or `--shm-size` comm
**NOTE:** Must be built with a docker version > 18.06
The `Dockerfile` is supplied to build images with Cuda support and cuDNN v7.
The `Dockerfile` is supplied to build images with CUDA 11.1 support and cuDNN v8.
You can pass `PYTHON_VERSION=x.y` make variable to specify which Python version is to be used by Miniconda, or leave it
unset to use the default.
```bash

View File

@ -37,7 +37,7 @@ An example of this would look like:
release/1.8
```
Please make sure to create branch that pins divergent point of release branch from the main thunk, i.e. `orig/release/{MAJOR}.{MINOR}`
Please make sure to create branch that pins divergent point of release branch from the main branch, i.e. `orig/release/{MAJOR}.{MINOR}`
### Making release branch specific changes
These are examples of changes that should be made to release branches so that CI / tooling can function normally on
@ -51,8 +51,13 @@ them:
* Example: https://github.com/pytorch/pytorch/pull/40706
* Add `release/{MAJOR}.{MINOR}` to list of branches in [`browser-extension.json`](https://github.com/pytorch/pytorch/blob/fb-config/browser-extension.json) for FaceHub integrated setups
* Example: https://github.com/pytorch/pytorch/commit/f99fbd94d18627bae776ea2448e075ca4d5e37b2
* A release branch should also be created in [`pytorch/builder`](https://github.com/pytorch/builder) repo and pinned in `pytorch/pytorch`
* Example: https://github.com/pytorch/pytorch/pull/58514
> TODO: Create release branch in [`pytorch/builder`](https://github.com/pytorch/builder) repo and pin release CI to use that branch rather than HEAD of builder repo.
These are examples of changes that should be made to the *default* branch after a release branch is cut
* Nightly versions should be updated in all version files to the next MINOR release (i.e. 0.9.0 -> 0.10.0) in the default branch:
* Example: https://github.com/pytorch/pytorch/pull/51891
### Getting CI signal on release branches:
Create a PR from `release/{MAJOR}.{MINOR}` to `orig/release/{MAJOR}.{MINOR}` in order to start CI testing for cherry-picks into release branch.

View File

@ -34,8 +34,8 @@ repositories {
dependencies {
...
implementation 'org.pytorch:pytorch_android:1.9.0-SNAPSHOT'
implementation 'org.pytorch:pytorch_android_torchvision:1.9.0-SNAPSHOT'
implementation 'org.pytorch:pytorch_android:1.10.0-SNAPSHOT'
implementation 'org.pytorch:pytorch_android_torchvision:1.10.0-SNAPSHOT'
...
}
```
@ -95,13 +95,12 @@ dependencies {
implementation(name:'pytorch_android', ext:'aar')
implementation(name:'pytorch_android_torchvision', ext:'aar')
...
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'com.facebook.soloader:nativeloader:0.8.0'
implementation 'com.facebook.fbjni:fbjni-java-only:0.0.3'
}
```
We also have to add all transitive dependencies of our aars.
As `pytorch_android` [depends](https://github.com/pytorch/pytorch/blob/master/android/pytorch_android/build.gradle#L62-L63) on `'com.android.support:appcompat-v7:28.0.0'`, `'com.facebook.soloader:nativeloader:0.8.0'` and 'com.facebook.fbjni:fbjni-java-only:0.0.3', we need to add them.
As `pytorch_android` [depends](https://github.com/pytorch/pytorch/blob/master/android/pytorch_android/build.gradle#L76-L77) on `'com.facebook.soloader:nativeloader:0.8.0'` and `'com.facebook.fbjni:fbjni-java-only:0.0.3'`, we need to add them.
(In case of using maven dependencies they are added automatically from `pom.xml`).
You can check out [test app example](https://github.com/pytorch/pytorch/blob/master/android/test_app/app/build.gradle) that uses aars directly.

View File

@ -12,7 +12,6 @@ allprojects {
rulesVersion = "1.2.0"
junitVersion = "4.12"
androidSupportAppCompatV7Version = "28.0.0"
fbjniJavaOnlyVersion = "0.0.3"
soLoaderNativeLoaderVersion = "0.8.0"
}

View File

@ -1,6 +1,6 @@
ABI_FILTERS=armeabi-v7a,arm64-v8a,x86,x86_64
VERSION_NAME=1.9.0-SNAPSHOT
VERSION_NAME=1.10.0-SNAPSHOT
GROUP=org.pytorch
MAVEN_GROUP=org.pytorch
SONATYPE_STAGING_PROFILE=orgpytorch

View File

@ -74,7 +74,6 @@ android {
dependencies {
implementation 'com.facebook.fbjni:fbjni-java-only:' + rootProject.fbjniJavaOnlyVersion
implementation 'com.android.support:appcompat-v7:' + rootProject.androidSupportAppCompatV7Version
implementation 'com.facebook.soloader:nativeloader:' + rootProject.soLoaderNativeLoaderVersion
testImplementation 'junit:junit:' + rootProject.junitVersion

View File

@ -42,7 +42,6 @@ android {
dependencies {
implementation project(':pytorch_android')
implementation 'com.android.support:appcompat-v7:' + rootProject.androidSupportAppCompatV7Version
implementation 'com.facebook.soloader:nativeloader:' + rootProject.soLoaderNativeLoaderVersion
testImplementation 'junit:junit:' + rootProject.junitVersion

View File

@ -149,8 +149,8 @@ dependencies {
//nativeBuildImplementation(name: 'pytorch_android_torchvision-release', ext: 'aar')
//extractForNativeBuild(name: 'pytorch_android-release', ext: 'aar')
nightlyImplementation 'org.pytorch:pytorch_android:1.9.0-SNAPSHOT'
nightlyImplementation 'org.pytorch:pytorch_android_torchvision:1.9.0-SNAPSHOT'
nightlyImplementation 'org.pytorch:pytorch_android:1.10.0-SNAPSHOT'
nightlyImplementation 'org.pytorch:pytorch_android_torchvision:1.10.0-SNAPSHOT'
aarImplementation(name:'pytorch_android', ext:'aar')
aarImplementation(name:'pytorch_android_torchvision', ext:'aar')

View File

@ -120,7 +120,7 @@ set(ATen_HIP_TEST_SRCS ${ATen_HIP_TEST_SRCS} PARENT_SCOPE)
set(ATen_VULKAN_TEST_SRCS ${ATen_VULKAN_TEST_SRCS} PARENT_SCOPE)
set(ATen_MOBILE_BENCHMARK_SRCS ${ATen_MOBILE_BENCHMARK_SRCS} PARENT_SCOPE)
set(ATen_MOBILE_TEST_SRCS ${ATen_MOBILE_TEST_SRCS} PARENT_SCOPE)
set(ATen_VEC256_TEST_SRCS ${ATen_VEC256_TEST_SRCS} PARENT_SCOPE)
set(ATen_VEC_TEST_SRCS ${ATen_VEC_TEST_SRCS} PARENT_SCOPE)
set(ATen_CPU_INCLUDE ${ATen_CPU_INCLUDE} PARENT_SCOPE)
set(ATen_CUDA_INCLUDE ${ATen_CUDA_INCLUDE} PARENT_SCOPE)
set(ATen_HIP_INCLUDE ${ATen_HIP_INCLUDE} PARENT_SCOPE)

View File

@ -87,8 +87,6 @@ static void warnFallback(const c10::FunctionSchema& schema, bool is_inplace) {
// the operator, and then pop the results off the stack.
void batchedTensorInplaceForLoopFallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
const auto& schema = op.schema();
// NOLINTNEXTLINE(clang-analyzer-deadcode.DeadStores,clang-diagnostic-unused-variable)
const auto num_returns = schema.returns().size();
warnFallback(schema, /*in_place*/true);
const auto num_arguments = schema.arguments().size();

View File

@ -2,6 +2,7 @@
#include <ATen/WrapDimUtils.h>
#include <c10/util/Exception.h>
#include <c10/util/irange.h>
namespace at {
@ -23,8 +24,7 @@ BatchedTensorImpl::BatchedTensorImpl(Tensor value, BatchDims bdims)
const auto value_sizes = value_.sizes();
const auto value_strides = value_.strides();
sizes_and_strides_.resize(public_dims);
// NOLINTNEXTLINE(clang-diagnostic-sign-compare)
for (int64_t dim = 0; dim < public_dims; dim++) {
for (const auto dim : c10::irange(public_dims)) {
auto actual_dim = actualDim(dim, /*wrap_dim=*/false);
sizes_and_strides_.size_at_unchecked(dim) = value_sizes.at(actual_dim);
sizes_and_strides_.stride_at_unchecked(dim) = value_strides.at(actual_dim);
@ -51,7 +51,7 @@ int64_t BatchedTensorImpl::actualDim(int64_t dim, bool wrap_dim) const {
// but it might require newer (>= ~2015) CPUs. We should clean this up
// if/when we have dropped support for older CPUs.
int64_t non_bdim_count = 0;
for (int64_t actual_dim = 0; actual_dim < kVmapMaxTensorDims; actual_dim++) {
for (const auto actual_dim : c10::irange(kVmapMaxTensorDims)) {
if (is_bdim[actual_dim]) {
continue;
}

View File

@ -1031,7 +1031,6 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
m.impl("sum.dim_IntList", sum_batching_rule);
m.impl("is_complex", native::is_complex);
m.impl("conj", native::conj);
// inplace operations
m.impl("fill_.Scalar", fill_inplace_scalar_batching_rule);
@ -1085,7 +1084,7 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
UNARY_POINTWISE(ceil);
UNARY_POINTWISE(cos);
UNARY_POINTWISE(cosh);
UNARY_POINTWISE(_conj);
UNARY_POINTWISE(conj_physical);
UNARY_POINTWISE(digamma);
UNARY_POINTWISE(exp);
UNARY_POINTWISE(expm1);
@ -1144,10 +1143,10 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
BINARY_POINTWISE(mul);
BINARY_POINTWISE(div);
{
using Binop = Tensor (*)(const Tensor&, const Tensor&, c10::optional<std::string>);
using Unop = Tensor (*)(const Tensor&, const Scalar&, c10::optional<std::string>);
m.impl("div.Tensor_mode", binary_pointwise_batching_rule<Binop, at::div, c10::optional<std::string>>);
m.impl("div.Scalar_mode", unwrap_and_call<Unop, at::div, const Scalar&, c10::optional<std::string>>);
using Binop = Tensor (*)(const Tensor&, const Tensor&, c10::optional<c10::string_view>);
using Unop = Tensor (*)(const Tensor&, const Scalar&, c10::optional<c10::string_view>);
m.impl("div.Tensor_mode", binary_pointwise_batching_rule<Binop, at::div, c10::optional<c10::string_view>>);
m.impl("div.Scalar_mode", unwrap_and_call<Unop, at::div, const Scalar&, c10::optional<c10::string_view>>);
}
// at::pow has three out-of-place overloads
@ -1181,6 +1180,10 @@ TORCH_LIBRARY_IMPL(aten, Batched, m) {
TRIVIAL_OP(imag)
TRIVIAL_OP(real);
TRIVIAL_OP(view_as_real);
TRIVIAL_OP(_view_as_real_physical);
TRIVIAL_OP(conj);
TRIVIAL_OP(_conj);
TRIVIAL_OP(resolve_conj);
m.impl("view_as_complex", view_as_complex_batching_rule);
#undef TRIVIAL

View File

@ -41,7 +41,7 @@ if(NOT BUILD_LITE_INTERPRETER)
endif()
EXCLUDE(ATen_CORE_SRCS "${ATen_CORE_SRCS}" ${ATen_CORE_TEST_SRCS})
file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec256/*.h" "quantized/*.h")
file(GLOB base_h "*.h" "detail/*.h" "cpu/*.h" "cpu/vec/vec256/*.h" "cpu/vec/*.h" "quantized/*.h")
file(GLOB base_cpp "*.cpp" "detail/*.cpp" "cpu/*.cpp")
file(GLOB cuda_h "cuda/*.h" "cuda/detail/*.h" "cuda/*.cuh" "cuda/detail/*.cuh")
file(GLOB cuda_cpp "cuda/*.cpp" "cuda/detail/*.cpp")
@ -309,6 +309,9 @@ if(NOT MSVC AND NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
set(BUILD_GNUABI_LIBS OFF CACHE BOOL "Don't build sleef gnuabi libs" FORCE)
set(BUILD_TESTS OFF CACHE BOOL "Don't build sleef tests" FORCE)
set(OLD_CMAKE_BUILD_TYPE ${CMAKE_BUILD_TYPE})
if(CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64" AND CMAKE_SYSTEM_NAME STREQUAL "Darwin")
set(DISABLE_SVE ON CACHE BOOL "Xcode's clang-12.5 crashes while trying to compile SVE code" FORCE)
endif()
if("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU" AND
CMAKE_C_COMPILER_VERSION VERSION_GREATER 6.9 AND CMAKE_C_COMPILER_VERSION VERSION_LESS 8)
set(GCC_7 True)
@ -497,7 +500,7 @@ set(ATen_HIP_TEST_SRCS ${ATen_HIP_TEST_SRCS} PARENT_SCOPE)
set(ATen_VULKAN_TEST_SRCS ${ATen_VULKAN_TEST_SRCS} PARENT_SCOPE)
set(ATen_MOBILE_BENCHMARK_SRCS ${ATen_MOBILE_BENCHMARK_SRCS} PARENT_SCOPE)
set(ATen_MOBILE_TEST_SRCS ${ATen_MOBILE_TEST_SRCS} ${ATen_VULKAN_TEST_SRCS} PARENT_SCOPE)
set(ATen_VEC256_TEST_SRCS ${ATen_VEC256_TEST_SRCS} PARENT_SCOPE)
set(ATen_VEC_TEST_SRCS ${ATen_VEC_TEST_SRCS} PARENT_SCOPE)
set(ATen_QUANTIZED_TEST_SRCS ${ATen_QUANTIZED_TEST_SRCS} PARENT_SCOPE)
set(ATen_CPU_INCLUDE ${ATen_CPU_INCLUDE} PARENT_SCOPE)
set(ATen_THIRD_PARTY_INCLUDE ${ATen_THIRD_PARTY_INCLUDE} PARENT_SCOPE)

View File

@ -0,0 +1,152 @@
#include <ATen/ATen.h>
#include <ATen/core/op_registration/op_registration.h>
#include <torch/library.h>
#include <ATen/core/dispatch/Dispatcher.h>
#include <ATen/native/UnaryOps.h>
#include <ATen/NativeFunctions.h>
namespace at {
void conjugateFallback(const c10::OperatorHandle& op, DispatchKeySet dispatch_keys, torch::jit::Stack* stack) {
// Situations to handle:
// 1. Out-of-place operation. Easy: materialize all inputs and
// call it a day.
// 2. Inplace operation. Desugar x.add_(2) into x.conj_().add_(2).conj_().
// Materialize other inputs as in (1).
// 3. out= operation. Desugar add(x, 2, out=y) into y.copy_(add(x, 2))
// Materialize other inputs as in (1).
//
// It is important to be able to tell if we READ from an argument and if we
// WRITE from an argument. Conservative approach is to assume that we always
// READ from an argument, but in out-of-place operations you can skip
// conjugating inputs on entry that never get used. In current schema we
// can't easily tell if inplace situation has happened, so don't do it.
const auto& arguments = op.schema().arguments();
const auto num_arguments = arguments.size();
const auto stack_start = stack->size() - num_arguments;
c10::optional<bool> is_write;
for (int64_t i = 0; i < num_arguments; ++i) {
const auto& alias_info = arguments[i].alias_info();
// Three possible states:
// 1. alias_info has no value --> out-of-place operation
// 2. alias_info does have a value, alias_info->is_write=True --> in-place or out= operation
// 3. alias_info does have a value, alias_info->is_write=False --> view operation
if (alias_info.has_value()) {
if (is_write.has_value()) {
TORCH_CHECK(*is_write == alias_info->isWrite(),
"Unsupported operator for conjugate fallback: ", op.schema().name(),
"Conjugate fallback doesn't work for operators with a mix "
"mutable and non-mutable inputs that alias with outputs, "
"this must be implemented manually. "
"If you got this error on a core op, please report a bug to PyTorch.");
} else {
is_write = alias_info->isWrite();
}
}
}
if (is_write.has_value() && !*is_write) {
// We assume that view operators automatically handle conjugation
// correctly by propagating the Conjugate dispatch key in key_set.
// This is not necessarily always right, so you should test these cases.
op.redispatchBoxed(dispatch_keys & c10::DispatchKeySet(DispatchKeySet::FULL_AFTER, DispatchKey::Conjugate), stack);
return;
}
// Mutable inputs to be tracked separately
std::vector<Tensor> mutable_inputs;
for (int64_t i = 0; i < num_arguments; ++i) {
auto& ivalue = (*stack)[stack_start + i];
if (!(ivalue.isTensor() || ivalue.isTensorList())) {
continue;
}
const auto& argument = arguments[i];
bool mut_arg = false;
if (argument.alias_info()) {
// View operations were already filtered above, so only in-place/out= operations should get here.
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(argument.alias_info()->isWrite());
mut_arg = true;
}
if (ivalue.isTensor()) {
auto* impl = ivalue.unsafeToTensorImpl();
if (!impl->is_conj()) {
continue;
}
auto tensor = std::move(ivalue).toTensor();
TORCH_CHECK_NOT_IMPLEMENTED(!tensor.is_meta(), "Conjugate Fallback does not support meta tensors.");
if (mut_arg) {
// TODO: This is a waste if the argument is write only
tensor._set_conj(false);
at::conj_physical_(tensor);
mutable_inputs.emplace_back(tensor);
} else {
tensor = at::resolve_conj(tensor);
}
(*stack)[stack_start + i] = std::move(tensor);
} else if (ivalue.isTensorList()) {
auto tensors = std::move(ivalue).toTensorList();
if (mut_arg) {
for(const auto j : c10::irange(tensors.size())) {
Tensor t = tensors[j];
t._set_conj(false);
at::conj_physical_(t);
mutable_inputs.emplace_back(t);
}
} else {
for(const auto j : c10::irange(tensors.size())) {
tensors[j] = at::resolve_conj(tensors[j]);
}
}
(*stack)[stack_start + i] = std::move(tensors);
}
}
op.redispatchBoxed(dispatch_keys & c10::DispatchKeySet(DispatchKeySet::FULL_AFTER, DispatchKey::Conjugate), stack);
for (auto& mutable_input : mutable_inputs) {
at::conj_physical_(mutable_input);
mutable_input._set_conj(true);
}
}
TORCH_LIBRARY_IMPL(_, Conjugate, m) {
m.fallback(torch::CppFunction::makeFromBoxedFunction<&conjugateFallback>());
}
TORCH_LIBRARY_IMPL(aten, Conjugate, m) {
m.impl("requires_grad_", torch::CppFunction::makeFallthrough());
m.impl("set_.source_Storage_storage_offset", torch::CppFunction::makeFallthrough());
m.impl("set_.source_Tensor", torch::CppFunction::makeFallthrough());
m.impl("set_", torch::CppFunction::makeFallthrough());
m.impl("copy_", torch::CppFunction::makeFallthrough());
m.impl("clone", torch::CppFunction::makeFallthrough());
m.impl("conj", torch::CppFunction::makeFallthrough());
m.impl("_conj", torch::CppFunction::makeFallthrough());
m.impl("_conj_physical", torch::CppFunction::makeFallthrough());
m.impl("conj_physical", torch::CppFunction::makeFallthrough());
m.impl("conj_physical_", torch::CppFunction::makeFallthrough());
m.impl("resolve_conj", torch::CppFunction::makeFallthrough());
m.impl("empty_like", torch::CppFunction::makeFallthrough());
m.impl("empty.memory_format", torch::CppFunction::makeFallthrough());
m.impl("empty.out", torch::CppFunction::makeFallthrough());
m.impl("empty_strided", torch::CppFunction::makeFallthrough());
m.impl("full_like", torch::CppFunction::makeFallthrough());
m.impl("stride.int", torch::CppFunction::makeFallthrough());
m.impl("stride.Dimname", torch::CppFunction::makeFallthrough());
m.impl("size.int", torch::CppFunction::makeFallthrough());
m.impl("size.Dimname", torch::CppFunction::makeFallthrough());
m.impl("is_complex", torch::CppFunction::makeFallthrough());
m.impl("_view_as_real_physical", torch::CppFunction::makeFallthrough());
m.impl("view_as_real", torch::CppFunction::makeFallthrough());
m.impl("imag", torch::CppFunction::makeFallthrough());
m.impl("real", torch::CppFunction::makeFallthrough());
m.impl("view", torch::CppFunction::makeFallthrough());
m.impl("reshape", torch::CppFunction::makeFallthrough());
}
} // namespace at

View File

@ -106,6 +106,9 @@ inline constexpr bool should_include_kernel_dtype(
int bit_width = bitwidth; \
int64_t quant_min = qmin; \
int64_t quant_max = qmax; \
(void)bit_width; /* Suppress unused variable warning */ \
(void)quant_min; /* Suppress unused variable warning */ \
(void)quant_max; /* Suppress unused variable warning */ \
return __VA_ARGS__(); \
}

View File

@ -35,253 +35,6 @@ namespace {
}
}
Tensor & _th_nonzero_out(const Tensor & self, Tensor & result) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
switch (dispatch_scalar_type) {
case ScalarType::Bool: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THBoolTensor_nonzero(result_, self_);
break;
}
case ScalarType::Byte: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THByteTensor_nonzero(result_, self_);
break;
}
case ScalarType::Char: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THCharTensor_nonzero(result_, self_);
break;
}
case ScalarType::Double: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THDoubleTensor_nonzero(result_, self_);
break;
}
case ScalarType::Float: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THFloatTensor_nonzero(result_, self_);
break;
}
case ScalarType::Int: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THIntTensor_nonzero(result_, self_);
break;
}
case ScalarType::Long: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THLongTensor_nonzero(result_, self_);
break;
}
case ScalarType::Short: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THShortTensor_nonzero(result_, self_);
break;
}
case ScalarType::Half: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THHalfTensor_nonzero(result_, self_);
break;
}
case ScalarType::BFloat16: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THBFloat16Tensor_nonzero(result_, self_);
break;
}
case ScalarType::ComplexDouble: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THComplexDoubleTensor_nonzero(result_, self_);
break;
}
case ScalarType::ComplexFloat: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_nonzero_out", false, DeviceType::CPU, ScalarType::Long);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero_out", false, DeviceType::CPU, dispatch_scalar_type);
THComplexFloatTensor_nonzero(result_, self_);
break;
}
default:
AT_ERROR("_th_nonzero_out not supported on CPUType for ", dispatch_scalar_type);
}
return result;
}
Tensor _th_nonzero(const Tensor & self) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
auto result_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(c10::Storage(c10::Storage::use_byte_size_t(), 0, allocator(), true),DispatchKey::CPU, scalarTypeToTypeMeta(ScalarType::Long)).release();
auto result = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(result_));
switch (dispatch_scalar_type) {
case ScalarType::Bool: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THBoolTensor_nonzero(result_, self_);
break;
}
case ScalarType::Byte: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THByteTensor_nonzero(result_, self_);
break;
}
case ScalarType::Char: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THCharTensor_nonzero(result_, self_);
break;
}
case ScalarType::Double: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THDoubleTensor_nonzero(result_, self_);
break;
}
case ScalarType::Float: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THFloatTensor_nonzero(result_, self_);
break;
}
case ScalarType::Int: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THIntTensor_nonzero(result_, self_);
break;
}
case ScalarType::Long: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THLongTensor_nonzero(result_, self_);
break;
}
case ScalarType::Short: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THShortTensor_nonzero(result_, self_);
break;
}
case ScalarType::Half: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THHalfTensor_nonzero(result_, self_);
break;
}
case ScalarType::BFloat16: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THBFloat16Tensor_nonzero(result_, self_);
break;
}
case ScalarType::ComplexDouble: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THComplexDoubleTensor_nonzero(result_, self_);
break;
}
case ScalarType::ComplexFloat: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_nonzero", false, DeviceType::CPU, dispatch_scalar_type);
THComplexFloatTensor_nonzero(result_, self_);
break;
}
default:
AT_ERROR("_th_nonzero not supported on CPUType for ", dispatch_scalar_type);
}
return result;
}
Scalar _th_std_var(const Tensor& self, int64_t correction, bool take_sqrt) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
switch (dispatch_scalar_type) {
case ScalarType::Double: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_var", false, DeviceType::CPU, dispatch_scalar_type);
return convert<double>(THDoubleTensor_std_var_all(self_, correction, take_sqrt));
break;
}
case ScalarType::Float: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_var", false, DeviceType::CPU, dispatch_scalar_type);
return convert<float>(THFloatTensor_std_var_all(self_, correction, take_sqrt));
break;
}
default:
AT_ERROR("_th_var not supported on CPUType for ", dispatch_scalar_type);
}
}
Tensor & _th_renorm_out(const Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm, Tensor & result) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
switch (dispatch_scalar_type) {
case ScalarType::Double: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_renorm_out", false, DeviceType::CPU, dispatch_scalar_type);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_renorm_out", false, DeviceType::CPU, dispatch_scalar_type);
auto p_ = p.toDouble();
auto maxnorm_ = maxnorm.toDouble();
THDoubleTensor_renorm(result_, self_, p_, dim, maxnorm_);
break;
}
case ScalarType::Float: {
auto result_ = checked_dense_tensor_unwrap(result, "result", 0, "_th_renorm_out", false, DeviceType::CPU, dispatch_scalar_type);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_renorm_out", false, DeviceType::CPU, dispatch_scalar_type);
auto p_ = p.toFloat();
auto maxnorm_ = maxnorm.toFloat();
THFloatTensor_renorm(result_, self_, p_, dim, maxnorm_);
break;
}
default:
AT_ERROR("_th_renorm_out not supported on CPUType for ", dispatch_scalar_type);
}
return result;
}
Tensor _th_renorm(const Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
auto result_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(c10::Storage(c10::Storage::use_byte_size_t(), 0, allocator(), true),DispatchKey::CPU, scalarTypeToTypeMeta(dispatch_scalar_type)).release();
auto result = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(result_));
switch (dispatch_scalar_type) {
case ScalarType::Double: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_renorm", false, DeviceType::CPU, dispatch_scalar_type);
auto p_ = p.toDouble();
auto maxnorm_ = maxnorm.toDouble();
THDoubleTensor_renorm(result_, self_, p_, dim, maxnorm_);
break;
}
case ScalarType::Float: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_renorm", false, DeviceType::CPU, dispatch_scalar_type);
auto p_ = p.toFloat();
auto maxnorm_ = maxnorm.toFloat();
THFloatTensor_renorm(result_, self_, p_, dim, maxnorm_);
break;
}
default:
AT_ERROR("_th_renorm not supported on CPUType for ", dispatch_scalar_type);
}
return result;
}
Tensor & _th_renorm_(Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
switch (dispatch_scalar_type) {
case ScalarType::Double: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_renorm_", false, DeviceType::CPU, dispatch_scalar_type);
auto p_ = p.toDouble();
auto maxnorm_ = maxnorm.toDouble();
THDoubleTensor_renorm(self_, self_, p_, dim, maxnorm_);
break;
}
case ScalarType::Float: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_renorm_", false, DeviceType::CPU, dispatch_scalar_type);
auto p_ = p.toFloat();
auto maxnorm_ = maxnorm.toFloat();
THFloatTensor_renorm(self_, self_, p_, dim, maxnorm_);
break;
}
default:
AT_ERROR("_th_renorm_ not supported on CPUType for ", dispatch_scalar_type);
}
return self;
}
Tensor & _th_histc_out(const Tensor & self, int64_t bins, const Scalar& min, const Scalar& max, Tensor & result) {
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
@ -334,84 +87,6 @@ Tensor _th_histc(const Tensor & self, int64_t bins, const Scalar& min, const Sca
return result;
}
std::tuple<Tensor &,Tensor &> _th_gels_out(const Tensor & self, const Tensor & A, Tensor & res1, Tensor & res2) {
TORCH_WARN_ONCE(
"torch.lstsq is deprecated in favor of torch.linalg.lstsq and will be removed in a future PyTorch release.\n",
"torch.linalg.lstsq has reversed arguments and does not return the QR decomposition in "
"the returned tuple (although it returns other information about the problem).\n",
"To get the qr decomposition consider using torch.linalg.qr.\n",
"The returned solution in torch.lstsq stored the residuals of the solution in the ",
"last m - n columns of the returned value whenever m > n. In torch.linalg.lstsq, the ",
"residuals in the field 'residuals' of the returned named tuple.\n",
"The unpacking of the solution, as in\n",
"X, _ = torch.lstsq(B, A).solution[:A.size(1)]\n",
"should be replaced with\n",
"X = torch.linalg.lstsq(A, B).solution"
);
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
switch (dispatch_scalar_type) {
case ScalarType::Double: {
auto res1_ = checked_dense_tensor_unwrap(res1, "res1", 0, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
auto res2_ = checked_dense_tensor_unwrap(res2, "res2", 0, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
auto A_ = checked_dense_tensor_unwrap(A, "A", 2, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
THDoubleTensor_gels(res1_, res2_, self_, A_);
break;
}
case ScalarType::Float: {
auto res1_ = checked_dense_tensor_unwrap(res1, "res1", 0, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
auto res2_ = checked_dense_tensor_unwrap(res2, "res2", 0, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
auto A_ = checked_dense_tensor_unwrap(A, "A", 2, "_th_gels_out", false, DeviceType::CPU, dispatch_scalar_type);
THFloatTensor_gels(res1_, res2_, self_, A_);
break;
}
default:
AT_ERROR("_th_gels_out not supported on CPUType for ", dispatch_scalar_type);
}
return std::tuple<Tensor &, Tensor &>(res1, res2);
}
std::tuple<Tensor,Tensor> _th_gels(const Tensor & self, const Tensor & A) {
TORCH_WARN_ONCE(
"torch.lstsq is deprecated in favor of torch.linalg.lstsq and will be removed in a future PyTorch release.\n",
"torch.linalg.lstsq has reversed arguments and does not return the QR decomposition in "
"the returned tuple (although it returns other information about the problem).\n",
"To get the qr decomposition consider using torch.linalg.qr.\n",
"The returned solution in torch.lstsq stored the residuals of the solution in the ",
"last m - n columns of the returned value whenever m > n. In torch.linalg.lstsq, the ",
"residuals in the field 'residuals' of the returned named tuple.\n",
"The unpacking of the solution, as in\n",
"X, _ = torch.lstsq(B, A).solution[:A.size(1)]\n",
"should be replaced with\n",
"X = torch.linalg.lstsq(A, B).solution"
);
// DeviceGuard omitted
auto dispatch_scalar_type = infer_scalar_type(self);
auto res1_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(c10::Storage(c10::Storage::use_byte_size_t(), 0, allocator(), true),DispatchKey::CPU, scalarTypeToTypeMeta(dispatch_scalar_type)).release();
auto res1 = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(res1_));
auto res2_ = c10::make_intrusive<TensorImpl, UndefinedTensorImpl>(c10::Storage(c10::Storage::use_byte_size_t(), 0, allocator(), true),DispatchKey::CPU, scalarTypeToTypeMeta(dispatch_scalar_type)).release();
auto res2 = Tensor(c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl>::reclaim(res2_));
switch (dispatch_scalar_type) {
case ScalarType::Double: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_gels", false, DeviceType::CPU, dispatch_scalar_type);
auto A_ = checked_dense_tensor_unwrap(A, "A", 2, "_th_gels", false, DeviceType::CPU, dispatch_scalar_type);
THDoubleTensor_gels(res1_, res2_, self_, A_);
break;
}
case ScalarType::Float: {
auto self_ = checked_dense_tensor_unwrap(self, "self", 1, "_th_gels", false, DeviceType::CPU, dispatch_scalar_type);
auto A_ = checked_dense_tensor_unwrap(A, "A", 2, "_th_gels", false, DeviceType::CPU, dispatch_scalar_type);
THFloatTensor_gels(res1_, res2_, self_, A_);
break;
}
default:
AT_ERROR("_th_gels not supported on CPUType for ", dispatch_scalar_type);
}
return std::tuple<Tensor, Tensor>(res1, res2);
}
} // namespace th
} // namespace legacy
} // namespace native

View File

@ -20,16 +20,8 @@ namespace cpu {
Tensor & _th_masked_scatter_(Tensor & self, const Tensor & mask, const Tensor & source);
Tensor & _th_masked_scatter_bool_(Tensor & self, const Tensor & mask, const Tensor & source);
Tensor& _th_nonzero_out(const Tensor& self, Tensor& result);
Tensor _th_nonzero(const Tensor & self);
Scalar _th_std_var(const Tensor& self, int64_t correction, bool take_sqrt);
Tensor & _th_renorm_out(const Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm, Tensor & result);
Tensor _th_renorm(const Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm);
Tensor & _th_renorm_(Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm);
Tensor & _th_histc_out(const Tensor & self, int64_t bins, const Scalar& min, const Scalar& max, Tensor & result);
Tensor _th_histc(const Tensor & self, int64_t bins, const Scalar& min, const Scalar& max);
std::tuple<Tensor &,Tensor &> _th_gels_out(const Tensor & self, const Tensor & A, Tensor & res1, Tensor & res2);
std::tuple<Tensor,Tensor> _th_gels(const Tensor & self, const Tensor & A);
} // namespace th
} // namespace legacy

View File

@ -20,9 +20,6 @@ namespace cuda {
Tensor & _th_masked_fill_(Tensor & self, const Tensor & mask, const Scalar& value);
Tensor & _th_masked_fill_bool_(Tensor & self, const Tensor & mask, const Scalar& value);
Tensor & _th_renorm_out(const Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm, Tensor & result);
Tensor _th_renorm(const Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm);
Tensor & _th_renorm_(Tensor & self, const Scalar& p, int64_t dim, const Scalar& maxnorm);
Tensor & _th_cross_kernel_out(Tensor & result, const Tensor & self, const Tensor & other, int64_t dim);
Tensor _th_cross_kernel(const Tensor & self, const Tensor & other, int64_t dim);
std::tuple<Tensor &,Tensor &> _th_gels_out(const Tensor & self, const Tensor & A, Tensor & res1, Tensor & res2);
@ -54,10 +51,7 @@ std::tuple<Tensor &,Tensor &> _thnn_log_sigmoid_forward_out(const Tensor & self,
std::tuple<Tensor,Tensor> _thnn_log_sigmoid_forward(const Tensor & self);
Tensor & _thnn_log_sigmoid_backward_out(const Tensor & grad_output, const Tensor & self, const Tensor & buffer, Tensor & grad_input);
Tensor _thnn_log_sigmoid_backward(const Tensor & grad_output, const Tensor & self, const Tensor & buffer);
Tensor & _thnn_rrelu_with_noise_forward_out(const Tensor & self, const Tensor & noise, const Scalar& lower, const Scalar& upper, bool training, c10::optional<at::Generator> generator, Tensor & output);
Tensor _thnn_rrelu_with_noise_forward(const Tensor & self, const Tensor & noise, const Scalar& lower, const Scalar& upper, bool training, c10::optional<at::Generator> generator);
Tensor _thnn_rrelu_with_noise_backward(const Tensor & grad_output, const Tensor & self, const Tensor & noise, const Scalar& lower, const Scalar& upper, bool training);
Tensor & _thnn_rrelu_with_noise_forward_(Tensor & self, const Tensor & noise, const Scalar& lower, const Scalar& upper, bool training, c10::optional<at::Generator> generator);
std::tuple<Tensor &,Tensor &,Tensor &> _thnn_conv2d_forward_out(const Tensor & self, const Tensor & weight, IntArrayRef kernel_size, const c10::optional<Tensor>& bias_opt, IntArrayRef stride, IntArrayRef padding, Tensor & output, Tensor & columns, Tensor & ones);
std::tuple<Tensor,Tensor,Tensor> _thnn_conv2d_forward(const Tensor & self, const Tensor & weight, IntArrayRef kernel_size, const optional<Tensor> & bias, IntArrayRef stride, IntArrayRef padding);
std::tuple<Tensor &,Tensor &,Tensor &> _thnn_conv2d_backward_out(Tensor & grad_input, Tensor & grad_weight, Tensor & grad_bias, const Tensor & grad_output, const Tensor & self, const Tensor & weight, IntArrayRef kernel_size, IntArrayRef stride, IntArrayRef padding, const Tensor & columns, const Tensor & ones);

View File

@ -8,9 +8,9 @@ MemOverlap has_internal_overlap(const Tensor& tensor) {
}
MemOverlap has_internal_overlap(TensorImpl* t) {
AT_ASSERT(t->layout() == kStrided);
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(t->layout() == kStrided);
if (t->is_contiguous()) {
if (t->is_non_overlapping_and_dense()) {
return MemOverlap::NO;
}
@ -45,18 +45,16 @@ MemOverlapStatus get_overlap_status(TensorImpl* a, TensorImpl* b) {
if (a->numel() == 0 || b->numel() == 0) {
return MemOverlapStatus::NO;
}
if (!a->is_contiguous() || !b->is_contiguous()) {
if (!a->is_non_overlapping_and_dense() || !b->is_non_overlapping_and_dense()) {
return MemOverlapStatus::TOO_HARD;
}
if (!a->has_storage() || !b->has_storage()) {
return MemOverlapStatus::NO;
}
// Test for storage equality, rather than pointer equality.
// This reduces precision, but if people are aliasing the
// same pointer across multiple storages there are many
// similar situations (e.g., storage().data() == storage().data()+1)
// which we will miss.
if (a->storage().is_alias_of(b->storage())) {
auto a_storage = a->unsafe_storage();
if (a_storage && a_storage.is_alias_of(b->unsafe_storage())) {
const auto a_begin = static_cast<char*>(a->data());
const auto a_end = a_begin + a->numel() * a->itemsize();
const auto b_begin = static_cast<char*>(b->data());

View File

@ -128,7 +128,7 @@ void launch_no_thread_state(std::function<void()> fn);
TORCH_API void intraop_launch(std::function<void()> func);
// Launches intra-op parallel task, returns a future
TORCH_API std::shared_ptr<c10::ivalue::Future> intraop_launch_future(
TORCH_API c10::intrusive_ptr<c10::ivalue::Future> intraop_launch_future(
std::function<void()> func);
// Returns number of intra-op threads used by default

View File

@ -271,10 +271,10 @@ void intraop_launch(std::function<void()> func) {
#endif // C10_MOBILE
}
std::shared_ptr<c10::ivalue::Future> intraop_launch_future(
c10::intrusive_ptr<c10::ivalue::Future> intraop_launch_future(
std::function<void()> func) {
#ifndef C10_MOBILE
auto future = std::make_shared<c10::ivalue::Future>(c10::NoneType::get());
auto future = c10::make_intrusive<c10::ivalue::Future>(c10::NoneType::get());
if (!in_parallel_region() && get_num_threads() > 1) {
_get_intraop_pool().run(
[func, future]() {
@ -290,7 +290,7 @@ std::shared_ptr<c10::ivalue::Future> intraop_launch_future(
#else
// TODO: caffe2::PThreadPool only provides a data-parallel API.
// Task parallelism is not currently supported.
auto future = std::make_shared<c10::ivalue::Future>(NoneType::get());
auto future = c10::make_intrusive<c10::ivalue::Future>(NoneType::get());
func();
future->markCompleted();
return future;

View File

@ -70,11 +70,12 @@ int get_num_threads() {
}
int get_thread_num() {
return tbb::this_task_arena::current_thread_index();
auto tid = tbb::this_task_arena::current_thread_index();
return std::max(tid, 0);
}
bool in_parallel_region() {
return tbb::this_task_arena::current_thread_index() != -1;
return tbb::this_task_arena::current_thread_index() >= 0;
}
void intraop_launch(std::function<void()> func) {
@ -85,9 +86,9 @@ void intraop_launch(std::function<void()> func) {
}
}
std::shared_ptr<c10::ivalue::Future> intraop_launch_future(
c10::intrusive_ptr<c10::ivalue::Future> intraop_launch_future(
std::function<void()> func) {
auto future = std::make_shared<c10::ivalue::Future>(NoneType::get());
auto future = c10::make_intrusive<c10::ivalue::Future>(NoneType::get());
if (get_num_threads() > 1) {
tg_.run(
[func, future]() {

View File

@ -101,10 +101,10 @@ void intraop_launch(std::function<void()> func) {
func();
}
std::shared_ptr<c10::ivalue::Future> intraop_launch_future(
c10::intrusive_ptr<c10::ivalue::Future> intraop_launch_future(
std::function<void()> func) {
func();
auto future = std::make_shared<c10::ivalue::Future>(NoneType::get());
auto future = c10::make_intrusive<c10::ivalue::Future>(NoneType::get());
future->markCompleted();
return future;
}

View File

@ -17,16 +17,22 @@ inline void parallel_for(
const int64_t end,
const int64_t grain_size,
const F& f) {
TORCH_CHECK(grain_size >= 0);
at::internal::lazy_init_num_threads();
TORCH_INTERNAL_ASSERT_DEBUG_ONLY(grain_size >= 0);
if (begin >= end) {
return;
}
if (end - begin == 1) {
#ifdef _OPENMP
at::internal::lazy_init_num_threads();
const auto numiter = end - begin;
const bool use_parallel = (
numiter > grain_size && numiter > 1 &&
omp_get_max_threads() > 1 && !omp_in_parallel());
if (!use_parallel) {
f(begin, end);
return;
}
#ifdef _OPENMP
std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
std::exception_ptr eptr;
// Work around memory leak when using 1 thread in nested "omp parallel"
@ -34,7 +40,7 @@ inline void parallel_for(
// returns false when omp_get_max_threads() == 1 inside nested "omp parallel"
// See issue gh-32284
#pragma omp parallel if (omp_get_max_threads() > 1 && !omp_in_parallel() && ((end - begin) > grain_size))
#pragma omp parallel
{
// choose number of tasks based on grain size and number of threads
// can't use num_threads clause due to bugs in GOMP's thread pool (See #32008)
@ -76,7 +82,8 @@ inline scalar_t parallel_reduce(
at::internal::lazy_init_num_threads();
if (begin >= end) {
return ident;
} else if (in_parallel_region() || get_num_threads() == 1) {
} else if ((end - begin) <= grain_size || in_parallel_region() ||
get_num_threads() == 1) {
return f(begin, end, ident);
} else {
const int64_t num_results = divup((end - begin), grain_size);
@ -84,7 +91,7 @@ inline scalar_t parallel_reduce(
scalar_t* results_data = results.data();
std::atomic_flag err_flag = ATOMIC_FLAG_INIT;
std::exception_ptr eptr;
#pragma omp parallel for if ((end - begin) >= grain_size)
#pragma omp parallel for
for (int64_t id = 0; id < num_results; id++) {
int64_t i = begin + id * grain_size;
try {

Some files were not shown because too many files have changed in this diff Show More