Compare commits

...

6626 Commits

Author SHA1 Message Date
138e2895d0 Enable tuple operands for cond (#108026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108026
Approved by: https://github.com/zou3519
ghstack dependencies: #108025
2023-08-28 00:17:54 +00:00
8688965337 Move cond to torch/_higher_order_ops/ (#108025)
1. Move cond to torch/_higher_order_ops
2. Fix a bug in map, which didn't respect tensor dtype when creating a new one from them. We cannot directly use empty_strided because boolean tensor created by empty_strided is not properly intialized so it causes error "load of value 190, which is not a valid value for type 'bool'" on clang asan environment on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108025
Approved by: https://github.com/zou3519
2023-08-28 00:01:35 +00:00
cyy
1fd4e787ce [2/N] fix clang-tidy warnings in torch/csrc (#107966)
Apply fixes to some found issues by clang-tidy in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107966
Approved by: https://github.com/Skylion007
2023-08-27 18:06:21 +00:00
9ae3d7ca90 [reland][quant][pt2e][xnnpack_quantizer] Add support for mul and mul_relu (#107930) (#107992)
Summary: att

Test Plan: buck2 run executorch/examples/quantization:example -- -m=mv3 --verify

Differential Revision: D48588121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107992
Approved by: https://github.com/digantdesai, https://github.com/mcr229
2023-08-27 14:50:03 +00:00
a432f37e49 Serialize pytree to json string (#106116)
Fixes https://github.com/pytorch/pytorch/pull/102577#issuecomment-1650905536

Serializing to json is more stable, and renamed the API:

```
# Takes in a treespec and returns the serialized treespec as a string. Also optionally takes in a protocol version number.
def treespec_dumps(treespec: TreeSpec, protocol: Optional[int] = None) -> str:
# Takes in a serialized treespec and outputs a TreeSpec
def treespec_loads(data: str) -> TreeSpec:
```

If users want to register their own serialization format for a given pytree, they can go through the `_register_treespec_serializer` API which optionally takes in a `getstate` and `setstate` function.
```
_register_treespec_serializer(type_, *, getstate, setstate)
# Takes in the context, and outputs a json-dumpable context
def getstate(context: Context) -> DumpableContext:
# Takes in a json-dumpable context, and reconstructs the original context
def setstate(dumpable_context: DumpableContext) -> Context:
```

We will serialize to the following dataclass, and then json.dump this it to string.
```
class TreeSpec
    type: Optional[str]  # a string name of the type. null for the case of a LeafSpec
    context: Optional[Any]  # optional, a json dumpable format of the context
    children_specs: List[TreeSpec],
}
```

If no getstate/setstate function is registered, we will by default serialize the context using `json.dumps/loads`. We will also serialize the type through `f"{typ.__module__}.{typ.__name__}"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106116
Approved by: https://github.com/zou3519
2023-08-27 14:34:49 +00:00
4b27e46ddb [Quant][Inductor] add UT of dequant promotion for linear (#106935)
**Summary**
Previously the UT of dequant promotion in Inductor only tests convolution. Now add linear case in the UT. This is for quantization PT2E with Inductor.

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_dequant_promotion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106935
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #105818, #106781, #106782, #106934
2023-08-27 13:53:13 +00:00
f3adbab4bb [Quant][Inductor] Enable quantization linear pattern fusion inside inductor (#106934)
**Summary**
Enable lowering of quantized linear in Inductor

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_unary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106934
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
ghstack dependencies: #105818, #106781, #106782
2023-08-27 13:00:16 +00:00
15ceafb5c5 [Quant][Inductor] Enable qlinear weight prepack inside inductor constant folding (#106782)
**Summary**
To realize weight prepack for quantized linear, we replace the following pattern
```
int8 activation
      |
dequant_per_tensor
      |
mm/addmm <- t <- dequant_per_channel <- int8_weight
```
with
```
int8 activation
  |
onednn.qlinear_pointwise <- onednn.qlinear_prepack <- int8_weight
```
And we register weight prepack path inside inductor constant folding. Constant folding evaluates the prepack op and replace it with prepacked weight (a constant parameter)

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_unary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106782
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/eellison
ghstack dependencies: #105818, #106781
2023-08-27 12:53:44 +00:00
e9b0f62a19 [Quant][PT2E] Enable linear and linear-unary post-op quant recipe for x86 inductor quantizer (#106781)
**Summary**
Add linear and linear-unary post-op quantization recipe to x86 inductor quantizer. For PT2E with Inductor. With this, the quantization path will add `quant-dequant` pattern for linear and linear-unary post op.

**Test plan**
python test/test_quantization.py -k test_linear_with_quantizer_api
python test/test_quantization.py -k test_linear_unary_with_quantizer_api

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106781
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #105818
2023-08-27 10:50:17 +00:00
2179ebde1f [inductor] correctly handle resize for AOTInductor wrapper calls (#107848)
When generating a wrapper call, we may have implicit resize applied to
the kernel's output. For example, for addmm(3d_tensor, 2d_tensor),
its output buffer is resized to a 2d tensor. This triggers a warning from
Aten's resize_output op:

    "UserWarning: An output with one or more elements was resized since it had...
    This behavior is deprecated, and in a future PyTorch release outputs will
    not be resized unless they have zero elements..."

More importantly, the output shape is not the same as we would expect, i.e.
2d tensor v.s. 3d tensor.

This PR fixed the issue by injecting resize_(0) before calling the relevant
kernel and resize_(expected_shape) after the kernel call.

We also fixed a minor typo in the PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107848
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-08-27 09:56:16 +00:00
a6d3da1835 [Quant] Add int8 linear op impl for quantization PT2E with Inductor. input is an int8 CPU tensor; weight is an int8 MdkldnnCPU tensor. (#105818)
**Summary**
Add a new onednn qlinear op for quantization PT2E with Inductor. input is an int8 CPU tensor and weight is an int8 MkldnnCPU tensor.

**Test plan**
python test/test_quantization.py -k test_qlinear_pt2e

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105818
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2023-08-27 08:13:12 +00:00
bad3f2db40 [vision hash update] update the pinned vision hash (#108011)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108011
Approved by: https://github.com/pytorchbot
2023-08-27 03:29:53 +00:00
808e088615 Update writing_batching_rules.md (#108007)
Was reading through the batching rules info which is very cool and just saw a couple of typos 😊.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108007
Approved by: https://github.com/msaroufim
2023-08-26 19:07:36 +00:00
a18ee0c6ec [ROCm] ROCm compatible configs for triton kernels (#107584)
This PR brings in a few inductor changes required for ROCm

~**1 - Introduction of a toggle for enforced channel last convolution fallbacks**~
This addition is split off into its own PR after some cleanup by @pragupta  https://github.com/pytorch/pytorch/pull/107812

**2 - Addition of ROCm specific block sizes**
We are now able to support the MAX_AUTOTUNE mode on ROCm, we are proposing conditions to allow us to finetune our own block tuning. Currently triton on ROCm does not benefit from pipelining so we are setting all configs to `num_stages=1` and we have removed some upstream tunings on ROCm to avoid running out of shared memory resources.

In the future we will provide more optimised tunings for ROCm but for now this should mitigate any issues

~**3 - Addition of device_type to triton's compile_meta**~
~Proposing this addition to `triton_heuristics.py`, Triton on ROCm requires device_type to be set to hip https://github.com/ROCmSoftwarePlatform/triton/pull/284 suggesting to bring this change in here so we can pass down the correct device type to triton.~
This change is split off and will arrive in the wheel update PR https://github.com/pytorch/pytorch/pull/107600 leaving this PR to focus on the ROCm specific block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107584
Approved by: https://github.com/jithunnair-amd, https://github.com/jansel, https://github.com/eellison
2023-08-26 18:24:55 +00:00
15e5bd5103 [ONNX] Support torch.compile(backend="onnxrt", options=OrtBackendOptions(...)) (#107973)
This reworks the DORT backend factory function to support the options kwarg of torch.compile, and defines a concrete OrtBackendOptions type that can be used to influence the backend.

Caching is also implemented in order to reuse backends with equal options.

Wrapping the backend in auto_autograd also becomes an option, which allows `OrtBackend` to always be returned as the callable for torch.compile; wrapping happens internally if opted into (True by default).

Lastly, expose options for configuring preferred execution providers (will be attempted first), whether or not to attempt to infer an ORT EP from a torch found device in the graph or inputs, and finally the default/fallback EPs.

### Demo

The following demo runs `Gelu` through `torch.compile(backend="onnxrt")` using various backend options through a dictionary form and a strongly typed form. It additionally exports the model through both the ONNX TorchScript exporter and the new TorchDynamo exporter.

```python
import math

import onnx.inliner
import onnxruntime
import torch
import torch.onnx

torch.manual_seed(0)

class Gelu(torch.nn.Module):
    def forward(self, x):
        return x * (0.5 * torch.erf(math.sqrt(0.5) * x) + 1.0)

@torch.compile(
    backend="onnxrt",
    options={
        "preferred_execution_providers": [
            "NotARealEP",
            "CPUExecutionProvider",
        ],
        "export_options": torch.onnx.ExportOptions(dynamic_shapes=True),
    },
)
def dort_gelu(x):
    return Gelu()(x)

ort_session_options = onnxruntime.SessionOptions()
ort_session_options.log_severity_level = 0

dort_gelu2 = torch.compile(
    Gelu(),
    backend="onnxrt",
    options=torch.onnx._OrtBackendOptions(
        preferred_execution_providers=[
            "NotARealEP",
            "CPUExecutionProvider",
        ],
        export_options=torch.onnx.ExportOptions(dynamic_shapes=True),
        ort_session_options=ort_session_options,
    ),
)

x = torch.randn(10)

torch.onnx.export(Gelu(), (x,), "gelu_ts.onnx")

export_output = torch.onnx.dynamo_export(Gelu(), x)
export_output.save("gelu_dynamo.onnx")
inlined_model = onnx.inliner.inline_local_functions(export_output.model_proto)
onnx.save_model(inlined_model, "gelu_dynamo_inlined.onnx")

print("Torch Eager:")
print(Gelu()(x))

print("DORT:")
print(dort_gelu(x))
print(dort_gelu2(x))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107973
Approved by: https://github.com/BowenBao
2023-08-26 18:20:18 +00:00
c85c5954f2 [Quant][PT2E]Make _fuse_conv_bn_ support graph capture by torch._dynamo.export (#107951)
**Summary**
The latest check-in a0cfaf0688 for the conv-bn folding assumes the graph is captured by the new graph capture API `torch._export.capture_pre_autograd_graph`. Since we still need to use the original graph capture API `torch._dynamo_export` in 2.1 release. So, this check-in made negative impact to workloads' performance heavily. Made this PR to fix this issue by trying to make the conv-bn folding function workable with both new and original graph capture API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107951
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #106836, #106838, #106958
2023-08-26 17:19:41 +00:00
fdbc2ec5cb [Quant][Inductor] Fix the non contiguous load with uint8 data type (#106958)
**Summary**
Currently, the load vectorization code generation with `non_contiguous` and `uint8` data type has issue in determining the data type. It caused wrong results in `shufflenet_v2_x1_0` model after we enable the `cat` quantization recipe.

- Previously code gen with the example in this PR:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = flag_to_float_scalar(in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]); return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

- After this PR, the code gen is:

```
cpp_fused_clone_view_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(232L); i0+=static_cast<long>(1L))
            {
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(784L); i1+=static_cast<long>(16L))
                {
                    auto tmp0 = ([&]() { __at_align__ unsigned char tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((116L*(static_cast<long>(i0) % static_cast<long>(2L))) + (232L*i1) + (232L*i1_inner) + (at::native::div_floor_integer(i0, 2L)))]; return at::vec::Vectorized<uint8_t>::loadu_one_fourth(tmpbuf); })();
                    auto tmp1 = at::vec::convert_uint8_to_float(tmp0);
                    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(0.0));
                    auto tmp3 = tmp1 - tmp2;
                    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(1.0));
                    auto tmp5 = tmp3 * tmp4;
                    auto tmp6 = tmp5 * tmp4;
                    auto tmp7 = tmp6.round();
                    auto tmp8 = tmp7 + tmp2;
                    auto tmp9 = at::vec::maximum(tmp8, tmp2);
                    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(255.0));
                    auto tmp11 = at::vec::minimum(tmp9, tmp10);
                    auto tmp12 = at::vec::convert_float_to_uint8(tmp11);
                    auto tmp13 = at::vec::convert_uint8_to_float(tmp12);
                    auto tmp14 = tmp13 - tmp2;
                    auto tmp15 = tmp14 * tmp4;
                    tmp15.store(out_ptr0 + static_cast<long>(i1 + (784L*i0)));
                }
            }
        }
    }
}
''')
```

**Test Plan**
```
clear && python -m pytest test_cpu_repro.py -k test_non_contiguous_load_buf_quant
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106958
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #106836, #106838
2023-08-26 16:58:45 +00:00
9e3f3f0b3d [Quant][Inductor] Enable lowering of qcat (#106838)
**Summary**
Enable the lowering of `qcat` inside inductor as extern kernel.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qcat
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106838
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #106836
2023-08-26 16:56:27 +00:00
1147a28b0b [Quant][PT2E] Add cat and avg_pool2d recipe into x86InductorQuantizer (#106836)
**Summary**
Add `cat` and `avg_pool2d` quantization recipe as input output share observer into `x86InductorQuantizer`.

**Test Plan**
```
clear && python -m pytest test_x86inductor_quantizer.py -k test_cat_recipe
clear && python -m pytest test_x86inductor_quantizer.py -k test_cat_recipe_same_inputs
clear && python -m pytest test_x86inductor_quantizer.py -k test_cat_recipe_single_input
clear && python -m pytest test_x86inductor_quantizer.py -k test_avg_pool2d_recipe
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106836
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-08-26 16:51:13 +00:00
15d4dedbbf [quant][pt2e] Add reference representation rewrite for statically quantized linear (#107994)
Summary: att

Test Plan:
```
python test/test_quantization.py TestQuantizePT2E.test_representation_linear
buck2 test 'fbcodemode/opt' fbcodecaffe2/test:quantization_pt2e -- 'test_representation_linear'
```

Reviewed By: kimishpatel

Differential Revision: D48674862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107994
Approved by: https://github.com/mcr229, https://github.com/guangy10
2023-08-26 15:39:52 +00:00
162109f6c2 [export] Don't save example_inputs for now. (#107978)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107978
Approved by: https://github.com/angelayi
2023-08-26 14:36:56 +00:00
d4a99631dd Handle 2D blocking with foreach (#107840)
Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well.

Code when at least one dim matches:
[example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196)

Code when neither X or Y dim matches:
[example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840
Approved by: https://github.com/jansel
2023-08-26 11:02:46 +00:00
558a9501fa [cuda] remove dead CUDA code in layer_norm_kernel.cu (#107976)
Removing CUDA kernels which are not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107976
Approved by: https://github.com/Skylion007
2023-08-26 09:18:05 +00:00
9fa5283401 [dynamo+aten] Enable embedding_bag_byte_unpack + meta kernel impl (#107937)
Summary:
```
torch._dynamo.exc.Unsupported: unsupported operator: quantized.embedding_bag_byte_unpack.default
```

Differential Revision: D48652953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107937
Approved by: https://github.com/houseroad
2023-08-26 08:52:42 +00:00
2bddfb0263 [submodule][Quant][PT2E] Upgrade IDeep to remove redundant QConv weight scale reciprocal calculation (#107565)
**Summary**
Upgrade IDeep which includes 1 IDeep change as IDeep PR: https://github.com/intel/ideep/pull/226

- For IDeep PR: https://github.com/intel/ideep/pull/226 which has done 2 things:

  - Remove the redundant QConv weight scale reciprocal calculation.
  - Pump IDEEP_VERSION_REVISION version from 0 to 1.

  So only QConv related calculation will be impacted and we already use IDeep version API in https://github.com/pytorch/pytorch/pull/105996 to make the corresponding change in PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107565
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639, #105906, #105996
2023-08-26 08:42:12 +00:00
780a5a0c7d [Quant][PT2E] Enable weight scale optimization in QConv PT2E (#105996)
**Summary**
After oneDNN 3.1 upgrade, we don't need to do the weight scale reciprocal calculation. So, remove the redundant reciprocal calculation to optimize QConv performance and using IDeep version API to implement it in this PR:

- This QConv implementation expects to work functionally both with current IDeep version and the following IDeep upgrade in PR: https://github.com/pytorch/pytorch/pull/107565.
- With the following IDeep upgrade in PR: https://github.com/pytorch/pytorch/pull/107565, the QConv has better performance since the redundant reciprocal calculation are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105996
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639, #105906
2023-08-26 08:39:18 +00:00
9319dd1c7c [Quant][Inductor] Enable the lowering of quantized maxpool2d (#105906)
**Summary**
Enable the `dq-maxpool2d-q` pattern match and lower into `torch.ops.quantized.max_pool2d`.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qmaxpool2d
python -m pytest test_quantized_op.py -k test_max_pool2d_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105906
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456, #105639
2023-08-26 08:36:47 +00:00
70ca18f8a0 [Quant][PT2E] Enable X86InductorQuantizer single quantizable op(maxpool2d) (#105639)
**Summary**
In this PR, we mainly enable 2 things.

- Enable the skeleton of quantization recipe for single quantizable operators in `X86InductorQuantizer`.
- Add quantization recipe of `maxpool2d` and annotate it as input./output share observer.

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_maxpool2d_recipe
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105639
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #104580, #104581, #104588, #104590, #105455, #105456
2023-08-26 08:34:15 +00:00
c5ad44be1d Add torch.sparse.as_sparse_gradcheck decorator of gradcheck that allows gradcheck input function to receive and return sparse tensors (#107150)
Compared to #104848, this PR makes a step further: when the enable_sparse_support decorator is applied to `torch.autograd.gradcheck`, the resulting callable is equivalent to `torch.autograd.gradcheck` with an extra feature of supporting functions that can have input sparse tensors or/and can return sparse tensors.

At the same time, the underlying call to `torch.autograd.gradcheck` will operate on strided tensors only. This basically means that torch/autograd/gradcheck.py can be cleaned up by removing the code that deals with sparse tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107150
Approved by: https://github.com/albanD, https://github.com/amjames, https://github.com/cpuhrsch
ghstack dependencies: #107638, #107777
2023-08-26 07:24:31 +00:00
e4b38b9ce9 Support torch.sparse_mask on strided input with sparse CSR, CSC, BSR, and BSC mask. (#107777)
While `input.sparse_mask(mask)` can be defined as `input.mul(ones_like(mask))`, implementing this definition leads to a chicken-and-egg problem because the multiplication of dense and sparse_compressed tensors relies on the `sparse_mask` support.

This PR implements `sparse_mask` support for sparse compressed masks using utility functions from sparse compressed tensor conversions support.

Fixes https://github.com/pytorch/pytorch/issues/107373
Fixes https://github.com/pytorch/pytorch/issues/107370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107777
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
ghstack dependencies: #107638
2023-08-26 07:24:31 +00:00
fe3309b4b8 Add optional is_coalesced argument to sparse coo tensor factory function. (#107638)
Resolves https://github.com/pytorch/pytorch/issues/107097

After this PR, instead of
```python
torch.sparse_coo_tensor(indices, values, size)._coalesced_(is_coalesced)
```
(that does not work in the autograd context, see #107097), use
```python
torch.sparse_coo_tensor(indices, values, size, is_coalesced=is_coalesced)
```

All sparse coo factory functions that take indices as input support the `is_coalesced` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107638
Approved by: https://github.com/cpuhrsch
2023-08-26 07:24:29 +00:00
781b7ebe91 [DeviceMesh] Expose init_device_mesh (#107969)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107969
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-08-26 06:48:17 +00:00
35f4bb9a25 [ONNX] Return input itself for non-fp inputs and support decimals for aten::round op (#107920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107920
Approved by: https://github.com/justinchuby
2023-08-26 05:54:52 +00:00
ed8f21282f Minor fixs to make torchbench runable on torch/xla (#107919)
`import torch_xla.core.xla_model as xm` no longer trigger the xla runtime to init, hence explictly create the device here. This is a workaround for https://github.com/pytorch/xla/issues/4174.

`is_correct` reference has been deleted, I think it is a deadcode.

After this patch, I am able to run

```
python benchmarks/dynamo/torchbench.py --randomize-input --performance --trace-on-xla --training --backend=openxla --only resnet50
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107919
Approved by: https://github.com/shunting314, https://github.com/wconstab
2023-08-26 03:34:54 +00:00
95cacb7fa9 [reland][inductor] make thread order consistent with loop order (#107902)
This PR relands https://github.com/pytorch/pytorch/pull/106827 which get reverted because of causing compilation error for some ads model.

Yanbo provide a repro in one of the 14k model ( `pytest ./generated/test_KaiyangZhou_deep_person_reid.py -k test_044`). This is also the model I used to confirm the fix and come up with a unit test. In this model, we call `tritoin_heuristics.triton_config` with size_hints [2048, 2]. Previously this would result in a trition config with XBLOCK=2048 and YBLOCK=2 . But since we change the mapping between size_hints and XYZ dimension, we now generate a triton config with XBLOCK=2 and YBLOCK=2048.  This fails compilation since we set max YBLOCK to be 1024.

My fix is to make sure we never generate a triton config that exceeds the maximum block size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107902
Approved by: https://github.com/jansel
2023-08-26 02:56:20 +00:00
4e9d7f878b [export] Serialize getattr nodes (#107924)
Turns out some graphs will result in getattr nodes...so let's serialize them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107924
Approved by: https://github.com/zhxchen17, https://github.com/avikchaudhuri
2023-08-26 02:41:49 +00:00
27afb1c61f Disable Constraint constructor (#107918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107918
Approved by: https://github.com/zhxchen17
2023-08-26 02:12:47 +00:00
f877d0a4bf [dynamo] Treat monkey patched .forward as dynamic (#107104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107104
Approved by: https://github.com/anijain2305
2023-08-26 01:41:29 +00:00
240bdbea61 [quant][pt2e] Fix annotation for conv no bias case (#107971)
Summary: This fixes the no bias case for conv annotations.
Previously this would result in an index out of bounds, since
the new aten.conv2d op may not have the bias arg (unlike the
old aten.convolution op). This was not caught because of a lack
of test cases, which are added in this commit.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_qat_conv_no_bias
python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_relu_fusion_no_conv_bias

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel

Differential Revision: [D48696874](https://our.internmc.facebook.com/intern/diff/D48696874)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107971
Approved by: https://github.com/jerryzh168
2023-08-26 01:01:54 +00:00
25d98a3e3b [ONNX] Remove API reference for TorchScript export diagnostics (#107979)
Remove both api reference and rules specific to TorchScript ONNX export. The page should display only info related to `torch.onnx.dynamo_export` diagnostics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107979
Approved by: https://github.com/justinchuby
2023-08-26 00:52:59 +00:00
52eb773e9c Add runtime assertions for prim values (#107939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107939
Approved by: https://github.com/gmagogsfm
2023-08-26 00:51:28 +00:00
f92f69dbfb [quant][pt2e] Enable testing for reference quant model representations (#107474)
Summary:
Previously these tests were disabled due to time out in dynamo export in fbcode,
this might have been resolved, so trying to enable again

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48619072](https://our.internmc.facebook.com/intern/diff/D48619072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107474
Approved by: https://github.com/andrewor14
2023-08-26 00:37:45 +00:00
8d44b0f5a5 Revert "[quant][pt2e][xnnpack_quantizer] Add support for mul and mul_relu (#107930)"
This reverts commit 1d1739dc6d7365c28719cd0175081f9d9aab0324.

Reverted https://github.com/pytorch/pytorch/pull/107930 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107930#issuecomment-1694069330))
2023-08-26 00:37:02 +00:00
3267996372 add channel last 3d support for maxpool3d on CPU (#97775)
### Testing
Single socket (28 cores):

shape | fp32 forward / ms | bf16 forward / ms | fp32 backward / ms  | bf16 backward / ms
-- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 3.959584 | 5.493402 | 0.557232 | 0.568485
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 0.815511 | 1.351261 | 5.710506 | 10.57506
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig  | 10.63426 | 15.28637 | 2.67656 | 1.71365
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.63570 | 2.05532 | 2.55452 | 2.33923
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: contig | 0.375469 | 0.479748 | 0.066364 | 0.065155
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: CL3d | 0.112197 | 0.112326 | 0.111697 | 0.145364

Single core:

shape | fp32 forward / ms | bf16 forward / ms | fp32 backward / ms | bf16 backward / ms
-- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 92.16582 | 128.6513 | 6.684325 | 12.21541
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 10.14318 | 29.80297 | 7.350142 | 11.25323
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 238.55453 | 331.89967 | 19.694657 | 32.78853
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 30.17079 | 32.75628 | 22.44543 | 30.17796
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: contig | 7.474389 | 9.937217 | 0.236015 | 0.434229
size: (4, 19, 10, 16, 16), kernel:   3, stride: 1, mem_format: CL3d | 2.318954 | 2.469444 | 0.262125 | 0.401361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97775
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-08-26 00:21:27 +00:00
ee171465ad [ONNX] Support constant tensors in FakeMode exporting (#107836)
Fixes https://github.com/pytorch/pytorch/issues/107475

- Constant tensors was wrongly recognized as weights and buffers, and then was detached from its default value during `to_model_proto`. This PR fixes the bug and pick up Bloom CI test back successfully. NOTE: non-persistent buffer and weights has different situation and is not fixed by this PR.
- Reduce transformers model size by modifying their config parameters to speed up CI tests. (Unrelated to this PR title)

Corresponding change with https://github.com/microsoft/onnxscript/pull/1023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107836
Approved by: https://github.com/BowenBao, https://github.com/justinchuby
2023-08-26 00:06:49 +00:00
42d60d012e Bias overflow fix mem eff bias (#107968)
Fixes #107959
This should have been fixed here https://github.com/pytorch/pytorch/pull/103201
Edit:
Looking at git blame it appears the dropout revet squashed the changes from this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107968
Approved by: https://github.com/cpuhrsch
2023-08-26 00:00:49 +00:00
1d1739dc6d [quant][pt2e][xnnpack_quantizer] Add support for mul and mul_relu (#107930)
Summary: att

Test Plan: buck2 run executorch/examples/quantization:example -- -m=mv3 --verify

Differential Revision: D48588121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107930
Approved by: https://github.com/kimishpatel
2023-08-25 23:36:19 +00:00
d35d7de60e Revert "Handle 2D blocking with foreach (#107840)"
This reverts commit f87ffe473d4825c15eaea0360baf08cad49979de.

Reverted https://github.com/pytorch/pytorch/pull/107840 on behalf of https://github.com/huydhn due to Sorry for reverting this, but test_2d_blocking is failing in trunk, probably a landrace as PR was green ([comment](https://github.com/pytorch/pytorch/pull/107840#issuecomment-1694009217))
2023-08-25 22:49:15 +00:00
af229ecd34 [RFC] Change --standalone to bind to a random port (#107734)
Given standalone generates args anyways, it seems like it would be more convenient if it explicitly used a random port by default instead of trying to use 29400.

That way users can directly go with `--standalone` instead of having to spell out `--rdzv-backend=c10d --rdzv-endpoint=localhost:0`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107734
Approved by: https://github.com/H-Huang
2023-08-25 22:13:44 +00:00
7ef13b1831 [TP][2D][EZ] Fix Error in FSDP 2D test (#107975)
As title, TP dimension should be the second dim, so we need to pass tp_degree to the second rather the first dim of the mesh tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107975
Approved by: https://github.com/wz337, https://github.com/awgu
2023-08-25 21:56:03 +00:00
08e49fe97a Make openxla and opexla_eval backend show up in list_backends (#107905)
The reason to keep the non-aot(openxla_eval) backend is discussed in https://github.com/pytorch/xla/issues/5430#issuecomment-1683191662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107905
Approved by: https://github.com/jansel
2023-08-25 21:52:17 +00:00
6c0ce03b1f [inductor] WeakDep should not prevent dead node elimination (#107813)
A WeakDep is classed as a read dependency but the buffer is never actually read.
Instead it only effects schedule ordering. So for the purposes of dead node
elimination we should ignore WeakDeps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107813
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-08-25 21:20:04 +00:00
71045f4885 AOTInductor: error: ‘c10::Dispatcher’ has not been declared for CPU model (#107935)
Differential Revision: D48675055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107935
Approved by: https://github.com/houseroad
2023-08-25 21:17:28 +00:00
1374974d60 [Quant][Inductor] Enable quantization conv_binary(add/add_relu) pattern fusion inside inductor (#105456)
**Summary**
Enable the `dequant-conv2d-binary_postop(add)-unary_postop(relu)-quant` pattern fusion and lowering inside inductor.

**Test Plan**
```
clear && python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_binary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105456
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580, #104581, #104588, #104590, #105455
2023-08-25 21:16:02 +00:00
d2105a8688 inductor: support masked load for cpu path (#107670)
For max_pooling code:

```

#pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(56L); i2+=static_cast<long>(1L))
                    {
                        for(long i3=static_cast<long>(0L); i3<static_cast<long>(64L); i3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i1)));
                            auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                            auto tmp2 = to_float_mask(tmp0 >= tmp1);
                            auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(112));
                            auto tmp4 = to_float_mask(tmp0 < tmp3);
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = at::vec::Vectorized<int>(static_cast<int>((-1L) + (2L*i2)));
                            auto tmp7 = to_float_mask(tmp6 >= tmp1);
                            auto tmp8 = to_float_mask(tmp6 < tmp3);
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>((-7232L) + i3 + (128L*i2) + (14336L*i1) + (802816L*i0)), 16);
                                                        load
                                auto tmp13 = cvt_lowp_fp_to_fp32<bfloat16>(tmp12);

                                return tmp13;
                            }
                            ;
                            auto tmp14 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10));
```

the index of ```tmp12 ``` may be a correct index, such as ```i1=0, i2=0, i3=0```, the index is ```-7232L```, it is not a valid index. We may meet segmentation fault error when we call ```tmp11()```, the original behavior is that only the ```tmp10```(index check variable) is true, we can safely get the value, this PR will support masked_load to fixing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107670
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-25 21:11:09 +00:00
f87ffe473d Handle 2D blocking with foreach (#107840)
Previously blocking in foreach ops was only 1D. This PR allows handling kernels with 2D blocking with foreach as well.

Code when at least one dim matches:
[example code + output](https://gist.github.com/mlazos/9f100b21cfe2540f0a24303a8349c196)

Code when neither X or Y dim matches:
[example code + output](https://gist.github.com/mlazos/14e2a455f635896dface09be601595dd)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107840
Approved by: https://github.com/jansel
2023-08-25 20:32:36 +00:00
cyy
d9fb7166d6 [BE] use DeviceIndex instead of int64_t for related device interfaces (#103068)
This PR unifies the device interfaces in aten/*cpp and torch/csrc/*cpp to use  **c10::DeviceIndex**.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103068
Approved by: https://github.com/malfet
2023-08-25 20:16:14 +00:00
4656e09431 Fixes #107737 SGD doc blank line (#107738)
docs preview brings joy
<img width="774" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/1bfaae64-16f2-448a-8af2-36303d2845db">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107738
Approved by: https://github.com/mikaylagawarecki
2023-08-25 19:48:30 +00:00
161ea463e6 Revert "Remove remaining global set_default_dtype calls from tests (#107246)"
This reverts commit aa8ea1d787a9d21b064b664c5344376265feea6c.

Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))
2023-08-25 19:34:55 +00:00
c68d0a7042 [ATen] Update pre-compiled header (#106915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106915
Approved by: https://github.com/lezcano
ghstack dependencies: #106914
2023-08-25 18:24:05 +00:00
a6c29b7227 Remove some unnecessary <iostream> includes from headers (#106914)
In almost all cases this is only included for writing the output formatter, which
only uses `std::ostream` so including `<ostream>` is sufficient.

The istream header is ~1000 lines so the difference is non-trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106914
Approved by: https://github.com/lezcano
2023-08-25 18:24:05 +00:00
78a053bad7 [activation checkpointing] Add default autocast keys to functional rng wrappers (#107934)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107934
Approved by: https://github.com/xw285cornell
2023-08-25 18:22:02 +00:00
3992450e8d Add backward check for test_memory_format (#106104)
Add backward check for test_memory_format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106104
Approved by: https://github.com/mikaylagawarecki
2023-08-25 18:11:54 +00:00
c1e0fb7ff0 [Quant][Inductor] Enable quantization conv_unary(relu) pattern fusion inside inductor (#105455)
**Summary**
Enable the `dequant-conv2d-unary_postop(relu)-quant` pattern fusion and lowering inside inductor.

**Test Plan**
```
clear && python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105455
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580, #104581, #104588, #104590
2023-08-25 18:07:29 +00:00
4f3ff16baf [Quant][Inductor] Enable dequant promotion inside inductor (#104590)
**Summary**
Enable the `dequant pattern` promotion pass in inductor. Since in the qconv weight prepack pass, we will match the `dequant->conv2d` pattern. If the `dequant pattern` has multi user nodes, it will fail to be matched.
Taking the example of
```
        conv1
       /     \
   conv2    conv3
```
After quantization flow, it will generate pattern as
```
      dequant1
          |
        conv1
          |
        quant2
          |
       dequant2
       /     \
   conv2    conv3
```
We need to duplicate `dequant2` into `dequant2` and `dequant3`, in order to make `dequant2->conv2` and  `dequant3->conv3`  pattern matched.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_dequant_promotion
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104590
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580, #104581, #104588
2023-08-25 18:01:06 +00:00
087c0613c3 Implement size checking for copy_ with meta tensors (#107779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107779
Approved by: https://github.com/ezyang
2023-08-25 17:59:16 +00:00
46f63e283b [Quant][Inductor] Enable quantization conv pattern fusion inside inductor (#104588)
**Summary**
Enable the `dequant-quantization-quant` pattern fusion and lowering inside inductor.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104588
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580, #104581
2023-08-25 17:57:13 +00:00
572bc4817d Fix how DDPOptimizer clones dynamo callback (#107834)
Instead of hardcoding a new callback creation using 'convert_frame',
add an attribute to both callbacks that implement 'self cloning with new
backend', so DDPOptimizer can invoke this in a consistent way.

Fixes #107686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107834
Approved by: https://github.com/ezyang
2023-08-25 17:46:36 +00:00
25678e31dc [Quant][Inductor] Enable quantized conv weight prepack inside inductor constant folding (#104581)
**Summary**
Enable quantization conv weight prepack inside inductor constant folding.

**Test Plan**
```
python -m pytest test_mkldnn_pattern_matcher.py -k test_qconv2d_unary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104581
Approved by: https://github.com/jgong5, https://github.com/eellison
ghstack dependencies: #104580
2023-08-25 17:37:41 +00:00
2b7271c703 Support cond and out_dtype for predispatch (#107941)
Summary: Title

Test Plan: CI

Differential Revision: D48675742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107941
Approved by: https://github.com/jerryzh168
2023-08-25 17:37:16 +00:00
8ef057255d [Quant][PT2E] Enable qconv for quantization 2.0 export (#104580)
**Summary**
Enable `qconv1d/2d/3d`, `qconv2d_relu`, `qconv2d_add`, and `qconv2d_add_relu` operator for quantization 2.0 export with oneDNN library.

**Test Plan**
```
python -u -m pytest -s -v test_quantized_op.py -k test_qconv1d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv3d_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_relu_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_add_pt2e
python -u -m pytest -s -v test_quantized_op.py -k test_qconv2d_add_relu_pt2e
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104580
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-08-25 17:34:45 +00:00
679e8e9d48 [cuda] Fix the incorrect types in int8_gemm (#107895)
Fixes #107671

From cublas team: alpha and beta need to be of the same C++ type as of scaleType, which is `int32_t` here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107895
Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch
2023-08-25 17:30:54 +00:00
d24c457b30 [inductor] Add cat + split_with_sizes elimination pass (#107956)
Summary:

When the `cat` inputs' sizes and the `split_sizes` of the downstream `split_with_sizes` match, the `cat` + `split_with_sizes` constellation can be eliminated. E.g. here:

```
@torch.compile
def fn(a, b, c):
    cat = torch.ops.aten.cat.default([a, b, c], 1)
    split_with_sizes = torch.ops.aten.split_with_sizes.default(cat, [2, 3, 5], 1)
    return [s ** 2 for s in split_with_sizes]

inputs = [
    torch.randn(2, 2, device="cuda"),
    torch.randn(2, 3, device="cuda"),
    torch.randn(2, 5, device="cuda"),
]
output = fn(*inputs)
```

This PR adds a new fx pass for such elimination. The new pass is similar to the existing [`splitwithsizes_cat_replace`](b18e1b684a/torch/_inductor/fx_passes/post_grad.py (L508)), but considers the ops in the opposite order.

Test Plan:

```
$ python test/inductor/test_pattern_matcher.py

...

----------------------------------------------------------------------
Ran 21 tests in 46.450s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107956
Approved by: https://github.com/jansel
2023-08-25 17:17:19 +00:00
9af0e47653 Hide transform method by renaming it (#107940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107940
Approved by: https://github.com/tugsbayasgalan
2023-08-25 16:31:44 +00:00
598babf017 Added normal op decomposition for specializations of the normal op (#106792)
This fixes running normal with the meta key.

```
import torch

t = torch.tensor(4.0, device='meta')
torch.normal(0.5, t)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106792
Approved by: https://github.com/lezcano
2023-08-25 16:18:28 +00:00
b4c6c4da88 Revert "[Dynamo] cache_size policy (#107496)"
This reverts commit 4175a6e9443c58d57e0fc595f506c43aec8cb477.

Reverted https://github.com/pytorch/pytorch/pull/107496 on behalf of https://github.com/ZainRizvi due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/107496#issuecomment-1693590121))
2023-08-25 16:07:14 +00:00
4b44b1861d [export] Store the arguments used to trace the exported program in itself (#107906)
Proper fix would be to do something like https://github.com/pytorch/pytorch/pull/107877, but since that depends on internal changes and it would take too long for diff train to land we will first just make OSS work using torch.save.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107906
Approved by: https://github.com/gmagogsfm
2023-08-25 16:04:58 +00:00
48e05d5d44 [ROCm] enable missed cpp tests - test_libtorch_jit (test_jit and test_lazy) (#107234)
[ROCm] enable missed cpp tests - test_libtorch_jit (test_jit and test_lazy)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107234
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/ezyang
2023-08-25 16:03:58 +00:00
b382d55338 [core aten] Remove split.Tensor from core aten (#107938)
Removing split.Tensor from core aten as it can be decomposed into split_with_sizes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107938
Approved by: https://github.com/larryliu0820
2023-08-25 15:53:04 +00:00
b445ed3158 Cleanup RUNNER_TEMP folder (#107868)
Cleanup RUNNER_TEMP folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107868
Approved by: https://github.com/atalman
2023-08-25 15:09:15 +00:00
3a3cf0e09d Revert "[optim] Make casting to match params a hook (#106725)"
This reverts commit 9f86d8517201a3d473a8e80d12e22b46570c88a2.

Reverted https://github.com/pytorch/pytorch/pull/106725 on behalf of https://github.com/janeyx99 due to We acknowledge this is a huge risk because people do not remember to call super().__init__ from their Optimizer subclasses and so this will break lots of load_state_dict behavior ([comment](https://github.com/pytorch/pytorch/pull/106725#issuecomment-1693386137))
2023-08-25 13:47:19 +00:00
b9472decf8 Initial Python 3.12 build fixes (#106083)
This compiles with python 3.12
You can get numpy from https://anaconda.org/scientific-python-nightly-wheels/numpy/files so that you don't need to remove numpy from test files.

Basic core tests work but obviously dynamo and first class dims don't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106083
Approved by: https://github.com/ezyang
2023-08-25 13:23:48 +00:00
97a291f6bd [ONEDNN][BC-breaking] update onednn from v2.7.3 to v3.1.1 (#97957)
**Summary**
Update onednn from v2.7.3 to v3.1.1.
It is bc-breaking as some APIs are changed on oneDNN side. Changes include:
- PyTorch code where oneDNN is directly called
- Submodule `third_party/ideep` to adapt to oneDNN's new API.
- CMAKE files to fix build issues.

**Test plan**
Building issues and correctness are covered by CI checks.
For performance, we have run TorchBench models to ensure there is no regression. Below is the comparison before and after oneDNN update.
![image](https://github.com/pytorch/pytorch/assets/12522207/415a4ff0-7566-40c6-aed0-24997a475b0e)

Note:
- Base commit of PyTorch: da322ea
- CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Ice Lake)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97957
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-08-25 12:13:18 +00:00
ff37f6018d Enable custom device support in fsdp checkpoint (#107289)
Fixes https://github.com/pytorch/pytorch/issues/104390
Enable custom device(privateuse1 backend) support in checkpointing by a dynamic abstract device module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107289
Approved by: https://github.com/wz337
2023-08-25 11:50:03 +00:00
b18e1b684a Bump scipy from 1.9.3 to 1.10.1 in /.ci/docker (#104746)
* Bump scipy from 1.9.3 to 1.10.1 in /.ci/docker

Bumps [scipy](https://github.com/scipy/scipy) from 1.8.1 to 1.10.0.
- [Release notes](https://github.com/scipy/scipy/releases)
- [Commits](https://github.com/scipy/scipy/compare/v1.8.1...v1.10.0)

---
updated-dependencies:
- dependency-name: scipy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update .ci/docker/requirements-ci.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikita Shulga <nshulga@meta.com>
2023-08-25 20:20:20 +09:00
196ef78b90 [ROCm] Use rocm manylinux builder image for triton wheels (#107600)
Update to ROCm triton pinned commit for the 2.1 branch cut off.

As part of this we are updating `build_triton_wheel.py` and `build-triton-wheel.yml` to support building ROCm triton wheels through pytorch/manylinux-rocm to avoid the need of slowly downloading rpm libraries for ROCm in the cpu manylinux builder image and avoiding the need to maintain a conditional file with hard coded repositories from radeon.org for every ROCm release.

This new approach will allow us to build wheels faster in a more easily maintainable way.

This PR also brings in a required change as Triton on ROCm requires device_type to be set to hip so we can pass down the correct device type to triton (https://github.com/ROCmSoftwarePlatform/triton/pull/284).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107600
Approved by: https://github.com/jansel, https://github.com/jithunnair-amd
2023-08-25 10:25:29 +00:00
39854df1d3 Make validate private by renaming validate to _validate (#107927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107927
Approved by: https://github.com/tugsbayasgalan
2023-08-25 08:14:56 +00:00
f2f82855e2 Add tests for foreach copy (#107860)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107860
Approved by: https://github.com/eellison, https://github.com/jansel
2023-08-25 07:32:11 +00:00
925d71e72e [core][sparse][pruning] cuSPARSELt Kernels and ops. (#107398)
Summary:
This is a duplicate PR of 102133, which was reverted because it was
failing internal tests.

It seems like that internal builds did not like my guard to check if
cuSPARSELt was available or not.

Test Plan: python test/test_sparse_semi_structured.py

Differential Revision: D48440330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107398
Approved by: https://github.com/cpuhrsch
2023-08-25 07:04:15 +00:00
c2ac0da445 Enhance fakepg: add fsdp+tp tests (#107626)
from working on a starter task with @wanchaol (T161350434):
Add the previously unsupported fsdp+tp example in FakePG, which required scatter and broadcast, as a unit test: https://github.com/pytorch/examples/blob/main/distributed/tensor_parallelism/two_d_parallel_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107626
Approved by: https://github.com/wanchaol
ghstack dependencies: #107625
2023-08-25 06:17:54 +00:00
bfcd26459c improved error message for IO mismatch (#107907)
Previously when we found some input or output mismatch between original args / traced result vs. graph-captured input / output, we would have a pretty sparse error message. (This might be partly due to the urge to reuse the same code for matching both inputs and outputs.)

With this PR we now point out which input or output is problematic, what its type is, and also present the expected types along with descriptions of what they mean. We don't suggest any fixes, but the idea is that it should be evident what went wrong looking at the error message.

Differential Revision: [D48668059](https://our.internmc.facebook.com/intern/diff/D48668059/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107907
Approved by: https://github.com/gmagogsfm
2023-08-25 06:08:44 +00:00
bfb09204bd Expose torch.export.{save,load} APIs (#107888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107888
Approved by: https://github.com/angelayi
2023-08-25 06:06:36 +00:00
4f2ff1d019 add get buffer from exported program (#107809)
Summary: We have the util function to get params, for parity we also need util function to get buffer`

Test Plan:
```
buck test //caffe2/test:test_export
```

Differential Revision: D48610877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107809
Approved by: https://github.com/JacobSzwejbka
2023-08-25 05:46:04 +00:00
a0cfaf0688 [quant][pt2e] Make sure XNNPACKQuantizer works with the pre_dispatch=True (#107872)
Summary: att

Test Plan:
```
buck test //executorch/backends/xnnpack/test:test_xnnpack_quantized_models -- test_resnet18

buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:quantization_pt2e
```

Reviewed By: andrewor14, tugsbayasgalan

Differential Revision: D48415977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107872
Approved by: https://github.com/andrewor14
2023-08-25 05:04:01 +00:00
86f9fec3ac Avoid decomposing _unsafe_index in Inductor (#107882)
`_unsafe_index` was previously added to the core ATen decomp table in https://github.com/pytorch/pytorch/pull/106814, but this has performance ramifications for Inductor. Therefore, this diff removes it from the decomposition table used by Inductor.

Differential Revision: [D48649210](https://our.internmc.facebook.com/intern/diff/D48649210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107882
Approved by: https://github.com/SherlockNoMad
2023-08-25 04:51:53 +00:00
e00bd83124 Fix the example of torch.slice_scatter (#107849)
Fixes #107681
fix the example of torch.slice_scatter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107849
Approved by: https://github.com/drisspg
2023-08-25 04:19:49 +00:00
8b7b824dca [inductor][ac] preserve recompute tags through pattern matching (#107742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107742
Approved by: https://github.com/eellison
2023-08-25 03:48:26 +00:00
df2ca1871d [vision hash update] update the pinned vision hash (#107911)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107911
Approved by: https://github.com/pytorchbot
2023-08-25 03:27:42 +00:00
6e85a68829 [MPS] Implement polar via metal shader (#107324)
Use `view_as_real` to cast complex into a pair of floats and then it becomes just another binary operator.

Enable `polar` and `view_as_complex` consistency tests, but skip `test_output_grad_match_polar_cpu` as `mul` operator is yet not supported

Remove redundant `#ifdef __OBJC__` and capture and re-throw exceptions captured during `createCacheBlock` block.
Fixes https://github.com/pytorch/pytorch/issues/78503

TODOs(in followup PRs):
  - Implement backwards (requires complex mul and sgn)
  - Measure the perf impact of computing the strides on the fly rather than ahead of time (unrelated to this PR)

Partially addresses https://github.com/pytorch/pytorch/issues/105665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107324
Approved by: https://github.com/albanD
2023-08-25 03:16:23 +00:00
00e9735ee3 [ONNX] Enable 'ExportOutput.save' for models larger than 2GB (#107904)
Previously it fails during serialization, despite onnxscript graph_building managed to return ModelProto > 2GB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107904
Approved by: https://github.com/abock
2023-08-25 03:08:38 +00:00
fc33dc014a [inductor][fx passes] batch tanh in pre grad (#107881)
Summary:
Daohang report this pattern in f469463749
{F1074472207}
 {F1074473348}
Hence, we can fuse the tanh after same split.
Typically the pattern looks like split->getitem0,...n-> tanh(geitem 0,..., n). Hence, we search for parent node of tahn nodes and the node should be getitem(parent, index). If tanh is after same split node, parent nodes of getitem nodes should be same.

Test Plan:
```
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (c78736187)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/df87affc-d294-4663-a50d-ebb71b98070d
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149208311124
Network: Up: 0B  Down: 0B
Jobs completed: 16. Time elapsed: 1:19.9s.
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D48581140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107881
Approved by: https://github.com/yanboliang
2023-08-25 03:02:30 +00:00
0a9778a372 Expose cudaStreamCaptureMode in CUDA Graphs, use local setting in inductor (#107407)
>  capture_error_mode (str, optional): specifies the cudaStreamCaptureMode for the graph capture stream.
Can be "global", "thread_local" or "relaxed". During cuda graph capture, some actions, such as cudaMalloc,
 may be unsafe. "global" will error on actions in other threads, "thread_local" will only error for
 actions in the current thread, and "relaxed" will not error on these actions.

Inductor codegen is single-threaded, so it should be safe to enable "thread_local" for inductor's cuda graph capturing. We have seen errors when inductor cudagraphs has been used concurrently with data preprocessing in other threads.

Differential Revision: [D48656014](https://our.internmc.facebook.com/intern/diff/D48656014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107407
Approved by: https://github.com/albanD, https://github.com/eqy
2023-08-25 01:44:26 +00:00
c18d2a3c05 profiler tree test: skip cudaGetDeviceProperties_v2, cudaGetDeviceCount (#107887)
I don't know why these are getting called. But, they only get called on cuda machines, which is breaking tests (https://github.com/pytorch/pytorch/issues/100728).

We can just prune them so that the same result is shown for both CPU and CUDA tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107887
Approved by: https://github.com/aaronenyeshi
2023-08-25 01:32:25 +00:00
ec10b17cfb [FSDP] verify backward_prefetch works correctly with unit test (#107058)
issue resolved: https://github.com/pytorch/pytorch/pull/105984

context:
* CI did not catch the commit that breaks backward_prefetch https://github.com/pytorch/pytorch/pull/105006
* we had an action item to add unit test to prevent similar cases: https://github.com/pytorch/pytorch/pull/105984

what's included in this unit test
* monkey patch
torch.distributed.fsdp._runtime_utils._get_handle_to_prefetch and check which handles are prefetched

for backward_prefetch = BackwardPrefetch.BACKWARD_PRE
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* pre-backward hook order: root -> decoder 5...0 -> encoder 5...0
* prefetch order: decoder 5...0 -> encoder 5...0 -> None
  * when current_handle=encoder 0, _get_handle_to_prefetch returns None

for backward_prefetch = BackwardPrefetch.BACKWARD_POST
* state._exec_order_data.handles_post_forward_order equals forward order: encoder 0...5 -> decoder 0...5 -> root
* post-backward hook (AccumulateGrad) order: decoder 5, 4...0 -> encoder 5...0 -> root
* prefetch order: decoder 4...0 -> encoder 5...0 -> None -> None
  * 1st None: when current_handle=encoder 0, _get_handle_to_prefetch returns None
  * 2nd None: when current_handle=root, we get decoder 5 inside _get_handle_to_prefetch but is not needed. so returns None
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107058
Approved by: https://github.com/awgu
2023-08-25 01:12:43 +00:00
485de73004 Improve unbacked symint error msg (#107806)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107806
Approved by: https://github.com/avikchaudhuri
2023-08-25 01:07:09 +00:00
cyy
49eeca00d1 [1/N] fix clang-tidy warnings in torch/csrc (#107648)
Apply fixes to some found issues by clang-tidy in torch/csrc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107648
Approved by: https://github.com/Skylion007
2023-08-25 00:30:09 +00:00
7dd1113463 Expose ExportedProgram and related classes (#107852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107852
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2023-08-25 00:07:00 +00:00
49fbaa29e6 [c10d] Increase socket buffer size to allow ProcessGroup init up to 12k ranks (#107878)
The c10d socket and gloo listener both set their buffer size to 2048 which causes connection issue at 4k scale. This diff sets the buffer size to `-1` which uses `somaxconn` as the actual buffer size, aiming to enable 24k PG init without crash. The experiment shows the ability to successful creation of 12k ranks without crash.

split the original diff for OSS vs. internal.

Caution: we need the change on both gloo and c10d to enable 12k PG init. Updating only one side may not offer the benefit.

Differential Revision: [D48634654](https://our.internmc.facebook.com/intern/diff/D48634654/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107878
Approved by: https://github.com/H-Huang, https://github.com/fduwjj
2023-08-25 00:06:30 +00:00
8a7a6867b9 [PyTorch][Tensor] Introduce tensor.dim_order (#106835)
Summary:
This is a stride based attribute for a tensor available in Python.

This can help inspect tensors generated using `torch.empty_permuted(.., physical_layout, ...)`, where physical_layout should match the dim_order returned here. `empty_permuted` will be renamed to use dim_order as the param name in the future. And also help Executorch export pipeline with implementing dim_order based tensors.

Differential Revision: D48134476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106835
Approved by: https://github.com/ezyang
2023-08-25 00:06:03 +00:00
2fbe6ef2f8 [pytorch][Quant] Fix bias quant bug (#107810)
Summary: Bias should be quantized by act_scale * weight_scale in conv and linear

Test Plan: Rewrite tests

Differential Revision: D48606828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107810
Approved by: https://github.com/jerryzh168
2023-08-24 23:44:19 +00:00
497571df58 [aot_inductor] fix hardcoded output dtype (#107825)
Summary: as titled

Reviewed By: chenyang78

Differential Revision: D47779519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107825
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-08-24 22:16:13 +00:00
870fa460be Enhance fakepg: send and recv (#107625)
from working on a starter task with @wanchaol (T161350434):
Enhance Fake Process Group by adding missing collectives: send, recv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107625
Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2023-08-24 22:06:34 +00:00
66dc1aba03 [Inductor][MacOS] resolve macos openmp problem and provide a holistic instruction (#107111)
There has been several reports of difficulty in using OpenMP in MacOS, e.g.: https://github.com/pytorch/pytorch/issues/95708 . And there are several PRs to fix it, e.g.: https://github.com/pytorch/pytorch/pull/93895 and https://github.com/pytorch/pytorch/pull/105136 .

This PR tries to explain the root cause, and provide a holistic and systematic way to fix the problem.

For the OpenMP program below to run, the compiler must:
- Be able to process macros like `#pragma omp parallel`
- Be able to find header files like `<omp.h>`
- Be able to link to a library file like `libomp`

```C++
#include <omp.h>

int main()
{
    omp_set_num_threads(4);
    #pragma omp parallel
    {
        int id = omp_get_thread_num();
        int nthrds = omp_get_num_threads();
        int y = id * nthrds;
    }
}
```

In MacOS, there might be different compiler tools:
- Apple builtin `clang++`, installed with `xcode commandline tools`. The default `g++` and `clang++` commands both point to the Apple version, as can be confirmed by `g++ --version`
- Public `clang++`, can be installed via `brew install llvm`.
- Public GNU compiler `g++`, can be installed via `brew install gcc`.

Among these compilers, public `clang++` from LLVM and `g++` from GNU both support OpenMP with the flag `-fopenmp`. They have shipped with `<omp.h>` and `libomp` support. The only problem is that Apple builtin `clang++` does not contain `<omp.h>` or `libomp`. Therefore, users can follow the steps to enable OpenMP support:

- Use a compiler other than Apple builtin clang++ by specifying the `CXX` environment variable
- Use `conda install llvm-openmp` to place the header files and lib files inside conda environments (and can be discovered by `CONDA_PREFIX`)
- Use `brew install libomp` to place the header files and lib files inside brew control (and can be discovered by `brew --prefix libomp`)
- Use a custom install of OpenMP by specifying an `OMP_PREFIX` where header files and lib files can be found.

This PR reflects the above logic, and might serve as a final solution for resolving OpenMP issues in MacOS.

This PR also resolves the discussion raised in https://dev-discuss.pytorch.org/t/can-we-add-a-default-backend-when-openmp-is-not-available/1382/5 with @jansel , and provide a way for brew users to automatically find the installation via `brew --prefix libomp`, and provide instructions to switch to another compiler by `CXX` environment variable.

I have tested the following code in different conditions:
- Use `CXX` to point to an LLVM-clang++, works fine.
- Use `CXX` to point to a GNU g++, not working because the compiler flag `-Xclang`. Manually removing the code `base_flags += " -Xclang"` works.
- Use default compiler and `conda install llvm-openmp`, works fine
- Use default compiler and `brew install libomp`, works fine
- Do nothing, compiler complains `omp.h` not found.
```python
import torch

@torch.compile
def f(x):
    return x + 1

f(torch.randn(5, 5))
```

If we want the code to be more portable, we can also deal with the `-Xclang` issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107111
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-24 21:58:27 +00:00
4175a6e944 [Dynamo] cache_size policy (#107496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107496
Approved by: https://github.com/ezyang
ghstack dependencies: #107645
2023-08-24 21:50:00 +00:00
d7130e9704 Add SingletonSymIntNode (#107089)
Adds `SingletonSymNodeImpl` (alternatively, `SkolemSymNodeImpl`). This is a int-like object that only allows  the`eq` operation; any other operation produces an error.

The main complexity is that we require operations that dispatch to SymNode must take and return SymNodes, but when performing operations involving `SingletonSymNodeImpl`, operations involving SymNode can return non-SymNode bools.  For more discussion see [here](https://docs.google.com/document/d/18iqMdnHlUnvoTz4BveBbyWFi_tCRmFoqMFdBHKmCm_k/edit)
- Introduce `ConstantSymNodeImpl` a generalization of `LargeNegativeIntSymNodeImpl` and replace usage of `LargeNegativeIntSymNodeImpl`  in SymInt.
- Also use ConstantSymNodeImpl to enable SymBool to store its data on a SymNode. Remove the  assumption that if SymBool holds a non-null SymNode, it must be symbolic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107089
Approved by: https://github.com/ezyang
ghstack dependencies: #107839
2023-08-24 21:38:47 +00:00
a41d15e458 Update nccl submodule to 2.18.5 (#107883)
Updates NCCL submodule to v2.18.5 . It's exactly the same as 2.18.3 except for a few bugfixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107883
Approved by: https://github.com/ezyang
2023-08-24 21:30:27 +00:00
5c133e91c3 Add guidance on the tutorials release proccess (#107871)
Clarify the tutorials release process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107871
Approved by: https://github.com/atalman
2023-08-24 21:27:35 +00:00
1ef4bd169d [ROCm] Add conditions for channels last logic (#107812)
Although there are some performance benefits by enforcing NHWC convolutions as inductor's fallback method for all hardware this may not be the case. Currently on ROCm we are seeing some slow downs in gcnArch that do not have optimal NHWC implementations and would like to introduce some control on this behavior in pytorch. On ROCm MI200 series we will default to the enforced last channels behavior aligned with the rest of pytorch but on non-MI200 series we will disable the forced layout.

For now we are using torch.cuda.get_device_name(0) for this control but we will replace with gcnArchName when https://github.com/pytorch/pytorch/pull/107477 lands.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107812
Approved by: https://github.com/jataylo, https://github.com/eellison
2023-08-24 19:39:49 +00:00
40cbda274b document memory snapshotting (#107660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107660
Approved by: https://github.com/albanD
ghstack dependencies: #107171, #107399
2023-08-24 19:20:03 +00:00
cd031f13ba [security] Move s3-html-update workflow into its own environment (#107889)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at d02bfb0</samp>

Add environment name for S3 HTMLs workflow. This allows secure and controlled access to the secrets and approval for updating the PyTorch whl indexes on S3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107889
Approved by: https://github.com/huydhn
2023-08-24 19:08:51 +00:00
6c508e0be4 refactor common code, fix test discovery (#107506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107506
Approved by: https://github.com/voznesenskym
2023-08-24 18:14:38 +00:00
969bf8a054 Fix the document of torch.nn.functional.conv2d (#107851)
Fixes #107692

Fix the document of torch.nn.functional.conv2d
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107851
Approved by: https://github.com/mikaylagawarecki
2023-08-24 18:02:03 +00:00
f6cce3c468 Fix sym_{sizes,strides} slow path (#107839)
Previously, when SymInt is returned from sym_sizes slow path, it would segfault.

This is useful for tensors that have symbolic sizes and use the sym_sizes slow path, e.g. NestedTensor returning SingletonSymInt as its sizes in the slow path.

See also: https://github.com/pytorch/pytorch/pull/106405/files#r1303714865
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107839
Approved by: https://github.com/ezyang
2023-08-24 17:28:05 +00:00
35de780aa6 Fix Inplace tensor update on transpose (#104689)
Fixes #https://github.com/pytorch/pytorch/issues/103650

- To align with HPU device backend architecture.
   Ensure all non-view ops return contiguous fake tensor outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104689
Approved by: https://github.com/ezyang
2023-08-24 16:58:50 +00:00
3cc5c42a23 Fix aot sequence_nr to reset bwd flag (#107210)
The way the aot autograd sequence_nr tracking works is that we run the aot export logic, the dynamo captured forward graph is run under an fx.Interpreter, which iterates through the nodes of the forward graph while setting the `current_metadata`.
Since during backward what is run doesn't correspond to any node during forward, we fallback to the global `current_metadata`. And since this global metadata is ends up being shared between runs, that leads to weirdness if we forget to reset things, e.g., depending whether this is the first test run, the printed results will be different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107210
Approved by: https://github.com/bdhirsh
2023-08-24 16:58:12 +00:00
eefce56b66 Revert "[dynamo] Treat monkey patched .forward as dynamic (#107104)"
This reverts commit 79b3a9f94537677f9079915001c88bb0745c1e52.

Reverted https://github.com/pytorch/pytorch/pull/107104 on behalf of https://github.com/ZainRizvi due to Breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/107104#issuecomment-1692072018))
2023-08-24 16:55:33 +00:00
aa8ea1d787 Remove remaining global set_default_dtype calls from tests (#107246)
Fixes #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246
Approved by: https://github.com/ezyang
2023-08-24 16:10:48 +00:00
918df10198 [Easy] use dtype.itemsize in partitions (#107749)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107749
Approved by: https://github.com/davidberard98
2023-08-24 16:07:05 +00:00
0156eeb564 [dynamo] bugfix - make module setattr more restrictive (#107828)
A check got missed in https://github.com/pytorch/pytorch/pull/106092

Fixes https://github.com/pytorch/pytorch/issues/107721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107828
Approved by: https://github.com/eellison
2023-08-24 16:00:29 +00:00
85b0e03df8 Default permissions for torch.hub downloads (#82869)
### Description
The `download_url_to_file` function in torch.hub uses a temporary file to prevent overriding a local working checkpoint with a broken download.This temporary file is created using `NamedTemporaryFile`. However, since `NamedTemporaryFile` creates files with overly restrictive permissions (0600), the resulting download will not have default permissions and will not respect umask on Linux (since moving the file will retain the restrictive permissions of the temporary file). This is especially problematic when trying to share model checkpoints between multiple users as other users will not even have read access to the file.

The change in this PR fixes the issue by using custom code to create the temporary file without changing the permissions to 0600 (unfortunately there is no way to override the permissions behaviour of existing Python standard library code). This ensures that the downloaded checkpoint file correctly have the default permissions applied. If a user wants to apply more restrictive permissions, they can do so via usual means (i.e. by setting umask).

See these similar issues in other projects for even more context:
* https://github.com/borgbackup/borg/issues/6400
* https://github.com/borgbackup/borg/issues/6933
* https://github.com/zarr-developers/zarr-python/issues/325

### Issue
https://github.com/pytorch/pytorch/issues/81297

### Testing
Extended the unit test `test_download_url_to_file` to also check permissions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82869
Approved by: https://github.com/vmoens
2023-08-24 15:48:24 +00:00
64d5851b1f make python decomp for native_batch_norm CompositeImplicitAutograd, remove native_batch_norm from core aten opset (#107791)
Summary:
(From Brian Hirsh)

Description copied from what I put in a comment in this PR: https://github.com/pytorch/pytorch/pull/106329

So, the slightly-contentious idea behind this PR is that lower in the stack, I updated torch._decomps.get_decomps() to check not only the decomp table to see if a given op has a decomposition available, but to also check the dispatcher for any decomps registered to the CompositeImplicitAutograd key (link: https://github.com/pytorch/pytorch/pull/105865/files#diff-7008e894af47c01ee6b8eb94996363bd6c5a43a061a2c13a472a2f8a9242ad43R190)

There's one problem though: we don't actually make any hard guarantees that a given key in the dispatcher points does or does not point to a decomposition. We do rely pretty heavily, however, on the fact that everything registered to the CompositeImplicitAutograd key is in fact a decomposition into other ops.

QAT would like this API to faithfully return "the set of all decomps that would have run if we had traced through the dispatcher". However, native_batch_norm is an example of an op that has a pre-autograd decomp registered to it (through op.py_impl(), but the decomp is registered directly to the Autograd key instead of being registered to the CompositeImplicitAutograd key.

If we want to provide a guarantee to QAT that they can programatically access all decomps that would have run during tracing, then we need to make sure that every decomp we register to the Autograd key is also registered to the CompositeImplicitAutograd key.

This might sound kind of painful (since it requires auditing), but I think in practice this basically only applies to native_batch_norm.

Test Plan: python test/test_decomp.py

Differential Revision: D48607575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107791
Approved by: https://github.com/jerryzh168, https://github.com/SherlockNoMad
2023-08-24 15:19:07 +00:00
91a674ccd4 Fix docstring for shape of target for MultiLabelSoftMarginLoss (#107817)
Fixes #92000

The documentation at https://pytorch.org/docs/stable/generated/torch.nn.MultiLabelSoftMarginLoss.html#multilabelsoftmarginloss states:
> label targets padded by -1 ensuring same shape as the input.

However, the shape of input and target tensor are compared, and an exception is raised if they differ in either dimension 0 or 1. Meaning the label targets are never padded. See the code snippet below and the resulting output. The documentation is therefore adjusted to:
> label targets must have the same shape as the input.

```
import torch
import torch.nn as nn

# Create some example data
input = torch.tensor(
    [
        [0.8, 0.2, -0.5],
        [0.1, 0.9, 0.3],
    ]
)
target1 = torch.tensor(
    [
        [1, 0, 1],
        [0, 1, 1],
        [0, 1, 1],
    ]
)
target2 = torch.tensor(
    [
        [1, 0],
        [0, 1],
    ]
)
target3 = torch.tensor(
    [
        [1, 0, 1],
        [0, 1, 1],
    ]
)
loss_func = nn.MultiLabelSoftMarginLoss()
try:
    loss = loss_func(input, target1).item()
except RuntimeError as e:
    print('target1 ', e)
try:
    loss = loss_func(input, target2).item()
except RuntimeError as e:
    print('target2 ', e)
loss = loss_func(input, target3).item()
print('target3 ', loss)
```

output:
```
target1  The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 0
target2  The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 1
target3  0.6305370926856995
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107817
Approved by: https://github.com/mikaylagawarecki
2023-08-24 15:13:46 +00:00
256fed02e9 Check tensors are defined before attempting to access their impl (#106787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106787
Approved by: https://github.com/albanD
2023-08-24 11:38:35 +00:00
c91d2f5bf6 Remove CUTLASS extensions merged upstream (#107612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107612
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-08-24 11:34:59 +00:00
cf76938f70 remove redundant dynamic_dim (#107815)
Differential Revision: D48618472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107815
Approved by: https://github.com/tugsbayasgalan, https://github.com/gmagogsfm
2023-08-24 10:46:24 +00:00
8354d32f6b Ensure optimizer in backward works with 2d parallel (#107748)
Summary: Test to ensure optimizer in backward works with 2D parallel.

Test Plan: CI

Differential Revision: D48508057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107748
Approved by: https://github.com/awgu, https://github.com/fduwjj
2023-08-24 09:20:00 +00:00
1491bae277 [reland][inductor] Adjust dynamic SMEM limit when above default in AOT (#107814)
Summary:

This relands #107601, which was reverted due to the new test failing in the internal CI. Here we skip the new test (as well as the existing tests in `test_aot_inductor.py`, as those are also failing in the internal CI).

Test Plan:

```
$ python test/inductor/test_aot_inductor.py
...
----------------------------------------------------------------------
Ran 5 tests in 87.309s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48623171](https://our.internmc.facebook.com/intern/diff/D48623171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107814
Approved by: https://github.com/eellison
2023-08-24 07:59:51 +00:00
16fcb07846 [quant][pt2e] Add support for channel in DerivedQuantizationSpec (#107833)
Summary:
att

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_derived_qspec_per_channel

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48630535](https://our.internmc.facebook.com/intern/diff/D48630535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107833
Approved by: https://github.com/andrewor14
2023-08-24 07:45:13 +00:00
387556318e [ONNX] Cap opset version at 17 for torch.onnx.export (#107829)
Cap opset version at 17 for torch.onnx.export and suggest users to use the dynamo exporter. Warn users instead of failing hard because we should still allow users to create custom symbolic functions for opset>17.

Also updates the default opset version by running `tools/onnx/update_default_opset_version.py`.

Fixes #107801 Fixes #107446
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107829
Approved by: https://github.com/BowenBao
2023-08-24 07:21:10 +00:00
444875cd25 constraint violation error messages (#107790)
Currently there are 4 cases where contraint violation errors are raised, but the error messages are (a) inconsistent in their information content (b) worded in ways that are difficult to understand for the end user.

This diff cuts one of the cases that can never be reached, and makes the other 3
(a) consistent, e.g. they all point out that some values in the given range may not work, citing a reason and asking the user to run with logs to follow up
(b) user-friendly, e.g., compiler-internal info is cut out or replaced with user-facing syntax.

Differential Revision: D48576608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107790
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2023-08-24 06:58:25 +00:00
1e71c51350 [export] Serialize map correctly (#107837)
Summary: Previously serializing graphs using map would error
because map returns a singleton tensor list rather than a
single tensor. So this diff adds support for if a higher order operator
returns a list of tensors as output.

We also run into an issue with roundtripping the source_fn on
map nodes/subgraphs. The source_fn originally is
<functorch.experimental._map.MapWrapper object at 0x7f80a0549930>, which
serializes to `functorch.experimental._map.map`. However, we are unable
to construct the function from this string. This should be fixed once
map becomes a fully supported operator like
torch.ops.higher_order.cond.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D48631302](https://our.internmc.facebook.com/intern/diff/D48631302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107837
Approved by: https://github.com/zhxchen17
ghstack dependencies: #107818
2023-08-24 06:47:50 +00:00
1166f9a02c [export] Custom object serialization (#107666)
Some NvidaTRT folks were asking for a way to integrate the serialization of custom objects with export's serialization. After some discussion (more background [here](https://docs.google.com/document/d/1lJfxakmgeoEt50inWZ53MdUtOSa_0ihwCuPy_Ak--wc/edit)), we settled on a way for users to register their custom object's serializer/deserializer functions.

Since TorchScript's `.def_pickle` already exists for [registering custom classes](https://pytorch.org/tutorials/advanced/torch_script_custom_classes.html), and `tensorrt.ICudaEngine` already contains a `.def_pickle` implementation, we'll start off by reusing the existing framework and integrating it with export's serialization.

TorchScript's `.def_pickle` requires users to register two functions, which end up being the `__getstate__` and `__setstate__` methods on the class. The semantics of `__getstate__` and `__setstate__` in TorchScript are equivalent to that of Python pickle modules. This is then registered using pybind's `py::pickle` function [here](https://www.internalfb.com/code/fbsource/[f44e048145e4697bccfaec300798fce7daefb858]/fbcode/caffe2/torch/csrc/jit/python/script_init.cpp?lines=861-916) to be used with Python's pickle to initialize a ScriptObject with the original class, and set the state back to what it used to be.

I attempted to call `__getstate__` and `__setstate__` directly, but I couldn't figure out how to initial the object to be called with `__setstate__` in python. One option would be to create a `torch._C.ScriptObject` and then set the class and call `__setstate__`, but there is no constructor initialized for ScriptObjects. Another option would be to construct an instance of the serialized class itself, but if the class constructor required arguments, I wouldn't know what to initialize it with. In ScriptObject's `py::pickle` registration it directly creates the object [here](https://www.internalfb.com/code/fbsource/[f44e048145e4697bccfaec300798fce7daefb858]/fbcode/caffe2/torch/csrc/jit/python/script_init.cpp?lines=892-906), which is why I was thinking that just directly using Python's `pickle` will be ok since it is handled here.

So, what I did is that I check if the object is pickle-able, meaning it contains `__getstate__` and `__setstate__` methods, and if so, I serialize it with Python's pickle. TorchScript does have its own implementation of [pickle/unpickle](https://www.internalfb.com/code/fbsource/[59cbc569ccbcaae0db9ae100c96cf0bae701be9a][history]/fbcode/caffe2/torch/csrc/jit/serialization/pickle.h?lines=19%2C82), but it doesn't seem to have pybinded functions callable from python.

A question is -- is it ok to combine this pickle + json serialization?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107666
Approved by: https://github.com/gmagogsfm
2023-08-24 06:36:23 +00:00
6ec2ec845c [exportdb] Fix generating docs (#107838)
Previously I accidentally replaced all `=` with `-`, resulting in clowny code rendering like:
![image](https://github.com/pytorch/pytorch/assets/10901756/738eaf92-8cc6-43bd-b531-224ec44afa9f)

The purpose of replacing the `=` with `-` is to change the RST heading size of modules. So now, I replace strings with more than 3 `=`'s with `-`. This should avoid incorrectly replacing code where we set variables with `=` and do equality checks with `==`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107838
Approved by: https://github.com/gmagogsfm
2023-08-24 06:32:51 +00:00
2fcda650cf Revert "inductor: remove conv_bn folding from pre_grad pass (#106686)"
This reverts commit 22bc08da29ca8900ff877bb8a20f8369894c4c68.

Reverted https://github.com/pytorch/pytorch/pull/106686 on behalf of https://github.com/XiaobingSuper due to Has big accuracy drop for internal models test ([comment](https://github.com/pytorch/pytorch/pull/106686#issuecomment-1690972043))
2023-08-24 04:19:11 +00:00
3af04ce0ff Revert "enable conv+bn folding for mixed-dtype when bn has post activation (#107142)"
This reverts commit 29813c61ea61d616230d3251ade81f56472ecb77.

Reverted https://github.com/pytorch/pytorch/pull/107142 on behalf of https://github.com/XiaobingSuper due to [Depends on reverted https://github.com/pytorch/pytorch/pull/106576](https://github.com/pytorch/pytorch/pull/106686) ([comment](https://github.com/pytorch/pytorch/pull/107142#issuecomment-1690968509))
2023-08-24 04:14:00 +00:00
6178022aac [vision hash update] update the pinned vision hash (#107831)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107831
Approved by: https://github.com/pytorchbot
2023-08-24 03:44:33 +00:00
9bda8f1e16 [inductor][fx passes]batch linear in pre grad (#107759)
Summary:
After we compile dense arch, we observe split-linear-cat pattern. Hence, we want to use bmm fusion + split cat pass to fuse the pattern as torch.baddmm.
Some explanation why we prefer pre grad:
1) We need to add bmm fusion before split cat pass which is in pre grad pass to remove the new added stack and unbind node with the original cat/split node
2) Post grad does not support torch.stack/unbind. There is a hacky workaround but may not be landed in short time.

Test Plan:
# unit test
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
[jackiexu0313@devgpu005.cln5 ~/fbsource/fbcode (f0ff3e3fc)]$ buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/189dd467-d04d-43e5-b52d-d3b8691289de
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5910974704097734
Network: Up: 0B  Down: 0B
Jobs completed: 14. Time elapsed: 1:05.4s.
Tests finished: Pass 5. Fail 0. Fatal 0. Skip 0. Build failure 0
```
# local test
```
=================Single run start========================
enable split_cat_pass for control group
================latency analysis============================
latency is : 73.79508209228516 ms

=================Single run start========================
enable batch fusion for control group
enable split_cat_pass for control group
================latency analysis============================
latency is : 67.94447326660156 ms
```
# e2e test
todo add e2e test

Differential Revision: D48539721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107759
Approved by: https://github.com/yanboliang
2023-08-24 03:42:09 +00:00
f8119f8bda Move Constraint class to torch.export() to avoid circular dependency in _dynamo package (#107750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107750
Approved by: https://github.com/tugsbayasgalan
2023-08-24 03:07:28 +00:00
7bab98f161 [export] Serialize cond submodules (#107818)
Cond submodules only return a single tensor, which was not supported by the serializer. Since the serializer assumes that the graph always returns a list -- this is true for the toplevel graph from dynamo, but not true for the subgraphs.

Differential Revision: [D48622687](https://our.internmc.facebook.com/intern/diff/D48622687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107818
Approved by: https://github.com/avikchaudhuri
2023-08-24 02:29:26 +00:00
a560135516 [Inductor] Add new fused_attention pattern matcher (#107578)
Add new fused_attention pattern matcher for Inductor, in order to make more models call the op SDPA.

The following models would call SDPA due to the added pattern:

For HuggingFace
- AlbertForMaskedLM
- AlbertForQuestionAnswering
- BertForMaskedLM
- BertForQuestionAnswering
- CamemBert
- ElectraForCausalLM
- ElectraForQuestionAnswering
- LayoutLMForMaskedLM
- LayoutLMForSequenceClassification
- MegatronBertForCausalLM
- MegatronBertForQuestionAnswering
- MobileBertForMaskedLM
- MobileBertForQuestionAnswering
- RobertaForCausalLM
- RobertaForQuestionAnswering
- YituTechConvBert

For TorchBench
- llama

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107578
Approved by: https://github.com/mingfeima, https://github.com/XiaobingSuper, https://github.com/jgong5, https://github.com/eellison, https://github.com/jansel
2023-08-24 01:45:41 +00:00
9b2d43df93 Handle empty lists properly (#107803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107803
Approved by: https://github.com/ezyang
2023-08-24 01:42:29 +00:00
d707724ac9 [DeviceMesh] init_device_mesh dosctring update to include one d mesh initialization (#107805)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107805
Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2023-08-24 01:28:22 +00:00
26ae48832e Remove run torchbench. Torchbench runs are now part of the dynamo ci. (#107826)
As the title says.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107826
Approved by: https://github.com/huydhn
2023-08-24 01:19:49 +00:00
4fdfe33ae6 Bump scipy from 1.9.0 to 1.10.1 in /.github/requirements (#104763)
* Bump scipy from 1.9.0 to 1.10.1 in /.github/requirements

Bumps [scipy](https://github.com/scipy/scipy) from 1.9.0 to 1.10.0.
- [Release notes](https://github.com/scipy/scipy/releases)
- [Commits](https://github.com/scipy/scipy/compare/v1.9.0...v1.10.0)

---
updated-dependencies:
- dependency-name: scipy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update .github/requirements/pip-requirements-macOS.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikita Shulga <nshulga@meta.com>
2023-08-24 10:10:40 +09:00
96c27c2d81 Support is_mtia() in TensorBase.h (#107723)
Summary: As title

Test Plan: CI tests.

Differential Revision: D48477102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107723
Approved by: https://github.com/cx-yin, https://github.com/ezyang
2023-08-24 00:26:12 +00:00
4fd42e62c6 Remove unnecessary import in python_variable.cpp (#107794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107794
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2023-08-24 00:24:25 +00:00
6e71ad0509 Add tensor post accumulate grad hook API (#107063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107063
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-08-24 00:19:35 +00:00
3828cd4b79 [TP][EZ] Update doc for TP parallel style (#107819)
We need to update the doc for PairwiseParallel and SequenceParallel so that users don't get wrong impressions that these working for ``nn.Transformer``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107819
Approved by: https://github.com/awgu, https://github.com/wanchaol
2023-08-24 00:13:52 +00:00
432fce4e0d Revert "Add tensor post accumulate grad hook API (#107063)"
This reverts commit 3f655277d44909e0770e77e1b4fe1c9b0f39d7b9.

Reverted https://github.com/pytorch/pytorch/pull/107063 on behalf of https://github.com/ZainRizvi due to Diff train weirdness. Need to temporarily revert this PR and will right land it soon afterwards ([comment](https://github.com/pytorch/pytorch/pull/107063#issuecomment-1690799057))
2023-08-24 00:12:34 +00:00
bc0790559b Revert "Remove unnecessary import in python_variable.cpp (#107794)"
This reverts commit 9d23b8b3eabe2cd38eb5a11cc49cda6970675595.

Reverted https://github.com/pytorch/pytorch/pull/107794 on behalf of https://github.com/ZainRizvi due to Diff train weirdness. Need to temporarily revert this PR and will right land it soon afterwards ([comment](https://github.com/pytorch/pytorch/pull/107794#issuecomment-1690798855))
2023-08-24 00:10:18 +00:00
2c45a579ca Add wait_tensor so print always has a correct result for AsyncCollectiveTensor (#107808)
As the title says, I was trying to test the functional collectives, and, when printing the resulting tensors, sometimes they wouldn't have finished the Async operation yet. According to the comments in the file, "AsyncTensor wrapper applied to returned tensor, which issues wait_tensor() at the time of first use". This is true in most cases, but not when print() is your first use. This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107808
Approved by: https://github.com/fduwjj
2023-08-24 00:00:23 +00:00
3d3f18260f Move conda uploads into environment (#107807)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107807
Approved by: https://github.com/atalman
2023-08-23 23:50:03 +00:00
9a365fe914 Use docker-build env to access GHCR_PAT (#107655)
This will restrict the access to GHCR_PAT to only [docker-build](https://github.com/pytorch/pytorch/settings/environments/1258682414/edit) env.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107655
Approved by: https://github.com/clee2000, https://github.com/atalman
2023-08-23 23:45:41 +00:00
b74b8e33db [Redo] Enhance fakepg: alltoall and alltoall_base (#107798)
[ghstack-poisoned]

Redo https://github.com/pytorch/pytorch/pull/107624, previously tried to land via `ghstack land`, but that doesn't work with pytorch repo which has protected main branch. As a result, that PR was only merged to [gh/xmfan/1/base](https://github.com/pytorch/pytorch/tree/gh/xmfan/1/base).

This PR manually merges [gh/xmfan/1/base](https://github.com/pytorch/pytorch/tree/gh/xmfan/1/base) into main, via pytorchbot

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107798
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-08-23 23:45:11 +00:00
726b7ff608 Support integer implementations for padding(cpu and cuda) (#107755)
Fixes #107733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107755
Approved by: https://github.com/albanD
2023-08-23 23:45:00 +00:00
8c66f97c9b [profiler] move _enable_dynamo_cache_lookup_profiler (#107720)
_enable_dynamo_cache_lookup_profiler used to get turned on when running `__enter__` or `__exit__` with the profiler. But it's possible to turn the profiler on and off without the context manager (e.g. with a schedule and calling `.step()`). Instead, we should put these calls (which are supposed to be executed when the profiler turns on/off) where `_enable_profiler()` and `_disable_profiler()` are called.

This puts `_enable_dynamo_cache_lookup_profiler` and `_set_is_profiler_enabled` into `_run_on_profiler_(start|stop)` and calls that on the 3 places where `_(enable|disable)_profiler` get called.

Differential Revision: [D48619818](https://our.internmc.facebook.com/intern/diff/D48619818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107720
Approved by: https://github.com/wconstab
2023-08-23 23:41:35 +00:00
cb107c74bb [profiler] DISABLE_CUPTI_LAZY_REINIT for CUDA 12 as well (#107744)
Summary:
Apparently CUDA 12 + CUPTI can fail with an illegal memory access, similar to what we saw with CUDA 11 (https://github.com/pytorch/pytorch/issues/75504).

For now we'll just turn on DISABLE_CUPTI_LAZY_REINIT, which will fix this internally. In OSS, this will probably still break - which will hopefully give us a repro.

Differential Revision: D48568888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107744
Approved by: https://github.com/aaronenyeshi
2023-08-23 23:28:15 +00:00
4a022e2185 Update unary_ufuncs groupings to include primtorch types. (#107345)
Fixes #107335. The skips were updated for the _ref ops to match those for eager mode where necessary. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107345
Approved by: https://github.com/ezyang
2023-08-23 22:45:19 +00:00
9f86d85172 [optim] Make casting to match params a hook (#106725)
Moves the logic to casting state to match parameters into a hook so that users can choose to enable their hooks before or after the casting has happened.

With this, there is a little bit of redundancy of the id_map building and the check that the param groups are still aligned in length.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106725
Approved by: https://github.com/albanD
2023-08-23 22:25:33 +00:00
92f6454ff8 [export][reland] ExportedProgram.transform updates graph_signature automatically (#107792)
Summary: Reland of https://github.com/pytorch/pytorch/pull/107080

Test Plan: CI

Differential Revision: D48533622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107792
Approved by: https://github.com/gmagogsfm
2023-08-23 22:16:56 +00:00
2515ab93c4 [FSDP][Docs] Add note on NCCL_CROSS_NIC=1 for HSDP (#107784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107784
Approved by: https://github.com/fegin
ghstack dependencies: #106068, #106080
2023-08-23 22:00:50 +00:00
c0ba9a7840 Fix docs, missed a // in LaTeX for nadam (#107736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107736
Approved by: https://github.com/mikaylagawarecki
2023-08-23 21:36:27 +00:00
36399d067a Port existing heuristics to TD framework (#107071)
This PR looks big, but it's mostly just refactorings with a bit of dead code deletion. Exceptions are:
- Some metric emissions were changed to comply with the new TD format
- Some logging changes
- We now run tests in three batches (highly_relevant, probably_relevant, unranked_relevance) instead of the previous two (prioritized and general)

Refactorings done:
- Moves all test reordering code to the new TD framework
- Refactors run_test.py to cleanly support multiple levels of test priorities
- Deletes some dead code that was originally written for logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107071
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-08-23 21:23:23 +00:00
d7f943ec82 [mergebot] Flaky and broken trunk should take precedence over ic (#107761)
I notice a curious case on https://github.com/pytorch/pytorch/pull/107508 where there was one broken trunk failure and the PR was merged with `merge -ic`.  Because the failure had been classified as unrelated, I expected to see a no-op force merge here.  However, it showed up as a force merge with failure.

![Screenshot 2023-08-22 at 20 01 10](https://github.com/pytorch/pytorch/assets/475357/b9c93e24-8da8-4fc6-9b3d-61b6bd0a8937)

The record on Rockset reveals https://github.com/pytorch/pytorch/pull/107508 has:

* 0 broken trunk check (unexpected, this should be 1 as Dr. CI clearly say so)
* 1 ignore current check (unexpected, this should be 0 and the failure should be counted as broken trunk instead)
* 3 unstable ROCm jobs (expected)

It turns out that ignore current takes precedence over flaky and broken trunk classification.  This might have been the expectation in the past but I think that's not the case now.  The bot should be consistent with what is shown on Dr. CI.  The change here is to make flaky, unstable, and broken trunk classification to take precedence over ignore current.  Basically, we only need to ignore new or unrecognized failures that have yet been classified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107761
Approved by: https://github.com/clee2000
2023-08-23 21:22:56 +00:00
ee4b99cc3a Decomp for aten.dropout (#106274)
When exporting dropout with cpu tensor, we get following graph module
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            empty_memory_format: f32[512, 10] = torch.ops.aten.empty.memory_format([512, 10], dtype = torch.float32, layout = torch.strided, device = device(type='cpu'), pin_memory = False, memory_format = torch.contiguous_format)
            bernoulli_p: f32[512, 10] = torch.ops.aten.bernoulli.p(empty_memory_format, 0.9);  empty_memory_format = None
            div_scalar: f32[512, 10] = torch.ops.aten.div.Scalar(bernoulli_p, 0.9);  bernoulli_p = None
            mul_tensor: f32[512, 10] = torch.ops.aten.mul.Tensor(arg0_1, div_scalar);  arg0_1 = div_scalar = None
            return (mul_tensor,)
```

In addition, if we export with eval() mode, we will have an empty graph.

However, when exporting with cuda tensor, we got
```
    class GraphModule(torch.nn.Module):
        def forward(self, arg0_1: f32[512, 10]):
            native_dropout_default = torch.ops.aten.native_dropout.default(arg0_1, 0.1, True);  arg0_1 = None
            getitem: f32[512, 10] = native_dropout_default[0];  native_dropout_default = None
            return (getitem,)
```
and exporting under eval() mode will still have a dropout node in graph.

This PR make exporting with CPU tensor also produce aten.native_dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106274
Approved by: https://github.com/ezyang
2023-08-23 21:12:37 +00:00
50024d04a8 [core aten] Add ops to core aten set (#107766)
Update the list of core aten ops with the ones we determined [here](https://docs.google.com/spreadsheets/d/1u9jQ-uGlKu-fe9nLy-jS2AIPtpE8sGTmELOFYgKOhXU/edit#gid=1098862752).

```
aten::adaptive_avg_pool1d
aten::_adaptive_avg_pool3d
aten::_cdist_forward
aten::_embedding_bag
aten::_local_scalar_dense
aten::_native_batch_norm_legit_no_training
aten::_native_batch_norm_legit
aten::_pdist_forward
aten::any
aten::any.dim
aten::avg_pool1d
aten::avg_pool3d
aten::bitwise_and.Scalar
aten::bitwise_or.Scalar
aten::bitwise_xor.Scalar
aten::ceil
aten::clamp.Tensor
aten::cumsum
aten::embedding
aten::floor
aten::fmod.Scalar
aten::index_put
aten::index.Tensor
aten::logical_xor
aten::mean
aten::mean.dim
aten::pixel_shuffle
aten::prod
aten::prod.dim_int
aten::rand
aten::randperm
aten::reflection_pad_1d
aten::reflection_pad_2d
aten::reflection_pad_3d
aten::remainder.Scalar
aten::roll
aten::round
aten::scatter.src
aten::scatter.value
aten::select_scatter
aten::sort
aten::split.Tensor
aten::split_with_sizes
aten::squeeze.dims
aten::tan
aten::unsqueeze
aten::var.correction
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107766
Approved by: https://github.com/kirklandsign, https://github.com/SS-JIA, https://github.com/SherlockNoMad
2023-08-23 21:05:25 +00:00
8c62f01cb7 [dynamo][guards] Use dict for storing weakrefs (#107645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107645
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-08-23 20:52:38 +00:00
221daeb1a7 Fix deepcopy for tensor with MTIA device key. (#107427)
Summary: Tensor with MTIA device type doesn't have storage and we need to treat it same as other tensors which don't have storage.

Test Plan: CI tests.

Differential Revision: D48456004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107427
Approved by: https://github.com/cx-yin, https://github.com/ezyang
2023-08-23 20:47:36 +00:00
42b6ba3484 Use TORCH_SYM_CHECK for check_size_nonnegative on SymIntArrayRef (#107785)
See https://github.com/pytorch/pytorch/pull/106788 for context.

I think I don't actually need this for anything real, but this is pretty
mild so might as well.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107785
Approved by: https://github.com/albanD
2023-08-23 20:46:02 +00:00
cdd0821f00 [2/N][DeviceMesh] Overriding __getitem__ for DeviceMesh to support Mesh Slicing (#107730)
Add support for DeviceMesh slicing by overloading __getitem__ for DeviceMesh.

With this change, you can do:
```
mesh_shape = (2, 4)
mesh_dim_names = ("DP", "TP")
two_d_mesh = init_device_mesh(
    self.device_type, mesh_shape, mesh_dim_names=mesh_dim_names
)
tp_mesh = two_d_mesh["TP"]
```

cc. @wanchaol, @fduwjj
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107730
Approved by: https://github.com/wanchaol
2023-08-23 20:35:30 +00:00
652ccfadc1 Expose torch.export.constrain_as_{size,value} APIs (#107735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107735
Approved by: https://github.com/avikchaudhuri
2023-08-23 20:13:40 +00:00
9d23b8b3ea Remove unnecessary import in python_variable.cpp (#107794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107794
Approved by: https://github.com/albanD, https://github.com/ZainRizvi
2023-08-23 19:43:39 +00:00
79b3a9f945 [dynamo] Treat monkey patched .forward as dynamic (#107104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107104
Approved by: https://github.com/anijain2305
2023-08-23 19:03:02 +00:00
977aba7cfe Revert the removal of a SampleInput for gather (#107776)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107776
Approved by: https://github.com/peterbell10
2023-08-23 19:01:03 +00:00
c9b5e9d7a8 [allocator] register oom observers on every device (#107399)
This change is to match the behavior of _record_memory_history which was
recently changed to enable history recording on all devices rather than
the current one. It prevents confusing situations where the observer
was registered before the device was set for the training run.

It also ensures the allocators have been initialized in the python binding just in case this is the first call to the CUDA API.
Fixes #107330
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107399
Approved by: https://github.com/eellison
ghstack dependencies: #107171
2023-08-23 18:57:24 +00:00
cc54448a07 [memory snapshot] add 'address' key to block (#107171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107171
Approved by: https://github.com/ngimel
2023-08-23 18:57:24 +00:00
2b964d6efd [FSDP] Enable async all-reduce for HSDP (#106080)
**Overview**
This PR runs the HSDP all-reduce as async so that it can overlap with both all-gather and reduce-scatter, which can lead to slight end-to-end speedups when the sharding process group is fully intra-node. Previously, the all-reduce serializes with reduce-scatter, so it can only overlap with one all-gather.

For some clusters (e.g. our AWS cluster), `NCCL_CROSS_NIC=1` improves inter-node all-reduce times when overlapped with intra-node all-gather/reduce-scatter.

**Experiment**
<details>
<summary> Example 'before' trace </summary>
<img width="559" alt="hsdp_32gpus_old" src="https://github.com/pytorch/pytorch/assets/31054793/15222b6f-2b64-4e0b-a212-597335f05ba5">

</details>

<details>
<summary> Example 'after' trace </summary>
<img width="524" alt="hsdp_32gpus_new" src="https://github.com/pytorch/pytorch/assets/31054793/94f63a1d-4255-4035-9e6e-9e10733f4e44">

</details>

For the 6-encoder-layer, 6-decoder layer transformer with `d_model=8192`, `nhead=64` on 4 nodes / 32 40 GB A100s via AWS, the end-to-end iteration times are as follows (with AG == all-gather, RS == reduce-scatter, AR == all-reduce; bandwidth reported as algorithmic bandwidth):
- Reference FSDP:
    - **1160 ms / iteration**
    - ~23 ms / encoder AG/RS --> 24.46 GB/s bandwidth
    - ~40 ms / decoder AG/RS --> 26.5 GB/s bandwidth
    - 50 GB/s theoretical inter-node bandwidth
- Baseline 8-way HSDP (only overlap AR with AG) -- intra-node AG/RS, inter-node AR:
    - **665 ms / iteration**
    - ~3 ms / encoder AG/RS --> 187.5 GB/s bandwidth
    - ~5 ms / decoder AG/RS --> 212 GB/s bandwidth
    - ~30 ms / encoder AR --> 2.34 GB/s bandwidth
    - ~55 ms / decoder AR --> 2.65 GB/s bandwidth
    - 300 GB/s theoretical intra-node bandwidth
- New 8-way HSDP (overlap AR with AG and RS) -- intra-node AG/RS, inter-node AR:
    - **597 ms / iteration**
    - ~3 ms / encoder AG/RS --> 187.5 GB/s bandwidth
    - ~6.2 ms / decoder AG/RS --> 170.97 GB/s bandwidth (slower)
    - ~23 ms / encoder AR (non-overlapped) --> 3.057 GB/s bandwidth (faster)
    - ~49 ms / decoder AR (non-overlapped) --> 2.70 GB/s bandwidth (faster)
    - ~100 ms / decoder AR (overlapped) --> 1.325 GB/s bandwidth (slower)
    - Overlapping with reduce-scatter reduces all-reduce bandwidth utilization even though the all-reduce is inter-node and reduce-scatter is intra-node!
- New 8-way HSDP (overlap AR with AG and RS) with `NCCL_CROSS_NIC=1`:
    - **556 ms / iteration**
    - Speedup comes from faster overlapped AR

Thus, for this particular workload, the async all-reduce enables 16% iteration-time speedup compared to the existing HSDP and 52% speedup compared to FSDP. These speedups are pronounced due to the workload being communication bound, so any communication time reduction translates directly to speedup.

**Unit Test**
This requires >= 4 GPUs:
```
python -m pytest test/distributed/fsdp/test_fsdp_hybrid_shard.py -k test_fsdp_hybrid_shard_parity
```

Differential Revision: [D47852456](https://our.internmc.facebook.com/intern/diff/D47852456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106080
Approved by: https://github.com/ezyang
ghstack dependencies: #106068
2023-08-23 18:36:15 +00:00
50e1378680 [FSDP] Break up _post_backward_hook into smaller funcs (#106068)
The post-backward hook has some complexity due to the different paths: {no communication hook, communication hook} x {`NO_SHARD`, `FULL_SHARD`/`SHARD_GRAD_OP`, `HYBRID_SHARD`/`_HYBRID_SHARD_ZERO2`} plus some options like CPU offloading and `use_orig_params=True` (requiring using sharded gradient views).

The PR following this one that adds async all-reduce for HSDP further complicates this since the bottom-half after all-reduce must still be run in the separate all-reduce stream, making it more unwieldy to unify with the existing bottom-half.

Nonetheless, this PR breaks up the post-backward hook into smaller logical functions to hopefully help readability.

Differential Revision: [D47852461](https://our.internmc.facebook.com/intern/diff/D47852461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106068
Approved by: https://github.com/ezyang, https://github.com/fegin
2023-08-23 18:36:15 +00:00
55d6b80188 torch._numpy: keep f16 CUDA tensors in f16 where possible (#107768)
Contain workarounds for _RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'_ to CPU tensors, do computations on CUDA tensors in f16.

Fixes https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/170

We do not really systematically test CUDA tensors in torch._numpy, so I only spot-checked locally that the affected functions work with `tensor.to("cuda")`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107768
Approved by: https://github.com/lezcano
2023-08-23 18:35:47 +00:00
61fe49b8ed pt2: make aot_eager backend handle basic float8 operations (#107783)
Summary:

Reland of https://github.com/pytorch/pytorch/pull/107642 with a fix for tests on Windows.

Makes aot_eager backend of torch.compile handle basic float8 operations.

This is useful for float8 training UX.

Test Plan:

```
python test/test_quantization.py -k test_pt2_traceable_aot_eager
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107783
Approved by: https://github.com/albanD
2023-08-23 18:10:53 +00:00
5b632bf7a6 [ONNX] More debug logging from fx to onnx (#107654)
Summary:
- Log fx graph name for 'fx-graph-to-onnx' diagnostic.
- Log fx graph and onnx graph under DEBUG verbosity level for 'fx-graph-to-onnx' diagnostic.
- Adjust unittest to run with diagnostics verbosity level logging.DEBUG.
- Sarif logs will be saved for unittest when `TORCH_LOGS="onnx_diagnostics"` is set.

<img width="640" alt="image" src="https://github.com/pytorch/pytorch/assets/9376104/a5718530-3594-46fb-85a2-b8bcc8ba01c7">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107654
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
ghstack dependencies: #107408, #107409, #107653
2023-08-23 18:05:15 +00:00
bb1852fb9e [ONNX] Clean up diagnostic rules (#107653)
Summary:

- Remove experimental rules that were never used.
- Fill in detailed rule descriptions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107653
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
ghstack dependencies: #107408, #107409
2023-08-23 18:05:14 +00:00
c3c1b68ae8 [ONNX] Enclose package info for modules exported as local functions (#107409)
Enclose source package of modules that are exported as onnx local function in exported onnx model. GPT2 model example:

<img width="350" alt="image" src="https://github.com/pytorch/pytorch/assets/9376104/5e361bdd-ca24-45e7-a9ba-191c35acf3bb">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107409
Approved by: https://github.com/justinchuby
ghstack dependencies: #107408
2023-08-23 18:05:13 +00:00
7a8db57e37 [ONNX] Re-purpose 'name' field of GraphProto (#107408)
Previously, the top level GraphProto is hardcoded with name "torch_jit", and the subgraphs "torch_jit_{count}". It does not offer any insight to the graph, but rather encodes the graph producer as jit (torchscript). This is no longer true now that the graph can also be produced from dynamo.

As a naive first step, this PR re-purposes the name, to "main_graph", and "sub_graph_{count}" respectively. More delicate processing can be done to name the subgraphs with respect to their parent node or module. This can be done as follow ups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107408
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2023-08-23 18:05:11 +00:00
398f4ae451 Back out "[inductor] make thread order consistent with loop order (#106827)" (#107796)
Summary:
D48295371 cause batch fusion failure, which will block mc proposals on all mc models.
e.g. cmf f470938179

Test Plan: Without revert, f469732293. With revert diff f472266199.

Differential Revision: D48610062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107796
Approved by: https://github.com/yanboliang
2023-08-23 18:02:54 +00:00
f7a51c4208 fix pad_sequence docstring (#107669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107669
Approved by: https://github.com/mikaylagawarecki
2023-08-23 18:01:39 +00:00
42738c56a0 Skip the extra copy operation in broadcast_object_list if tensor_list has only one element (#107509)
The `broadcast_object_list` function can easily broadcast the state_dict of models/optimizers. However, the `torch.cat` operation performed within `broadcast_object_list` consumes an additional double amount of memory space. This means that only objects with a maximum memory occupancy of half the device capacity can be broadcasted. This PR improves usability by skipping the `torch.cat` operation on object_lists with only a single element.

Before (30G tensor):
<img width="607" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/c0c67931-0851-4f27-81c1-0119c6cd2944">

After (46G tensor):
<img width="600" alt="image" src="https://github.com/pytorch/pytorch/assets/22362311/90cd1536-be7c-43f4-82ef-257234afcfa5">

Test Code:
```python
if __name__ == "__main__":
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(dist.get_rank() % torch.cuda.device_count())

    fake_tensor = torch.randn(30 * 1024 * 1024 * 1024 // 4)

    if dist.get_rank() == 0:
        state_dict = {"fake_tensor": fake_tensor}
    else:
        state_dict = {}
    object_list = [state_dict]
    dist.broadcast_object_list(object_list, src=0)
    print("Rank: ", dist.get_rank(), " Broadcasted Object: ", object_list[0].keys())
    dist.barrier()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107509
Approved by: https://github.com/awgu
2023-08-23 17:19:10 +00:00
ecde622649 Revert "reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131)"
This reverts commit 42625da5e1c29d710abf6db01c2506898043fdb2.

Reverted https://github.com/pytorch/pytorch/pull/107131 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107131#issuecomment-1690325745))
2023-08-23 17:08:07 +00:00
3f2ecf7755 [inductor] Separate to_{dtype,device} from lowering to avoid copying (#107640)
These lowerings must copy even when they are no-ops in order to preserve
correctness in the presense of mutations. However, `to_dtype` and `to_device`
are also used in various lowerings as a helper function where it is okay to alias.

So, I've split these into two functions and allow the helper functions to alias
which saves some unnecessary copies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107640
Approved by: https://github.com/lezcano
2023-08-23 16:56:39 +00:00
3022a395f3 test_memory_format test now passes on rocm (#107696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107696
Approved by: https://github.com/pruthvistony, https://github.com/albanD
2023-08-23 16:39:19 +00:00
469e7479e8 [CI] Delete .github/ci_commit_pins/huggingface.txt (#107729)
Summary: .github/ci_commit_pins/huggingface.txt is not needed since CI now installs huggingface as a part of the CI docker image by looking up .ci/docker/ci_commit_pins/huggingface.txt.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107729
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-08-23 16:26:28 +00:00
9f5c705806 [CODEOWNERS] Add wz337 as a reviewer for Distributed Package and Distributed Tests. (#107747)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107747
Approved by: https://github.com/fduwjj, https://github.com/awgu
2023-08-23 15:58:22 +00:00
6f0d0b3850 fix type check of overflow (#107579)
Fixes #95451 and remove duplicate check

**Code:**
```python
import torch
import sys

i = sys.maxsize + 1

input = torch.full((1, 32, 32,), 0.5)
torch.max_pool1d(input, kernel_size=[i] , stride=[i], padding=0, dilation=[i], ceil_mode=True)
```

**Result:**
```shell
Traceback (most recent call last):
  File "/root/Git.d/pytorch/samples/src/simple.py", line 13, in <module>
    torch.max_pool1d(input, kernel_size=[i] , stride=[i], padding=0, dilation=[i], ceil_mode=True)
TypeError: max_pool1d(): argument 'dilation' failed to unpack the object at pos 1 with error "Overflow when unpacking long"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107579
Approved by: https://github.com/albanD
2023-08-23 15:34:40 +00:00
48b1208e05 Disable nn.MHA fastpath for floating point masks (#107641)
Fixes https://github.com/pytorch/pytorch/issues/107084 by disabling the fast path when floating point masks (which should be additive) are passed

- [We claim in our docs for MHA that float masks will be added to the attention](https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html) (be it `key_padding_mask` or `attn_mask`)
- We always canonicalize any mask at the start of MHA in python by converting it to float
- my understanding from Driss is that SDPA properly supports additive masking (but there are many special cases for mask shape for MHA that don't work properly currently (BxT, TxT) so [we're turning this off for now](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/attention.cu#L531-L532)
- More broadly, the problem isn't with the SDPA path, but that things are broken for the path it falls back to
-  Right now mha "fast path" code with non-None masks is always going through [this path ](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/attention.cu#L554-L640) that  has a call to `masked_softmax` that [converts the masks back to bool](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/attention.cpp#L154-L156)
- the implication here is that **additive floating point attn_mask and additive key_padding_mask to nn.MHA fastpath are broken**
- This wasn't broken for the user in [https://github.com/pytorch/pytorch/issues/107084](https://l.workplace.com/l.php?u=https%3A%2F%2Fgithub.com%2Fpytorch%2Fpytorch%2Fissues%2F107084&h=AT35qHIQavtxKtriTkrkPsWRB3eSRh4qH5PQUyiTzrPTshoztPL0593AmKCmSdEQ5O-5wib0Fd4mwztVu4YbMWb2ghZnZw1pvpJb9-FYWjDsPQ6_oHRVPzFfj8xYXC1TaFnJCkMYjrGXkIfzzxZvmcQYNnIPgsJSiWgjIw) in 1.13.1 because of [this check which bypassed the fast path if attn_mask was defined](https://github.com/pytorch/pytorch/blob/v1.13.1/torch/nn/modules/activation.py#L1096-L1097) (as Driss pointed out though additive key_padding_mask with the fast path were probably  broken in 1.13.1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107641
Approved by: https://github.com/drisspg, https://github.com/jbschlosser
2023-08-23 15:08:18 +00:00
207b06d099 [dynamo] Wrap ndarray dunder methods (#107689)
Fixes https://github.com/pytorch/pytorch/issues/107437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107689
Approved by: https://github.com/ezyang
ghstack dependencies: #107687, #107688, #107710, #107711, #107746
2023-08-23 13:55:36 +00:00
b5c90ba7e7 [dynamo] Fix ndarray.__pow__ (#107746)
As per title. Tests in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107746
Approved by: https://github.com/ezyang
ghstack dependencies: #107687, #107688, #107710, #107711
2023-08-23 13:55:36 +00:00
2b6249e209 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-23 11:59:20 +00:00
c81c217a2f Make ExportedProgram valid tracing callable (#107657)
In this PR, we make ExportedProgram valid callable to export for re-exporting. Note that we don't allow any new constraints specified from user as we don't have any way of handling it right now. There are some caveats that is worth mentioning in this PR.
Today, graph_module.meta is not preserved (note that this is different from node level meta which we preserve). Our export logic relies on this meta to process the constraints. But if we skip dynamo, we will have to preserve the constraints stored in graph_module.meta. Once dynamo supports retracibility, we don't have to do this anymore. I currently manually save graph_module.meta at following places:
1. After ExportedProgram.module()
2. After ExportedProgram.transform()
3. At construction site of ExportedProgram.

Jerry will add the update on the quantization side as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107657
Approved by: https://github.com/gmagogsfm
2023-08-23 08:01:57 +00:00
400c4de53b [ONNX] Add huggingface models into CI tests (#107247)
1. Add a list of HF models to CI tests. The PR intends to build them from Config, but some of them are not supported with Config. NOTE: Loaded from pre-trained model could potentially hit [uint8/bool conflict](https://github.com/huggingface/transformers/issues/21013) when a newer version of transformers is used.
    - Dolly has torch.fx.Node in OnnxFunction attribute, which is currently not supported.
    - Falcon and MPT has unsupported user coding to Dynamo.
2. Only update GPT2 exporting with real tensor to Config, as FakeMode rises unequal input errors between PyTorch and ORT. The reason is that [non-persistent buffer is not supported](https://github.com/pytorch/pytorch/issues/107211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107247
Approved by: https://github.com/wschin, https://github.com/BowenBao
2023-08-23 07:28:26 +00:00
610f64d72a inductor: also check index_exp when select tiling var (#106765)
For select tiling var, currently, we only consider load and store which do not consider index exp, and meet accuracy issues:

before(the index exp ```i1-1``` can not be vectrized):
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = at::vec::Vectorized<int>(static_cast<int>((-1L) + i1));
                        auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                        auto tmp2 = to_float_mask(tmp0 >= tmp1);
                        auto tmp3 = [&]
                        {
                            auto tmp4 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (8L*i1_inner) + (25088L*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
                            auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0)));
                            auto tmp6 = tmp4 * tmp5;
                            return tmp6;
                        }
                        ;
                        auto tmp7 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2));
                        { __at_align__ float tmpbuf[16*sizeof(float)/sizeof(float)]; tmp7.store(tmpbuf); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>(i2 + (8L*i1) + (8L*i1_inner) + (25096L*i0))] = tmpbuf[i1_inner]; }
                    }
                }
                #pragma GCC ivdep
                for(long i1=static_cast<long>(3136L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
```

after:
```
cpp_fused_constant_pad_nd_mul_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(3137L); i1+=static_cast<long>(1L))
                {
                    #pragma omp simd simdlen(8)
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(8L); i2+=static_cast<long>(1L))
                    {
                        auto tmp0 = static_cast<long>((-1L) + i1);
                        auto tmp1 = static_cast<long>(0);
                        auto tmp2 = tmp0 >= tmp1;
                        auto tmp3 = [&]
                        {
                            auto tmp4 = in_ptr0[static_cast<long>((-8L) + i2 + (8L*i1) + (25088L*i0))];
                            auto tmp5 = in_ptr1[static_cast<long>((-1L) + i1 + (3136L*i2) + (25088L*i0))];
                            auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                            return tmp6;
                        }
                        ;
                        auto tmp7 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                        out_ptr0[static_cast<long>(i2 + (8L*i1) + (25096L*i0))] = tmp7;
                    }
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106765
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-23 07:16:14 +00:00
4a40e27583 Enable mypy check in torch/_inductor/config.py (#107448)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/config.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/config.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107448
Approved by: https://github.com/ezyang
2023-08-23 07:11:31 +00:00
d0f8ee45bd [ONNX] Exclude FXSymbolicTracer from _assert_fake_tensor_mode (#107712)
Previous to this PR, `_assert_fake_tensor_mode` checks all of exporting tracer that they enable fake mode "from" exporter API whenever they have fake tensors in args/buffers/weights. However, FXSymbolicTracer doesn't use exprter API to create fake mode, so it hits the raise RuntimeError everytime we run it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107712
Approved by: https://github.com/BowenBao
2023-08-23 05:51:51 +00:00
31b0445702 Fix torch.compile with FakeTensor that has SymInt sizes (#107662)
**Motivation:**
When input FakeTensor to torch.compile has SymInt sizes (e.g. make_fx(opt_f, tracing_mode="symbolic"):
1. We cannot create a FakeTensor from that input in dynamo due to the SymInts.
2. We cannot check input tensors in guard check function and will abort due to tensor check calls sizes/strides.

For 1, we specialize the FakeTensor's SymInts using their hints. This is mostly safe since inputs mostly have concrete shapes and not computed from some DynamicOutputShape ops. We'll throw a data dependent error if the symint is unbacked.

For 2, we replace size/stride calls with the sym_* variants in TENSOR_CHECK guards' check function.

**Test Plan:**
See added tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107662
Approved by: https://github.com/ezyang
2023-08-23 05:27:57 +00:00
83517c8dba Enable Mypy Check in torch/_inductor/virtualized.py (#107127)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/virtualized.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/virtualized.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107127
Approved by: https://github.com/eellison
2023-08-23 04:54:32 +00:00
4cc05c41fa [MPS] Fix torch.std for negative dimentions (#107754)
By simply comparing output dimentions to a properly wrapped dim
Add regression test to opinfo

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ca98536</samp>

> _`reduceTensor` bug_
> _negative dimensions wrapped_
> _autumn tests added_

Fixes https://github.com/pytorch/pytorch/issues/107116
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107754
Approved by: https://github.com/kit1980
2023-08-23 03:50:02 +00:00
17675cb1f5 [vision hash update] update the pinned vision hash (#107757)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107757
Approved by: https://github.com/pytorchbot
2023-08-23 03:24:29 +00:00
09c642bfc8 Fix the use of head_branch in filter-test-configs action (#107753)
This addresses https://github.com/pytorch/pytorch/security/advisories/GHSA-hw6r-g8gj-2987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107753
Approved by: https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/malfet
2023-08-23 03:14:36 +00:00
cbcd551045 Fix torch.compile FunctionalTensor inputs for higherOrderOps (#107604)
Before this PR, for the added [test](https://github.com/pytorch/pytorch/pull/107604/files#diff-c618f2274b6b5ccc533c580549d2e552edbd9fc5ac0da1aa4b00338525c8f78dR224), which feeds FunctionTensorWrapper inputs to higherOrderOperator, we have an assertion error in this line [code](https://github.com/pytorch/pytorch/pull/107604/files#diff-9f0663783bcd93e948e0491ef61b48123bdc9977bcc632fd707da578df13bfa1R1284).

The key difference of this PR is this [line ](https://github.com/pytorch/pytorch/pull/107604/files#diff-9f0663783bcd93e948e0491ef61b48123bdc9977bcc632fd707da578df13bfa1L1263)of check:
```python
        elif (
            isinstance(example_value, FakeTensor)
            and example_value.fake_mode is tx.fake_mode
        ):
```
The original intention of it seems to be dealing with case where we want to wrap an fx proxy for an intermediate fake tensor that's produced by some tensor ops and an example value is provided (as is the case for higherOrderOps [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/higher_order_ops.py#L85)). A fakified FunctionalTensorWrapper(FakeTensor) always fails this check. This PR changes it to checking whether it's already fakified by tx.fake_mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107604
Approved by: https://github.com/zou3519
ghstack dependencies: #107569
2023-08-23 02:42:18 +00:00
fc380a2b5a [ez] Minor refactors (#107656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107656
Approved by: https://github.com/angelayi
2023-08-23 02:27:47 +00:00
d395088dc8 Add _native_batch_norm_legit_no_training to core IR (#107732)
Summary: Added due to how common the op is. For performance reasons users may not want to decompose batch_norm op. batch_norm is also part of StableHLO

Test Plan: After adding to IR, we can enable _check_ir_validity in exir.EdgeCompileConfig for models like MV2, MV3, IC3, IC4

Reviewed By: guangy10

Differential Revision: D48576866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107732
Approved by: https://github.com/manuelcandales, https://github.com/guangy10
2023-08-23 02:24:43 +00:00
d34cf147d1 MatMul heuristics for aarch64 (#107167)
This PR focuses on improving MatMul performance for aarch64 only. It introduces a light-weight heuristic that dispatches small or tall/flat MatMul operations to OpenBLAS while other shapes to MKLDNN/ACL.

On average, the proposed heuristics improve MatMul operator latency by 1.03x / 1.04x / 1.05x / 1.09x / 1.22x for 1 / 2 / 4 / 8 / 16 threads, respectively (baseline is using ACL for all MatMuls on AWS Graviton c7g instances).

Fixes #107168

<details>
<summary>Full MatMul benchmark script and result</summary>

Run this following script `run.sh` with `ABt.py` under the same directory:
```shell
#!/bin/bash

script=ABt.py
prefix=acl

OMP_NUM_THREADS=1 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t1.csv
OMP_NUM_THREADS=2 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t2.csv
OMP_NUM_THREADS=4 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t4.csv
OMP_NUM_THREADS=8 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t8.csv
OMP_NUM_THREADS=16 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=0 TORCH_MKLDNN_MATMUL_MIN_SIZE=0 python ${script} > ${prefix}_t16.csv

prefix=heur
OMP_NUM_THREADS=1 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t1.csv
OMP_NUM_THREADS=2 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t2.csv
OMP_NUM_THREADS=4 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t4.csv
OMP_NUM_THREADS=8 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t8.csv
OMP_NUM_THREADS=16 DNNL_DEFAULT_FPMATH_MODE=BF16 TORCH_MKLDNN_MATMUL_MIN_DIM=8 TORCH_MKLDNN_MATMUL_MIN_SIZE=8192 python ${script} > ${prefix}_t16.csv
```

`ABt.py`:
```python
import argparse
import timeit
import torch
import numpy as np

BATCH = 1
DIM_MIN = 8
DIM_MAX = 256

M_MIN = DIM_MIN
K_MIN = DIM_MIN
N_MIN = DIM_MIN

M_MAX = DIM_MAX
K_MAX = DIM_MAX
N_MAX = DIM_MAX

min_iterations = 1000
min_time = 0.1

def get_stats(timings):
    times = np.array(timings)
    time_avg = np.average(times) * 1000
    time_med = np.median(times) * 1000
    time_90th = np.percentile(times, 90) * 1000
    time_99th = np.percentile(times, 99) * 1000

    return time_avg, time_med, time_90th, time_99th

def bench(M, K, N, min_iterations, min_time):
    a = torch.randn(M, K, dtype=torch.float32)
    b = torch.randn(N, K, dtype=torch.float32)
    timings = []

    with torch.no_grad():
        for _ in range(max(1, min_iterations // 100)):
            c = torch.matmul(a, b.transpose(0, 1))

        bench_time = timeit.default_timer()

        while True:
            for _ in range(min_iterations):
                start_time = timeit.default_timer()
                c = torch.matmul(a, b.transpose(0, 1))
                end_time = timeit.default_timer()
                timings.append(end_time - start_time)

            if timeit.default_timer() - bench_time > min_time:
                break

    return get_stats(timings)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('-l', '--loop', dest='loop', action='store_true')
    flags = parser.parse_args()

    if flags.loop:
        while True:
            for M in range(M_MAX//2, M_MAX+1, 8):
                for K in range(K_MAX//2, K_MAX+1, 8):
                    for N in range(N_MAX//2, N_MAX+1, 8):
                        stats = bench(M, K, N, min_iterations, min_time)
    else:
        torch.manual_seed(0)
        print(f"M, K, N, latency")

        for M in range(M_MIN, M_MAX+1, 8):
            for K in range(K_MIN, K_MAX+1, 8):
                for N in range(N_MIN, N_MAX+1, 8):
                    stats = bench(M, K, N, min_iterations, min_time)
                    print(f"{M}, {K}, {N}, {stats[2]}")
```

Here's the benchmark result collected on AWS Graviton c7g instance. Due to the size of the table, I can only attach the result in this text file:

[table.txt](https://github.com/pytorch/pytorch/files/12374265/table.txt)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107167
Approved by: https://github.com/malfet
2023-08-23 02:21:06 +00:00
fada0527fa Dispatch take_along_axis to gather (#107711)
Gather does the same thing, but it's much better supported in the
`torch.compile` stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107711
Approved by: https://github.com/ezyang
ghstack dependencies: #107687, #107688, #107710
2023-08-23 01:21:23 +00:00
62113a2361 [dynamo] np.sort(complex) is not implemented (#107710)
This issue was discovered once we were able to trace without breaking
in https://github.com/pytorch/pytorch/pull/107689. Same for the next
one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107710
Approved by: https://github.com/ezyang
ghstack dependencies: #107687, #107688
2023-08-23 01:21:23 +00:00
2fc828312c Support negative indices in ndarray.__getitem__ (#107688)
In this case, we copy, but this is part of the set of divergences
described in https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/73.

This does not work with dynamic shapes, but it's not clear to me what
would be the best fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107688
Approved by: https://github.com/ezyang
ghstack dependencies: #107687
2023-08-23 01:21:23 +00:00
db39a81e1e Add a flag that allows breaking on NumPy ops (#107687)
This was removed in 63d406a6a9
Resotiring, as it's rather useful for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107687
Approved by: https://github.com/larryliu0820
2023-08-23 01:21:22 +00:00
e573abec12 Revert "[ATen] Update pre-compiled header (#106915)"
This reverts commit 4f3284e3edd41b883f8bb347fbe33532b2485f53.

Reverted https://github.com/pytorch/pytorch/pull/106915 on behalf of https://github.com/ZainRizvi due to reverting the full stack. I missed that the iostream pr was stacked under this one and it's builds are also failing internally ([comment](https://github.com/pytorch/pytorch/pull/106915#issuecomment-1689087860))
2023-08-23 00:25:31 +00:00
874d1b18b0 [BE] reorganize opt disables in dynamo for clarity (#107709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107709
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2023-08-23 00:17:34 +00:00
0c4fa02296 fallback to cpu_kernel for VSX (#98511)
Attempt to fix https://github.com/pytorch/pytorch/issues/97497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98511
Approved by: https://github.com/quickwritereader, https://github.com/ezyang
2023-08-22 23:44:30 +00:00
42897e8127 Revert "[inductor] Adjust dynamic SMEM limit when above default in AOT (#107601)"
This reverts commit 3920ce2f6ef7f93dd121f86371c1b35697e2e744.

Reverted https://github.com/pytorch/pytorch/pull/107601 on behalf of https://github.com/ZainRizvi due to Sorry, but the test added in this PR breaks when run internally. See D48549503 for more details ([comment](https://github.com/pytorch/pytorch/pull/107601#issuecomment-1689049609))
2023-08-22 23:26:50 +00:00
68c941d228 [Mypy] move inductor to exclude list (#107741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107741
Approved by: https://github.com/ezyang
2023-08-22 23:19:55 +00:00
660e8060ad [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-22 23:16:38 +00:00
e8278d6058 Support graphs which return get_attr nodes directly as output (#107610)
Summary: Currently serializing graphs which return get_attr's directly as output fails. This diff adds support for that only in EXIR serializer while we still support unlifted params.

Test Plan: Added test case.

Differential Revision: D48258552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107610
Approved by: https://github.com/angelayi
2023-08-22 23:16:10 +00:00
979e706f8e [dtensor] update some comments (#107608)
This update some comments from the follow up of https://github.com/pytorch/pytorch/pull/107305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107608
Approved by: https://github.com/fduwjj
ghstack dependencies: #107606
2023-08-22 23:08:13 +00:00
945fa7e8a8 [dtensor] fix requires_grad in distribute_tensor (#107606)
This PR fixes the requires_grad set when calling distribute_tensor, we
should set the requires_grad of the local tensor after the detach call
to make sure we create the leaf correctly, otherwise it would raise
warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107606
Approved by: https://github.com/fduwjj
2023-08-22 23:08:13 +00:00
8367cf1220 Add Early Link To Intro Doc (#107344)
Fixes #106600 by including an early link to the intro docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107344
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-08-22 23:00:52 +00:00
b115da8361 [MPS][BE] Refactor atan2_out_mps (#107334)
It's the only function at the moment that has an int64 exception, but
check from the preprocessor define unnecessarily applied to all binary functions

Also, rename `atan2_mps_out` to `atan2_out_mps` to match the common pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107334
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-08-22 22:54:07 +00:00
d9460bb8f8 Update test_MaxUnpool_index_errors XFAIL after #107483 (#107658)
After https://github.com/pytorch/pytorch/pull/107483 which reverted https://github.com/pytorch/pytorch/pull/95300, these tests are not XFAIL anymore.  So now we know the root cause of https://github.com/pytorch/pytorch/issues/103854.

As this is failing slow jobs in trunk atm, i.e. 6981bcbc35, I'm moving these tests back.

### Testing

Run locally and all tests passes.

```
PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1 python test/nn/test_pooling.py -k test_MaxUnpool_index_errors
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107658
Approved by: https://github.com/PaliC
2023-08-22 22:36:35 +00:00
a711679527 Add skipLazy marker for tests and use it for tests not working with LazyTensor (#107382)
[This PR](https://github.com/pytorch/pytorch/pull/80251/files#diff-87e1d4e98eab994c977a57be29c716d3dc0f76d5b5e98cbf23cfcbd48ae625a4) marked some tests in `test/test_view_ops.py` with `@onlyNativeDeviceTypes`, because they'd fail if run on the `'lazy'` device type.
However, that marker is overly restrictive, because it prevents all devices outside of the native ones to run those tests.
This PR adds a `@skipLazy` marker (analogous to the existing ones for the other devices), and marks the tests from the mentioned PR so that they're skipped only for the `'lazy'` device type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107382
Approved by: https://github.com/ezyang
2023-08-22 22:34:36 +00:00
4d13422997 fix errors about mypy check in torch/_inductor/compile_fx.py (#107508)
the `compile_fx.py` blocked the merging of [PR1 ](https://github.com/pytorch/pytorch/pull/107127)and [PR2](https://github.com/pytorch/pytorch/pull/107448)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107508
Approved by: https://github.com/ezyang
2023-08-22 22:33:37 +00:00
5025fb9213 Revert "pt2: make aot_eager backend handle basic float8 operations (#107642)"
This reverts commit 24147a8e1c6855489c1669c612ff5cb1b09a09dd.

Reverted https://github.com/pytorch/pytorch/pull/107642 on behalf of https://github.com/huydhn due to Sorry for reverting this, but it is failing Windows CPU test in trunk. The Windows failures on your PR looks related I think ([comment](https://github.com/pytorch/pytorch/pull/107642#issuecomment-1688999380))
2023-08-22 22:17:36 +00:00
9c56ca80f3 Delete accidental print statement (#107745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107745
Approved by: https://github.com/angelayi
2023-08-22 22:09:36 +00:00
c093fdf924 Fix wrong hardcoded value for _scaled_mm (#107719)
## Summary
Sneaky lil bug where we were accidentally fusing in relu to the epilogue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107719
Approved by: https://github.com/vkuzo
2023-08-22 21:52:20 +00:00
c14f4d66c3 [pytorch][export] Move is_param and get_param out of exir and into export (#107264)
Summary: These doesn't feel edge specific so moving out of exir.

Test Plan: ci

Differential Revision: D48361384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107264
Approved by: https://github.com/angelayi
2023-08-22 21:41:51 +00:00
8fb6416bfa Revert "Remove CUTLASS extensions merged upstream (#107612)"
This reverts commit cfd98d3c429b0de8a634a843c4551ee86d0084f3.

Reverted https://github.com/pytorch/pytorch/pull/107612 on behalf of https://github.com/ZainRizvi due to Sorry, this breaks internal builds which still depend on these files ([comment](https://github.com/pytorch/pytorch/pull/107612#issuecomment-1688936837))
2023-08-22 21:11:41 +00:00
bcee3d6fa4 [BE] Make nadam decoupled_weight_decay clearer, add missing setstate (#107706)
Inspired by @blizard's changes in #107507

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107706
Approved by: https://github.com/Skylion007
2023-08-22 20:48:28 +00:00
b282787409 Revert "Wrap indirect indexing on CUDA (#105055)"
This reverts commit 85c673e6b25173e2697a0dd741a9b2ebb33dec1d.

Reverted https://github.com/pytorch/pytorch/pull/105055 on behalf of https://github.com/peterbell10 due to Causes failure in inductor_torchbench ([comment](https://github.com/pytorch/pytorch/pull/105055#issuecomment-1688871947))
2023-08-22 20:24:41 +00:00
d59a6864fb Revert "[BE]: Update ruff to 0.285 (#107519)"
This reverts commit 88ab3e43228b7440a33bf534cde493446a31538c.

Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))
2023-08-22 19:53:32 +00:00
1e9b590df9 Optimize Net._get_next_net_name (#107479)
Summary: This is surprisingly expensive and can be easily optimized.

Differential Revision: D48440000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107479
Approved by: https://github.com/kit1980
2023-08-22 19:15:11 +00:00
24147a8e1c pt2: make aot_eager backend handle basic float8 operations (#107642)
Summary:

Makes aot_eager backend of torch.compile handle basic float8 operations.

This is useful for float8 training UX.

Test Plan:

```
python test/test_quantization.py -k test_pt2_traceable_aot_eager
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107642
Approved by: https://github.com/albanD
2023-08-22 18:57:14 +00:00
ba5eeed4ac [inductor] Add CPU-side profiler event for triton kernels w/ python wrapper (#106351)
This allows you to view the original kernel names (e.g. to reference the triton kernel implementation in the python wrapper code / TORCH_COMPILE_DEBUG logs). `torch._inductor.config.unique_kernel_names=True` does this too, but leaving unique_kernel_names=False will increase triton caching.

Another benefit to this approach is that we can attach additional information to this profiler event in the future. For example, we could attach input shapes/strides (i.e. record_shapes=True for profiler), or possibly paths to the files where the code was dumped.

<img width="435" alt="Screenshot 2023-07-31 at 5 34 25 PM" src="https://github.com/pytorch/pytorch/assets/5067123/839b752f-3907-4f29-9038-9d1822222b45">

^ in the trace above, the pink "triton_poi_fused_add_cos_sin_0" kernel is the new trace event which is added by this PR.

**Performance impact**: [dashboard run](https://hud.pytorch.org/benchmark/compilers?startTime=Thu%2C%2010%20Aug%202023%2000%3A52%3A06%20GMT&stopTime=Thu%2C%2017%20Aug%202023%2000%3A52%3A06%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/davidberard98/216/orig&lCommit=90c4212a7993c3660e7ea53bcd9d21160be31d1a&rBranch=main&rCommit=35cca799ff42182a1b7f1ee4d0225ee879b7c924). There are some regressions, including a 1.72x -> 1.71x on huggingface and 1.30x -> 1.29x on dynamic; however, locally I can't reproduce the results on any of the individual models (differences look like they are within noise). I think the perf impact is likely < 1% overall.

Differential Revision: [D47941809](https://our.internmc.facebook.com/intern/diff/D47941809)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106351
Approved by: https://github.com/eellison, https://github.com/albanD
ghstack dependencies: #107195
2023-08-22 18:48:30 +00:00
614b865721 [profiler] _RecordFunctionFast - faster python bindings for record_function (#107195)
torch.profiler.record_function is relatively slow; for example, in some benchmarks I was running, x.view_as(x) was ~2us, and ~16-17us when wrapped in a record_function context. The reasons for this are: dispatcher overhead from going through an op (the main source of overhead), python binding / python conversion overhead, and some overhead from the context manager.

This new implementation is faster, but it won't work with torchscript. Based on the benchmarks I was running, it adds 0.5-0.7us overhead per call when the profiler is turned off. To use it, you can just:

```python
with torch._C._profiler_manual._RecordFunctionFast("title"):
    torch.add(x, y)
```

It implements a context manager in python which directly calls the record_function utilities, instead of calling through an op.
* The context manager is implemented directly in python because the overhead from calling a python function seems non-negligible
* All the record_function calls, python object conversions are guarded on checks for whether the profiler is enabled or not. It seems like this saves a few hundred nanoseconds.

For more details about the experiments I ran to choose this implementation, see [my record_functions experiments branch](https://github.com/pytorch/pytorch/compare/main...davidberard98:pytorch:record-function-fast-experiments?expand=1).

This also adds a `torch.autograd.profiler._is_profiler_enabled` global variable that can be used to check whether a profiler is currently enabled. It's useful for further reducing the overhead, like this:

```python
if torch.autograd.profiler._is_profiler_enabled:
    with torch._C._profiler_manual._RecordFunctionFast("title"):
        torch.add(x, y)
else:
    torch.add(x, y)
```

On BERT_pytorch (CPU-bound model), if we add a record_function inside CachedAutotuning.run:
* Naive torch.profiler.record_function() is a ~30% slowdown
* Always wrapping with RecordFunctionFast causes a regression of ~2-4%.
* Guarding with an if statement - any regression is within noise

**Selected benchmark results**: these come from a 2.20GHz machine, GPU build but only running CPU ops; running `x.view_as(x)`, with various record_functions applied (with profiling turned off). For more detailed results see "record_functions experiments branch" linked above (those results are on a different machine, but show the same patterns). Note that the results are somewhat noisy, assume 0.05-0.1us variations

```
Baseline:: 1.7825262546539307 us  # Just running x.view_as(x)
profiled_basic:: 13.600390434265137 us  # torch.profiler.record_function(x) + view_as
precompute_manual_cm_rf:: 2.317216396331787 us  # torch._C._profiler_manual._RecordFunctionFast(), if the context is pre-constructed + view_as
guard_manual_cm_rf:: 1.7994389533996582 us  # guard with _is_profiler_enabled + view_as
```

Differential Revision: [D48421198](https://our.internmc.facebook.com/intern/diff/D48421198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107195
Approved by: https://github.com/albanD, https://github.com/aaronenyeshi
2023-08-22 18:48:30 +00:00
137d96a26e Expose torch.export.dynamic_dim() API (#107635)
With updated doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107635
Approved by: https://github.com/avikchaudhuri
2023-08-22 18:40:49 +00:00
515aa993e3 Document post acc grad hooks in backward hooks execution (#107323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107323
Approved by: https://github.com/soulitzer, https://github.com/albanD
2023-08-22 18:37:03 +00:00
b0e93e206c Grant upload-stats jobs access to S3 (#107717)
These jobs have write access to S3 when they are running on our self-hosted runners.  On the other hand, they would need the AWS credential to run if they are run on GitHub ephemeral runner.

### Testing

Use the AWS credential in upload-stats environment to run the test command successfully (currently failing in trunk due to the lack of permission a5f83245fd)

```
python3 tools/alerts/upload_alerts_to_aws.py --alerts '[{"AlertType": "Recurrently Failing Job", "AlertObject": "Upload Alerts to AWS/Rockset / upload-alerts", "OncallTeams": [], "OncallIndividuals": [], "Flags": [], "sha": "c8a6c74443f298111fd6568e2828765d87b69c98", "branch": "main"}, {"AlertType": "Recurrently Failing Job", "AlertObject": "inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 1, linux.g5.4xlarge.nvidia.gpu)", "OncallTeams": [], "OncallIndividuals": [], "Flags": [], "sha": "f13101640f548f8fa139c03dfa6711677278c391", "branch": "main"}, {"AlertType": "Recurrently Failing Job", "AlertObject": "slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (slow, 1, 2, linux.g5.4xlarge.nvidia.gpu)", "OncallTeams": [], "OncallIndividuals": [], "Flags": [], "sha": "6981bcbc35603e5d8ac7d00a2032925239009db5", "branch": "main"}]' --org "pytorch" --repo "pytorch"
Writing 138 documents to S3
Done!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107717
Approved by: https://github.com/clee2000
2023-08-22 18:31:02 +00:00
2e054037da fixing named tensor unflatten example (#106921)
Fixes an example from the documentation [here](https://pytorch.org/docs/stable/named_tensor.html#manipulating-dimensions).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106921
Approved by: https://github.com/zou3519
2023-08-22 18:00:10 +00:00
28dc1a093f Revert "Remove some unnecessary <iostream> includes from headers (#106914)"
This reverts commit 60936e4c296e79f56cac2431a560970bb4529d03.

Reverted https://github.com/pytorch/pytorch/pull/106914 on behalf of https://github.com/ZainRizvi due to Sorry, but this is breaking internal builds. Seems like a lot of internal code depends on some of the removed imports ([comment](https://github.com/pytorch/pytorch/pull/106914#issuecomment-1688605975))
2023-08-22 17:16:48 +00:00
c8a6c74443 Remove aws ossci metrics upload keys from rocm (#107613)
@huydhn
Our current workflow is to upload to GH and then upload from GH to S3 when uploading test stats at the end of a workflow.

I think these keys could be used to directly upload from the runner to S3 but we don't do that right now.

I'm not sure how high priority they keys are.

Rocm artifacts can still be seen on the HUD page

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107613
Approved by: https://github.com/huydhn
2023-08-22 17:12:49 +00:00
a5f83245fd Access ROCKSET_API_KEY from ephemeral runners (#107652)
Hardening the access to ROCKSET_API_KEY by only using this key from ephemeral runners `ubuntu-22.04`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107652
Approved by: https://github.com/clee2000
2023-08-22 17:02:44 +00:00
de8a91f40a [ROCm] Remove expected inductor UT fails for batch norm (#107027)
Removing expected failures relating to inductor batch_norm on ROCm

Also removing the addition of `tanh` to expected failures list as this is a cuda exclusive failure already captured here (cc: @peterbell10)
```
if not TEST_WITH_ROCM:
    inductor_gradient_expected_failures_single_sample["cuda"]["tanh"] = {f16}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107027
Approved by: https://github.com/peterbell10
2023-08-22 16:39:11 +00:00
e0238577b6 Always import test selection tools (#107644)
https://github.com/pytorch/pytorch/pull/107070 made emit_metrics importable without boto3, so we could just import all the files without the try catch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107644
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-08-22 16:36:20 +00:00
4dc9df2f87 Slightly more flexible naming system for disable + slow tests (#104002)
Sometimes test suite names include file/module names since they were imported from another file (ex _nvfuser.test_dynamo.TestNvFuserDynamo etc).  This can sometimes make the autogenerated named by disable bot and the disable test button on hud incorrect which is annoying to track down, which leads to issues that are open but don't actually do anything, so my solution is to make the check between the issue name + the test more flexible.  Instead of checking the entire test suite name, we chop off the file/module names and only look for the last part (ex TestNvFuserDynamo) and check if those are equal.

Also bundle both the check against the names in the slow test json and disable test issue names into one function for no reason other than less code.

Looked through logs to see what tests are skipped with this vs the old one and it looked the same.

Diff looks like a big change but its mostly a change in the indentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104002
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-08-22 16:35:54 +00:00
e740491674 [caffe2][cuda] Trace allocate and local_raw_delete events with PyTorch USDTs (#107322)
Summary: Adds new tracepoints to CUDA allocator code for tracking alloc and dealloc events in the allocator code.

Test Plan: This change simply adds static tracepoints to CUDA allocator code, and does not otherwise change any logic. Testing is not required.

Reviewed By: chaekit

Differential Revision: D48229150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107322
Approved by: https://github.com/chaekit
2023-08-22 16:31:30 +00:00
a408920817 Reland fakify FunctionalTensor (#107569)
Try to rebase and reland https://github.com/pytorch/pytorch/pull/107062 . One difference compared with previous is to make the DTensor logic same as previously in _clone_input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107569
Approved by: https://github.com/zou3519
2023-08-22 15:46:25 +00:00
02d41b7afd allow result of at::for_blob to advertise as resizeable (for tensor subclasses) (#107416)
Previously, the first overload of `_make_wrapper_subclass` returned a tensor that **always** advertised as having a non-resizeable storage. Eventually, we'll need it be advertise as resizeable for functionalization to work (since functionalization occasionally needs to resize storages).

Not directly tested in this PR (tested more heavily later in aot dispatch, but if someone wants me to write a more direct test I can add one).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107416
Approved by: https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #107417
2023-08-22 15:25:31 +00:00
2c8759df9d Allow storage() to work on python tensor subclasses, but error on future data accesses (#107417)
This was discussed in feedback from the original version of my "reorder proxy/fake" PR. This PR allows calls to `tensor.untyped_storage()` to **always** return a python storage object to the user. Previously, we would error loudly if we detected that the storage had a null dataptr.

Instead, I updated the python bindings for the python storage methods that I saw involve data access, to throw an error later, only if you try to access those methods (e.g. `storage.data_ptr()` will now raise an error if the data ptr is null).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107417
Approved by: https://github.com/albanD, https://github.com/ezyang, https://github.com/zou3519
2023-08-22 15:25:31 +00:00
df42f15e28 Improve generate_opcheck_tests, add opcheck utility (#107597)
Summary:
This PR improves `generate_opcheck_tests`:
- We shouldn't run automated testing through operators called in
  torch.jit.trace / torch.jit.script
- I improved the error message and added a guide on what to do if one of the
  tests fail.
- While dogfooding this, I realize I wanted a way to reproduce the failure
  without using the test suite. If you pass `PYTORCH_OPCHECK_PRINT_REPRO`, it
  will now print a minimal repro on failure. This involves serializing some
  tensors to disk.
- The minimal repro includes a call to a new API called `opcheck`.

The opcheck utility runs the same checks as the tests generated
by `generate_opcheck_tests`. It doesn't have a lot of knobs on it for
simplicity. The general workflow is: if an autogenerated test fails, then the
user may find it easier to reproduce the failure without the test suite by
using opcheck

Test Plan: - new tests

Differential Revision: D48485013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107597
Approved by: https://github.com/ezyang
2023-08-22 15:16:04 +00:00
3f655277d4 Add tensor post accumulate grad hook API (#107063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107063
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-08-22 15:15:57 +00:00
bcede143bd Do not mutate SymNode expression. (#107492)
This PR stops `SymNode` from mutating (i.e. simplifying) its expression. Instead, the
simplification (without mutation) is deferred to the `SymNode.maybe_as_int` method.

```python
- FakeTensor(size=(s0,), ...)
- FakeTensor(size=(s1, s2, s3), ...)

- Eq(s0, s1 + s2 + s3)

- FakeTensor(size=(s0,), ...)
- FakeTensor(size=(s1, s2, s3), ...)
```

In summary, this PR:
- Replaces `SymNode._expr` by `SymNode.expr`, removing the old property function
    - This makes it so `SymNode` instances never update their expression
- Creates `SymNode.simplified_expr()` method for actually calling `ShapeEnv.replace` on
  its expression. Note that this doesn't updates `SymNode.expr`
- Changes how `tensor.size()` gets converted to its Python `torch.Size` type
    - Instead of calling `SymInt::maybe_as_int()` method, we create a new
      `SymInt::is_symbolic()` method for checking whether it is actually a symbolic value
    - This is needed so that when we call `tensor.size()` in the Python side, the returned
      sequence is faithful to the actual data, instead of possibly simplifying it and
      returning an integer
    - 2 files needs this modification:
        - _torch/csrc/Size.cpp_: for handling `torch.Tensor.size` Python calls
        - _torch/csrc/utils/pybind.cpp_: for handling `symint.cast()` C++ calls

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107492
Approved by: https://github.com/ezyang
ghstack dependencies: #107523
2023-08-22 12:38:05 +00:00
d2215f14ba Fix: transactional translation validation insertion. (#107523)
This PR fixes transactional behavior of translation validation insertion.

Previously, this transactional behavior was implemented by removing the FX node if any
issues occurred until the end of `evaluate_expr`. However, since we cache FX nodes, we
might end up removing something that wasn't inserted in the same function call.

**Solution:** when creating an FX node for `call_function`, we also return whether this is
a fresh FX node or not. Then, we can appropriately handle each case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107523
Approved by: https://github.com/ezyang
2023-08-22 12:38:05 +00:00
3f3479e85e reduce header file to boost cpp_wrapper build. (#107585)
1. Reduce cpp_wrapper un-used header files.
2. Clean pch cache, when use_pch is False.

The first change will reduce the build time from 7.35s to 4.94s.

Before change:
![image](https://github.com/pytorch/pytorch/assets/8433590/fc5c1d37-ec40-44f3-8d4d-bf26bdc674bb)
After change:
![image](https://github.com/pytorch/pytorch/assets/8433590/c7ccadd2-bf3a-4d30-bf56-6e3b0230a194)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107585
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/jgong5
2023-08-22 11:58:47 +00:00
94d85f18c9 Enable Mypy Check in torch/_inductor/triton_heuristics.py (#107135)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/triton_heuristics.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/triton_heuristics.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107135
Approved by: https://github.com/ezyang
2023-08-22 09:51:30 +00:00
431d25a141 [export] Add save/load function (#107309)
Added the following APIs:

```
def save(
    ep: ExportedProgram,
    f: Union[str, pathlib.Path, io.BytesIO],
    extra_files: Optional[Dict[str, Any]] = None,
    opset_version: Optional[Dict[str, int]] = None,
) -> None:
    """
    Saves a version of the given exported program for use in a separate process.
    Args:
        ep (ExportedProgram): The exported program to save.
        f (str): A file-like object (has to implement write and flush)
            or a string containing a file name.
        extra_files (Optional[Dict[str, Any]]): Map from filename to contents
            which will be stored as part of f.
        opset_version (Optional[Dict[str, int]]): A map of opset names
            to the version of this opset
    """

def load(
    f: Union[str, pathlib.Path, io.BytesIO],
    extra_files: Optional[Dict[str, Any]] = None,
    expected_opset_version: Optional[Dict[str, int]] = None,
) -> ExportedProgram:
    """
    Loads an ExportedProgram previously saved with torch._export.save
    Args:
        ep (ExportedProgram): The exported program to save.
        f (str): A file-like object (has to implement write and flush)
            or a string containing a file name.
        extra_files (Optional[Dict[str, Any]]): The extra filenames given in
            this map would be loaded and their content would be stored in the
            provided map.
        expected_opset_version (Optional[Dict[str, int]]): A map of opset names
            to expected opset versions
    Returns:
        An ExportedProgram object
    """
```

Example usage:
```
# With buffer
buffer = io.BytesIO()
torch._export.save(ep, buffer)
buffer.seek(0)
loaded_ep = torch._export.load(buffer)

# With file
with tempfile.NamedTemporaryFile() as f:
    torch._export.save(ep, f)
    f.seek(0)
    loaded_ep = torch._export.load(f)

# With Path
with TemporaryFileName() as fname:
    path = pathlib.Path(fname)
    torch._export.save(ep, path)
    loaded_ep = torch._export.load(path)

# Saving with extra files
buffer = io.BytesIO()
save_extra_files = {"extra.txt": "moo"}
torch._export.save(ep, buffer, save_extra_files)
buffer.seek(0)
load_extra_files = {"extra.txt": ""}
loaded_ep = torch._export.load(buffer, extra_files)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107309
Approved by: https://github.com/avikchaudhuri, https://github.com/gmagogsfm, https://github.com/tugsbayasgalan
2023-08-22 08:25:19 +00:00
134d415615 Unlift mutated buffers (#107643)
In this PR, we extend ExportedProgram.module() functionality by also unlifting the mutated buffers. We only really care about top level buffers as we don't allow any buffer mutation inside HigherOrderOps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107643
Approved by: https://github.com/avikchaudhuri
2023-08-22 05:16:27 +00:00
8ed169b162 Re-enable AVX512 ATen kernels for compute-intensive ops (#104165)
## Summary

Enables AVX512 dispatch by default for some kernels, for which AVX512 performs better than AVX2.
For other kernels, their AVX2 counterparts are used.

## Implementation details

`REGISTER_DISPATCH` should now only be used for non-AVX512 dispatch.
`ALSO_REGISTER_AVX512_DISPATCH` should be used when AVX512 dispatch should also be done for a kernel.

## Benchmarking results with #104655

[Raw data at GitHub Gist (Click on `Download ZIP`)](https://gist.github.com/sanchitintel/87e07f84774fca8f6b767aeeb08bc0c9)

| Op | Speedup of AVX512 over AVX2 |
|----|------------------------------------|
|sigmoid|~27%  with FP32|
|sign| ~16.6%|
|sgn|~15%|
|sqrt|~4%|
|cosh|~37%|
|sinh|~37.5%|
|acos| ~8% with FP32 |
|expm1| ~30% with FP32|
|log|~2%|
|log1p|~16%|
|erfinv|~6% with FP32|
|LogSigmoid|~33% with FP32|
|atan2|~40% with FP32|
|logaddexp| ~24% with FP32|
|logaddexp2| ~21% with FP32|
|hypot| ~24% with FP32|
|igamma|~4% with FP32|
|lgamma| ~40% with FP32|
|igammac|3.5%|
|gelu|~3% with FP32|
|glu|~20% with FP32|
|SiLU|~35% with FP32|
|Softplus|~33% with FP32|
|Mish|~36% with FP32|
|Hardswish|~7% faster with FP32 when tensor can fit in L2 cache|
|Hardshrink|~8% faster with FP32 when tensor can fit in L2 cache|
|Softshrink|~10% faster with FP32 when tensor can fit in L2 cache|
|Hardtanh|~12.5% faster with FP32 when tensor can fit in L2 cache|
Hardsigmoid|~7% faster with FP32 when tensor can fit in L2 cache|
|hypot|~35%|
|atan2|~37%|
|dequantize per channel|~10%|

## Insights gleaned through collected data (future action-items):

1. Inplace variants of some ops are faster with AVX512 although the functional variant may be slower for FP32. Will enable AVX512 dispatch for the inplace variants of such kernels.
2. Almost all BF16 kernels are faster with AVX512, so after PyTorch 2.1 release, will enable AVX512 dispatch for BF16 kernels whose corresponding FP32 kernel doesn't perform well with AVX512.
3. Some kernels rely on auto-vectorization & might perform better with AVX512 once explicit vectorization would be enabled for them.

Data was collected with 26 physical threads of one socket of Intel Xeon 8371HC. Intel OpenMP & tcmalloc were preloaded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104165
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/kit1980
2023-08-22 04:26:28 +00:00
ee72071fc7 Avoid executing side-effectful graph_module as validation step (#107271)
Dynamo currently runs the real graph module with real inputs as a way to match the return result of graph module with the eager return type. This is unsafe when graph module is side effectful. In the long term, we will get rid of this step. But in the short term, we just fakify the graph module again and run it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107271
Approved by: https://github.com/ezyang
2023-08-22 04:22:31 +00:00
155d12856c Update utils.h and correct misleading error messages (#107602)
Fixes #107410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107602
Approved by: https://github.com/ezyang
2023-08-22 03:55:46 +00:00
f9f88f2d31 [ONNX] Add unittest for exporting embedding_bag (#105862)
Issue list:
* Unsupported FX nodes: {'call_function': ['aten.embedding_renorm.default', ~~'aten._embedding_bag_forward_only.default'~~]}.
* aten._embedding_bag.default not captured by test. Hence this test is not reflecting the pattern seen in model from onnxbench. Update: need validation again, unsure if this is still the case.
* `padding_idx` is always emitted for `aten._embedding_bag` and `aten._embedding_bag_forward_only`. This overload is unsupported by Torchlib.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105862
Approved by: https://github.com/justinchuby
2023-08-22 03:52:38 +00:00
849fbc6929 [vision hash update] update the pinned vision hash (#107649)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107649
Approved by: https://github.com/pytorchbot
2023-08-22 03:17:50 +00:00
a506d0ad8f [dynamo] Store originating source in the Guard object (#107634)
Many times, I find myself wanting to know the source for the guard. This PR adds that as a field of guard itself.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107634
Approved by: https://github.com/voznesenskym
ghstack dependencies: #107622
2023-08-22 02:16:31 +00:00
12b0372a75 [dynamo] Continue on fbgemm import fail (#107622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107622
Approved by: https://github.com/voznesenskym
2023-08-22 02:16:31 +00:00
3361fae89b Fix FP16Planner documentation (#107620)
Fixes #107619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107620
Approved by: https://github.com/awgu
2023-08-22 02:05:27 +00:00
f13101640f Quick return when there's nothing to bound in bound_sympy (#107549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107549
Approved by: https://github.com/ezyang, https://github.com/eellison
ghstack dependencies: #105055
2023-08-22 01:06:35 +00:00
85c673e6b2 Wrap indirect indexing on CUDA (#105055)
Lifting this to CPU should be rather easy. @jgong5
Partially fixes https://github.com/pytorch/pytorch/issues/97365. I'd wait to close that issue once this works on CPU as well.

This fix works with dynamic shapes as well.

@voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105055
Approved by: https://github.com/peterbell10, https://github.com/jansel
2023-08-22 01:06:35 +00:00
8292b03c47 Use fast traceback for symbolic shapes (#107439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107439
Approved by: https://github.com/voznesenskym
ghstack dependencies: #107505, #107516, #107530, #107532, #107562, #107471
2023-08-22 01:03:13 +00:00
072bb06117 Change how caching/cleanup for CapturedTraceback works (#107471)
CapturedTraceback is fast but one downside is that it has strong references to code objects, which via `co_extra` can cause un-collectable cycles. This means that it is important to clear out CapturedTraceback when you are done with it; e.g., if you tracebacks during compilation, you need to explicitly clear them out at the end of compilation to actually make sure they promptly deallocate.

Instead of caching `summary` on the CapturedTraceback, we simply allow for tracebacks to have `tb = None`. Tracebacks get dropped if you pickle the traceback, or if you explicitly call cleanup().

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107471
Approved by: https://github.com/voznesenskym
ghstack dependencies: #107505, #107516, #107530, #107532, #107562
2023-08-22 01:03:13 +00:00
bbb216bca4 Move torch.export() to torch.export.export() (#107609)
New plan:

torch.export.export() as the main API

All other utilities will be torch.export.foo_utilities
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107609
Approved by: https://github.com/tugsbayasgalan, https://github.com/msaroufim
2023-08-22 00:38:32 +00:00
2e73c86d45 [fx][split] make sure we copy node.meta over during split (#107248)
Summary: Previously when we create placeholder nodes for sub graph modules, we didn't copy node.meta over.

Test Plan: CI

Differential Revision: D48330866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107248
Approved by: https://github.com/zyan0, https://github.com/houseroad, https://github.com/Neilblaze
2023-08-22 00:06:45 +00:00
9c2b4a35a3 [dtensor] group all dynamo tests together (#107487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107487
Approved by: https://github.com/fduwjj
ghstack dependencies: #107472, #107473
2023-08-21 23:56:00 +00:00
42f25d49f8 [dynamo] enable 2d torch.compile test (#107473)
This PR adds 2d parallel torch.compile test on a simple MLP model and
test that the dynamo changes works, once @bdhirsh aot autograd enablement
done we can switch this test to test the e2e torch.compile workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107473
Approved by: https://github.com/fduwjj
ghstack dependencies: #107472
2023-08-21 23:56:00 +00:00
8c10be28a1 Update reduce_scatter_tests to work for world_size > 1 (#104424)
These tests only worked since `world_size==1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104424
Approved by: https://github.com/awgu
2023-08-21 23:13:56 +00:00
1641d671e5 [optim] FusedAdam/W accepts lr: Tensor without h2ds (#106916)
Starts addressing #106802

This PR also conveniently does some BE:
- Fixes a bug in adamw where we use amsgrad instead of per group amsgrad
- Brings the impls of adamw and adam closer to correctness and to each other

I couldn't fully remove the .pyi's because mypy was going to complain about the entire files which scared me and shouldn't go in this PR anyway.

Test plan:
- Add tests to ensure that lr could be passed as a Tensor
- Did some profiling of the below code (runs 1k iterations of step for Adam)

```
import torch
from torch.testing._internal.common_utils import TestCase

param = torch.rand(2, 3, dtype=torch.float, device='cuda:0', requires_grad=True)
param.grad = torch.rand_like(param)

lr = torch.tensor(.001, device='cuda:0')
opt = torch.optim.Adam([param], lr=lr, fused=True)

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    for _ in range(1000):
        opt.step()

print(p.key_averages().table(sort_by="cpu_time_total"))

```

Before my change:
<img width="1381" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/cfc5175a-0f41-4829-941f-342554f3b152">

After my change (notice there are no d2h syncs and the CPU time is lower!):
![image](https://github.com/pytorch/pytorch/assets/31798555/726d7e66-dcff-4a4f-8a75-e84329961989)

Next steps long term:
- have all capturable foreach + forloop impls in Adam(W) handle tensor LR
- have all capturable impls handle tensor LR
- have all impls handle tensor LR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106916
Approved by: https://github.com/albanD
2023-08-21 23:00:44 +00:00
350fb16f47 Add space to merge cancel comment (#107603)
Minor QoL improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107603
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2023-08-21 21:43:15 +00:00
da67b414d9 torch._numpy: remove noops and half-implemented nan-functions (#107596)
As discussed in the review of https://github.com/pytorch/pytorch/pull/106211, remove several noops (https://github.com/pytorch/pytorch/pull/106211#pullrequestreview-1559806543 and https://github.com/pytorch/pytorch/pull/106211#pullrequestreview-1559809287).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107596
Approved by: https://github.com/lezcano
2023-08-21 21:17:55 +00:00
f5d1df3c2f [1/N] Introduce init_device_mesh() (#107254)
This PR introduces init_device_mesh() as an API to standardize UX device_mesh initialization.

The functionality of slicing out a submesh from a given mesh would come in later PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107254
Approved by: https://github.com/wanchaol
2023-08-21 21:13:47 +00:00
5ddb8ef827 Make emit_metrics importable without having boto3 installed (#107070)
Make it so that scripts can import and run the `emit_metrics` function even if they don't have boto3 installed, in which case it will still validate the inputs but skip the actual metric emission part.

It's purely a refactor without any real logic changes

Motivation: So that run_test.py and the target determination code can use this library easily without worrying about if it was imported or if it's dependencies are installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107070
Approved by: https://github.com/huydhn
2023-08-21 21:13:01 +00:00
3920ce2f6e [inductor] Adjust dynamic SMEM limit when above default in AOT (#107601)
Summary:

When AOT Inductor runs a Triton matmul kernel (generated from the Triton mm template) on large inputs of particular shape, the `RuntimeError: CUDA driver error: 1` may happen. E.g., when `x @ y` is compiled with AOT Inductor and run on the input shapes `[10285, 96]` and `[96, 1]`. Digging deeper into the generated AOT Inductor wrapper code, we see this line:

```
launchKernel(triton_unk_fused_mm_0, 81, 1, 1, 4, 55296, kernel_args_var_0, stream);
```

`55296` is the required amount (in bytes) of dynamic shared memory. This is larger than the default dynamic shared memory on A100: `49152` bytes. In these cases, `cudaFuncSetAttribute` must be called explicitly to set  the`cudaFuncAttributeMaxDynamicSharedMemorySize` attribute of the kernel before launching it. Or, because AOT Inductor wrapper relies on the CUDA Driver API, the equivalent [`cuFuncSetAttribute`](https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g0e37dce0173bc883aa1e5b14dd747f26) function can be called to set the `CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute.

This PR adds the above call in the AOT Inductor codegen for every case when the required amount of dynamic SMEM is > 0. The call is done *within* the `launchKernel` function, meaning that it will happen only once per kernel and not affect the subsequent AOT Inductor-compiled model performance (after the first run).

P.S. One could, in principle, call the `cuFuncSetAttribute` only when the required amount of dynamic SMEM is above the default limit, but that would require detecting the default limit which is different on different devices. Assuming that the `cuFuncSetAttribute` is relatively lightweight and because it's performed only once per kernel, for simplicity, the suggestion is to call the function in every non-zero dynamic SMEM case.

Test Plan:

```
$ python test/inductor/test_aot_inductor.py

...

----------------------------------------------------------------------
Ran 5 tests in 100.177s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107601
Approved by: https://github.com/jansel
2023-08-21 21:06:09 +00:00
cfd98d3c42 Remove CUTLASS extensions merged upstream (#107612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107612
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-08-21 20:55:21 +00:00
6981bcbc35 fixing bug with non-contiguous mixed_mm [inductor] (#107495)
Summary: this PR detects
https://github.com/pytorch/pytorch/issues/107423 and falls back to the
non-triton kernel. It also adds a check for non-contiguous issues in
uint4x2 in the unit tests though its not an issue in this case.

Test Plan: python pytorch/test/inductor/test_pattern_matcher.py -k
"test_mixed_mm_bad_cases"

python pytorch/test/inductor/test_pattern_matcher.py -k
"test_uint4x2_mixed_mm"

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107495
Approved by: https://github.com/davidberard98
2023-08-21 20:44:19 +00:00
977a77ca2c Manually enable capture_func_transforms for testing (#107122)
Manually enable `capture_func_transforms` for testing as plan is to default `capture_func_transforms` to False in 2.1. (enable it so that we still test the support on release branch).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107122
Approved by: https://github.com/zou3519
2023-08-21 20:38:33 +00:00
a816aa785b Implement autograd support for sparse compressed tensor constructors (#107384)
Fixes https://github.com/pytorch/pytorch/issues/107126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107384
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #107447
2023-08-21 20:26:39 +00:00
04a7915dbc Run check api rate limit on ephemeral runner (#107621)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107621
Approved by: https://github.com/huydhn
2023-08-21 20:20:31 +00:00
a250cc9bd7 Update persons_of_interest.rst (#107592)
Updating the state of PyTorch Audio.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107592
Approved by: https://github.com/cpuhrsch
2023-08-21 20:01:46 +00:00
d7c0c5de2d Set crow_indices outputs as non-differentiable. (#107447)
Fixes https://github.com/pytorch/pytorch/issues/107083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107447
Approved by: https://github.com/cpuhrsch
2023-08-21 19:52:32 +00:00
a4eae43315 [ONNX] Update xfail reasons in fx runtime tests (#107257)
1. Update xfail reasons in fx runtime
2. Enable bloom-560m in runtime test. However, it's blocked by the unsupported constant tensor case. The previous error was because the when the model loads with external data, it surpasses 2GB, and couldn't be inlined. The fix is to inline the model it self, and then replace the original one. Pointing ORT to the path allows it to load with external data into model in runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107257
Approved by: https://github.com/justinchuby
2023-08-21 19:21:56 +00:00
612c8a8c84 Guard numpy imports in the dynamo folder (#107299)
Fixes https://github.com/pytorch/pytorch/issues/107228

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107299
Approved by: https://github.com/atalman
2023-08-21 19:07:20 +00:00
79d35bfc01 [BE]: Add PYI files to ruff lintrunner (#107524)
Due to a bug with the lintrunner yaml, PYI files were not convered. This PR builds off #107519 covers all PYI files and adds one noqa to fix a B006 bugprone error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107524
Approved by: https://github.com/ezyang
2023-08-21 18:55:41 +00:00
e201e3ffa1 [dynamo][eval frame] Make CacheEntry a PyObject (#107405)
This PR makes CacheEntry a PyObject. This is prep PR for cache size changes. As CacheEntry is a py object, we can now traverse the linked list in Python and write cache size policies. It was possible to do in C, but Python is just easier to iterate upon. We call convert_frame only when we (re)compile, so a small bump in latency going from C to Python is acceptable here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107405
Approved by: https://github.com/ezyang
ghstack dependencies: #106917, #107117
2023-08-21 18:47:53 +00:00
3b2c5d47c0 Use default build env and test config for test times (#107325)
Redo of #107312

Pairs with https://github.com/pytorch/test-infra/pull/4476

If build env and test config combo cannot be found in the test times, use default.  Then we don't have to go manually change the test-times.json a new job is added or we update the jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107325
Approved by: https://github.com/huydhn
2023-08-21 18:39:55 +00:00
ad07a4bc56 Print per-tensor guard messages for TENSOR_MATCH (#107562)
The new guard messages look like:

```
check_tensor(L['y'], Tensor, DispatchKeySet(CPU, BackendSelect, ADInplaceOrView, AutogradCPU), torch.float32, device=None, requires_grad=False, size=[3], stride=[1])  # _dynamo/variables/builder.py:1237 in wrap_fx_proxy_cls
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107562
Approved by: https://github.com/anijain2305, https://github.com/jansel
ghstack dependencies: #107505, #107516, #107530, #107532
2023-08-21 18:00:00 +00:00
3336aa191c Adding allocated and reserved memory values to memory timline view. (#107056)
Summary: This diff adds the max allocated and max reserved memory values to the memory timeline plot.

Test Plan:
Executed

`buck run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test -- --enable_profiling --profile_memory --trace_handler=auto_trace --with_stack --record_shapes` on my devgpu.

The generated output is at
https://www.internalfb.com/manifold/explorer/ai_efficiency/tree/traces/dynocli/devgpu020.odn1.facebook.com/rank-0/rank-0.Aug_10_16_50_50.236946.pt.memorytl.html

 {F1067885545}
Screenshot of the html above
 {F1067886350}

Reviewed By: aaronenyeshi

Differential Revision: D48251791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107056
Approved by: https://github.com/aaronenyeshi, https://github.com/davidberard98
2023-08-21 17:20:13 +00:00
da765995fb [2d] remove ShardedTensor from fsdp extension (#107472)
2D Parallel won't use ShardedTensor, and it causes headable for dynamo
to recoginize it, removing it from the runtime flatten/unflatten path
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107472
Approved by: https://github.com/fduwjj
2023-08-21 17:16:07 +00:00
e0f1fe102a Revert "Add scalar conversion using avx instructions for half (#102140)"
This reverts commit 1d6a44656755c89f4f9a878865dcb0ac39af9a74.

Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/ZainRizvi due to Sorry, this is still breaking internal builds. Specifically, the dynamo test test_repros.py::DynamicShapesReproTests::test_odict_get_item_index_name ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1686684117))
2023-08-21 16:51:50 +00:00
df16b1ed53 [dynamo+aten] Enable embedding_bag_byte_rowwise_offsets + meta kernel impl (#106105)
Differential Revision: D47007550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106105
Approved by: https://github.com/gmagogsfm
2023-08-21 16:33:42 +00:00
d5b8c71112 [inductor] Revert inductor changes in #105977 (#107468)
Reverts inductor changes in #105977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107468
Approved by: https://github.com/jansel
2023-08-21 15:50:03 +00:00
a5efb5eb84 [export] Serialize constrain_as_size ops (#107386)
Since constrain_as_size has been fixed, I tried serializing it, but ran into some issues.
Notably, after each `.transform` call, I added a helper `_get_updated_range_constraints` to update the range constrains list. This is because when we retrace in a pass, the symbolic values being used changes, so we need to update this dictionary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107386
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2023-08-21 15:24:11 +00:00
5f56c4fb32 [torch.compile x autograd.Function] More test cases (#107467)
I pulled a bunch of autograd.Function from test_autograd.py and added a
smoke test for them. Ideally we would actually run test_autograd.py as a
part of the Dynamo test suite, but we have excluded it due to there
being too many errors and I don't have time to figure that out at the
moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107467
Approved by: https://github.com/ydwu4
ghstack dependencies: #107459, #107461
2023-08-21 13:39:36 +00:00
72de9b2ec2 [HigherOrderOp] stop erroring out on non-Tensor returns (#107461)
If map or autograd.Function have an input that returns a non-Tensor,
then the code just errors out. Instead of erroring out we should graph
break by raising Unsupported so users aren't confused. The better thing
to do is actually support non-Tensor returns but that requires more
work.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107461
Approved by: https://github.com/ydwu4
ghstack dependencies: #107459
2023-08-21 13:39:36 +00:00
c5c41f9601 [HigherOrderOps] Saner error message (#107459)
Sometimes the Unsupported error messages can be pretty opaque (see
https://github.com/pytorch/pytorch/issues/106390 for an example). This
PR ensures the error message says something sane by raising a new
Unsupported exception (that includes the older one in the stack trace)
with a description of what's going on.

Test Plan:
- new test utility to check that a dictionary matches a regex so we
don't need to write out this super long error message every time.
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107459
Approved by: https://github.com/ydwu4, https://github.com/kshitij12345
2023-08-21 13:39:34 +00:00
796ce67229 Single source of truth for guard logging (#107532)
Instead of (poorly) reconstructing the guard list from the guards on OutputGraph, we log them at the horses mouth: when we actually codegen the guard. This only requires very modest refactoring: as we translate guards into code parts, we also have to pass the source guard along so we can use it to give stack information.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107532
Approved by: https://github.com/anijain2305
ghstack dependencies: #107505, #107516, #107530
2023-08-21 13:02:12 +00:00
8316affc45 Add frame/recompile counter to all log messages in tracing context (#107530)
All log messages that occur while running Dynamo compilation now have `[X/Y]` added to the beginning of their message. X represents the frame being compiled, while Y says which compilation of the frame. For example, if you are debugging a frame that is repeatedly recompiling, you can look for N/0, N/1, N/2, etc. for the same N.  Here is what the logs look like as you transition from one frame to another:

<img width="1372" alt="image" src="https://github.com/pytorch/pytorch/assets/13564/4897e368-1e50-4807-b342-54e911bcf087">

To accurately get this prefix added to all messages, I had to expand the scope of the `tracing` context manager. Its scope now coincides with `log_compilation_event`. To do this, I had to populate fake mode lazily in the TracingContext, since it isn't created until later, inside the OutputGraph.

This subsumes the previous X.Y logging that was solely for dynamic shapes.

Unfortunately I had to reindent some stuff. Review the diff with whitespace off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107530
Approved by: https://github.com/anijain2305
ghstack dependencies: #107505, #107516
2023-08-21 13:02:12 +00:00
5ed60477a7 Optimize load inline via pch (#106696)
Add PreCompiled Header(PCH) to reduce load_inline build time.
PCH is gcc built-in mechanism: https://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Precompiled-Headers.html

Add PCH for '#include <torch/extension.h>'. This file will used in all load_inline modules. All load_inline modules can take benifit from this PR.

Changes:
1. Add PCH signature to guarantee PCH(gch) file take effect.
2. Unification get cxx compiler funtions.
3. Unification get build flags funtions.

Before this PR:
![image](https://github.com/pytorch/pytorch/assets/8433590/f190cdcb-236c-4312-b165-d419a7efafe3)

Added this PR:
![image](https://github.com/pytorch/pytorch/assets/8433590/b45c5ad3-e902-4fc8-b450-743cf73505a4)

Compiling time is reduced from 14.06s to 7.36s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106696
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-21 10:08:30 +00:00
24968383b5 Fix RenamePlanner documentation (#107535)
Fixes #107490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107535
Approved by: https://github.com/awgu, https://github.com/fduwjj
2023-08-21 07:51:57 +00:00
7ba513b6e4 [FSDP][state_dict] Expose optimizer state_dict config (#105949)
Optimizer state_dict config are not exposed. This PR exposes the 2 dataclass.

Differential Revision: [D47766024](https://our.internmc.facebook.com/intern/diff/D47766024/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105949
Approved by: https://github.com/rohan-varma
2023-08-21 07:29:49 +00:00
63e9b5481d [export] Add schema version to serializer/deserializer (#107420)
Added a version number to the schema for BC issues. We will add this number to the serialized ExportedProgram and then when deserializing, if the number does not match up with the existing deserializer, we will error. We should update the number of there are any major changes to the schema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107420
Approved by: https://github.com/zhxchen17
2023-08-21 06:56:46 +00:00
6dea9927a8 Don't use thrust::log(complex) in CUDA as it takes a FOREVER to compile (#107559)
As per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107559
Approved by: https://github.com/peterbell10
2023-08-21 05:47:49 +00:00
5ce88e7e71 remove unnecessary import introduced in PR 106535 (#107440)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107440
Approved by: https://github.com/fduwjj
ghstack dependencies: #106535
2023-08-21 05:29:31 +00:00
b9befc53a6 benchmark: higher tolerance for RobertaForQuestionAnswering (#107376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107376
Approved by: https://github.com/kit1980, https://github.com/XiaobingSuper, https://github.com/jansel
ghstack dependencies: #107375
2023-08-21 04:34:24 +00:00
1ea83f04d2 benchmark: convert output of fp64 to torch.float64 (#107375)
This PR adds converting the output of fp64 to torch.float64 before checking for accuracy.

Why we need this change?
For llama of torchbench, it converts output to float before returning it.
bad4e9ac19/torchbenchmark/models/llama/model.py (L241)

While in the correctness checker, it will not compare the res results with fp64_ref if the fp64_ref.dtype is not torch.float64. So llama fails the accuracy check in the low-precision case, even though res is closer to fp64_ref than ref.
e108f33299/torch/_dynamo/utils.py (L1025)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107375
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/jansel
2023-08-21 04:34:23 +00:00
d77e95c3bf [Compiled Autograd] Improve nyi error messages (#106176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106176
Approved by: https://github.com/eellison
2023-08-21 04:31:13 +00:00
59c5424654 [inductor] Improve handling of index_expr with floating point dtypes (#105021)
I found that the upsample bicubic lowering was generating this line

```python
ops.index_expr(0.244094488188976*x0, torch.float32)
```

which is not good because triton's `ops.index_expr` expects integer expressions and dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105021
Approved by: https://github.com/lezcano
2023-08-21 03:09:53 +00:00
3b160ecc71 Fix wrong error messages with torch.nn.AdaptiveMaxPool1d (#107450)
Fixes #104822

A duplicate check is introduced into function `adaptive_max_pool1d`, but this is probably a relatively good approach.

Of course, it is also possible to transparently pass a flag in function `adaptive_max_pool1d` to function `adaptive_max_pool2d`(no need to add new parameter), and then supplement relevant Checks in `adaptive_max_pool2d`, but this approach is not clear enough first, and secondly, the amount of modification is relatively large.

At the same time, there is currently a duplicate check for `output_size`,which is cheched in both functions(`adaptive_max_pool1d` && `adaptive_max_pool2d`)

If you have better advice, please let me know, thank you
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107450
Approved by: https://github.com/ezyang
2023-08-21 02:04:37 +00:00
96c5be8bc4 Revert "Fakify leaf of FunctionalTensor (#107062)"
This reverts commit 3349725766c229b3ead0fb692197d11bd8a85957.

Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ydwu4 due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph. Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1685447747))
2023-08-21 00:30:16 +00:00
c1cc74c7da Enable a number inductor of tests on CPU (#107465)
There were many test that their `_cuda` variants were not running on
cuda. I fixed a few of these, but I'm sure there's plenty more.
It'd be great to have a way to test that we're indeed compiling
something in these tests, but I don't know how to do this off the top of
my head.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107465
Approved by: https://github.com/ezyang
2023-08-20 21:44:21 +00:00
71632d4d24 [cpu] add sdpa choice and UT (#105131)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

Write an SDPA selecting function for CPU to automatically choose one SDPA implementation among several ones. There are two CPU implementations which could be chosen: the unfused SDPA and flash attention. In general, flash attention has a higher priority than the unfused SDPA. For cases where flash attention is not applicable, such as manually disabling flash attention or the inputs not 4 dimensional, the unfused SDPA is chosen.

## Performance of the stack

### NanoGPT's SDPA kernel
Using benchmark [repo](https://github.com/mingfeima/bench_sdpa/blob/main/README.md), with one socket.
Shape: Batch size 1, Sequence length 1024, Head number 25, Head size 64.
Machine: SPR.

| Dtype    | Causal   | Mode      | SDPA            | Time (ms per iter) | Speedup |
| -------- | -------- | -------   | -------         | -------            | ------- |
| float32  | FALSE    | Inference | Unfused         | 3.081              |         |
|          |          |           | Flash attention | 1.665              | **1.85045** |
| float32  | TRUE     | Inference | Unfused         | 3.463              |         |
|          |          |           | Flash attention | 1.662              | **2.083634**|
| bfloat16 | FALSE    | Inference | Unfused         | 1.203              |         |
|          |          |           | Flash attention | 1.154              | **1.042461**|
| bfloat16 | TRUE     | Inference | Unfused         | 1.543              |         |
|          |          |           | Flash attention | 1.154              | **1.337088**|
| float32  | FALSE    | Training  | Unfused         | 54.938             |         |
|          |          |           | Flash attention | 23.029             | **2.385601**|
| float32  | TRUE     | Training  | Unfused         | 58.266             |         |
|          |          |           | Flash attention | 17.835             | **3.266947**|
| bfloat16 | FALSE    | Training  | Unfused         | 18.924             |         |
|          |          |           | Flash attention | 18.886             | **1.002012**|
| bfloat16 | TRUE     | Training  | Unfused         | 21.08              |         |
|          |          |           | Flash attention | 14.172             | **1.48744** |

### Stable Diffusion
Following model's [BKM](https://github.com/intel-innersource/frameworks.ai.models.intel-models/blob/develop/quickstart/diffusion/pytorch/stable_diffusion/inference/cpu/README.md).
Mode: Inference; Machine: SPR.

| Dtype    | SDPA                    | Throughput (fps) | Speedup SDPA | Total Time (ms) | Speedup |
| -------- | --------                | -------          | -------      | -------         | ------- |
| float32  | Unfused                 | 1.63             |              | 1139            |         |
|          | Flash attention         | 1.983            | 1.216564     | 547.488         | **2.080411**|
| bfloat16 | Flash attention in IPEX | 4.784            |              | 429.051         |         |
|          | Flash attention         | 4.857            | 1.015259     | 408.823         | **1.049479**|

### LLM models of Torchbench

Dtype: float32; Mode: Inference, single socket; Machine: CPX.
Model   name | SDPA | Inductor_new | Inductor_old | Inductor   Ratio(old/new)
-- | -- | -- | -- | --
hf_Albert | Unfused -> Flash attention | 0.048629309 | 0.05591545 | **1.14983024**
hf_Bert | Unfused -> Flash attention | 0.053156243 | 0.060732115 | **1.142520841**
hf_Bert_large | Unfused -> Flash attention | 0.141089502 | 0.155190077 | **1.099940636**
llama | Unfused -> Flash attention | 0.033250106 | 0.033720745 | **1.01415451**

Dtype: bfloat16; Mode: Inference, single socket; Machine: SPR.
Model   name | SDPA | Inductor_new | Inductor_old | Inductor   Ratio(old/new)
-- | -- | -- | -- | --
hf_Albert | Unfused -> Flash attention | 0.020681298 | 0.020718282 | **1.001788324**
hf_Bert | Unfused -> Flash attention | 0.019932816 | 0.019935424 | **1.000130842**
hf_Bert_large | Unfused -> Flash attention | 0.047949174 | 0.048312502 | **1.007577355**
llama | Unfused -> Flash attention | 0.018528057 | 0.01861126 | **1.0044907**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105131
Approved by: https://github.com/drisspg
ghstack dependencies: #104583, #104584, #103826, #104693, #104863, #107128
2023-08-20 08:56:21 +00:00
a46217d2ef [CPU] Enable fused_attention pattern matcher (#107128)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

Enable the SDPA graph rewriting for Inductor CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107128
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #104583, #104584, #103826, #104693, #104863
2023-08-20 08:53:24 +00:00
6d647762d0 [cpu] enable bfloat16 and refactor for flash attention (#104863)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

The support for BF16 is added in flash attention CPU kernel, for both forward and backward paths.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104863
Approved by: https://github.com/jgong5, https://github.com/drisspg
ghstack dependencies: #104583, #104584, #103826, #104693
2023-08-20 08:50:56 +00:00
3fc321f342 [cpu] implement flash attention backward (#104693)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

The flash attention CPU kernel is added, for backward path FP32. Parallelization is on the dimensions of batch size and head number.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104693
Approved by: https://github.com/jgong5, https://github.com/drisspg
ghstack dependencies: #104583, #104584, #103826
2023-08-20 08:48:12 +00:00
5516fe12ec [cpu] implement scaled dot product flash attention (#103826)
Feature RFC: https://github.com/pytorch/rfcs/pull/56.

The flash attention CPU kernel is added, for forward path FP32. Blocking is applied on dimensions of query length and kv length and the fusion of gemm + softmax update + gemm is done at once for each block. Parallelization is on the dimensions of batch size, head number and query length. In addition, the causal attention mask is supported. As the attention is masked for the unseen tokens, early termination is applied and we only calculate the blocks in the lower triangular part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103826
Approved by: https://github.com/drisspg, https://github.com/jgong5
ghstack dependencies: #104583, #104584
2023-08-20 08:43:48 +00:00
02dfacb1ec expand functional map for reduced floating points on CPU (#104584)
Return output in accumulated dtype in vec::reduce functions when input is float16 or bfloat16, to reduce rounding error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104584
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #104583
2023-08-20 08:40:56 +00:00
68b9bf9671 Simplify verbose error guard printing (#107516)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107516
Approved by: https://github.com/anijain2305
ghstack dependencies: #107505
2023-08-20 06:50:27 +00:00
d6d485fa8c Revamp guard debug logging (#107505)
The new guard printout looks like this:

```
[DEBUG] GUARDS:
[DEBUG]   ___check_type_id(L['name'], 7605632)                          # if name == "special_attr":  # test/dynamo/test_misc.py:1155 in __getattribute__
[DEBUG]   L['name'] == '_backward_pre_hooks'                            # if name == "special_attr":  # test/dynamo/test_misc.py:1155 in __getattribute__
[DEBUG]   ___check_obj_id(L['self'], 139746432564960)                   # return super().__getattribute__(name)  # test/dynamo/test_misc.py:1157 in __getattribute__
[DEBUG]   ___check_obj_id(L['__class__'], 1451499216)                   # return super().__getattribute__(name)  # test/dynamo/test_misc.py:1157 in __getattribute__
[DEBUG]   ___is_grad_enabled()                                          # _dynamo/output_graph.py:346 in init_ambient_guards
[DEBUG]   not ___are_deterministic_algorithms_enabled()                 # _dynamo/output_graph.py:342 in init_ambient_guards
[DEBUG]   ___is_torch_function_enabled()                                # _dynamo/output_graph.py:350 in init_ambient_guards
[DEBUG]   utils_device.CURRENT_DEVICE == None                           # _dynamo/output_graph.py:348 in init_ambient_guards
```

Along with the guards, we also print what line of user code caused the guard to be added, or what line of Dynamo internal code added the guard (if there is no user stack trace, which is typically the case for ambient guards.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107505
Approved by: https://github.com/mlazos, https://github.com/voznesenskym, https://github.com/anijain2305
2023-08-20 06:50:27 +00:00
db3a199b2c fix symint meta val (#107491)
`aot_export` adds metadata for int inputs as symints. This diff turns such metadata into ints since they will be specialized anyway. We don't turn these into runtime assertions yet (but should, as future work).

Differential Revision: D48487562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107491
Approved by: https://github.com/gmagogsfm
2023-08-20 06:05:04 +00:00
4d0e7908c3 disable multi_linear_share_same_input for dynamic shape case (#107123)
`reshape_linear_reshape_pattern` will fail for dynamic shapes and break the fusion.
We will disable this optimization for dynamic shapes, since the shape might be changed during runtime so we cannot compare the size hint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107123
Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5, https://github.com/jansel
2023-08-20 05:58:13 +00:00
e21ca06f46 [BE]: Update cudnn_frontend submodule to v0.9.2. (#107525)
Updates the cudnn_frontend submodule to v0.9.2, which mainly consists of bugfixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107525
Approved by: https://github.com/ezyang
2023-08-20 05:26:42 +00:00
2c3d2fa2d2 do not raise constraint violation on trivial guards (#107470)
Differential Revision: D48475543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107470
Approved by: https://github.com/tugsbayasgalan
2023-08-20 03:35:27 +00:00
b1e8e01e50 [BE]: Apply PYI autofixes to various types (#107521)
Applies some autofixes from the ruff PYI rules to improve the typing of PyTorch. I haven't enabled most of these ruff rules yet as they do not have autofixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107521
Approved by: https://github.com/ezyang
2023-08-20 02:42:21 +00:00
24f0b552e1 [EASY] Use runtime_var_to_range for guards (#107329)
We sometimes allow compile-time reasoning to diverge from runtime
reasoning.  When we check guards, we are testing for *runtime*
properties.  Thus we should use those ranges, not the compile time
ones.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107329
Approved by: https://github.com/tugsbayasgalan
2023-08-20 02:16:56 +00:00
88ab3e4322 [BE]: Update ruff to 0.285 (#107519)
This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings.

I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519
Approved by: https://github.com/ezyang
2023-08-20 01:36:18 +00:00
02c2b750c5 Add support for GET_YIELD_FROM_ITER, YIELD_FROM, SEND (#106986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106986
Approved by: https://github.com/jansel
2023-08-19 20:38:16 +00:00
4f3284e3ed [ATen] Update pre-compiled header (#106915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106915
Approved by: https://github.com/lezcano
ghstack dependencies: #106914
2023-08-19 20:21:58 +00:00
60936e4c29 Remove some unnecessary <iostream> includes from headers (#106914)
In almost all cases this is only included for writing the output formatter, which
only uses `std::ostream` so including `<ostream>` is sufficient.

The istream header is ~1000 lines so the difference is non-trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106914
Approved by: https://github.com/lezcano
2023-08-19 20:21:58 +00:00
eee2f57257 Raise TypeError for calling moduletype in dynamo (#107393)
Fixes #107314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107393
Approved by: https://github.com/williamwen42
2023-08-19 20:04:33 +00:00
3349725766 Fakify leaf of FunctionalTensor (#107062)
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```

This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).

**Test Plan:**
Add a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
2023-08-19 17:33:42 +00:00
11602ac564 [dynamo] fix disable_saved_tensors_hooks - graph break (#106875)
```python
def wrapper_fn(x):
    with torch.autograd.graph.disable_saved_tensors_hooks("ERROR"):
        y = x + 1
        print("HI")
        return y + 2

x = torch.randn(())

a = wrapper_fn(x)
opt = torch.compile(wrapper_fn, backend='eager', fullgraph=False)
e = opt(x)
```

Without the fix fails with,
```
Traceback (most recent call last):
  File "/home/kshiteej/Pytorch/pytorch_functorch/test/test_trace_grad.py", line 182, in <module>
    e = opt(x)
  File "/home/kshiteej/Pytorch/pytorch_functorch/torch/_dynamo/eval_frame.py", line 333, in _fn
    return fn(*args, **kwargs)
  File "/home/kshiteej/Pytorch/pytorch_functorch/test/test_trace_grad.py", line 165, in wrapper_fn
    def wrapper_fn(x):
AttributeError: module 'torch.autograd.graph' has no attribute 'disable_saved_tensors_hook'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106875
Approved by: https://github.com/zou3519
2023-08-19 11:41:40 +00:00
4eac43d046 Trace through Tensor slots (#107159)
Namely
```
__delattr__
__delitem__
__getattribute__
__getitem__
__setattr__
__setitem__
__str__
```

We don't trace through `__init__`.

Fixes https://github.com/pytorch/pytorch/issues/106648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107159
Approved by: https://github.com/Skylion007
2023-08-19 08:56:25 +00:00
8df298bc1e [functorch] vmap-dynamo: run vmap_impl under fake_mode (#107462)
Fixes https://github.com/pytorch/pytorch/issues/107050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107462
Approved by: https://github.com/zou3519
2023-08-19 08:32:01 +00:00
871d7d242d Silu support Complex for CUDA (#106854)
Fixes #89382

Silu support Complex for CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106854
Approved by: https://github.com/albanD
2023-08-19 06:57:09 +00:00
3ddf30505f fixing internal test failure on non sm_80 machines (#107340)
Summary:
These tests were failing on non sm_80+ machines used for internal CI, added check to skip this.

D48295360 added new tests that work in OSS but not in phabricator CI

https://www.internalfb.com/intern/test/562950057441807?ref_report_id=0

https://www.internalfb.com/intern/test/281475080709193?ref_report_id=0

Test Plan: see phabricator result

Differential Revision: D48417499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107340
Approved by: https://github.com/davidberard98
2023-08-19 04:27:15 +00:00
b5642f0b02 [vision hash update] update the pinned vision hash (#107498)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107498
Approved by: https://github.com/pytorchbot
2023-08-19 03:22:33 +00:00
302278b4d5 [pytorch][fakepg] enhance fakepg: broadcast and scatter (#107480)
Summary:
Add support for broadcast and scatter in FakeProcessGroup.

As a side note, we can't easily support broadcast_object_list or
scatter_object_list since they rely on actual broadcasted/scattered
values for pickle object deserialization. We could add support for rank 0, but
other to support ranks may need additional changes outside of
FakeProcessGroup.

Test Plan:
`buck2 run mode/dev-nosan -c fbcode.enable_gpu_sections=true
//caffe2/test/distributed:fake_pg`, on of TARGETS diff: D48481513

`python test/distributed/test_fake_pg.py` after github sync

Differential Revision: D48481512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107480
Approved by: https://github.com/wanchaol
2023-08-19 02:36:45 +00:00
017499b078 Update reduction_ops groupings to include primtorch types (#107338)
Fixes https://github.com/pytorch/pytorch/issues/107335. The skips were updated for the _ref ops to match those for eager mode where necessary. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107338
Approved by: https://github.com/ezyang
2023-08-19 02:09:11 +00:00
93f2a64d4d Update submodule NCCL to v2.18.3 (#104993)
Update NCCL submodule to v2.18.3 which fixes numerous bugs and performance issues, particularly on newer GPUs: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-18-3.html#rel_2-18-3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104993
Approved by: https://github.com/malfet
2023-08-18 23:43:01 +00:00
64e02de93c Revert "Use CUDA DSA in ATen (#95300)" (#107483)
This reverts commit 93b0410eef57ebf038c12ed2fa1d4018a24096b7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107483
Approved by: https://github.com/ngimel
2023-08-18 23:33:07 +00:00
2d7a062db0 Update shape_funcs to test primtorch operators (#107336)
Fixes #107335. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107336
Approved by: https://github.com/ezyang
2023-08-18 23:18:48 +00:00
5814380e7b Revert "Revert "Reland "Add forward mode AD to out-place foreach functions (#102409) (#106043)""" (#106320)
Fixed a typo specifying the number of tensors and elements in the test having failed in slow gradcheck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106320
Approved by: https://github.com/soulitzer
2023-08-18 23:01:42 +00:00
bc662ffff9 [ROCm] Update ROCm skip decorators (#106138)
This PR adds a msg argument for skipIfRocm and skipCUDAIfRocm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106138
Approved by: https://github.com/jataylo, https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/albanD
2023-08-18 22:02:06 +00:00
28be2c674a [quant][pt2e] Move specific quantizer related things outside of main quant code base (#106806) (#107259)
Summary:

Currently in quantizer/quantize_pt2e we import things from specific quantizers (XNNPACKQuantizer, QuantizationConfig) etc.
this PR removes them so it's clearer that they are not part of the core quantization code base

This PR also removed get_supported_operators from main Quantizer since we haven't seen a clear need for this API

Test Plan:
CIs

Imported from OSS

Differential Revision: D48340367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107259
Approved by: https://github.com/kimishpatel
2023-08-18 21:29:09 +00:00
4ee6224767 Remove jbschlosser from symbolic-shapes auto request list (#107482)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107482
Approved by: https://github.com/jbschlosser
2023-08-18 20:51:19 +00:00
35e222e152 Enable mypy check in torch/_inductor/fx_passes/post_grad.py (#107449)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/fx_passes/post_grad.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/fx_passes/post_grad.py
Success: no issues found in 1 source file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107449
Approved by: https://github.com/ezyang
2023-08-18 20:48:19 +00:00
77f080ee29 [pt2] test if core decomps are differentiable (#107241)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107241
Approved by: https://github.com/ezyang
2023-08-18 20:47:58 +00:00
5b7b9e7896 Update binary_ufuncs groupings to include primtorch types (#107419)
Fixes #107335. The skips were updated for the _ref ops to match those for eager mode where necessary. Part of breakdown of https://github.com/pytorch/pytorch/pull/104489.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107419
Approved by: https://github.com/ezyang
2023-08-18 20:45:36 +00:00
af0ed25ea8 Change >= in the GRU and the LSTM document to \ge (#107379)
Change >= in the GRU document to \ge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107379
Approved by: https://github.com/ezyang
2023-08-18 20:44:51 +00:00
c2706e5b5d Enable mypy check in torch/_inductor/kernel/unpack_mixed_mm.py (#107445)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/kernel/unpack_mixed_mm.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/kernel/unpack_mixed_mm.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107445
Approved by: https://github.com/ezyang
2023-08-18 20:44:21 +00:00
2d2d43d9fb add more check on LSTMCell (#107380)
Just like #107223, operator ``LSTMCell`` have the same problems as ``GRUCell``, and add some check and tests related to fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107380
Approved by: https://github.com/ezyang
2023-08-18 20:44:17 +00:00
bdecdfd202 [Compiled Autograd] Fix duplicate visits of same node (#105887)
The error fixed here happened when we had multiple autograd::Edge objects pointing to the same autograd::Node, causing before() to get called multiple times on the same object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105887
Approved by: https://github.com/albanD
2023-08-18 19:47:34 +00:00
67bb3c05b0 Add verbose_guards logging artifact (#107388)
It looks like this:

```
[DEBUG] GUARD: ___check_type_id(L['z'][L["MyEnum"].BAR], 7640416) and L['z'][L["MyEnum"].BAR] == 10
[DEBUG] Stack:
[DEBUG]   File "/data/users/ezyang/b/pytorch/test/dynamo/test_misc.py", line 6657, in <module>
[DEBUG]     run_tests()
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/test_case.py", line 38, in run_tests
[DEBUG]     run_tests()
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 985, in run_tests
[DEBUG]     unittest.main(argv=argv)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__
[DEBUG]     self.runTests()
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests
[DEBUG]     self.result = testRunner.run(self.test)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run
[DEBUG]     test(result)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
[DEBUG]     return self.run(*args, **kwds)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
[DEBUG]     test(result)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
[DEBUG]     return self.run(*args, **kwds)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
[DEBUG]     test(result)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__
[DEBUG]     return self.run(*args, **kwds)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 2521, in run
[DEBUG]     self._run_with_retry(
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 2450, in _run_with_retry
[DEBUG]     super_run(result=result)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run
[DEBUG]     self._callTestMethod(testMethod)
[DEBUG]   File "/home/ezyang/local/b/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
[DEBUG]     method()
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/testing/_internal/common_utils.py", line 2377, in wrapper
[DEBUG]     method(*args, **kwargs)
[DEBUG]   File "/data/users/ezyang/b/pytorch/test/dynamo/test_misc.py", line 2529, in test_enum_as_dict_key_with_overloaded_str
[DEBUG]     res = opt_fn(x)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 333, in _fn
[DEBUG]     return fn(*args, **kwargs)
[DEBUG]   File "/data/users/ezyang/b/pytorch/test/dynamo/test_misc.py", line 2519, in fn
[DEBUG]     torch._dynamo.graph_break()
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/eval_frame.py", line 493, in catch_errors
[DEBUG]     return callback(frame, cache_size, hooks, frame_state)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 637, in _convert_frame
[DEBUG]     result = inner_convert(frame, cache_size, hooks, frame_state)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 133, in _fn
[DEBUG]     return fn(*args, **kwargs)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 371, in _convert_frame_assert
[DEBUG]     return _compile(
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 567, in _compile
[DEBUG]     guarded_code = compile_inner(code, one_graph, hooks, transform)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/utils.py", line 181, in time_wrapper
[DEBUG]     r = func(*args, **kwargs)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 466, in compile_inner
[DEBUG]     out_code = transform_code_object(code, transform)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
[DEBUG]     transformations(instructions, code_options)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/convert_frame.py", line 416, in transform
[DEBUG]     tracer = InstructionTranslator(
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2018, in __init__
[DEBUG]     self.symbolic_locals = collections.OrderedDict(
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/symbolic_convert.py", line 2021, in <genexpr>
[DEBUG]     VariableBuilder(
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 211, in __call__
[DEBUG]     vt = self._wrap(value).clone(**self.options())
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 404, in _wrap
[DEBUG]     result = {
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 405, in <dictcomp>
[DEBUG]     k: VariableBuilder(
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 211, in __call__
[DEBUG]     vt = self._wrap(value).clone(**self.options())
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 354, in _wrap
[DEBUG]     return type_dispatch(self, value)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 837, in wrap_literal
[DEBUG]     return self.wrap_unspecialized_primitive(value)
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 1073, in wrap_unspecialized_primitive
[DEBUG]     guards=self.make_guards(GuardBuilder.CONSTANT_MATCH),
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 269, in make_guards
[DEBUG]     return {source.make_guard(guard) for guard in guards}
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_dynamo/variables/builder.py", line 269, in <setcomp>
[DEBUG]     return {source.make_guard(guard) for guard in guards}
[DEBUG]   File "/data/users/ezyang/b/pytorch/torch/_guards.py", line 641, in make_guard
[DEBUG]     return Guard(self.name(), self.guard_sou
```

One downside is I can't report *why* the guard was added. I'm not entirely sure how to do this; the problem is guards will propagate to a bunch of variables before finally getting included as part of the final set. Maybe a very very verbose version could report stack traces at every handoff point.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107388
Approved by: https://github.com/mlazos
ghstack dependencies: #107438, #107358
2023-08-18 19:05:54 +00:00
36bb7a1f42 Add fast traceback utilities (#107358)
This adds some utilities for conveniently working with fast combined CapturedTraceback from Python. The main goal of these utilities is to make it easier for people to use CapturedTraceback as a drop-in replacement for `traceback.extract_stack`, which is 20x slower than CapturedTraceback.

I port symbolic shapes to use the new CapturedTraceback code, to validate that the APIs work and are useful.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107358
Approved by: https://github.com/zdevito, https://github.com/albanD
ghstack dependencies: #107438
2023-08-18 19:05:54 +00:00
d5f7df3b8a Hand bind CapturedTraceback (#107438)
I do this instead of pybind11 because I need a custom tp_dealloc to promptly free PyFrames. I also add GC traverse/clear support. This is required to avoid leaking memory from co_extra on code objects in some obscure situations. This is indirectly tested by #107388

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107438
Approved by: https://github.com/albanD
2023-08-18 19:05:52 +00:00
d8f2ef10a6 [dtensor][1/n] refactor op dispatch logic to reduce overhead (#107305)
This PR is the first change of a series of refactors to the op dispatch logic to:
1. remove the redundant logic in the op dispatch, simplify the error
checking
2. reduce the number of tree_map/tree_flatten/unflatten needed to reduce
the overhead coming from those operations
3. remove the CachedShardingPropagator by using lru_cache from functools
directly, this makes it not only helps TP, but general DTensor
operations could be faster!
4. change the view ops behavior by inplace changing the op_schema, which
is dangerous for sharding prop caching, model the view op as one type
of resharding too
5. enrich output sharding to include whether the op needs redistribute
so that we don't need explicit op schema comparison to know it.

This should help with further reducing the CPU overhead, benchmark
results:
before (without this change), aten.addmm latency: 0.476ms
![Screenshot 2023-08-16 at 10 46 26 AM](https://github.com/pytorch/pytorch/assets/9443650/7692e6c1-1936-4c7f-bf9c-6c8c9b8f6c76)

after (with this change), aten.addmm latency: 0.341ms
![Screenshot 2023-08-16 at 11 05 49 AM](https://github.com/pytorch/pytorch/assets/9443650/15a53f0b-7a95-444e-ab2f-3ee0ad2fa47f)

overall one layer of mlp time reduced from 13.535 -> 9.665ms

Apart from overhead reduction, this PR simplifies the op dispatching logic and the resharding logic (more refactor needed to make things more clean, which will be done in later PRs)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107305
Approved by: https://github.com/fduwjj
2023-08-18 18:30:46 +00:00
8d6a487d69 [dynamo] Make KeyedJaggedTensor a variable. (#107319)
This is extracted from https://github.com/pytorch/pytorch/pull/107156/
to model KeyedKaggedTensor as a first class concept in dynamo.
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107319
Approved by: https://github.com/ezyang
2023-08-18 17:15:46 +00:00
ea3381d92c Make StarDep.index throw an error (#107092)
There was an issue where `hasattr(dep, "index")` would incorrectly be True because it was picking up `NamedTuple.index` (a method).  We were also comparing that method to a `sympy.Exper` in one place.

As far as I can tell this wasn't actually causing any bugs (the comparison actually did the right thing), but still good to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107092
Approved by: https://github.com/eellison
2023-08-18 17:15:43 +00:00
139437bb84 Make Openxla dynamo backend take boxed input (#107260)
Fixes https://github.com/pytorch/xla/issues/5454

Also adding the inference(non-aot) backend back since we see a speed regression when using the aot-backend compared to the non-aot openxla backend. It is being tracked in https://github.com/pytorch/xla/issues/5430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107260
Approved by: https://github.com/shunting314, https://github.com/jansel
2023-08-18 16:58:05 +00:00
3c11184ca8 Revert "Fakify leaf of FunctionalTensor (#107062)"
This reverts commit 6cb0128c8a07d626ab84516df3c9727943469d49.

Reverted https://github.com/pytorch/pytorch/pull/107062 on behalf of https://github.com/ZainRizvi due to This appears to have broken the test TestDTensorCompile.test_dtensor_fullgraph.  Probably a land race ([comment](https://github.com/pytorch/pytorch/pull/107062#issuecomment-1684124230))
2023-08-18 16:02:54 +00:00
c21e9de25d Inductor cpp wrapper: fix optional tensor input (#106847)
Fix cpp wrapper failure on `clip` in Torchbench:
```
RuntimeError: tensor does not have a device
```

An `optional<at::Tensor>` variable with value equal to `at::Tensor()` will be considered as _contains value_. When it's converted to `bool`, it returns `true`. While for `None` in python, when converting it to `bool`, `false` is returned.
Fix it to be an optional variable that _does not contain a value_.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106847
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-18 13:20:19 +00:00
e10791c0bd enable mkl_gemm_bf16bf16f32 in cpublas::gemm (#107196)
This one is a wrapper upon `mkl_gemm_bf16bf16f32` which is used in flash attention kernel on intel 4th gen xeon.
Fallback path has also been implemented on cpublas::gemm in case `mkl_gemm_bf16bf16f32` is not available.

The primary target of this change is to help build kernels in `scaled_dot_product_attention`, e.g. flash attention and efficient attention.  In the attention kernel, `q @ k.T = attn`, q and k will be given as bfloat16 and attn is float32. This is actually both beneficial for both performance and accuracy, since attn will be used to compute lazy softmax which has to be done in float32.

This patch also adds routine from OpenBlas `sbgemm_` which also has a signature of bf16 * bf16 -> fp32; but since OpenBlas routine has different name from MKL's, we can not use `sbgemm_` in MKL.

In the fallback path, it takes two steps to do the computation: first do gemm with beta = 0; then add beta * C in full precision. Idea from @peterbell10 not to truncate C to bfloat16, so as to avoid unnecessary accuracy loss.

ref: https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-0/cblas-gemm-bf16bf16f32.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107196
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2023-08-18 12:48:10 +00:00
42625da5e1 reseed all Generators in Dataloader's _worker_loop() -- via GC (#107131)
Alternative to https://github.com/pytorch/pytorch/pull/107034, implements @ezyang 's suggestion from https://github.com/pytorch/pytorch/pull/107034#discussion_r1292857201.

This PR addresses https://fb.workplace.com/groups/pytorch.oss.dev/posts/1699944830430051 and does a bunch of stacked changes:

- Make `Generator` class support GC;this makes all `Generator` instances tracked and accessile through Python's GC.
- Use the GC to retrieve all existing Generator instances in Dataloader's `_worker_loop` and re-seed them: this extends what is already applied to the global/default Generator, which is already re-seeded.

~TODO: a bit of docs and justification, which I'll do if this PR is mergeable.~ -- Done

CC @albanD @ezyang  as previously discussed

BC-Breaking Note
-------------------

We now re-seed all `Generator` instances within the `Dataloader` workers' loop to ensure that their RNG is different across workers.
Previously, the RNG of user-defined `Generators` would be the same across workers, which could lead to wrong training procedures. This only affects user-defined `Generators`, not the default `Generator` (which was already re-seeded).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107131
Approved by: https://github.com/ezyang
2023-08-18 10:23:23 +00:00
95f1591acb error on bad input to equality constraint (#107311)
Differential Revision: D48401664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107311
Approved by: https://github.com/angelayi
2023-08-18 09:01:51 +00:00
9c9982a0aa Turn on typechecking for _inductor/kernel/conv.py (#106258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106258
Approved by: https://github.com/Skylion007
ghstack dependencies: #106252
2023-08-18 08:49:18 +00:00
18b1c2907d [inductor] Add ir.WelfordReduction with multiple outputs (#104725)
This replaces `var_unnormalized` reduction type with `welford_reduce` which takes the input data and outputs not just the variance, but also the mean and weights which account for the full welford accumulator state. Thus we can avoid re-computing the mean, and we now have enough information to create a multilayer reduction which I implement here by adding a second reduction type called `welford_combine` which reduces over all three inputs simultaneously.

Multi-layer support is particularly important as normalization operators like BatchNorm are being split in many timm models, which meant `var_unnormalized` had to fall back to two-pass variance calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104725
Approved by: https://github.com/lezcano
2023-08-18 08:18:01 +00:00
3699c6adaa [DTensor][random] add DTensor constructor: rand (#106535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106535
Approved by: https://github.com/fduwjj, https://github.com/wanchaol
2023-08-18 07:39:34 +00:00
d465d6a838 [inductor] scatter_reduce - skip .item() in backward if GradMode is not enabled (#107353)
Repeats #106429 for scatter_reduce so that the backward will pass for PT2. The .item() call is only needed to make double-backward work, which isn't supported anyway for PT2; so an easy fix is to just skip the .item() call if we know we won't need double-backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107353
Approved by: https://github.com/eellison
2023-08-18 07:17:29 +00:00
a815e719e8 Turn on typechecking for _inductor/utils.py (#106252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106252
Approved by: https://github.com/Skylion007
2023-08-18 04:11:34 +00:00
1d6a446567 Add scalar conversion using avx instructions for half (#102140)
### Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398

Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
2023-08-18 04:07:59 +00:00
b31a357eaa [dynamo][eval_frame] Set destroy_extra_state deleter as part of co_extra (#107117)
Using the `freefunc` facility to free the ExtraState objects - https://peps.python.org/pep-0523/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107117
Approved by: https://github.com/jansel
ghstack dependencies: #106917
2023-08-18 03:52:08 +00:00
4608b9422c [dynamo][eval_frame] Unify cache entry and frame_state on the same co_extra index (#106917)
Handling follow up from https://github.com/pytorch/pytorch/pull/106413#discussion_r1288971923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106917
Approved by: https://github.com/ezyang
2023-08-18 03:52:08 +00:00
fcd1a0e93e [inductor] Use divisor_override param in in aten.divisor_override lowering (#107401)
Summary: Just mirrored the treatment of this param from here: https://codebrowser.bddppq.com/pytorch/pytorch/aten/src/ATen/native/cuda/AveragePool2d.cu.html#171

Test Plan:
`python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_nn_functional_avg_pool2d`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107401
Approved by: https://github.com/eellison
2023-08-18 03:52:00 +00:00
2bb59a9ac6 Fix vscode test discovery (#107404)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107404
Approved by: https://github.com/wconstab
2023-08-18 03:50:46 +00:00
6cb0128c8a Fakify leaf of FunctionalTensor (#107062)
This PR allows dynamo to fakify FunctionalTensorWrapper by unwrapping, replacing and wrapping again for FunctionalTensorWrapper so that FunctionalTensorWrapper can be passed in as input for dynamo.optimize and we can support something like this
```python
ff = torch.func.functionalize(f)
torch.compile(ff)(x)
```

This PR didn't follow the \_\_tensor_flatten\_\_ and \_\_tensor_unflatten\_\_ protocol right now because we're not sure the plan of doing that for FunctionalTensorWrapper (it's implemented in C++).

**Test Plan:**
Add a new test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107062
Approved by: https://github.com/zou3519
ghstack dependencies: #107042
2023-08-18 03:05:45 +00:00
36141de427 Throw error if stateless.functional_call called with nn.DataParallel (#107403)
Part of #77576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107403
Approved by: https://github.com/mikaylagawarecki
2023-08-18 03:02:04 +00:00
600f9ef2ad [nullability] Suppress -Wnullable-to-nonnull-conversion errors in caffe2 (#107418)
Summary: Changelog: Suppresses the nullable to nonnull conversion errors from caffe2.

Test Plan:
```
buck2 build //xplat/caffe2:caffe2Apple
```

Differential Revision: D48453395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107418
Approved by: https://github.com/seemethere
2023-08-18 02:04:37 +00:00
9ca2106e5f Use CUDA 12.1.1 patch version in CI (#107295)
Update cuda 12.1.1

After :
Nightly Linux - https://github.com/pytorch/builder/pull/1476
Nightly Windows - https://github.com/pytorch/builder/pull/1485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107295
Approved by: https://github.com/ZainRizvi
2023-08-18 01:28:16 +00:00
35b2b3ee47 Fix rst formatting in torch.compiler_troubleshooting.rst (#107360)
Fix some rst formatting - mostly around ``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107360
Approved by: https://github.com/kit1980
2023-08-18 01:04:24 +00:00
608afe8083 Added xla friendly codepath to single_tensor_adamw (#102858)
There are extra graph compilations on XLA when beta{1,2} ** step get too small. This PR addresses this issue by making the `capturable` interface enabled for XLA, as well as switching to `torch.float_power` which preserves the same behaviour as the non-capturable flow on XLA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102858
Approved by: https://github.com/janeyx99, https://github.com/albanD
2023-08-18 00:16:28 +00:00
89de048563 [BE] Use allocator to allocate workspace (#107178)
As suggested in https://github.com/pytorch/pytorch/pull/106844#discussion_r1293839247 it's better to just allocate DataPtr than the whole tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107178
Approved by: https://github.com/albanD
ghstack dependencies: #106977, #106844
2023-08-18 00:15:34 +00:00
3c3874d623 Align formula in Impl::General mode with Impl::Contiguous mode in batch_norm_elementwise_cuda op. (#106943)
Fixes #106941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106943
Approved by: https://github.com/colesbury
2023-08-17 23:46:42 +00:00
02bcaf45f6 Revert "Add backward check for test_memory_format (#106104)"
This reverts commit 2e44adb06608d09a36b899ffdb375cb7d46a78d2.

Reverted https://github.com/pytorch/pytorch/pull/106104 on behalf of https://github.com/huydhn due to Sorry for reverting this but it is failing inductor job in trunk 2e44adb066.  I will add ciflow/inductor label to the PR make sure that the test runs there ([comment](https://github.com/pytorch/pytorch/pull/106104#issuecomment-1683119990))
2023-08-17 23:45:31 +00:00
d3f92ca9e9 Revert "[C10D] Implement new libuv backend for TCPStore. (#105870)"
This reverts commit 3c841163cef9167ea50adbcfc4384b63c0b6e93a.

Reverted https://github.com/pytorch/pytorch/pull/105870 on behalf of https://github.com/huydhn due to I think the distributed failure is related as this is now failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/105870#issuecomment-1683117192))
2023-08-17 23:41:00 +00:00
266772472e Describe the 'float32_matmul_precision' settings in more detail (#107169)
The documentation for `torch.set_float32_matmul_precision()` mentions a datatype called "bfloat16_3x".  This doesn't appear to be a very standard term, and I had a hard time figuring out what exactly it meant.  I now assume it refers to [[Henry2019]](http://arxiv.org/abs/1904.06376), which describes an algorithm by which a float32 multiplication is approximated via three bfloat16 multiplications.  This PR updates the documentation to include this reference and to briefly describe how this algorithm works.

Note that I just learned everything that I wrote here, so I'd appreciate if someone more expert in this topic could check to make sure that I didn't get anything significantly wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107169
Approved by: https://github.com/colesbury
2023-08-17 22:41:22 +00:00
2b3917dc63 [ONNX] Fix memory leak when exporting models (#107244)
This commit fixes a memory leak caused by creating a new PyListObject using PyDict_Items() and not releasing that list later. This often prevented the entire model from being de-allocated even when all python references to it have gone out of scope.

Here is a repro script:

```python
import psutil, torch, transformers, gc, os, sys
import math

# Size in MB
model_size = 512

kB = 1024
MB = kB * kB
precision_size = 4 # bytes per float
activation_size = math.floor(math.sqrt(model_size * MB / precision_size))

class Net(torch.nn.Module):
    def __init__(self, activation_size):
        super(Net, self).__init__()
        self.linear = torch.nn.Linear(activation_size, activation_size)
    def forward(self, x):
        return {"result": self.linear(x)}

def collect_and_report(s):
    gc.collect()
    print(s)
    #print("psutil: ", psutil.virtual_memory().percent)
    print("CPU MB used by this process: ", psutil.Process(os.getpid()).memory_info().rss / 1024 ** 2)
    print("GPU MB allocated by pytorch: ", torch.cuda.memory_allocated(0) / 1024 ** 2)
    print()

def run_test(device_str):
    device = torch.device(device_str)
    dummy_input = torch.zeros(activation_size, requires_grad=True).to(device)

    collect_and_report("Before loading model: ")
    model = Net(activation_size).to(device)
    collect_and_report("After loading model: ")

    torch.onnx.export(model, dummy_input, "dummy.onnx")
    collect_and_report("After exporting model: ")

    del model
    collect_and_report("After deleting model:")

print("Running CPU test: ")
run_test("cpu")

print("Running GPU test: ")
run_test("cuda")
```

Results with this commit:
```
Running CPU test:
Before loading model:
CPU MB used by this process:  346.5
GPU MB allocated by pytorch:  0.0

After loading model:
CPU MB used by this process:  861.078125
GPU MB allocated by pytorch:  0.0

After exporting model:
CPU MB used by this process:  880.12890625
GPU MB allocated by pytorch:  0.0

After deleting model:
CPU MB used by this process:  880.12890625
GPU MB allocated by pytorch:  0.0

Running GPU test:
Before loading model:
CPU MB used by this process:  991.9375
GPU MB allocated by pytorch:  0.04443359375

After loading model:
CPU MB used by this process:  992.19140625
GPU MB allocated by pytorch:  512.0888671875

After exporting model:
CPU MB used by this process:  1026.64453125
GPU MB allocated by pytorch:  520.25830078125

After deleting model:
CPU MB used by this process:  1026.64453125
GPU MB allocated by pytorch:  520.25830078125
```

With this commit:
```
Running CPU test:
Before loading model:
CPU MB used by this process:  372.7734375
GPU MB allocated by pytorch:  0.0

After loading model:
CPU MB used by this process:  887.18359375
GPU MB allocated by pytorch:  0.0

After exporting model:
CPU MB used by this process:  918.96875
GPU MB allocated by pytorch:  0.0

After deleting model:
CPU MB used by this process:  407.3671875
GPU MB allocated by pytorch:  0.0

Running GPU test:
Before loading model:
CPU MB used by this process:  516.6875
GPU MB allocated by pytorch:  0.04443359375

After loading model:
CPU MB used by this process:  516.75390625
GPU MB allocated by pytorch:  512.0888671875

After exporting model:
CPU MB used by this process:  554.25390625
GPU MB allocated by pytorch:  520.2138671875

After deleting model:
CPU MB used by this process:  554.25390625
GPU MB allocated by pytorch:  8.16943359375
```

Fixes #106976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107244
Approved by: https://github.com/BowenBao, https://github.com/kit1980
2023-08-17 22:15:28 +00:00
d8dadb0f25 aot_inductor: fix compile returning None if cache hits (#107020)
Summary:
Seems like a bug in D47998435, where when cache hits it returns None

Repro:

```
class TestModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x + 1

mod = TestModule()
inp = torch.rand(1)
out = mod(inp)
mod2 = torch.fx.symbolic_trace(mod, concrete_args=[inp])

so, _ = torch._export.aot_compile(mod2, tuple([inp]))
# 2nd time, it will return None
so, _ = torch._export.aot_compile(mod2, tuple([inp]))
assert so is not None  # FAIL
```

Test Plan: Run the repro

Differential Revision: D48258375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107020
Approved by: https://github.com/angelayi
2023-08-17 22:12:24 +00:00
37eb969939 Update test name in multiGPU test (#107397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107397
Approved by: https://github.com/wanchaol
ghstack dependencies: #107313, #106583
2023-08-17 21:40:50 +00:00
b9c86c521d Make mergebot work with review comments (#107390)
Fixes https://github.com/pytorch/pytorch/issues/100406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107390
Approved by: https://github.com/clee2000
ghstack dependencies: #107385
2023-08-17 21:31:41 +00:00
4874b02379 [BE] Remove deprecated github gql param and disable inconsistent test (#107385)
Two fixes:
- Stop querying `pushDate`, which [has been deprecated ](https://docs.github.com/en/graphql/reference/objects) and now always returns null
- Disables the test `test_merge_ghstack_into` which was recently added in https://github.com/pytorch/pytorch/pull/105251. This test used the results of another person's ghstack PR for testing, but as the dev submitted chunks of their PR this test's assumptions have been broken. cc @izaitsevfb for a long term fix here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107385
Approved by: https://github.com/clee2000
2023-08-17 21:31:41 +00:00
8ccfd801be Introduce CUDA-only _scaled_mm op (#107341)
Summary:
Based on D48377631 with updates to guard the utilization of cublas features only found after 11.8

According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.

See table below for supported input and output types:
| Mat1 type  | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3  | Float8_e4m3  | Float16  | Float8_e4m3, Float16 |
| Float8_e4m3  | Float8_e4m3  | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2  | Float8_e4m3  | Float16 |  Float8_e4m3, Float8_e5m2, Float16  |
| Float8_e5m2  | Float8_e4m3  | BFloat16 |  Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Float16 |  Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3  | Float8_e5m2  | BFloat16 |  Float8_e4m3, Float8_e5m2,  BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Not supported | Not supported |

Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
register_decomposition(aten._scaled_mm)
def _scaled_mm(
    mat1: Tensor,
    mat2: Tensor,
    *,
    dtype: Optional[torch.dtype] = None,
    scale_a: Optional[Tensor] = None,
    scale_b: Optional[Tensor] = None,
    scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
    rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
    rc = scale_a * rc if scale_a is not None else rc
    rc = scale_b * rc if scale_b is not None else rc
    rc = scale_result * rc if scale_result is not None else rc
    rc = rc.to(dtype if dtype is not None else mat1.dtype)
    return rc, torch.tensor(0.0, device=mat1.device)
```

Known limitations:
  - Only works for matrix sizes divisible by 16
  - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)

Test Plan: Tests in test_matmul_cda.py

Differential Revision: D48415871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107341
Approved by: https://github.com/vkuzo
2023-08-17 21:24:43 +00:00
2e44adb066 Add backward check for test_memory_format (#106104)
Add backward check for test_memory_format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106104
Approved by: https://github.com/mikaylagawarecki
2023-08-17 21:19:34 +00:00
c69514ccb2 Update generate_opcheck_tests, also use it to test some internal tests (#107328)
Summary:
We change `generate_opcheck_tests` to be a bit more user-friendly. Note that
there are some internal-only changes, go review them there.

Test Plan: - tests

Differential Revision: D47965247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107328
Approved by: https://github.com/ezyang
2023-08-17 21:18:14 +00:00
bbf03561a9 [functional collectives] Move back to registering finalizers on wrappers. (#107250)
We cannot use inner tensors for finalizers as they are uncollective until waited.

This PR adds a bunch of tests for the observable behavior we want, including the
necessary scafold for us to test code for their waitiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107250
Approved by: https://github.com/wconstab
2023-08-17 21:08:28 +00:00
3c841163ce [C10D] Implement new libuv backend for TCPStore. (#105870)
The new backend is currently under a flag 'use_libuv' in TCPStore constructor
to reduce the impact on existing users as we test it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105870
Approved by: https://github.com/H-Huang
2023-08-17 20:40:32 +00:00
d86445a506 [cuda] vectorized gamma and beta loading in vectorized_layer_norm (#107287)
Improves the performance of `vectorized_layer_norm` by vectorizing access to `gamma` and `beta` buffers. This uses 128 bit load instructions which improves memory bandwidth. The speedup is ~3% on average and there are no obvious regressions on any problem sizes.

Used the following code to test:

```python
import torch
from torch.utils.benchmark import Compare, Timer  # @manual

l_inputs = [
    (32, 32),
    (64, 32),
    (256, 128),
    (512, 1024),
    (1024, 2048),
    (2048, 2048),
    (4096, 16384),
    (70000, 64),
    (131072, 512),
    (1000, 520),
    (4005, 4005),
    (10000, 1000),
    (1024, 10000),
    (8192, 4096),
    (10000, 10000),
    (3072, 10000),
    (6144, 10000),
    (1024, 20000),
    (1024, 20000),
    (512, 1536),
    (512, 6144),
    (512, 10240),
    (1000, 1000),
    (2000, 2000),
    (10240, 10240),
    (384, 128),
    (2048, 1024),
    (267, 513),
    (67, 123479),
    (1024, 123479),
    (2048, 66679),
    (200, 256),
    (1000, 256),
    (6000, 256),
    (6272, 256),
    (200, 512),
    (1000, 512),
    (6000, 512),
    (6272, 512),
    (200, 1024),
    (1000, 1024),
    (6000, 1024),
    (6272, 1024),
    (200, 2048),
    (1000, 2048),
    (6000, 2048),
    (6272, 2048),
    (200, 3072),
    (1000, 3072),
    (6000, 3072),
    (6272, 3072),
]

def run_model_on_device(fs, X, gO, device_string, numeric_type):
    ln = torch.nn.LayerNorm((fs,), device=device_string, dtype=numeric_type)
    ln.reset_parameters()
    X.grad = None
    ln.zero_grad(set_to_none=True)
    out = ln(X)
    out.backward(gO)
    return (ln.weight.grad, ln.bias.grad)

def run_correctness_test(eps_weight, eps_bias):
    dtype = torch.float
    for val in l_inputs:
        bs = val[0]
        fs = val[1]
        mean_adjustment = torch.randn(fs, device="cpu", dtype=torch.float)
        X = mean_adjustment * torch.randn(
            bs, fs, device="cpu", dtype=torch.float, requires_grad=True
        )

        X = X.detach().requires_grad_()
        gO = torch.rand_like(X)
        X_gpu = X.to("cuda")
        X_gpu = X_gpu.detach().requires_grad_()
        gO_gpu = gO.to("cuda")
        gO_gpu = gO_gpu.detach().requires_grad_()

        grad_cpu_ref = run_model_on_device(fs, X, gO, "cpu", dtype)
        grad_gpu = run_model_on_device(fs, X_gpu, gO_gpu, "cuda", dtype)
        weight_grad_gpu_target = grad_gpu[0].detach().to("cpu")
        bias_grad_gpu_target = grad_gpu[1].detach().to("cpu")

        weight_delta = torch.abs(grad_cpu_ref[0] - weight_grad_gpu_target)
        weight_mismatches = (weight_delta >= eps_weight).nonzero()
        weight_mismatch_pct = len(weight_mismatches) / len(weight_delta) * 100

        bias_delta = torch.abs(grad_cpu_ref[1] - bias_grad_gpu_target)
        bias_mismatches = (bias_delta >= eps_bias).nonzero()
        bias_mismatch_pct = len(bias_mismatches) / len(bias_delta) * 100

        print(
            "Size ({} x {}) mismatch percentage: weight {:3.2f} bias {:3.2f}".format(
                fs, bs, weight_mismatch_pct, bias_mismatch_pct
            )
        )

# Run the correctness tests
run_correctness_test(0.01, 0.01)

# Run the performance tests. We need to run this at global scope because otherwise
# the `ln` and `gO` objects are likely removed by the JIT compiler
results = []
for dtype in (torch.float, torch.half):
    for val in l_inputs:
        bs = val[0]
        fs = val[1]
        ln = torch.nn.LayerNorm((fs,), device="cuda", dtype=dtype)
        X = torch.randn(bs, fs, device="cuda", dtype=dtype, requires_grad=True)
        gO = torch.rand_like(X)
        stmtfwd = "ln(X)"
        stmtfwdbwd = (
            "X.grad=None; ln.zero_grad(set_to_none=True); out = ln(X); out.backward(gO)"
        )
        tfwd = Timer(
            stmt=stmtfwd,
            label="ln",
            sub_label=f"{bs:5}, {fs:5}",
            description=f"fwd, {dtype}",
            globals=globals(),
        )
        tfwdbwd = Timer(
            stmt=stmtfwdbwd,
            label="ln",
            sub_label=f"{bs:5}, {fs:5}",
            description=f"fwdbwd, {dtype}",
            globals=globals(),
        )
        for t in (tfwd, tfwdbwd):
            results.append(t.blocked_autorange())
    print(fs, end="\r")
c = Compare(results)
c.print()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107287
Approved by: https://github.com/malfet
2023-08-17 19:57:45 +00:00
8a0425fdd6 [export] Remove setter for graph_module (#106651)
Summary: The ExportedProgram should be immutable

Test Plan: CI

Differential Revision: D48086375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106651
Approved by: https://github.com/zhxchen17
2023-08-17 18:38:21 +00:00
2d727c8c3f remove the duplicate method is_private_use1 in class Device (#107198)
In the `Device` class, there are two methods with similar functions called `is_private_use1` and `is_privateuseone`.
ddf36c82b8/c10/core/Device.h (L84-L87)

ddf36c82b8/c10/core/Device.h (L159-L162)
The former is not being utilized and therefore, this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107198
Approved by: https://github.com/bdhirsh
2023-08-17 18:23:29 +00:00
aca3d1433c Estimate Scheduler node runtimes (#106426)
Working as starter task with @Chillee

This PR adds a method under BaseSchedulerNode to estimate the node's runtime in seconds.

We use a heuristic based approach, first by considering whether the operation is memory bandwidth bounded or compute bounded:
- memory bandwidth bounded: we compute the number of bytes that are read/written to
- compute bounded: we compute the FLOPS required by the operation

One use case could be to be used as a cost model for scheduling: https://github.com/pytorch/pytorch/pull/100762

```
(pytorch-3.10) [14:08:02] ~/local/pytorch (xmfan/estimate_snode_runtime) > python3 test/inductor/test_perf.py -k EstimateSnodeRuntimeTests
[(ExternKernelSchedulerNode(name='buf0'), 400)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)]
.[(ExternKernelSchedulerNode(name='buf0'), 3000), (SchedulerNode(name='buf1'), 3000)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26), (SchedulerNode(name='buf1'), 7.187055238190188e-09)]
.[(ExternKernelSchedulerNode(name='buf0'), 3000)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-26)]
.[(ExternKernelSchedulerNode(name='buf0'), 34600)]
[(ExternKernelSchedulerNode(name='buf0'), 3.22687496698039e-24)]
.[(ExternKernelSchedulerNode(name='buf0'), 396)]
[(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)]
.[(ExternKernelSchedulerNode(name='buf0'), 396)]
[(ExternKernelSchedulerNode(name='buf0'), 1.88046326747109e-27)]
.[(ExternKernelSchedulerNode(name='buf0'), 7776176)]
[(ExternKernelSchedulerNode(name='buf0'), 4.63240241413653e-21)]
.[(FusedSchedulerNode(nodes=buf0_buf1), 210)]
[(FusedSchedulerNode(nodes=buf0_buf1), 5.030938666733132e-10)]
.[(ExternKernelSchedulerNode(name='buf0'), 300)]
[(ExternKernelSchedulerNode(name='buf0'), 2.35057908433887e-27)]
.[(SchedulerNode(name='buf0'), 20)]
[(SchedulerNode(name='buf0'), 4.7913701587934585e-11)]
.
----------------------------------------------------------------------
Ran 10 tests in 14.311s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106426
Approved by: https://github.com/Chillee
2023-08-17 17:23:30 +00:00
aa04b0536b Fix inference_mode decorator pass mode as kwarg (#107349)
Fixes https://fb.workplace.com/groups/1405155842844877/permalink/7330520550308347/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107349
Approved by: https://github.com/albanD
ghstack dependencies: #107296
2023-08-17 17:12:31 +00:00
4bfc55ba8b [MPS] Enable forward test for renorm (#106666)
Enabled forward test for renorm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106666
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-08-17 16:46:06 +00:00
8298720299 Enable Lowering Channels last Conv1x1 when max autotune is set (#107004)
This can lead to a large speedup when max autotune is set, e.g. resnet 2.1x -> 2.5x, particularly in combination with freezing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107004
Approved by: https://github.com/jansel, https://github.com/shunting314, https://github.com/int3
ghstack dependencies: #106911, #106912
2023-08-17 16:05:32 +00:00
f96617fdcd Add deployment environment for docs and upload test stats (#107318)
Many thanks to this discussion comment for explaining why we don't need an environment for the calling workflow but the secret can still be used
https://github.com/orgs/community/discussions/25238#discussioncomment-3247035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107318
Approved by: https://github.com/huydhn, https://github.com/atalman
2023-08-17 15:47:18 +00:00
aa9f6a4335 Fix native_batch_norm_backward returning non-channels_last_3d grad (#107270)
Fix #107199

Checked out https://github.com/pytorch/pytorch/pull/106104 which caught this locally and verified that 551124f670/torch/testing/_internal/common_modules.py (L2635-L2642) with the `p['device'] == 'cuda'` part shifted to `device_type = 'cuda'` now succeeds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107270
Approved by: https://github.com/albanD
2023-08-17 14:58:56 +00:00
e108f33299 Update distutils.Version to packaging.version due to the deprecation … (#107207)
Update distutils.Version to packaging.version due to the deprecation warning.

```python
/root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17136: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
/root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17138: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
/root/Git.d/pytorch/pytorch/torch/testing/_internal/common_methods_invocations.py:17140: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  active_if=TEST_SCIPY and LooseVersion(scipy.__version__) < "1.4.0"),
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107207
Approved by: https://github.com/soulitzer
2023-08-17 11:19:44 +00:00
a98f745c80 Use compiled model in torch.compiler_get_started (#107267)
- Text says `Next, let’s try a real model like resnet50 from the PyTorch` but the code example uses `resnet18`. Fixed code to use `resnet50` for consistency.
- One of the examples in TorchDynamo Overview uses uncompiled model - fixed it - now it uses compiled model.
- Removed unused import to `_dynamo` in one of the examples
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107267
Approved by: https://github.com/soulitzer
2023-08-17 09:26:54 +00:00
f21b9cb954 Enable mypy check in torch/_inductor/kernel/mm_common.py (#106776)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/kernel/mm_common.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/kernel/mm_common.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106776
Approved by: https://github.com/eellison
2023-08-17 09:19:45 +00:00
11e366943d Fix rst formatting in dynamo/guards-overview doc (#107275)
Fix rst formatting in dynamo/guards-overview doc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107275
Approved by: https://github.com/soulitzer
2023-08-17 09:04:44 +00:00
384e0d104f [vision hash update] update the pinned vision hash (#107342)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107342
Approved by: https://github.com/pytorchbot
2023-08-17 05:58:27 +00:00
29813c61ea enable conv+bn folding for mixed-dtype when bn has post activation (#107142)
For conv+bn+relu6, the joint-graph pass will remove one type of conversion and the graph will be like this:
```
 def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...)
     convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1);  arg6_1 = arg0_1 = None
     # weight upcasting
     convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32);  arg3_1 = None
     convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32);  arg4_1 = None
     ...
     # end of batch norm
     add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7);  mul_2 = unsqueeze_7 = None
     # output downcast
     convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.float32);  add_1 = None
     clamp_min: f32[3, 32, 15, 15] = torch.ops.aten.clamp_min.default(convert_element_type_2, 0.0); convert_element_type_2 = None
     clamp_max: f32[3, 32, 15, 15] = torch.ops.aten.clamp_max.default(clamp_min, 6.0);  clamp_min = None
     convert_element_type_3: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(clamp_max, torch.bfloat16); clamp_max = None
```

the conv+bn folding will be failed, this PR will move the joint-graph pass's dtype conversion removing to after of conv_bn folding pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107142
Approved by: https://github.com/eellison
2023-08-17 04:17:35 +00:00
f53ecfbcc6 Correctly format original traceback for delayed CUDA error (#107297)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107297
Approved by: https://github.com/albanD
2023-08-17 03:13:31 +00:00
e9af315e02 Fix torch.bucketize docs for "right" (#104474)
The docs correctly (i.e matching actual op behavior) state that

`right = False` means `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]`.

However they previously stated that
`If 'right' is False (default), then the left boundary is closed.`

which contradicts the `boundaries[i-1] < input[m][n]...[l][x] <= boundaries[i]` statement.

This modifies the docs to say `... then the left boundary is OPEN.` and also clarifies that this is the opposite behavior of numpy.digitize.

Fixes #91580
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104474
Approved by: https://github.com/aakhundov, https://github.com/svekars
2023-08-17 03:08:07 +00:00
25d87c8301 torch.ops.aten.*: sort aten ops before jit overloads (#107138)
Summary:
In fbcode, aten and jit ops can get registered in different orders depending on build mode. In dev mode, aten is registered first; in opt mode, jit is registered first.

This causes problems in torch.ops.aten.* calls; these calls use `torch._C._jit_get_operation`, which selects an overload based on the inputs to the call. It searches through the overloads for the op with the given name, and chooses the first one that matches the input types. "First" depends on whether aten or jit ops were registered first - e.g. in `test_both_scalars_cuda` in opt mode, it chooses `add.complex` and returns a complex value.

We also saw this issue in https://github.com/pytorch/pytorch/pull/103576.

This PR sorts the list of overloads first, putting the aten ops first.

Differential Revision: D48304930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107138
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-08-17 03:05:59 +00:00
983fd5ba79 [2D][TP] Enable DDP TP integration with unit test (#106583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106583
Approved by: https://github.com/kumpera, https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #107313
2023-08-17 02:54:17 +00:00
4979a1b8f9 Fix trymerge broken trunk detection when the merge base job was retried (successfully) (#107333)
This fixes a discrepancy bug between Dr.CI and trymerge when detecting broken trunk failures.
 Take https://github.com/pytorch/pytorch/pull/107160 as an example:

* Dr.CI correctly identifies the broken trunk failure
* while trymerge records it as a new failure

The issue is that the merge base [failure](https://github.com/pytorch/pytorch/actions/runs/5833057579/job/15820504498) was flaky.  It was retried successfully and its conclusion went from a failure to a success.  The Rockset query returns all run attempts and while Dr.CI correctly records the failure, trymerge overwrites it with the successful retry.   Thus, the latter saw a new failure.

This change makes trymerge keep the merge base failure similar to what Dr.CI does https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L158-L168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107333
Approved by: https://github.com/clee2000
2023-08-17 02:09:31 +00:00
5b9b816b17 WAR by avoid querying device before env mutation (#107301)
We should probably fix https://github.com/pytorch/pytorch/issues/107300
properly but this works around the problem

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107301
Approved by: https://github.com/bdhirsh, https://github.com/H-Huang, https://github.com/albanD
2023-08-17 00:31:16 +00:00
b234b94760 Add in-place _foreach_copy (#107226)
Fixes #107162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107226
Approved by: https://github.com/janeyx99
2023-08-17 00:11:18 +00:00
8507b22fea propagate _GLIBCXX_USE_CXX11_ABI to NVCC (#107209)
Fixes #107161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107209
Approved by: https://github.com/malfet
2023-08-16 22:41:52 +00:00
a4229690e3 Add Some Checks about dim (#107223)
Fixes #106769

As mentioned in [GRUCell](https://pytorch.org/docs/stable/generated/torch.nn.GRUCell.html#grucell), `hidden` should have the same dimension as `input`, and the dimension should be either `1D` or `2D`.

As for other aspects, it has been verified in `C++`, such as the batch of `Input` and `hidden` are the same, `Input`'s Dim1 and `input_size` are the same, `hidden`'s Dim1 and `hidden_size` are the same, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107223
Approved by: https://github.com/albanD
2023-08-16 22:03:31 +00:00
f3b0d83fe3 [EZ][TP] Refactor FSDP 2D integration extension code so that it can re-used (#107313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107313
Approved by: https://github.com/wz337
2023-08-16 22:01:17 +00:00
b08b0c915f [easy] Fix docs for sd calculation in BatchNorm1d/3d for consistency with BatchNorm2d (#107308)
Fixes https://github.com/pytorch/pytorch/issues/100048

BatchNorm2d docs were updated in https://github.com/pytorch/pytorch/pull/97974. There have been a number of issues filed due to confusion about this so I think we should fix before branch cut

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107308
Approved by: https://github.com/albanD
2023-08-16 21:51:02 +00:00
d3c4ec767b [quant][pt2e] Fix handling for SharedQuantizationSpec (#106922)
Summary:
Previously if we have:
```
conv1 -> cat
conv2  /
```
and configure output of conv1/conv2 to be int8 quantized, and cat also int8 quantized and with shared inputs,
it will not produce expected results (input of cat will not be shared)

The problem is that there is some missing checks when inserting observers for input for cat

This PR fixes the problem.

Fixes: https://github.com/pytorch/pytorch/issues/106760
Test Plan:
python tes/test_quantization.py TestQuantzePT2E.test_shared_qspec

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106922
Approved by: https://github.com/kimishpatel
2023-08-16 21:16:45 +00:00
6bfb4f7c4b [CUDA][Linalg} Patch crash of linalg.eigh when input matrix is ill-conditioned, in some cusolver version (#107082)
Related: https://github.com/pytorch/pytorch/issues/94772, https://github.com/pytorch/pytorch/issues/105359

I can locally reproduce this crash with pytorch 2.0.1 stable pip binary. The test already passes with the latest cuda 12.2 release.

Re: https://github.com/pytorch/pytorch/issues/94772#issuecomment-1658909998
> From discussion in triage review:

- [x] we should add a test to prevent regressions
- [x] properly document support wrt different CUDA versions
- [x] possibly add support using MAGMA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107082
Approved by: https://github.com/lezcano
2023-08-16 21:15:15 +00:00
0434a2c7c8 [BE][PG NCCL] Improve input mismatch error msg (#107281)
Test Plan: CI

Differential Revision: D48363238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107281
Approved by: https://github.com/awgu, https://github.com/H-Huang, https://github.com/fegin
2023-08-16 20:22:22 +00:00
ba6fcc4eae [caffe2][SDT] Check whether TORCH_DISABLE_SDT macro is defined before referencing it (#107149)
Summary:
Some jobs in the next diff in stack (D48229150) fail with the following message:

```
stderr: In file included from xplat/caffe2/c10/cuda/CUDACachingAllocator.cpp:9:
xplat/caffe2/c10/util/static_tracepoint.h:4:6: error: 'TORCH_DISABLE_SDT' is not defined, evaluates to 0 [-Werror,-Wundef]
    !TORCH_DISABLE_SDT
```

When porting USDT macros to PyTorch in D47159249, I must have not hit a codepath which treated warnings as errors during testing.

This diff fixes the issue by first checking whether the `TORCH_DISABLE_SDT` macro is defined before trying to access it in the `static_tracepoint.h` header.

Test Plan:
Similar to D47159249, tested the following macro on test scripts with `libbpf` USDTs:
* `CAFFE_DISABLE_SDT`

Reviewed By: chaekit

Differential Revision: D48251736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107149
Approved by: https://github.com/chaekit
2023-08-16 19:52:12 +00:00
f16be5e0d4 Reordering tests experiment (#106347)
Companion with https://github.com/pytorch/test-infra/pull/4424

Uses the file rating generated by the test infra PR to re order tests.  For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum.

A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now.

Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests.  Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards.

I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347
Approved by: https://github.com/ZainRizvi
2023-08-16 18:23:09 +00:00
884c03d240 Improve activation checkpoint docs wording (#107296)
This helps eliminate some confusion around "intermediates" and whether module outputs are handled as well. See this internal post https://fb.workplace.com/groups/1405155842844877/permalink/7327505913943144/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107296
Approved by: https://github.com/albanD
2023-08-16 17:36:52 +00:00
e9ae820279 Unfuse bias add before pointwise ops (#106912)
I get a 2% inference speedup in HF with this PR. I checked to see if there any models where unfusing was slower than the cublas gelu fusion, and I did not see any, which was surprising to me. Sorry for the cublas-activation api churn 😬

Kicking off another run in cublas 12, it's possible that the results have changed since.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106912
Approved by: https://github.com/jansel
ghstack dependencies: #106911
2023-08-16 17:22:24 +00:00
c88775b937 Make Nd tensors hit fused addmm pass (#106911)
Replace https://github.com/pytorch/pytorch/pull/106433 since I had a bad cla commit.

Speeds up eager convnext bfloat16 inference by 35%., and eager timm bfloat16 inference average by `.5%`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106911
Approved by: https://github.com/ezyang
2023-08-16 17:12:11 +00:00
1c6f39098f [export] avoid calling the callable during export. (#107249)
We avoid calling user's function f again in export. It's error prone (due to side effects in f) and time-consuming. Instead, we directly manipulate the out_spec of the graph module to make sure the graph module outputs a tuple so that aot_export is happy.

The out_spec of gm_torch_level is computed from dynamo traced result and is guaranteed to be the same output as eagerly running user's original callable f.

Test Plan:
existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107249
Approved by: https://github.com/tugsbayasgalan
2023-08-16 17:08:30 +00:00
d0e50d9094 Move overloaded_args from FunctionSignature to PythonArgs (#106983)
This moves the `overloaded_args` field from FunctionSignature to PythonArgs. FunctionSignature is shared by all calls and should be immutable. PythonArgs contains the parsing results for an single call to the PyTorch API.

I did not measure a difference in performance in the "overrides_benchmark", although I expect there to be a bit more work in the common case. Note that the noise factor for the benchmark is much larger than the differences reported below:

Before:
```
Type tensor had a minimum time of 2.3615360260009766 us and a standard deviation of 0.7833134150132537 us.
Type SubTensor had a minimum time of 10.473251342773438 us and a standard deviation of 0.1973132457351312 us.
Type WithTorchFunction had a minimum time of 5.484819412231445 us and a standard deviation of 0.13305981701705605 us.
Type SubWithTorchFunction had a minimum time of 11.098146438598633 us and a standard deviation of 0.15598918253090233 us.
```
After:
```
Type tensor had a minimum time of 2.2134780883789062 us and a standard deviation of 0.802064489107579 us.
Type SubTensor had a minimum time of 10.625839233398438 us and a standard deviation of 0.15155907021835446 us.
Type WithTorchFunction had a minimum time of 5.520820617675781 us and a standard deviation of 0.23115111980587244 us.
Type SubWithTorchFunction had a minimum time of 11.227846145629883 us and a standard deviation of 0.23032321769278497 us.
```

Fixes #106974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106983
Approved by: https://github.com/zou3519, https://github.com/ezyang, https://github.com/albanD
2023-08-16 15:59:26 +00:00
1f6c1d9beb Fix inductor torch.cat edge case for empty tensor (#107193)
Align with eager behavior on this edge case- essentially, the empty
tensor is ignored by the operator.

Fixes #107118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107193
Approved by: https://github.com/wanchaol, https://github.com/eellison, https://github.com/peterbell10
2023-08-16 15:30:44 +00:00
7cb2a6bfab [dynamo][fallback] Fallback to eager when backend fails with fake tensor exceptions (#107179)
Example (I think we should fix this test case for real, but using this to test the ux around fallbacks)

~~~
@torch.compile(backend="aot_eager")
def fn(x):
    return torch.sum(x, dim=1).tolist()

print(fn(torch.rand(4, 4).to(dtype=torch.int64)))
~~~

Running the script as is

~~~
[2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING] Backend compiler failed with a fake tensor exception at
[2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING]   File "/data/users/anijain/pytorch/examples/spl.py", line 5, in fn
[2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING]     return torch.sum(x, dim=1).tolist()
[2023-08-14 14:53:48,863] torch._dynamo.output_graph: [WARNING] Falling back to eager for this frame. Please use TORCH_LOGS=graph_breaks to see the full stack trace.
[0, 0, 0, 0]
~~~

Running the script with TORCH_LOGS="graph_breaks"

~~~
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] WON'T CONVERT fn /data/users/anijain/pytorch/examples/spl.py line 3
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] ========== TorchDynamo Stack Trace ==========
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] Traceback (most recent call last):
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_dynamo/output_graph.py", line 995, in call_user_compiler
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     compiled_fn = compiler_fn(gm, self.example_inputs())
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     compiled_gm = compiler_fn(gm, example_inputs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/__init__.py", line 1586, in __call__
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     return self.compiler_fn(model_, inputs_, **self.kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_dynamo/backends/common.py", line 55, in compiler_fn
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     cg = aot_module_simplified(gm, example_inputs, **kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 3795, in aot_module_simplified
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     compiled_fn = create_aot_dispatcher_function(
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_dynamo/utils.py", line 194, in time_wrapper
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     r = func(*args, **kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 3283, in create_aot_dispatcher_function
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     fw_metadata = run_functionalized_fw_and_collect_metadata(
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 757, in inner
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     flat_f_outs = f(*flat_f_args)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_functorch/aot_autograd.py", line 3400, in functional_call
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     out = Interpreter(mod).run(*args[params_len:], **kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/fx/interpreter.py", line 138, in run
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     self.env[node] = self.run_node(node)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/fx/interpreter.py", line 195, in run_node
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     return getattr(self, n.op)(n.target, args, kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/fx/interpreter.py", line 289, in call_method
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     return getattr(self_obj, target)(*args_tail, **kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/utils/_stats.py", line 20, in wrapper
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     return fn(*args, **kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_subclasses/fake_tensor.py", line 1233, in __torch_dispatch__
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     return self.dispatch(func, types, args, kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_subclasses/fake_tensor.py", line 1470, in dispatch
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     op_impl_out = op_impl(self, func, *args, **kwargs)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/torch/_subclasses/fake_tensor.py", line 501, in local_scalar_dense
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     raise DataDependentOutputException(func)
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] torch._subclasses.fake_tensor.DataDependentOutputException: aten._local_scalar_dense.default
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] While executing %item : [num_users=1] = call_method[target=item](args = (%getitem,), kwargs = {})
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG] Original traceback:
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]   File "/data/users/anijain/pytorch/examples/spl.py", line 5, in fn
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]     return torch.sum(x, dim=1).tolist()
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]
[2023-08-14 14:54:15,689] torch._dynamo.output_graph.__graph_breaks: [DEBUG]
~~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107179
Approved by: https://github.com/ezyang
2023-08-16 14:57:42 +00:00
3577ae3e53 Fix TestDistBackendWithSpawn.test_backend_group and test_backend_full_group (#107231)
Fixes https://github.com/pytorch/pytorch/issues/107078 and allows tests to be run with 2 GPUs only.

testing command:
`BACKEND=gloo WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_group`
`BACKEND=nccl WORLD_SIZE=2 pytest test/distributed/test_distributed_spawn.py -vs -k test_backend_full_group`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107231
Approved by: https://github.com/rohan-varma
2023-08-16 12:01:09 +00:00
ddba7a5a55 Expose torch.export() API (#106904)
Other class definitions and utilities will be moved in subsequent PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106904
Approved by: https://github.com/avikchaudhuri
2023-08-16 10:47:26 +00:00
528a2c0aa9 Fix bug: not creating empty tensor with correct sizes and device. (#106734)
Summary:
logical_add and logical_add_ are reusing implementation of logical_add_out. But the `comparison_op` doesn't create an empty tensor with correct sizes and device type.
```
Tensor& logical_and_out(const Tensor& self, const Tensor& other, Tensor& result) { return comparison_op_out(result, self, other, logical_and_stub); }
Tensor logical_and(const Tensor& self, const Tensor& other) { return comparison_op(self, other, static_cast<OutFunc>(at::logical_and_out)); }
Tensor& logical_and_(Tensor& self, const Tensor& other) { return comparison_op_(self, other, static_cast<OutFunc>(at::logical_and_out)); }
```

Test Plan: CI tests.

Differential Revision: D48134169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106734
Approved by: https://github.com/jackm321
2023-08-16 09:48:35 +00:00
4de5e1775a Improved log1p implementation for complex inputs (#107100)
This PR improves the implementation of `torch.log1p` for complex inputs as mentioned in issue #107022. The new implementation is based on the insights provided in https://github.com/numpy/numpy/pull/22611#issuecomment-1667945354.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107100
Approved by: https://github.com/lezcano
2023-08-16 07:19:11 +00:00
35cca799ff [vision hash update] update the pinned vision hash (#107272)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107272
Approved by: https://github.com/pytorchbot
2023-08-16 04:47:59 +00:00
e0d6072f69 Add API to mark input tensors static for cudagraphs (#107154)
Adds API to mark tensor as a static input -
To make this trigger recompiles properly, I'll need to update tensor match checks to also check for this new attribute

Additional concern is memory - the tensors will be kept alive, but this is the current behavior for nn modules and parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107154
Approved by: https://github.com/eellison
2023-08-16 04:38:19 +00:00
9921b48558 Extend Inductor to support the third-party backend (#106874)
## Summary

This is re-land PR for https://github.com/pytorch/pytorch/pull/100706 to address the compilation latency performance regression.

## Root Cause

Regarding the C++/OpenMP backend,  `codecache.pick_vec_isa()` to check vectorization ISA is a time-consuming and one-shot operation. It leads to taking a longer time to import `codegen.cpp` package because the `LoopLevel` of the package is decorated by `@dataclasses.dataclass` while the decorator will invoke `codecache.pick_vec_isa()` to initialize the `simd_nelements` of the `LoopLevel`.
c14cf312c9/torch/_inductor/codegen/cpp.py (L2883C53-L2883C53)

In terms of the Triton backend, it does not need to touch it. But we'd prefer to uniform the code. Therefore, the new design simultaneously registers `CpuScheduling` for CPU and `TritonScheduling` for Triton regardless of whether the current backend is Triton. It will bring additional overhead to the Triton backend.

```python
def init_backend_registration(self):
    if get_scheduling_for_device("cpu") is None:
        from .codegen.cpp import CppScheduling

        register_backend_for_device("cpu", CppScheduling, WrapperCodeGen)

    if get_scheduling_for_device("cuda") is None:
        from .codegen.triton import TritonScheduling

        register_backend_for_device("cuda", TritonScheduling, WrapperCodeGen)
```

## Solution

To resolve the compilation latency regression for the Triton backend, we changed the `LoopLevel` a little bit([new code changes](https://github.com/pytorch/pytorch/pull/106874/files#diff-5ab7b0235e2076a5fc6629ba0b109208940f5b94f5c13babc3e0f87cf4fcec82R2893-R2904)) by moving the `simd_nelements` to `__post_init__` and the compilation performance would be back.

## Compilation Latency Performance Result
We ran a single model benchmark and reproduced the compilation regression:

- Run `python benchmarks/dynamo/torchbench.py -dcuda --training --performance --inductor --only hf_Bart`

- W/ PR #100706, the compilation latency is about **57~58**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.556712,109.676554,57.055242,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.646658,109.621747,57.909817,0.936330,5.760698,6.152422,642,1,8,7
```

- W/O PR #100706, the compilation latency is about **46~47**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.599065,108.702480,47.490346,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.588419,108.431411,46.983041,0.936330,5.760698,6.152422,642,1,8,7
```

This PR fixed the compilation performance regression.

- W/ this PR #106874, the compilation latency is about **47~48**
```
dev,name,batch_size,speedup,abs_latency,compilation_latency,compression_ratio,eager_peak_mem,dynamo_peak_mem,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks
cuda,hf_Bart,4,1.586261,108.149467,47.481058,0.936330,5.760698,6.152422,642,1,8,7
cuda,hf_Bart,4,1.758915,108.613899,47.925633,0.936330,5.760698,6.152422,642,1,8,7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106874
Approved by: https://github.com/jansel
2023-08-16 04:11:36 +00:00
6c0bba3daf Revert "Use cpuinfo to determine c10::ThreadPool thread number (#107010)"
This reverts commit ad0476540dcaf07aa6e3639f6c60ee820d5f3f99.

Reverted https://github.com/pytorch/pytorch/pull/107010 on behalf of https://github.com/izaitsevfb due to Breaks internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/107010#issuecomment-1679866821))
2023-08-16 02:20:31 +00:00
1af324b560 Revert "Introduce CUDA-only _scaled_mm op (#106844)"
This reverts commit 9440a8cbec52ce5c2eb9b95b4a8d9f16055d611d.

Reverted https://github.com/pytorch/pytorch/pull/106844 on behalf of https://github.com/izaitsevfb due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/106844#issuecomment-1679858327))
2023-08-16 02:05:29 +00:00
22f5889753 [Dynamo, ONNX] Replace onnxrt backend with new backend from ONNXRuntime team (#106929)
In https://github.com/pytorch/pytorch/pull/106589, a new ONNXRuntime-based Dynamo backend is introduced. As mentioned in that PR, we hope to replace legacy `onnxrt` with that new backend. This PR remove legacy `onnxrt` and register the new backend under the same name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106929
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao, https://github.com/abock, https://github.com/msaroufim, https://github.com/jansel
2023-08-15 22:50:46 +00:00
d290511ecd [gtest-static-listing] Enable for cc_test (#107186)
Reviewed By: Nekitosss

Differential Revision: D48323659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107186
Approved by: https://github.com/jeanschmidt
2023-08-15 22:31:32 +00:00
19db570cd9 [ROCm] Add miopen determinism support for convolutions (#107028)
With torchvision installed many of the test_distributed_spawn tests failed due to divergence between model runs. To resolve this we are adding the MIOPEN_CONVOLUTION_ATTRIB_DETERMINISTIC attribute to support deterministic convolutions on ROCm.

This means examples such as https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/distributed/distributed_test.py#L4777 which use the torch.backends.cudnn.flags.deterministic flag will behave correctly on ROCm
```
with torch.backends.cudnn.flags(
      enabled=True, deterministic=True, benchmark=False
):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107028
Approved by: https://github.com/jeffdaily, https://github.com/kit1980
2023-08-15 22:18:32 +00:00
bc053070f8 Mark test_gradient_extreme_cases as slow for inductor (#107189)
test_gradient_extreme_cases_* takes ~5 minutes on the inductor sm86 shard and possibly even longer on the inductor workflow since it's timing out right now although I'm not sure what the difference between the two is, and sometimes auto slow test detection isn't catching it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107189
Approved by: https://github.com/ZainRizvi
2023-08-15 22:03:00 +00:00
f6a9c15421 [FSDP][state_dict] Make optim_state_dict_to_load work with use_orig_param=False + NO_SHARD (#107185)
Summary: As title

Test Plan: CI

Reviewed By: wz337

Differential Revision: D48329724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107185
Approved by: https://github.com/fegin
2023-08-15 21:42:41 +00:00
f76250f6e3 [ONNX] Relax not exist assertion for 'register_pytree_node' (#107245)
To not conflict with potential existing workaround or solution outside of exporter.
Latest huggingface/transformers main (>4.31) patches PyTorch PyTree with support over `ModelOutput` class.
`_PyTreeExtensionContext` is kept to support prior versions of transformers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107245
Approved by: https://github.com/titaiwangms
ghstack dependencies: #106741, #107158, #107165
2023-08-15 21:01:17 +00:00
d8a71a6633 [ONNX] Set 'Generic[Diagnostic]' as base class for 'DiagnosticContext' (#107165)
Allows overriding the `Diagnostic` type for DiagnosticContext and enable type checking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107165
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
ghstack dependencies: #106741, #107158
2023-08-15 21:01:17 +00:00
c71828b097 Lift non-FakeTensor restriction for compile (#107042)
Currently, we have the assertion that dynamo won't accept FakeTensor input unless we're exporting. This PR try to remove this restriction to finish https://github.com/pytorch/pytorch/pull/105679.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107042
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-08-15 20:58:56 +00:00
22cade56ba Revert "[Reland] Upgrade NVTX to NVTX3 (#97582)"
This reverts commit 5bbfb96203370f73b4cd28e6ac766a26debce3df.

Reverted https://github.com/pytorch/pytorch/pull/97582 on behalf of https://github.com/izaitsevfb due to Breaks meta RL builds ([comment](https://github.com/pytorch/pytorch/pull/97582#issuecomment-1679568525))
2023-08-15 20:55:12 +00:00
87cd10bc7b Add basic TD framework (#106997)
Adds a new structure to house all heuristics we use for Target Determination and Test Reordering.  I'm keeping it somewhat minimal for now, to let it evolve more easily as we try new things.

It currently does nothing. The 2nd pr in the stack ports the existing heuristics to actually use this new framework
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106997
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-08-15 20:54:54 +00:00
b860c8c5b8 Revert "ExportedProgram.transform now updates graph_signature automatically (#107080)"
This reverts commit 8c9b2fe8f097cd4b32000e5124232a0047d92234.

Reverted https://github.com/pytorch/pytorch/pull/107080 on behalf of https://github.com/izaitsevfb due to Breaks executorch tests, see D48333170 ([comment](https://github.com/pytorch/pytorch/pull/107080#issuecomment-1679588292))
2023-08-15 20:47:35 +00:00
10ce16bebb Specify if mismatch is input or output in export (#107145)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107145
Approved by: https://github.com/suo, https://github.com/gmagogsfm
2023-08-15 20:34:25 +00:00
5673c0874c Use expect_true to make split with unbacked sizes work. (#106788)
This pattern shows up in torchrec KeyedJaggedTensor.  Most
of the change in this PR is mechanical: whenever we failed
an unbacked symint test due to just error checking, replace the
conditional with something that calls expect_true (e.g.,
torch._check or TORCH_SYM_CHECK).

Some of the changes are a bit more nuanced, I've commented on the PR
accordingly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106788
Approved by: https://github.com/lezcano
ghstack dependencies: #106720
2023-08-15 20:31:30 +00:00
e1ee10e6f5 Add expect_true for irrefutable guards (#106720)
Here's what it does from the comments:

```
Assume that a boolean is true for the purposes of subsequent symbolic
reasoning.  This will keep track of corresponding runtime checks to verify
that the result is upheld: either as a regular guard, or as a special set
of asserts which are triggered when an unbacked SymInt is allocated.

DO NOT use this function for these cases:

 - This is inappropriate for "branching" conditions (where both
   true and false result in valid programs).  We will always assume
   the condition evaluates true, and so it will never be possible
   to trace the false condition when you use it.  For true branching
   on unbacked SymInts, you must use torch.cond.

 - This is inappropriate for situations where you know some other system
   invariant guarantees that this property holds, since you don't
   really need to insert a runtime check in that case.  Use something
   like constrain_range in that case.

This API has a hitch.  To avoid having to reimplement error reporting
capabilities, this function CAN return False.  The invariant is that
the surrounding code must raise an error when this function returns
False.  This is quite low level, so we recommend using other functions
like check() which enforce this in a more intuitive way.

By the way, this name is a nod to the __builtin_expect likely macro,
which is used similarly (but unlike __builtin_expect, you MUST fail
in the unlikely branch.)
```

We don't do anything with this right now, except use it to discharge regular guards.  Follow up PRs to (1) use it at important error checking sites, (2) actually ensure the runtime asserts make there way into the exported IR / inductor generated code.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106720
Approved by: https://github.com/ysiraichi, https://github.com/voznesenskym
2023-08-15 18:42:22 +00:00
388ba7e5ae [ptd] make multithreaded pg wait for readiness before the 1st collective (#106954)
Summary:
This used to be not a problem because in c10d collective init, a store based barrier would be applied.

This recently got changed in https://github.com/pytorch/pytorch/pull/103033
where the barrier is not by default applied.

For normal PGs like gloo/nccl, this is not a problem as the rendezvous process is implicitly a barrier anyway.

But for threaded pg, without the store based barrier this would lead to race condition as the local pg does not wait for world to be ready before starting collectives.

This fixes the issue by just doing a store based barrier for each pg created.
The CV attempt wouldn't work since that would still rely on class level variables which would break in the device mesh case. See inline comment for details.

Differential Revision: D48220125

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106954
Approved by: https://github.com/wanchaol, https://github.com/H-Huang, https://github.com/XilunWu
2023-08-15 18:40:49 +00:00
e9cb7179cb [ONNX] Fix diagnostic log and add unittest (#107158)
As title. Previously message was formatted but mistakenly not logged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107158
Approved by: https://github.com/titaiwangms
ghstack dependencies: #106741
2023-08-15 17:46:15 +00:00
19a76290d8 [ONNX] Public diagnostic options for 'dynamo_export' (#106741)
Generate diagnostic reports to monitor the internal stages of the export process. This tool aids in unblocking model exports and debugging the exporter.

#### Settings

~~1. Choose if you want to produce a .sarif file and specify its location.~~
1. Updated: saving .sarif file should be done by `export_output.save_sarif_log(dst)`, similar to saving exported onnx model `export_output.save(model_dst)`.
2. Customize diagnostic options:
    - Set the desired verbosity for diagnostics.
    - Treat warnings as errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106741
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/malfet
2023-08-15 17:46:15 +00:00
45128ab67c [Reland] Add OnCompletion Hook to ProcessGroup (#106988) (#107233)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107233
Approved by: https://github.com/kumpera
2023-08-15 17:35:14 +00:00
d9dc4b2b4c [BE] Add missing override to remove build warning spam (#107191)
```
In file included from /local/pytorch3/test/cpp/api/optim.cpp:7:
local/pytorch3/test/cpp/api/support.h:44:3: warning: '~WarningCapture' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  ~WarningCapture() {
  ^
local/pytorch3/c10/util/Exception.h:167:11: note: overridden virtual function is here
  virtual ~WarningHandler() = default;
  ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107191
Approved by: https://github.com/janeyx99
2023-08-15 17:32:34 +00:00
8c44cfef5e Add some support for detecting false aliasing in AOTAutograd (#106461)
This is a partial fix for https://github.com/pytorch/pytorch/issues/106457. In the examples with the shampoo optimizer that i ran, they were enough to remove the parameter aliasing in shampoo.

I added some new logic for detecting if two inputs have overlapping memory in specific cases: if they're both 2D tensors with stride 1. In that case (the case for shampoo), I try to compute a bunch of contiguous intervals on the two tensors, and check if any of the intervals overlap. In theory this is slow, since if our two tensors are e.g. of size (256, N), we'll need to create 256 intervals to check for overlap on. This seems... probably fine, since I think we do more egregious things in the compile stack to cause slowness. Open to suggestions though!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106461
Approved by: https://github.com/albanD
ghstack dependencies: #106460
2023-08-15 17:27:37 +00:00
517ba2add7 AOTAutograd: allow input mutations on inputs that are non-contiguous (#106460)
Fixes https://github.com/pytorch/pytorch/issues/106456

I also had to update the logic in functionalization's resize_() kernel to convey to AOTAutograd that resize_() is a metadata mutation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106460
Approved by: https://github.com/ezyang
2023-08-15 17:27:37 +00:00
cyy
ad0476540d Use cpuinfo to determine c10::ThreadPool thread number (#107010)
This PR prefers "logical processor number" (the cpu cores shown in htop) returned by cpuinfo for determining c10 thread number. If that fails, it uses hardware_concurrency exactly.
The motivation is that in a x86 host with 64 cores and Hyper-Threading disabled, the current behavior uses 32 threads, resulting half of cores being idle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107010
Approved by: https://github.com/ezyang
2023-08-15 17:26:24 +00:00
7fb543e36d [ROCm] Enable hipsolver unit tests for batched linalg drivers (#106620)
This is a follow up to https://github.com/pytorch/pytorch/pull/105881 and replaces https://github.com/pytorch/pytorch/pull/103203

The batched linalg drivers from 103203 were brought in as part of the first PR. This change enables the ROCm unit tests that were enabled as a result of that change. Along with a fix to prioritize hipsolver over magma when the preferred linalg backend is set to `default`
The following 16 unit tests will be enabled for rocm in this change:
- test_inverse_many_batches_cuda*
- test_inverse_errors_large_cuda*
- test_linalg_solve_triangular_large_cuda*
- test_lu_solve_batched_many_batches_cuda*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106620
Approved by: https://github.com/lezcano
2023-08-15 15:54:27 +00:00
ed0782125a [gtest-static-listing] Make decision for caffe2 targets (#107129)
Summary: title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107129
Approved by: https://github.com/atalman
2023-08-15 12:41:47 +00:00
fd214aa8be Revert "Add OnCompletion Hook to ProcessGroup (#106988)"
This reverts commit ba1da47e8fa95ca0dd8b2d63430f7eb54fdbbccb.

Reverted https://github.com/pytorch/pytorch/pull/106988 on behalf of https://github.com/huydhn due to Sorry for reverting you change, but it is failing Windows build with some linker error.  The Windows failures on PR looks legit ([comment](https://github.com/pytorch/pytorch/pull/106988#issuecomment-1678580899))
2023-08-15 08:24:33 +00:00
2abcfc40b0 Enable torchgen for MTIA dispatch key (#107046)
Summary: As title.

Test Plan: See diff D48258693

Differential Revision: D48273743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107046
Approved by: https://github.com/albanD
2023-08-15 07:56:18 +00:00
935f2dd084 adding fused uint4x2_mixed_mm to inductor (#106516)
Summary: this is needed for int4 weight-only quantization, we're
matching on the specific unpack operation that unpacks the uint4x2 into
int4's so we can have a fused kernel for it.  note, even if the user
isn't specifically doing this, the two operations are mathematically
equilvanet so it won't cause issues (for some reason int8 bitwise logic
in triton and pytorch doesn't match so that's the only exception). Ideally
at some point full prologue fusion for the mm arguments would be able to
handle this chain but until then, this type of kernel is needed.

Test Plan:

python test/inductor/test_pattern_matcher.py -k "uint4x2"
print test/inductor/test_torchinductor.py -k "uint4x2"

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106516
Approved by: https://github.com/jansel
2023-08-15 06:58:36 +00:00
eqy
4add06eb5c [CUDNN][CUDNN V8 API] LRU Cache for cuDNN frontend ExecutionPlan (#104369)
Adds LRU functionality to the cuDNN frontend `ExecutionPlan` cache to address high memory usage as observed in #98688, #104122 via the `TORCH_CUDNN_V8_LRU_CACHE_LIMIT` environment variable. By default this limit is set to 10000, which corresponds to about 2GiB of host memory usage as observed empirically. Note that we are still following up with cuDNN to see if the size of an `ExecutionPlan` can be reduced, as it appears to currently be around 200KiB (!!) for a single plan.

This implementation is a bit heavy on the internal asserts for now as it's a bit difficult to directly test the state of the cache without instrumenting it explicitly in tests. Once we are confident that the implementation is stable, we can remove the asserts.

CC @malfet who @ptrblck mentioned may have also been looking into this

CC @colesbury

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104369
Approved by: https://github.com/malfet
2023-08-15 05:52:49 +00:00
cyy
b7431d815f [submodule] Pin fmtlib/fmt to 10.1.0 (#106672)
fmt10.1.0 fixes a bug of format_string_checker initialisation order which is important to our improved clang-tidy checks #103058. This PR upgrades third_party fmt to 10.1.0, in the meanwhile, kineto is also upgraded to avoid fmt errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106672
Approved by: https://github.com/Skylion007
2023-08-15 05:47:04 +00:00
20c5add133 [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-15 05:41:43 +00:00
d6c120d7f9 [TP][DTensor Perf]Fix DTensor Spec hash (#107181)
https://github.com/pytorch/pytorch/pull/106524 gets merged so fast that we didn't figure out that we should hash both stride and dtype in DTensorSpec. This is a forward fix.

One analysis for why using just shape is not enough.
1. We use the hash value for sharding propogation cache. And the output sharding contains the stride, size of the output DTensor. If we don't consider stride, we will see errors.
2. One reason can be found below:
```
OpSchema(func_schema=aten::t(Tensor(a) self) -> Tensor(a), args_schema=(DTensorSpec(mesh=DeviceMesh:([0, 1, 2, 3, 4, 5, 6, 7]), placements=(Shard(dim=0),), tensor_meta=TensorMetadata(shape=torch.Size([64, 128]), dtype=torch.float32, requires_grad=False, stride=(128, 1), memory_format=None, is_quantized=False, qparams={})),), kwargs_schema={})
```

```
OpSchema(func_schema=aten::t(Tensor(a) self) -> Tensor(a), args_schema=(DTensorSpec(mesh=DeviceMesh:([0, 1, 2, 3, 4, 5, 6, 7]), placements=(Shard(dim=0),), tensor_meta=TensorMetadata(shape=torch.Size([64, 128]), dtype=torch.float32, requires_grad=False, stride=(1, 64), memory_format=None, is_quantized=False, qparams={})),), kwargs_schema={})
```

The only difference between two op_schame is the tensor stride:
<img width="151" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/161335df-bdfb-47c5-ba79-82616d070d15">

that makes the transpose op generates wrong result and leads to the add_/addmm_ op failing with errors:
```
Traceback (most recent call last):
  File "/data/users/fduwjj/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/data/users/fduwjj/pytorch/benchmarks/distributed/tensor/tp_benchmark.py", line 210, in run_tp
    output.sum().backward()
  File "/data/users/fduwjj/pytorch/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/data/users/fduwjj/pytorch/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/api.py", line 252, in __torch_dispatch__
    return op_dispatch.operator_dispatch(
  File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/dispatch.py", line 116, in operator_dispatch
    out, _, _ = _operator_dispatch(op_call, args, kwargs, sharding_propagator)
  File "/data/users/fduwjj/pytorch/torch/distributed/_tensor/dispatch.py", line 246, in _operator_dispatch
    local_results = op_call(*local_tensor_args, **local_tensor_kwargs)
  File "/data/users/fduwjj/pytorch/torch/_ops.py", line 435, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: The size of tensor a (64) must match the size of tensor b (8) at non-singleton dimension 1
```

Same thing with dtype, if we are using DTensor in the environment of mixed precision, we will run into situations like this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107181
Approved by: https://github.com/wanchaol
ghstack dependencies: #106524
2023-08-15 05:33:10 +00:00
2d841bcb9f Remove type promotion workaround (#107202)
Removes old type promotion workaround

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107202
Approved by: https://github.com/xuzhao9, https://github.com/eellison
2023-08-15 05:32:42 +00:00
c9c90765c1 grad_mode decorators without paren (#107086)
This PR implements the feature described in #107036 for `no_grad`, `enable_grad` and `inference_mode`.

Users can still use the above as before but they can also use them without parentheses.

For example:

```python
import torch

a = torch.ones(1, requires_grad=True)

def do_something():
    print(2 * a)

with torch.no_grad():
    do_something()  # tensor([2.])

torch.no_grad()(do_something)()  # tensor([2.])

torch.no_grad(do_something)()  # tensor([2.])

do_something()  # tensor([2.], grad_fn=<MulBackward0>)
```

For `inference_mode`, decorating without parenthesis is equivalent to decorating with the default `mode=True`, similiar to how dataclasses behave (https://docs.python.org/3/library/dataclasses.html#module-contents)

Closes #107036

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107086
Approved by: https://github.com/albanD
2023-08-15 05:25:33 +00:00
ba1da47e8f Add OnCompletion Hook to ProcessGroup (#106988)
This allows infra/trainers to get detailed stats about communication
efficiencies without know anything about what model or distributed
training paradigms have been used. This is helpful as infra/trainer
package usually prefers to be as model/algorithm agnostic as possible.
Therefore, we cannot assume that infra/trainer can have access to all
collectives used by the model authors.

This commit adds an `OnCompletion` hook to `ProcessGroupNCCL` which
will be fired on every work completion event.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106988
Approved by: https://github.com/kumpera, https://github.com/H-Huang
ghstack dependencies: #107140, #107141, #107160
2023-08-15 04:32:23 +00:00
2624da638d Support third-party devices to use the init_process_group method with… (#107113)
…out specifying the Backend

When init_process_group is not been done before, it will automatically apply  init_process_group within Devicemesh without specifying the backend. Thus, when a third-party device want to use Devicemesh without doing init_process_group before, there comes a problem. In this PR, add a default_device_backend_map for third-party device users to add their backends to this map when they register their backends to pytorch firstly. When doing init_process_group without parameter backend, it will init the backends in this map. Thus, a third-party user can use init_process_group method without specifying the Backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107113
Approved by: https://github.com/wanchaol
2023-08-15 03:46:07 +00:00
574442ba01 CI upgradeapalooza bionic->focal, gcc7->gcc9, clang7->clang10 (#105260)
Bionic support was finished back in April 2023, see https://ubuntu.com/blog/ubuntu-18-04-eol-for-devices

And neither gcc-7 nor clang7 are fully compatible with c++17, update minimal tested gcc to gcc9 and clang to clang-10

Note: OpenMP support is  broken in Focal's `clang9`, so move up to a `clang10`

- Suppress `-Wuninitialized` in complex_test as gcc-11 fires a seemingly false-positive warning:
```
In file included from /home/malfet/git/pytorch/pytorch/c10/test/util/complex_test.cpp:1:
/home/malfet/git/pytorch/pytorch/c10/test/util/complex_test_common.h: In member function ‘virtual void memory::TestMemory_ReinterpretCast_Test::TestBody()’:
/home/malfet/git/pytorch/pytorch/c10/test/util/complex_test_common.h:38:25: warning: ‘z’ is used uninitialized [-Wuninitialized]
   38 |     c10::complex<float> zz = *reinterpret_cast<c10::complex<float>*>(&z);
      |                         ^~
/home/malfet/git/pytorch/pytorch/c10/test/util/complex_test_common.h:37:25: note: ‘z’ declared here
   37 |     std::complex<float> z(1, 2);
      |                         ^
```
- Downgrade `ucc` to 2.15, as 2.16 brings an incompatible libnccl, that causes crash during the initialization
- Install `pango` from condo environment for `doctr` torch bench tests to pass, as one available in the system is too new for conda
- Suppress some functorch tests when used with python-3.8+dynamo, see https://github.com/pytorch/pytorch/issues/107173
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105260
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/ZainRizvi, https://github.com/seemethere
2023-08-15 03:07:01 +00:00
9440a8cbec Introduce CUDA-only _scaled_mm op (#106844)
According to https://docs.nvidia.com/cuda/cublas/#id99 only FP8 matrix types can be scaled, and `Float8_e4m3`x`Float8_e4m3` results can be returned as `Float8_e4m3` type, or upcast to `Half`, `BFloat16` or `Float`, but in that case `result_scale` will have no effect as well as `amax` would not be computed.
Optional `bias` argument can also be passed to a function, which should be a vector of either `Half` or `BFloat16`, whose values are added to each row of the result matrix.

See table below for supported input and output types:
| Mat1 type  | Mat2 type | Bias type | Output types |
| ----------- | ----------- | ----------- | ----------- |
| Float8_e4m3  | Float8_e4m3  | Float16  | Float8_e4m3, Float16 |
| Float8_e4m3  | Float8_e4m3  | BFloat16 | Float8_e4m3, BFloat16, Float |
| Float8_e5m2  | Float8_e4m3  | Float16 |  Float8_e4m3, Float8_e5m2, Float16  |
| Float8_e5m2  | Float8_e4m3  | BFloat16 |  Float8_e4m3, Float8_e5m2, BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Float16 |  Float8_e4m3, Float8_e5m2, Float16 |
| Float8_e4m3  | Float8_e5m2  | BFloat16 |  Float8_e4m3, Float8_e5m2,  BFloat16, Float |
| Float8_e4m3  | Float8_e5m2  | Not supported | Not supported |

Skip decomposition implementation until fp8-on-triton story is better defined, Potential decomposition can look something like the following:
```python
@register_decomposition(aten._scaled_mm)
def _scaled_mm(
    mat1: Tensor,
    mat2: Tensor,
    *,
    dtype: Optional[torch.dtype] = None,
    scale_a: Optional[Tensor] = None,
    scale_b: Optional[Tensor] = None,
    scale_result: Optional[Tensor] = None,
) -> Tuple[Tensor, Tensor]:
    rc = torch.mm(mat1.to(torch.float32), mat2.to(torch.float32))
    rc = scale_a * rc if scale_a is not None else rc
    rc = scale_b * rc if scale_b is not None else rc
    rc = scale_result * rc if scale_result is not None else rc
    rc = rc.to(dtype if dtype is not None else mat1.dtype)
    return rc, torch.tensor(0.0, device=mat1.device)
```

Known limitations:
  - Only works for matrix sizes divisible by 16
  - 1st operand must be in row-major and 2nd in column-major orders (i.e. if `x` and `y` are contiguous, than only `torch._scaled_mm(x, y.t())` will work)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106844
Approved by: https://github.com/albanD
ghstack dependencies: #106977
2023-08-15 02:59:41 +00:00
e4e9aa28a7 Add generate_opcheck_tests, a PT2 crossref testing mechanism (#106903)
This PR adds `generate_opcheck_tests`. This is a utility that adds
additional crossref tests to an existing TestCase that has tests that
invokes operators. The main use case is if you have a large test suite
that already exercises operators and want to add automated testing that
the operators are correct, without actually refactoring your code into
something like OpInfos.

Given a `test_` method of a TestCase, we will generate one new
additional test for each of {schema correctness, autograd registration,
faketensor rule, aot_autograd static shapes, aot_autograd dynamic
shapes}. Each newly generated test runs the original test method under a
special torch_function mode (OpCheckMode) that intercepts
`op(*args, **kwargs)` calls and additional passes (op, args, kwargs) to
a separate function (e.g. SchemaCheck).

Nitty-gritty details:
- If a test is named test_cumsum, we end up generating new tests
(`test_schema__test_cumsum`, `test_<something>__test_cumsum`)
- Users can provide a dictionary of expected failures / skips  that is indexed on
operators. This gives us a sense of what operators support PT2 and which
operators require fixing before they support PT2.

Due to some co-dev limitations, I'm planning on landing this PR first
and then using it to add crossref testing for internal tests and
fbgemms. I could squash this PR with the internal changes if we want to
see how that works out, just let me know.

Test Plan:
- We create a mini op test suite called MiniOpTests.
- Then, we use `generate_opcheck_tests` to generate tests onto it.
- We have our own test xfail list to check that the things that should
fail do fail.
- Finally, there is a separate TestGenerateOpcheckTests that checks that
the correct number of tests were generated and also tests some helper
functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106903
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-08-15 02:16:07 +00:00
ddf36c82b8 [PT-D][FSDP] Handle corner case of load with multi-backend PG (#107172)
Summary:
When loading a CPU state_dict with a pg initialized with
cpu:gloo,cuda:nccl, we hit a gloo crash since dest tensor is on GPU and input
is on CPU.

As a workaround, just enforce that if local_tensor.is_cpu, the dest tensor is
also cpu.

Test Plan: CI

Differential Revision: D48324752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107172
Approved by: https://github.com/fegin
2023-08-14 23:24:44 +00:00
064d813f37 Add distributed/c10d *.hpp files to lintrunner (#107160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107160
Approved by: https://github.com/rohan-varma, https://github.com/fegin, https://github.com/Skylion007
ghstack dependencies: #107140, #107141
2023-08-14 23:16:40 +00:00
facadc6c97 [Easy] Make Work::retrieveOpType a const function (#107141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107141
Approved by: https://github.com/awgu
ghstack dependencies: #107140
2023-08-14 23:16:40 +00:00
dd6319198d Apply clang-format to distributed/c10d folder (#107140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107140
Approved by: https://github.com/H-Huang
2023-08-14 23:16:38 +00:00
858b465d74 fix str splits in single line (#106005)
Simple formating improvement and two spell fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106005
Approved by: https://github.com/H-Huang
2023-08-14 23:07:38 +00:00
759c4995e7 [ci] clean up some multigpu tests, and add funcol test (#107153)
Add the funcol tests to multigpu tests to ensure it runs on CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107153
Approved by: https://github.com/kumpera
ghstack dependencies: #107151, #107152
2023-08-14 21:55:31 +00:00
9ae51e3ad9 [thread_pg] fix reduce_scatter to respect different cuda device (#107152)
Same reason as the previous all_reduce PR, see context in the allreduce
PR https://github.com/pytorch/pytorch/pull/107151 instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107152
Approved by: https://github.com/kumpera
ghstack dependencies: #107151
2023-08-14 21:55:31 +00:00
4be8fe0f0d [thread_pg] fix all_reduce to respect different cuda device (#107151)
The previous implementation only works on CPU and it does not respect
the fact that each rank have its data in different devices (i.e. cuda),
so the implementation will raise the error like below:

```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
```

See report in https://github.com/pytorch/pytorch/pull/105604#issuecomment-1675472670

This PR fix this issue and tested that the failed tests on GPU now works
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107151
Approved by: https://github.com/kumpera
2023-08-14 21:55:29 +00:00
50927e25f7 Correct compile doc string format (#107124)
The blank line should be added between the list items

See the wrong generated doc: https://pytorch.org/docs/main/generated/torch.compile.html#torch-compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107124
Approved by: https://github.com/colesbury
2023-08-14 21:49:12 +00:00
2f0ca722d1 Typo fix in Nonzero.cu (#107090)
Stride of the output that is being produced is (1, num_nonzeros)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107090
Approved by: https://github.com/colesbury
2023-08-14 21:15:41 +00:00
2c5f96deac [Inductor] Make softshrink composite implicit (#107052)
The backward is pretty much equivalent to the one we had written

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107052
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039, #107051
2023-08-14 21:01:50 +00:00
6d899571d6 Simplify sign lowering in triton (#107051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107051
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038, #107039
2023-08-14 21:01:50 +00:00
3b1254e800 Make hardshrink's decomp composite implicit (#107039)
The generated code is the same
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107039
Approved by: https://github.com/peterbell10
ghstack dependencies: #107038
2023-08-14 21:01:50 +00:00
45c7880486 Simplify some decompositions. (#107038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107038
Approved by: https://github.com/peterbell10
2023-08-14 21:01:50 +00:00
80988b6277 Introduce memory stacks for free (#106758)
Previously when we recorded a free action in a memory trace, we would provide
the stack for when the block was allocated. This is faster because we do not
have to record stacks for free, which would otherwise double the number of stacks
collected. However, sometimes knowing the location of a free is useful for
figuring out why a tensor was live. So this PR adds this behavior. If
performance ends up being a concern the old behavior is possible by passing
"alloc" to the context argument rather than "all".

Also refactors some of glue logic to be consistent across C++ and Python and
routes the Python API through the C++ version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106758
Approved by: https://github.com/albanD
2023-08-14 20:38:15 +00:00
df8493455e [ROCm] enable test_api (test_libtorch) cpp unit tests (#106712)
This is part of effort to enable missed cpp tests for ROCm platform.
In this change,
- enabled test_libtorch cpp tests (more than 3107 tests)
- fixed missing dependency: libcaffe2_nvrtc.so required by FunctionalTest.Conv1d
- test_api binary is changed to exclude failed tests InitTest and IntegrationTest - to revisit later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106712
Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980
2023-08-14 20:09:34 +00:00
1e007d044d [AOTInductor] Prepare for ProxyExecutor, OSS only change (#107065)
Summary: Minor fixes to export schema and serialization

Test Plan: OSS CI

Differential Revision: D48280809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107065
Approved by: https://github.com/zhxchen17
2023-08-14 20:04:45 +00:00
4a6ca4cc05 [TP][DTensor Perf] Some perf improvement to reduce DTensor CPU overhead (#106524)
By inspecting a small TP benchmark, we found couple things we can optimize:
1. We call deep_copy so many times when we initialize DTensor.
2. Some shading_prop is not cached successfully.
3. We are still calling redistribute when not necessary.

![image](https://github.com/pytorch/pytorch/assets/6937752/b847d110-eea1-45df-9298-066d0ba07dd7)

![image](https://github.com/pytorch/pytorch/assets/6937752/fc08f564-caed-496b-80d7-275c1dba3806)

![image](https://github.com/pytorch/pytorch/assets/6937752/fdc06cc4-a4ba-48e8-a118-c041bbd04f5e)

So we want to:
1. Remove the deep_copy, and we now make placements a tuple so we are sure it's immutable.
2. Somehow the op_schema gets changed during sharding_op propogation, so we store a hash version of it before passing it to sharding_prop. Ideally we want to figure out why `op_schema` gets changed, but looks like in both index and detach/view op, all get changed, it might take more time to debug.
3. Also when we do hashing of op_schema, we want to hash the entire args_schema not just the args_spec which only contains the DTensorSpec from args which are Dtensors.
4. It turns out that sometimes, DTensor has mem_format to be None (not contiguous) and this will lead to redistribute get triggered, so that we only need to compare type/shape and stride in the metadata.

Also we need to ensure _Partial and Shard have different hash value in the DTensorSpec.

![image](https://github.com/pytorch/pytorch/assets/6937752/321e6890-1ab6-4975-adc9-524c6ef9a76b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106524
Approved by: https://github.com/wanchaol
2023-08-14 20:03:19 +00:00
00751772e6 Upload perf benchmark to Rockset in batch of at most 5000 records (#107095)
TIL, uploading to Rockset has an upper limit of 5000 records per request.  So uploading PT2 perf benchmark could fail if that limit was reached, for example https://github.com/pytorch/pytorch/actions/runs/5828810421/job/15849232756

```
HTTP response body: {"message":"The number of documents specified in this request exceeds the maximum allowed limit of 5,000 documents.","message_key":"RECEIVER_REQUEST_MAX_DOCUMENT_LIMIT","type":"INVALIDINPUT","line":null,"column":null,"trace_id":"73fc2eb5-cfd1-4baa-8141-47c7cde87812","error_id":null,"query_id":null,"internal_errors":null}
```

The fix is to upload the results in multiple smaller batches of at most 5000 records.

### Testing

5743 records from https://github.com/pytorch/pytorch/actions/runs/5828810421/job/15849232756 were written in 2 batches (5000 + 743)

```
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 5821183777 --workflow-run-attempt 1 --repo pytorch/pytorch --head-branch gh/ezyang/2294/head
...
Writing 5000 documents to Rockset
Done!
Writing 743 documents to Rockset
Done!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107095
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/ZainRizvi
2023-08-14 19:56:42 +00:00
8c9b2fe8f0 ExportedProgram.transform now updates graph_signature automatically (#107080)
Update graph_signature according to graph after transformation.
            Transformations can lead to node name changes, which are used in
            graph_signature to identify inputs and outputs. Therefore, after each
            transformation, we need to update the graph_signature according to
            new node names.
            WARNING: This implementation makes a few assumptions
                - The transformation doesn't change number of inputs/outputs
                - Each input/output still has the same meaning.
                    - For inputs, that means that the inputs in transformed
                        graph map to the same lifted parameter/buffer or user
                        input as the input of the same position in the graph
                        before transformation.
                    - Similarly for outputs, each output should correspond to the
                        same mutated buffer or user output as the output value of
                        the same position  in the graph before transformation.
            It is difficult to programatically validate these assumptions, but they
            should hold true most of the time as inputs/outputs of the graph rarely
            need to be changed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107080
Approved by: https://github.com/tugsbayasgalan
2023-08-14 19:52:41 +00:00
05db3d9969 improve doc on how to understand dynamo (#106860)
Per the discussion in https://github.com/pytorch/pytorch/pull/106673#issuecomment-1669939815 , I add more documentation to explain the output of dynamo compilation. I didn't find any de-compile library, so I manually de-compile the bytecode. The result looks good.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106860
Approved by: https://github.com/jansel, https://github.com/msaroufim
2023-08-14 19:49:24 +00:00
770a565e26 [dynamo][easy] Only xfail test_dynamic_shapes_float_guard_dynamic_shapes if z3 is available (#107137)
This test only fails when z3 is available. So we should only xfail it if z3 is available, otherwise the test passes with an unexpected success.

Differential Revision: [D48323103](https://our.internmc.facebook.com/intern/diff/D48323103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107137
Approved by: https://github.com/ysiraichi, https://github.com/williamwen42
2023-08-14 19:47:21 +00:00
6af6b8f728 Reland: Remove set_default_dtype from nn tests (#107069)
Part of #68972
Relands #105775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107069
Approved by: https://github.com/ezyang
2023-08-14 17:01:57 +00:00
32f93b1c68 [Security] Use github environment for update-commit-hash workflow (#107060)
Similar to: https://github.com/pytorch/pytorch/pull/101718

https://github.com/pytorch/pytorch/actions/runs/5856611801/job/15876722301

Please note since we can't specify environment for a composite workflow. It was needed to move update-commit-hash as action rather then workflow.

Still todo: Move docs and binary builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107060
Approved by: https://github.com/seemethere
2023-08-14 16:55:37 +00:00
cyy
5bbfb96203 [Reland] Upgrade NVTX to NVTX3 (#97582)
PR #90689 replaces NVTX with NVTX3. However, the torch::nvtoolsext is created only when the third party NVTX is used.
 This is clear a logical error. We now move the creation code out of the branch to cover all cases. This should fix the issues reported in the comments of  #90689.

It would be better to move configurations of the failed FRL jobs to CI tests so that we can find such issues early before merging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97582
Approved by: https://github.com/peterbell10
2023-08-14 16:55:25 +00:00
461c703ee6 Add typecasting for gelu backward kernel (#86673) (#106856)
Fixes #86673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106856
Approved by: https://github.com/janeyx99
2023-08-14 15:02:25 +00:00
2932b0bf37 Extend impl_backward to be usable with torch.library operators (#106817)
- impl_save_for_backward/impl_backward only work for functional,
non-view schemas. We validate this.
- impl_save_for_backward/impl_backward raise if there already exists an
autograd implementation from torch.library / TORCH_LIBRARY.
- Operators constructed via custom_op receive an "autograd indirection
kernel". The "autograd indirection kernel" automatically pulls the
constructed autograd kernel out of a dict. When
impl_save_for_backward/impl_backward get used with torch.library
operators, we also register the "autograd indirection kernel" so we can
reuse the logic.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106817
Approved by: https://github.com/soulitzer
ghstack dependencies: #106799, #106800
2023-08-14 14:33:46 +00:00
db9a0cf689 Extend impl_backward to handle non-Tensor outputs (#106800)
Recall that the user must give us a backward function that accepts
`(ctx, saved, *grads)`, with one grad per output. Previously,
impl_backward only worked for functions that return one or more Tensors.

The new semantics are that if the output has:
- a TensorList, the backward function provided by the user will receive
a List[Tensor] of grads for that output.
- a number, the backward function provided by the user will receive
None as the grad.

Also recall that impl_backward is implemented by registering an
autograd.Function to the autograd dispatch key.
We needed to make the following changes:
- If an output is a TensorList, autograd.Function will ignore it. So we
need to tree-flatten it before returning it from the autograd.Function
- This means that the autograd.Function receives a flat list of grad
during the backwards pass. We need to tree-unflatten it into the correct
shape before passing it to the user-defined backward
- We modify the logic of output_differentiability. Only
Tensor/TensorList outputs can be marked as differentiable. If a
TensorList is marked as non-differentiable, then this is equivalent to
all Tensors in the list being non-differentiable. There is no
finer-grain control over this (to match derivatives.yaml).

Test Plan:
- There are new `numpy_split_copy` (returns TensorList) and
`numpy_split_copy_with_int` (returns (TensorList, int)) operators in
custom_op_db
- Added tests for output_differentiability into test/test_custom_ops.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106800
Approved by: https://github.com/soulitzer
ghstack dependencies: #106799
2023-08-14 14:33:46 +00:00
9fcce1baf1 [custom_op] Allow constructor to infer more types (#106799)
This expands the torch._custom_ops.custom_op API to be able to construct
operators that return (int, bool, float, Scalar, List[Tensor]) to make
it more in-line with our torch.library API.

NB: there needs to be updates to our custom_op autograd registration
API. For ease of review those changes will go in the next PR up but I
can squash if requested.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106799
Approved by: https://github.com/soulitzer
2023-08-14 14:33:43 +00:00
d8ad74857c Run translation validation on tracing error. (#106645)
This PR wraps `InstructionTranslator` run with a try-catch block so as to run the
translation validation (TV) if it ends up raising an error.

In this context, we run TV so as to catch simplification errors. These may turn
`ShapeEnv.divisible` and `ShapeEnv.replacements` incorrect.

For example: #101173 describes a SymPy simplification bug that doesn't reach TV, since
it's run only in the end of the tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106645
Approved by: https://github.com/ezyang
2023-08-14 13:43:34 +00:00
937cd3742b [xla hash update] update the pinned xla hash (#107120)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107120
Approved by: https://github.com/pytorchbot
2023-08-14 10:54:47 +00:00
2b1058c542 Enable mypy check in torch/_inductor/wrapper_benchmark.py (#106775)
Fixes #105230

```shell
$ lintrunner init && lintrunner -a torch/_inductor/wrapper_benchmark.py
...
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/wrapper_benchmark.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106775
Approved by: https://github.com/eellison
2023-08-14 04:32:08 +00:00
d392963ac4 [fbcode] Use FastCat in PT Concat implementation (#106727)
Summary: Reimplement D48081898 and PR https://github.com/pytorch/pytorch/pull/106518 in fbcode first to accelerate the launching process

Test Plan:
All checks have been passed: https://github.com/pytorch/pytorch/actions/runs/5758987335/job/15612600466?pr=106518

(For my own learning purpose)
Check out OSS PyTorch repo and test following the instructions in https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Workflow/PyTorch_environment_setup/
and https://www.internalfb.com/intern/wiki/PyTorch/PyTorchDev/Workflow/PyTorch_environment_setup/oss_setup_on_devserver

:
```
pytest -k test_cat_out test/test_tensor_creation_ops.py -v -s
```
To submit to GitHub
```
hg amend; jf submit; ghexport
```

Differential Revision: D48082741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106727
Approved by: https://github.com/ezyang, https://github.com/houseroad
2023-08-13 22:36:51 +00:00
e7a3fb13e7 [pt2] add Python metas for special ops (#106683)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106683
Approved by: https://github.com/ezyang
2023-08-13 14:12:21 +00:00
b897c57d47 [TGIF][Inplace][Perf] Copy tensor to device with pinned memory & move copy weight sleep to getRecord (#106849)
Summary:
There are 2 changes in the diff that helps optimize perf during inplace update:
1. Read data with pinned memory
2. move the copy weight sleep from between copying the whole Tensor to between copying chunks

Test Plan:
**Local Test**
```
./ai_infra/inference_platform/test_platform/script/run_sigrid_4card.sh --port 7451 --local_model_dir /home/lujia/script --cuda_devices 6 --bind_node 3 --model_id 962549778_514 --gflag_config_path sigrid/predictor/predictor_x_gflags_mrs_prospector_gpu_torchscript_fusedsolution_1card_opt_fm -- --enable_thrift_warmup=false --tgif_replicate_merge_by_tempfile=false --enable_inplace_snapshot_transition --model_version_config_path sigrid/predictor/models_version/lujia_test --inplace_update_max_retries 0 --submod_to_device="merge|cuda0"
```

**Load test on job  tsp_eag/smart/inference_platform_sp__sigrid_predictor_gpu_adhoc_realtimetest_m962549778_latest.s3**

Before:
(p99 latency)
{F1066957232}

(SR error rate)
 {F1066957650}

After:
(p99 latency)
 {F1066957141}

(SR error rate)
{F1066957376}

Differential Revision: D48182533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106849
Approved by: https://github.com/842974287, https://github.com/kit1980
2023-08-13 07:37:46 +00:00
ddd2f682b9 [executorch] Let custom ops registration code only import ATen headers (#107064)
Summary: Basically we generate `CustomOpsNativeFunctions.h` for registering custom ops into PyTorch JIT runtime. This header needs to hookup with the C++ kernel implementation of all the custom ops. For this reason it should include ATen headers instead of Executorch headers. This PR changes it.

Test Plan: Rely on existing CI jobs

Differential Revision: D48282828

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107064
Approved by: https://github.com/kirklandsign
2023-08-13 00:34:34 +00:00
f26aa2dcd9 Keep fx node name consistent with aot_export (#107068)
torch.export() starts initially with node names in aot_export, if we don't make this change, any no-op transformation would break name consistency, thus breaking GraphSignature correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107068
Approved by: https://github.com/tugsbayasgalan
2023-08-12 23:12:03 +00:00
8472c24e3b [inductor] Optimize away zero-element loads (#107074)
Fixes #107066, closes #107008

This replaces loads to zero-element `Loops` or `Buffer`s with `ops.constant`
calls. This both avoids the issue of masked loads under triton, and also means
the buffer is not listed as a dependency for downstream users which may improve
performance generally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107074
Approved by: https://github.com/davidberard98
2023-08-12 07:58:14 +00:00
aa36e16f95 Add gfx90a target for ROCm CI (#106879)
...in preparation for upgrading CI runners to MI2xx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106879
Approved by: https://github.com/seemethere
2023-08-12 07:23:20 +00:00
6f83382161 [inductor][easy] add a missing parenthesis (#107001)
If I understand the code correctly, we want to add a fusion choice if
- node2 is template or foreach
and
- can_fuse return true for (node2, node1)

But the code misses a pair of parenthesis since in python 'and' has higher precedence than 'or'. This does not cause much damage since even if we add a pair of nodes that can not be fused, we will skip them later when we call can_fuse again (in fuse_nodes_once). Fixing this mainly to avoid confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107001
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-08-12 06:26:06 +00:00
5b04e9b6ce Install torchrec/fbgemm from source in CI (#106808)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106808
Approved by: https://github.com/malfet, https://github.com/xuzhao9
2023-08-12 02:08:44 +00:00
9858edd99f Revert "Reordering tests experiment (#106347)"
This reverts commit 7dfab082be9eaeeee95c7b0363e59c824c6a9009.

Reverted https://github.com/pytorch/pytorch/pull/106347 on behalf of https://github.com/clee2000 due to probably broke sharding ([comment](https://github.com/pytorch/pytorch/pull/106347#issuecomment-1675542738))
2023-08-11 23:59:48 +00:00
c9cbcb2449 [device_mesh] move remaining collectives to a separate file (#107012)
Move the remaining collectives to a separate file to prepare device mesh
to become a public distributed API

For those remaining utils, we need to upstream them to functional
collectives with proper implementation, added TODO there for a follow up
PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107012
Approved by: https://github.com/fduwjj
2023-08-11 23:49:27 +00:00
22095acfd7 [ONNX] Migrate to PT2 logging (#106592)
Summary
- The 'dynamo_export' diagnostics leverages the PT2 artifact logger to handle the verbosity
level of logs that are recorded in each SARIF log diagnostic. In addition to SARIF log,
terminal logging is by default disabled. Terminal logging can be activated by setting
the environment variable `TORCH_LOGS="onnx_diagnostics"`. When the environment variable
is set, it also fixes logging level to `logging.DEBUG`, overriding the verbosity level
specified in the diagnostic options.
See `torch/_logging/__init__.py` for more on PT2 logging.
- Replaces 'with_additional_message' with 'Logger.log' like apis.
- Introduce 'LazyString', adopted from 'torch._dynamo.utils', to skip
evaluation if the message will not be logged into diagnostic.
- Introduce 'log_source_exception' for easier exception logging.
- Introduce 'log_section' for easier markdown title logging.
- Updated all existing code to use new api.
- Removed 'arg_format_too_verbose' diagnostic.
- Rename legacy diagnostic classes for TorchScript Onnx Exporter to avoid
confusion.

Follow ups
- The 'dynamo_export' diagnostic now will not capture python stack
information at point of diagnostic creation. This will be added back in
follow up PRs for debug level logging.
- There is type mismatch due to subclassing 'Diagnostic' and 'DiagnosticContext'
for 'dynamo_export' to incorporate with PT2 logging. Follow up PR will
attempt to fix it.
- More docstrings with examples.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106592
Approved by: https://github.com/titaiwangms
2023-08-11 23:27:00 +00:00
5d09e49947 Make the __call__ op of ExportedProgram follow calling convention. (#106186)
Convention documented here: 01069ad4be/torch/_functorch/aot_autograd.py (L1034)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106186
Approved by: https://github.com/zhxchen17
2023-08-11 23:12:37 +00:00
42660015b4 [Dynamo x FSDP][2/x] Small changes to distributed to make it dynamo friendly (#106886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106886
Approved by: https://github.com/awgu, https://github.com/wconstab
ghstack dependencies: #106884
2023-08-11 22:35:50 +00:00
91778ada87 [inductor] graph replayer (#106952)
Recently I feel it's a bit painful to run benchmark scripts on my dev environment. E.g., the command below
```
 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only YituTechConvBert --training
```
took about 2 minutes to run. It may take even longer for some other models.

The command is slow since it
- need do dynamo work
- verify the model on CPU
- run perf tests
- compile all the graphs

However, often times I only need to debug inductor specific logic like loop ordering and fusion. A lot of the things the script is done are useless for me. Also I only need test one graph at a time (e.g. check fwd graph first and when I'm done, continue to check bwd graph) rather than compiling all the graphs.

The graph replayer add a `@save_args` decorator to compile_fx_inner function. When `config.save_args` is true, it will pickle all the arguments to `comple_fx_inner` to the file system.  Later on, we can call `load_args_and_run_compile_fx_inner("/tmp/inductor_saved_args/compile_fx_inner_0.pkl")` to replay the graph and compile it with inductor.

Replaying the fwd graph took around 60 seconds (maybe this can be further reduced but this is already 2x speedup for dev efficiency) , and it only took around 20 seconds to reach `Scheduler.__init__` method.

I also checked `TORCH_COMPILE_DEBUG` flag that already exists. The most similar part of `TORCH_COMPILE_DEBUG` is it can save a graph and it's arguments and later on rerun it. But the difference here is, rather than run the model, we want to call inductor API to compile the model (without even going thru dynamo or aot-autograd).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106952
Approved by: https://github.com/jansel
ghstack dependencies: #106990
2023-08-11 22:28:20 +00:00
6730d5f9a0 [inductur][easy] show kernel name in str(ExternKernel) (#106990)
The string representation of an ExternKernel does not show kernel name. Since kernel name is such an important information for an ExternKernel, this PR adds that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106990
Approved by: https://github.com/eellison
2023-08-11 22:27:28 +00:00
2c8f24829f Decomposition of bmm and mm for dot product (#106593)
Summary: Decomposition of bmm and mm for dot product

Test Plan: sandcastle, and Bert

Differential Revision: D48055141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106593
Approved by: https://github.com/jansel
2023-08-11 22:23:04 +00:00
ec0f3fda7d Revert "Remove set_default_dtype from nn tests (#105775)"
This reverts commit 4d6a891baf2224cfa81bfe7632cf08be50812216.

Reverted https://github.com/pytorch/pytorch/pull/105775 on behalf of https://github.com/huydhn due to Sorry for reverting you change, it is failing one of the slow test in trunk ([comment](https://github.com/pytorch/pytorch/pull/105775#issuecomment-1675460195))
2023-08-11 22:14:17 +00:00
3d00170b20 [inductor] fix test_dim_function_empty (#106994)
Summary: Looks like the assert syntax was just wrong

Test Plan:
PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py -k test_dim_function_empty
PYTORCH_TEST_WITH_AOT_EAGER=1 python test/test_torch.py -k test_dim_function_empty
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106994
Approved by: https://github.com/eellison
2023-08-11 21:38:53 +00:00
547ccae0db [export] Support preserving calling convention to some modules. (#106798)
Summary: APS use this feature to swap out some submodules after unflattening.

Test Plan: test_export_preserve_signature

Differential Revision: D48154341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106798
Approved by: https://github.com/tugsbayasgalan
2023-08-11 21:17:45 +00:00
354484ea6d Revert "Add _foreach_clamp (#106574)"
This reverts commit 2b560d3c3a9b34cd11fc9ff9e3a0be6a81d47968.

Reverted https://github.com/pytorch/pytorch/pull/106574 on behalf of https://github.com/kit1980 due to breaking internal windows builds ([comment](https://github.com/pytorch/pytorch/pull/106574#issuecomment-1675400335))
2023-08-11 21:05:04 +00:00
c9cdcb299a Remove ExclusivelyOwned from register_dispatch_key (#106791)
This fixes a bug that could occur with python decompositions.

When an operation is intercepted in the c++ code in pytorch the outputs a created as `ExclusivelyOwned<at::Tensor>`s. Later on when it dispatches back to python for the decomposition these tensors have their ownership shared with python. In a normal use case the exclusively owned tensor is released and it's value returned as a non-exclusively owned tensor from the operation. However if the python decomposition throws an error the `ExclusivelyOwned` wrapper destroys the `at::Tensor` leading to a python reference to a tensor which isn't alive (and meaning pytorch falls over in debug mode).

Note this will be a performance hit when handling errors.

Fixes #106790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106791
Approved by: https://github.com/ezyang
2023-08-11 21:04:33 +00:00
d97b18d769 Propose nkaretnikov as general PrimTorch/meta/decomp reviewer (#106970)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106970
Approved by: https://github.com/albanD
2023-08-11 20:50:31 +00:00
fbfb9a1648 [Dynamo] Improve PT2 fbcode logging observability (#106932)
Summary:
https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit

This is the revamped version of D47908299.

For each frame, we will record a list of compilation metrics: e.g, backend_compile time, entire_frame_compile time, cache_size, co_filename, co_firstlineno, co_name, guards, graph input_count, graph node_count, graph op_count.

With the help of job info: mast_job_name, global_rank, we can satisfy the requirements from `Things I’ve used/wanted to use our logging to determine` in https://docs.google.com/document/d/1D5K3_ELsda3tIUeSyNL_2yee-M3jVWbirqSQ5BDNvHQ/edit (or add more metrics for this framework)

Test Plan:
```
buck2 test //caffe2/test:test_dynamo
```

Differential Revision: D48142400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106932
Approved by: https://github.com/anijain2305
2023-08-11 20:46:04 +00:00
1cfe292061 Mark test_lstm_packed as slow (#107048)
The test takes >30 minutes to run on some configurations and keeps getting unmarked as slow by the automatic slow test detection.
Examples:
https://ossci-raw-job-status.s3.amazonaws.com/log/15824750763
https://ossci-raw-job-status.s3.amazonaws.com/log/15802766247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107048
Approved by: https://github.com/huydhn
2023-08-11 20:35:14 +00:00
22a20d0850 Add isFloat8Type predicate (#106977)
And make float8 dtypes part of `isReducedFloatingType()` predicate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106977
Approved by: https://github.com/albanD
2023-08-11 20:13:57 +00:00
5c48ff20b5 AsyncCollectiveTensor: dont sync on view ops (#105240)
AsyncCollectiveTensor is a tensor subclass that is meant to "delay synchronization" when you call into the functional collectives API's. It does this (if I understand correctly) by internally holding an "unsynchronized" version of the tensor, which is the result of the communication op, and internally calling `.wait()` to synchronize the data the next time it is used.

Previously, these wait() calls would happen immediately, because `AsyncCollectiveTensor` gets wrapped by `DTensor()`, which calls `.detach()` on its inner tensor, immediately causing the sync (code: 1518d5eec4/torch/distributed/_tensor/api.py (L207))

AsyncCollectiveTensor shouldn't need to do a synchronization if you try to detach() it though - in fact, it should be fine to avoid synchronizing if you perform any view ops on it (which just require viewing metadata, but not actual data). This PR tries to update `AsyncCollectiveTensor` to delay `wait()` calls whenever the subclass encounters a view op.

Added some light testing, that just runs some DTensor compute followed by view ops, and confirms that the output is still an `AsyncCollectiveTensor` when we call `.to_local()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105240
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/wconstab
2023-08-11 19:20:25 +00:00
e165938853 Implement decomposition for aten.rrelu_with_noise (#106812)
Test Plan:
* Primarily, added new test in test/test_decomp.py
* Updated existing tests, e.g., to NOT expect failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106812
Approved by: https://github.com/eellison
2023-08-11 19:18:29 +00:00
b1b3f61f2c Skip Triton templates in MM max autotune with zero-size inputs (#106865)
Summary:

MM max autotune (and friends) crash when one of the inputs is zero-size.

E.g., running this code:

```
@torch.compile()
def fn(x, y):
    return torch.mm(x, y)

inps = [torch.rand([0, 30]), torch.rand([30, 40])]
inps = [x.to(device="cuda") for x in inps]
out = fn(*inps)
```

with this command:

```
TORCHINDUCTOR_MAX_AUTOTUNE=1 python test.py
```

raises this error (the top of the stack trace omitted for brevity):

```
...
  File "/data/users/aakhundov/pytorch/torch/_inductor/kernel/mm.py", line 119, in tuned_mm
    return autotune_select_algorithm("mm", choices, [mat1, mat2], layout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 960, in autotune_select_algorithm
    return _ALGORITHM_SELECTOR_CACHE(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 787, in __call__
    timings = self.lookup(
              ^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/codecache.py", line 267, in lookup
    timings[choice] = benchmark(choice)
                      ^^^^^^^^^^^^^^^^^
  File "/data/users/aakhundov/pytorch/torch/_inductor/select_algorithm.py", line 774, in autotune
    raise ErrorFromChoice(msg, choice, benchmark_fn.debug_str())
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: ErrorFromChoice: Please run `ptxas /tmp/compile-ptx-src-bfb1c6` to confirm that this is a bug in `ptxas`

From choice TritonTemplateCaller(/tmp/torchinductor_aakhundov/z7/cz7n7nn6rdlaelu4pbaaurgmu74ikl6g76lkngwawrevlfxlc6re.py, ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=32, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, num_stages=2, num_warps=4)
inputs = [
    torch.empty_strided((0, 30), (30, 1), dtype=torch.float32, device='cuda'),
    torch.empty_strided((30, 40), (40, 1), dtype=torch.float32, device='cuda'),
]
out = torch.empty_strided((0, 40), (40, 1), dtype=torch.float32, device='cuda')

  target: aten.mm.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda', torch.float32, size=[0, s0], stride=[s0, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg3_1', layout=FixedLayout('cuda', torch.float32, size=[s0, s1], stride=[s1, 1]))
  ))
```

This PR adds a check to skip Triton templates in the `mm`, `addmm`, `mm_plus_mm` autotuning when the product of the MM problem shape (`m * n * k`) is zero.

Additionally, early exit conditions have been added to the mm and mm_plus_mm Triton templates on `M * N * K == 0`, to prevent issues when autotuning was done on non-zero-size inputs with dynamic shapes, then zero-size inputs are encountered by the compiled model.

Test Plan:

```
$ python test/inductor/test_max_autotune.py -v

...

----------------------------------------------------------------------
Ran 16 tests in 29.569s

OK
```

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106865
Approved by: https://github.com/jansel
2023-08-11 19:10:16 +00:00
656412f0cb Add multiprocess option to dynamo benchmarks (#106394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106394
Approved by: https://github.com/XilunWu
2023-08-11 18:34:09 +00:00
3fe1dba068 Fix test_cond_functionalized_aot_func_check_functional (#106889)
Fix a typo in unit test.

Test Plan:
Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106889
Approved by: https://github.com/tugsbayasgalan
2023-08-11 18:31:12 +00:00
a926be39d4 torch.jit.script escape hatch (#106229)
Although the sun is setting for torchscript, it is not [officially deprecated](https://github.com/pytorch/pytorch/issues/103841#issuecomment-1605017153) since nothing currently fully replaces it. Thus, "downstream" libraries like TorchVision, that started offering torchscript support still need to support it for BC.

torchscript has forced us to use workaround after workaround since forever. Although this makes the code harder to read and maintain, we made our peace with it. However, we are currently looking into more elaborate API designs that are severely hampered by our torchscript BC guarantees.

Although likely not intended as such, while looking for ways to enable our design while keeping a subset of it scriptable, we found the undocumented `__prepare_scriptable__` escape hatch:

0cf918947d/torch/jit/_script.py (L977)

One can define this method and if you call `torch.jit.script` on the object, the returned object of the method will be scripted rather than the original object. In TorchVision we are using exactly [this mechanism to enable BC](3966f9558b/torchvision/transforms/v2/_transform.py (L122-L136)) while allowing the object in eager mode to be a lot more flexible (`*args, **kwargs`, dynamic dispatch, ...).

Unfortunately, this escape hatch is only available for `nn.Module`'s

0cf918947d/torch/jit/_script.py (L1279-L1283)

This was fine for the example above since we were subclassing from `nn.Module` anyway. However, we recently also hit a case [where this wasn't the case](https://github.com/pytorch/vision/pull/7747#issuecomment-1642045479).

Given the frozen state on JIT, would it be possible to give us a general escape hatch so that we can move forward with the design unconstrained while still keeping BC?

This PR implements just this by re-using the `__prepare_scriptable__` hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106229
Approved by: https://github.com/lezcano, https://github.com/ezyang
2023-08-11 18:24:46 +00:00
71be8f2223 Revert "Add initial support for FP8 ONNX export (#106379)"
This reverts commit 08704f96f08da5a52f65a7c3001d6ce4aae0102e.

Reverted https://github.com/pytorch/pytorch/pull/106379 on behalf of https://github.com/kit1980 due to breaking multiple internal builds ([comment](https://github.com/pytorch/pytorch/pull/106379#issuecomment-1675192700))
2023-08-11 18:22:35 +00:00
e18ca4028b [indcutor] add one triton config for reduction (#106925)
This config found by coordinate descent tuning improves kernel https://gist.github.com/shunting314/189a8ef69f90db9d614a823385147a72 from
- 10.008ms    5.993GB    598.83GB/s
to
- 6.170ms    5.993GB    971.28GB/s .

It should only affect reduction with hint ReductionHint.DEFAULT or when max autotune is enabled.

(It's funny that before I upgrade my triton version, the improvement is from 9.076ms -> 5.692ms )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106925
Approved by: https://github.com/jansel
2023-08-11 17:15:03 +00:00
6696a75ea8 [inductor] make thread order consistent with loop order (#106827)
I found that for a tiled kernel for tensor with shape [a, b], we map 'a' with XBLOCK and 'b' with YBLOCK. However, 'a' actually should be the outer looper while 'b' corresponding to the inner loop. This order is picked by our loop ordering algorithm. Mapping 'a' with XBLOCK has the semantic like assigning 'a' to the inner loop instead.

For a simple 'A + B.t()' kernel, making the loop order consistent can brings 1.027x speedup ( 1.938ms -> 1.887ms speedup) . Here are the dump of kernels:

- before fix: https://gist.github.com/shunting314/4dacf73cf495cdd7e84dede7c3e0872d
- after fix (this one is done manually): https://gist.github.com/shunting314/441e8839d24e1878c313e539b1ebd551

I tried this on DistillGPT2 and found perf is neutral. But that because DistillGPT2 has a single tiled pointwise kernel in it's backward graph. Will check the dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106827
Approved by: https://github.com/jansel
2023-08-11 17:05:21 +00:00
745d29b0cc Revert "[export] Refactor constrain_as_value and constrain_as_size (#106591)"
This reverts commit 18989890bfc4d74dbf4a175d425b5b291e09cb8b.

Reverted https://github.com/pytorch/pytorch/pull/106591 on behalf of https://github.com/izaitsevfb due to Breaks inductor test on trunk ([comment](https://github.com/pytorch/pytorch/pull/106591#issuecomment-1675069091))
2023-08-11 16:37:47 +00:00
0b05aef8d0 Add ONNX export support for huggingface's bigscience/bloom-560m model (#106930)
Port fix from https://github.com/huggingface/safetensors/pull/318 into ONNX exporter until it is merged

* This add support for safetensors to be loaded within a FakeTensorMode, which results in creating `torch.empty((shape,), dtype=)`. This is done through a monkeypatch for the in-progress https://github.com/huggingface/safetensors/pull/318
* Adds a test for the HF bloom model (bigscience/bloom-560m)
* This PR also fixes existing fake tensor unit tests by moving the `torch.onnx.dynamo_export` to be inside the `enable_fake_mode()` context. Although calling `torch.onnx._dynamo_export` works for several models, the right way of using fake mode is calling the exporter within the context manager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106930
Approved by: https://github.com/BowenBao
2023-08-11 16:34:24 +00:00
9f26503bf0 SymInt'ify tile (#106933)
When auditing before I was deceived by the argument name "dims". Actually, this is saying how many times to replicate each dim, so definitely can be symbolic.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106933
Approved by: https://github.com/nkaretnikov, https://github.com/lezcano
2023-08-11 15:28:28 +00:00
a5d841ef01 asarray: take the default device into consideration. (#106779)
Fix: #106773

This PR makes it so `asarray` takes the default device into consideration when called with
a Python sequence as the data.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106779
Approved by: https://github.com/rgommers, https://github.com/lezcano
2023-08-11 13:16:42 +00:00
171341ee65 Support complex inputs in nan_to_num (#106944)
Fixes #105462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106944
Approved by: https://github.com/lezcano
2023-08-11 09:15:57 +00:00
7db6eb7156 [test_nn] add custom device support for dropout tests、lazy_modules te… (#106609)
add custom device support for dropout tests、lazy_modules tests and multihead_attention tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106609
Approved by: https://github.com/mikaylagawarecki
2023-08-11 09:14:34 +00:00
03414081ff adding mixed_dtype_mm to torch._inductor (#106443)
Summary: if torch._inductor.config.use_mixed_mm then we can convert
torch.mm(a, b.to(some_dtype)) into a triton kernel where the casting b
is fused into the matmul rather than needing to instantiate the casted b
tensor. If use_mixed_mm is set, this fused kernel will be autotuned
against the default 2 kernel fallback option. If force_mixed_mm then the
fused kernel will always be used, This option is needed for weight-only quantization where we are in
some cases relying on the superior memory characteristics of the fused
kernel rather than the perf numbers (when we can't afford to load memory
 with a tensor 4x the size of our quantized one).

Test Plan: python test/inductor/test_pattern_matcher.py -k "mixed_mm"

python test/inductor/test_torchinductor.py -k "mixed_mm"

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106443
Approved by: https://github.com/jansel
2023-08-11 05:34:54 +00:00
18989890bf [export] Refactor constrain_as_value and constrain_as_size (#106591)
Some notable changes:
1. `constrain_as_size` allows min value to be less than 2 as it will unconditionally assume min >= 2 for compiler purposes. Instead, we add additional check to make sure max value is always greater than 2.
2. Previously, we used to runtime assert on the unbacked symint's val range which would be always between [2, max]. I modified this logic to assert on [0, max] unless user explicitly specifies the min range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106591
Approved by: https://github.com/gmagogsfm, https://github.com/ezyang
2023-08-11 05:29:22 +00:00
df6aaf7bc2 inductor: fix compile error for inplace variable multi-defined (#106852)
When removing an inplace buffer, we just mark it as ```REMOVED```, after removing some inplace buffer, and then if we mark a buffer as inplace buffer using the ```self.inplace_buffer.values()``` length to create a buffer name, there may have an issue which we may define a same inplace buffer name with existed in ```self.inplace_buffer.values()```:

before removing some inplace buffers, the ```self.inplace_buffers``` may be like:

```
{'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf7': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf9': InplacedBuffer(inner_name='in_out_ptr1', other_names=['buf5', 'buf7', 'buf9']), 'buf12': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf13': InplacedBuffer(inner_name='in_out_ptr2', other_names=['buf12', 'buf13']), 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf25': InplacedBuffer(inner_name='in_out_ptr4', other_names=['buf21', 'buf25']), 'buf20': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf26': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf31': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32']), 'buf32': InplacedBuffer(inner_name='in_out_ptr5', other_names=['buf20', 'buf26', 'buf31', 'buf32'])}
```
After removing some inplace buffers, the ```self.inplace_buffers``` may be like:

```
{'buf0': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf2': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf4': InplacedBuffer(inner_name='in_out_ptr0', other_names=['buf0', 'buf2', 'buf4']), 'buf5': 'REMOVED', 'buf7': 'REMOVED', 'buf9': 'REMOVED', 'buf12': 'REMOVED', 'buf13': 'REMOVED', 'buf17': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf19': InplacedBuffer(inner_name='in_out_ptr3', other_names=['buf17', 'buf19']), 'buf21': 'REMOVED', 'buf25': 'REMOVED', 'buf20': 'REMOVED', 'buf26': 'REMOVED', 'buf31': 'REMOVED', 'buf32': 'REMOVED', 'buf16': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38']), 'buf38': InplacedBuffer(inner_name='in_out_ptr6', other_names=['buf16', 'buf38'])}
```
And then if we mark some buffer as inplace buffer and the buffer name will use ```in_out_ptr{len(unique(self.inplace_buffers.values()))}```, the buffer name may be ```in_out_ptr6``` even this name has existed in ```self.inplace_buffers```.

After this PR, we will change ```REMOVED``` to ```REMOVED{1, 2, 3..}``` which avoids defining a duplicate name.  ```pyhpc_equation_of_state ``` of ```torchbench``` will work for CPU backend:

```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/torchbench.py --performance --inference --float32 -dcpu -n50 --inductor --freezing --no-skip --dashboard --only pyhpc_equation_of_state  --cold_start_latency```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106852
Approved by: https://github.com/lezcano
2023-08-11 04:06:58 +00:00
7460adf7f3 Causing internal clang tidy to error (#106895)
Summary:
This was causing an error with clang tidy for internal version of PyTorch:
https://www.internalfb.com/diff/D47044755?dst_version_fbid=1190859734932683&transaction_fbid=1553883761684581

Test Plan: See Summary

Reviewed By: dmpolukhin

Differential Revision: D48202402

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106895
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-08-11 03:54:05 +00:00
71a336ef75 [Dynamo x FSDP][1/x] Builder support for deque, appendleft (#106884)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106884
Approved by: https://github.com/ezyang
2023-08-11 03:26:12 +00:00
4df84c3b4d make sure mkldnn convolution given same stride as ref path for nc11 contiguous input (#106966)
On SPR machine, the mkldnn bfloat16 convolution always return a channels last output, and we will convert it to channels first if input and weight are channels first, there has an issue when we do such conversion if output is nc11(4*512*1*1), we always mark it as public format ideep tensor, and even we calling ```to_dense``` before returning the output, the output's stride is still a channels last stride(512, 1, 512, 512), this PR will calling ```resize_``` to make sure the stride is contiguous stride.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106966
Approved by: https://github.com/mingfeima
2023-08-11 00:59:48 +00:00
a9dca53438 NumPy support in torch.compile (#106211)
RFC: https://github.com/pytorch/rfcs/pull/54
First commit is the contents of https://github.com/Quansight-Labs/numpy_pytorch_interop/

We have already been using this in core for the last few months as a external dependency. This PR pulls all these into core.

In the next commits, I do a number of things in this order
- Fix a few small issues
- Make the tests that this PR adds pass
- Bend backwards until lintrunner passes
- Remove the optional dependency on `torch_np` and simply rely on the upstreamed code
- Fix a number dynamo tests that were passing before (they were not tasting anything I think) and are not passing now.

Missing from this PR (but not blocking):
- Have a flag that deactivates tracing NumPy functions and simply breaks. There used to be one but after the merge stopped working and I removed it. @lezcano to investigate.
- https://github.com/pytorch/pytorch/pull/106431#issuecomment-1667079543. @voznesenskym to submit a fix after we merge.

All the tests in `tests/torch_np` take about 75s to run.

This was a work by @ev-br, @rgommers @honno and I. I did not create this PR via ghstack (which would have been convenient) as this is a collaboration, and ghstack doesn't allow for shared contributions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106211
Approved by: https://github.com/ezyang
2023-08-11 00:39:32 +00:00
8f774330af [inductor] Use shape env bounds in inductor bounds.py (#106175) (#106568)
Summary: If constrained ranges are available, use them in bounds.py before value range analysis (to enable Int64 -> Int32 optimization).

Test Plan: New unit test in test_torchinductor.py to mark a tensor as dynamic, then constrain with constrain_as_size (as outlined in https://github.com/pytorch/pytorch/issues/106175)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106568
Approved by: https://github.com/eellison, https://github.com/lezcano
2023-08-11 00:16:09 +00:00
62b3018024 [Vulkan] Introduce GPU Memory Layout qualifier (#106978)
Summary:
Introduce a GPU memory Layout qualifier in `vTensor`, which will allow more efficient memory layouts when storing Tensors on the GPU.

The plan is for shaders to use the memory layout qualifier to convert between logical tensor coordinates and physical texel positions.

Test Plan:
As-is, this diff should be a no-op. Run standard tests to make sure everything works as expected.

```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

```

Reviewed By: kimishpatel

Differential Revision: D48129905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106978
Approved by: https://github.com/liuk22
2023-08-10 23:45:54 +00:00
8c8477e55a Add _unsafe_index decomp (#106814)
Summary:
Redirect `aten._unsafe_index` to `aten.index` through a decomposition.

Also add it to the list of core decompositions.

Test Plan: contbuild and OSS CI (similar to D40075277)

Differential Revision: D48163393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106814
Approved by: https://github.com/SherlockNoMad
2023-08-10 23:23:37 +00:00
152203d3c3 [pytorch][ao] Add torch.matmul in FloatFunctional/QFunctional (#106831)
Summary: As title

Test Plan: new unit tests

Differential Revision: D48172841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106831
Approved by: https://github.com/jerryzh168
2023-08-10 22:43:36 +00:00
dfb1b95919 [caffe2] Add enforce inside ScatterAssignOp (#106882)
Summary: Adding an enforce gives better error information than raising SIGFPE when division by zero happens. We'll get the actual BlobRef names as well as the error categories.

Test Plan:
Ran a local worker and client using DPP session with empty tensors and checked the error:

`../buck-out/v2/gen/fbcode/data_preproc/perf_test/client --sr2_event_base_pool_size=24`

`../buck-out/v2/gen/fbcode/data_preproc/perf_test/worker --dpp_session_id=5D49F56C98CC95BD97027BC0DDB38D8F`

```{dpp_internal_errorcategory : user_error,
ONCALL : MLDP_CONTROL,
CATEGORY : INPUT_ERROR,
errorsubsystemtags : [DPP_WORKER],
errorcause : USER_ERROR,
RETRYABILITY : 0}F0806 17:47:52.607200 2280375 SchedRuntimeEnv.cpp:385] facebook::data_preproc::NonRetryableGenericUser
Error: User preprocessing error c10::Error: [enforce fail at utility_ops.h:730] input.numel() > 0. 0 vs 0. tensor has t
o be nonempty (Error from operator:
input: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_
features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCOR
ELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Concat:0" input:
"preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_feature
s_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_E
NCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/Mul_2" input: "preproc_d
ata_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processo
r_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST_ENCODED_1/sequential_1019/id_score_list_quantization_decode_1/encoded_id_lengths" output: "preproc_data_pipeline/preproc/features/default_feature_preproc/normalization/dper_feature_normalization/sparse_features_processor_1/sparse_feature_transform/F3_ADFINDER_USER_ADS_COFFEE_LSF_FLEXIBLE_BATCH_USER_FB_UIP_FEATURE_IDSCORELIST_ENCODED_FB_UIP_TOP100_IDSCORELIST```

Differential Revision: D48104430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106882
Approved by: https://github.com/kit1980
2023-08-10 21:46:13 +00:00
aef27bdbe7 Reword release version: major->minor in README.md (#106980)
Correct wording to reflect reality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106980
Approved by: https://github.com/huydhn, https://github.com/albanD
2023-08-10 21:32:30 +00:00
a62de2d5ec [inductor] Enable multilayer reductions with dynamic shapes (#106747)
Currently multilayer reduction (aka split reductions) are only used with static
shapes which results in worse performance and accuracy when dynamic shapes are
enabled. Instead, this only requires that the shape has a hint value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106747
Approved by: https://github.com/lezcano
ghstack dependencies: #106626, #106870
2023-08-10 21:07:25 +00:00
fa65df3745 [inductor] Type triton size arguments in the kernel index_dtype (#106870)
`JITFunction._key_of` uses the value of the argument to distinguish between
i32 and i64, but this fails if the value is used in indexing calculations where
the value exceeds `INT_MAX`.

Instead, we should use `index_dtype` which means all indexing calculations are
performed in the same dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106870
Approved by: https://github.com/lezcano
ghstack dependencies: #106626
2023-08-10 21:07:25 +00:00
3b2cb459fc [inductor] Fix reference_as_float gradcheck (#106626)
When `reference_as_float` is true, reference gradients will not have the same
dtype as the actual computed gradients. This fixes the issue by downcasting
before doing the comparison.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106626
Approved by: https://github.com/lezcano
2023-08-10 21:07:25 +00:00
02abbb8109 Fix some typos, mostly "that that" (#106901)
Fix some typos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106901
Approved by: https://github.com/janeyx99
2023-08-10 19:46:53 +00:00
7b94d93431 [FSDP] Fix train -> EMA -> eval with mixed precision (#106858)
This fixes a pretty vicious bug relating to `SHARD_GRAD_OP`, mixed precision, EMA, and eval.

**Bug Explanation**
The model has a main module and an EMA module, where the main module is used for training and the EMA module is used for eval. The model has FSDP's fp16 mixed precision enabled. The flow consists of (1) training forward/backward/optimizer -> (2) EMA update (copy main module to EMA module) -> eval forward in `torch.no_grad()`, where this repeats for many iterations.

Consider the _second_ iteration.
- From the first iteration's eval forward, the EMA module has the fp16 unsharded parameters in memory (not freed due to `SHARD_GRAD_OP`).
- In this second iteration's step (2), we perform the EMA update under the `summon_full_params()` context, where FSDP specially forces full precision.  This means that the EMA module now uses fp32 unsharded parameters, distinct from the fp16 unsharded parameters still in memory. The EMA update modifies those fp32 parameters, and upon exiting the context, FSDP correctly writes the modifications back to the fp32 sharded parameters.
- In the second iteration's step (3) (eval forward), FSDP checks whether it needs to run the unshard op (including all-gather) but sees it does not since the fp16 unsharded parameters are still in memory. Thus, FSDP uses those fp16 unsharded parameters directly without all-gather. However, these fp16 unsharded parameters are stale and do not include the EMA update!
- In other words, at this point, the fp32 sharded parameters are correct, the fp16 unsharded parameters are stale, and FSDP chooses _not_ to re-all-gather since the fp16 unsharded parameters are in memory.

**Fix Explanation**
This PR fixes this by freeing the fp16 unsharded parameters if they are still allocated when forcing full precision, i.e. using fp32 unsharded parameters in `summon_full_params()`. This ensures that any modifications written back to the fp32 sharded parameters will be persisted via the next all-gather.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106858
Approved by: https://github.com/kumpera
ghstack dependencies: #106857
2023-08-10 19:32:43 +00:00
79449e6272 [quant][pt2e][fix] Remove the requirement of using no_grad for reference model that contains quantized conv2d (#106924)
Summary:
att

we don't actually need gradient for conv2d, just need it to run without error, so we delayed the error of out_dtype gradient
to the time when user actually requested it

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_conv2d

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106924
Approved by: https://github.com/zou3519, https://github.com/kimishpatel
2023-08-10 19:16:10 +00:00
1afbc985fe Make RNGStateTracker support cuda-like device (#106771)
replace  `CudaRNGStateTracker` with `RNGStateTracker` by rewriting some Cuda-binding code with `device_handle`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106771
Approved by: https://github.com/wanchaol
2023-08-10 19:14:33 +00:00
bb6b157458 Fix IndexKernel.cu build (#104423)
Fixes `signed-unsigned comparison` warnings introduced by https://github.com/pytorch/pytorch/pull/106809 (previously by  <s> https://github.com/pytorch/pytorch/pull/104054 </s> ) that changed type of `num_indices` to unsigned.

Before the change warnings looks as follows:
```
/tmp/tmpxft_00194ca7_00000000-6_IndexKernel.cudafe1.stub.c:31:580:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:58:63: warning: comparison of integer expressions of different signedness: ‘const long unsigned int’ and ‘int’ [-Wsign-compare]
   58 |   AT_ASSERT(num_indices == iter.ntensors() - 2);
      |                                                               ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:74:19: warning: comparison of integer expressions of different signedness: ‘int’ and ‘const long unsigned int’ [-Wsign-compare]
   74 |   for (int i = 0; i < num_indices; i++) {
      |                 ~~^~~~~~~~~~~~~
```
TODO: Turn those warning into errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104423
Approved by: https://github.com/Skylion007
2023-08-10 18:32:47 +00:00
393e9eed90 [inductor] modify index_reduce to pass opinfo tests (#106429)
1. add a python meta registration, to fix an issue with the forward pass. The problem was that previously, the C++ meta registration calls [numel()](7b14a14e27/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L329)) which fails (LMK if it's better to fix the C++ implementation to not do this check)
2. Modify the backward to fix an issue in the backward. The backward is not a custom op - it's a custom manual backward implementation. In particular, there's some situations that don't support double backward; the check for whether double backward is allowed requires a .item() call. To fix the meta/fake tensor case, this PR will avoid setting the double backward error only if `GradMode::is_enabled()` - which shouldn't be turned on in PT2.
3. Update skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106429
Approved by: https://github.com/zou3519
2023-08-10 18:14:00 +00:00
a14d99bb6c Close non existent disable issues complete rollout (#106923)
follow up to https://github.com/pytorch/pytorch/pull/105096
It seems fine, anecdotally I have seen some issues closed and they haven't been reopened
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106923
Approved by: https://github.com/huydhn
2023-08-10 16:48:14 +00:00
c0f80c6696 [forward-fix] Fix multigpu varying tensor optim tests (#106887)
Forward fixes https://github.com/pytorch/pytorch/pull/106615 by increasing tolerance in the test.

The capturable implementation for foreach simply varies due to a different order of operations when updating params. I had also attempted to compare against fp64 but that introduced more disparity in the other optimizer configs. It is worth trying the fp64 comparison at a later point, but let's get the test passing first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106887
Approved by: https://github.com/izaitsevfb
2023-08-10 16:35:38 +00:00
149e458846 [BE] RPC is missing RRef docs (#106902)
Current `RRef` class derives from `PyRRef` which has all the method definitions and documentations, and we don't see any of this in the current documentation:

<img width="891" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/62897766-a660-4846-97bf-182e4aa45079">

Changing to :inherited-member: so sphinx can pick up these methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106902
Approved by: https://github.com/svekars
2023-08-10 16:26:27 +00:00
89fd1b8717 Upgrade all inductor workflows to CUDA 12.1 / gcc9 (#106876)
gcc7 is too old to build fbgemm

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106876
Approved by: https://github.com/msaroufim, https://github.com/anijain2305
ghstack dependencies: #106900
2023-08-10 15:02:20 +00:00
4d6a891baf Remove set_default_dtype from nn tests (#105775)
Part of #68972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105775
Approved by: https://github.com/ezyang
2023-08-10 14:56:13 +00:00
22bc08da29 inductor: remove conv_bn folding from pre_grad pass (#106686)
The freezing pass has support conv+bn folding pass, we don't need to do that at pre_grad pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106686
Approved by: https://github.com/eellison
2023-08-10 12:25:05 +00:00
35a1913370 [inductor] Added affine_grid_generator decomposition (#104709)
Description:
- Added affine_grid_generator decomposition

Related to https://github.com/pytorch/pytorch/issues/104296

Fixes https://github.com/pytorch/pytorch/issues/105565

Perfs:
- speed-up on cuda with bilinear and nearest modes

```
Speed-up PR vs Nightly = ratio between columns "Compiled (2.1.0a0+git3ed904e) PR-afgg" and "Compiled (2.1.0a0+gitbcdd413) Nightly"

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cpu ------------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           7.467 (+-0.036)            |             11.905 (+-0.276)            |             13.391 (+-0.051)            |     1.125 (+-0.000)      |           7.343 (+-0.036)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           7.722 (+-0.168)            |             14.371 (+-0.035)            |             15.899 (+-0.038)            |     1.106 (+-0.000)      |           7.870 (+-0.043)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |           7.710 (+-0.051)            |             11.354 (+-0.053)            |             13.376 (+-0.045)            |     1.178 (+-0.000)      |           7.698 (+-0.061)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |           7.870 (+-0.050)            |             13.744 (+-0.237)            |             15.206 (+-0.102)            |     1.106 (+-0.000)      |           7.912 (+-0.039)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           4.738 (+-0.015)            |             4.508 (+-0.005)             |             6.566 (+-0.027)             |     1.456 (+-0.000)      |           4.630 (+-0.022)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           4.391 (+-0.010)            |             4.860 (+-0.390)             |             6.438 (+-0.047)             |     1.325 (+-0.000)      |           4.458 (+-0.010)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |           4.279 (+-0.008)            |             4.127 (+-0.010)             |             6.598 (+-0.709)             |     1.599 (+-0.000)      |           5.064 (+-0.025)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |           4.537 (+-0.010)            |             4.593 (+-0.006)             |             6.365 (+-0.104)             |     1.386 (+-0.000)      |           4.480 (+-0.011)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           26.411 (+-0.066)           |             62.275 (+-0.436)            |             64.486 (+-0.353)            |     1.035 (+-0.000)      |           26.210 (+-0.110)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           26.457 (+-0.096)           |             72.887 (+-0.247)            |             74.207 (+-0.337)            |     1.018 (+-0.000)      |           25.995 (+-0.120)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |           26.457 (+-0.086)           |             64.110 (+-0.233)            |             66.340 (+-0.406)            |     1.035 (+-0.000)      |           26.145 (+-0.085)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |           26.536 (+-0.094)           |             73.742 (+-0.483)            |             71.946 (+-0.460)            |     0.976 (+-0.000)      |           26.457 (+-0.166)

Times are in milliseconds (ms).

[------------------------------------------------------------------------------------------------------------------------------------ Affine grid sampling, cuda -----------------------------------------------------------------------------------------------------------------------------------]
                                                                                                          |  Eager (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git1afae24) PR-afgg  |  Compiled (2.1.0a0+git16df542) Nightly  |  speed-up PR vs Nightly  |  Eager (2.1.0a0+git16df542) Nightly
1 threads: ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bilinear   |           91.971 (+-0.253)           |             90.570 (+-0.193)            |            137.206 (+-0.214)            |     1.515 (+-0.000)      |           84.280 (+-0.241)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bilinear       |           91.893 (+-0.361)           |             89.866 (+-0.170)            |            136.678 (+-0.471)            |     1.521 (+-0.000)      |           84.573 (+-0.214)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bilinear  |          116.967 (+-0.481)           |            110.468 (+-0.326)            |            223.770 (+-0.334)            |     2.026 (+-0.000)      |          108.098 (+-0.392)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bilinear      |          117.563 (+-0.546)           |            111.438 (+-0.212)            |            223.101 (+-0.350)            |     2.002 (+-0.000)      |          108.225 (+-0.395)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=nearest    |           80.706 (+-0.289)           |             70.525 (+-0.204)            |            143.697 (+-0.311)            |     2.038 (+-0.000)      |           74.485 (+-0.258)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=nearest        |           80.955 (+-0.208)           |             69.986 (+-0.250)            |            143.658 (+-0.244)            |     2.053 (+-0.000)      |           74.163 (+-0.238)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=nearest   |          117.576 (+-0.435)           |             71.179 (+-0.412)            |            178.515 (+-0.539)            |     2.508 (+-0.000)      |          108.394 (+-0.473)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=nearest       |          117.441 (+-0.205)           |             70.313 (+-0.170)            |            178.664 (+-0.555)            |     2.541 (+-0.000)      |          108.098 (+-0.416)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=True, mode=bicubic    |           92.962 (+-0.509)           |            1740.964 (+-0.597)           |            1785.401 (+-0.369)           |     1.026 (+-0.000)      |           92.638 (+-0.539)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=True, mode=bicubic        |           92.928 (+-0.493)           |            1401.146 (+-0.732)           |            1453.229 (+-0.628)           |     1.037 (+-0.000)      |           92.458 (+-0.428)
      Input: (2, 3, 345, 456) torch.float32, torch.contiguous_format, align_corners=False, mode=bicubic   |          118.152 (+-0.442)           |            1740.644 (+-0.480)           |            1793.475 (+-0.458)           |     1.030 (+-0.000)      |          107.962 (+-0.548)
      Input: (2, 3, 345, 456) torch.float32, torch.channels_last, align_corners=False, mode=bicubic       |          118.182 (+-0.425)           |            1400.621 (+-0.624)           |            1461.796 (+-0.630)           |     1.044 (+-0.000)      |          107.894 (+-0.994)

Times are in microseconds (us).
```

[Source](https://raw.githubusercontent.com/vfdev-5/pth-inductor-dev/master/output/20230801-220216-affine-grid-sampler-PR-afgg-vs-Nightly-speedup.md), [script](https://github.com/vfdev-5/pth-inductor-dev/blob/master/perf_affine_grid_sampler.py)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104709
Approved by: https://github.com/lezcano
2023-08-10 09:52:48 +00:00
bb2fcc7659 unify TEST_CUDA (#106685)
Fixes #ISSUE_NUMBER
as title, unify TEST_CUDA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106685
Approved by: https://github.com/zou3519
2023-08-10 09:01:36 +00:00
2b560d3c3a Add _foreach_clamp (#106574)
Rel:
- #106221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106574
Approved by: https://github.com/janeyx99
2023-08-10 05:26:09 +00:00
3495f0c999 Generate mypy hints for torch.Tag, add a couple of pointwise ops (#106910)
Replace https://github.com/pytorch/pytorch/pull/106739, since i had a bad CLA commit.

- adds clone, and convert_element_dtype to pointwise
- adds codegen for mypy hints of torch.Tag and removes existing ignores for them

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106910
Approved by: https://github.com/mlazos
2023-08-10 05:12:27 +00:00
606e3c297b conv-bn folding in low precision (#106576)
Batchnorm inference is done in fp32 if the inputs are in fp16/bf16 and the output is casted back down to its original precision. This causes the batchnorm weights to get constant folded to fp32, and prevented Conv-BN folding from firing.
```
 def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...)
     convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1);  arg6_1 = arg0_1 = None
     # weight upcasting
     convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32);  arg3_1 = None
     convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32);  arg4_1 = None
     ...
     # end of batch norm
     add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7);  mul_2 = unsqueeze_7 = None
     # output downcast
     convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.bfloat16);  add_1 = None
```

I mark the convolutions which are followed by binary foldable ops in a higher precision that are then get converted back down to the original conv dtype. We fold the weights in fp32 because it's slightly better accuracy, then at the end of the pass convert back the weights to their original dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106576
Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang
2023-08-10 05:12:04 +00:00
4afab40b56 [quant][pt2e] Removed mean/hardtanh annotations and refactored adaptive_avg_pool annotation (#106805)
Summary:
Removed annotations for some ops, since they are handled in torch/ao/quantization/pt2e/_propagate_annotation.py

Test Plan:
CIs

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106805
Approved by: https://github.com/kimishpatel
2023-08-10 04:51:06 +00:00
dfd441a12c [BE] Use nested namespaces in torch/csrc/cuda (#106928)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 6b1dde1</samp>

> _`namespace` syntax_
> _Simplified with C++17_
> _Code is more readable_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106928
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb
2023-08-10 03:56:09 +00:00
e34a05b960 [ez][inductor][fx pass] strengthen numerical check for batch fusion (#106744)
Summary:
As title.
For batch fusion, we use torch op to fuse and the result should be exactly same as the original ones.
pull request: https://github.com/pytorch/pytorch/pull/106731#issuecomment-1668662078

Test Plan:
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
File changed: fbsource//xplat/caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/cf14a2dd-faee-417a-8d26-0b9326c944e4
Test UI: https://www.internalfb.com/intern/testinfra/testrun/6755399617159540
Network: Up: 0B  Down: 0B
Jobs completed: 12. Time elapsed: 2:55.5s.
Tests finished: Pass 4. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Reviewed By: dshi7

Differential Revision: D48132255

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106744
Approved by: https://github.com/kit1980
2023-08-10 03:49:23 +00:00
83b5027027 Enable Mypy Check in torch/_inductor/select_algorithm.py (#106701)
Fixes #105230 to enable mypy check.

```
$ mypy  torch/_inductor/select_algorithm.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106701
Approved by: https://github.com/eellison
2023-08-10 03:19:50 +00:00
8093349d42 Enable mypy check in torch/_inductor/fx_passes/post_grad.py (#106839)
Fixes #105230

```shell
$ lintrunner -a torch/_inductor/fx_passes/post_grad.py
  FLAKE8 success!
  CLANGFORMAT success!
  MYPYNOFOLLOW success!
  MYPY success!
  MYPYSTRICT success!
  CLANGTIDY success!
  TYPEIGNORE success!
  NOQA success!
  CIRCLECI success!
  SPACES success!
  NEWLINE success!
  CONSTEXPR success!
  NATIVEFUNCTIONS success!
  INCLUDE success!
  TABS success!
  PYBIND11_INCLUDE success!
  ERROR_PRONE_ISINSTANCE success!
  PYBIND11_SPECIALIZATION success!
  PYPIDEP success!
  RAWCUDA success!
  CUBINCLUDE success!
  EXEC success!
  RAWCUDADEVICE success!
  ROOT_LOGGING success!
  DEPLOY_DETECTION success!
  ACTIONLINT success!
  CALL_ONCE success!
  TESTOWNERS success!
  WORKFLOWSYNC success!
  CMAKE success!
  COPYRIGHT success!
  BAZEL_LINTER success!
  SHELLCHECK success!
  LINTRUNNER_VERSION success!
  UFMT success!
  ONCE_FLAG success!
  RUFF success!
ok No lint issues.
Successfully applied all patches.
```

```shell
$ mypy torch/_inductor/fx_passes/post_grad.py
Success: no issues found in 1 source file
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106839
Approved by: https://github.com/Skylion007
2023-08-10 03:11:00 +00:00
e93a90bdd5 [ONNX] Refactor perfect/nearest match criteria to allow optional inputs and disallow mismatch attributes (#106478)
Fix #106057, except **Attribute dtype mismatch. E.g., alpha of aten.add.Tensor. -> Attribute: alpha INT vs FLOAT**.

Summarized the change
* Fill in defaults of attribute when `param_schema` is applied. This relaxes the matching on default attributes.
* Fill in None to optional input when `param_schema` is applied.
* Keep extra kwargs in attributes to make matching strictly.
* Allow input to be None when its dtype is `optiona[INPUT]`

The change comes with the guarantee from torchlib that attribute would never be None. For example, if `memory_format` is needed. The function should specify like this:
```python
@torch_op("aten::clone")
def aten_clone(
    self: TTensor, memory_format: str = ""  # pylint: disable=unused-argument
) -> TTensor:
    """clone(Tensor self, *, MemoryFormat? memory_format=None) -> Tensor"""

    return op.Identity(self)
```

Previous to this PR, opSchema matching didn't strictly guard the number of inputs/attributes to allow nearest match, which introduces the bug of dispatching `aten::div.Tensor` to `aten::div.default` disregarding the fact that `aten::div.Tensor` has an extra attibute `rounding_mode`. This PR fixes the issue with the new logic to perfect/nearest match. Particularly, strictly restrict the qualification of being nearest match candidate.

For each ONNX variants, we check these step by step:
1. Check if the function signature of inputs number is the same as the inputs.
2. Check if the function signature of attribute names is the same set of inputs.

If either of the above two criteria fails to meet, the ONNX variant is not a perfect match, nor a nearest match candidate (match_score=None)

3. Check if input dtype matches
4. Check if attribute dtype matches

If 3 and 4 are met, then this is a perfect match, otherwise, it's still considered a candidate of nearest match with a matching score.

## Case Study

### Optional Input
The dispatcher recognizes optional inputs. However, the input can't be ignored. None must be provided.
```python
# Perfect match is found
inputs = (Tensor([2, 3]), None)
aten_op(X: TTensor, Y: Optional[INT64]):
    ...
```
Real Case: aten::convolution
NOTE: There is/will not optional attribute in torchlib.

### Different attributes
If an attribute is provided with value, it's a must to match the attribute in function signature.
```python
# Not perfect match, nor nearest match
inputs = (Tensor([2, 3]),)
attributes = {"a":1, "b":2}
aten_op(X: TTensor, a: int):
    ...
```
Real Case: aten::div and aten::div.Tensor_mode

### Default attribute
Default attribute will fill in the value into inputs/attributes
```python
# Perfect match is found
inputs = (Tensor([2, 3]),)
attributes = {}
aten_op(X: TTensor, a: int = 3):
    ...
```
Real case: aten::clone

### Ignore attribute with None value
The attributes with None value will be ignored in matching.
```python
# Perfect match
inputs = (Tensor([2, 3]),)
attributes = {"a": None}
aten_op(X: TTensor):
    ...

# Not perfect match, but eligible for nearest match
inputs = (Tensor([2, 3]),)
attributes = {"a": None}
aten_op(X: TTensor, a: int = 3):
    ...
```
Real case: aten::div and aten::div.Tensor_mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106478
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao
2023-08-10 03:08:23 +00:00
4c1d8ab272 [vision hash update] update the pinned vision hash (#106926)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106926
Approved by: https://github.com/pytorchbot
2023-08-10 02:58:34 +00:00
9891c6aa15 [export] cleanup pass base. [2/n] (#106905)
Test Plan: CI

Differential Revision: D48004717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106905
Approved by: https://github.com/angelayi
2023-08-10 02:49:58 +00:00
08704f96f0 Add initial support for FP8 ONNX export (#106379)
Add support for ONNX_NAMESPACE::TensorProto_DataType_FLOAT8E5M2 and ONNX_NAMESPACE::TensorProto_DataType_FLOAT8E4M3FN to enable export of torch models that use FP8 (E4M3 and E5M2) to ONNX (opset 19)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106379
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi, https://github.com/malfet
2023-08-10 01:02:45 +00:00
526d93bba3 Add _onnx.pyi to ONNX merge rules (#106927)
Followup after https://github.com/pytorch/pytorch/pull/106379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106927
Approved by: https://github.com/izaitsevfb
2023-08-10 00:51:49 +00:00
b9ad7bc533 Don't run test/autograd/test_fallback.py in parallel (#106866)
Fixes https://github.com/pytorch/pytorch/issues/106754

This PR:
- moves test/autograd/test_fallback.py to test_autograd_fallback.py and
removes it from test_autograd.py (necessary for the next step)
- adds test_autograd_fallback.py to parallel test blocklist.
- lintrunner really wanted to make changes to the files, but other than
that, it is a move.

The problem is that we set a global option (the autograd fallback mode)
during these tests which may cause the tests to interfere with each
other.

Test Plan:
- python test/run_test.py -i test_autograd_fallback

NOTE to diff train oncall:
- You'll also need to modify the test/autograd/test_fallback.py TARGET in
caffe2/test/TARGETS since we renamed the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106866
Approved by: https://github.com/soulitzer
2023-08-10 00:26:23 +00:00
0b57581dec [pytorch] Disable fast path in MultiheadAttention in Export (#106824)
Summary:
We are seeing `aten._native_multi_head_attention` op (not in core Aten op set) is left in the exported graph and causes problems in the downstream at runtime.

Two proposed solutions:
 1. Disable fast path while tracing to leverage the non-optimized path to get decomp, that way, the blamed op won't show up in the exported graph
 2. Add a decomp rule for `aten._native_multi_head_attention`

After discussing with kimishpatel and bdhirsh, #1 is preferred and verified it could immediately unblock the critical model enablement work for PP.

Test Plan: CI

Differential Revision: D48169806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106824
Approved by: https://github.com/kimishpatel
2023-08-10 00:18:37 +00:00
7f9d1cacca [export] Minor fixes to contrain_as_size (#106737)
Fixed some minor issues with constraint APIs while I was helping enable some other model

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106737
Approved by: https://github.com/tugsbayasgalan
2023-08-10 00:13:08 +00:00
99a10da295 [Dynamo] a dyanmo backend based on ONNXRuntime (#106589)
This PR migrates the dynamo backend developed under ONNXRuntime into PyTorch. The ultimate goal is to replace legacy `onnxrt` in dynamo with dynamo compiler from ONNXRuntime team.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106589
Approved by: https://github.com/abock, https://github.com/thiagocrepaldi
2023-08-10 00:09:19 +00:00
4dc66a4b5c [BE] fix type iteration typo in test_lrscheduler.py (#106908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106908
Approved by: https://github.com/clee2000, https://github.com/soulitzer
2023-08-09 23:56:06 +00:00
7b3d50e4cc [Pytorch][Vulkan] Set global and local sizes for image->bool copy (#106752)
Summary:
1. Add bool to quantized flow
2. Add support for cases where channel is *not* a multiple of 4 to the shader `image_to_nchw_quantized_mul4.glsl`. Note that the `mul4` in the shader name refers to height * width % 4 == 0.

Add test cases.

See: D48082479

Test Plan:
New tests:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*copy_to_texture_bool*"

Downloaded 1/3 artifacts, 1.74 Mbytes, 50.0% cache miss (for updated rules)
Building: finished in 14.4 sec (100%) 474/474 jobs, 3/474 updated
  Total time: 14.4 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *copy_to_texture_bool*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.copy_to_texture_bool_mul4_hw
VUID-VkDeviceCreateInfo-pProperties-04451(ERROR / SPEC): msgNum: 976972960 - Validation Error: [ VUID-VkDeviceCreateInfo-pProperties-04451 ] Object 0: handle = 0x10bf61020, type = VK_OBJECT_TYPE_PHYSICAL_DEVICE; | MessageID = 0x3a3b6ca0 | vkCreateDevice: VK_KHR_portability_subset must be enabled because physical device VkPhysicalDevice 0x10bf61020[] supports it The Vulkan spec states: If the [VK_KHR_portability_subset] extension is included in pProperties of vkEnumerateDeviceExtensionProperties, ppEnabledExtensions must include "VK_KHR_portability_subset". (https://vulkan.lunarg.com/doc/view/1.2.182.0/mac/1.2-extensions/vkspec.html#VUID-VkDeviceCreateInfo-pProperties-04451)
    Objects: 1
        [0] 0x10bf61020, type: 2, name: NULL
[       OK ] VulkanAPITest.copy_to_texture_bool_mul4_hw (114 ms)
[ RUN      ] VulkanAPITest.copy_to_texture_bool_mul4_chw
[       OK ] VulkanAPITest.copy_to_texture_bool_mul4_chw (4 ms)
[ RUN      ] VulkanAPITest.copy_to_texture_bool
[       OK ] VulkanAPITest.copy_to_texture_bool (7 ms)
[----------] 3 tests from VulkanAPITest (126 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (127 ms total)
[  PASSED  ] 3 tests.

```

All tests:
```
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 331 tests from VulkanAPITest (7327 ms total)

[----------] Global test environment tear-down
[==========] 331 tests from 1 test suite ran. (7327 ms total)
[  PASSED  ] 330 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
Quantized tests:
```
[----------] 63 tests from VulkanAPITest (2009 ms total)

[----------] Global test environment tear-down
[==========] 63 tests from 1 test suite ran. (2009 ms total)
[  PASSED  ] 63 tests.

  YOU HAVE 8 DISABLED TESTS
```

Differential Revision: D48086455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106752
Approved by: https://github.com/SS-JIA
2023-08-09 23:37:13 +00:00
eefe06ef56 [BE] Move common logic into cublasCommonArgs (#106842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106842
Approved by: https://github.com/vkuzo
ghstack dependencies: #106843
2023-08-09 23:30:15 +00:00
4bc846c101 [FSDP] Ignore buffer type casting in ignored modules (#106766)
issue resolved: https://github.com/pytorch/pytorch/issues/97791

before this PR, mixed_precision applies to buffers from ignored modules. see ```test_state_dict_with_ignored_modules(mixed_precision=True)``` for reproduce

after, we avoid applying mixed_precision semantics to buffers from ignored modules
* step 1 initialization: state._ignored_buffer_names contains all the buffers from ignored modules
* step 2 lazy init at runtime: skip ignored buffers in ```_get_buffers_and_dtypes_for_computation```
* step 3 skip upcasting in state_dict hook: avoid upcasting for ignored buffers in ```_get_buffers_and_dtypes_for_computation```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106766
Approved by: https://github.com/awgu
2023-08-09 23:09:43 +00:00
97ce979e5d [quant][pt2e] Add reference representation for quantized conv2d (#105784)
Summary:
Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_quantize_dequantize_per_channel

Although right now it is not really testing things since there is some problem with dynamo export

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105784
Approved by: https://github.com/kimishpatel
ghstack dependencies: #105783
2023-08-09 22:41:35 +00:00
02e4415315 Attempt to pin opencv-python version (#106900)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106900
Approved by: https://github.com/voznesenskym, https://github.com/huydhn, https://github.com/malfet
2023-08-09 22:38:16 +00:00
787d5259fa Include fused nodes' debug_str in FusedSchedulerNode::debug_str_extra (#106356)
Currently, there's no way to print the debug information of fused scheduler nodes. I'm adding this to inspect the individual nodes' ir type e.g. ComputedBuffer, but not sure if this would be useful for more use cases

FusedSchedulerNode::debug_str_extra only prints its fused nodes' names
```
# calling .debug_str() on a FusedSchedulerNode
buf0_buf1: FusedSchedulerNode(NoneType)
buf0_buf1.writes = [MemoryDep('buf0', c0, {c0: 10}), MemoryDep('buf1', c0, {c0: 10})]
buf0_buf1.unmet_dependencies = []
buf0_buf1.met_dependencies = [MemoryDep('arg0_1', c0, {c0: 100}), MemoryDep('arg1_1', c0, {c0: 10})]
buf0_buf1.users = None
buf0_buf1.snodes = ['buf0', 'buf1']
```

This PR adds support to print the fused nodes' debug_str
```
buf0_buf1: FusedSchedulerNode(NoneType)
buf0_buf1.writes = [MemoryDep('buf0', c0, {c0: 10}), MemoryDep('buf1', c0, {c0: 10})]
buf0_buf1.unmet_dependencies = []
buf0_buf1.met_dependencies = [MemoryDep('arg0_1', c0, {c0: 100}), MemoryDep('arg1_1', c0, {c0: 10})]
buf0_buf1.users = None
    buf0_buf1.snodes[0] =
    buf0: SchedulerNode(ComputedBuffer)
    buf0.writes = [MemoryDep('buf0', c0, {c0: 10})]
    buf0.unmet_dependencies = []
    buf0.met_dependencies = [MemoryDep('arg0_1', c0, {c0: 100})]
    buf0.users = [NodeUser(node=SchedulerNode(name='buf1'), can_inplace=True)]
    buf0.group.device = cuda:0
    buf0.group.iteration = (10, 10)
    buf0.sizes = ([10], [10])
    class buf0_loop_body:
        var_ranges = {z0: 10, z1: 10}
        index0 = 10*z0 + z1
        index1 = z0
        def body(self, ops):
            get_index = self.get_index('index0')
            load = ops.load('arg0_1', get_index)
            reduction = ops.reduction(torch.float32, torch.float32, 'sum', load)
            get_index_1 = self.get_index('index1')
            store_reduction = ops.store_reduction('buf0', get_index_1, reduction)
            return store_reduction
    buf0_buf1.snodes[1] =
    buf1: SchedulerNode(ComputedBuffer)
    buf1.writes = [MemoryDep('buf1', c0, {c0: 10})]
    buf1.unmet_dependencies = [MemoryDep('buf0', c0, {c0: 10})]
    buf1.met_dependencies = [MemoryDep('arg1_1', c0, {c0: 10})]
    buf1.users = [NodeUser(node=OUTPUT, can_inplace=False)]
    buf1.group.device = cuda:0
    buf1.group.iteration = (10, 1)
    buf1.sizes = ([10], [])
    class buf1_loop_body:
        var_ranges = {z0: 10}
        index0 = z0
        def body(self, ops):
            get_index = self.get_index('index0')
            load = ops.load('arg1_1', get_index)
            cos = ops.cos(load)
            get_index_1 = self.get_index('index0')
            load_1 = ops.load('buf0', get_index_1)
            add = ops.add(cos, load_1)
            get_index_2 = self.get_index('index0')
            store = ops.store('buf1', get_index_2, add, None)
            return store
```

I'm assuming that FusedSchedulerNode cannot be fused, i.e. can't have FusedSchedulerNode::snodes contain any FusedSchedulerNode.

# Tests
Changes were tested adhoc by printing debug_str in GraphLowering::count_bytes, and running `python3 test/inductor/test_perf.py -k test_fusion_choice3`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106356
Approved by: https://github.com/peterbell10
2023-08-09 21:19:07 +00:00
77acb04a00 [dynamo] Readability - Rename name to get_frame_name (#106880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106880
Approved by: https://github.com/jansel
ghstack dependencies: #106878
2023-08-09 21:15:41 +00:00
8aca724312 [dynamo] use cache size to detect recompilation (#106878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106878
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/mlazos
2023-08-09 21:15:40 +00:00
c2ddb71aba Add F8 BLAS data types conversion (#106843)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 54e843d</samp>

> _`Float8` types added_
> _to switch on `ScalarType` -_
> _CUDA version checked._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106843
Approved by: https://github.com/Skylion007
2023-08-09 20:49:40 +00:00
0b88007540 Adding release compatibility matrix for release 2.1 (#106891)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 3c5a179</samp>

Update `RELEASE.md` with compatibility information for PyTorch 2.1. This file documents the supported versions of Python, CUDA, and CUDNN for each PyTorch release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106891
Approved by: https://github.com/kit1980
2023-08-09 20:30:45 +00:00
861ae39938 [aarch64] Add PT Docker build image for aarch64 (#106881)
# Changes
* Update `generate_binary_build_matrix.py` for aarch64 to use `pytorch/manylinuxaarch64-builder:cpu` when it is created
* Executed `generate_binary_build_matrix.py` to update `generated-linux-aarch64-binary-manywheel-nightly.yml`

Aarch64 build/test will fail till the new docker image is available for consmption.

Builder PR to build docker image : https://github.com/pytorch/builder/pull/1472

This switches nightly to use the docker build : https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=aarch64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106881
Approved by: https://github.com/atalman
2023-08-09 20:28:04 +00:00
7dfab082be Reordering tests experiment (#106347)
Companion with https://github.com/pytorch/test-infra/pull/4424

Uses the file rating generated by the test infra PR to re order tests.  For each test file, sum the file ratings from the changed files in the PR, and put the tests in order of sum.

A lot of tests are probably going to end up as "prioritized" since it takes anything with a rating > 0 right now.

Sharding is done twice, once on the prioritized tests, and once on the general/non prioritized tests.  Prioritized tests have an order, so they should be sharded according to that order, while general tests don't have an order and are sharded by test time, which should result in more balanced shards.

I'll change the metric name before I merge, i want to quarantine my testing stuff from actual results

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106347
Approved by: https://github.com/ZainRizvi
2023-08-09 20:11:11 +00:00
a44c072c89 Make InternalModel and Resnet work with rexportable flow (#106676)
Summary: Internal model and Resnet uses "re-export" flow now. Also did some refactoring to make the code little cleaner

Some changes for OSS:
1. Correctly use the "cached" fake tensors so that static symbols are still resolved to static
2. Change logic in PassBase to allocate static shapes for parameters
3. Add "is_torch_exported" tag to every node to make it survive during various graph transformations.
4. Added experimental wrapper API for quantization team to get pre_dispatch=True graph. Note that it doesn't actually do that right now. But we plan to switch soon.

Test Plan: CI

Differential Revision: D47890878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106676
Approved by: https://github.com/jerryzh168
2023-08-09 20:10:48 +00:00
8ea13a955a Avoid subtracting by sys.maxsize when something is bounded below by -sys.maxsize - 1 (#106716)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106716
Approved by: https://github.com/albanD
2023-08-09 19:34:03 +00:00
1b32ac3cab Update torchbench.txt (#106761)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106761
Approved by: https://github.com/malfet
2023-08-09 19:01:21 +00:00
47014883a7 Remove unused _add_runtime_assertions (#106759)
`_add_runtime_assertions` is not used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106759
Approved by: https://github.com/tugsbayasgalan
2023-08-09 18:58:32 +00:00
e1a1780626 [quant][pt2e] Move annotate functions in XNNPACKQuantizer to utils (#106642)
Summary:
This is to allow sharing these annotate functions by other quantizers so that writing a new quantizer is easier

note that these annotation functions will be maintained by XNNPACKQuantizer developers instead of AO team

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106642
Approved by: https://github.com/andrewor14
2023-08-09 18:52:39 +00:00
467a2e63f0 [pt2] add Python meta for triangular_solve (#106682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106682
Approved by: https://github.com/ezyang
2023-08-09 18:50:54 +00:00
318fcc5eb9 Change dropout of device Privateuse1 to fused kernel (#106774)
Similar to issue in  #97894, dropout is dispatched to fused kernel(native_dropout) only with some devices like cuda, etc.. It is hard  to optimize performance when using AOT with custom device, as dropout is finally decomposed to bernoulli and mul.  This PR changes this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106774
Approved by: https://github.com/ezyang
2023-08-09 18:50:28 +00:00
6f036c9637 [FSDP][Easy] zeros -> empty for immediately freed tensors (#106857)
Since we immediately free these tensors' storage (via `_free_storage()`), there is no reason to zero them after allocation:
92e5b124c8/torch/distributed/fsdp/flat_param.py (L1140-L1145)
92e5b124c8/torch/distributed/fsdp/flat_param.py (L1155-L1161)
92e5b124c8/torch/distributed/fsdp/flat_param.py (L1166-L1171)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106857
Approved by: https://github.com/Skylion007
2023-08-09 17:26:33 +00:00
a0c0666fca Add some const to IndexKernel.cu (#106809)
Test Plan: Sandcastle

Differential Revision: D48137853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106809
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-08-09 16:54:47 +00:00
387e3b04fa Reenable torch._int_mm testing on newer CUDAs (#106840)
Looks like "it just works" on SM80+ on CUDA-12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106840
Approved by: https://github.com/vkuzo
2023-08-09 16:23:30 +00:00
046d6178c5 [BE] Add optional t param to CuBlasLtMatrixLayout (#106841)
To avoid writing the same ternary in 4 (and soon to be 6) places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106841
Approved by: https://github.com/kit1980
2023-08-09 16:14:48 +00:00
fe594ab323 Revert "[core][pruning][feature] cuSPARSELt kernels and ops (#102133)"
This reverts commit ad22f0ffb456fc3f967ad32e09376f7c9cf94a56.

Reverted https://github.com/pytorch/pytorch/pull/102133 on behalf of https://github.com/jcaip due to breaking lots of internal builds, see D48144534 ([comment](https://github.com/pytorch/pytorch/pull/102133#issuecomment-1671707821))
2023-08-09 16:03:14 +00:00
387f1ab104 [inductor] Switch inductor_prims._bucketize over to aten.bucketize (#106658)
inductor_prims._bucketize was added while we worked on hardening the inductor lowering. Now the lowering should be sufficiently tested and should have good enough perf (https://github.com/pytorch/pytorch/pull/104456) - so we can remove the temporary `inductor_prims._bucketize` op and move the lowerings to the `aten.bucketize` op.

Note that we haven't added a CPU implementation yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106658
Approved by: https://github.com/eellison
2023-08-09 14:00:22 +00:00
40a15b50a8 Enable mypy checking in compile_fx.py (#105830)
This is part of the effort for issue #105230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105830
Approved by: https://github.com/eellison
2023-08-09 09:05:23 +00:00
088e316659 add xpu support for foreach kernels (#106021)
We want to add xpu support for foreach kernels, so we add the "xpu" devices to the support list.

Besides, for fused kernels in Adam and AdamW,  the devices check is enabled by the support list in adam.py (lines 44-46) and adamw.py (lines 60-64),  so we remove the repetitive check for cuda devices as it will block the other devices in the support list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106021
Approved by: https://github.com/janeyx99
2023-08-09 07:51:02 +00:00
dc7ec4c843 Revert "conv-bn folding in low precision (#106576)"
This reverts commit c21df02ec0b9fd366bdf203134595664de030758.

Reverted https://github.com/pytorch/pytorch/pull/106576 on behalf of https://github.com/kit1980 due to breaking internal builds, see D48144191 ([comment](https://github.com/pytorch/pytorch/pull/106576#issuecomment-1670768310))
2023-08-09 06:51:54 +00:00
0ce103a0f8 Revert "inductor: remove conv_bn folding from pre_grad pass (#106686)"
This reverts commit 2a16457976c884b9cd1f196120529464c880e7ba.

Reverted https://github.com/pytorch/pytorch/pull/106686 on behalf of https://github.com/kit1980 due to Depends on reverted https://github.com/pytorch/pytorch/pull/106576 ([comment](https://github.com/pytorch/pytorch/pull/106686#issuecomment-1670753365))
2023-08-09 06:37:22 +00:00
c876afea2d [vision hash update] update the pinned vision hash (#106832)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106832
Approved by: https://github.com/pytorchbot
2023-08-09 04:29:46 +00:00
6691413145 export torch/csrc/dynamo/*.h (#106757)
Fixes #ISSUE_NUMBER
as title, we need the header files in torch/csrc/dynamo, so to export it. could you have a look? @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106757
Approved by: https://github.com/albanD
2023-08-09 03:57:49 +00:00
e1e6bbd889 Update opset version warning text (#106830)
Fix the line break. Remove mentioning of "Torchlib" and instead mention `torch.onnx.dynamo_export` because `Torchlib` seems like a foreign concept in torch. Suggestions welcome.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106830
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2023-08-09 03:42:10 +00:00
cce2c52b0b [pt2] support vmap (#101707)
Teach dynamo about `vmap`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101707
Approved by: https://github.com/zou3519
2023-08-09 03:39:33 +00:00
c379d6283a Don't suppress ModuleNotFoundError if the failure is for an unrelated module (#106807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106807
Approved by: https://github.com/williamwen42, https://github.com/voznesenskym
2023-08-09 01:54:49 +00:00
69ecad6f2b [quant][pt2e] Add reference representation for quantize_per_channel and dequantize_per_channel (#105783)
Summary:
Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_quantize_dequantize_per_channel

Although right now it is not really testing things since there is some problem with dynamo export

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105783
Approved by: https://github.com/kimishpatel
2023-08-09 01:39:52 +00:00
c14cf312c9 expandable_segments fix possible assert (#106818)
If record_history is enabled, then a block is allocated, record_history
is disabled, and then the block is freed and later unnmapped, we can hit
the `to_map->context_when_allocated == nullptr` assertion.

This change universally clears context_when_allocated on free, which should
prevent this sequence of events from  happening.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106818
Approved by: https://github.com/eellison
2023-08-09 01:09:03 +00:00
44448754c1 [CI] Fix sccaching of nvcc builds (#106811)
In cmake-3.26 or newer, `--options-file` is used, which renders nvcc outputs uncacheable by `sccache`, which were enable for CUDA-11 or newer builds by default by 6377a43814

Fix it by disabling RESPONSE_FILE use for CUDA compilation.
Test Plan: Check that `multiple input files` stats in `PyTorch Build Statistics` is down to 13 files again, see https://github.com/pytorch/pytorch/actions/runs/5801865789/job/15727069855?pr=106811#step:10:42423

Fixes https://github.com/pytorch/pytorch/issues/105004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106811
Approved by: https://github.com/seemethere
2023-08-09 00:25:11 +00:00
9e35df4adc [pytorch][ao] force weight observer/fake_quant to be on the same device as the weight tensor (#106755)
Summary:
As title.
There's a corner case where both cpu and gpu are avaiable, although the model is moved to cpu, the newly created PTQ weight observer is still on gpu. Therefore, during the convert, this line will fail https://fburl.com/4rhipfvb

Test Plan: CI

Differential Revision: D48141494

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106755
Approved by: https://github.com/jerryzh168
2023-08-09 00:22:49 +00:00
cbcd9083be [DCP] Modify tensor saving logic in DCP (#106415)
Currently, DCP treats tensors as duplicates and only saves them on rank0. This won't work for PiPPy as PiPPy does have unique tensors across different ranks. With the current setup, we would only be saving the tensors on rank0 (coordinator rank).

In this PR, we are changing to letting each rank create its own WriteItem for tensors. For the ones that does replicate across different ranks, we are handling it thru dedup_tensors(), which will dedup the replicate WriteItem so we only do the actual writing once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106415
Approved by: https://github.com/wz337
2023-08-09 00:16:10 +00:00
c913f3857f Remove dynamo+nvfuser (#105789)
This PR removes unmaintained Dynamo+nvFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789
Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD
2023-08-08 22:29:32 +00:00
32d8de23d4 Enable mypy check for torch/_inductor/codegen/common.py (#106199)
Fixes #105230

Summary:

As suggested in [#105230](https://github.com/pytorch/pytorch/issues/105230) mypy checking is enabled in torch/_inductor/codegen/common.py.

After the fix:
`mypy --follow-imports=skip torch/_inductor/codegen/common.py
Success: no issues found in 1 source file`

Reviewers: @eellison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106199
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-08-08 20:37:47 +00:00
2a138d7f1d [ONNX] Turn on batch norm related unittest (#105769)
As title, add test for ops already supported.
Bump ORT in CI to 1.15.1 release version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105769
Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-08-08 19:51:04 +00:00
3c52c6fd53 [pytorch] Disable CUDA sync events by default (#106723)
Summary:
As above, this was missed in a previous diff accidentally setting defaul to true.
Internal to MEta this is disabled but it is enabled in open source PyTorch.

Test Plan: CI

Differential Revision: D48124636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106723
Approved by: https://github.com/davidberard98, https://github.com/aaronenyeshi
2023-08-08 19:30:45 +00:00
d4bc27191a [exir] Update exir.pass_base to use export.pass_base (#106647)
Summary: Also fixed T159713621

Test Plan: CI

Differential Revision: D48068293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106647
Approved by: https://github.com/tugsbayasgalan
2023-08-08 19:27:21 +00:00
2764ead429 Add missing quantize per tensor vulkan backend function (#106641)
Summary: Add missing function for vulkan namespace for quantization

Test Plan:
Check that a quantized model runs on Vulkan.
Notebook tested: https://www.internalfb.com/intern/anp/view/?id=4045081

Differential Revision: D48047516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106641
Approved by: https://github.com/SS-JIA
2023-08-08 17:44:03 +00:00
3a300ed84e [export] refactor and add same_signature flag to dynamo.export (#106569)
This PR adds a **same_signature** flag to dynamo.export.

**Motivation:**
In https://github.com/pytorch/pytorch/pull/105679, we experimented on **using dynamo to inspect the UDFs** for cond in eager mode (without torch.compile). This helps us to normalize the inputs (e.g. lifting closure to inputs) and makes higher order operator more robust (e.g. forbid python side effects) and less error-prone in general.

We decided to use dynamo.export (instead of torch.compile) to do the inspection (pointed out by @voznesenskym @zou3519):
- We'd like a **whole-graph capture** for the UDF.
- We'd like the dynamo inspection to be **stateless**. Using torch.compile would require resetting dynamo context before and after the inspection because the compile flags may be different from users' torch.compile. This will clear all dynamo cache.
- We can still implement some **caching** based on the guards.

However, this requires export to be able to handle the case where it cannot always rewrite signature: e.g. closure lifted as input.

This PR makes the rewrite optional.

**Implementation:**
We just put all the code that are related to signature rewriting into a function called rewrite_signature and use a same_signature flag to optionally to the transformation.

**Test Plan:**
existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106569
Approved by: https://github.com/ezyang
2023-08-08 17:16:18 +00:00
bd3b6f1ab4 add a debug api to extract cache entry from code (#106673)
Per the discussion with @jansel  in https://dev-discuss.pytorch.org/t/how-are-guards-installed-on-frames-that-are-transient-objects/1415/7 , guards and compiled code live in `co_extra` field in pycodeobject, which cannot be accessed in a trivial way. This PR tries to add a debug API to extract the data from that field, which can make debugging torchdynamo much easier.

The API is intended to be used for debug only, and should have no compatibility issues with the current system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106673
Approved by: https://github.com/jansel
2023-08-08 16:33:46 +00:00
bc88028e8e Back out "Reland "Make adding buffers more like adding parameters (#104069)" (#106224)" (#106743)
Summary:
Original commit changeset: 81319beb97f3

Original Phabricator Diff: D47961182

Test Plan: revert to maintain backward compat with legacy ads_dper3 production package. Read details in: S357822

Reviewed By: atuljangra

Differential Revision: D48131623

@diff-train-skip-merge
(D48131623 landed internally)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106743
Approved by: https://github.com/malfet
2023-08-08 15:27:34 +00:00
891bb259f8 Revert "Remove dynamo+nvfuser (#105789)"
This reverts commit 6030151d3758715097b89026e9b3b3f839fbd544.

Reverted https://github.com/pytorch/pytorch/pull/105789 on behalf of https://github.com/DanilBaibak due to Break a lot of tests on main. ([comment](https://github.com/pytorch/pytorch/pull/105789#issuecomment-1669710571))
2023-08-08 14:20:32 +00:00
16b6873885 [custom_ops] extend impl_abstract to work with existing torch.library ops (#106088)
This PR extends impl_abstract to work with existing
torch.library/TORCH_LIBRARY ops.

There's a question of what to do if the user calls impl_abstract
and the op already has a registration for:
- DispatchKey::Meta. We raise.
- DispatchKey::CompositeImplicitAutograd. We raise.
- DispatchKey::CompositeExplicitAutograd. To be pragmatic, we don't
raise, since the user's CompositeExplicitAutograd might work for all
other backends but Meta.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106088
Approved by: https://github.com/soulitzer
ghstack dependencies: #106075, #106076
2023-08-08 13:53:20 +00:00
cebff39fad [custom_ops] make custom_ops.impl work on existing operators (#106076)
The design is that we construct a CustomOp object around the existing
operator and then use it to register things. It is totally OK if the
operator isn't functional (unlike torch._custom_ops.custom_op that can
only construct functional operators).

If the operator already has an implementation from a backend (either via
direct registration to e.g. DispatchKey::CPU, or an indirect
registration like CompositeImplicitAutograd/CompositeExplicitAutograd),
we raise an error.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106076
Approved by: https://github.com/soulitzer
ghstack dependencies: #106075
2023-08-08 13:53:20 +00:00
60a4ac3068 [custom_ops] Block overload names (#106075)
These are valid with the torch.library API, but (1) they add complexity
and (2) I have never seen a custom op actually use an overload name
before. For simplicity we block all overloads.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106075
Approved by: https://github.com/soulitzer
2023-08-08 13:53:18 +00:00
6030151d37 Remove dynamo+nvfuser (#105789)
This PR removes unmaintained Dynamo+nvFuser.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105789
Approved by: https://github.com/jansel, https://github.com/jjsjann123, https://github.com/albanD
2023-08-08 13:29:31 +00:00
ad22f0ffb4 [core][pruning][feature] cuSPARSELt kernels and ops (#102133)
This PR contains two new private ops, added for cuSPARSELt support.

These ops call into the cuSPASRELt kernels using the bindings they
provide. For more information, see the documentation
[here](https://docs.nvidia.com/cuda/cusparselt/index.html).

The two new private ops added are:
```
_cslt_compress()
_cslt_sparse_mm()
```

_cslt_compress is an op that reuturns the compressesed matrix given a
sparse matrix that is passed in.

_cslt_sparse_mm is an op that expects a compressed matrix (the result of
_cslt_compress) and a dense matrix and performs sparse-dense matmul

These ops will throw runtime errors if they cusparselt is not present.

This PR also modifies the test and tensor sublass to reflect the new
cuSPARSELt support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102133
Approved by: https://github.com/cpuhrsch
2023-08-08 06:59:22 +00:00
eqy
03c9321722 [CUDA][CUDA Graphs] Fix potential race with autograd thread during a graph capture 2 (#106570)
An alternative to #106235 that just adds our own uid generation so that we can call `beginAllocateStreamToPool` (which notifies the caching allocator that a capture is starting) before actually starting the capture. Note that this does appear to change the behavior uid generation a bit from the CUDA API call (which seems to increment by 3 each time instead of 1).

Looking at the changes again I'm not sure if both the _begin_ capture ordering change is needed in addition to the _end_ capture ordering change, but it makes me uneasy as I'm not sure anything prevents the autograd thread from running cleanup code "in-between" captures.

CC @zdevito @eellison

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106570
Approved by: https://github.com/zdevito
2023-08-08 06:03:21 +00:00
cc21fa75a3 Enable dynamic shapes of torch.nn.Parameter (#105855)
This PR adds a new configuration that enables shapes of torch.nn.Parameter to be treated as dynamic in order to avoid extensive recompilation when Paramters are used instead of Tensor.

This features addresses part of issue #105279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105855
Approved by: https://github.com/ezyang
2023-08-08 05:40:01 +00:00
0d57e87000 Fix test_div in caffe2/caffe2/python:hypothesis_test (#106694)
Summary: Suppress the "too_slow" health check for `test_div`.

Test Plan: Sandcastle

Differential Revision: D48105842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106694
Approved by: https://github.com/malfet
2023-08-08 04:50:21 +00:00
cdfd0ea162 [MPS] Introduce torch.mps.Event() APIs (#102121)
- Implement `MPSEventPool` to recycle events.
- Implement python bindings with `torch.mps.Event` class using the MPSEventPool backend. The current member functions of the Event class are `record()`, `wait()`, `synchronize()`, `query()`, and `elapsed_time()`.
- Add API to measure elapsed time between two event recordings.
- Added documentation for Event class to `mps.rst`.
- Added test case to `test_mps.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102121
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-08-08 03:45:45 +00:00
eqy
5f551133dc [NCCL][AVOID_RECORD_STREAMS] Initialize stashed_for_allocator_safety_ in endCoalescing if TORCH_NCCL_AVOID_RECORD_STREAMS=1 (#106166)
Currently `stashed_for_allocator_safety_` is uninitialized in this path, which will crash if another operation assumes a non-nullptr (the case when `TORCH_NCCL_AVOID_RECORD_STREAMS=1` and `avoidRecordStreams_` is set).

CC @kwen2501 @ptrblck

@kwen2501
I'm not familiar with what happens to the coalesced work when `endCoalescing` is called. In theory, if the coalesced work has already "stashed for allocator safety," can we also avoid the record streams calls here? Or is the coalesced work discarded (and their `_stashed_for_allocator_safety` vectors also destroyed?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106166
Approved by: https://github.com/kwen2501
2023-08-08 03:03:22 +00:00
9e4e0ecdd9 Add 0-dim Tensor overload to _foreach_mul (#106677)
rel:
- https://github.com/pytorch/pytorch/issues/106427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106677
Approved by: https://github.com/janeyx99
2023-08-08 03:00:01 +00:00
90c264c276 sd flaky on cpu skip (#106726)
waiting for update expected script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106726
Approved by: https://github.com/malfet
2023-08-08 02:44:47 +00:00
2a16457976 inductor: remove conv_bn folding from pre_grad pass (#106686)
The freezing pass has support conv+bn folding pass, we don't need to do that at pre_grad pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106686
Approved by: https://github.com/eellison
2023-08-08 02:01:15 +00:00
8ef7512dc4 create API jit::Module::deepcopy(device) (#106521)
Summary:
Before we copy a meta merge, and use it as a skeleton to do d2d merge replication. However some models like prospector has CPU op LongIndex which takes quite long time to load. That makes the meta merge copy expensive.

Modify jit::Module::deepcopy() to allow device copy. It simplifies user code and removes all unnecessary copies like tempfile, meta merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106521
Approved by: https://github.com/davidberard98
2023-08-08 00:45:49 +00:00
a25eee1d77 _force_original_view_tracking to work as both context manager and function (#106706)
Fix _force_original_view_tracking to work as a function as well as a context manager, as stated by documentation.

Applied similar fixes to PR: https://github.com/pytorch/pytorch/pull/105291
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106706
Approved by: https://github.com/albanD
2023-08-07 23:29:22 +00:00
3cda19c10a [inductor] Fix xpasses being impossible (#106631)
This test raises an error inside the test when an xfailed test succeeds, but
is also decorated with the xfail decorator which converts the error to an xfail.
Instead, this lets the test function pass normally and lets the xfail decorator
raise "Unexpected success".

I also updated the COLLECT_EXPECT code and run it to get the updated set of
failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106631
Approved by: https://github.com/lezcano
ghstack dependencies: #106319, #106400
2023-08-07 20:59:30 +00:00
ab6efb1649 [pt2] Add reference implementations of torch.{stft,istft} (#106400)
This allows symbolic shapes to be traced through `torch.stft` and `torch.istft`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106400
Approved by: https://github.com/lezcano
ghstack dependencies: #106319
2023-08-07 20:59:30 +00:00
d4d090e2da [FakeTensor] Workaround FFT ops with incorrect meta strides (#106319)
Currently there are FFT operators which raise `UnsupportedOperatorException`
because their meta implementations sometimes give incorrect strides. This works
around the problem for static shapes by falling back to eager. Though we still
don't support calls with dynamic shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106319
Approved by: https://github.com/ezyang
2023-08-07 20:59:30 +00:00
66d90e8054 [inductor][fx passes] add tensor size limit for group fusion and enable batch fusion (#106627)
Summary:
Add threshhold for max size. if tensor size> threshold, we will not fuse them.

Enable batch_fusion by default since we have found consistent qps gain and ne neutral.

Test Plan:
Some local test result in: https://docs.google.com/document/d/1-qNuvGejhGgwKmRVTbz98_-SVu_fMoKgFcxyxrNMH_M/edit
4096 should be a better threshold for ads cmf model.
f465511761
f465519705
4.8% qps gain
 {F1064213077}
ne neutral
 {F1064214423}

Reviewed By: yanboliang

Differential Revision: D48042826

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106627
Approved by: https://github.com/jansel
2023-08-07 20:34:20 +00:00
45c03b1ad4 Better dynamo dict support via SetVariable keys (#106559)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106559
Approved by: https://github.com/ezyang
2023-08-07 20:20:06 +00:00
4639ceb3fd [BE] Use convenience function (#106709)
Rather than typing `TORCH_CUDABLAS_CHECK(cublasLtMatmulDescSetAttribute(desc.descritor, attr, &value, sizeof(value))` introduce template method `setAttribute` that does the same
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106709
Approved by: https://github.com/Skylion007
ghstack dependencies: #106708
2023-08-07 20:12:23 +00:00
e02a3d4ad5 [BE] Use nested namespace in ATen/cuda (#106708)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ac9bd0c</samp>

> _We're sailing on the CUDA sea, with tensors and graphs aplenty_
> _We're refactoring the code, to make it clear and neat_
> _We're using nested namespaces, like `at::cuda::blas`_
> _So heave away, me hearties, heave away on the count of three_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106708
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2023-08-07 20:12:23 +00:00
1317dbf176 Reland "Add nn.CircularPad{*}d for consistency + fix no_batch_dim support (#106148)" (#106632)
Previous one was reverted because the PR stacked under which added error-checking to Pad variants https://github.com/pytorch/pytorch/pull/106147 was reverted as internally some people pass 2D inputs to ZeroPad2d (which should actually take 3d or 4d inputs :) but there wasn't actually anything this PR was breaking according to my understanding

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106632
Approved by: https://github.com/albanD
2023-08-07 20:10:25 +00:00
0208574db9 [NAdam] Add capturable API and tests + fix differentiable (#106615)
This PR:
- adds a capturable API for NAdam similar to Adam(W)
- adds tests accordingly
- discovered and fixed bugs in the differentiable implementation (now tested through the capturable codepath).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106615
Approved by: https://github.com/albanD
2023-08-07 19:49:11 +00:00
3dd8cb12b5 [ROCM] enable test_aten cpp tests (#106476)
This is part of effort to enable missed cpp test for ROCm platform.

In this change, we enabled the test_aten cpp test.
The total number of tests enabled is 214.

**Test plan:**
Tested in the rocm/pytorch-nightly:latest

```
jenkins@xxxxx:/tmp/pytorch$ .ci/pytorch/test.sh &> test_aten.out

jenkins@xxxxx:/tmp/pytorch$ grep PASS test_aten.out  |wc -l
214
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106476
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-07 18:31:30 +00:00
3a07dfde48 Fix lifetime of JITException binding (#106401)
Fix issues with new asserts introduced in 3.12 and pybind gil holding check on destructor.
See https://github.com/pybind/pybind11/pull/4769 for details on why this is a preferred solution rather than skipping the decref in all pybind object destructors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106401
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/Skylion007
2023-08-07 18:00:50 +00:00
af78e139a8 [functorch] fix dynamo support for functorch.grad (#106610)
Ref: https://github.com/pytorch/pytorch/pull/106475#discussion_r1282384503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106610
Approved by: https://github.com/zou3519
2023-08-07 17:44:49 +00:00
slc
2d4b1ae434 [Fix Bug] Cannot assign index like x[[1,2], :] = 2 when torch.use_deterministic_algorithms(True) to main (#105833)
Fixes https://github.com/pytorch/pytorch/issues/105819 and fix https://github.com/pytorch/pytorch/issues/96724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105833
Approved by: https://github.com/kurtamohler, https://github.com/janeyx99
2023-08-07 17:00:19 +00:00
070eb88a96 Handle Rational divisors in FloorDiv. (#106644)
Follow-up: #101173

This PR fixes the bug presented in #101173 by creating a special case for `sympy.Rational`
divisors, inside `FloorDiv` evaluation. In summary:

```python
FloorDiv(a, Rational(1, b))
a * b
```

Besides that, this PR also does 2 other things:

- Replaces the use of the old `sympy.Mod` by the internal `Mod` (there were a few places
that were still looking for the SymPy one)

- Introduces debugging logs to the translation validator. These can be seen by setting the
environment variable: `TORCH_LOGS=+torch.fx.experimental.validator`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106644
Approved by: https://github.com/ezyang
ghstack dependencies: #106643
2023-08-07 16:52:22 +00:00
33e70e34a3 More readable Z3 expressions printer. (#106643)
This PR makes Z3 expressions easier to read and understand by creating a custom printer
for them.

Z3 expressions can be printed in 2 forms:

1. Using the builtin `str(e)` function
2. Using the `e.sexpr()` method

Problem is that (1) is a bit hard to read because its line breaks are not so
intuitive. (2) is a bit nicer, but the `to_int` and `to_real` functions clutter things up.

The custom printer is an improved `sexpr()` function:

- Leaves everything in one line
- Gets rid of `to_int` and `to_real` functions
- Reconstruct the floor division operations
- Merge commutative operation chains

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106643
Approved by: https://github.com/ezyang
2023-08-07 16:52:22 +00:00
26846546e8 export tools/autograd to torchgen package (#106663)
Fixes #ISSUE_NUMBER
as discussed here https://github.com/pytorch/pytorch/pull/105003,  I have exported tools/autograd to torchgen package, and could you have a look? @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106663
Approved by: https://github.com/zou3519
2023-08-07 16:14:51 +00:00
273ad1dd23 Fix typo in jit_opt_limit.h (#106684)
Fix typo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106684
Approved by: https://github.com/zou3519
2023-08-07 13:51:55 +00:00
1cc002621d [xla hash update] update the pinned xla hash (#106695)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106695
Approved by: https://github.com/pytorchbot
2023-08-07 10:55:02 +00:00
416bf4e3e7 [Inductor][FX passes] Pre grad batch linear LHS fusion (#106497)
This is a popular pattern in many internal user cases, we have two versions (pre and post grad) and found the pre grad version has more perf gain, which makes sense in theory as this corresponding backward graph doesn't have this pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106497
Approved by: https://github.com/jackiexu1992, https://github.com/jansel
2023-08-07 05:52:27 +00:00
e35cb480f4 [DCP][Test]Remove broken 2d checkpoint test (#106640)
Summary: Removing this broken test as we are not going to land the fix for 2D regression. Instead, we are going to migrate to use device_mesh and dtensor state_dict for 2D.

Differential Revision: D48082586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106640
Approved by: https://github.com/fduwjj
2023-08-07 03:13:50 +00:00
aa1b2f16c5 fix upsample_nearest decompositions for uint8 tensors (#106675)
Fixes #106674.

This PR aligns the implementation of `_compute_upsample_nearest_indices` with `UpSampleKernel.cpp`: 68cb854d73/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L1388-L1393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106675
Approved by: https://github.com/albanD
2023-08-07 01:52:41 +00:00
c21df02ec0 conv-bn folding in low precision (#106576)
Batchnorm inference is done in fp32 if the inputs are in fp16/bf16 and the output is casted back down to its original precision. This causes the batchnorm weights to get constant folded to fp32, and prevented Conv-BN folding from firing.
```
 def forward(self, arg0_1: bf16[32, 3, 3, 3], arg1_1: bf16[32], arg2_1: bf16[32], ...)
     convolution: bf16[3, 32, 15, 15] = aten..convolution.default(arg6_1, arg0_1, None, [2, 2], [0, 0], [1, 1], False, [0, 0], 1);  arg6_1 = arg0_1 = None
     # weight upcasting
     convert_element_type: f32[32] = torch.ops.prims.convert_element_type.default(arg3_1, torch.float32);  arg3_1 = None
     convert_element_type_1: f32[32] = torch.ops.prims.convert_element_type.default(arg4_1, torch.float32);  arg4_1 = None
     ...
     # end of batch norm
     add_1: f32[3, 32, 15, 15] = aten..add.Tensor(mul_2, unsqueeze_7);  mul_2 = unsqueeze_7 = None
     # output downcast
     convert_element_type_2: bf16[3, 32, 15, 15] = torch.ops.prims.convert_element_type.default(add_1, torch.bfloat16);  add_1 = None
```

I mark the convolutions which are followed by binary foldable ops in a higher precision that are then get converted back down to the original conv dtype. We fold the weights in fp32 because it's slightly better accuracy, then at the end of the pass convert back the weights to their original dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106576
Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang
ghstack dependencies: #106471, #106575
2023-08-07 01:30:47 +00:00
7215007f01 [pt2] add Python meta for polygamma (#106681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106681
Approved by: https://github.com/ezyang
2023-08-07 00:59:14 +00:00
12041d8e1f Use default dispatch table for tensordot.out (#106669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106669
Approved by: https://github.com/ezyang
2023-08-07 00:58:17 +00:00
f694bcc9a8 [pt2] add meta for _cdist_backward (#106680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106680
Approved by: https://github.com/Skylion007
2023-08-07 00:58:14 +00:00
05e1a50723 [pt2] remove meta skips for aminmax, decomp exists (#106670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106670
Approved by: https://github.com/ezyang
2023-08-07 00:55:25 +00:00
26e98040da Improve AOTAutograd tests to do something when inputs don't require grad (#106558)
This PR:
- Changes the AOTAutograd tests to also check that the output of the
forward is equal under AOTAutograd and eager-mode PyTorch.
- Adds a "check_gradients" flag to `check_aot_autograd`.
  - If True, then we attempt to compute gradients and check them.
  - If False, then we we just check the outputs are equal
  - If "auto", then we will compute gradients and check them only if
    some input and some output requires grad. This option is useful for
    crossref tests where we don't necessarily have inputs that require
    grad.

1) I need a testing utility to test "AOTAutograd for inference",
   e.g. make_fx + functionalize.
2) I want to run aot_autograd_check in crossref tests for other test
suites (e.g. fbgemm) and not all inputs require grad.

Test Plan:
- existing tests
- new tests to test the degenerate cases
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106558
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2023-08-07 00:11:30 +00:00
8b8f576f56 Minor update to ROCm triton commit pin (#106616)
This is required for a fix in latest nightly wheels in which the hip_runtime header file cannot be found during triton lowering

```
/tmp/tmpqoq6gtl8/main.c:3:14: fatal error: hip/hip_runtime.h: No such file or directory
3 | #include <hip/hip_runtime.h>
| ^~~~~~~~~~~~~~~~~~~
```

This is a single commit update to bring in this change
https://github.com/ROCmSoftwarePlatform/triton/pull/283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106616
Approved by: https://github.com/malfet
2023-08-06 16:56:44 +00:00
68cb854d73 Fix CPUFallback Mechinasm on TensorList Type (#105209)
Fixes #104965

Currently, the cpufallback mechinasm lack the code logic of TensorList, so some operators like _foreach_add_/_foreach_add don`t work well.

cc  @bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105209
Approved by: https://github.com/bdhirsh
2023-08-05 15:38:30 +00:00
19621a73c0 [pt2] add metas for grid_sampler_3d ops (#106261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106261
Approved by: https://github.com/ezyang
2023-08-05 14:48:11 +00:00
bd34f85fe5 [pt2] meta for searchsorted.Scalar, tests, and out support (#106283)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106283
Approved by: https://github.com/ezyang
2023-08-05 09:12:29 +00:00
0a4e5e07db [inductor][easy] log the number of fused nodes for each graph (#106653)
This simple PR can let me know how much more fusion the loop ordering PR can bring compared to baseline. Need this separate PR since I need include it in both the baseline and test runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106653
Approved by: https://github.com/eellison
2023-08-05 08:34:55 +00:00
6540f92507 Compile AOTInductor in Meta prod env (#106636)
Summary: Reland https://github.com/pytorch/pytorch/pull/106442 (previously reverted with https://github.com/pytorch/pytorch/pull/106492)

Differential Revision: D48036309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106636
Approved by: https://github.com/houseroad
2023-08-05 08:01:24 +00:00
7e55dd7a15 Make NCCL default logging more friendly. (#105695)
Default behavior for a python library should be to not print anything that's not error/warning. However today any 8GPU tasks will by default print these logs that take more than a whole screen. This is especially heavily affecting user-experience for small workloads that don't print much themselves:
```
I0719 10:50:33.485718 219407 ProcessGroupNCCL.cpp:482] [Rank 3] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.485716 219402 ProcessGroupNCCL.cpp:482] [Rank 1] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.485841 220673 ProcessGroupNCCL.cpp:581] [Rank 1] NCCL watchdog thread started!
I0719 10:50:33.485882 220672 ProcessGroupNCCL.cpp:581] [Rank 3] NCCL watchdog thread started!
I0719 105033.485 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 3
I0719 105033.485 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 1
I0719 10:50:33.559300 219400 ProcessGroupNCCL.cpp:482] [Rank 0] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.559444 220675 ProcessGroupNCCL.cpp:581] [Rank 0] NCCL watchdog thread started!
I0719 105033.559 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 0
I0719 10:50:33.577245 219415 ProcessGroupNCCL.cpp:482] [Rank 4] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.577381 220676 ProcessGroupNCCL.cpp:581] [Rank 4] NCCL watchdog thread started!
I0719 105033.577 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 4
I0719 10:50:33.583372 219404 ProcessGroupNCCL.cpp:482] [Rank 2] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.583511 220677 ProcessGroupNCCL.cpp:581] [Rank 2] NCCL watchdog thread started!
I0719 105033.583 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 2
I0719 10:50:33.672052 219421 ProcessGroupNCCL.cpp:482] [Rank 5] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.672153 220684 ProcessGroupNCCL.cpp:581] [Rank 5] NCCL watchdog thread started!
I0719 105033.672 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 5
I0719 10:50:33.844262 219427 ProcessGroupNCCL.cpp:482] [Rank 6] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.844411 220687 ProcessGroupNCCL.cpp:581] [Rank 6] NCCL watchdog thread started!
I0719 105033.844 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 6
I0719 10:50:33.853435 219432 ProcessGroupNCCL.cpp:482] [Rank 7] ProcessGroupNCCL initialized with following options:
NCCL_ASYNC_ERROR_HANDLING: 0
NCCL_BLOCKING_WAIT: 0
TIMEOUT(ms): 1800000
USE_HIGH_PRIORITY_STREAM: 0
NCCL_DEBUG: WARN
I0719 10:50:33.853551 220688 ProcessGroupNCCL.cpp:581] [Rank 7] NCCL watchdog thread started!
I0719 105033.854 distributed_c10d.py:213] Added key: store_based_barrier_key:1 to store for rank: 7
I0719 105033.854 distributed_c10d.py:247] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
I0719 105033.854 distributed_c10d.py:247] Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
```

This PR changes the NCCL init logs from multi-line to a shorter one-line format. And changes the watchdog logs from LOG(INFO) to VLOG so it can be enabled on-demand.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105695
Approved by: https://github.com/fduwjj
2023-08-05 07:40:53 +00:00
1819fe1324 Revert "Extend Inductor to support the third-party backend (#100706)" (#106652)
This reverts commit 05bd24bb3548105776cf73226927cbd0ed575c55.

It caused compilation time regression on torchbench, huggingface and dynamic models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106652
Approved by: https://github.com/davidberard98, https://github.com/voznesenskym
2023-08-05 06:41:08 +00:00
dc22b4fdb1 [vision hash update] update the pinned vision hash (#106654)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106654
Approved by: https://github.com/pytorchbot
2023-08-05 03:42:54 +00:00
4be6b6b673 Add quantization support to reshape and size for the ONNX exporter (#106629)
Fixes https://github.com/microsoft/onnx-converters-private/issues/175

Add quantization support for Reshape-14, Size-9 and Size-11
For Size operators, we don't requantize outputs because we want the original scalar in the graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106629
Approved by: https://github.com/BowenBao
2023-08-05 02:08:52 +00:00
5eb429ac30 Add test support for dynamic shapes for torch.onnx.dynamo_export (#106495)
This PR adds the ability to check whether the resulting ONNX graph has dynamic shape when the dynamic shape is enabled

Only test/onnx/test_fx_to_onnx.py and test/onnx/test_fx_op_consistency.py were covered because test/onnx/test_fx_to_onnx.py does not use any common "run_test" helper to wrap `dynamo_export` call. Maybe that could be a refactor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106495
Approved by: https://github.com/BowenBao
2023-08-05 01:57:36 +00:00
aa7824867f [ONNX] Remove legacy diagnostic printing (#106498)
As title, these are unused and removed to make way for adoption to PT2 logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106498
Approved by: https://github.com/thiagocrepaldi
2023-08-05 01:29:13 +00:00
136bda2568 fix issue with checking counters in binary folding (#106575)
We were incorrectly determining whether we had found a foldable op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106575
Approved by: https://github.com/XiaobingSuper, https://github.com/yanboliang
ghstack dependencies: #106471
2023-08-05 01:25:35 +00:00
02b9119105 [Pytorch][Vulkan] aten::flip (#106628)
Summary:
https://pytorch.org/docs/stable/generated/torch.flip.html

Implement flip for vulkan.

For batch and channel cases:
- Calculate the logical tensor values of N and C from pos.xyz
- Flip the logical tensor value of N, C or both
- Use `n*[C/4] + i/4, i%4` to get the new tensor value

Test Plan:
New tests:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*flip*"
Recommended: Free up disk space to speed up builds.

  Only 17GB is available on disk. Buck is slow when free disk space is under
  50GB.

  Consider running this command (from your home directory) to reclaim purgeable
  space:

  sudo /System/Library/Filesystems/apfs.fs/Contents/Resources/apfs.util -P *

Downloaded 0/53 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 35.3 sec (100%) 536/536 jobs, 6/536 updated
  Total time: 35.3 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *flip*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.flip_1d
[       OK ] VulkanAPITest.flip_1d (117 ms)
[ RUN      ] VulkanAPITest.flip_2d
[       OK ] VulkanAPITest.flip_2d (1 ms)
[ RUN      ] VulkanAPITest.flip_3d
[       OK ] VulkanAPITest.flip_3d (2 ms)
[ RUN      ] VulkanAPITest.flip_4d
[       OK ] VulkanAPITest.flip_4d (10 ms)
[----------] 4 tests from VulkanAPITest (132 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (132 ms total)
[  PASSED  ] 4 tests.
lfq@lfq-mbp fbsource %
```

clang-format on `Flip.cpp` and `flip.glsl`

Reviewed By: SS-JIA

Differential Revision: D47921025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106628
Approved by: https://github.com/SS-JIA
2023-08-05 00:59:29 +00:00
cyy
c287262b02 enable missing-prototypes warnings on MPS backend (#105831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105831
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-08-05 00:22:56 +00:00
e190afb829 [Dynamo] Allow users to patch custom builtin functions and inline them (#106595)
Fixes Meta internal user case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106595
Approved by: https://github.com/jansel
2023-08-04 23:47:09 +00:00
e61558b5fe Test fixes (#106586)
Fix for https://github.com/pytorch/pytorch/issues/106548 and https://github.com/pytorch/pytorch/issues/106299.

The fallback was not actually testing fallback anymore now that we have a fake tensor rule for conv. Memory format fallback testing is also now exercised in test_ops.py `TestFakeTensor`.

Gc collect fixes the list_clearing test. I suspect their was a refcycle introduced which is making it flakey.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106586
Approved by: https://github.com/wconstab
2023-08-04 23:23:17 +00:00
786977c647 [easy] Add reset_parameters for nn.PRelu (#106507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106507
Approved by: https://github.com/albanD
2023-08-04 23:22:42 +00:00
45f6ef2597 Expose intended public constraints. Fixes #106386 (#106458)
Fixes #106386

Straightforward change, just exposes the `one_hot` and `nonnegative` distribution constraints that are intended to be public. This fixes downstream pyro usage of these constraints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106458
Approved by: https://github.com/ezyang, https://github.com/kit1980
2023-08-04 23:20:59 +00:00
578969ca61 skip maml (#106471)
This one benchmark distorts benchmarks because it is so low (.0007, the equivalent of a 1400x speedup). It also has been flakey, which has produced a lot of noise. Disabling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106471
Approved by: https://github.com/anijain2305
2023-08-04 22:14:09 +00:00
a01e795a6d [Compiled Autograd] Fix bug with multithreading check (#106621)
Fixes #106555

There was bug where the multithreading check would fire because of the
`compiled_autograd.disable()` calls in AotAutograd, even though compiled
autograd was already disabled, so that call was doing nothing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106621
Approved by: https://github.com/yanboliang
2023-08-04 20:49:21 +00:00
b782beb18e [ONNX] Expose OnnxRegistry publicly (#106140)
The official move of `OnnxRegistry` to `torch.onnx` allows it to become one of the parameters in `torch.onnx.ExportOption`. By incorporating `OnnxRegistry` in `torch.onnx.ExportOption`, users gain access to various functionalities, including the ability to register custom operators using `register_custom_op`, check whether an operator is supported using `is_registered_op`, and obtain symbolic functions that support specific operators using `get_functions`.

Additionally, `opset_version` is now exclusively available in `torch.onnx.OnnxRegistry` as it is removed from `torch.onnx.ExportOption`. The initialization of the registry with torchlib under the provided opset version ensures that the exporter uses the specified opset version as the primary version for exporting.

These changes encompass scenarios where users can:

1. Register an unsupported ATen operator with a custom implementation using onnx-script.
2. Override an existing symbolic function (onnx invariant).

NOTE: The custom registered function will be prioritized in onnx dispatcher, and if there are multiple custom ones, the one registered the last will be picked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106140
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-08-04 20:46:03 +00:00
5dcc85d663 [dynamo, logging] add default pt2 logging group (#106417)
Create a new logging group that enables "interesting" logging: graph breaks, recompiles, symbolic shapes, guards, source trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106417
Approved by: https://github.com/ezyang
2023-08-04 20:34:42 +00:00
2156f0434c [quant][pt2e] Add reference representation for quantized adaptive_avg_pool2d (#105709)
Summary:
Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_adaptive_avg_pool2d

Although right now it is not really testing things since there is some problem with dynamo export

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105709
Approved by: https://github.com/andrewor14
ghstack dependencies: #105708
2023-08-04 18:49:14 +00:00
3e6da46aff err on dot product for tensors of different sizes (#106572)
Fixes #106448

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106572
Approved by: https://github.com/ezyang
2023-08-04 18:34:34 +00:00
df8abaaf5f [Dynamo] Revert 'Enable torch._dynamo.config.suppress_errors by default' (#106562)
D47969512 was the original diff to revert this, but the diff train doesn't work well, so I have to split it into two part: this OSS PR and another separate diff to revert the fbcode change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106562
Approved by: https://github.com/angelayi
2023-08-04 16:46:21 +00:00
d67f4d4e9f Revert "[DCP][Test]Remove broken 2d checkpoint test (#106367)"
This reverts commit d2a9b256f00742b3fd1271ad087fc4e02144aed8.

Reverted https://github.com/pytorch/pytorch/pull/106367 on behalf of https://github.com/jeanschmidt due to Breaking internal builds for diff D48007925 ([comment](https://github.com/pytorch/pytorch/pull/106367#issuecomment-1665901322))
2023-08-04 16:45:28 +00:00
ae4b2d272f Fix the Test of duplicate registration on genarator (#106536)
The duplicate registration test case shown in the figure below has always failed.
3d165dc3f3/test/test_cpp_extensions_open_device_registration.py (L171-L173)

3d165dc3f3/aten/src/ATen/core/GeneratorForPrivateuseone.h (L36-L37)

Because there is a static variable in the ```self.module.register_generator()``` function, it will only be initialized once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106536
Approved by: https://github.com/albanD
2023-08-04 16:09:40 +00:00
8e76c01043 Fix the api of privateuse1 in comment (#106537)
Fix the api of privateuse1 in comment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106537
Approved by: https://github.com/albanD
2023-08-04 16:07:26 +00:00
91afefb55b Fix some fake mode confusion between inner/outer fake mode in export (#106515)
Fixes https://github.com/pytorch/pytorch/issues/106412

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106515
Approved by: https://github.com/voznesenskym, https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-08-04 15:42:23 +00:00
5b13c779d4 [AOTInductor] Remove call to aot_autograd when receiving ExportedProgram (#105977)
https://github.com/pytorch/pytorch/issues/105555

Existing flow first exports and then calls torch._inductor.aot_compile. However, export calls aot_autograd with the core aten decomposition table, and then torch._inductor.aot_compile calls aot_autograd again with the inductor decomposition table. The 2nd calling of aot_autograd is supposedly causing some problems, and seems excessive, so instead we will create a new function, torch._export.aot_compiler which will export using the inductor decomposition table, pass it to inductor's compile_fx_aot, and because it has already been exported, avoid recalling aot_autograd.

```
def aot_compile(
    f: Callable,
    args: Tuple[Any],
    kwargs: Optional[Dict[str, Any]] = None,
    constraints: Optional[List[Constraint]] = None,
) -> Tuple[str, ExportedProgram]:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105977
Approved by: https://github.com/desertfire, https://github.com/zhxchen17, https://github.com/eellison
2023-08-04 15:35:23 +00:00
63d45275f4 is causal hints for transformer (#106143)
Summary:
make is_causal hint flags available for the top level transformer module.

It's debatable whether this is useful -- at present we autodetect causal masks for src and tgt masks in transformer encoder and decoder, respectively. is_causal flags available woul enable users to short-cut this check by asserting whether they mask is causal, or not.

I am putting this diff up for discussion, not as a solution.  Not doing anything may be the right solution, unless there is strong (data-driven) user demand. -- it appears the consensus is to move ahead with this, as per discussions below.

@cpuhrsch @mikaylagawarecki @jbschlosser @janEbert

Test Plan: sandcastle

Differential Revision: D47373260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106143
Approved by: https://github.com/mikaylagawarecki
2023-08-04 14:16:48 +00:00
e421edf377 Add utility to test if autograd was registered correctly (#106561)
See docstring for more details. This API is not meant to be directly
user-facing, it is meant to be used as a subtest of D47965247, which is
coming soon.

Test Plan:
- Some new tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106561
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2023-08-04 13:39:10 +00:00
5a9e82fa02 let torch.device be overrideable by TorchFunctionMode (#106514)
Fixes #103828
let torch.device be overrideable by TorchFunctionMode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106514
Approved by: https://github.com/ezyang
2023-08-04 10:47:43 +00:00
d4d086ce7b [MPS] Fix Clamp with strided outputs/inputs (#97858)
Fixes #94396
Fixes #87348

1. If output is strided, we don't gather input tensors.
2. If output is not strided but min_t or max_t is strided, we make min_t or max_t contiguous.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97858
Approved by: https://github.com/kulinseth
2023-08-04 09:32:12 +00:00
9e301949ec [quant][pt2e] Add reference representation for quantized max_pool2d (#105708)
Summary:
Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_maxpool2d

Although right now it is not really testing things since there is some problem with dynamo export

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105708
Approved by: https://github.com/andrewor14
2023-08-04 08:19:52 +00:00
b2d3a2f433 [inductor] Remove ReinterpretView copy_ for AOT Inductor outputs (#106564)
Running benchmark on HF models result in 71% pass rate now: P802905571
Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Fri%2C%2028%20Jul%202023%2005%3A02%3A20%20GMT&stopTime=Fri%2C%2004%20Aug%202023%2005%3A02%3A20%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/bench&lCommit=e35a655e59b2038c0395f972a1f567f862093d9c&rBranch=main&rCommit=3e5a52cedd2d586fc6cb40a73a098252b9edc2a1)

Originally, a lot of the HF export-aot-inductor tests are failing with the error message:
```
RuntimeError: unsupported operation: some elements of the input tensor and the written-to tensor refer to a single memory location. Please clone() the tensor before performing the operation.
```

I looked at the result of one of the models, AlbertForMaskedLM, and the error is due to an additional [`copy_`](https://www.internalfb.com/phabricator/paste/view/P802043305?lines=1460%2C1426%2C1438%2C1451%2C1428) being inserted at the end. Looking at the [exported graph](https://www.internalfb.com/phabricator/paste/view/P802908243?lines=1124), `buf237` in the cpp program corresponds to the `view_269` node. During inductor lowering, this `view_269` node will result in a `ir.ReinterpretView` node, and when generating code for the outputs, this [line](https://fburl.com/code/epola0di) will add an additional `copy_`.

I'm unsure if removing this case will result in other errors, but it seems to raise the HF model benchmark pass rate :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106564
Approved by: https://github.com/jansel
2023-08-04 07:51:29 +00:00
a899333ffc fix: nll_loss batch rule with negative ignore_idx (#106118)
We use python decompositions instead of writing our own for batching rules.

Fixes https://github.com/pytorch/pytorch/issues/105736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118
Approved by: https://github.com/lezcano, https://github.com/zou3519
2023-08-04 07:43:02 +00:00
f8817d8ac8 Remove deepcopy override from ExportedProgram (#106578)
Summary: When we do a deep copy of the ExportedProgram because of the custom deep copy override the graph metadata (graph.meta) is failing to be copied over. This can be fixed but overall i don't see a need for a custom deepcopy in ExportedProgram and thus trying to get rid of it.

Test Plan: CI

Differential Revision: D48043723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106578
Approved by: https://github.com/JacobSzwejbka
2023-08-04 06:31:32 +00:00
0ae7afd14e [MPS] Adding renorm implementation (#106059)
Related to #77764

Adding support for `aten::renorm` in mps backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106059
Approved by: https://github.com/kulinseth
2023-08-04 05:31:30 +00:00
aaa989c244 inductor: support linear fusion when multi linear using same input (#106300)
For ```llama``` model, there has a pattern that multi linear using the same input and input dim > 2:
```input->view->(linear->view->silu, linear->view)```, this PR update the pattern to make the linar+silu can be fused(we first need remove view ops, and then apply fusion patterns).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106300
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-04 03:53:00 +00:00
3c7331742a test_fused_sdp_choice in test_transformers.py fix (#106587)
sdp dispatcher prioritizes flash attention over efficient attention: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L684-L687, and flash attention is enabled for sm75+: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L625. Thus, the unit test `test_fused_sdp_choice` from `test_transformers.py` which is failing on T4 (sm75) should have this `SM80OrLater` check changed to `SM75OrLater`: https://github.com/pytorch/pytorch/blob/main/test/test_transformers.py#L1914-L1917.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106587
Approved by: https://github.com/drisspg
2023-08-04 03:43:56 +00:00
d4271b16ca [vision hash update] update the pinned vision hash (#106588)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106588
Approved by: https://github.com/pytorchbot
2023-08-04 03:07:30 +00:00
8fe5fa8613 Update mobile build docker image to pytorch-linux-jammy-py3-clang12-asan (#106582)
`pytorch-linux-focal-py3-clang7-asan` has been removed by https://github.com/pytorch/pytorch/pull/106355
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106582
Approved by: https://github.com/malfet
2023-08-04 01:21:30 +00:00
b283e93158 Add missing hpu check to is_any_autocast_enabled (#106539)
Hpu check was never added to this function, because both commits were delivered the same day.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106539
Approved by: https://github.com/albanD
2023-08-04 00:40:56 +00:00
93f538db35 Fix nullable-to-nonnull-conversion warnings (#106232)
Summary:
Cleaning up some build warnings when Wnullable-to-nonnull-conversion is enabled.

Changelog: Fixes nullability warnings from `MetalContext.mm`.

Test Plan:
```
buck build fbsource//fbobjc/mode/iphonesimulator //fbobjc/Apps/MSQRD/MSQRDPlayer:ARStudioPlayer
```

Differential Revision: D47886793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106232
Approved by: https://github.com/drisspg
2023-08-03 23:18:30 +00:00
c0b8b7b90c [inductor] Enable mypy checking in torch/_inductor/metrics.py (#105793)
As suggested in https://github.com/pytorch/pytorch/issues/105230

Implements small fix for torch/_inductor/metrics.py

I ran into a circular import, which I handled using if TYPE_CHECKING (https://docs.python.org/3/library/typing.html#constant).

There are then two options for describing the types, either use their class names as strings or use from future import annotations

```
If from __future__ import annotations is used, annotations are not evaluated at function definition time.
Instead, they are stored as strings in __annotations__.
This makes it unnecessary to use quotes around the annotation (see [PEP 563](https://peps.python.org/pep-0563/)).
```

I'm open to suggestions if it does not meet your coding guidelines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105793
Approved by: https://github.com/Skylion007
2023-08-03 22:43:57 +00:00
ce608712cb [inductor] don't cache non-static content (#106502)
I happened to find that inductor may cache stale inner_fn_str and ReadWrites object in a ComputedBuffer when I work on looping ordering.

Let's say we have producer buffer buf0 and consumer buffer buf1. Before we call GraphLowering.finalize, the layout for buf0 may be a FlexibleLayout. At that moment, the inner_fn_str or ReadWrites object computed for buf1 will be based on the layout of buf0 which most likely is a contiguous FlexibleLayout. And they will be cached on buf1 object (or buf1.data).

However after we call GraphLowering.finalize, we may realize it's better to give a non-contiguous layout for buf0 (e.g., if its input has non-contiguous layout or whatever reason). The layout change of buf0 should affect the inner_fn_str and ReadWrites object for buf1. But we may have cached those on buf1. The stale ReadWrites objects for buf1 may result in sub-optimal strides for buf1.

This may affect perf and I'll check the nightly runs.

Here is a dump of `nodes` in `Scheduler.__init__` before the fix as a reference: https://gist.github.com/shunting314/ed2152a08e268f5563fd55398b1392c7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106502
Approved by: https://github.com/jansel
2023-08-03 22:09:58 +00:00
60121e391b [caffe2] Replace CAFFE_ prefixes in static_tracepoint.h macros with TORCH_ (#106380)
Summary: Rename static tracepoint macros to better describe their targeted usage.

Test Plan:
Same as for D47159249:

Tested the following macros on test scripts with libbpf USDTs:
* `CAFFE_SDT`
* `CAFFE_DISABLE_SDT`
* `CAFFE_SDT_WITH_SEMAPHORE`

Reviewed By: chaekit

Differential Revision: D47727339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106380
Approved by: https://github.com/chaekit
2023-08-03 21:51:36 +00:00
1642daeaaa [inductor] codegen dynamic shapes tests: reset inductor metrics (#106481)
A bunch of the tests are getting skipped/xfailed because of generated_kernel_count checks. In other tests, inductor metrics automatically get reset in the common() function, so we should do this in the test_torchinductor_codegen_dynamic_shapes tests as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106481
Approved by: https://github.com/eellison
2023-08-03 21:50:25 +00:00
424dc238f4 Fix split module interaction with dead code (#104554)
Summary:
This change fixes split_module's interaction with dead code. Previously if a dead region was split out, split module would throw an error while attempting to access the outputs for the partition even though the partition has no outputs.

This change adds a new unit test to cover the dead code case and changes the output check to allow no output. The split module with no output will now output None like a normal python function

Unit Test Added:
test_split_module_dead_code

A module with dead code:
```
class ModWithDeadCode(torch.nn.Module):
            def forward(self, x):
                output = x * 2 # we want this
                dead_line = x + 2 # this is dead
                return output
```

Before:
```
torch/fx/passes/split_module.py, line 357, in split_module
base_mod_env[list(partition.outputs)[0]] = output_val
IndexError: list index out of range
```

After:
```
class GraphModule(torch.nn.Module):
    def forward(self, x):
        # No stacktrace found for following nodes
        submod_2 = self.submod_2(x)
        submod_1 = self.submod_1(x);  x = None
        return submod_1

    class GraphModule(torch.nn.Module):
        def forward(self, x):
            # No stacktrace found for following nodes
            add = x + 2;  x = None
            return None

    class GraphModule(torch.nn.Module):
        def forward(self, x):
            # No stacktrace found for following nodes
            mul = x * 2;  x = None
            return mul
```
Submod 2 is correctly extracted

Test Plan: Tested with new unit test

Differential Revision: D47196732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104554
Approved by: https://github.com/yf225
2023-08-03 21:36:35 +00:00
239578beff [ROCm] Enable a few bfloat16 unit tests (#105177)
Currently a few unit tests from **test_matmul_cuda** and **test_sparse_csr** test suites are being skipped on ROCm.

This PR is to enable the following unit tests on ROCm (~30 UTs):

test_cublas_baddbmm_large_input_* (__main__.TestMatmulCudaCUDA)
test_addmm_sizes_all_sparse_csr* (__main__.TestSparseCSRCUDA) when m==0 or n==0 or k==0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105177
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-03 21:17:19 +00:00
236eda4d51 remove jit from torchbench (#106071)
Need to remove jit arguments after changes in https://github.com/pytorch/benchmark/pull/1787

Also curious, is there is a procedure for updating torchbench version in Pytorch CI?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106071
Approved by: https://github.com/xuzhao9, https://github.com/msaroufim, https://github.com/malfet, https://github.com/lezcano
2023-08-03 21:04:43 +00:00
b03505eca8 update expected pass for torchbench dynamic (#106573)
fixes https://github.com/pytorch/pytorch/pull/106009#issuecomment-1664513049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106573
Approved by: https://github.com/cpuhrsch
2023-08-03 20:46:08 +00:00
c9eb95cca4 Update XLA dyanmo backend name (#106489)
This is to deprecate the old XLA dyanmo backend and rename it `openxla`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106489
Approved by: https://github.com/jansel, https://github.com/shunting314
2023-08-03 20:00:37 +00:00
a8e3bd97cf [export] cleanup pass base. [1/n] (#106480)
Test Plan: CI

Differential Revision: D48004635

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106480
Approved by: https://github.com/angelayi
2023-08-03 19:48:05 +00:00
c27e15359a use no_grad() consistently for testing transformer trace construction (#106523)
Summary: check trace runs with no_grad() and grad or not impacts transformer trace construction. use no_grad() consistently

Test Plan:
sandcastle and github ci

```
buck2 run mode/opt mode/inplace //caffe2/test:test_jit_cuda -- --regex test_scriptmodule_transformer_cuda
```

Differential Revision: D48020889

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106523
Approved by: https://github.com/davidberard98
2023-08-03 19:28:20 +00:00
3200f63ee6 Make mocked functioned return the proper result structure (tuple for native MHA for attn result and attn weights) (#106526)
Summary: Make mocked functioned return the proper result structure (tuple for native MHA for attn result and attn weights)

Test Plan: sandcastle

Differential Revision: D48021277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106526
Approved by: https://github.com/davidberard98
2023-08-03 19:27:31 +00:00
d1a99a083f Reland Simplify handle indexing (#105006) (#106357)
This reverts commit a9a3c456495ccddff55e088ebf395c599db62d12.

This PR changes the following:
- `_ExecOrderData.handle_to_handle_index` -> `FlatParamHandle._handle_index`
- `_ExecOrderData.handles_to_pre_forward_order_index` -> `FlatParamHandle._pre_forward_order_index`
- `_ExecOrderData.handles_to_post_forward_order_index` -> `FlatParamHandle._post_forward_index`
- `_FSDPState._needs_pre_forward_unshard` -> `FlatParamHandle._needs_pre_forward_unshard`
- `_FSDPState._needs_pre_backward_unshard` -> `FlatParamHandle._needs_pre_backward_unshard`
- `_FSDPState._handles_prefetched` -> `FlatParamHandle._prefetched`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106357
Approved by: https://github.com/awgu
2023-08-03 19:17:32 +00:00
578d9fee42 [DTensor][EZ] op schema comparison so that no redistribute is called (#106158)
When looking at traces of TP more carefully, I found that for cases when input reshard is not needed, we also call redistribute within sharding propogation. Upon carefully checking, looks like the way we compare different op_schema is not correct.

One example can be seen in the following trace:
<img width="1146" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7322d26f-7029-41f9-8f8c-5f27a6bb98f9">

As you can see, no collectives are called, and this redistribute is not needed.

With this change:

<img width="1491" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/eb4a971f-44c1-4d83-8671-fce94cfa926c">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106158
Approved by: https://github.com/Skylion007, https://github.com/wanchaol
2023-08-03 19:17:10 +00:00
4c46ea583f [Export] Support re-exportability (#106531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106531
Approved by: https://github.com/zhxchen17
2023-08-03 18:27:26 +00:00
3db255020b Clarify the clarification (#106358)
Summary: Clarify the clarification

Differential Revision: D47941982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106358
Approved by: https://github.com/mikaylagawarecki
2023-08-03 16:58:36 +00:00
2f281949a5 [dynamo] resolve InlinedClosureVariable in InstructionTranslator stack (#106491)
When inlining a function which loads a closure, its direct parent may not load that closure. So we cannot find the closure name in parent's symbolic locals. In this PR, we fix it by recursively searching the parent instruction translator stack to resolve the closure.

**Background**
When developing https://github.com/pytorch/pytorch/pull/105679, this corner case is triggered. A small repro is added in the test of this pr, where outer is loaded by deep2 but not by deep.
```python
def test_inline_closure_not_loaded_by_parent(self):
    def outer(a):
        return a + 1

    def indirect(x):
        return direct(x)

    def direct(x):
        def deep2(c):
            return outer(c)

        def deep(c):
            return deep2(c)

        return deep(x)

    x = torch.randn(3)
    eager = indirect(x)
    counter = CompileCounter()
    compiled = torch._dynamo.optimize(counter)(indirect)(x)
```

Running the test, we have the following error before the PR:
```
Traceback (most recent call last):
  File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6584, in test_inline_closure_not_loaded_by_parent
    compiled = torch._dynamo.optimize(counter)(indirect)(x)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 321, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/eval_frame.py", line 481, in catch_errors
    return callback(frame, cache_size, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 543, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 130, in _fn
    return fn(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 362, in _convert_frame_assert
    return _compile(
  File "/home/yidi/local/pytorch/torch/_dynamo/utils.py", line 194, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 531, in _compile
    raise InternalTorchDynamoError(str(e)).with_traceback(e.__traceback__) from None
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 432, in _compile
    out_code = transform_code_object(code, transform)
  File "/home/yidi/local/pytorch/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
    transformations(instructions, code_options)
  File "/home/yidi/local/pytorch/torch/_dynamo/convert_frame.py", line 417, in transform
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2067, in run
    super().run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1116, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 261, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2172, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2279, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1116, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2172, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2279, in inline_call_
    tracer.run()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 724, in run
    and self.step()
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 688, in step
    getattr(self, inst.opname)(inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 392, in wrapper
    return inner_fn(self, inst)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 1116, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 598, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2172, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/symbolic_convert.py", line 2227, in inline_call_
    sub_locals, closure_cells = func.bind_args(parent, args, kwargs)
  File "/home/yidi/local/pytorch/torch/_dynamo/variables/functions.py", line 471, in bind_args
    result[name] = parent.symbolic_locals[name]
torch._dynamo.exc.InternalTorchDynamoError: outer

from user code:
   File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6570, in indirect
    return direct(x)
  File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6579, in direct
    return deep(x)
  File "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6577, in deep
    return deep2(c)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

To execute this test, run the following from the base repo dir:
     python test/dynamo/test_misc.py -k test_inline_closure_not_loaded_by_parent

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
---------------------------------------------------------------------------------------------------------------------------- Captured stdout call -----------------------------------------------------------------------------------------------------------------------------
frames [('total', 1)]
inline_call []
---------------------------------------------------------------------------------------------------------------------------- Captured stderr call -----------------------------------------------------------------------------------------------------------------------------
[2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py
[2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py
[2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping helper /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py
[2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __init__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py
[2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping __enter__ /home/yidi/local/miniconda3/envs/pytorch-3.10/lib/python3.10/contextlib.py
[2023-08-02 15:48:36,560] torch._dynamo.eval_frame: [DEBUG] skipping enable_dynamic /home/yidi/local/pytorch/torch/_dynamo/eval_frame.py
[2023-08-02 15:48:36,561] torch._dynamo.symbolic_convert: [INFO] Step 1: torchdynamo start tracing indirect /home/yidi/local/pytorch/test/dynamo/test_misc.py:6569
TRACE starts_line indirect /home/yidi/local/pytorch/test/dynamo/test_misc.py:6569
            def indirect(x):
[2023-08-02 15:48:36,591] torch._dynamo.variables.builder: [DEBUG] wrap_to_fake L['x'] (3,) [<DimDynamic.STATIC: 2>] [None]
TRACE starts_line indirect /home/yidi/local/pytorch/test/dynamo/test_misc.py:6570
                return direct(x)
[2023-08-02 15:48:36,594] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_DEREF direct []
[2023-08-02 15:48:36,594] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x [UserFunctionVariable()]
[2023-08-02 15:48:36,594] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [UserFunctionVariable(), TensorVariable()]
[2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] INLINING <code object direct at 0x7fbe4d366810, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6572>
TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6572 (inline depth: 1)
            def direct(x):
TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6573 (inline depth: 1)
                def deep2(c):
[2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CLOSURE outer []
[2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE BUILD_TUPLE 1 [InlinedClosureVariable()]
[2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST <code object deep2 at 0x7fbe4d3666b0, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6573> [TupleVariable()]
[2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST MiscTests.test_inline_closure_not_loaded_by_parent.<locals>.direct.<locals>.deep2 [TupleVariable(), ConstantVariable(code)]
[2023-08-02 15:48:36,595] torch._dynamo.symbolic_convert: [DEBUG] TRACE MAKE_FUNCTION 8 [TupleVariable(), ConstantVariable(code), ConstantVariable(str)]
[2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_DEREF deep2 [NestedUserFunctionVariable()]
TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6576 (inline depth: 1)
                def deep(c):
[2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CLOSURE deep2 []
[2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE BUILD_TUPLE 1 [NewCellVariable()]
[2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST <code object deep at 0x7fbe4d366760, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6576> [TupleVariable()]
[2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_CONST MiscTests.test_inline_closure_not_loaded_by_parent.<locals>.direct.<locals>.deep [TupleVariable(), ConstantVariable(code)]
[2023-08-02 15:48:36,597] torch._dynamo.symbolic_convert: [DEBUG] TRACE MAKE_FUNCTION 8 [TupleVariable(), ConstantVariable(code), ConstantVariable(str)]
[2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE STORE_FAST deep [NestedUserFunctionVariable()]
TRACE starts_line direct /home/yidi/local/pytorch/test/dynamo/test_misc.py:6579 (inline depth: 1)
                return deep(x)
[2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST deep []
[2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST x [NestedUserFunctionVariable()]
[2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [NestedUserFunctionVariable(), TensorVariable()]
[2023-08-02 15:48:36,598] torch._dynamo.symbolic_convert: [DEBUG] INLINING <code object deep at 0x7fbe4d366760, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6576>
TRACE starts_line deep /home/yidi/local/pytorch/test/dynamo/test_misc.py:6576 (inline depth: 2)
                def deep(c):
TRACE starts_line deep /home/yidi/local/pytorch/test/dynamo/test_misc.py:6577 (inline depth: 2)
                    return deep2(c)
[2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_DEREF deep2 []
[2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] TRACE LOAD_FAST c [NestedUserFunctionVariable()]
[2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] TRACE CALL_FUNCTION 1 [NestedUserFunctionVariable(), TensorVariable()]
[2023-08-02 15:48:36,599] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes
[2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object deep at 0x7fbe4d366760, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6576>
[2023-08-02 15:48:36,599] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes
[2023-08-02 15:48:36,599] torch._dynamo.symbolic_convert: [DEBUG] FAILED INLINING <code object direct at 0x7fbe4d366810, file "/home/yidi/local/pytorch/test/dynamo/test_misc.py", line 6572>
[2023-08-02 15:48:36,599] torch._dynamo.output_graph: [DEBUG] restore_graphstate: removed 0 nodes
```

Test Plan:
add new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106491
Approved by: https://github.com/williamwen42, https://github.com/jansel, https://github.com/zou3519
2023-08-03 16:45:42 +00:00
6268ab2c2d torchbench pin upd: hf auth token, clip, whisper, llamav2, sd (#106009)
Includes stable diffusion, whisper, llama7b and clip

To get this to work I had to Pass in hf auth token to all ci jobs, github does not pass in secrets from parent to child automatically. There's a likelihood HF will rate limit us in case please revert this PR and I'll work on adding a cache next - cc @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @aakhundov @malfet

Something upstream changed in torchbench too where now `hf_Bert` and `hf_Bert_large` are both failing on some dynamic shape looking error which I'm not sure how to debug yet so for now felt a bit gross but added a skip since others are building on top this work @ezyang

`llamav2_7b_16h` cannot pass through accuracy checks cause it OOMs on deepcloning extra inputs this seems to make it not need to show up in expected numbers csv, will figure this when we update the pin with https://github.com/pytorch/benchmark/pull/1803 cc @H-Huang @xuzhao9 @cpuhrsch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106009
Approved by: https://github.com/malfet
2023-08-03 16:28:40 +00:00
0dc7f6df9d [inductor] Make AOT CPU Inductor work in fbcode (#106225)
Summary:
This diff has a couple of hacks to make inductor-CPU work for AOT codegen in fbcode:

- We need to add the CUDA link flags; AOT-Inductor is specialized for CUDA
  right now and uses a lot of `at::cuda` stuff.  We should do a proper AOT CPU
  at some point but this unblocks perf measurement.

- Add an include path to the cpp_prefix.  It's kind of hilarious; we remove the
  include path for remote execution, but then for AOT we need it back. 🤷

Test Plan: internal test

Differential Revision: D47882848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106225
Approved by: https://github.com/mikekgfb, https://github.com/bdhirsh, https://github.com/jansel
2023-08-03 13:56:54 +00:00
1f734e03df [pt2] add metas for mode ops (#106273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106273
Approved by: https://github.com/ezyang
ghstack dependencies: #106272
2023-08-03 13:11:10 +00:00
70469e6f04 [pt2] add metas for median ops (#106272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106272
Approved by: https://github.com/ezyang
2023-08-03 13:11:10 +00:00
57fba6fd86 [FSDP][9/N] Introduce CustomPolicy (#104986)
This PR adds a new `CustomPolicy` that acts like the existing `lambda_auto_wrap_policy` except it (1) leverages the new auto wrapping infrastructure and (2) allows overriding FSDP kwargs for particular instances. (1) gives it access to the validation checks (like for frozen parameters), and (2) makes it as expressive as manual wrapping. This should allow us to effectively deprecate manual wrapping if desired.

The API is as follows:
```
def lambda_fn(module: nn.Module) -> Union[bool, Dict[str, Any]]:
    ...
policy = CustomPolicy(lambda_fn)
```
The `lambda_fn` can return:
- `False` or `{}` to indicate no wrapping
- `True` to indicate wrapping while inheriting the root's FSDP kwargs
- Non-empty `dict` to indicate wrapping while overriding the specified FSDP kwargs and inheriting the rest from the root

---

After this PR, the follow-up work items for auto wrapping are:
1. Add shared parameter validation
2. (Longer-term / exploratory) Add a policy that provides a reasonable auto wrapping with "minimal" user input

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104986
Approved by: https://github.com/ezyang
ghstack dependencies: #104427, #104967, #104999, #104969
2023-08-03 12:46:36 +00:00
15953fdf35 [FSDP][8/N] Replace _FSDPPolicy.policy with _Policy._run_policy (#104969)
This does some code organization improvement.
- It renames `_FSDPPolicy` to `_Policy` to show that it is not only for FSDP but for any module-level API.
- It formalizes the contract that such a policy should return something like `target_module_to_kwargs: Dict[nn.Module, Dict[str, Any]]` that maps each module to wrap to its kwargs. It does so by requiring a `_run_policy` abstract method (this time private since users do not need to care about it). Then, our auto wrapping can just call `_run_policy()` to generate the dict and do any validation or post-processing.

This PR is technically BC-breaking because it removes the public `ModuleWrapPolicy.policy`. However, I do not think anyone was using that anyway, so this is a pretty safe breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104969
Approved by: https://github.com/rohan-varma
ghstack dependencies: #104427, #104967, #104999
2023-08-03 12:42:14 +00:00
697893568d Improve error message when export encounters non-local input (#106403)
Previously, you would get an error like

```
Dynamo input and output is a strict subset of traced input/output
```

now you get

```
Cannot export model which references tensors that are neither
buffers/parameters/constants nor are direct inputs.  For each tensor, if you'd
like this tensor to be an explicit input, add it as a dummy argument
to the top-level model definition you are exporting; if you would
like its value to be embedded as an exported constant, wrap its access
in a function marked with @assume_constant_result.

G['bulbous_bouffant'], accessed at:
  File "test_export.py", line N, in f
    return bulbous_bouffant + y
```

This doesn't handle outputs, I'm going to hit that next.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106403
Approved by: https://github.com/tugsbayasgalan
2023-08-03 12:35:25 +00:00
83e36fe127 Fix vscode test discovery (#106490)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106490
Approved by: https://github.com/eellison
2023-08-03 09:02:22 +00:00
3d165dc3f3 Upgrade expecttest to 0.1.6 (#106314)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106314
Approved by: https://github.com/malfet
2023-08-03 07:06:53 +00:00
bcc0f4bcab Move ASAN to clang12 and Ubuntu-22.04 (Jammy) (#106355)
- Modify `install_conda` to remove libstdc++ from libstdcxx-ng to use one from OS
- Modify `install_torchvision` to workaround weird glibc bug, where malloc interposers (such as ASAN) are causing a hang in internationalization library, see https://sourceware.org/bugzilla/show_bug.cgi?id=27653 and https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
- Modify `torch.utils.cpp_extension` to recognize Ubuntu's clang as supported compiler

Extracted from https://github.com/pytorch/pytorch/pull/105260
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106355
Approved by: https://github.com/huydhn
ghstack dependencies: #106354
2023-08-03 05:36:04 +00:00
97396cdbb2 Fix undefined behavior detected by clang-12 (#106354)
Compiler behavior when non-zero offset is added to a null pointer is undefined and is a bad habit.

- When `lapackEig` is called with to estimate a workspace size, do not add matrix size to the W pointer.
- When `unpack_pivots_cpu_kernel` with zero `dim_size` exit early.
- When `topk_impl_loop` is called with  `k` is zero, exit right away as output tensors are empty anyway.
- Ignore adding non-zero storage-offset in `TensorImpl::data_ptr_impl_impl`, which can be the case if tensor is created as `torch.empty(3)[4:]`.
- In `s_addmm_out_sparse_dense_worker` do not call `axpy` over an empty vector.
- In `_sparse_binary_op_intersection_kernel_impl` do skip computing `ptr_indices_dim` when `sparse_dim` is empty.
- Exit `grid_sample` forward/backward kernels earlier if either `input` or `grid` are empty tensors.

Found by asan in clang-12

Before the change UBSan report looks as follows:
```
 ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-12/bin/llvm-symbolizer UBSAN_OPTIONS=print_stacktrace=1 LD_PRELOAD=/usr/lib/llvm-12/lib/clang/12.0.1/lib/linux/libclang_rt.asan-x86_64.so python test_fx_experimental.py -v -k test_normalize_operator_exhaustive_linalg_eig_cpu_float32
Test results will be stored in test-reports/python-unittest/test_fx_experimental

Running tests...
----------------------------------------------------------------------
  test_normalize_operator_exhaustive_linalg_eig_cpu_float32 (__main__.TestNormalizeOperatorsCPU) ... /opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:111: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  torch.has_cuda,
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:112: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
  torch.has_cudnn,
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:118: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
  torch.has_mps,
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/overrides.py:119: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
  torch.has_mkldnn,
/var/lib/jenkins/workspace/aten/src/ATen/native/BatchLinearAlgebra.cpp:937:17: runtime error: applying non-zero offset 20 to null pointer
    #0 0x7f2025794888 in void at::native::lapackEig<float, float>(char, char, int, float*, int, float*, float*, int, float*, int, float*, int, float*, int*) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9945888)
    #1 0x7f20257da256 in void at::native::(anonymous namespace)::apply_linalg_eig<float>(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x998b256)
    #2 0x7f20257d902d in at::native::(anonymous namespace)::linalg_eig_kernel(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor const&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x998a02d)
    #3 0x7f20257b5b3d in at::native::linalg_eig_out_info(at::Tensor const&, at::Tensor&, at::Tensor&, at::Tensor&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9966b3d)
    #4 0x7f20257b4770 in at::native::linalg_eig_out(at::Tensor const&, at::Tensor&, at::Tensor&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9965770)
    #5 0x7f20280710e6 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor&, at::Tensor&> (at::Tensor const&, at::Tensor&, at::Tensor&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU_out_linalg_eig_out(at::Tensor const&, at::Tensor&, at::Tensor&))>, std::tuple<at::Tensor&, at::Tensor&>, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor&, at::Tensor&> >, std::tuple<at::Tensor&, at::Tensor&> (at::Tensor const&, at::Tensor&, at::Tensor&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor&, at::Tensor&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xc2220e6)
    #6 0x7f202727a045 in at::_ops::linalg_eig_out::call(at::Tensor const&, at::Tensor&, at::Tensor&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xb42b045)
    #7 0x7f20257b7e29 in at::native::linalg_eig(at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x9968e29)
    #8 0x7f2028070bf0 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor> (at::Tensor const&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CPU__linalg_eig(at::Tensor const&))>, std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<at::Tensor const&> >, std::tuple<at::Tensor, at::Tensor> (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xc221bf0)
    #9 0x7f2026b1f787 in std::tuple<at::Tensor, at::Tensor> c10::Dispatcher::redispatch<std::tuple<at::Tensor, at::Tensor>, at::Tensor const&>(c10::TypedOperatorHandle<std::tuple<at::Tensor, at::Tensor> (at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&) const (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xacd0787)
    #10 0x7f20273230a7 in at::_ops::linalg_eig::redispatch(c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xb4d40a7)
    #11 0x7f202c3cc32d in torch::autograd::VariableType::(anonymous namespace)::linalg_eig(c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1057d32d)
    #12 0x7f202c3cba96 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<std::tuple<at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&), &(torch::autograd::VariableType::(anonymous namespace)::linalg_eig(c10::DispatchKeySet, at::Tensor const&))>, std::tuple<at::Tensor, at::Tensor>, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, std::tuple<at::Tensor, at::Tensor> (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1057ca96)
    #13 0x7f20272798e0 in at::_ops::linalg_eig::call(at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xb42a8e0)
    #14 0x7f2043d97ae3 in torch::autograd::THPVariable_linalg_eig(_object*, _object*, _object*) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_python.so+0x23feae3)
    #15 0x5072d6 in cfunction_call /usr/local/src/conda/python-3.9.17/Objects/methodobject.c:543:19
    ...

SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior /var/lib/jenkins/workspace/aten/src/ATen/native/BatchLinearAlgebra.cpp:937:17 in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106354
Approved by: https://github.com/huydhn, https://github.com/lezcano
2023-08-03 05:36:03 +00:00
6e2a2849f0 [Typo]Fix a typo for index. (#106447)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106447
Approved by: https://github.com/awgu
2023-08-03 04:58:15 +00:00
a6f7dd4707 Catch cuda driver shutdown error in NCCLWatchdog (#106503)
There is a design flaw in NCCLWatchdog, namely it spawns threads that
talk to the CUDA api, but the CUDA api may have been deinitialized,
forming a race.

This is a known issue with widespread impact
(https://github.com/pytorch/pytorch/issues/90848).

I should point out that i tested this fix on the repro command for https://github.com/pytorch/pytorch/issues/82632 by running `NCCL_DESYNC_DEBUG=1 CUDA_LAUNCH_BLOCKING=1 python test/distributed/test_c10d_nccl.py -k test_find_unused_parameters_kwarg_debug_detail` and observing that instead of crashing, we observe log messages with the exception string about the cuda driver shutdown error.

A partial fix was landed already, but it applied too narrowly:
ec071a0815

This PR is a copy-paste of the previous fix, applying to one more case,
plugging a hole.  We probably need to do a more thorough review and
either plug all the holes, or design this differently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106503
Approved by: https://github.com/kwen2501
2023-08-03 04:14:52 +00:00
c9c2b14c53 Fix copy_ broadcast behavior on mps (#105617)
Fixes #105277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105617
Approved by: https://github.com/malfet
2023-08-03 04:03:32 +00:00
1c2918647a Revert PR #106442 (#106492)
Revert https://github.com/pytorch/pytorch/pull/106442 to prevent diff train goes to Meta internal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106492
Approved by: https://github.com/osalpekar
2023-08-03 03:48:34 +00:00
77e369b363 Run minification for TorchDynamo benchmark models that fail evaluation (#106201)
### Description
As an alternative to PR #105774, which provides a standalone, end-to-end minification script that covers all types of failures and has more functionality, this PR adds the ability to minify models when they fail the eval loop (accuracy checks). Both this PR and the other one can be merged without issue.

### Purpose
The goal is to leverage the minifier to minify models that fail accuracy checks, allowing failed models to be debugged more easily. The ideal use-case is trying to run a model suite on a backend where operator coverage is not known or is limited. If models can compile but fails the eval loop, having the repro script for each model is valuable for any developer that's trying to fix the issue.

### Functionality
- Create minify flag that minifies models when they fail accuracy check
- Produce minified graph for each model, and save it into repro script
- Move repro script to output directory/base Dynamo directory
- Enable functionality for running an entire model suite (Hugging Face, timm, and TorchBench) by prepending model name to repro script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106201
Approved by: https://github.com/ezyang
2023-08-03 03:34:04 +00:00
4b1872f1e1 [vision hash update] update the pinned vision hash (#106500)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106500
Approved by: https://github.com/pytorchbot
2023-08-03 03:29:52 +00:00
e1a0543dac [logs] Share same formatter between trace_source and other Dynamo loggers (#106493)
Earlier we wont have formatter prefix - like [rank] etc. This makes grepping out for trace_source for rank harder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106493
Approved by: https://github.com/williamwen42
2023-08-03 02:53:52 +00:00
3322bfb66e [jit] test_complexity.py - don't set default dtype in global scope (#106486)
Summary:
Depending on import order, this was sometimes causing another assert to fail:
aec8418bd9/torch/testing/_internal/jit_metaprogramming_utils.py (L20)

Differential Revision: D48011132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106486
Approved by: https://github.com/eellison
2023-08-03 02:50:15 +00:00
14266b4955 make sure log tests are run in non-verbose mode (#106496)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106496
Approved by: https://github.com/voznesenskym
2023-08-03 02:45:35 +00:00
410bc558e6 Assert that args is of tuple type. (#106352)
This avoids accidental unpacking of tensor-type inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106352
Approved by: https://github.com/tugsbayasgalan
2023-08-03 01:47:38 +00:00
fd6e052a8a Some minor improvements to FakeTensor testing (#106311)
Summary:
- PyTorch testing chokes sometimes when it sees an exception where the first
  argument is not a string. fake_tensor.UnsupportedOperatorException's first
  arg is an OpOverload. This PR fixes PyTorch testing to not choke. I'm not
  really sure how to reproduce this in OSS.
- It turns out that if an operator does not have a meta kernel, the FakeTensor
  rule is really slow (30ms in OSS in debug mode, 3s on some internal config).
  The thing that is slow (aside from the previous diff) is waiting for the Dispatcher to
  report NotImplemented and then attempting to catch that. I'm not really sure
  why this is slow but it's easy to workaround so I added a workaround.

Test Plan: - existing tests

Differential Revision: D47917554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106311
Approved by: https://github.com/eellison
2023-08-03 01:44:15 +00:00
60237ccbdf fix bf16 constant accuracy (#105827)
This PR aims to sort out the data type for `constant`.

The constant should be promoted to float https://github.com/pytorch/pytorch/pull/105440. So there are serval changes to do:
 - Data type propagation should propagate constant node to `float` dtype if original dtype is `bfloat16`
 - We do not need to insert `to_dtype` after the `constant` node, directly init an `fp32` constant is faster.
```
    vectorized<bfloat16> tmp(value);
    vectorized <float> tmp1 = cvt_bf16_fp32(tmp);
->
    vectorized<float> tmp(value);
```
 - move `constant` out of the list for `all operations can support bf16 without converting to fp32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105827
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-08-03 01:17:50 +00:00
f82e6ff29e add channel last 3d support for batch_norm on CPU (#97774)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97774
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-08-03 01:16:05 +00:00
719c493f0b MemoryViz: print stream 0 if other streams exist (#106483)
It is confusing to not print stream 0, but print other streams. It makes stream 0
allocations seem like they are missing a stream annotation. This change will print streams
for everything unless all the events are on stream 0, then it will just not print streams.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106483
Approved by: https://github.com/albanD
ghstack dependencies: #106328, #106482
2023-08-03 00:42:13 +00:00
6f07c57416 MemoryViz.js: format, move style (#106482)
This updates the JS format of MemoryViz.js to match internal format.
It also moves the style sheet into the JS so it is easier package for
both oss and internal use.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106482
Approved by: https://github.com/aaronenyeshi
ghstack dependencies: #106328
2023-08-03 00:42:13 +00:00
820e68b58a [quant][pt2e] Add reference representation for quantized add - relu (#105707)
Summary:
Implementing reference representation for quantized ops we decided in https://docs.google.com/document/d/17h-OEtD4o_hoVuPqUFsdm5uo7psiNMY8ThN03F9ZZwg/edit#heading=h.ov8z39149wy8

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_representation_add_relu

Although right now it is not really testing things since there is some problem with dynamo export
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105707
Approved by: https://github.com/andrewor14
2023-08-03 00:42:06 +00:00
ba387b8830 [easy][be] operator_config -> quantization_config renaming (#106479)
Summary:
att

Test Plan:
CIs

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106479
Approved by: https://github.com/andrewor14
2023-08-03 00:36:44 +00:00
f533791cd0 [SDPA] Mirror c++ implementation in FlashAttention meta func (#106477)
# Summary
Test edge case and update meta function to match the c++ implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106477
Approved by: https://github.com/eellison
2023-08-03 00:28:27 +00:00
b3c29cd1ec Remove unused workflow.py (#106340)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106340
Approved by: https://github.com/zhxchen17
2023-08-02 23:42:06 +00:00
cebc11ae8f Register ONNX exporter under PT2 logging (#105989)
As first step of adopting PT2 logging for ONNX exporter.
Also adds `torch/_logging` to `.github/merge_rules.yaml` for ONNX exporter for easier follow ups.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105989
Approved by: https://github.com/abock, https://github.com/ezyang
2023-08-02 23:33:38 +00:00
640a96dfbb [FSDP][Easy] Allow ModuleWrapPolicy to take Iterable (#104999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104999
Approved by: https://github.com/rohan-varma
ghstack dependencies: #104427, #104967
2023-08-02 22:03:03 +00:00
031ce0fadc [FSDP][7/N] Add warning about frozen params (#104967)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104967
Approved by: https://github.com/rohan-varma
ghstack dependencies: #104427
2023-08-02 21:50:38 +00:00
bdcc454be4 [dynamo] Add missing fields for THPPyInterpreterFrame. (#103227)
Fixes https://github.com/pytorch/pytorch/issues/103210
Test Plan:
Before the fix:
```
pytest test/dynamo/test_export.py -k suppress_errors
```
got result:
```
  File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/eval_frame.py", line 295, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/eval_frame.py", line 448, in catch_errors
    return callback(frame, cache_size, hooks, frame_state)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 127, in _fn
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 360, in _convert_frame_assert
    return _compile(
           ^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/utils.py", line 180, in time_wrapper
    r = func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 511, in _compile
    exception_handler(e, code, frame)
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/convert_frame.py", line 216, in exception_handler
    log.error(format_error_msg(e, code, record_filename, frame))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/exc.py", line 248, in format_error_msg
    stack_above_dynamo = filter_stack(extract_stack(frame))
                                      ^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 231, in extract_stack
    stack = StackSummary.extract(walk_stack(f), limit=limit)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 393, in extract
    return klass._extract_from_extended_frame_gen(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 416, in _extract_from_extended_frame_gen
    for f, (lineno, end_lineno, colno, end_colno) in frame_gen:
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 390, in extended_frame_gen
    for f, lineno in frame_gen:
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/traceback.py", line 334, in walk_stack
    yield f, f.f_lineno
             ^^^^^^^^^^
AttributeError: 'torch._C.dynamo.eval_frame._PyInterpreterFrame' object has no attribute 'f_lineno'
```

After the fix:
```
pytest test/dynamo/test_export.py -k suppress_errors -s
```
Got Result:
```
  File "/data/users/zhxchen17/pytorch/torch/_dynamo/exc.py", line 135, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: map() operator doesn't support scalar or zero-sized tensors during
tracing.

========== The above exception occurred while processing the following code ==========

  File "/data/users/zhxchen17/pytorch/test/dynamo/test_export.py", line 3043, in forward
    def forward(self, xs):
  File "/data/users/zhxchen17/pytorch/test/dynamo/test_export.py", line 3047, in forward
    return map(body, xs)

==========
unimplemented [("map() operator doesn't support scalar or zero-sized tensors during tracing.", 1)]
.

=============================== 1 passed, 133 deselected in 4.60s ================================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103227
Approved by: https://github.com/williamwen42
2023-08-02 21:48:49 +00:00
a8c52863dd [FSDP][6/N] Check valid param freezing for ModuleWrapPolicy (#104427)
This PR adds improved error/warning messaging when auto wrapping with `ModuleWrapPolicy` in the presence of frozen parameters.
- For `use_orig_params=False`, FSDP requires uniform `requires_grad` for each FSDP instance. This PR adds a `ValueError` at wrapping time with a message that mentions the violating module and the frozen/non-frozen parameter names.
- For `use_orig_params=True`, FSDP allows non-uniform `requires_grad` for each FSDP instance. However, it will result in higher-than-expected gradient memory usage. This PR adds a `UserWarning` at wrapping time with a message that mentions the violating module, how much extra gradient memory will be used (in units of numel), and the frozen/non-frozen parameter names.
    - There is a possibility that this warning will be spammy/verbose, but my current thinking is that it is okay for now unless users complain.

<details>
<summary> Why DFS via named_children() vs. Using named_modules()</summary>

```
LoraModel(
  (embed_tokens): Embedding(100, 32)
  (layers): ModuleList(
    (0-3): 4 x LoraDecoder(
      (attn): LoraAttention(
        (q_proj): Linear(in_features=32, out_features=32, bias=False)
        (lora_A): Linear(in_features=32, out_features=8, bias=False)
        (lora_B): Linear(in_features=8, out_features=32, bias=False)
        (k_proj): Linear(in_features=32, out_features=32, bias=False)
        (v_proj): Linear(in_features=32, out_features=32, bias=False)
        (o_proj): Linear(in_features=32, out_features=32, bias=False)
      )
      (mlp): LoraMLP(
        (proj1): Linear(in_features=32, out_features=128, bias=False)
        (proj2): Linear(in_features=128, out_features=32, bias=False)
      )
      (inp_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
      (post_attn_layernorm): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
  )
  (norm): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
)
```
Reverse topological order with stack-based DFS via `named_children()`:
```
[
  'embed_tokens',
  'layers.0.attn.q_proj', 'layers.0.attn.lora_A', 'layers.0.attn.lora_B', 'layers.0.attn.k_proj', 'layers.0.attn.v_proj', 'layers.0.attn.o_proj', 'layers.0.attn', 'layers.0.mlp.proj1', 'layers.0.mlp.proj2', 'layers.0.mlp', 'layers.0.inp_layernorm', 'layers.0.post_attn_layernorm', 'layers.0',
  'layers.1.attn.q_proj', 'layers.1.attn.lora_A', 'layers.1.attn.lora_B', 'layers.1.attn.k_proj', 'layers.1.attn.v_proj', 'layers.1.attn.o_proj', 'layers.1.attn', 'layers.1.mlp.proj1', 'layers.1.mlp.proj2', 'layers.1.mlp', 'layers.1.inp_layernorm', 'layers.1.post_attn_layernorm', 'layers.1',
  'layers.2.attn.q_proj', 'layers.2.attn.lora_A', 'layers.2.attn.lora_B', 'layers.2.attn.k_proj', 'layers.2.attn.v_proj', 'layers.2.attn.o_proj', 'layers.2.attn', 'layers.2.mlp.proj1', 'layers.2.mlp.proj2', 'layers.2.mlp', 'layers.2.inp_layernorm', 'layers.2.post_attn_layernorm', 'layers.2',
  'layers.3.attn.q_proj', 'layers.3.attn.lora_A', 'layers.3.attn.lora_B', 'layers.3.attn.k_proj', 'layers.3.attn.v_proj', 'layers.3.attn.o_proj', 'layers.3.attn', 'layers.3.mlp.proj1', 'layers.3.mlp.proj2', 'layers.3.mlp', 'layers.3.inp_layernorm', 'layers.3.post_attn_layernorm', 'layers.3',
  'layers', 'norm', ''
]
```
Reverse topological order with `named_modules()`:
```
[
  'norm',
  'layers.3.post_attn_layernorm', 'layers.3.inp_layernorm', 'layers.3.mlp.proj2', 'layers.3.mlp.proj1', 'layers.3.mlp', 'layers.3.attn.o_proj', 'layers.3.attn.v_proj', 'layers.3.attn.k_proj', 'layers.3.attn.lora_B', 'layers.3.attn.lora_A', 'layers.3.attn.q_proj', 'layers.3.attn', 'layers.3',
  'layers.2.post_attn_layernorm', 'layers.2.inp_layernorm', 'layers.2.mlp.proj2', 'layers.2.mlp.proj1', 'layers.2.mlp', 'layers.2.attn.o_proj', 'layers.2.attn.v_proj', 'layers.2.attn.k_proj', 'layers.2.attn.lora_B', 'layers.2.attn.lora_A', 'layers.2.attn.q_proj', 'layers.2.attn', 'layers.2',
  'layers.1.post_attn_layernorm', 'layers.1.inp_layernorm', 'layers.1.mlp.proj2', 'layers.1.mlp.proj1', 'layers.1.mlp', 'layers.1.attn.o_proj', 'layers.1.attn.v_proj', 'layers.1.attn.k_proj', 'layers.1.attn.lora_B', 'layers.1.attn.lora_A', 'layers.1.attn.q_proj', 'layers.1.attn', 'layers.1', 'layers.0.post_attn_layernorm', 'layers.0.inp_layernorm', 'layers.0.mlp.proj2', 'layers.0.mlp.proj1', 'layers.0.mlp', 'layers.0.attn.o_proj', 'layers.0.attn.v_proj', 'layers.0.attn.k_proj', 'layers.0.attn.lora_B', 'layers.0.attn.lora_A', 'layers.0.attn.q_proj', 'layers.0.attn', 'layers.0',
  'layers', 'embed_tokens', ''
]
```
With the stack-based DFS via `named_children()`, reversing the topological order gives us each level in the module tree in the registered order, wheres with `named_modules()`, reversing the topological order gives us each level in reverse. Both are valid orders, but we prefer the former since it allows us to error/warn on the _first-registered_ module that violates the frozen/non-frozen condition.

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104427
Approved by: https://github.com/ezyang
2023-08-02 21:44:44 +00:00
aec8418bd9 Pin conda to 23.5.2 for Docker builds (#106473)
Fixes #106470

Since conda released version 23.7.2 - https://github.com/conda/conda/releases our nightly Docker build started to fail
```
#28 12.53 ResolvePackageNotFound:
#28 12.53   - conda==23.5.2
```

This PR pins conda Docker install to 23.5.2 to fix nightly Docker release

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106473
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-08-02 21:04:21 +00:00
1f1dfa9be9 Fix grad higher order handling TupleVariable (#106425)
Previously, we assume the argnums is a **ConstantVariable**. However I accidentally triggered an error on CI where argnums could be a **TupleVariable**. In that case, we have an attribute error when access the .value of argnums.

This PR adds support for the TupleVariable. It allows the unit test to pass without falling back to eager
"PYTORCH_TEST_WITH_DYNAMO=1 python test/functorch/test_eager_transforms.py -k test_argnums_cpu"

Test Plan:
see modified test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106425
Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/kshitij12345
2023-08-02 20:57:05 +00:00
f998869160 AOTInductor compile in prod env (#106442)
Summary: This diff updates the Inductor internal compile workflow

Reviewed By: wushirong

Differential Revision: D47958727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106442
Approved by: https://github.com/houseroad
2023-08-02 20:39:00 +00:00
fb6652b56e [profiler] add profiler parsing support for custom device. (#106142)
We hope PyTorch profiling parsing ability can also be applicable to custom devices. Based on previous  work  https://github.com/pytorch/pytorch/pull/101554, we have made supplementary updates to PyTorch profiling to extend its parsing capabilities for custom devices. These modifications do not affect the original logic of the code and mainly include the following aspects:
1. Added the relevant logic for use_device in torch.profiler.profiler._KinetoProfile.
2. In torch.autograd.profiler and torch.autograd.profiler_util, custom devices profiling data  parsing ability has been added using privateuse1 and use_device attributes.
3. In torch._C._autograd.pyi and torch._C._autograd.pyi, custom devices related attributes have been added. The underlying C++
logic will be added in subsequent pull requests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106142
Approved by: https://github.com/aaronenyeshi
2023-08-02 20:23:22 +00:00
6339f57fae Update export/export-aot-inductor benchmark code (#106323)
Update export/export-aot-inductor benchmark code to use recent changes related to kwarg inputs and dataclass outputs.

Updated [dashboard](https://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2031%20Jul%202023%2017%3A28%3A05%20GMT&stopTime=Tue%2C%2001%20Aug%202023%2017%3A28%3A05%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=angelayi/benchmark&lCommit=f0987867a88b0b9510fcaf33307150e61517e7a1&rBranch=main&rCommit=f23d755e1f835485b8fef5661e7f983b520d844e)

80% pass rate on HF for export: P801372961
20% pass rate on HF for export-aot-inductor: [link](https://hud.pytorch.org/benchmark/huggingface/inductor_aot_inductor?startTime=Mon,%2031%20Jul%202023%2017:08:02%20GMT&stopTime=Tue,%2001%20Aug%202023%2017:08:02%20GMT&granularity=hour&mode=inference&dtype=bfloat16&lBranch=angelayi/benchmark&lCommit=f0987867a88b0b9510fcaf33307150e61517e7a1&rBranch=main&rCommit=f23d755e1f835485b8fef5661e7f983b520d844e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106323
Approved by: https://github.com/desertfire
2023-08-02 20:18:37 +00:00
3143d81f6c Add support for edge dialect ops in exir/serde (#106371)
Summary:
Adding support for edge dialect ops in `exir/serde`. This diff does the following:
- Moves the global `serialize_operator/deserialize_operator` implementations in`export/serde/serialize.py` into `GraphModuleSerializer` and `GraphModuleDeserializer`
- Adds implementations of `serialize_operator/deserialize_operator` inside `GraphModuleSerializer` and `GraphModuleDeserializer` in `exir/serde/serialize.py`

Test Plan: CI + Enabled edge dialect ops in `executorch/exir/tests/test_serde.py`

Differential Revision: D47938280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106371
Approved by: https://github.com/angelayi
2023-08-02 20:09:15 +00:00
cc38d40cec Document f parameter of torch.save (#106248)
Fixes #104359
Changing the documentation for a better description of the function torch.save() compared to torch.jit.save()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106248
Approved by: https://github.com/malfet
2023-08-02 19:32:44 +00:00
1534af2a5c Add type annotations to torch/__init__.py (#106214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106214
Approved by: https://github.com/Skylion007
2023-08-02 19:13:31 +00:00
bd84651e19 Replace sympy.solve with a new simplified one. (#105877)
This PR implements `try_solve`: a function that tries to move terms of a relational
expression around, so as to isolate a given variable on the left-hand side.

For example:

```python
>>> try_solve(Eq(a + 5, 3), a)
Eq(a, -2)
>>> try_solve(Gt(Mod(a, 3), 0), a) # returns None
>>> try_solve(Gt(Mod(a, 3), 0), Mod(a, 3))
Gt(Mod(a, 3), 0), Mod(a, 3)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105877
Approved by: https://github.com/ezyang
2023-08-02 17:53:29 +00:00
bfed2da2e4 [Quant][PT2E] Re-enable test case of conv add/add_relu recipe for x86inductorquantizer (#105638)
**Summary**
Re-enable the test case of `test_conv2d_binary_with_quantizer_api` and `test_conv2d_binary_unary_with_quantizer_api` for X86InductorQuantizer. We disable these 2 testcases previously due to the time out issue in internal CI.

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_with_quantizer_api
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_unary_with_quantizer_api
```

Differential Revision: [D47745372](https://our.internmc.facebook.com/intern/diff/D47745372)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105638
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
2023-08-02 17:26:22 +00:00
7e47343d64 [BE] document more of FSDP checkpointing logic with a sprinkle of cleaning (#106069)
This PR should not make any functional difference. It:
- adds clearer documentation
- clarifies a type
- revises minor typos
- swaps a .keys for a .items call on a dictionary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106069
Approved by: https://github.com/awgu
2023-08-02 17:19:04 +00:00
ae1c0f42a3 update tf32 thresholds for H100 (#105879)
Addresses tf32 threshold related failures from NVIDIA internal testing for following unit tests:

H100:
- test_nn.py: test_ConvTranspose2d_dilated_cuda_tf32, test_ConvTranspose2d_no_bias_cuda_tf32, test_Transformer_multilayer_coder_cuda_tf32
- test_torch.py: test_cdist_non_contiguous_batch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105879
Approved by: https://github.com/ezyang
2023-08-02 16:44:01 +00:00
7820bd8404 Disable TV if Z3 is not found. (#106399)
Fix: #106276

This PR disables translation validation when running _test/dynamo/test_dynamic_shapes.py_
if Z3 is not installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106399
Approved by: https://github.com/ezyang
2023-08-02 16:38:19 +00:00
57f2a8d3a8 freezing w aot (#105497)
Freezing will take parameters and turn them into constants. A couple changes here:

-  move the setting of `flat_params[dropped_index]` before cpp compilation so that cpp_wrapper knows they have been dropped
- compile_fx_aot is doesn't use aot_autograd for invocation, so we no longer add the wrapper which discards dropped param indices. Continuing to add arguments everywhere didn't seem great, so I added `_in_aot_compilation`, but maybe reviewers would prefer something else.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105497
Approved by: https://github.com/desertfire
2023-08-02 16:30:08 +00:00
63b7be5a6f Use ProcessPoolExecutor in the ufmt adapter (#106123)
When running on a host with multiple CPUs, the ufmt linter was not able to use them very effectively. The biggest single culprit seems to be debug logging inside blib2to3 trying to acquire a lock, but disabling that doesn't help much - I suppose this must be GIL contention. Changing to a ProcessPoolExecutor makes it much faster.

The following timings are on a PaperSpace GPU+ instance with 8 vCPUs (the cores show up as Intel(R) Xeon(R) CPU E5-2623 v4 @ 2.60GHz but I'm not entirely clear if those are shared with other instances).

On main:

```
$ time lintrunner --all-files --take UFMT
ok No lint issues.

real    7m46.140s
user    8m0.828s
sys     0m5.446s
```

On this branch:

```
$ time lintrunner --all-files --take UFMT
ok No lint issues.

real    1m7.255s
user    8m13.388s
sys     0m3.506s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106123
Approved by: https://github.com/ezyang
2023-08-02 16:28:36 +00:00
e7b2430818 add pruning method: Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration (#95689)
add `class FPGMStructured`
add `function FPGM_structured()`
add `function _validate_distance_type()`
add `function _compute_distance()`

Implement method mentioned in issue #39765

---
FPGMSparsifier Implement with the new pytorch pruning API torch.ao.pruning.
It is a structured pruning method, and it is added under torch.ao.pruning._experimental. Test cases are added at `test_structured_sparsifier.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95689
Approved by: https://github.com/jcaip
2023-08-02 16:24:42 +00:00
aa0b4dac46 Check that mypy is installed (#106212)
Otherwise `python -mmypy filenames` just outputs something like "No module named mypy" and the lint run looks successful.

Fixes #78695, or at least one possible cause for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106212
Approved by: https://github.com/ezyang
2023-08-02 16:11:58 +00:00
dfcfd5cedb Revert "Add nn.CircularPad{*}d for consistency + fix no_batch_dim support (#106148)"
This reverts commit 87d253697116eee12d6010233d0a57fd5b152e9e.

Reverted https://github.com/pytorch/pytorch/pull/106148 on behalf of https://github.com/malfet due to Reverting as dependent PR https://github.com/pytorch/pytorch/pull/106147 was reverted as well ([comment](https://github.com/pytorch/pytorch/pull/106148#issuecomment-1662344543))
2023-08-02 14:46:00 +00:00
b37a50afda [ROCm] fix ROCm 5.5 nightly build after hipblas change (#106438)
Fixes ROCm 5.5 nightly build broken by #105881.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106438
Approved by: https://github.com/ezyang
2023-08-02 13:39:34 +00:00
f81f9093ec [core][pruning][feature] cuSPARSELt build integration (#103700)
Summary:

This stack of PR's integrates cuSPARSELt into PyTorch.

This PR adds support for cuSPARSELt into the build process.
It adds in a new flag, USE_CUSPARSELT that defaults to false.

When USE_CUSPASRELT=1 is specified, the user can also specify
CUSPASRELT_ROOT, which defines the path to the library.

Compiling pytorch with cusparselt support can be done as follows:

``
USE_CUSPARSELT=1
CUSPARSELT_ROOT=/path/to/cusparselt

python setup.py develop
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103700
Approved by: https://github.com/albanD
2023-08-02 12:48:39 +00:00
d83b887f2a Revert "Add error checking for padding modules (#106147)"
This reverts commit 0547b6279d6f7249c0e588508c2561589514d3aa.

Reverted https://github.com/pytorch/pytorch/pull/106147 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal builds, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/106147#issuecomment-1661870970))
2023-08-02 09:37:40 +00:00
fdd4b3aaa8 Revert "faketensor: prevent deepcopy from cloning FakeTensorMode (#104476)"
This reverts commit c54afea6eeb016b0a5ad7006f25746b2c83eaf9a.

Reverted https://github.com/pytorch/pytorch/pull/104476 on behalf of https://github.com/jeanschmidt due to sadly it is breaking internal tests, and I can't coordinate a FF due to timezone differences ([comment](https://github.com/pytorch/pytorch/pull/104476#issuecomment-1661808343))
2023-08-02 08:56:33 +00:00
d528a137e0 [quant][pt2e][quantizer] Suppoert set_module_type in XNNPACKQuantizer (#106094)
Summary:
Added support to allow users to set configurations based on module type in XNNPACKQuantizer, can also serve as an example
for implementing new quantizers

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_xnnpack_quantizer_set_module_type

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106094
Approved by: https://github.com/andrewor14
ghstack dependencies: #106087
2023-08-02 08:33:58 +00:00
936333fd5f Fix the Requirement of CMake Version (#106254)
Fix the Requirement of CMake Version

Many CMakeLists.txt require cmake versions greater than 3.18.0, so the cmake version in cmake.py is not correct.
1da4115702/CMakeLists.txt (L1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106254
Approved by: https://github.com/malfet
2023-08-02 08:02:52 +00:00
8ee0b17990 Fix reference cycle in our test suite (#106328)
In certain cases we capture ErrorMeta in a list. The ErrorMeta objects hold
tracebacks which contain a frame with a local variable that refers to that list.
This change mutates the list on exit from the frame so that it doesn't refer
to the ErrorMeta objects, breaking the cycle.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106328
Approved by: https://github.com/huydhn
2023-08-02 07:58:32 +00:00
30442c039c fix torch.norm for custom device (#106198)
Fixes #ISSUE_NUMBER
as title, fix torch.norm for custom device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106198
Approved by: https://github.com/ezyang
2023-08-02 06:25:52 +00:00
05bd24bb35 Extend Inductor to support the third-party backend (#100706)
This PR intends to extend Inductor to support the third-party backend that only focuses on the code generation just like what C++/OpenMP and Triton backend have done.

Currently, the generated code by Inductor contains two major parts. One is the kernel, and the other is the Python wrapper to glue the kernel. Therefore, the third-party backend needs to customize the two parts to generate its specific code.

- Python wrapper code generation

  Inductor provides a `WrapperCodeGen` class to generate the Python wrapper code to glue the kernel. Therefore, it is straightforward for the third-party backend to generate the backend-specific Python wrapper code. It just needs to inherit the `WrapperCodeGen` class and purposely override the particular member functions.

- Kernel code generation

  It is driven by different `Scheduling`. Hence, the third-party backend needs to provide a custom `Scheduling` for its specific kernel code generation. Currently, `CppScheduling` and `TritonScheduling` are for C++/OpenMP and Triton backend, respectively. But there is no common `Scheduling` class. Based on the scheduling invocation, this PR abstracts a common `Scheduling` class containing the following member functions.

  -   [group_fn](71c4becda7/torch/_inductor/scheduler.py (LL649C64-L649C64))
  - [flush](71c4becda7/torch/_inductor/scheduler.py (L1150))
  - [can_fuse_vertical](71c4becda7/torch/_inductor/scheduler.py (L1006))
  - [can_fuse_horizontal](71c4becda7/torch/_inductor/scheduler.py (LL1008C45-L1008C64))
  - [codegen_template](71c4becda7/torch/_inductor/scheduler.py (L1234)) _This function is only available for triton. If the third-party backend behaves as a sub-class of `TritonScheduling`, it can override it or reuse it._
  - [codegen_nodes](71c4becda7/torch/_inductor/scheduler.py (L1234))
  - [codegen_sync](71c4becda7/torch/_inductor/scheduler.py (LL1251C1-L1251C1)). _This function is only available for triton debug purpose. But it might also be useful for other computation devices. Therefore, we'd prefer to keep this function._

  The third-party backend needs to inherit from the `Scheduling` class and implement these functions.

Regarding some other classes like `CppKernel` and `TritonKernel` for code generation, they are used by or part of the logic of either `Scheduling` or `WrapperCodeGen`. Hence, this PR does not define the interface and leaves the flexibility to the third-party backend. The third-party backend can decide to implement these classes from scratch or reuse them by inheriting and overriding them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100706
Approved by: https://github.com/jansel
2023-08-02 05:13:51 +00:00
850ad54139 correct spelling mistake (#106309)
Fixes #ISSUE_NUMBER
correct spelling mistake
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106309
Approved by: https://github.com/kit1980
2023-08-02 04:38:23 +00:00
04f20bb285 Use isinstance(foreach_arg.type, ListType) for correctness (#106428)
the intention here is to check the element type of `List`, not all the types with `elem` attribute such as optional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106428
Approved by: https://github.com/soulitzer
2023-08-02 04:03:58 +00:00
92cac6bf32 InductorCpp: Fix "call to constructor is ambiguous" error (#106418)
Not sure why `{{}}` is better that just calling a default constructor, but removing it fixes:
```
% python test_cpp_wrapper.py -v -k test_profiler_mark_wrapper_call_cpu_cpp_wrapper
....
clang++ -MMD -MF main.o.d -DTORCH_EXTENSION_NAME=inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_clang\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1002\" -I/var/lib/jenkins/workspace/test/inductor/-I/var/lib/jenkins/workspace/torch/include -I/var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -I/var/lib/jenkins/workspace/torch/include/TH -I/var/lib/jenkins/workspace/torch/include/THC -I/opt/conda/envs/py_3.9/include/python3.9 -isystem /var/lib/jenkins/workspace/torch/include -isystem /var/lib/jenkins/workspace/torch/include/torch/csrc/api/include -isystem /var/lib/jenkins/workspace/torch/include/TH -isystem /var/lib/jenkins/workspace/torch/include/THC -isystem /opt/conda/envs/py_3.9/include/python3.9 -D_GLIBCXX_USE_CXX11_ABI=1 -fPIC -std=c++17 -std=c++17 -Wno-unused-variable -O3 -ffast-math -fno-finite-math-only -march=native -fopenmp -Wall -DCPU_CAPABILITY_AVX512 -D C10_USING_CUSTOM_GENERATED_MACROS -c /tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp -o main.o
/tmp/torchinductor_jenkins/py39_cpu/inline_extension_cwnqbbq5lr6hsaktauqhm5hulaxgvvwxphzkz3docrqablnmbd4v/main.cpp:41:50: error: call to constructor of 'c10::ArrayRef<c10::IValue>' is ambiguous
        RECORD_FUNCTION("inductor_wrapper_call", c10::ArrayRef<c10::IValue>({{}}));
                                                 ^                          ~~~~
/var/lib/jenkins/workspace/torch/include/ATen/record_function.h:580:38: note: expanded from macro 'RECORD_FUNCTION'
      at::RecordScope::FUNCTION, fn, inputs, ##__VA_ARGS__)
                                     ^~~~~~
/var/lib/jenkins/workspace/torch/include/ATen/record_function.h:561:20: note: expanded from macro 'RECORD_FUNCTION_WITH_SCOPE'
        guard, fn, inputs, ##__VA_ARGS__);                 \
                   ^~~~~~
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit move constructor)
class ArrayRef final {
      ^
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:40:7: note: candidate constructor (the implicit copy constructor)
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:71:13: note: candidate constructor
  constexpr ArrayRef(const T& OneElt) : Data(&OneElt), Length(1) {}
            ^
/var/lib/jenkins/workspace/torch/include/c10/util/ArrayRef.h:126:28: note: candidate constructor
  /* implicit */ constexpr ArrayRef(const std::initializer_list<T>& Vec)
                           ^
1 error generated.
```
if clang12 is used as the host compiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106418
Approved by: https://github.com/desertfire
2023-08-02 04:02:15 +00:00
17a3141696 Support is_mtia() (#106396)
Summary: As title.

Test Plan: CI tests.

Reviewed By: yuhc

Differential Revision: D47937061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106396
Approved by: https://github.com/yuhc, https://github.com/ezyang
2023-08-02 03:24:23 +00:00
4c3e137157 [vision hash update] update the pinned vision hash (#106434)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106434
Approved by: https://github.com/pytorchbot
2023-08-02 02:59:09 +00:00
d1a2aa1909 [MPS] Fix MPS clamp issue with different dtypes between input and min/max tensors (#105747)
- Fix the FP16 clamp issue (FP32 and FP16 are not broadcast compatible)
- Fix clamp (cached graph nodes were previously replaced with the cast version)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105747
Approved by: https://github.com/kulinseth
2023-08-02 02:51:34 +00:00
af37608276 Remove duplicate ops tests in test_quantized_op.py (#106398)
The duplicates are after https://github.com/pytorch/pytorch/pull/94170
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106398
Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/jerryzh168
2023-08-02 02:37:36 +00:00
dd12c4c2cb Fix wrong class name in comments (#106419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106419
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2023-08-02 02:32:56 +00:00
c29f8ccc02 [inductor][easy] Improved warning message for missing OMP on mac (#106241)
If 'omp.h' file not found is encountered, link to https://github.com/pytorch/pytorch/issues/95708 for suggestions on how to work around this

Error message after this change:
```
(pytorch-docs) dberard@dberard-mbp scripts % python inductor_mac_cpu.py
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] WON'T CONVERT fn /Users/dberard/Documents/scripts/inductor_mac_cpu.py line 3
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] due to:
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Traceback (most recent call last):
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]   File "/Users/dberard/Documents/pytorch/torch/_inductor/codecache.py", line 953, in compile_file
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]     raise exc.CppCompileError(cmd, output) from e
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] CppCompileError: C++ compile error
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Command:
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] g++ /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/4s/c4sg7sknpldhmeuikjbbjt7lcjvndzrr7h2ml2iqprmzyjjw6sn4.cpp -shared -fPIC -Wall -std=c++17 -Wno-unused-variable -I/Users/dberard/Documents/pytorch/torch/include -I/Users/dberard/Documents/pytorch/torch/include/torch/csrc/api/include -I/Users/dberard/Documents/pytorch/torch/include/TH -I/Users/dberard/Documents/pytorch/torch/include/THC -I/Users/dberard/miniconda3/envs/pytorch-docs/include/python3.9 -I/Users/dberard/miniconda3/envs/pytorch-docs/include -L/Users/dberard/miniconda3/envs/pytorch-docs/lib -lomp -O3 -ffast-math -fno-finite-math-only -Xclang -fopenmp -D C10_USING_CUSTOM_GENERATED_MACROS -o /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/4s/c4sg7sknpldhmeuikjbbjt7lcjvndzrr7h2ml2iqprmzyjjw6sn4.so
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Output:
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] In file included from /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/4s/c4sg7sknpldhmeuikjbbjt7lcjvndzrr7h2ml2iqprmzyjjw6sn4.cpp:2:
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] /var/folders/1k/9l4kxkgx6jn28jn_pp2th63c0000gn/T/torchinductor_dberard/i5/ci5uspp363v3ky6jkccllm3bxudy2fkdpqinkqhmpehfihejs7ko.h:8:10: fatal error: 'omp.h' file not found
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] #include <omp.h>
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]          ^~~~~~~
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] 1 error generated.
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING] Try setting OMP_PREFIX; see https://github.com/pytorch/pytorch/issues/95708
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]
[2023-07-28 17:15:25,731] torch._dynamo.convert_frame: [WARNING]
[2023-07-28 17:15:25,732] torch._dynamo.convert_frame: [WARNING] converting frame raised error, suppressing error
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106241
Approved by: https://github.com/jansel
2023-08-02 02:12:27 +00:00
e7115dbecf [pytorch] Suppress C4624 warnings on Windows (#106348)
Summary:
Building on Microsoft Visual Studio can show excessive warnings of the form:
```
caffe2\c10\util\Optional.h(212): warning C4624: 'c10::constexpr_storage_t<T>': destructor was implicitly defined as deleted
        with
        [
            T=std::string
        ]
caffe2\c10\util\Optional.h(411): note: see reference to class template instantiation 'c10::constexpr_storage_t<T>' being compiled
        with
        [
            T=std::string
        ]
caffe2\c10\util\Optional.h(549): note: see reference to class template instantiation 'c10::trivially_copyable_optimization_optional_base<T>' being compiled
        with
        [
            T=std::string
        ]
```

While we have macros such as `C10_CLANG_DIAGNOSTIC_{PUSH,POP,IGNORE}`, no there's no equivalent `C10_MSVC_DIAGNOSTIC_*`, so just do the suppressions explicitly.

Test Plan: CI should complete, but Windows build log will no longer contain C4624 warnings

Differential Revision: D47736268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106348
Approved by: https://github.com/albanD
2023-08-02 01:57:21 +00:00
92a22a8098 [quant][pt2e][quantizer] Suppoert set_module_name in XNNPACKQuantizer (#106087)
Summary:
Added support to allow users to set configurations based on module name in XNNPACKQuantizer, can also serve as an example
for implementing new quantizers

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_xnnpack_quantizer_set_module_name

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106087
Approved by: https://github.com/andrewor14
2023-08-02 01:19:23 +00:00
9ba0558d48 Add sequence_nr to aot_autograd to map forward ops to their corresponding backward ops (#103129)
Fixes #102375

Sequence_nr increments in the forward pass and decrements in the backward pass.  Backward ops with the same sequence_nr as a forward op represent the backward implementation for the op.  The long term goal is to make this information available to the profiler so users can observe which ops are fused by the inductor openai triton kernels.

Added a test for this feature **test/dynamo/test_aot_autograd.py::AotAutogradFallbackTests::test_aot_sequence_nr**.  The test case uses **aot_export_module()** to create a joint fwd/bwd fx graph.  Then it walks all the nodes in fx graph using fx_graph.graph.nodes.   The seq_nr of each node is recorded in node.meta.  During the fwd pass the seq_nr increments and it decrements during the bwd pass.  This allows the user to map forward ops to their corresponding bwd ops which is useful for performance analysis.

Expected output from the test case.

 SeqNr|OrigAten|SrcFn
0|aten.convolution.default|l__self___conv1
0|aten.add.Tensor|l__self___bn1
1|aten._native_batch_norm_legit_functional.default|l__self___bn1
2|aten.relu.default|l__self___relu1
3|aten.add.Tensor|add
4|aten.view.default|flatten
5|aten.t.default|l__self___fc1
6|aten.unsqueeze.default|l__self___fc1
7|aten.mm.default|l__self___fc1
8|aten.squeeze.dim|l__self___fc1
9|aten.add.Tensor|l__self___fc1
10|aten.sub.Tensor|l__self___loss_fn
11|aten.abs.default|l__self___loss_fn
12|aten.mean.default|l__self___loss_fn
12|aten.ones_like.default|
12|aten.expand.default|
12|aten.div.Scalar|
11|aten.sgn.default|
11|aten.mul.Tensor|
8|aten.unsqueeze.default|
7|aten.t.default|
7|aten.mm.default|
7|aten.t.default|
7|aten.t.default|
7|aten.mm.default|
6|aten.squeeze.dim|
5|aten.t.default|
4|aten.view.default|
2|aten.threshold_backward.default|
1|aten.native_batch_norm_backward.default|
0|aten.convolution_backward.default|
0|aten.add.Tensor|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103129
Approved by: https://github.com/soulitzer
2023-08-02 00:52:52 +00:00
0cba33e176 [DTensor]Minor Docstring Update (#106250)
Fix docstring to reflect change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106250
Approved by: https://github.com/wanchaol
2023-08-02 00:27:29 +00:00
5ebb18c5c6 exclude internal folders from lint (#106291)
Title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106291
Approved by: https://github.com/ZainRizvi
2023-08-02 00:15:32 +00:00
76163a56c0 Refactor stack handling to always use TracingContext to populate real stack on exception (#106277)
The basic gist of the PR is simple, but it's accompanied with some careful modifications and unit tests to make sure I got it right. Check inline comments for more details.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106277
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2023-08-02 00:09:16 +00:00
753991b8c5 aot_inductor: properly split code gen compilation command (#105367)
Summary:
Proactively fix it so we don't run into strange things in the future.

```
In [5]: cmd = '''gcc "single arg with space"'''

In [6]: print(cmd)
gcc "single arg with space"

In [7]: cmd.split(' ')
Out[7]: ['gcc', '"single', 'arg', 'with', 'space"']

In [8]: shlex.split(cmd)
Out[8]: ['gcc', 'single arg with space']
```

Test Plan: CI

Reviewed By: chenyang78

Differential Revision: D47532486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105367
Approved by: https://github.com/chenyang78
2023-08-01 23:32:44 +00:00
5e3aca6c5c [BE] Input check for torch.nn.MultiheadAttention (#106363)
Summary: Check `embed_dim` and `num_heads ` of `torch.nn.MultiheadAttention`.

Test Plan: Please see GitHub Actions.

Differential Revision: D47943134

Fix: #105630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106363
Approved by: https://github.com/mikaylagawarecki
2023-08-01 23:28:23 +00:00
ef0576f203 [Benchmarks] Updated CSVs for improvement to visformer_small (#106414)
# Summary
Accuracy improvement from https://github.com/pytorch/pytorch/actions/runs/5730904800/job/15531967776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106414
Approved by: https://github.com/kit1980
2023-08-01 23:02:22 +00:00
40184b28eb [ROCm] enabling miopen_batch_norm lowering in inductor (#105740)
Enabling miopen_batch_norm lowering for inductor only.

This is to avoid errors observed in some models and perf difference is very close from initial benchmarks.
```
LoweringException: RuntimeError: Expected contiguous tensor, but got non-contiguous tensor for argument #1 'input' (while checking arguments for miopen_batch_norm)
  target: aten.miopen_batch_norm.default
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105740
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-01 22:39:17 +00:00
7a3503dfd8 Add _foreach_sign (#106343)
Rel:
- #106221

Should we add foreach of [`torch.sgn`](https://pytorch.org/docs/stable/generated/torch.sgn.html) as well?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106343
Approved by: https://github.com/janeyx99
2023-08-01 22:33:34 +00:00
b46a89bcfb Remove skipIfROCm from test_cuda_repro.py (#106416)
This is to fix tests failures after land race of https://github.com/pytorch/pytorch/pull/105662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106416
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-08-01 22:28:42 +00:00
3ce7abe111 Fix whenRegisteringAutogradKernelWithCatchAllKernel_thenCanCallAutogradKernel (#106023)
Both test cases should call catchallkernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106023
Approved by: https://github.com/bdhirsh
2023-08-01 22:27:42 +00:00
97e5055a69 Add cumprod support for device mps (#104688)
Related to #77764

Add support for the cumprod operation (which in turn allows its gradient). This also allows us to compute the gradient of prod since it was blocked behind cumprod in the case where exactly one element of the tensor was 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104688
Approved by: https://github.com/kulinseth
2023-08-01 21:51:20 +00:00
fadd0859ca Expose module method in ExportedProgram (#105575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105575
Approved by: https://github.com/zhxchen17
2023-08-01 21:28:57 +00:00
60e65a70e5 [ROCm] enable cudagraph inductor UTs on ROCm (#105662)
These tests can now be enabled after a hipGraph fix landed in 5.6.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105662
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-08-01 20:55:27 +00:00
506b55fc29 [FSDP][Easy] Move _FSDPState attrs to avoid comment confusion (#106392)
Resubmit of https://github.com/pytorch/pytorch/pull/106333 after rebasing (I lost the original branch locally)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106392
Approved by: https://github.com/kwen2501
2023-08-01 20:39:22 +00:00
5c3aae8385 [ONNX] Support type promoting sym number representing scalar output (#106178)
Summary:
* Add test cases distilled from models that requires setting dynamo config `capture_scalar_outputs` and `capture_dynamic_output_shape_ops` to True. Kudos to #105962 both configs are on by default for export now.
* Improve type promotion to support fx.Node of sym number representing scalar output.
* Bug fix: `onnxfunction_dispatcher` would crash if an input was mis-aligned to be attribute when doing schema matching.
* Misc: re-enable op tests that are already passing.
* Needs https://github.com/microsoft/onnxscript/pull/931. Waiting for merge and the publishing of the new whl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106178
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-08-01 20:08:49 +00:00
449f481de0 [memory snaphots] record for all devices (#106346)
Previously calling _record_memory_history would only start recording
for a single device because snapshots were also device specific.

Now the visualizer packages all devices into a single page, so we snapshot
recording should also enable recording for all devices.

Verified locally that calling the method does not initialize cuda context
on devices that have not previously been used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106346
Approved by: https://github.com/eellison
2023-08-01 19:56:15 +00:00
d2a9b256f0 [DCP][Test]Remove broken 2d checkpoint test (#106367)
Removing this broken test as we are not going to land the fix for 2D regression. Instead, we are going to migrate to use device_mesh and dtensor state_dict for 2D.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106367
Approved by: https://github.com/fduwjj
2023-08-01 19:54:40 +00:00
4b2c6282e0 Modify signature for tensor.tile in doc (#106295)
Fixes #71476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106295
Approved by: https://github.com/soulitzer
2023-08-01 19:51:52 +00:00
05b2a6c8db [ONNX] Do not run 'deduplicate_initializers' when 'keep_initializers_as_inputs' is True (#96320)
### Proposal
When arg of 'keep_initializers_as_inputs' is True, it's quite possible that parameters are set by initializer of input.
Hence we should disable de-duplicate initializer optimization when 'keep_initializers_as_inputs==True'.

- [x] Update doc related to `keep_initializers_as_inputs`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96320
Approved by: https://github.com/abock, https://github.com/thiagocrepaldi
2023-08-01 19:42:57 +00:00
cfa4edcde0 [SDPA] Update dispatch checks to catch last_dim_stride != 1. Also update mask padding logic (#106102)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at bb1fc29</samp>

This pull request simplifies and refactors the code for fused scaled dot product attention kernels in `attention.cu` and `sdp_utils.cpp`, and adds new input validation checks and tests. It also modifies the `sdp_params` struct to store optional mask tensors directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106102
Approved by: https://github.com/cpuhrsch
2023-08-01 19:13:01 +00:00
1a6f1d816d [Doc] Add proj_size < hidden_size in LSTM (#106364)
Summary:
Add parameter constraint: `proj_size` has to be smaller than `hidden_size` in RNNBase doc.

Ref:
ceea08a986/torch/nn/modules/rnn.py (L83)

ceea08a986/torch/nn/modules/rnn.py (L458)

Test Plan: Please see GitHub Actions.

Differential Revision: D47943365

Fix: #105628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106364
Approved by: https://github.com/mikaylagawarecki
2023-08-01 18:58:27 +00:00
6d2162e644 Remove fake_mode arg from torch._dynamo.export API (#106345)
#105477 removes the need of explicitly specifying `fake_mode`.
The same effect can be achieved by wrapping `torch._dynamo.export` around a `torch._subclasses.FakeTensorMode` context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106345
Approved by: https://github.com/ezyang
2023-08-01 17:52:06 +00:00
596491f1f5 Propagate dynamic int on __setitem__. (#105923)
Fix: #105533

This PR propagates dynamic ints used as indices for `__setitem__`. In summary, we:

- Replace the integer type for `TensorIndex` (both the enum and the corresponding
functions)
- Accordingly modify _python_variable_indexing.cpp_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105923
Approved by: https://github.com/ezyang
2023-08-01 17:34:03 +00:00
cf012c43f4 Do not call decref if python runtime is already dead (#106334)
Same treatment as many other objects such as https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/python_hook.cpp#L99
This one can outlive the python runtime due to structs like: 2f35715f0d/torch/csrc/autograd/python_cpp_function.cpp (L232)

With the pybind patch and this one, the 3.12 build at https://github.com/pytorch/pytorch/pull/106083 stops segfaulting and runs test_autograd.py just fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106334
Approved by: https://github.com/ezyang
2023-08-01 17:22:42 +00:00
6d86a255e6 Revert "Add scalar conversion using avx instructions for half (#102140)"
This reverts commit 888bdddb1ed0f3bfbbfc964f3b6080b0ea431dfd.

Reverted https://github.com/pytorch/pytorch/pull/102140 on behalf of https://github.com/jeanschmidt due to This is breaking internal tests @cpuhrsch can share more context and help with a follow up ([comment](https://github.com/pytorch/pytorch/pull/102140#issuecomment-1660686075))
2023-08-01 16:35:23 +00:00
aaaafa1bcf [Export] remove unused flags in export (#106336)
Remove unused flags from export_dynamo_config:
Among them:
- capture_scalar_outputs: bool = True. **True by default** in dynamo.export:
- capture_dynamic_output_shape_ops: bool = True.  **True by default** in dynamo.export
- specialize_int: bool = True: **True by default** in dynamo.export.
- guard_nn_modules: bool = True: this flag is **not being used** as we never look at nn module guards and assume modules are forzen. See the [doc](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py#L77) of this flag.
- dynamic_shapes: bool = True: **deprecated by dynamo**:  [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py#L55 )

test plan:
Added new test for allow_rnn to test its effectiveness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106336
Approved by: https://github.com/tugsbayasgalan
2023-08-01 16:10:09 +00:00
e35950cd0d [caffe2] Move CAFFE SDT macros' definitions to c10/util/ (#105856)
Summary: Moving static tracepoint macros header to a location where it can be easily used by various PyTorch components (`c10/utill`).

Test Plan:
Same as for D47159249:

Tested the following macros on test scripts with libbpf USDTs:
* `CAFFE_SDT`
* `CAFFE_DISABLE_SDT`
* `CAFFE_SDT_WITH_SEMAPHORE`

Reviewed By: EDG-GH

Differential Revision: D47636258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105856
Approved by: https://github.com/EDG-GH, https://github.com/chaekit
2023-08-01 14:42:55 +00:00
4dc063821d [inductor] Fix lowerings that create unexpected aliases (#105173)
This may give the wrong result in some cases, e.g.
```python
@torch.compile()
def fn(x):
    tmp = x.ceil()
    x.add_(10)
    return tmp

a = torch.zeros((), dtype=torch.int64)
fn(a)  # tensor(10)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105173
Approved by: https://github.com/lezcano
2023-08-01 14:09:13 +00:00
e514386315 Normalize builtin types to dtypes. (#106074)
Fix: #105052
Follow-up: #105588

This PR normalizes builtin Python types (e.g. `int` and `float`) into PyTorch data types
when these are passed as argument, instead of used as functions.

In summary, we:

- Implement `BuiltinVariable.as_proxy`, mapping Python types into PyTorch data types

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106074
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-08-01 13:32:19 +00:00
87d2536971 Add nn.CircularPad{*}d for consistency + fix no_batch_dim support (#106148)
Fixes #105749 https://github.com/pytorch/pytorch/issues/95320

(tldr is that input should always be `[N, C, H, (W, D])` where only H, W and D dimensions get circular padding, so the 2D case where user wants both dimensions to be padded --> they should `.unsqueeze(0)` (as is the case for `Reflection/ReplicationPad`) but we didn't document this for circular padding. [This seems to be the old docstring](277b05014a/torch/nn/functional.py (L4689)) that was somehow lost.

Fixes no_batch_dim support https://github.com/pytorch/pytorch/issues/104860

- Adds missing documentation for circular padding
- Adds missing CircularPad modules
- Migrates legacy test_nn tests from circular padding to ModuleInfo
- Adds no_batch_dim support + sample inputs that test this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106148
Approved by: https://github.com/albanD
ghstack dependencies: #106325, #106147
2023-08-01 12:49:58 +00:00
0547b6279d Add error checking for padding modules (#106147)
Fixes https://github.com/pytorch/pytorch/issues/105627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106147
Approved by: https://github.com/albanD
ghstack dependencies: #106325
2023-08-01 12:49:58 +00:00
c9be60cd0e Add error inputs to ModuleInfo (mirroring OpInfo) (#106325)
Add infra for error inputs to ModuleInfos, migrate first few error inputs tests from test_nn.py (more to come!)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106325
Approved by: https://github.com/albanD
2023-08-01 12:49:56 +00:00
16df54239f remove tensorpipe code which forgot to delete (#106301)
When analyzing the rpc code, I found redundant tensorpipe code, which was submitted synchronously with #40846, but was not deleted synchronously when the code was subsequently deleted.
The tensorpipe namespace is not useful in neither utils.h nor utils.cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106301
Approved by: https://github.com/ezyang, https://github.com/fduwjj
2023-08-01 08:14:50 +00:00
f23d755e1f [pt2] add meta for ormqr (#106278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106278
Approved by: https://github.com/ezyang
2023-08-01 06:47:48 +00:00
780b90ba6c [opinfo] Fix logic in sample_inputs_linspace (#106353)
Previously in `sample_inputs_linspace` the logic

```
dtype == torch.uint8 and end < 0 or start < 0
```

is equivalent to

```
(dtype == torch.uint8 and end < 0) or start < 0
```

which skipped all `start < 0` cases. I think this is unintended and the negative inputs should only be skipped when dtype is `unit8`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106353
Approved by: https://github.com/BowenBao
2023-08-01 06:04:44 +00:00
59d0dea90f Only make a shallow copy when loading optimizer state_dict (#106082)
The thing we do still deep copy is the param_groups, which is much lighter weight. This should also save memory when loading from a checkpoint.

The deepcopy was introduced in ecfcf39f30, but module.py had only a shallow copy at that point so it did not actually bring parity.

Incorporates an XLA fix, which is why I'm updating the pin to ca5eab87a7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106082
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-08-01 05:33:31 +00:00
ceea08a986 [vision hash update] update the pinned vision hash (#106350)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106350
Approved by: https://github.com/pytorchbot
2023-08-01 03:31:36 +00:00
aa2cee44b7 [Pytorch][Vulkan] Reuse broadcast checks (#105960)
Summary:
Place broadcast checks into `Broadcast.h` and `Broadcast.cpp` for code re-use.

Rename `check_inputs` to `is_broadcastable`

https://pytorch.org/docs/stable/notes/broadcasting.html

Test Plan:
All tests
https://www.internalfb.com/phabricator/paste/view/P797165124
```
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 318 tests from VulkanAPITest (8693 ms total)

[----------] Global test environment tear-down
[==========] 318 tests from 1 test suite ran. (8693 ms total)
[  PASSED  ] 317 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```

Differential Revision: D47741937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105960
Approved by: https://github.com/SS-JIA
2023-08-01 02:48:28 +00:00
f456c504b9 Update kineto submodule to 465ff4cd7 (#106293)
Repeat #106154 but without ghstack, because ghstack isn't letting me import it.

Original commit message:
> Reland update to kineto after https://github.com/pytorch/pytorch/pull/105866 was reverted. This new update contains a patch to check CUPTI_API_VERSION instead of CUDA_VERSION to handle cases where CUPTI_API_VERSION is behind CUDA_VERSION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106293
Approved by: https://github.com/aaronenyeshi
2023-08-01 02:34:30 +00:00
f11090288c create benchmark example tensors with correct sizes (#106238)
We need to consider the node's offset when we create benchmark example
tensors with test_cat_addmm. Otherwise, we would fail with applying
torch.as_strided to the return tensor value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106238
Approved by: https://github.com/jansel
2023-08-01 01:14:53 +00:00
03e85be9b0 [Inductor][FX passes] New group/batch fusion pattern searching algorithm + group mm fusion + preserve memory layout (#106279)
Summary:

Major changes:
* Implement a new group/batch fusion pattern searching algorithm: only fuse patterns that are in a certain depth difference (locally).
* Search FX graph in reverse order since most of ops have more inputs than outputs.
* Support fuse mm (linear backward)
* Preserve memory layout for fbgemm.gmm.

We tested in Ads models and saw consistent gains.

Test Plan: Unit tests and integration test.

Differential Revision: D47581710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106279
Approved by: https://github.com/jansel, https://github.com/Skylion007
2023-08-01 01:10:44 +00:00
a2e8ac1d34 Update Anaconda download link (#106335)
I found out reading the readme that link is broken and thought that would be great first time contribution
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106335
Approved by: https://github.com/msaroufim
2023-08-01 00:57:59 +00:00
e075f91dcc Extend workflow sync to more workflow (#106331)
To `slow.yml` and `mac-mps.yaml`, based on the results of the following grep:
```
% grep "sync-tag: " .github/workflows/*.yml
.github/workflows/mac-mps.yml:      sync-tag: macos-12-py3-arm64-build
.github/workflows/mac-mps.yml:      sync-tag: macos-12-py3-arm64-mps-test
.github/workflows/pull.yml:      sync-tag: asan-build
.github/workflows/pull.yml:      sync-tag: asan-test
.github/workflows/pull.yml:      sync-tag: win-cpu-build
.github/workflows/pull.yml:      sync-tag: rocm-build
.github/workflows/slow.yml:      sync-tag: asan-build
.github/workflows/slow.yml:      sync-tag: asan-test
.github/workflows/trunk.yml:      sync-tag: macos-12-py3-arm64-build
.github/workflows/trunk.yml:      sync-tag: macos-12-py3-arm64-mps-test
.github/workflows/trunk.yml:      sync-tag: win-cpu-build
.github/workflows/trunk.yml:      sync-tag: win-cuda-build
.github/workflows/trunk.yml:      sync-tag: rocm-build
```

Allow synced workflows to diverge with regards to `test-matrix`, to allow for both `mac-mps` and slow part of ASAN tests.

Discovered while working on https://github.com/pytorch/pytorch/pull/105260 that slow sync-tag is not checked.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106331
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/seemethere
2023-08-01 00:47:28 +00:00
55f9359d36 fix sdpa math accuracy issue when scale is negative (#105202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105202
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/drisspg
2023-08-01 00:19:14 +00:00
788c825837 Higher order operator util for raising if inputs require grads (#106078)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08bd685</samp>

Added a utility function `autograd_not_implemented_check` to `torch._higher_order_ops.utils` and used it in `out_dtype_autograd` to simplify and standardize the error handling for higher order operators that do not support autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106078
Approved by: https://github.com/zou3519
2023-08-01 00:13:13 +00:00
57d0bec306 aot_inductor_interface: surface previously eaten error messages (#105366)
Summary:
tldr:

* change glog -> cout for important logging inside aot_inductor.so
* bring a small amount of important python logging from debug to info

Test Plan: CI

Differential Revision: D47464665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105366
Approved by: https://github.com/desertfire
2023-08-01 00:06:53 +00:00
186352a625 [inductor] Make autotune_process.py pass mypy (#105791)
`TensorMeta.from_irnodes` handles either a single `IRNode` or a tuple or list of them. I tried to express this with overloading, but because this file is in MYPYNOFOLLOW, the `IRNode` subclasses become `Any`, which causes the overloads to be overlapping.

This changes the type of the argument to `benchmark_in_sub_process` to the more specific `TritonTemplateCaller`, since that one has the `bmreq` member and existing docstrings indicate that only the triton template benchmark is handled.

The `rand_strided` call caused a mypy error because the default value for device was a string. This is fixed by adding type hints to `rand_strided` in `torch/_dynamo/testing.py`. Likewise, the return value of `PyCodeCache.load_by_key_path` can be inferred from the type hint on `PyCodeCache.cache`.

Fixes one part of #105230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105791
Approved by: https://github.com/jansel, https://github.com/Skylion007
2023-07-31 23:58:38 +00:00
018ac76362 fix x.numpy() breaks in #106211 (#106327)
Fixes https://github.com/pytorch/pytorch/issues/106316. Need to promote [this](https://dev-discuss.pytorch.org/t/supporting-dynamo-in-python-3-11-null/1393) a little more I guess. I'm going to make a PR soon that will add `push_null` arg to `load_import_from` and other function call codegen methods that are missing the field, so that we can push null as early in the function call sequence as possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106327
Approved by: https://github.com/lezcano
2023-07-31 21:19:27 +00:00
2f35715f0d [onnx] Fix output shape mismatch issue of max_pool (#106270)
For onnx MaxPool with ceil_mode=1, the sliding windows that starts in the right padded region won't be ignored, which causes different output shape with torch.
Therefore, need to add Pad op before and not to set ceil_mode for MaxPool op like what is done in symbolic_opset9 when convertting torch max_pool to onnx MaxPool.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106270
Approved by: https://github.com/thiagocrepaldi
2023-07-31 21:03:08 +00:00
2138aaa978 [FSDP] Validate buffer dtype in pre-forward hook for FSDP mixed precision tests (#106231)
Summary: https://github.com/pytorch/pytorch/issues/104740

Test Plan: buck2 test mode/dev-nosan caffe2/test/distributed/fsdp:fsdp_mixed_precision -- test_full_precision_in_eval_buffers

Reviewed By: awgu

Differential Revision: D47858769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106231
Approved by: https://github.com/awgu
2023-07-31 21:01:33 +00:00
23f47f746b [optim][rprop] Minimize intermediates=1 for foreach to save memory (#105193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105193
Approved by: https://github.com/albanD
2023-07-31 20:59:26 +00:00
4eeda6616c Correct URL Link for torchDynamo (#105903)
Correct some error or 404 urls for torchDynamo doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105903
Approved by: https://github.com/malfet
2023-07-31 20:50:09 +00:00
5379b5f927 [ROCm] use hipblas instead of rocblas (#105881)
- BatchLinearAlgebraLib.cpp is now split into one additional file
  - BatchLinearAlgebraLib.cpp uses only cusolver APIs
  - BatchLinearAlgebraLibBlas.cpp uses only cublas APIs
  - hipify operates at the file level and cannot mix cusolver and cublas APIs within the same file
- cmake changes to link against hipblas instead of rocblas
- hipify mappings changes to map cublas -> hipblas instead of rocblas

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105881
Approved by: https://github.com/albanD
2023-07-31 20:42:55 +00:00
c9c66819a1 Move more TCPStorestate from BackgroundThread to TCPStoreMasterDaemon as it won't be used by the libuv backend. (#105674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105674
Approved by: https://github.com/H-Huang
ghstack dependencies: #105163, #105164, #105184, #105672
2023-07-31 20:10:16 +00:00
57a47ed905 [ONNX] Log instead of crash when 'tabulate' is not installed (#106228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106228
Approved by: https://github.com/justinchuby
2023-07-31 19:16:37 +00:00
196372415a Use nodejs-16 for docs builds (#106312)
As node-12 EOLed long time ago and not available for Ubuntu-22.04 (Discovered while working on bionic deprecation).
Remove artificial constraint on gcc-10 downgrade (and some in-pace patching) for Jammy, as CUDA-11.8+ works perfectly fine with gcc-11.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 6367120</samp>

> _`nodejs` version_
> _upgraded for security_
> _autumn leaves fall fast_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106312
Approved by: https://github.com/DanilBaibak, https://github.com/albanD, https://github.com/atalman
2023-07-31 19:11:47 +00:00
1b757fb60b Enable distributed cpp test for rocm (#106132)
There are some cpp tests that did not run for ROCm platform. This is the part of effort to enable them. Specifically, this change is to enable the distributed cpp tests.

Test plan:
Tested by using rocm/pytorch-nightly:latest image, and verified the distributed cpp tests PASSED locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106132
Approved by: https://github.com/huydhn
2023-07-31 19:05:58 +00:00
4a549dd57a AOTAutograd: correctness fix when tracing custom autograd functions that alias inputs (#102992)
Fixes https://github.com/pytorch/pytorch/issues/102970. See the comment [here](https://github.com/pytorch/pytorch/issues/102970#issuecomment-1577223773) for details.

We normally treat "outputs that alias inputs" specially in AOTAutograd, by replaying the views at runtime, instead of baking them into the graph. For views that are part of custom autograd functions though, we can't do that view-replay, since it will clobber the backwards function that the user specified in their custom autograd.Function.

Right now in this PR, I distinguish between "aliased inputs that are normal views" vs. "aliased inputs that are views that came from an autograd.Function call" by checking the outputs `.grad_fn` field, to see if it inherits from our custom CBackward function class. Then I added a new `OutputType` enum value, that we effectively treat the "normal" way (the same way that we treat ordinary, non-aliased outputs). The new enum val is mostly for debugging - so we can print it and know that our graph had custom autograd.Function aliased outputs in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102992
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-07-31 19:02:12 +00:00
a0b6c0d1da Remove @penguinwu from distributed codeowners (#106322)
@penguinwu said she found a different way to get notified, so she can be removed from codeowners.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106322
Approved by: https://github.com/ezyang
2023-07-31 18:20:45 +00:00
0010a8f753 Deallocate constant when it is no longer needed in constant folding (#106216)
Differential Revision: [D47881214](https://our.internmc.facebook.com/intern/diff/D47881214)

tested locally with :
```
@torch.compile()
def foo():
    size_gb = 1
    size_bytes = size_gb * 1024 * 1024 * 1024 * 20

    # Allocate the tensor on the GPU
    tensor = torch.empty(size_bytes // 4, device='cuda')  # Divide by 4 to allocate float32 elements

    for _ in range(10):
        tensor = tensor + 1

    return tensor

foo()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106216
Approved by: https://github.com/Skylion007
2023-07-31 18:20:22 +00:00
cdc9127733 [ONNX] Perform Shape inference on added "Cast" node (#106093)
This commit fixes a bug where some "If" nodes blocked shape inference during the onnx graph building.

In fixup_onnx_controlflow, a "Cast" node is added to conditions in "If" and "Loop" nodes if the condition type is not bool.

This commit performs shape inference on this new "Cast" node which allows its output to be marked as "reliable" in ConstantValueMap during further shape inference. This would have eventually happened when shape inference is performed on the entire graph, but the inferred shapes are also useful to have during onnx graph building, since it allows some ops (like Squeeze) to export into simpler subgraphs.

Also adds a test for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106093
Approved by: https://github.com/thiagocrepaldi
2023-07-31 18:20:19 +00:00
66c537429e [export] Move attrs to properties and add BC decorator (#106170)
@SherlockNoMad mentioned that it's not bc safe to directly access these attributes, so I moved them to @property fields, and added a `@compatibility` decorator. For now I just set it to True for graph_module/graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106170
Approved by: https://github.com/SherlockNoMad
2023-07-31 18:13:07 +00:00
0dc251323d torch::nn::functional::batch_norm(): add a shape check of input tensor (#105930)
Fixes #105458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105930
Approved by: https://github.com/albanD
2023-07-31 18:03:12 +00:00
1cebfef8a4 sm90 efficient attention test fixes (#105978)
Fixes the following two test cases involving efficient attention on sm90:

Explanations:

functorch/test_ops.py: test_vjp_nn_functional_scaled_dot_product_attention_cuda_float32
* originally the test had xfail for all sm
* in https://github.com/pytorch/pytorch/issues/102029, we found that it was unexpectedly passing on sm90
* I made https://github.com/pytorch/pytorch/pull/102131 to update the test to let it pass
* @drisspg seems to have made changes to the behavior such that the original xfail was getting triggered (https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560071148)
* the CI began complaining about the failure again: https://github.com/pytorch/pytorch/issues/102663
* I'm now reverting https://github.com/pytorch/pytorch/pull/102131 to bring back the original xfail now that the behavior has been fixed by @drisspg to trigger the xfail in sm90 similar to all other sm

test_transformers.py: test_mem_efficient_fail_sm90_cuda
* the test as it's currently written seems to expect the sdp dispatcher to fail for mem efficient attention on sm90; however, testing this on H100, it actually succeeds, so I'm disabling the test for now as the current expected result may be outdated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105978
Approved by: https://github.com/eqy, https://github.com/kshitij12345, https://github.com/zou3519
2023-07-31 17:59:40 +00:00
d8e5f2aa6d Reland "Make adding buffers more like adding parameters (#104069)" (#106224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106224
Approved by: https://github.com/atalman, https://github.com/albanD
2023-07-31 17:18:56 +00:00
50e3f9cbbb [ROCm] HIP stream priority fix post #101956 (#106157)
PR #101956 introduced additional stream priorities for cuda streams. HIP streams have slightly different semantics.
- HIP: 1=low, 0=default, -1=high
- CUDA: 0=default, -1=high, -2=higher, etc.

This PR forces HIP stream priority to just 0 and -1 to match the pytorch semantics.

This fixes a broken unit test.

```
python3 test_cuda_multigpu.py TestCudaMultiGPU.test_streams_priority -v

Test results will be stored in test-reports/python-unittest/test_cuda_multigpu

Running tests...
----------------------------------------------------------------------
  test_streams_priority (__main__.TestCudaMultiGPU) ... ERROR (0.200s)

======================================================================
ERROR [0.200s]: test_streams_priority (__main__.TestCudaMultiGPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2354, in wrapper
    method(*args, **kwargs)
  File "test_cuda_multigpu.py", line 656, in test_streams_priority
    low, high = torch.cuda.Stream.priority_range()
RuntimeError: least_priority == 0 INTERNAL ASSERT FAILED at "/var/lib/jenkins/pytorch-upstream/c10/hip/HIPStream.h":184, please report a bug to PyTorch. Unexpected HIP stream priority range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106157
Approved by: https://github.com/malfet
2023-07-31 16:57:20 +00:00
2b427ae3a7 Revert "Reland "Add forward mode AD to out-place foreach functions (#102409) (#106043)"
This reverts commit e773f28ee307e2a246a4b765f3a51117661b45ba.

Reverted https://github.com/pytorch/pytorch/pull/106043 on behalf of https://github.com/DanilBaibak due to Break slow tests ([comment](https://github.com/pytorch/pytorch/pull/106043#issuecomment-1658642734))
2023-07-31 15:50:36 +00:00
c5b9dc1f40 Optimize stack frame inspection in torch._custom_op.impl:CustomOp._register_impl (#105940)
Summary: This is surprisingly expensive when the stack is deep. We can instead just process the specific stack frame that's relevant -- it's much faster.

Test Plan:
```
import inspect
import sys
import time

def make_deep_stack(fn, n: int = 10):
    if n > 0:
        return make_deep_stack(fn, n - 1)

    return fn()

def full_stack():
    return inspect.stack()[1][3]

def via_current_frame():
    return inspect.getframeinfo(sys._getframe(1))[2]

start = time.perf_counter()
for _ in range(1000):
    make_deep_stack(full_stack)
print(f"full_stack took {time.perf_counter() - start}s")

start = time.perf_counter()
for _ in range(1000):
    make_deep_stack(via_current_frame)
print(f"via_current_frame took {time.perf_counter() - start}s")

> full_stack took 31.788201928138733s
> via_current_frame took 2.33455612603575s
```

Differential Revision: D47674015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105940
Approved by: https://github.com/zou3519
2023-07-31 15:49:33 +00:00
c54afea6ee faketensor: prevent deepcopy from cloning FakeTensorMode (#104476)
fixes https://github.com/pytorch/pytorch/issues/104465

A more detailed repro is here, which uses `nn.TransformerLayer` (this breaks with AOTAutograd today, due to the presence of multiple FakeTensorMode objects lying around) https://github.com/pytorch/pytorch/issues/103505#issuecomment-1614817132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104476
Approved by: https://github.com/ezyang
2023-07-31 15:49:08 +00:00
d3b508d068 Fix typo which suppresses user exception reporting (#106289)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106289
Approved by: https://github.com/albanD
2023-07-31 14:35:33 +00:00
0af3203c72 fix torchrun script for custom device (#105443)
Fixes #ISSUE_NUMBER
as the title,add torchrun support for custom device

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105443
Approved by: https://github.com/kumpera
2023-07-31 05:46:23 +00:00
0a0abd0ecf Revert "Update kineto submodule to 465ff4cd7 (#106154)"
This reverts commit efeb46e507eb7827e0fb5751d9556f31bafa8300.

Reverted https://github.com/pytorch/pytorch/pull/106154 on behalf of https://github.com/PaliC due to breaks diff train importing ([comment](https://github.com/pytorch/pytorch/pull/106154#issuecomment-1657665353))
2023-07-31 05:43:18 +00:00
3c70d4bda7 UFMT torch/jit/frontend.py, manually fix mypy suppression (#106268)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106268
Approved by: https://github.com/Skylion007
ghstack dependencies: #106266, #106267
2023-07-30 19:10:59 +00:00
af88e6d09d UFMT torch/jit/_script.py, manually move mypy suppressions (#106267)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106267
Approved by: https://github.com/Skylion007
ghstack dependencies: #106266
2023-07-30 19:10:59 +00:00
b581e03850 Apply UFMT to torch/distributions/distribution.py, manually resolve fstrings (#106266)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106266
Approved by: https://github.com/Skylion007
2023-07-30 19:10:57 +00:00
eab3b2637a only collect fx node for user_visible_outputs when doing output stride conversion (#106194)
For yolo3, there has a subgraph that output has int value, and AttributeError: 'int' object has no attribute 'name` caused by collecting ser_visible_outputs to do output stride conversion.  This PR will add a check only that the output is a fx node before being added in user_visible_outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106194
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/shunting314
2023-07-30 13:48:22 +00:00
888bdddb1e Add scalar conversion using avx instructions for half (#102140)
### Motivation

Scalar conversion between Half and Float on CPU is more time consuming compared to BFloat16 <-> Float. There is no direct data type conversion instruction for single Half value on CPU, so we add scalar conversion with avx instructions for Half to speed up.

### Testing
Test maxpool, and compared with the results of #98819.
Single socket (28 cores):

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 5.07165 | 5.418 | 0.5798 | 0.5123 | 1.373694951 | 3.430786
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 1.37455 | 1.2505 | 8.8336 | 9.7684 | 1.373635008 | 4.132924
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 28.72 | 30.7069 | 3.813 | 3.75 | 1.31977124 | 2.783006
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 4.5783 | 4.703 | 4.703 | 5.1 | 1.028980189 | 3.1293
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 13.896 | 14.8138 | 1.6635 | 1.6274 | 1.298704663 | 2.982699
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 2.11291 | 2.1158 | 2.26778 | 2.272 | 0.951105348 | 3.179012
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 0.4204 | 0.3843 | 0.0649 | 0.0633 | 2.102711703 | 1.779492
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 0.1134 | 0.11 | 0.1476 | 0.143 | 2.23042328 | 3.612398

Single core:

shape | fp16 forward / ms | bf16 forward / ms | fp16 backward / ms | bf16 backward / ms | speedup ratio (fp16 forward) | speedup ratio (fp16 backward)
-- | -- | -- | -- | -- | -- | --
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: contig | 124.413 | 114.44 | 10.553 | 11.2486 | 1.31395433 | 3.923844
size: (1, 56, 264, 264), kernel: 3,   stride: 1, mem_format: CL | 28.99 | 28.0781 | 9.5092 | 10.9258 | 1.324296999 | 3.888377
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: contig | 640.8276 | 591.964 | 59.18776 | 60.854 | 1.334956391 | 3.704458
size: (32, 16, 200, 200), kernel: 3,   stride: 1, mem_format: CL | 88.57 | 90.214 | 54.358 | 59.205 | 1.031258214 | 3.75285
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: contig | 318.6197 | 285.155 | 28.4999 | 29.4387 | 1.315298144 | 3.759747
size: (32, 32, 100, 100), kernel: 3,   stride: 1, mem_format: CL | 31.3981 | 34.0544 | 25.6557 | 28.7811 | 1.068505738 | 3.841587
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: contig | 8.87882 | 8.207 | 0.386056 | 0.3939 | 1.567866 | 3.50387
size: (4, 19, 10, 16, 16), kernel: 3,   stride: 1, mem_format: CL3d | 2.4167 | 2.38295 | 0.3769 | 0.4066 | 1.39402491 | 3.30061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102140
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/cpuhrsch
2023-07-30 11:25:28 +00:00
21fd2bc32e Allow setting TORCH_LINALG_PREFER_CUSOLVER=1 to prefer cusolver as linear algebra library globally (#106226)
setting TORCH_LINALG_PREFER_CUSOLVER=1

This will allow users to prefer cusolver as linear algebra backend in their container use case. The switch is not enabled by default so it won't change any existing default behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106226
Approved by: https://github.com/lezcano
2023-07-30 09:38:46 +00:00
858ca65c8a [vision hash update] update the pinned vision hash (#106262)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106262
Approved by: https://github.com/pytorchbot
2023-07-30 03:44:11 +00:00
8549abc347 Grab bag of DTensor enablement stuff (Enable whole graph capture for DTensor) (#105787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105787
Approved by: https://github.com/ezyang
2023-07-30 00:17:45 +00:00
3bf922a6ce Apply UFMT to low traffic torch modules (#106249)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249
Approved by: https://github.com/Skylion007
2023-07-29 23:37:30 +00:00
a4ebc61f15 Ignore UFMT revs in blame (#106246)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106246
Approved by: https://github.com/Skylion007
2023-07-29 23:37:21 +00:00
d2aa3f5fa9 [GHF][mergebot] record ghstack dependencies in the commit message (#105251)
Currently all information about the dependencies of ghstack PRs (e.g. #105010) is stripped away:
c984885809/.github/scripts/trymerge.py (L1077-L1078)

This PR adds this information back in a more compact form. All dependencies (PR numbers) of each PR in ghstack are recorded.

The resulting commit message will look like this (the last line is new):

> Mock title (#123)
>
> Mock body text
> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123
> Approved by: https://github.com/Approver1, https://github.com/Approver2
> ghstack dependencies: #1, #2

---

### Testing

Unit tests.

---

### Note Re: `# type: ignore[assignment]` in unit tests.

I did my due diligence to find alternatives. Unfortunately mypy [doesn't](https://github.com/python/mypy/issues/6713) support this [way of patching methods](https://docs.python.org/3/library/unittest.mock-examples.html#mock-patching-methods), and the alternatives are either extremely verbose or don't work for this case. I decided it's not worth the effort (since the problem is limited only to the unit test).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105251
Approved by: https://github.com/huydhn
2023-07-29 20:32:10 +00:00
0ee3b84021 [pt2] add meta for cholesky_inverse (#106120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106120
Approved by: https://github.com/ezyang
2023-07-29 17:16:20 +00:00
80755884be [pt2] add meta for cholesky (#106115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106115
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-07-29 17:16:20 +00:00
329a9a90c0 fix some typos (#106253)
Fixes typos

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106253
Approved by: https://github.com/awgu
2023-07-29 16:11:52 +00:00
3ecd05d9f3 Fix FakeTensor issues with copy_ between devices (#106172)
Used to fail with:
```
RuntimeError: Unhandled FakeTensor Device Propagation for aten.copy_.default, found two different devices cpu, cuda:0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106172
Approved by: https://github.com/eellison
2023-07-29 15:55:32 +00:00
7047d132fd add context support for custom device (#105056)
Fixes #ISSUE_NUMBER
as the title, add context support for custom device and testcase.
And in the future, we may want to refactor these hooks for different device to unify the APIs, would you agree my
idea? @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105056
Approved by: https://github.com/albanD
2023-07-29 12:56:03 +00:00
1da4115702 Make _dynamo.export return a NamedTuple (#106062)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106062
Approved by: https://github.com/voznesenskym
2023-07-29 06:17:33 +00:00
df50f91571 Support fx_pytree in dynamo (#105574)
This PR does two things:
1. Make dynamo trace through fx_pytree (on top of torch.utils._pytree) so that generated graph modules can be retraced.
2. Fix bug where unflatten not returning dynamo VariableTracker.

Differential Revision: [D47734623](https://our.internmc.facebook.com/intern/diff/D47734623)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105574
Approved by: https://github.com/yanboliang, https://github.com/ydwu4
2023-07-29 05:08:15 +00:00
f160a972aa [inductor][easy] "unhandled ValueRange op" - log at debug level (#106215)
Set this log line to debug level - it appears frequently for many ops that don't have implementations following https://github.com/pytorch/pytorch/pull/102611.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106215
Approved by: https://github.com/lezcano
2023-07-29 03:40:40 +00:00
e6ec0efaf8 Apply UFMT to all non test/torch files (#106205)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106205
Approved by: https://github.com/albanD
2023-07-29 02:56:24 +00:00
1163800d0f Upgraded triton pin to allow PTX ISA 8.2 (#106195)
Fixes #105522

@ezyang, could you please review?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106195
Approved by: https://github.com/ezyang
2023-07-29 02:55:15 +00:00
7b14a14e27 [Inductor] Optimize finding users of buffers for mutation (#105882)
Rather than visiting all nodes in the current environment to determine the users of a buffer, register the users of a buffer after node execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105882
Approved by: https://github.com/jansel
2023-07-29 02:04:03 +00:00
9b94dcf2ac [Compiled Autograd] Remove duplicate code from double-merge (#106233)
Something awfuly wierd is going on.  Somehow the changes in #105808 got applied twice, which caused a lint error on main. Notice how the two block of code are both copies of #105808:

Line 273:
505dd319ef/test/inductor/test_compiled_autograd.py (L273-L369)

Line 372:
505dd319ef/test/inductor/test_compiled_autograd.py (L372-L479)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106233
Approved by: https://github.com/malfet
2023-07-29 01:57:03 +00:00
c11412b4a8 [DDP] Support optim in backward after DDP init (#105995)
This allows in backward optimizers to be configured after DDP init, in
addition to before as was previously supported.

Differential Revision: [D47783347](https://our.internmc.facebook.com/intern/diff/D47783347/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105995
Approved by: https://github.com/fegin
2023-07-29 01:36:25 +00:00
5d4e170d58 [Optim in backward] API to retrieve in-backward optimizers (#105991)
API to retrieve in backward optimizer for checkpointing purposes

Differential Revision: [D47782225](https://our.internmc.facebook.com/intern/diff/D47782225/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105991
Approved by: https://github.com/awgu
2023-07-29 01:36:25 +00:00
86237dc59b fix typo in serialization.md (#106191)
Found this minor typo while reviewing the TorchScript docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106191
Approved by: https://github.com/Skylion007
2023-07-29 00:01:59 +00:00
bd669d52d2 Print env var name instead of flag name for commandline repros (#106223)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106223
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-07-28 23:22:27 +00:00
52d4b1ae31 [BE]: Enable ruff rules PIE807 and PIE810 (#106218)
* Enables PIE807 + PIE810. PIE807 is do not reimplement list builtin function using lambda and PIE810 is to always fuse startswith / endswith calls (I applied the autofixes for this before we had ruff enabled).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106218
Approved by: https://github.com/albanD
2023-07-28 22:35:56 +00:00
f3d165bf61 [fake_tensor] Don't run fallback for fbgemm ops (#106210)
Summary:
This diff also adds more warning messages around allowing a namespace into the
fallback. We need to grandfather in an operator to actually merge this diff.

Test Plan: - existing tests

Differential Revision: D47873841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106210
Approved by: https://github.com/eellison
2023-07-28 22:31:54 +00:00
505dd319ef [caffe2] Don't evaluate message in CAFFE_ENFORCE_THAT unless the check fails (#106145)
D26829714 improved CAFFE_ENFORCE_THAT, but made us eagerly evaluate the message, which has costs.

Differential Revision: [D47809432](https://our.internmc.facebook.com/intern/diff/D47809432/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106145
Approved by: https://github.com/davidberard98
2023-07-28 21:54:07 +00:00
26d29d9639 [Compiled Autograd] Support CopySlices and CopyBackwards (#105809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105809
Approved by: https://github.com/albanD
2023-07-28 21:42:51 +00:00
099345f1e5 [Compiled Autograd] Handle aten.sym_size/aten.sym_stride (#105814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105814
Approved by: https://github.com/voznesenskym
2023-07-28 21:42:51 +00:00
6da8825f20 [Pytorch][Vulkan] sum.dim_IntList with keepdim (#106159)
Summary:
Add Vulkan support for [sum](https://pytorch.org/docs/stable/generated/torch.sum.html).dim_IntList) with `keep_dim=true`

[sum.dim_IntList](https://www.internalfb.com/code/fbsource/[49b7951b7eb6]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=5466)

```
if keepdim is true, the output tensor is of the same size as input except in the dimension(s) dim, where it is of size 1
otherwise, the dim is squeezed, result in the output tensor having 1 fewer dimension/s.
```

Test Plan:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*.sum*"
Action graph will be rebuilt because files have been added or removed.
Parsing buck files: finished in 1.4 sec
Downloaded 4/58 artifacts, 3.08 Mbytes, 50.0% cache miss (for updated rules)
Building: finished in 41.2 sec (100%) 536/536 jobs, 13/536 updated
  Total time: 42.8 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *.sum*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.sum_dim_2d
[       OK ] VulkanAPITest.sum_dim_2d (558 ms)
[ RUN      ] VulkanAPITest.sum_dim_3d
[       OK ] VulkanAPITest.sum_dim_3d (7 ms)
[ RUN      ] VulkanAPITest.sum_dim_4d
[       OK ] VulkanAPITest.sum_dim_4d (14 ms)
[ RUN      ] VulkanAPITest.sum_dim_keepdim_2d
[       OK ] VulkanAPITest.sum_dim_keepdim_2d (4 ms)
[ RUN      ] VulkanAPITest.sum_dim_keepdim_3d
[       OK ] VulkanAPITest.sum_dim_keepdim_3d (7 ms)
[ RUN      ] VulkanAPITest.sum_dim_keepdim_4d
[       OK ] VulkanAPITest.sum_dim_keepdim_4d (18 ms)
[----------] 6 tests from VulkanAPITest (612 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (612 ms total)
[  PASSED  ] 6 tests.
```

Reviewed By: SS-JIA

Differential Revision: D47652931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106159
Approved by: https://github.com/SS-JIA
2023-07-28 21:36:05 +00:00
2ec7cd2db2 [CheckpointWrapper] Test for kwarg propagation, remove checkpoint_fn_arg support (#102679)
Closes https://github.com/pytorch/pytorch/issues/100576

Differential Revision: [D46342398](https://our.internmc.facebook.com/intern/diff/D46342398/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102679
Approved by: https://github.com/awgu
2023-07-28 21:18:35 +00:00
4d3ea5df65 Restructure torch.compile docs (#105376)
Current torch.compile docs have become a bit of a mess with the docs expanded in the left nav. This PR moves them under the torch.compiler menu item in the left nav. A bunch of rewrites were made in collaboration with @msaroufim to address formatting issues, latest updates that moved some of the APIs to the public torch.compiler namespace were addressed as well. The documentation is broken down in three categories that address three main audiences: PyTorch users, Pytorch Developers and PyTorch backend vendors. While, the user-facing documentation was significantly rewritten, dev docs and vendor docs kept mostly untouched. This can be addressed in the follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105376
Approved by: https://github.com/msaroufim
2023-07-28 20:58:57 +00:00
0cf918947d [inductor] Support using the 'stream' param in AOT mode (#105589)
Summary:
When in AOT mode, make use of the existing stream param:
- Pass through and use the stream param in the launchKernel helper function.
- In non-AOT mode, assign the stream param in the caller and pass to launchKernel
- Use a CUDAStreamGuard so all fallback ops execute on the stream
- CUDAStreamGuard subsumes CUDAGuard in AOT mode since it sets both stream and device

Test Plan:
- Ran cpp_wrapper tests: pytest test/inductor/test_cpp_wrapper.py
- Manually inspected cpp output from the alexnet benchmark:

  a) In AOT mode:
```
   static inline void launchKernel(
           CUfunction func,
           int gridX,
           int gridY,
           int gridZ,
           int numWraps,
           int sharedMemBytes,
           cudaStream_t stream) {
       AT_CUDA_DRIVER_CHECK_OVERRIDE(cuLaunchKernel(
           func, gridX, gridY, gridZ, 32*numWraps, 1, 1, sharedMemBytes, stream, args, nullptr));

   ...
   at::cuda::CUDAStreamGuard stream_guard(at::cuda::getStreamFromExternal(stream, 0));
   ...
   launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream);
   ...
```

   b) Regular cpp wrapper:
```
   ...
   at::cuda::CUDAGuard device_guard(0);
   cudaStream_t stream0 = at::cuda::getCurrentCUDAStream(0);
   ...
   launchKernel(triton_poi_fused_convolution_0, 1, 784, 1, 4, 4352, kernel_args_var_0, stream0);
   ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105589
Approved by: https://github.com/desertfire
2023-07-28 20:26:27 +00:00
035124774a Enable registering fallthroughs to (op, dk) from torch.library (#106086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106086
Approved by: https://github.com/zou3519, https://github.com/albanD
2023-07-28 19:37:59 +00:00
ad3af0aead Change phrasing on optim state hook docs (#106209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106209
Approved by: https://github.com/albanD
2023-07-28 18:59:21 +00:00
800287fb56 [FSDP] Optimize away intermediate div_ for HSDP (#106034)
### Background: Gradient Pre-Divide
Consider $N$ data parallel workers. Define $g_i$ to be the $i$ th worker's local unsharded gradient. Data parallel gradient reduction computes $\overline g = \frac{1}{N} \sum_{i \in [N]} g_i$.

$\sum_{i \in [N]} g_i$ increases the magnitude by a factor of $N$, which may overflow for fp16. However, if we pre-divide and compute $\sum_{i \in [N]} \frac{g_i}{N}$, then the $\frac{g_i}{N}$ may underflow. The current solution from Myle for FSDP is to pre-divide by $\sqrt{N}$ and post-divide by $\sqrt{N}$:
$$\overline{g} = \frac{1}{\sqrt{N}} \sum_{i \in [N]} \frac{g_i}{\sqrt{N}}.$$

Now, consider HSDP with $N = S \cdot R$ data parallel workers, sharding over $S$ workers and replicating over $R$ workers. Define $g_{i,j}$ to be the $i \cdot S + j$ th worker's local unsharded gradient (so sharding indexes with $i$ and replication indexes with $j$). The existing implementation computes
$$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}},$$
where the $\frac{1}{\sqrt{R}} \frac{1}{\sqrt{S}}$ involves two separate `aten::div_` kernels.

### Revisiting Pre-Divide for HSDP
A minor optimization that we can do is with this intermediate `div_`. There are two options:
1. Compute $\overline{g}$ in the same way as FSDP:
$$\overline{g} = \frac{1}{\sqrt{N}} \sum_{j \in [R]} \sum_{i \in [S]} \frac{g_{i,j}}{\sqrt{N}}.$$
2. Compute $\overline{g}$ still with an intermediate division for rescaling but coalescing the two `divs_` into one:
$$\overline{g} = \frac{1}{\sqrt{R}} \sum_{j \in [R]} \textcolor{red}{ \frac{1}{\sqrt{N}} } \sum_{i \in [S]} \frac{g_i}{\sqrt{S}}$$

This PR goes with the 1st approach prioritizing performance because (1) it matches the existing FSDP behavior and (2) it avoids a memor-bandwidth bound `div_` kernel that blocks all-reduce launch.

### Implementation Details
In order to accommodate this, we need to refactor the communication hook logic that baked the gradient pre/post-division into the default hook.
- We raise an error if registering a communication hook for HSDP since the current implementation would only apply the hook to the reduce-scatter, not the all-reduce, which may be unexpected.
- We change it so that `state._comm_hook is not None` iff a communication hook is registered. This makes the collectives and the pre/post-division in the default no-communication-hook path more visible in the code.

Differential Revision: [D47852459](https://our.internmc.facebook.com/intern/diff/D47852459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106034
Approved by: https://github.com/rohan-varma
2023-07-28 18:36:26 +00:00
c7b122b2b5 [FSDP] Add HSDP parity unit test (#106131)
With >= 4 GPUs:
```
python -m pytest test/distributed/fsdp/test_fsdp_hybrid_shard.py -k test_fsdp_hybrid_shard_parity
```

Differential Revision: [D47852458](https://our.internmc.facebook.com/intern/diff/D47852458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106131
Approved by: https://github.com/rohan-varma
2023-07-28 18:31:12 +00:00
3841be80de [FSDP] Improve test_fsdp_hybrid_shard_basic_setup (#106072)
Differential Revision: [D47852460](https://our.internmc.facebook.com/intern/diff/D47852460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106072
Approved by: https://github.com/rohan-varma
2023-07-28 18:31:12 +00:00
cyy
b3e24c53eb use performance-unnecessary-value-param in clang-tidy (#102615)
performance-unnecessary-value-param has been disabled in clang-tidy for a long time. However, this check is actually useful and able to some interesting performance problems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102615
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-07-28 17:37:03 +00:00
8f4d8b3773 More descriptive graph diagram names in svg (#106146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106146
Approved by: https://github.com/jansel, https://github.com/Chillee
2023-07-28 17:34:09 +00:00
5237ed55e6 [export] allow register dataclass as pytree node (#106160)
In this pr, we allow users to register a customized flatten/unflatten/serialization/deserialization for a dataclass. We provide some default implementation for flatten/unflatten. We could implement a decorator based on it when needed.

## Motivation:
HuggingFace and many internal models return dataclass output and torch.export wants to maintain the invariant that export result (i.e. exported_program) has the same calling convention and result as the original callable.

This is not supported in export yet: we cannot recover the original dataclass from flattened output produced by the underlying graph module (produced by dynamo and processed further by aot_export). We need to have a place to store the metadata of the dataclass so that we can re-construct it. To avoid adding hacky code in export and allow princinpled extensibility, we think extending pytree may be a good option.

## Implementation:
@zou3519 mentioned https://github.com/pytorch/pytorch/pull/93214/files and [jax-2371](https://github.com/google/jax/issues/2371#issuecomment-805361566), which suggests that it's not a good idea to make dataclass a default pytree node but it could be good to provide a default implementation for dataclass. Since currently, this seems to be an export-only feature, we added this extension point in export.

We also add "return_none_fields" flag to control whether none fields are returned after flattening, which is expected to be False in produce_matching of dynamo.export.

Also added some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106160
Approved by: https://github.com/zhxchen17
2023-07-28 17:33:13 +00:00
37cfe944bb add support for mutated params (#106098)
Previously, this didn't work because of the warmup run. Now that we do not run warmup, and then execution on one inductor invocation this works. llama inference 1.6->4.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106098
Approved by: https://github.com/ezyang
2023-07-28 17:27:06 +00:00
db2239706e Fix TORCH_COMPILE_DEBUG incompatibility with aot inductor (#106169)
Record replay tries to record a module which is already available

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106169
Approved by: https://github.com/anijain2305, https://github.com/jansel
2023-07-28 17:17:58 +00:00
76a2ec49d7 [Dynamo] Ignore no-op tensor assignment (#106092)
Ignore no-op `self.attr = self.attr` on NN Modules when attr is a Tensor attribute.

This comes from a [llama pattern](https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/llama/model.py#L121-L122). Normally, when a set attr occurs on an nn module we turn it into an `UnspecializedNNModuleVariable` which prevents static buffers and parameters. In subsequent pr i will add support for cudagraph mutation of buffers/params, which with this pr takes llama 1.6x -> 4.4x in inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106092
Approved by: https://github.com/yanboliang
2023-07-28 17:16:19 +00:00
7c8efc9049 [PT][FSDP] Combine _utils.py into _common_utils.py [2/2] (#106181)
Summary:
https://github.com/pytorch/pytorch/issues/97813
This diffs moves `_no_dispatch_record_stream` and `_same_storage_as_data_ptr`

Test Plan: CI

Differential Revision: D47706114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106181
Approved by: https://github.com/awgu
2023-07-28 17:15:25 +00:00
e5bd63a7f3 run freezing in nightly (#106097)
We could switch to every other day or something but inference is cheaper than training anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106097
Approved by: https://github.com/desertfire
2023-07-28 17:12:01 +00:00
c2e948edca Fix lint in test/inductor/test_compiled_autograd.py
Regression introduced by https://github.com/pytorch/pytorch/pull/105808
2023-07-28 09:54:22 -07:00
57f23ca58b Bot message changes for -f and rebase (#106150)
* Encourage people to use -i instead of -f for mergebot
* Add additional info for when rebase fails due to lacking permissions

<details><summary>dryrun</summary>

````
csl@csl-mbp ~/zzzzzzzz/pytorch [csl/errormsgs] $
(forpytorch) python3 .github/scripts/tryrebase.py 106089 --branch viable/strict --dry-run
+ git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify refs/remotes/origin/viable/strict
@pytorchbot started a rebase job onto [refs/remotes/origin/viable/strict](7c97c943fb). Check the current status [here](None)
+ git -C /Users/csl/zzzzzzzz/pytorch fetch origin pull/106089/head:pull/106089/head
+ git -C /Users/csl/zzzzzzzz/pytorch rebase refs/remotes/origin/viable/strict pull/106089/head
+ git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify pull/106089/head
+ git -C /Users/csl/zzzzzzzz/pytorch rev-parse --verify refs/remotes/origin/viable/strict
+ git -C /Users/csl/zzzzzzzz/pytorch push --dry-run -f https://github.com/Lightning-Sandbox/pytorch.git pull/106089/head:fix/spaces
stdout:
remote: Permission to Lightning-Sandbox/pytorch.git denied to clee2000.
fatal: unable to access 'https://github.com/Lightning-Sandbox/pytorch.git/': The requested URL returned error: 403

stderr:

Rebase failed due to Command `git -C /Users/csl/zzzzzzzz/pytorch push --dry-run -f https://github.com/Lightning-Sandbox/pytorch.git pull/106089/head:fix/spaces` returned non-zero exit code 128
```
remote: Permission to Lightning-Sandbox/pytorch.git denied to clee2000.
fatal: unable to access 'https://github.com/Lightning-Sandbox/pytorch.git/': The requested URL returned error: 403
```
This is likely because the author did not allow edits from maintainers on the PR or because the repo has additional permissions settings that mergebot does not qualify.
````
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106150
Approved by: https://github.com/huydhn
2023-07-28 16:13:51 +00:00
f15b6ec6d6 [Compiled Autograd] Add eager autograd tests (#105808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105808
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-07-28 15:59:49 +00:00
2e02dfae9a [Compiled Autograd] Fix handling of undefined gradients in hooks (#105813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105813
Approved by: https://github.com/albanD
2023-07-28 15:59:35 +00:00
ac6d8fb16e [Compiled Autograd] Add eager autograd tests (#105808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105808
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-07-28 15:59:35 +00:00
23a1eda890 [Compiled Autograd] Inplace updates of gradients (#105713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105713
Approved by: https://github.com/albanD
2023-07-28 15:58:49 +00:00
7b73b1e8a7 Fixed test_get_classifications_pending_unstable (#106203)
Fixed `test_get_classifications_pending_unstable` test. [Broken test](https://github.com/pytorch/pytorch/actions/runs/5690543018/job/15424383198) on main branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106203
Approved by: https://github.com/malfet
2023-07-28 14:15:17 +00:00
bb0b283e5a Do not force -Werror on Pooling.cpp
As new versions of compilers are likely find new types of violation s as shown in https://github.com/pytorch/pytorch/issues/105728
2023-07-28 07:08:59 -07:00
cb6c3cbc91 inductor: enable weight prepack for LSTM (#103071)
- Enabled LSTM weight prepack in inductor.
- Added a mkldnn decomposition for lstm which won't change for different `seq_lens`. With the previous decomposition, for dynamic shapes use case where `seq_lens` changes, the graph will be different.
- Extended several inductor utility functions to support `List(Tensor`) as input. Previously those functions only supported `Tensor` input.

**Update 2023-07-26:**
- https://github.com/pytorch/pytorch/pull/103851 has moved CPU weight packing to be after AOTAutograd. Fixed the support in this PR to follow the same way (mainly in 3b207f7f1c (diff-6dffed1ade0ba3e887f9a4eafa3bfcec267ab2365b8adcb91bd391f49b3fd2e3)).
LSTM is decomposed in `aten.mkldnn_rnn_layer` by layer and by direction. The weight prepack is done at the `mkldnn_rnn_layer` level.
- Add a fix in rnn `__get_state__` function in case we need to recompile an `LSTM` module.
When compiling the module, the weights tensors which are the `named_parameters` of the module are converted to `functional_tensor` here:
76fb72e24a/torch/nn/utils/stateless.py (L125-L128)
The forward function of LSTM will be called:
76fb72e24a/torch/_functorch/aot_autograd.py (L3379-L3381)
In the forward function, the `_flat_weights` are updated to be the same as the weights, thus becoming `functional_tensor`:
76fb72e24a/torch/nn/modules/rnn.py (L775-L778)
The weights tensors are converted back to the original tensors (which are not `functional_tensor` anymore) before exiting the `_reparametrize_module` context here:
76fb72e24a/torch/nn/utils/stateless.py (L130-L142)
But since `_flat_weights` is not in the `named_parameters` of the module, it's still `functional_tensor` ([link of the parameters that will be converted to functional and reverted back](76fb72e24a/torch/_functorch/aot_autograd.py (L3695-L3698))).
At this moment, if we need to recompile the model, `deepcopy` will be called:
76fb72e24a/torch/_dynamo/utils.py (L915-L917)
And it will report `UnImplemented` since we have `functional_tensor` (`_flat_weights`) and will trigger graph break which is not what we expect:
76fb72e24a/torch/_subclasses/meta_utils.py (L514)
Added a fix in the `__get_state__`  to update the `_flat_weights` if ever weights have changed to fix this issue. The fix is covered in the `test_lstm_packed` UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103071
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-28 13:54:32 +00:00
dad65d09f2 Update custom op API (#105947)
As described in
https://docs.google.com/document/d/1aGWtgxV3HppuxQAdddyPrs74_aEntpkYt9MalnCKnhk/edit

This PR changes the CustomOp API to be private and adds new public
wrappers around it so that the user does not need to know about the
"CustomOp" object. We've effectively changed the "CustomOp" object to be
some metadata about the operator that the user does not directly
interact with.

The "updated custom op API" is in torch._custom_ops. Pending good customer
feedback, we will promote this module to torch.custom_ops.

NB: I cannot move around the older torch._custom_op APIs yet because
people are already using them.

Test Plan:
- I changed all of our tests to use the new `torch._custom_ops` module
instead of the old CustomOp API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105947
Approved by: https://github.com/soulitzer
2023-07-28 13:30:58 +00:00
6d553a42fe Move most custom op related tests to test_custom_ops.py (#106036)
This PR moves most custom op related tests from
test/test_python_dispatch.py to test/test_custom_ops.py. Motivation is
that I had a difficult time finding the custom op tests inside
test_python_dispatch.py.

This doesn't preserve blame, but it's OK - I'm the only person who has
really touched the moved tests so far :).

Test Plan:
- run tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106036
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2023-07-28 13:30:58 +00:00
db365e1fb5 Create test/test_custom_ops.py, move test_custom_op_testing to it (#106035)
I'm in the process of putting all the custom op tests into this file. I
got tired of trying to find the custom ops tests in
test_python_dispatch.py, which (1) is getting long and (2) should actually
be the torch_dispatch and python torch.library tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106035
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2023-07-28 13:30:58 +00:00
0b8fbfe9de automatic_dynamic_shapes is on by default (#106188)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106188
Approved by: https://github.com/albanD
2023-07-28 13:26:54 +00:00
2636751fb9 [C10d] Add skeleton of LibUV backend. (#105672)
This commit hooks up tcpstore creation and build flags.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105672
Approved by: https://github.com/fduwjj
2023-07-28 13:19:06 +00:00
dffa4e14b9 Add Optimizer state_dict hooks (#105953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105953
Approved by: https://github.com/albanD
2023-07-28 11:52:41 +00:00
4fe407ad73 Add details about ic, broken, flaky, and unstable checks to merge records (#106162)
At the moment, we only record the list of pending and failed check on Rockset merge records. This is enough to compute the force merge KPI(s), but isn't enough for more in-depth analysis on what happened at the time of the merge:

* If the number of `ok_failed_checks` is less than `ok_failed_checks_threshold`, the list of `failed_checks` would be empty (expectedly).  So Rockset would only record an empty list.
* We support retry in PR, so the classifications on Dr.CI could be different than what dev observed at the time of the merge if retry completed successfully

### Testing

`python .github/scripts/trymerge.py --comment-id 1654010315 106095 --dry-run` (need to comment out some of the code to actually write a test record to Rockset), then manually verify it with

```
SELECT
    *
FROM
    commons.merges
WHERE
    pr_num = 106095
```

to see that `ignore_current_checks`, `broken_trunk_checks`, `flaky_checks`, and `unstable_checks` shows up correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106162
Approved by: https://github.com/clee2000
2023-07-28 09:41:02 +00:00
6366ed6edd inductor: using dummy input to pack the linear weight for bfloat16 dynamic shape path (#106122)
For the dynamic bfloat16 path, if we use plain weight, we can't call in amx path, so there use a dummy input(given a None value) to do the weight packing for better performance.

before:
```
onednn_verbose,exec,cpu,inner_product,x64:gemm:jit,forward_training,src_bf16::blocked:ab:f0 wei_bf16::blocked:ab:f0 bia_bf16::blocked:a:f0 dst_bf16::blocked:ab:f0,attr-scratchpad:user ,,mb64ic256oc256,9.4292
```
after:
```
onednn_verbose,exec,cpu,inner_product,brgemm:avx512_core_amx_bf16,forward_training,src_bf16::blocked:ab:f0 wei_bf16::blocked:AB16b32a2b:f0 bia_bf16::blocked:a:f0 dst_bf16::blocked:ab:f0,attr-scratchpad:user ,,mb64ic256oc256,0.35498

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106122
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-28 07:57:05 +00:00
359aa17125 [inductor] realize boundaries in bucketize() lowering (#106107)
ops.bucketize() implements a binary search: it takes values and offsets; offsets defines a set of buckets, and ops.bucketize() returns, for each value, the index of the bucket it lies in. The op is elemenwise with regard to the values and outputs, but it needs access to the entire offsets tensor in global memory so that it can perform the binary search. So, we need to realize the boundaries into global memory before running the op. The scheduler won't try to fuse the two kernels together because the input to ops.bucketize() is marked as a StarDep.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106107
Approved by: https://github.com/jansel
2023-07-28 07:19:35 +00:00
32175d794a No need to wait for pending unstable jobs when merging (#106095)
No need to wait if the job classification is unstable as it would be ignored anyway. This is useful to not need to wait for scarce resources like ROCm, which is also frequently in unstable mode (There is a ROCm queue atm)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106095
Approved by: https://github.com/clee2000
2023-07-28 07:08:23 +00:00
3b5fb7c0d4 Support regex flaky rules in trymerge (#106103)
This goes together with https://github.com/pytorch/test-infra/pull/4423 to support regex flaky rules in `trymerge`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106103
Approved by: https://github.com/clee2000
2023-07-28 07:05:12 +00:00
eebfb921c6 [ONNX] Support complex in FX exporter (#100554)
Previous to the PR, the complex dtype would only fail. This PR keeps torch.fx.Graph with complex dtype, while mapping them to float dtype in torchscript(onnx) graph with real representation.

The change happens in multiple files:

1. `placeholder`: Apply torch.view_as_real() before sending fake tensor to graph building.
2. `call_function`: Fill in TorchScriptTensor dtype and shape with real representation dtype and shape.
3. Registry: Add `is_complex`, and supports complex onnxfunction.
4. Dispatcher: Filter with/out complex onnxfunction before opschema matching, based on the dtype in torch args
5. Test cases: input/output view_as_real for result comparisons.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100554
Approved by: https://github.com/BowenBao
2023-07-28 07:03:07 +00:00
3e5a52cedd [memory snapshot] track context for segments (#106113)
We want to display the stack for the original cudaMalloc that created a segment.
Previously we could only report the last time the segment memory was used,
or the record of the segment_alloc could appear in the list of allocator actions.
This PR ensure regardless of whether we still have the segment_alloc action,
the context for a segment is still available. The visualizer is updated to
be able to incorporate this information.

This PR adds a new field to Block. However the previous stacked cleanup PR
 removed a field of the same size, making the change to Block size-neutral.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106113
Approved by: https://github.com/aaronenyeshi
2023-07-28 06:45:48 +00:00
45b564766d [memory snapshots] removed chained history (#106079)
For free blocks of memory in the allocator, we previously kept a linked list
of the stack frames of previous allocations that lived there. This was only
ever used in one flamegraph visualization and never proved useful at
understanding what was going on. When memory history tracing was added, it
became redundant, since we can see the history of the free space from recording
the previous actions anyway.

This patch removes this functionality and simplifies the snapshot format:
allocated blocks directly have a 'frames' attribute rather than burying stack frames in the history.
Previously the memory history tracked the real size of allocations before rounding.
Since history was added, 'requested_size' has been added directly to the block which records the same information,
so this patch also removes that redundancy.

None of this functionality has been part of a PyTorch release with BC guarentees, so it should be safe to alter
this part of the format.

This patch also updates our visualization tools to work with the simplified format. Visualization tools keep
support for the old format in `_legacy` functions so that during the transition old snapshot files can still be read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106079
Approved by: https://github.com/eellison
2023-07-28 06:45:48 +00:00
73a8544d8a [vision hash update] update the pinned vision hash (#106182)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106182
Approved by: https://github.com/pytorchbot
2023-07-28 06:16:51 +00:00
32844be3cf [JIT] Fix getting type for subscript assignments. (#106041)
### Description

Hi! We've been fuzzing `pytorch` with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz) and found error of out of bounds access in `torch::jit` module.

pytorch version: 18bcf62bbcf7ffd47e3bcf2596f72aa07a07d65f

The error occurs in `import_source.cpp:560` when we get the type from the `assign.rhs()`. `assign.rhs()` has `Maybe` type, as well as `assign.type()`, so one of them can be not presented. According to [grammar](22f93852a2/torch/csrc/jit/frontend/tree_views.h), we can have `Assign` statement, which `lhs` will be `Subscript`, `rhs` will be empty (`Maybe` type with no subtrees) and `type` will be presented. But in `import_source.cpp:560` we try to get `rhs` expression from the assignment with no check whether it is presented.

This is example from the how to reproduce section from the testing input:
```
class Module(Module):
  __parameters__ = ["0", ]
  __buffers__ = []
  __annotations__ = []
  __annotations__["0"] : Tensor
```

When we parse the last statement of class definition, we set the type of `lhs` to `Subscript`, because the lookahead is `[`
76fb72e24a/torch/csrc/jit/frontend/parser.cpp (L205-L207)

Then in `parseAssignment` we get `maybeOp` and `type` depending on the next symbol (if it is `:`, we get only the type)
76fb72e24a/torch/csrc/jit/frontend/parser.cpp (L437-L447)

So after that, in `import_source.cpp:560`, parsing attributes, one of which is assignment with subscript type of `lhs`, we try to get type from `rhs` expression and out of bounds access occurs.

To fix the error, we need to check whether the `rhs` or `type` are presented and get the type from corresponding expression.

### How to reproduce

Build docker container from [here](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch):
```bash
$ sudo docker build -t oss-sydr-fuzz-pytorch
```

Run docker container:
```bash
$ sudo docker run --rm --privileged -v `pwd`:/fuzz -it oss-sydr-fuzz-pytorch /bin/bash
```

Run the `load_fuzz` target on the [input.txt](https://github.com/pytorch/pytorch/files/12173962/input.txt)
```bash
/load_fuzz input.txt
```

You will see the following output:
```
AddressSanitizer:DEADLYSIGNAL
=================================================================
==157==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x00000c163764 bp 0x7ffee71d0070 sp 0x7ffee71d0050 T0)
==157==The signal is caused by a READ memory access.
==157==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0xc163764 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_() /pytorch/c10/util/intrusive_ptr.h:265:54
    #1 0xc1697fd in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::intrusive_ptr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/c10/util/intrusive_ptr.h:354:5
    #2 0xc1697fd in torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/tree_views.h:270:49
    #3 0xc1f02cb in torch::jit::Maybe<torch::jit::Expr>::get() const /pytorch/torch/csrc/jit/frontend/tree_views.h:212:12
    #4 0xd194369 in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) /pytorch/torch/csrc/jit/serialization/import_source.cpp:560:70
    #5 0xd18c701 in torch::jit::SourceImporterImpl::importNamedType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::ClassDef const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:288:5
    #6 0xd18a84c in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:140:5
    #7 0xd1913a8 in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10
    #8 0xc2e422f in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24
    #9 0xc2e4697 in torch::jit::ScriptTypeParser::parseType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312:10
    #10 0xd1a37d4 in torch::jit::SourceImporter::loadType(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import_source.cpp:786:27
    #11 0xd121c47 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import.cpp:146:33
    #12 0xd121c47 in c10::StrongTypePtr std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
    #13 0xd121ad0 in std::enable_if<is_invocable_r_v<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>, c10::StrongTypePtr>::type std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:113:9
    #14 0xd121926 in std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:291:9
    #15 0xd17ec49 in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14
    #16 0xd26b802 in torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:844:9
    #17 0xd2615fb in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:520:7
    #18 0xd25f917 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:253:27
    #19 0xd25f5b2 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:206:3
    #20 0xd186403 in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20
    #21 0xd12152d in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10
    #22 0xd117bae in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19
    #23 0xd114074 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:389:25
    #24 0xd113a27 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:325:10
    #25 0xd11bb64 in torch::jit::load(std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:485:10
    #26 0x610c5c in LLVMFuzzerTestOneInput /load.cc:42:14
    #27 0x537701 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #28 0x52160c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #29 0x52735b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #30 0x550912 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #31 0x7f06e8323082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #32 0x51bf2d in _start (/load_fuzz+0x51bf2d)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /pytorch/c10/util/intrusive_ptr.h:265:54 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_()
==157==ABORTING
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106041
Approved by: https://github.com/davidberard98
2023-07-28 05:04:00 +00:00
efeb46e507 Update kineto submodule to 465ff4cd7 (#106154)
Reland update to kineto after https://github.com/pytorch/pytorch/pull/105866 was reverted. This new update contains a patch to check CUPTI_API_VERSION instead of CUDA_VERSION to handle cases where CUPTI_API_VERSION is behind CUDA_VERSION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106154
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2023-07-28 05:02:27 +00:00
01069ad4be sparse.mm.backward: fix for non-contiguous grad values on CPU (#106127)
Fixes https://github.com/pytorch/pytorch/issues/102493.
The problem was that the backward implementation assumed inputs to be contiguous.
This might supersede https://github.com/pytorch/pytorch/pull/104520.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106127
Approved by: https://github.com/cpuhrsch
2023-07-28 01:25:00 +00:00
93b2036bef Revert "[quant][pt2e] store scale/zero_point as tensor attributes to support serialization (#105894)"
This reverts commit 3ca71ed735257cb7ad377b57a45057c265893a40.

Reverted https://github.com/pytorch/pytorch/pull/105894 on behalf of https://github.com/huydhn due to breaking executorch tests internally ([comment](https://github.com/pytorch/pytorch/pull/105894#issuecomment-1654831950))
2023-07-28 01:16:02 +00:00
cb14ff294b [inductor] Pass to remove pointless clones (#105994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105994
Approved by: https://github.com/yanboliang, https://github.com/eellison
2023-07-28 00:57:09 +00:00
e31855d0d6 [pytorch profiler] fix profiler test for windows (#106156)
Summary:
Fixes Windows tests, I am not seeing any CUDA events in the output so this test does not apply there
https://github.com/pytorch/pytorch/pull/105187#issuecomment-1652669293

Test Plan: buck2 run //caffe2/caffe2:caffe2_test_gpu

Reviewed By: chaekit, aaronenyeshi, huydhn

Differential Revision: D47841663

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106156
Approved by: https://github.com/aaronenyeshi, https://github.com/huydhn
2023-07-28 00:53:09 +00:00
88c400e03b Add @penguinwu to distributed codeowners (#105945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105945
Approved by: https://github.com/ezyang
2023-07-27 23:42:24 +00:00
ce63389246 Allow graph breaks in inductor opinfo tests (#105480)
Previously, we would have test failures for operators which graph broke bc dynamic shape or data dependent ops. Those would appear as a failure because we were running with `nopython=True`. Those test "failures" (which is expected behavior) obfuscated the actual correctness errors, and made this test lower signal.

If we wanted to do something like full-op export, that should be different than inductor opinfos.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105480
Approved by: https://github.com/desertfire
2023-07-27 23:23:35 +00:00
ec0ffac33b [BE] Document optimizer state_dict better, use example (#105958)
![image](https://github.com/pytorch/pytorch/assets/31798555/50ce293c-d884-47ab-b5f5-9ba41e3b4bad)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105958
Approved by: https://github.com/albanD
2023-07-27 23:08:42 +00:00
723bc136a1 Add context for warning about batch_first (#106139)
Summary: Add context for warning about batch_first

Test Plan: sandcastle github

Differential Revision: D47809651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106139
Approved by: https://github.com/mikaylagawarecki
2023-07-27 23:02:05 +00:00
7b9d250f06 Change _dynamo.export to be export(f)(*args, **kwargs) (#106109)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106109
Approved by: https://github.com/voznesenskym
2023-07-27 21:41:13 +00:00
5cbd3fc412 [Inductor] Fuse non-foreach ops with foreach ops without iterating over all subnodes (#106008)
Previously, when fusing a single node into a foreach op, the scheduler would iterate over each subnode and check if it can be fused, this PR adds a mapping so that the node to be fused with can be found more quickly by checking dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106008
Approved by: https://github.com/jansel
2023-07-27 21:40:24 +00:00
d4136c9088 Add pull request target to bc lint (#106065)
In order to get around the approval needed for first time contributors, add pull_request_target trigger for bc lint like for check labels.

Documentation about approvals here https://docs.github.com/en/actions/managing-workflow-runs/approving-workflow-runs-from-public-forks specifically:

> Note: Workflows triggered by pull_request_target events are run in the context of the base branch. Since the base branch is considered trusted, workflows triggered by these events will always run, regardless of approval settings. For more information about the pull_request_target event, see "[Events that trigger workflows](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#pull_request_target)."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106065
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-07-27 20:42:01 +00:00
ca7ece9b50 [easy] improve hint on error message in nn.Module.load_state_dict (#106042)
Fix #105963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106042
Approved by: https://github.com/albanD
2023-07-27 19:56:02 +00:00
70bc1b0f48 Tag functions to core IR in native_functions.yaml (#105849)
Summary:
Based on operator review meetings, tag appropriate functions as part of the Core IR.

[Operator Review Tracking Sheet](https://docs.google.com/spreadsheets/d/1u9jQ-uGlKu-fe9nLy-jS2AIPtpE8sGTmELOFYgKOhXU/edit#gid=0)

Test Plan: Use N3940835 to load the YAML and check updated core op list.

Reviewed By: mergennachin, kimishpatel, SherlockNoMad

Differential Revision: D47673670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105849
Approved by: https://github.com/SherlockNoMad
2023-07-27 19:40:18 +00:00
d960664842 Lower batch on cait_m36_384 (#106091)
The memory compression for this model is 0.9839, but we OOM w cudagraphs because we interleave the eager runs with cudagraph so it duplicates the memory bc of cudagraph memory pool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106091
Approved by: https://github.com/anijain2305
2023-07-27 19:33:38 +00:00
27ece5fad4 [Easy] remove unneeded sort (#106090)
This isn't needed now that we call stable_topological_sort in `freezing_passes`. The non-stable sort can also hurt perf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106090
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2023-07-27 19:09:48 +00:00
10f55a2a94 [export] Handle the case for no placeholders during in runtime assertion pass. (#106134)
Summary: as title

Differential Revision: D47835210

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106134
Approved by: https://github.com/angelayi
2023-07-27 18:36:51 +00:00
9ff36c2b3f [Pytorch][Vulkan] sum.dim_IntList (#105612)
Summary:
Add Vulkan support for [sum](https://pytorch.org/docs/stable/generated/torch.sum.html).dim_IntList

[sum.dim_IntList](https://www.internalfb.com/code/fbsource/[49b7951b7eb6]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=5466):
```
func: sum.dim_IntList(Tensor self, int[1]? dim, bool keepdim=False, *, ScalarType? dtype=None)
```
Some explanation
For each pos
 - Iterate over the out_texel and summed dimension
 - For H,W; rearrange pos.x, pos.y
 - For C,H,W;
When CHW are summed, batch moves into channel
The src N is determined by pos.z * 4 + out_index

Follow up:
Add support for `keepdim=true`
```
if keepdim is true, the output tensor is of the same size as input except in the dimension(s) dim, where it is of size 1
otherwise, the dim is squeezed, result in the output tensor having 1 fewer dimension/s.
```

Add support for [sum](https://www.internalfb.com/code/fbsource/[49b7951b7eb6]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=5457)
```
func: sum(Tensor self, *, ScalarType? dtype=None) -> Tensor
```

Test Plan:
New tests:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*.sum*"
Downloaded 0/53 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 47.4 sec (100%) 536/536 jobs, 8/536 updated
  Total time: 47.5 sec
BUILD SUCCEEDED
Running main() from third-party/googletest/1.11.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *.sum*
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.sum_2d
[       OK ] VulkanAPITest.sum_2d (426 ms)
[ RUN      ] VulkanAPITest.sum_3d
[       OK ] VulkanAPITest.sum_3d (2 ms)
[ RUN      ] VulkanAPITest.sum_4d
[       OK ] VulkanAPITest.sum_4d (3 ms)
[ RUN      ] VulkanAPITest.sum_3d_combined
[       OK ] VulkanAPITest.sum_3d_combined (1 ms)
[ RUN      ] VulkanAPITest.sum_4d_combined
[       OK ] VulkanAPITest.sum_4d_combined (5 ms)
[----------] 5 tests from VulkanAPITest (437 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (438 ms total)
[  PASSED  ] 5 tests.
```

clang-format on Sum.cpp and sum_dim.glsl

Differential Revision: D47580428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105612
Approved by: https://github.com/SS-JIA
2023-07-27 18:35:50 +00:00
78fffe8906 Bump certifi from 2023.5.7 to 2023.7.22 in /tools/build/bazel (#105983)
Bumps [certifi](https://github.com/certifi/python-certifi) from 2023.5.7 to 2023.7.22.
- [Commits](https://github.com/certifi/python-certifi/compare/2023.05.07...2023.07.22)

---
updated-dependencies:
- dependency-name: certifi
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-27 10:23:56 -07:00
487ebcac3b Clean up unsed MHA code to avoid confusion (#105956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105956
Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/wanchaol
2023-07-27 17:10:17 +00:00
1a59be2c9e [BE] Use C10_CLANG_DIAGNOSTIC macros (#106084)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106084
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2023-07-27 16:53:50 +00:00
977df45a0f [inductor] Call render() once for templates (#105987)
This is more code, but perhaps easier to understand?  Both @Chillee and @ipiszy expressed confusion that we rendered templates twice to reach a fixed point.  This removes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105987
Approved by: https://github.com/Chillee
2023-07-27 16:34:38 +00:00
377f306b4c [inductor] Add has_mkl check (#106049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106049
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-27 15:34:02 +00:00
6f1042c049 Make sure that little endian is default case when __BYTE_ORDER__ is not defined (#104249)
This is a follow up to discussion
in https://github.com/pytorch/pytorch/pull/96422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104249
Approved by: https://github.com/malfet
2023-07-27 13:33:35 +00:00
7c97c943fb inductor: always convert weight to channels_last for cpu conv (#105517)
For the CPU backend, we always use channels_last to get good performance by avoiding format reorder(block to plain or plain to black), and they also assume that the weight is channels_last when doing the weight packing, so there always convert weight format and doing layout optimization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105517
Approved by: https://github.com/jgong5, https://github.com/shunting314
2023-07-27 08:37:32 +00:00
a1d0db1c60 [pytorch] Fix MSVC unexpected tokens following preprocessor directive (#105922)
Summary:
Fix this warning:
```
caffe2\c10\macros\Macros.h(138): warning C4067: unexpected tokens following preprocessor directive - expected a newline
```
`caffe2/c10/util/variant.h` already has a similar to check and define a stub for `__has_attribute(x)`, so this would not be new to caffe2/pytorch.

Test Plan: CI should complete, still with plenty of caffe2 warnings but this one should be gone from the Windows build log

Differential Revision: D47735319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105922
Approved by: https://github.com/kit1980
2023-07-27 06:03:31 +00:00
b435bff53a [PyTorch] Add tests for empty tensors w/storage null data_ptr (#101426)
Further investigation seems to show that changing this behavior (making empty tensors sometimes have non-null data_ptr) was the real problem with #98090 . Adding tests to lock down this behavior so we don't change it by accident again.

Differential Revision: [D45873002](https://our.internmc.facebook.com/intern/diff/D45873002/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101426
Approved by: https://github.com/zou3519
2023-07-27 05:19:42 +00:00
5d8596292b fix atomic add for bf16/fp16 (#104592)
Enable atomic_add for fp16 and fix atomic_add issue for bf16/fp16.

Previously the constructor `bfloat16(addr->x);` will invoke
https://github.com/pytorch/pytorch/blob/main/c10/util/BFloat16.h#L99
(construct a `bfloat16` from `float`).
Instead, we actually wish to invoke
https://github.com/pytorch/pytorch/blob/main/c10/util/BFloat16.h#L97
(construct a `bfloat16/float16` from `bits`.

Test Plan:
 Remove expected failure for `float16` in `test_torchinductor_opinfo` with op `scatter_reduce, sum`, `scatter_add`, `index_add`, `amax/amin`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104592
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-27 04:57:56 +00:00
707aadeedd Track global Numpy variables as side-effect. (#105959)
Fix: #105074

This PR makes dynamo handle Numpy global variables the same way as PyTorch tensor global
variables by tracking them as side-effect.

In summary, we add `NumpyNdarrayVariable` to the
`VariableBuilder._can_lift_attrs_to_inputs` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105959
Approved by: https://github.com/ezyang
2023-07-27 03:49:48 +00:00
b812e35a75 [pt2] add meta for argsort.stable, use sort samples in OpInfo (#106025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106025
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-07-27 03:49:17 +00:00
edebdaf182 Change _dynamo.explain to be explain(f)(*args, **kwargs) (#106066)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106066
Approved by: https://github.com/wanchaol, https://github.com/voznesenskym
2023-07-27 03:21:52 +00:00
e773f28ee3 Reland "Add forward mode AD to out-place foreach functions (#102409) (#106043)
forward-mode AD of out-of-place foreach functions, finally.

rel:
- #102409
- #105504
- #58833
- #100695

---

# Generated Foreach
```c++
::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachSinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj();
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  return result;
}

::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachNormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->ord = ord;
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]);
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  if (grad_fn) {
    grad_fn->result = result;
  }
  return result;
}

```

# Reference
```c++
at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<SinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = (self_t.conj() * self_p.cosh().conj()).conj();
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  return result;
}
at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<NormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->p = p;
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "norm");
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result);
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(result, true);
  }
  return result;
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106043
Approved by: https://github.com/soulitzer
2023-07-27 03:13:24 +00:00
49e047e0f9 Delete dead summarize_dim_constraints (#106053)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106053
Approved by: https://github.com/ydwu4
2023-07-27 03:08:24 +00:00
076781ba9b Revert "fix building errors on FreeBSD (#105897)"
This reverts commit 5c5eece6d85d5be3485f96a6da3905f2dd28331b.

Reverted https://github.com/pytorch/pytorch/pull/105897 on behalf of https://github.com/PaliC due to causing regressions on internal models ([comment](https://github.com/pytorch/pytorch/pull/105897#issuecomment-1652840218))
2023-07-27 03:01:44 +00:00
c5f6c2de15 Revert "update kineto submodule to a94f97b (#105866)"
This reverts commit 8af25cfc245848708e2f24ab2dbed31f6a34f5dc.

Reverted https://github.com/pytorch/pytorch/pull/105866 on behalf of https://github.com/davidberard98 due to Apparently breaks for some older CUDA versions due to symbols that are not available in CUDA <=11.6, I'll take a look and re-update the module tomorrow ([comment](https://github.com/pytorch/pytorch/pull/105866#issuecomment-1652836973))
2023-07-27 02:56:15 +00:00
457d01bcfd [Compiled Autograd] Remove TORCH_API from generated autograd nodes (#105286)
This works around the Windows symbol count issues in #103822.  Unfortunately, removing TORCH_API only works on Windows, but causes build issues on Linux, so we need the `#ifdef`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105286
Approved by: https://github.com/albanD
2023-07-27 02:33:14 +00:00
952021934f inductor: legalize fp16 (#100857)
This PR aims to vectorize FP16 for CPU with what BF16 has done.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100857
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-27 02:31:40 +00:00
2d41fa9d38 Revise err msgs for weight param of Multimarginloss (#106047)
Summary: fix lint issue of #106019

Fix: https://github.com/pytorch/pytorch/issues/106020
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106047
Approved by: https://github.com/Skylion007
2023-07-27 01:44:13 +00:00
fd4f8e194e [dashboard] Replace cpp_wrapper with aot_inductor on the perf dashboard (#106077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106077
Approved by: https://github.com/gmagogsfm, https://github.com/huydhn
2023-07-27 01:39:36 +00:00
f026b32008 [device_mesh][BE] reduce_scatter fallback to funcol and remove from DM (#105642)
For the reason similar to https://github.com/pytorch/pytorch/pull/105605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105642
Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj
2023-07-27 01:33:05 +00:00
2fa063e1e0 [device_mesh][BE] remove allgather from DM (#105614)
For the reason similar to https://github.com/pytorch/pytorch/pull/105605
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105614
Approved by: https://github.com/rohan-varma, https://github.com/wz337, https://github.com/fduwjj
2023-07-27 01:33:05 +00:00
4a49f1f46e [device mesh][BE] remove allreduce from DM (#105605)
This PR removes allreduce from DM and use functional collective instead,
the rationle is that we don't want to maintain yet another set of
collective apis, and since the DM's collective is now a thin wrapper to functional collective so we
don't really need these collective to live in DM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105605
Approved by: https://github.com/kumpera, https://github.com/wz337, https://github.com/fduwjj
2023-07-27 01:33:02 +00:00
06dd850dd5 Simplify check (#106044)
Summary: Simplify check / refactor for readability

Test Plan: sandcastle, github

Differential Revision: D47800732

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106044
Approved by: https://github.com/mikaylagawarecki
2023-07-27 01:18:25 +00:00
6847c965f5 Turn on capture_dynamic_output_shape_ops/capture_scalar_outputs by default for export (#105962)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105962
Approved by: https://github.com/tugsbayasgalan
2023-07-27 01:02:09 +00:00
f70844bec7 Enable UFMT on a bunch of low traffic Python files outside of main files (#106052)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106052
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-27 01:01:17 +00:00
5a114f72bf [Compiled Autograd] Move to torch::dynamo::autograd namespace (#105854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105854
Approved by: https://github.com/albanD
2023-07-27 00:36:47 +00:00
f20ead0aea [pt2][inductor] guard for __package__ is None (#106056)
Summary:
Guard against errors when __package__ is NoneType

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
sandcastle + CI

Sandcastle run

Differential Revision: D47803386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106056
Approved by: https://github.com/jansel
2023-07-27 00:34:47 +00:00
3959695fbd Fix typo ; Update grad_mode.py (#106045)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106045
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-07-27 00:24:50 +00:00
c05eb77f09 Increase ignorable failures limit (#105998)
Given the number of unstable job atm (rocm, distributed), having the limit of 3 for ignorable failures is too low.  When I manually look into force merges, I could find many examples like like https://github.com/pytorch/pytorch/pull/105848 where there are 3+ unrelated failures.  As the classification is getting more accurate, we can aim to ignore all flaky and broken trunk failures.

* Default `ok_failed_checks_threshold` to `-1` to ignore all unrelated failures
* Increase the `IGNORABLE_FAILED_CHECKS_THESHOLD` to 10.  The only concern I have before setting it to `-1` is the fog of war situation when a sev occurs.  So 10 is a good middle ground before we agree to set it to `-1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105998
Approved by: https://github.com/clee2000
2023-07-27 00:14:37 +00:00
5a2b9ca754 [ONNX] Limit number of elements to display for list/tuple/dict in diagnostics (#106048)
In a recent change, diagnostics started logging contents within tuple/list/dict
for diagnosed function arguments and return types. This brought slow down
to export due to some extremely large container instances, such as the fx to
onnx node mapping dictionary.

This PR adds a limit to how many elements the diagnostic would record for
these types. Together with https://github.com/microsoft/onnxscript/pull/922, the performance of
export w/ diagnostics is restored and improved. As shown by pyinstrument:

GPT2 time for `fx_to_onnx_interpreter.run` 17.767s -> 1.961s
xcit_large_24_p8_224 time for `fx_to_onnx_interpreter.run` 144.729s -> 4.067s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106048
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-07-26 23:45:34 +00:00
64843993a4 [mypy] autotune_process.py (#105732)
Follows: #105571 / #105230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105732
Approved by: https://github.com/eellison
2023-07-26 22:45:04 +00:00
cb9a4fbbf2 [BE] Improve test_transformers test structure (#105938)
# Summary

We have a vast majority of test that only run on cuda. Decorating with @onlycuda causes pytest to instantiate 2x the tests and skip half of them. This overhead is non trivial when the #tests cross larger like it has for this file.

This breaks up the cuda only tests into a separate class
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105938
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2023-07-26 22:16:20 +00:00
cyy
646fa36875 Add const reference in opportunities detected by clang-tidy (#105931)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105931
Approved by: https://github.com/Skylion007
2023-07-26 21:38:10 +00:00
b69e5302b5 add skip if sm < 80 check (#105888)
Fix issue where we were testing `test_schema_correctness_nn_functional_scaled_dot_product_attention_cuda_bfloat16` from `test_schema_check.py` on V100, but bfloat16 support on cuda doesn't exist for sm < 80. Added skip if sm < 80 to the failing test. cc @ptrblck @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105888
Approved by: https://github.com/kit1980
2023-07-26 21:25:24 +00:00
38861ba39f Fixes netName assignment for NCCL Config (#105776)
Fixes #104340

The core issue is described here https://github.com/pybind/pybind11/issues/1168#issuecomment-341969643. Note that NCCL calls free on the netName pointer when destroying the communicator. So memory is safely managed here.

CC: @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105776
Approved by: https://github.com/kwen2501
2023-07-26 21:13:56 +00:00
43b3632215 [Composable] Add hybrid shard AC compile test (#105207)
This was request to ensure hybrid shard + AC + compile works.

Differential Revision: [D47462393](https://our.internmc.facebook.com/intern/diff/D47462393/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105207
Approved by: https://github.com/awgu, https://github.com/fegin
2023-07-26 21:03:55 +00:00
4137d6e499 [Composable FSDP] Enable HSDP (#105206)
Need to pass in strategy to _init_process_group_state to enable hsdp
for composable.

Differential Revision: [D47462394](https://our.internmc.facebook.com/intern/diff/D47462394/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105206
Approved by: https://github.com/awgu, https://github.com/fegin
2023-07-26 21:03:55 +00:00
dc19f8a6b5 Fix cuSparse CSR SPMM for using nullptr in csrRowOffsets (#105957)
cusparse from cuda 12.2 no longer allows passing nullptr to csrRowOffsets

Internal NVIDIA ref: 4208400
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105957
Approved by: https://github.com/IvanYashchuk
2023-07-26 20:15:30 +00:00
3ca71ed735 [quant][pt2e] store scale/zero_point as tensor attributes to support serialization (#105894)
Summary:
Currently scale/zero_point for per tensor quant is stored as burnt in literals, this means these values can't be serialized in state_dict, this
PR changes them to buffers/Tensors so that they can be serialized

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D47770963](https://our.internmc.facebook.com/intern/diff/D47770963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105894
Approved by: https://github.com/kimishpatel
2023-07-26 20:15:06 +00:00
841b4acf1e [FSDP][Easy] Rename to _comm_hook, _comm_hook_state (#106033)
This is just out of preference to make the naming convention consistent with `register_comm_hook()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106033
Approved by: https://github.com/fegin
2023-07-26 19:59:11 +00:00
035704e88d [FSDP][Easy] Move post-bwd hook logging to own func (#106032)
This is to help make `_post_backward_hook()` easier to read. I plan to refactor some other parts in future PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106032
Approved by: https://github.com/fegin
2023-07-26 19:59:11 +00:00
f725e6374d doc: fix fake quantize per channel doc (#105955)
another doc bug for fake_quantize_per_channel

function doc now matches e7142700ed/aten/src/ATen/native/quantized/FakeQuantPerChannelAffine.cpp (L32)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105955
Approved by: https://github.com/kit1980
2023-07-26 19:17:41 +00:00
60ad46f49d [ONNX] Clean up outdated skip ort < 1.15 decorator in tests (#105951)
`skip_min_ort_version` is not needed anymore, as the ort version is now officially > 1.15. But the function is kept for future usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105951
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-07-26 19:04:43 +00:00
9a1cdcb8a0 Format: fixing multiple string concatenation in single line (#106013)
Fixing multiple string concatenation in single line
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106013
Approved by: https://github.com/albanD
2023-07-26 18:39:18 +00:00
3a77f9aaaf [quant][api] Move torch.ao.quantization.pt2e.quantizer to torch.ao.quantization.quantizer (#105885)
Summary: moving quantizer to torch.ao.quantization to make it a public api, since pt2e is a folder for implementations

Test Plan:
CIs

sanity check: "buck test //executorch/backends/xnnpack/test:test_xnnpack_quantized_models -- test_resnet18"

Differential Revision: D47727838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105885
Approved by: https://github.com/andrewor14
2023-07-26 18:20:09 +00:00
70b0f1b248 fix some typos (#106018)
Fixes #ISSUE_NUMBER
Fix typos in `test_static_module.cc`, `backend_cutting_test.cc` and `types_base.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106018
Approved by: https://github.com/awgu
2023-07-26 18:14:44 +00:00
21ede4547a remove duplicated code in optimizer (#106022)
Fixes #ISSUE_NUMBER
as the title, the check code  has duplicates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106022
Approved by: https://github.com/janeyx99
2023-07-26 17:01:28 +00:00
716f37cef8 If we can't statically prove 32-bit indexing OK, only add guard if hint exists (#106004)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106004
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-07-26 16:36:29 +00:00
c0c208516b [Doc] Add Tensor.Shape (#104750)
Summary:
Add `Tensor.Shape` doc.

Fix: #104038

Ref:

- https://github.com/pytorch/pytorch/issues/5544
- https://github.com/pytorch/pytorch/issues/1980

Differential Revision: D47278630

CC: @svekars @carljparker

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104750
Approved by: https://github.com/mikaylagawarecki
2023-07-26 16:30:15 +00:00
28a4fc8d8a Fixe some typos (#105869)
### Description:
- Fixes for typos in comments
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105869
Approved by: https://github.com/mikaylagawarecki, https://github.com/Skylion007
2023-07-26 16:23:57 +00:00
2dbadd1eae [export] Remove experimental runtime assertion configs from export API. (#105043)
Test Plan: CI

Differential Revision: D47390794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105043
Approved by: https://github.com/larryliu0820
2023-07-26 16:21:29 +00:00
8af25cfc24 update kineto submodule to a94f97b (#105866)
New submodule commit: c23a1fdbf6

Fixes #97167.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105866
Approved by: https://github.com/aaronenyeshi
2023-07-26 16:04:30 +00:00
c4b7311fc2 Meff Attn Bias (#104310)
# Summary

### Review Points
- Automatically pad tensors to create aligned masks when seqlen_kv is not multiple of 16. This will cause memory spike ~ 2 * attn_mask size which could in theory be big.  At appears though that doing this + mem_eff is faster than no_pad + math. SO seems to be worth it
- Using expand to view the attn_mask in 4d. This is a little different to how we enforce q,k,v to be viewed in 4d prior to calling. Also not supprint b*n_heads, seq_lenq, seq_lenkv case.
- Should enable, #96099

### Profiling
I ran a bunch of comparisons between sdpa.MATH and sdp.MemEffAttention.  I added a attn_bias of shape (1, 1, seqlen_q, seqln_k). For these experiments seqlen_q == seqlen_k. These were all ran on an a100 80gb gpu.
Configs:
```
    # Run a bunch of experiments
    batch_sizes = [8, 16, 32]
    num_heads = [16, 32]
    max_seq_lens = [15, 64, 128, 512, 555, 1024]
    embed_dims = [32, 64, 128]
    dtypes = [torch.float16, torch.bfloat16, torch.float32]
    pad_percentages = [None]
    backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
    run_backward = True
    attn_mask = True
```

   The function calls `sdpa(input**).sum().backward()`.

   I calculated the geomean speedup of the efficient attention path of the math path for all these configs:
   `Geomean Speedup: 1.977`

An example comparision with batchsize = 8, num_heads = 32, embed_dim = 64, and dtype = torch.float16:
![attn_mask_compare_bsz_8_num_heads_32_embed_dim_64_dtype_fp16](https://github.com/pytorch/pytorch/assets/32754868/0d75bffe-350b-43f2-a37f-514f9158dcff)

 This was done using the current state of the branch where we force alignment of mask when the last dim is not divisible by 16, which shows up in seq_len = 15 and 555 case.

The full data can be found here:

[attn_mask_sweep.csv](https://github.com/pytorch/pytorch/files/11962399/attn_mask_sweep.csv)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104310
Approved by: https://github.com/cpuhrsch
2023-07-26 15:51:59 +00:00
45322fafd6 [ONNX] Add comment on test_view_dynamic_zero_dim (#105950)
From https://github.com/pytorch/pytorch/issues/105066.

The case was meant to test 0 bbox generated by vision models, but the bboxes still have `.view()` operated. The case was disabled, and not supported. Added a comment to clear the potential confusion in the future. We will wait for model example to proceed on this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105950
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-07-26 15:20:40 +00:00
72f2c87a5a [foreach] Set SavedVariable.is_output to true for grad_fn->result_ (#105504)
fixes #105502

The scope of this pull request is out-of-place foreach functions that depend on their output tensorlist for backward such as `_foreach_exp`. An example of the generated code with this update is as follows:

```c++
variable_list ForeachExpBackward0::apply(variable_list&& grads) {
  std::lock_guard<std::mutex> lock(mutex_);
  TORCH_CHECK(!result_released_, ERR_BACKWARD_TWICE);
  IndexRangeGenerator gen;
  auto self_ix = gen.range(self_size_);
  variable_list grad_inputs(gen.size());
  auto result = unpack_list(result_, shared_from_this());
  if (task_should_compute_output({ self_ix })) {
    std::vector<Tensor> grad_result;
    grad_result.reserve(grads.size());
    for (const auto & i : c10::irange(grads.size())) {
      if (grads[i].defined()) {
        grad_result.emplace_back(grads[i] * result[i].conj());
      } else {
        grad_result.emplace_back(Tensor());
      }
    }
    copy_range(grad_inputs, self_ix, grad_result);
  }
  return grad_inputs;
}

::std::vector<at::Tensor> _foreach_exp(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::shared_ptr<ForeachExpBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachExpBackward0>(new ForeachExpBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    if ((isFwGradDefinedTensorList(self))) {
      static c10::OperatorName full_name("aten::_foreach_exp", "");
      static c10::optional<c10::OperatorHandle> opt_op = c10::Dispatcher::singleton().findSchema(full_name);
      return impl::run_jit_decomposition_with_args_for_jvp<::std::vector<at::Tensor>>("_foreach_exp", *opt_op, ks, self);
    } else {
      at::AutoDispatchBelowADInplaceOrView guard;
      return at::redispatch::_foreach_exp(ks & c10::after_autograd_keyset, self_);
    }
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  if (grad_fn) {
    grad_fn->result_ = make_saved_variable_list(result, true);
  }
  return result;
}
```

A bit of context:
- https://github.com/pytorch/pytorch/pull/105368#issuecomment-1640912479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105504
Approved by: https://github.com/soulitzer
2023-07-26 14:29:32 +00:00
9d2e15882e Add torch.utils to the docs page, remove dead code and fix docstrings (#105142)
As per title.
Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142
Approved by: https://github.com/janeyx99
2023-07-26 14:24:58 +00:00
6b6702f506 Enhance no_grad-context FSDP backward handling (#105374)
Fixes #105369
Fixes #105371

Addressing two somewhat distinct issues that involve the same test in this PR:

1. To fix #105369:
    - Add a `no_grad` guard to [`_register_post_backward_reshard_only_hooks`](93f852f201/torch/distributed/fsdp/_runtime_utils.py (L1406)) to avoid registering post-backward hooks that would not be removed in that context.

2. To fix #105371:
    - Add a `grad` context condition to [`_use_sharded_flat_param`](93f852f201/torch/distributed/fsdp/flat_param.py (L1645C9-L1645C32)) logic to trigger post-forward `_use_sharded_views` in a `no_grad` context for `NO_RESHARD_AFTER_FORWARD_HANDLE_STRATEGIES`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105374
Approved by: https://github.com/awgu
2023-07-26 14:12:13 +00:00
66b73b08df Allow disabling bias for Transformer (#101687)
As used by T5 and PaLM, citing "increased training stability for large models" (https://arxiv.org/abs/2204.02311).

Depends on #101683, which allows disabling bias for `LayerNorm`s. Marked as draft due to this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101687
Approved by: https://github.com/mikaylagawarecki
2023-07-26 13:50:41 +00:00
76fb72e24a [profiler] Fix profiling shapes with PT2 + lists of dynamic shapes (#105893)
Fixes #105748

Follow-up to https://github.com/pytorch/pytorch/pull/104320. If we have a list that contains tensors with dynamic shapes, just mark the entire list as undefined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105893
Approved by: https://github.com/aaronenyeshi
2023-07-26 13:41:07 +00:00
6938494d03 [jit] move get_annotations out of infer_concrete_type_builder (#105197)
There's an internal use case for get_annotations - basically
1. make a module
2. copy annotations from the module to a fx-traced module
3. script the fx module

This isn't a public API but for internal use it's probably fine to reuse this logic instead of copying.

Differential Revision: [D47460370](https://our.internmc.facebook.com/intern/diff/D47460370/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105197
Approved by: https://github.com/eellison
2023-07-26 13:39:39 +00:00
22f93852a2 Fix error message about enabling InferenceMode in Python (#105948)
Summary:
The old error message shows
```
... add `c10::InferenceMode mode;` before model.forward(). Note this guard is only available in C++ but not Python at present."
```
However InferenceMode for Python has been enabled since D28390595. It can be used as a context manager with `torch.inference_mode()`. The error message is fixed as so.

Test Plan: Easy

Reviewed By: yipjustin

Differential Revision: D47655392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105948
Approved by: https://github.com/albanD
2023-07-26 13:21:11 +00:00
c099b80073 [FSDP] Add record_function for explicit prefetching (#105985)
Example:
<img width="568" alt="Screenshot 2023-07-25 at 7 41 43 PM" src="https://github.com/pytorch/pytorch/assets/31054793/5f3f07b3-97f4-4493-9cab-5619484e2f6d">

This can be particularly help when `with_stack=False`, in which case it is harder to tell the prefetch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105985
Approved by: https://github.com/fegin
2023-07-26 12:16:35 +00:00
a9a3c45649 Revert "Simplify handle indexing (#105006)" (#105984)
This reverts commit 429d45f91a5b636844954363851be309d8203b56.

Unfortunately, https://github.com/pytorch/pytorch/pull/105006 broke backward prefetching (where backward prefetching working correctly was not captured in our unit tests).

I need more time to dig into this (tomorrow), but I think the issue is related to:
429d45f91a (diff-9a6937168d232432c34c2c4605b96f3147afa2786e287f74b6074b20aa5980e6R143-R146)

Follow-ups:
1. Investigate this thoroughly
2. Add unit tests to capture backward prefetch functionality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105984
Approved by: https://github.com/fegin
2023-07-26 12:12:14 +00:00
854fe470cd fix check issue for replace_params_with_constants (#105909)
fix check issue for replace_params_with_constants to make llama mode const folding work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105909
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-26 12:04:01 +00:00
0616952d13 Merge and improve torch optim optimizer type stubs (#102593)
Fixes #102428

Also improves hook registration type hints:

```python
from typing import Any, Dict, Tuple

from torch import nn
from torch.optim import Adam, Adagrad, Optimizer

linear = nn.Linear(2,2)
optimizer = Adam(linear.parameters(), lr=0.001)

def pre_hook_fn_return_none(optimizer: Adam, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None:
    return None

def pre_hook_fn_return_modified(
    optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]
) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
    return inputs, kwargs

def hook_fn(optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None:
    return None

def hook_fn_other_optimizer(optimizer: Adagrad, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None:
    return None

optimizer.register_step_post_hook(hook_fn)  # OK

optimizer.register_step_pre_hook(pre_hook_fn_return_none)  # OK
optimizer.register_step_pre_hook(pre_hook_fn_return_modified)  # OK

optimizer.register_step_post_hook(hook_fn_other_optimizer)  # Parameter 1: type "Adam" cannot be assigned to type "Adagrad"

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102593
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-07-26 11:56:42 +00:00
1b9faf22ef [vision hash update] update the pinned vision hash (#105988)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105988
Approved by: https://github.com/pytorchbot
2023-07-26 11:39:02 +00:00
46b74ab9cf Bump pygments from 2.12.0 to 2.15.0 in /.github/requirements (#105669)
Bumps [pygments](https://github.com/pygments/pygments) from 2.12.0 to 2.15.0.
- [Release notes](https://github.com/pygments/pygments/releases)
- [Changelog](https://github.com/pygments/pygments/blob/master/CHANGES)
- [Commits](https://github.com/pygments/pygments/compare/2.12.0...2.15.0)

---
updated-dependencies:
- dependency-name: pygments
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-07-26 04:28:15 -07:00
d767cff7c7 [quant][fx] Fix docs for prepare_fx/prepare_qat_fx (#105979)
Summary:
Fixes: https://github.com/pytorch/pytorch/issues/103661

Test Plan:
visual inspectation of docs https://pytorch.org/docs/2.0/generated/torch.ao.quantization.quantize_fx.prepare_fx.html#torch.ao.quantization.quantize_fx.prepare_fx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105979
Approved by: https://github.com/andrewor14
2023-07-26 09:56:18 +00:00
0c65a2d58f [pt2] add meta for _adaptive_avg_pool3d_backward (#105816)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105816
Approved by: https://github.com/ezyang
2023-07-26 09:30:17 +00:00
36ae359655 Update matmul decomp to match eager (#105850)
The decomposition was not updated after https://github.com/pytorch/pytorch/pull/95261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105850
Approved by: https://github.com/Chillee
2023-07-26 09:24:51 +00:00
9bde7f4e27 Fix the docs for cosine_similarity (#104772)
The behaviour of `cosine_similarity` was subtly changed in
https://github.com/pytorch/pytorch/pull/31378, but the docs were not
updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104772
Approved by: https://github.com/albanD, https://github.com/svekars
2023-07-26 09:23:09 +00:00
fff4a9db8a Fuse ops in eager cosine_similarity while keeping the stability and the gradients (#104771)
There was a regression in https://github.com/pytorch/pytorch/pull/31378
which was reported in https://github.com/pytorch/pytorch/issues/104564.
This PR should keep the efficiency and memory usage from the original
implementation, while keeping the stability of the latter.

This solution was already discussed in https://github.com/pytorch/pytorch/pull/31378,
but it was discarded because it modified the vector_norm in-place. The
only magic ingredient that was missing for that solution to work was to
add a `clone()` after calling the `vector_norm`.

I hope this PR takes shorter to land than https://github.com/pytorch/pytorch/issues/104564.

Fixes https://github.com/pytorch/pytorch/issues/104564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104771
Approved by: https://github.com/albanD
2023-07-26 09:23:09 +00:00
a61a0fe490 test_linalg: triangular_solve - make well_conditioned well conditioned (#105919)
`well_contioned=True` does not guarantee that the samples for `triangular_solve` are actually well-conditioned. This PR fixes that. This issues was discovered in https://github.com/pytorch/pytorch/pull/104425.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105919
Approved by: https://github.com/lezcano
2023-07-26 09:21:12 +00:00
aabdd2b7a1 Add support for tensor.tolist() for static sized int tensors (#105976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105976
Approved by: https://github.com/ezyang
2023-07-26 08:13:22 +00:00
cyy
5c5eece6d8 fix building errors on FreeBSD (#105897)
Although FreeBSD is not officially supported, this PR fixes some errors on FreeBSD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105897
Approved by: https://github.com/kit1980
2023-07-26 08:11:42 +00:00
afd621ddde inductor: fix CSE issue when have symbolic shape input at the freezing path (#105651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105651
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-26 08:07:31 +00:00
9c1802f8e3 inductor: using binary folding path to do conv+bn folding (#105650)
This path will use binary folding to do conv+bn folding to avoid using ```make_fx``` which meets tracing errors in some model dynamic shape path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105650
Approved by: https://github.com/eellison
2023-07-26 07:37:47 +00:00
920b446da9 dynamo: support disable_saved_tensors_hooks (#104869)
Functorch transforms use this context manager which will lead to graph-breaks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104869
Approved by: https://github.com/zou3519
2023-07-26 07:27:37 +00:00
7b31732a6f Delete unused experimental export (#105873)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105873
Approved by: https://github.com/ezyang
2023-07-26 07:22:58 +00:00
03e2ca9d9c [Composable] Add more sharding strategies to runtime test (#105205)
Add more sharding strategies to ensure equivalence

Differential Revision: [D47462392](https://our.internmc.facebook.com/intern/diff/D47462392/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105205
Approved by: https://github.com/awgu
2023-07-26 07:03:09 +00:00
a326f5621e composable fsdp, checkpoint, + compile test (#105180)
Test to ensure that composable FSDP, checkpoint, and compile all work
together. Includes a change from https://github.com/pytorch/pytorch/pull/105090
which we can land in that PR first.

Differential Revision: [D47452973](https://our.internmc.facebook.com/intern/diff/D47452973/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105180
Approved by: https://github.com/awgu
2023-07-26 07:03:09 +00:00
5d70fe0165 [Composable] Use non-reentrant generator, remove reentrant (#105176)
Removes reentrant support for the composable checkpoint, as
non-reentrant is the recommended approach and we should use this when rolling
out composable checkpoint API.

Also removes the standalone implementation for non-reentrant and instead uses
the generator from below diff to reuse the original implemenetation.

Differential Revision: [D47451375](https://our.internmc.facebook.com/intern/diff/D47451375/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105176
Approved by: https://github.com/awgu, https://github.com/fegin
2023-07-26 07:03:03 +00:00
c76c84bde4 [dynamo] make ProcessGroupVariable a DistributedVariable (#105593)
This PR move the ProcessGroupVariable from UDO to DistributedVT
so that Distributed VTs are consolidated together

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105593
Approved by: https://github.com/voznesenskym
2023-07-26 06:42:50 +00:00
15442915cf [ONNX] Fix the warnings of aten overload fallback to default in onnx dispatcher (#105972)
Without this PR, the warning message is misleading as it says the default is found before the error message popped.
Next PR will start refactoring aten overload fallback with adding overloads supported by torchlib into OpSchema matching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105972
Approved by: https://github.com/BowenBao
2023-07-26 05:42:33 +00:00
8d9c8897ed [profiler] add option for kineto synchronization events in the trace (#105187)
Summary:
## About Sync Events
For CUDA profiling mode, we can enable tracing CUDA synchronization events.
* This feature captures synchronization events in CUDA including 1) context/device sync, 2) stream sync, 3) CUDA event sync, 4) CUDA stream wait event (inter stream synchronization). Read more
* We add this flag using the profiler's experimental config option.
* This PR relies on 7b003638c6 change in pytorch/kineto

## Usage
Just set the `enable_cuda_sync_events` option in `_ExperimentalConfig`
```
from torch.autograd.profiler import profile, _ExperimentalConfig
with profile(use_kineto=True, use_cuda=True,
   experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
   workload()
```

**Please wait for PyTorch github repo to point to 7b003638c6 or later commit in Kineto**

Test Plan:
## Unit Test
Added a unit test

  buck2 test mode/dev-nosan caffe2/test:profiler --local-only -- test_profiler_cuda_sync_events
  Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
ttps://www.internalfb.com/intern/testinfra/testrun/281475298097379

Reviewed By: davidberard98

Differential Revision: D46244591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105187
Approved by: https://github.com/aaronenyeshi
2023-07-26 03:45:04 +00:00
a770295af4 Don't alter original node's meta in Interpreter (#105880)
Test Plan: OSS CI

Reviewed By: angelayi

Differential Revision: D47740058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105880
Approved by: https://github.com/angelayi
2023-07-26 03:44:58 +00:00
6dd4b99ec2 Revert "Disable torchrec/sparse from top-level Dynamo tracing (#105733)"
This reverts commit 60d5efdb154da766b9f1c4b39bb6260b1427e45b.

Reverted https://github.com/pytorch/pytorch/pull/105733 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105733#issuecomment-1650931609))
2023-07-26 03:44:47 +00:00
884cd53e49 Unconditionally record when FakeTensorMode is allocated and report it on inconsistency (#105927)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105927
Approved by: https://github.com/albanD
2023-07-26 03:38:42 +00:00
523100a2f1 Make _CURRENT_TRACING_CONTEXT thread local (#105942)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105942
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2023-07-26 03:38:01 +00:00
0003d5135d [TP] Enable partial tensor add without redistribute (#105939)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105939
Approved by: https://github.com/wanchaol
2023-07-26 03:12:39 +00:00
e18d53e2df Added ModuleInfo test for meta device ctx init (#105871)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105871
Approved by: https://github.com/albanD
2023-07-26 01:57:54 +00:00
837363c72f inductor: support conv+binary foldinig for freezing path (#105048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105048
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-26 01:50:30 +00:00
78b28e884a Fix error formatting string (#105935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105935
Approved by: https://github.com/Skylion007
2023-07-26 01:20:19 +00:00
4af9a914ab Improve FakeTensor to work with mixed meta-cpu embedding bag arguments (#105924)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105924
Approved by: https://github.com/mikaylagawarecki, https://github.com/eellison
2023-07-26 01:19:08 +00:00
dd3a77bc96 Apply UFMT to all files in benchmarks/ (#105928)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928
Approved by: https://github.com/albanD
2023-07-26 01:18:48 +00:00
a361fceef3 [C10d] Move TCPStoreMasterDaemon to TCPStoreBackend. (#105184)
This makes TCPServer interface to the store server be through BackgroundThread.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105184
Approved by: https://github.com/fduwjj
2023-07-25 21:59:12 +00:00
1880852830 [C10d] Move protocol constants to TCPStoreBackend.hpp (#105164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105164
Approved by: https://github.com/fduwjj
2023-07-25 21:43:32 +00:00
e60af5c8e4 Revert "[Compiled Autograd] Move to torch::dynamo::autograd namespace (#105854)"
This reverts commit 26e3b4020f01d4fc2b7f63e1de4c94d2c8b362b5.

Reverted https://github.com/pytorch/pytorch/pull/105854 on behalf of https://github.com/PaliC due to breaking internal embedded device tests (details shared with author) ([comment](https://github.com/pytorch/pytorch/pull/105854#issuecomment-1650559375))
2023-07-25 21:09:18 +00:00
a4cffaae67 [pt2] add metas for _cholesky_solve_helper and cholesky_solve (#105867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105867
Approved by: https://github.com/ezyang
2023-07-25 20:21:47 +00:00
48cd8e29c1 Revert "Slightly improve AOTAutograd logging with ViewAndMutationMeta (#105702)"
This reverts commit cc137342d0ae3fcc95560bc10699bc829a83bf95.

Reverted https://github.com/pytorch/pytorch/pull/105702 on behalf of https://github.com/PaliC due to breaking internal export tests (relevant details shared with author) ([comment](https://github.com/pytorch/pytorch/pull/105702#issuecomment-1650492077))
2023-07-25 20:17:27 +00:00
3eef86dbf4 Only do TLS access when necessary in basicAutogradNotImplementedFallback (#105845)
This is an optimization that may or may not matter (it's difficult to
see the impact on benchmarks).

This PR is a resubmit of #105737 to go directly through github first.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105845
Approved by: https://github.com/soulitzer
2023-07-25 18:41:22 +00:00
da4f3fdca1 Fix bug in basicAutogradNotImplementedFallback (#105660)
In some situations we were registering a hook on a Tensor that does not
require grad, which immediately raises an error. This PR fixes that by
skipping the hook registration if the Tensor in question does not
require grad.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105660
Approved by: https://github.com/soulitzer
2023-07-25 18:41:22 +00:00
e7142700ed Update expected inference for torchbench sam (#105891)
This is currently failing `inductor_torchbench` trunk job https://github.com/pytorch/pytorch/actions/runs/5650538848/job/15308150238.  The job was marked as unstable to mitigate another issue few weeks ago (https://github.com/pytorch/pytorch/issues/104337) but was left open and hide the failure from view.

As the model passes as expected, I just add it into the list with `python benchmarks/dynamo/ci_expected_accuracy/update_expected.py becb8dc91a80e03455f7574dc0739fe93a2d199b`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105891
Approved by: https://github.com/msaroufim
2023-07-25 18:40:33 +00:00
fe284b0d97 [C10D] Extract some bits of TCPStore into TCPStoreBackend. (#105163)
This moves BackgroundThread to TCPStoreBackend.hpp. This will eventually be the
interface shared between the current TCPStore backend and the new libuv one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105163
Approved by: https://github.com/fduwjj, https://github.com/H-Huang
2023-07-25 17:59:15 +00:00
b65b9e6ff4 [PT][FSDP] Combine _utils.py into _common_utils.py [1/3] (#105857)
Summary:
https://github.com/pytorch/pytorch/issues/97813

This diffs moves `_override_module_mixed_precision`

Test Plan: CI

Differential Revision: D47706059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105857
Approved by: https://github.com/awgu
2023-07-25 17:37:08 +00:00
340ec1f460 Revert "Meff Attn Bias (#104310)"
This reverts commit 5453508115a2746eeeaaf306f22b0aec23b543d1.

Reverted https://github.com/pytorch/pytorch/pull/104310 on behalf of https://github.com/DanilBaibak due to PR introduced cuda OOM issue ([comment](https://github.com/pytorch/pytorch/pull/104310#issuecomment-1650171538))
2023-07-25 16:37:32 +00:00
3bc047fb9a [ONNX] Detailed diagnostics for 'perfect_match_inputs' (#105892)
Log reasoning behind each unsuccessful perfect match into diagnostic.
<img width="614" alt="image" src="https://github.com/pytorch/pytorch/assets/9376104/c6eb3012-7175-459f-91a5-653bbdf04eb4">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105892
Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms
2023-07-25 16:35:48 +00:00
8282c53789 [ONNX] Add primitives formatting for diagnostics (#105889)
E.g., `<type: int>` -> `2`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105889
Approved by: https://github.com/thiagocrepaldi, https://github.com/titaiwangms
2023-07-25 16:35:48 +00:00
00c6a2ecd5 [ONNX] Diagnostic option 'warnings_as_errors' (#105886)
If set, diagnostics with level as WARNING will be logged as level
with ERROR, and immediately raised.

TODO: bikeshed public export api.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105886
Approved by: https://github.com/thiagocrepaldi
2023-07-25 16:35:47 +00:00
c9edf11073 [FSDP][Docs] Make model/optim state dict configs visible in docs (#105848)
This closes https://github.com/pytorch/pytorch/issues/104717.

Rendered docs:
![Screenshot 2023-07-25 at 11 15 23 AM](https://github.com/pytorch/pytorch/assets/31054793/3c38166a-70c0-472c-805d-452d3bd9c700)
![Screenshot 2023-07-25 at 11 15 30 AM](https://github.com/pytorch/pytorch/assets/31054793/6d275d94-020a-44a2-a64c-0eeba083d47f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105848
Approved by: https://github.com/rohan-varma
2023-07-25 16:23:53 +00:00
71d18f6105 [DocString] Fix incorrect api Examples (#105911)
Fix incorrect Examples in `torch.linalg.tensorinv`.

- before (bug) : `torch.linalg.inverse`
- after: `torch.linalg.inv`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105911
Approved by: https://github.com/lezcano
2023-07-25 13:03:06 +00:00
cyy
1157b4393b Add const reference and std::move in opportunities detected by clang-tidy (#105815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105815
Approved by: https://github.com/Skylion007
2023-07-25 12:28:14 +00:00
5f6c6ff4cf [inductor] Make OpenMP work in fbcode (#105777)
Since we use libomp instead of libgomp, we need to suppress -fopenmp
at link time (using it only at preproc time), and explicitly link libomp.so.

Differential Revision: [D47692939](https://our.internmc.facebook.com/intern/diff/D47692939/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47692939/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105777
Approved by: https://github.com/jansel
2023-07-25 08:01:40 +00:00
b2b1f2194b [inductor] Enable vectorization in fbcode (#105756)
In fbcode, to run the test python script (with its accompanying test DSO) we
need to invoke the correct python, with the correct PYTHONPATH, so we supply
those by reading the appropriate values out of `sys`.

It's an improvement for OSS too, since the user may not be running the default
python.

My previous attempt of using `torch.backends.cpu.get_cpu_capability()` didn't work out, for two reasons:
1. That function actually refuses to report AVX512 support; it's #ifdef-ed out, for some reason.
2. In CI, we apparently are picking INVALID_VEC_ISA (at least when running
inductor_timm_cpu_accuracy), whereas `get_cpu_capability` reports AVX2.  This
is surprising, and probably indicates a bug (either in cpu capability or our
test binary), but I'd rather not go digging for it.

Differential Revision: [D47678649](https://our.internmc.facebook.com/intern/diff/D47678649/)

Differential Revision: [D47678649](https://our.internmc.facebook.com/intern/diff/D47678649)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105756
Approved by: https://github.com/jansel, https://github.com/mikekgfb
2023-07-25 08:01:40 +00:00
487a33e38a [FSDP x dynamo] simplify registry keys (#104209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104209
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-07-25 07:16:22 +00:00
da8de0455b [ONNX] Support ONNXFakeContext with op_level_debug (#105874)
Previous to the PR, op_level_debug doesn't support OnnxFakeConext because it relies on real tensor in args to do shape type inference propagation in fx graph to get static shapes helping simulating the op args/kwargs. However, OnnxFakeContext will fakify the args/kwargs at the very begining, so the op_level_debug can't have the static_shapes to utilize.

This PR uses SymInt API: `has_hint` and `hint_int` to fully replace the functionality of shape type inference propagation. The static shapes are obtained through SymInt. Therefore, the pass `ShapeInferenceWithFakeTensor` is eliminated.

Also moved the args/kwargs processing into op_validation to live under the rule `op_level_debug`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105874
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao
2023-07-25 07:01:49 +00:00
1032a2541e Add option to disable rewriting index hints in default global save plan (#105861)
With distributed checkpointing in PyTorch/XLA SPMD, the WriteItem index hints should not be modified when creating the global plan. In order to reuse the default planner logic for checkpoint metadata creation, we need to make the behavior of rewriting index hints optional.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105861
Approved by: https://github.com/kumpera
2023-07-25 06:00:13 +00:00
8bf253ecce [export] Remove eliminate_dead_code (#105875)
Summary: Sometimes the graph that is being serialized contains nodes with side effects + no users (ex. out variants of operators), so we don't want to eliminate those when deserializing.

Test Plan: CI

Differential Revision: D47735009

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105875
Approved by: https://github.com/ydwu4
2023-07-25 05:37:44 +00:00
c89aec207a [ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)
Current test case causes an edge case tensor input that causes a single generated tensor to fail the tolerance assertion on ROCm only and only for float32. We have reviewed the logic with our libraries team and have discovered the discrepancy is due to a difference in order of operations on AMD GPUs. They came back with "working as intended" and found no perceivable bug. Interestingly, if we change the values in ks, ns, or bs, the test passes on ROCm. These particular sizes in this particular order generates a single problematic input that causes the assertion to fail the tolerance check by ~0.07. Again, this is not a bug, just differences in implementation. This PR loosens the tolerance for ROCm only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104425
Approved by: https://github.com/jeffdaily, https://github.com/nikitaved, https://github.com/lezcano
2023-07-25 05:03:09 +00:00
9c458942ae [easy] Minor torch.load docs fix (#105876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105876
Approved by: https://github.com/albanD
2023-07-25 03:58:30 +00:00
8b34fa5e9b add basic cuda support for float8 dtypes (#105807)
Summary:

Ensures that creating tensors, copying, filling with zeroes, checking for nan works on cuda for the `float8` dtypes.  This should be enough for float8 emulation on cuda.

Note that I skipped the mul test - it's less trivial to add (need a new c++ macro), and there is no use case for it. We can follow up on that in the future.

Test Plan:

```
python test/test_quantization.py TestFloat8Dtype
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105807
Approved by: https://github.com/ezyang, https://github.com/jerryzh168, https://github.com/albanD
2023-07-25 03:43:36 +00:00
3a01c056f5 [PyTorch][ET] Collect Process Groups Mapping Info (#104373)
Summary: Add the logics and interface to log ProcessGroup comms configuration (unique ID, type, and ranks info).

Test Plan:
Testing in HPC:
```
TORCH_LOGS=all ../buck-out/v2/gen/fbcode/c8344b52091f4f7f/hpc/models/ads/__ads_10x_launcher__/ads_10x_launcher.par  +launcher=local launcher.num_trainers=4 +data_loader=random data_loader.num_batches=2000
```
Example output in ET:
```
    {
      "name": "## process_group:init ##", "id": 3, "rf_id": 1, "parent": 2, "fw_parent": 0, "seq_id": -1, "scope": 7, "tid": 1, "fw_tid": 0, "op_schema": "",
      "inputs": ["[{'pg_id': 140538064364672, 'backend_id': 140538060772480, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}, {'pg_id': 140538064363904, 'backend_id': 140538042628864, 'backend_config': 'cuda:nccl', 'ranks': {0: 0, 1: 1, 2: 2, 3: 3}}]"], "input_shapes": [[]], "input_types": ["String"],
      "outputs": [], "output_shapes": [], "output_types": []
    },
```

Differential Revision: D46321690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104373
Approved by: https://github.com/kwen2501
2023-07-25 03:34:53 +00:00
00ee38c661 [ONNX] Export module as function (#105618)
Introduce `Modularize` pass that analyzes the flat `fx.GraphModule` and creates nested
layers of sub `fx.GraphModule`s along with the `call_module` fx nodes that invokes them.
The analysis is done on the meta data "nn_module_stack", which captures the `nn.Module`
each flat `fx.Node` belongs to.

`FxOnnxInterpreter` is updated to support `call_module`. The related sub module linked
by `node.target` is exported as an ONNX model local function. The `call_module` node itself
is exported as an ONNX node, associated with the ONNX model local function by op_type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105618
Approved by: https://github.com/justinchuby
2023-07-25 03:28:31 +00:00
ec33733701 [ONNX] Improve shape inference for Slice (#105755)
Previously, if 'starts', 'ends', or 'steps' was dynamic, then shape inference would give up, even for dimensions which are not being sliced.

This commit improves this by setting the output shape to be the same as the input shape for dimensions which are not being sliced. Add a new test to cover this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105755
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao
2023-07-25 02:58:20 +00:00
98956c5320 Support dynamic shapes in TritonTemplates (#105295)
Currently when dynamic=True, TritonTemplates won't be used, as the condition  `if list(call_args) != expected_args` defined in `TritonTemplate` cannot be satisfied. This PR tries to fix this issue by allowing passing symbolic variable names via `extra_args` and replacing all symbolic values in the generated TritonTemplate code as call_arg names.

With this change, a locally compiled mm + epilogue node calls into the Triton kernel successfully.

This PR also introduces a new config "max_autotune_gemm_backends" to allow specifying candidate gemm backends for max autotune. Current choices: combinations of ATEN, TRITON. This makes tests easier, so that we can explicitly test Triton gemm kernels + epilogue fusions + dynamic shapes, without falling back to ATen ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105295
Approved by: https://github.com/jansel
2023-07-25 01:41:25 +00:00
26e3b4020f [Compiled Autograd] Move to torch::dynamo::autograd namespace (#105854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105854
Approved by: https://github.com/albanD
2023-07-25 01:14:04 +00:00
5403c7770c Provide a refined upper bound for nonzero when original numel is static (#105843)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105843
Approved by: https://github.com/lezcano
2023-07-25 00:51:35 +00:00
cc137342d0 Slightly improve AOTAutograd logging with ViewAndMutationMeta (#105702)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105702
Approved by: https://github.com/albanD
2023-07-25 00:47:38 +00:00
5fec1f93dc Add meta registration for foreach_maximum_.List (#105864)
Will fix issues compiling for when amsgrad is True for Adam(W), see related failures in https://github.com/pytorch/benchmark/actions/runs/5628705163/job/15252867793

Also did some refactoring where common registrations could be deduplicated.

Test plan:
python test/inductor/test_compiled_optimizers.py -k test_adam

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105864
Approved by: https://github.com/albanD, https://github.com/mlazos
2023-07-25 00:39:13 +00:00
6655b6527a [FSDP][Docs] Tidy up FSDP ctor/api docs (#105847)
- This PR rewords the `BackwardPrefetch` docs to make the tradeoffs clear in the first sentence of each with more technical details after.
- The only supported `_FSDPPolicy` is `ModuleWrapPolicy` at the time of writing this PR. We may add others in the future such as in my other PR stack. This PR removes `_FSDPPolicy` from the public docs.
- This provides some more details around `MixedPrecision` such as explaining that layer norm and batch norm accumulate in fp32.

Follow-ups:
- Why do we force batch norm modules to have FSDP applied separately? (E.g. was this because before batch norm kernels did not support fp16/bf16?) Like layer norm, this just means that the affine parameters are in fp32. Both already accumulate in fp32 even with fp16/bf16 inputs.
- Check the `param_init_fn` + `sync_module_states=True` usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105847
Approved by: https://github.com/rohan-varma
2023-07-25 00:19:08 +00:00
65bce811a6 [ONNX] Passes to reuse existing fake mode if possible (#105764)
Fixes #105467, namely the need of setting `aten_graph=True` in `_dynamo.export`
to make fake mode onnx exporter work.

Previously, `make_fx` called by passes always create new fake mode. Hence it is
missing out info from `shape_env` recorded during dynamo export. This PR tries
to check and fetch existing fake mode from graph node meta.
Also enable python dispatcher context when calling `make_fx`. This is done in
`_dynamo.export(aten_graph=True)` but was missing in our passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105764
Approved by: https://github.com/titaiwangms
2023-07-24 23:42:26 +00:00
8047ce05dd Cleanup calculate-docker-image (#105752)
As this has been replaced by the more generic calculate-docker-image on test-infra https://github.com/pytorch/test-infra/pull/4397 in:

* https://github.com/pytorch/test-infra/pull/4399
* and https://github.com/pytorch/pytorch/pull/105372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105752
Approved by: https://github.com/Skylion007
2023-07-24 23:37:08 +00:00
becb8dc91a [inductor] triton_utils.config_of: check for divisibility by 16, even when expr is not an Integer (#105743)
TL;DR: triton_utils.config_of determines divisibility by 16 for each of the inputs to the kernel (pointer alignment for pointers, and divisibility by 16 for sizes). For sizes, the check previously could only return true if the expr representing the size was an integer. However, it's possible for non-integral exprs to be divisible by 16, e.g. for an expr like 16*s0.

Motivation: Knowledge about divisibility by 16 allows for vectorizing loads and stores, which can improve memory bandwidth. If we have, for example, kernels with shape [s0, 16] (dynamic batch size; static, divisible-by-16 other dimensions), we want to still be able to vectorize those loads and stores.

Dashboard results suggest that this improves dynamic shape training performance for timm, and possibly a small improvement for torchbench as well. More details are provided in a comment below.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105743
Approved by: https://github.com/ezyang, https://github.com/aakhundov
2023-07-24 22:41:50 +00:00
8b94280008 [functional collective] parameterize allreduce tests (#105604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105604
Approved by: https://github.com/rohan-varma
2023-07-24 22:21:19 +00:00
5453508115 Meff Attn Bias (#104310)
# Summary

### Review Points
- Automatically pad tensors to create aligned masks when seqlen_kv is not multiple of 16. This will cause memory spike ~ 2 * attn_mask size which could in theory be big.  At appears though that doing this + mem_eff is faster than no_pad + math. SO seems to be worth it
- Using expand to view the attn_mask in 4d. This is a little different to how we enforce q,k,v to be viewed in 4d prior to calling. Also not supprint b*n_heads, seq_lenq, seq_lenkv case.
- Should enable, #96099

### Profiling
I ran a bunch of comparisons between sdpa.MATH and sdp.MemEffAttention.  I added a attn_bias of shape (1, 1, seqlen_q, seqln_k). For these experiments seqlen_q == seqlen_k. These were all ran on an a100 80gb gpu.
Configs:
```
    # Run a bunch of experiments
    batch_sizes = [8, 16, 32]
    num_heads = [16, 32]
    max_seq_lens = [15, 64, 128, 512, 555, 1024]
    embed_dims = [32, 64, 128]
    dtypes = [torch.float16, torch.bfloat16, torch.float32]
    pad_percentages = [None]
    backends = [SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]
    run_backward = True
    attn_mask = True
```

   The function calls `sdpa(input**).sum().backward()`.

   I calculated the geomean speedup of the efficient attention path of the math path for all these configs:
   `Geomean Speedup: 1.977`

An example comparision with batchsize = 8, num_heads = 32, embed_dim = 64, and dtype = torch.float16:
![attn_mask_compare_bsz_8_num_heads_32_embed_dim_64_dtype_fp16](https://github.com/pytorch/pytorch/assets/32754868/0d75bffe-350b-43f2-a37f-514f9158dcff)

 This was done using the current state of the branch where we force alignment of mask when the last dim is not divisible by 16, which shows up in seq_len = 15 and 555 case.

The full data can be found here:

[attn_mask_sweep.csv](https://github.com/pytorch/pytorch/files/11962399/attn_mask_sweep.csv)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104310
Approved by: https://github.com/cpuhrsch
2023-07-24 22:19:26 +00:00
9d62c5faf6 [exir] Add deepcopy to ExportedProgram (#105852)
Summary: ExirExportedProgram would like to have this feature. Today it does it itself since it inherits from ExportedProgram but since we are moving it to composition I think it would be cleaner to upstream the behavior into the root object anyway

Test Plan: ci, but todo where are the tests for this file?

Differential Revision: D47645843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105852
Approved by: https://github.com/tugsbayasgalan
2023-07-24 21:15:55 +00:00
c902b84e0b Compiled autograd (#103822)
This branch:
1) converts the autograd tape into an FX graph
2) caches that conversion using a "shadow" graph
3) compiles and runs the generated FX graph instead of the normal autograd

What works currently:
1) Caching, capture, and initial integration
2) Backwards hooks
3) Inlining AotAutograd generated subgraphs
4) torch.compiling the generated FX graph
5) Auto-detecting dynamic shapes based on changes

Future work
1) Larger scale testing
1) Boxed calling convention, so memory can be freed incrementally
1) Support hooks on SavedTensor
1) Additional testing by running eager autograd tests under compiled_autograd.enable()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103822
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-24 21:12:05 +00:00
14304afd76 Remove unnecessary simplification in guard_lt (#105842)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105842
Approved by: https://github.com/Skylion007
2023-07-24 20:58:10 +00:00
b78341dda9 Use hipsolver for default svd case on ROCm (#103540)
Fixes #102678
Fixes #102629
Fixes #102558
HipSOLVER performance on ROCm5.4.2 and later no longer serves as massive bottleneck. Additionally, using magma on rocm in this case caused test_compare_cpu_lialg_pinv_singular_cuda_float32 to fail. Using hipSOLVER, the test now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103540
Approved by: https://github.com/lezcano
2023-07-24 20:50:56 +00:00
bf693f2000 Strengthen ConstantVariable invariants (#105796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105796
Approved by: https://github.com/ezyang
2023-07-24 20:41:12 +00:00
d2ee3d0675 Add a version to this signpost so I can tell if packages have taken updates (#105735)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105735
Approved by: https://github.com/albanD
2023-07-24 19:49:16 +00:00
b0708654c0 Implement NAdamW optimizer (#103881)
NAdamW, which is simply NAdam with the AdamW weight decay term, has shown strong performance in optimizer comparisons such as
1. https://arxiv.org/abs/2211.09760
1. https://arxiv.org/abs/2306.07179

[The VeLO paper](https://arxiv.org/abs/2211.09760) argues its power lies in its ability to act as a superset of other popular optimizers.

This PR adds NAdamW by ~~copying and making very small adaptations to the NAdam implementation (just like AdamW and Adam). To see the small changes in better detail, you can `diff torch/optim/nadam.py torch/optim/nadamw.py`.~~ adding a boolean flag `decoupled_weight_decay` that activates NAdamW behavior (`False` by default) to NAdam.

Interest in the optimizer has also been shown in the PyTorch forums:
https://discuss.pytorch.org/t/nadamw-and-demon-optimizers/179778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103881
Approved by: https://github.com/janeyx99
2023-07-24 19:29:26 +00:00
a54043516f Add SparseCsrCPU and SparseCsrCUDA dispatch to sum.dim_IntList (#99292)
This PR is to add support of sum.dim_IntList for Sparse Tensor, which is exposed in https://github.com/pytorch/pytorch/issues/98796.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99292
Approved by: https://github.com/mingfeima, https://github.com/rusty1s, https://github.com/cpuhrsch
2023-07-24 17:30:58 +00:00
fb0ffeece3 Use the newer g5.12xlarge instead of g3.16xlarge for multigpu tests (#105759)
Both have 4 GPUs.  This is an attempt to mitigate the runner issue with `g3.16xlarge` where it starts to crash a lot recently https://github.com/pytorch/pytorch/issues/105721.  So, let's see if switching to a newer runner type helps.

The job also finishes slightly faster in ~120m https://github.com/pytorch/pytorch/actions/runs/5625775414/job/15246453229 v.s. ~140m as before https://github.com/pytorch/pytorch/actions/runs/5625238650/job/15244823174
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105759
Approved by: https://github.com/atalman
2023-07-24 17:18:22 +00:00
3045e84e67 Tweak dynamic=False behavior (#105715)
Previously, dynamic=False is a no-op, and dynamic=True preemptively
turns on dynamic shapes everywhere.

Now, dynamic=False *disables* automatic dynamic, and an unset dynamic
defaults to dynamic=None (which uses automatic dynamic.)  This
seems to be more intuitive per
https://github.com/pytorch/pytorch/issues/105634#issuecomment-1644883477

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105715
Approved by: https://github.com/voznesenskym
2023-07-24 16:56:41 +00:00
0ab74044c2 [BE] remove deprecated attributes from distributed_c10d (#105753)
Removing these attributes as they were introduced 5 years ago and before pytorch 1.0. `Backend` is the only support use now.

Differential Revision: [D47683717](https://our.internmc.facebook.com/intern/diff/D47683717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105753
Approved by: https://github.com/rohan-varma
2023-07-24 16:35:08 +00:00
cyy
b8eb827d93 use UBSAN on some tests (#103655)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103655
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-07-24 14:24:49 +00:00
68dce23722 [ROCm] Skip test_jit_cudnn_extension on ROCm (#105805)
follow up https://github.com/pytorch/pytorch/pull/105594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105805
Approved by: https://github.com/ezyang
2023-07-24 14:19:56 +00:00
1600585219 Revert "Fix test failure in TestCudaMultiGPU.test_cuda_device_memory_allocated (#105501)"
This reverts commit e6fd8ca3eef2b85b821936829e86beb7d832575c.

Reverted https://github.com/pytorch/pytorch/pull/105501 on behalf of https://github.com/zou3519 due to We've agreed that the PR is wrong. It didn't actually break anything. ([comment](https://github.com/pytorch/pytorch/pull/105501#issuecomment-1648005842))
2023-07-24 14:18:44 +00:00
33b855e906 [xla hash update] update the pinned xla hash (#105828)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105828
Approved by: https://github.com/pytorchbot
2023-07-24 10:54:25 +00:00
80144d9cf9 Implement NEON accelerated implementation of ERF() (#105610)
Fixes #105493

Inspired by the [AVX implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec256/vec256_float.h#L158-L189) for the same.

Perf on a Graviton3 EC2 instance with one OMP thread:
Operation | std math | SLEEF | NEON (this PR)
-- | -- | -- | --
GELU (100 passes) | 1141.897ms | 598.929ms | 515.499ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105610
Approved by: https://github.com/jgong5
2023-07-24 08:45:58 +00:00
54a673bdcf Initial sourceless builder (#104734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104734
Approved by: https://github.com/ezyang
2023-07-24 02:48:32 +00:00
b0816e4714 [inductor] Fix AOTInductor output issues (#105773)
Summary: This is a follow-up on https://github.com/pytorch/pytorch/pull/105496. There are several issues with the previous fix,
1) It explicitly does copy for every output at the end of the main function;
2) When an output is ReinterpretView, no as_strided was generated for it;
3) There can be duplicated buffer declarations.

This PR fixes by making sure can_reuse behave consistently between two AOTIndcutor passes, and thus always generate the same set of kernels. It also adds handling of ReinterpretView.

Differential Revision: [D47692214](https://our.internmc.facebook.com/intern/diff/D47692214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105773
Approved by: https://github.com/jansel
2023-07-24 01:58:49 +00:00
32b67b5a6b Fix RRef Future type annotation (#105296)
Test Plan: Sandcastle

Differential Revision: D47376236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105296
Approved by: https://github.com/Skylion007
2023-07-23 22:34:09 +00:00
c44ae5544f Skip the source info in the error report if the source code is too large (#105608)
Summary:
A small model (<100MB) took about 20mins to load, and consume 16GB memory.

Strobelight profiling: https://fburl.com/strobelight/abwtz0ry

We realized that calc_line_start_offsets is culprit, and the line_starting_offsets_ is a vector of line numbers.

There are >20000 places we generate such ErrorReport, and the line number is ~100000.

So total memory cost is about 100000 x 20000 x 8 = ~16GB.

We propose to skip the error info for extreme large source file (>1MB). And keep an environment variable to keep the ability to print the source code info for large source file.

Test Plan:
buck run mode/opt-split-dwarf scripts/lufang:load_pt_model -- --model_file_path=/data/local/models/961746678/2/961746678_2.predictor.disagg.gpu.local

before the change, it takes 20mins to load, and the model costs 16GB memory (the model itself is only <100MB)

after the change, it takes 15s to load.

The most of the time / space is spent on calc_line_start_offsets, https://fburl.com/code/2to60zqu

Differential Revision: D47610805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105608
Approved by: https://github.com/hl475
2023-07-23 20:53:14 +00:00
e3539a0e54 [dtensor] forward fix for dynamo import with deploy (#105760)
Summary: forward fix to avoid revert

Differential Revision: D47679598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105760
Approved by: https://github.com/atalman
2023-07-23 07:13:38 +00:00
66fbffce1f Fix unstable CI related to dynamo tests (#105797)
this PR fix the current unstable CI. The test failure comes from a bad
revert in https://github.com/pytorch/pytorch/pull/105581 where it does
not revert the intended PR correctly (there were some merge conflicts
and some logic got deleted during this revert)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105797
Approved by: https://github.com/ezyang
2023-07-23 05:43:54 +00:00
45e4706aff [pt2] add decomps for multilabel_margin_loss_forward ops (#105302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105302
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
944db0357d Unify multilabel_margin_loss_shape_check on CPU and CUDA (#105645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105645
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
eac9e1b35f [OpInfo] add reference and error inputs for multilabel_margin_loss (#105523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523
Approved by: https://github.com/ezyang
2023-07-23 02:16:29 +00:00
bba06ad751 [MPS] aten::erfinv metal kernel ops (#101507)
I've added the implementation of erfinv using the algorithm from 4154c8ea15/aten/src/ATen/native/Math.h (L152) in order for the MPS based algorithm to match the CPU automatic test. This PR is using the new metal api calls from https://github.com/pytorch/pytorch/pull/100661

Testing shows MPS has a decent speed up (270x) compared to CPU on tensor size of 100 mil elements.
```
import torch
x = torch.arange(-1, 1, 1e-8) # default cpu tensor
#measure CPU compute time by calling torch.erfinv
time = %timeit -o -q -r 5 torch.erfinv(x)
cpu_time = time.average
print("CPU torch.erfinv time: ", cpu_time)
x = x.to("mps")
# measure MPS compute time
time = %timeit -o -q -r 5 torch.erfinv(x)
mps_time = time.average
print("MPS torch.erfinv time: ", mps_time)
print(f"MPS torch.erfinv is {cpu_time/mps_time*100} percent faster than CPU torch.erfinv")

# compute MSE between MPS and CPU torch.erfinv
x = x.to("cpu")
y_cpu = torch.erfinv(x)
x = x.to("mps")
y_mps = torch.erfinv(x)
y_mps = y_mps.to("cpu")
mask = torch.isfinite(y_cpu) & torch.isfinite(y_mps.to("cpu"))
y_mps = y_mps[mask]
y_cpu = y_cpu[mask]
x = x[mask]
print(f"length of y_mps: {len(y_mps)}, length of y_cpu: {len(y_cpu)}, length of x: {len(x)}")
mse = torch.square(y_cpu - y_mps).mean()
print("MSE between MPS and CPU torch.erfinv: ", mse)
diff = torch.abs(y_cpu - y_mps)
print("Largest difference")
print(f"x:  {x[torch.argmax(diff)]}, y_cpu: {y_cpu[torch.argmax(diff)]}, y_mps: {y_mps[torch.argmax(diff)]} , diff = {y_cpu[torch.argmax(diff)] - y_mps[torch.argmax(diff)]}")
```
CPU torch.erfinv time:  2.654937833400254
MPS torch.erfinv time:  0.009831255332002912
MPS torch.erfinv is 27005.07456822776 percent faster than CPU torch.erfinv
length of y_mps: 199999992, length of y_cpu: 199999992, length of x: 199999992
MSE between MPS and CPU torch.erfinv:  tensor(4.2339e-14)
Largest difference
x:  -0.9999980330467224, y_cpu: -3.363569736480713, y_mps: -3.3635685443878174 , diff = -1.1920928955078125e-06

Fixes #https://github.com/pytorch/pytorch/issues/86808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101507
Approved by: https://github.com/kulinseth
2023-07-23 01:36:43 +00:00
12ea12d659 [MPS] Fix upsample output size tensor (incorrect result in MacOS 14.0) (#105677)
Fix output size tensor when passing a batched input tensor - this fixes the upsample test issues in MacOS 14.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105677
Approved by: https://github.com/kulinseth
2023-07-22 23:10:17 +00:00
6d43c89f37 [BE]: Update Ruff to 0.0.280 (#105724)
Removes unusued loop values in python dictionary iteration. Automated fix from Ruff master

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105724
Approved by: https://github.com/ezyang, https://github.com/janeyx99
2023-07-22 23:03:34 +00:00
53a4b262d2 Add missing evaluate_expr for slice_scatter, slight refactor (#105714)
The substantive change is adding slice_scatter to use evaluate_expr
(and I add a test for it).

While I'm at it, I do some cleanup: provide sizevars.evaluate_expr
directly, and rewrite all sites to use it consistently.

Fixes https://github.com/pytorch/pytorch/issues/105524

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105714
Approved by: https://github.com/Skylion007
2023-07-22 22:00:47 +00:00
f5def50461 Supress eager fallback suggestions when exporting (#105767)
Previously during torch.export(), when an exception is raised during tracing, Dynamo displays this error:

“You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True”

This is not viable in torch.export(), thus this diff suppresses this suggestion during export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105767
Approved by: https://github.com/anijain2305
2023-07-22 19:17:08 +00:00
afd955f3de [dynamo][constant] Kwargs already supported for str methods (#105785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105785
Approved by: https://github.com/yanboliang
2023-07-22 09:33:23 +00:00
20fb2ba68d [ONNX] Register list/tuple/dict to format_argumment and refactor fx.Node format_argument in diagnostics (#105263)
Previous to this PR, the SARIF reports don't have detail on torch.fx.Node (shape/dtype), and don't unpack the tuple/list/dict. This PR provides thorough information of args/kwargs from torch in fx.graph expression: f32[64, 64, 2] (dtype[shape]).

Need https://github.com/microsoft/onnxscript/pull/890

![dispatcher_sarif](https://github.com/pytorch/pytorch/assets/18010845/2567fac6-4154-4ce8-bc34-83950ef1c1d7)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105263
Approved by: https://github.com/BowenBao
2023-07-22 08:53:08 +00:00
0ad93a3d56 Fix aten.logspace decomposition (#105201)
Fixes #104118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105201
Approved by: https://github.com/ezyang
2023-07-22 04:10:20 +00:00
5afc2f5069 Documentation for torch.autocast (#95760)
- [x] Corrected examples for CUDA devices.
- [x] Information about availability of `torch.autocast`.

Fixes #95547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95760
Approved by: https://github.com/leslie-fang-intel, https://github.com/kit1980
2023-07-22 03:56:34 +00:00
09b5c35911 Support torch.onnx.dynamo_export within FakeTensorMode (#105477)
Currently, exporting a model to ONNX with fake tensor mode requires the
user to load data and model within `torch.onnx.enable_fake_mode` context,
but the actual call to `torch.onnx.dynamo_export` is done outside such
context.

With this PR, we enable `torch.onnx.dynamo_export` to be called either
within `torch.onnx.enable_fake_mode` or outside of it. This feature
required changes to the core PyTorch Dynamo, which were greatly
supported by @ezyang

In future steps we will determine which scenario we are going to
support, but for now we can use either to explore different options and
scenarios and asses their pros and cons.

This PR also creates a separate suite of tests for fake mode specific
scenarios (`TestFxToOnnxFakeTensorWithOnnxRuntime`).
It was done separately to decrease the test time, but we
could merge it with the default `TestFxToOnnxWithOnnxRuntime`. The
additional parameters are `load_checkpoint_during_init` and
`export_within_fake_mode`

With the newly added supported of nested export within fake mode, the
following scenarios are now supported:

```python
import torch

with torch.onnx.enable_fake_mode() as fake_context:
    fake_args = create_args()
    fake_kwargs = create_kwargs()
    fake_model = create_model()
    fake_model.load_state_dict(torch.load(tmp_checkpoint_file.name))

    export_options = torch.onnx.ExportOptions(fake_context=fake_context)

    # `torch.onnx.dynamo_export` called WITHIN `torch.onnx.enable_fake_mode`
    export_output = torch.onnx.dynamo_export(
        fake_model,
        *fake_args,
        **fake_kwargs,
        export_options=export_options,
    )

    export_output.save("/path/to/model.onnx", model_state_dict=create_model())
```

If we decide to only support scenarios in which `torch._dynamo.export` is called within `FakeTensorMode`, then we can remove `fake_mode` argument from `torch._dynamo.export` as a follow-up task

ps: This PR is mostly Edward's https://github.com/pytorch/pytorch/pull/105468 + unit tests after an offline discussion
ps: https://github.com/pytorch/pytorch/issues/105464 tracks pending tasks/limitations from this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105477
Approved by: https://github.com/ezyang, https://github.com/BowenBao
2023-07-22 03:50:52 +00:00
0b11da0ccb [partitioners][ac][dynamic] Fix output signature of fwd with symints (#105771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105771
Approved by: https://github.com/Chillee
2023-07-22 03:04:11 +00:00
0148db6765 [ONNX] Support torch.device in FX exporter (#105757)
Fixes https://github.com/pytorch/pytorch/issues/105172

When torch.device appears in kwargs, we ignore it, as it's not needed in onnx. However, if it's in args, it's used by dispatcher, and we didn't handle it. This PR adds torch.device into args processing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105757
Approved by: https://github.com/BowenBao, https://github.com/justinchuby
2023-07-22 02:22:42 +00:00
60d5efdb15 Disable torchrec/sparse from top-level Dynamo tracing (#105733)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105733
Approved by: https://github.com/voznesenskym
2023-07-22 02:00:36 +00:00
45e0193174 Add telemetry for number of nodes being compiled (#105741)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105741
Approved by: https://github.com/Chillee
2023-07-22 01:56:02 +00:00
7b211ff8dd doc: fix fake_quantize_per_channel_affine (#105241)
Fixes #105085

Fix in formula

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105241
Approved by: https://github.com/jcaip
2023-07-22 00:49:28 +00:00
a6b8c30726 [dynamo][higher order ops] Bugfix for kwargs support (#105699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105699
Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519
2023-07-21 23:44:37 +00:00
1959802548 [AdamW] Fix complex x amsgrad support (#104990)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104990
Approved by: https://github.com/albanD
2023-07-21 23:43:26 +00:00
e1296a7f8d [Adam] Fix complex x amsgrad support (#104989)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104989
Approved by: https://github.com/albanD
2023-07-21 23:43:26 +00:00
a44f8894fa [Inductor] Provenance tracking for wrapper code (#105717)
Summary:
Add comments in wrapper code for better provenance tracking

Sample inductor wrapper output:
```
# Source Nodes: [mm_1], Original ATen: [aten.mm]
extern_kernels.mm(as_strided(tangents_1, (500, 20), (1, 500)), view, out=buf1)

# Source Nodes: [l__self___linear], Original ATen: [aten.addmm]
extern_kernels.addmm(primals_2, as_strided(primals_3, (20, 500), (500, 1)), as_strided(primals_1, (500, 500), (1, 500)), alpha=1, beta=1, out=buf0)
```

in cpp wrapper
```
        // Source Nodes: [bmm_1], Original ATen: bmm
        at::bmm_out(buf0, arg0_1, arg1_1);
```

Test Plan: OSS CI

Differential Revision: D47657260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105717
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-07-21 23:06:43 +00:00
050d3de07d Revert "Correct dynamo logging docs (#105658)"
This reverts commit f3a261e0968b4b2da071734dd749a179f75bceab.

Reverted https://github.com/pytorch/pytorch/pull/105658 on behalf of https://github.com/PaliC due to breaking docs f3a261e096 ([comment](https://github.com/pytorch/pytorch/pull/105658#issuecomment-1646310865))
2023-07-21 22:38:28 +00:00
4d5d4d8b02 [pytorch] Disable new autograd fallback for mobile builds (#105750)
Summary:
To save on binary size, some of the mobile configs don't include the
autograd kernels for built-in operators (VariableTypeEverything.cpp).
For the mobile build:
- we don't care about having a nice autograd fallback that warns if
an operator has incorrect autograd support. If you're running
a custom operator on mobile then it's already too late for us to warn
or error on it.
- for perf reasons, we do not want mobile to go through autograd_fallbac
for all operators (the boxing/unboxing adds overhead).

As a result, on mobile we set the fallback to the fallthrough.

Test Plan: existing tests and benchmarks

Differential Revision: D47674272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105750
Approved by: https://github.com/soulitzer
2023-07-21 22:32:50 +00:00
221853af23 [FSDP][Easy] nit follow-ups to handle refactor (#105738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105738
Approved by: https://github.com/fegin, https://github.com/voznesenskym
2023-07-21 22:00:14 +00:00
f3a261e096 Correct dynamo logging docs (#105658)
Fixes #105657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105658
Approved by: https://github.com/zou3519
2023-07-21 21:37:02 +00:00
174b0c22cb [C10D] Remove watchKey functionality from the Store. (#105014)
The feature was never fully finished and never got any adoption but
TCPStore pays the cost of twice the number of tcp connections anyway.

While the cost of all those idle connections is minimal is doesn't come for free:

- It increases the likelyhood of a connection refused failure during the initialization stampede.
- TCPStore uses poll for checking for socket availability which scales linearly on the number of sockets regardless of their status.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105014
Approved by: https://github.com/fduwjj
2023-07-21 21:18:55 +00:00
9d2f56fd22 Bump pygments from 2.12.0 to 2.15.0 in /.ci/docker (#105654)
Bumps [pygments](https://github.com/pygments/pygments) from 2.12.0 to 2.15.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/pygments/pygments/releases">pygments's releases</a>.</em></p>
<blockquote>
<h2>2.15.0</h2>
<ul>
<li>
<p>Added lexers:</p>
<ul>
<li>Carbon (<a href="https://redirect.github.com/pygments/pygments/issues/2362">#2362</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2365">#2365</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2366">#2366</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2367">#2367</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2368">#2368</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2369">#2369</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2370">#2370</a>)</li>
<li>Dax (<a href="https://redirect.github.com/pygments/pygments/issues/2335">#2335</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2345">#2345</a>)</li>
<li>MediaWiki Wikitext (<a href="https://redirect.github.com/pygments/pygments/issues/2373">#2373</a>, <a href="https://redirect.github.com/pygments/pygments/issues/827">#827</a>)</li>
<li>PostgreSQL Explain (<a href="https://redirect.github.com/pygments/pygments/issues/2398">#2398</a>)</li>
<li>WGSL (WebGPU Shading Language) (<a href="https://redirect.github.com/pygments/pygments/issues/2386">#2386</a>)</li>
<li>X++ (<a href="https://redirect.github.com/pygments/pygments/issues/2339">#2339</a>)</li>
</ul>
</li>
<li>
<p>Updated lexers:</p>
<ul>
<li>
<p>AMDGPU: Add support for <code>scratch_</code> instructions, the <code>attr*.*</code> argument,
as well as the <code>off</code> modifier (<a href="https://redirect.github.com/pygments/pygments/issues/2327">#2327</a>).</p>
</li>
<li>
<p>APDL: Miscellaneous improvements (<a href="https://redirect.github.com/pygments/pygments/issues/2314">#2314</a>)</p>
</li>
<li>
<p>bash/tcsh:</p>
<ul>
<li>Move <code>break</code> to keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2377">#2377</a>)</li>
<li>Improve bash math expansion lexing (<a href="https://redirect.github.com/pygments/pygments/issues/2255">#2255</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2353">#2353</a>)</li>
</ul>
</li>
<li>
<p>Chapel: Support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2376">#2376</a>)</p>
</li>
<li>
<p>CMake: Implement bracket style comments (<a href="https://redirect.github.com/pygments/pygments/issues/2338">#2338</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2354">#2354</a>)</p>
</li>
<li>
<p>CSS: Improve lexing of numbers inside function calls (<a href="https://redirect.github.com/pygments/pygments/issues/2382">#2382</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2383">#2383</a>)</p>
</li>
<li>
<p>diff: Support normal diff syntax, as opposed to unified diff syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2321">#2321</a>)</p>
</li>
<li>
<p>GLSL, HLSL:</p>
<ul>
<li>Support line continuations in preprocessor code (<a href="https://redirect.github.com/pygments/pygments/issues/2350">#2350</a>)</li>
<li>Improve preprocessor directive handling (<a href="https://redirect.github.com/pygments/pygments/issues/2357">#2357</a>)</li>
</ul>
</li>
<li>
<p>LilyPond: minor update of builtins</p>
</li>
<li>
<p>PHP: support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2055">#2055</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2347">#2347</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2360">#2360</a>), fix anonymous classes without
parameters (<a href="https://redirect.github.com/pygments/pygments/issues/2359">#2359</a>), improve lexing of variable variable syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2358">#2358</a>)</p>
</li>
<li>
<p>Python:</p>
<ul>
<li>Add missing builtins (<a href="https://redirect.github.com/pygments/pygments/issues/2334">#2334</a>)</li>
<li>Fix inconsistent lexing of <code>None</code> (<a href="https://redirect.github.com/pygments/pygments/issues/2406">#2406</a>)</li>
</ul>
</li>
<li>
<p>Rebol/Red: Don't require script headers (<a href="https://redirect.github.com/pygments/pygments/issues/2348">#2348</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2349">#2349</a>)</p>
</li>
<li>
<p>Spice: Update keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2336">#2336</a>)</p>
</li>
<li>
<p>SQL+Jinja (<code>analyse_text</code> method): Fix catastrophic backtracking (<a href="https://redirect.github.com/pygments/pygments/issues/2355">#2355</a>)</p>
</li>
<li>
<p>Terraform: Add <code>hcl</code> alias (<a href="https://redirect.github.com/pygments/pygments/issues/2375">#2375</a>)</p>
</li>
</ul>
</li>
<li>
<p>Declare support for Python 3.11 and drop support for Python 3.6 (<a href="https://redirect.github.com/pygments/pygments/issues/2324">#2324</a>).</p>
</li>
<li>
<p>Update <code>native</code> style to improve contrast (<a href="https://redirect.github.com/pygments/pygments/issues/2325">#2325</a>).</p>
</li>
<li>
<p>Update `github-dark`` style to match latest Primer style (<a href="https://redirect.github.com/pygments/pygments/issues/2401">#2401</a>)</p>
</li>
<li>
<p>Revert a change that made guessing lexers based on file names slower
on Python 3.10 and older (<a href="https://redirect.github.com/pygments/pygments/issues/2328">#2328</a>).</p>
</li>
<li>
<p>Fix some places where a locale-dependent encoding could unintentionally
be used instead of UTF-8 (<a href="https://redirect.github.com/pygments/pygments/issues/2326">#2326</a>).</p>
</li>
<li>
<p>Fix Python traceback handling (<a href="https://redirect.github.com/pygments/pygments/issues/2226">#2226</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2329">#2329</a>).</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pygments/pygments/blob/master/CHANGES">pygments's changelog</a>.</em></p>
<blockquote>
<h2>Version 2.15.0</h2>
<p>(released April 10th, 2023)</p>
<ul>
<li>
<p>Added lexers:</p>
<ul>
<li>Carbon (<a href="https://redirect.github.com/pygments/pygments/issues/2362">#2362</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2365">#2365</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2366">#2366</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2367">#2367</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2368">#2368</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2369">#2369</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2370">#2370</a>)</li>
<li>Dax (<a href="https://redirect.github.com/pygments/pygments/issues/2335">#2335</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2345">#2345</a>)</li>
<li>MediaWiki Wikitext (<a href="https://redirect.github.com/pygments/pygments/issues/2373">#2373</a>, <a href="https://redirect.github.com/pygments/pygments/issues/827">#827</a>)</li>
<li>PostgreSQL Explain (<a href="https://redirect.github.com/pygments/pygments/issues/2398">#2398</a>)</li>
<li>WGSL (WebGPU Shading Language) (<a href="https://redirect.github.com/pygments/pygments/issues/2386">#2386</a>)</li>
<li>X++ (<a href="https://redirect.github.com/pygments/pygments/issues/2339">#2339</a>)</li>
</ul>
</li>
<li>
<p>Updated lexers:</p>
<ul>
<li>
<p>AMDGPU: Add support for <code>scratch_</code> instructions, the <code>attr*.*</code> argument,
as well as the <code>off</code> modifier (<a href="https://redirect.github.com/pygments/pygments/issues/2327">#2327</a>).</p>
</li>
<li>
<p>APDL: Miscellaneous improvements (<a href="https://redirect.github.com/pygments/pygments/issues/2314">#2314</a>)</p>
</li>
<li>
<p>bash/tcsh:</p>
<ul>
<li>Move <code>break</code> to keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2377">#2377</a>)</li>
<li>Improve bash math expansion lexing (<a href="https://redirect.github.com/pygments/pygments/issues/2255">#2255</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2353">#2353</a>)</li>
</ul>
</li>
<li>
<p>Chapel: Support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2376">#2376</a>)</p>
</li>
<li>
<p>CMake: Implement bracket style comments (<a href="https://redirect.github.com/pygments/pygments/issues/2338">#2338</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2354">#2354</a>)</p>
</li>
<li>
<p>CSS: Improve lexing of numbers inside function calls (<a href="https://redirect.github.com/pygments/pygments/issues/2382">#2382</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2383">#2383</a>)</p>
</li>
<li>
<p>diff: Support normal diff syntax, as opposed to unified diff syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2321">#2321</a>)</p>
</li>
<li>
<p>GLSL, HLSL:</p>
<ul>
<li>Support line continuations in preprocessor code (<a href="https://redirect.github.com/pygments/pygments/issues/2350">#2350</a>)</li>
<li>Improve preprocessor directive handling (<a href="https://redirect.github.com/pygments/pygments/issues/2357">#2357</a>)</li>
</ul>
</li>
<li>
<p>LilyPond: minor update of builtins</p>
</li>
<li>
<p>PHP: support attributes (<a href="https://redirect.github.com/pygments/pygments/issues/2055">#2055</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2347">#2347</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2360">#2360</a>), fix anonymous classes without
parameters (<a href="https://redirect.github.com/pygments/pygments/issues/2359">#2359</a>), improve lexing of variable variable syntax (<a href="https://redirect.github.com/pygments/pygments/issues/2358">#2358</a>)</p>
</li>
<li>
<p>Python:</p>
<ul>
<li>Add missing builtins (<a href="https://redirect.github.com/pygments/pygments/issues/2334">#2334</a>)</li>
<li>Fix inconsistent lexing of <code>None</code> (<a href="https://redirect.github.com/pygments/pygments/issues/2406">#2406</a>)</li>
</ul>
</li>
<li>
<p>Rebol/Red: Don't require script headers (<a href="https://redirect.github.com/pygments/pygments/issues/2348">#2348</a>, <a href="https://redirect.github.com/pygments/pygments/issues/2349">#2349</a>)</p>
</li>
<li>
<p>Spice: Update keywords (<a href="https://redirect.github.com/pygments/pygments/issues/2336">#2336</a>)</p>
</li>
<li>
<p>SQL+Jinja (<code>analyse_text</code> method): Fix catastrophic backtracking (<a href="https://redirect.github.com/pygments/pygments/issues/2355">#2355</a>)</p>
</li>
<li>
<p>Terraform: Add <code>hcl</code> alias (<a href="https://redirect.github.com/pygments/pygments/issues/2375">#2375</a>)</p>
</li>
</ul>
</li>
<li>
<p>Declare support for Python 3.11 and drop support for Python 3.6 (<a href="https://redirect.github.com/pygments/pygments/issues/2324">#2324</a>).</p>
</li>
<li>
<p>Update <code>native</code> style to improve contrast (<a href="https://redirect.github.com/pygments/pygments/issues/2325">#2325</a>).</p>
</li>
<li>
<p>Update `github-dark`` style to match latest Primer style (<a href="https://redirect.github.com/pygments/pygments/issues/2401">#2401</a>)</p>
</li>
<li>
<p>Revert a change that made guessing lexers based on file names slower
on Python 3.10 and older (<a href="https://redirect.github.com/pygments/pygments/issues/2328">#2328</a>).</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="6c187ad832"><code>6c187ad</code></a> Prepare 2.15 release.</li>
<li><a href="00b9cb022c"><code>00b9cb0</code></a> Prepare for release.</li>
<li><a href="a0824a45f0"><code>a0824a4</code></a> Update CHANGES</li>
<li><a href="26f9f6c852"><code>26f9f6c</code></a> Merge pull request <a href="https://redirect.github.com/pygments/pygments/issues/2406">#2406</a> from rdbende/fix-fromimport-none</li>
<li><a href="62b1bbbe6e"><code>62b1bbb</code></a> Change token of None after from keyword</li>
<li><a href="acee60e4e8"><code>acee60e</code></a> Update CHANGES</li>
<li><a href="eaca690911"><code>eaca690</code></a> Add lexer for MediaWiki Wikitext (<a href="https://redirect.github.com/pygments/pygments/issues/2373">#2373</a>)</li>
<li><a href="0e9c87bcf0"><code>0e9c87b</code></a> Update CHANGES</li>
<li><a href="ef0abbaece"><code>ef0abba</code></a> Add PostgreSQL Explain lexer (<a href="https://redirect.github.com/pygments/pygments/issues/2398">#2398</a>)</li>
<li><a href="3c6e2af8fb"><code>3c6e2af</code></a> Update CHANGES</li>
<li>Additional commits viewable in <a href="https://github.com/pygments/pygments/compare/2.12.0...2.15.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pygments&package-manager=pip&previous-version=2.12.0&new-version=2.15.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105654
Approved by: https://github.com/huydhn, https://github.com/seemethere
2023-07-21 20:52:05 +00:00
999ca07ef8 Improve fake mode support by adding fake_context to ExportOutput (#105247)
Prior to this PR, if the user called `fake_model.load_state_dict()` from within `enable_fake_mode`, the initial model state dict (including non persistent buffers) would not be reused by `ExportOutput.save` during ONNX proto creation.

That is not necessarily a bug because `ExportOutput.save` has a `model_state_dict` in which they can specify any state they want. However, it can be a hassle because if the user doesn't provide a full state, including non-persistent buffers, the resulting ONNX graph would require the missing buffers to be specified as input during execution.

With this PR, the `enable_fake_mode` is improved to capture the initial model state including any non-persistent buffer. This reference (not actual data) is persisted within `ExportOutput` and used by `save` to load additional `state_dict` that was captured by `enable_fake_mode`. The result is an ONNX graph with all model state without user having to specify the non-persistent buffers.

This helps addressing https://github.com/pytorch/pytorch/issues/105233 for models that call `fake_model.load _state_dict` under the hood as potential buffers not returned by `model.state_dict()` may be captured.

ps: https://github.com/pytorch/pytorch/issues/105464 tracks pending tasks/limitations from this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105247
Approved by: https://github.com/BowenBao
2023-07-21 20:36:45 +00:00
803d42e457 add lerp cpu support for half (#105607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105607
Approved by: https://github.com/albanD
2023-07-21 20:29:05 +00:00
d5d6eb2d46 [ONNX] Refactor AvgPool to support dynamic shapes (#105683)
In #87892, to pick up the corner cases found in #71549, the PR falls back the implementation of AvgPool to the way opset 9 implementing. However, it introduces a regression on dynamic shape cases found in #101397. This PR refactors the AvgPool op with the same implementation we have in onnxscript: https://github.com/microsoft/onnxscript/pull/754.

However, the corner case with `count_include_pad` remains unsolved in onnxruntime: https://github.com/microsoft/onnxruntime/issues/16203. The calculuation on the last value of each dimension is different between ORT and PyTorch. But the fix can be proved in: https://github.com/microsoft/onnxruntime/pull/16752, and it supports AvgPool since opset19.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105683
Approved by: https://github.com/thiagocrepaldi
2023-07-21 20:22:08 +00:00
4cc1745b13 [BE] f-stringify torch/ and scripts (#105538)
This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`.

- https://docs.python.org/3/reference/lexical_analysis.html#f-strings
- https://pypi.org/project/flynt/

Command used:

```
flynt torch/ -ll 120
flynt scripts/ -ll 120
flynt tools/ -ll 120
```

and excluded `collect_env.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-21 19:35:24 +00:00
4c73016ff2 [Dynamo] Enable torch._dynamo.config.suppress_errors by default (#105307)
Summary:
We are working toward full model compilation, where when compilation error happens, we just fall back to eager mode rather than error out.
But at the same time, we should fix these issues if they are bugs. We will:
* 1/ log warnings in OSS;
* 2/ log warnings and write them into Scuba in fbcode;

to prevent us from ignoring these issues.

Test Plan: Manual test

Differential Revision: D47506314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105307
Approved by: https://github.com/jansel
2023-07-21 19:17:46 +00:00
de8bd108b4 [BE] Enable ruff's UP rules in pyproject.toml (#105437)
Signed-off-by: Justin Chu <justinchu@microsoft.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105437
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/Skylion007
2023-07-21 19:14:52 +00:00
6b2d48e78c [8/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for optim.load_state_dict() (#105690)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105690
Approved by: https://github.com/fegin
2023-07-21 18:55:01 +00:00
72b223cd1b [Inductor] Optimize read write merging in FusedSchedulerNode ctor (#105693)
Reduced optimizer compilation time by half, I think it will improve it in general as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105693
Approved by: https://github.com/jansel
2023-07-21 17:26:44 +00:00
842616bcba Allow (temporarily?) non-fake input during ONNX export with fake mode (#105246)
Although input and model are expected to be fake during ONNX export with fake mode enabled, apparently some models can create new parameters during tracing. That makes internal checks on dynamo side to fail when we dont set `allow_non_fake_input=True` for `torch._dynamo.export`.

https://github.com/pytorch/pytorch/issues/105077 tracks this issue and if a proper fix is done, we will set `allow_non_fake_input=False` again

Additionally to that, a possible bug was found at torch.nn.Module.state_dict() in which some registered buffers are not listed.

This is being tracked by https://github.com/pytorch/pytorch/issues/105233 but in the mean time, we are merging `state_dict()` and `named_buffers()` results to create a full `state_dict` for the model

Two more complex/larger tests are added to the ONNX export which are the same for the experimental symbolic tracing: tiny gpt2 and toy mlp (https://github.com/pytorch/pytorch/blob/main/test/onnx/test_fx_to_onnx_with_onnxruntime.py#L766-L825)

ps: https://github.com/pytorch/pytorch/issues/105464 tracks pending tasks/limitations from this PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105246
Approved by: https://github.com/BowenBao
2023-07-21 16:05:09 +00:00
04da0c76a0 Improve basicAutogradNotImplementedFallback + new tests (#105591)
This PR:
- removes some reference count bumps (to potentially improve overhead)
- adds some tests for undefined gradients
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105591
Approved by: https://github.com/soulitzer
2023-07-21 14:37:21 +00:00
ed6de45563 Fix Tensor::register_hook behavior on undefined tensors (#105587)
When the hook registered by Tensor::register_hook (in C++) gets passed
an undefined tensor, it raises an internal assert in debug mode.
The cause is that we attempt to construct an OptionalTensorRef
(4448c78a5d/aten/src/ATen/core/Tensor.h (L68))
which asserts that the passed-in TensorBase is defined.

The fix is that we create a new TensorRef class to convert the
TensorBase into a Tensor without bumping the refcount (which is what
OptionalTensorRef does). We cannot reuse OptionalTensorRef because
OptionalTensorRef represents `optional<Tensor>` that cannot hold an
Undefined Tensor.

For some more historical context, it looks like this behavior was introduced
in #63612

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105587
Approved by: https://github.com/soulitzer
2023-07-21 14:37:21 +00:00
eqy
29f856e3e0 Kill process in wait_for_process if SIGINT fails to terminate it (#105625)
#98035 adds some additional logic `wait_for_process` that includes catching a timeout exception and sending `SIGINT` to the process before waiting on it again with a timeout. However, if the additional wait times out again, then the wait call in the `finally` block (which does not have a timeout) has the potential to hang indefinitely.

This PR kills the process if a second timeout exception occurs after the `SIGINT` signal is sent.

CC @clee2000 @ptrblck @xwang233 @kwen2501

Also hoping that this has the potential to reduce turnaround time for distributed timeouts like those seen in https://hud.pytorch.org/pr/pytorch/pytorch/105274#15148799113
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105625
Approved by: https://github.com/ezyang
2023-07-21 10:11:58 +00:00
ec26947c58 [Inductor] Replace functools.reduce union calls with set unions (#105720)
This improves compilation speed another 30%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105720
Approved by: https://github.com/lezcano
2023-07-21 09:49:56 +00:00
79c5e33349 [BE] Enable ruff's UP rules and autoformat nn/ mps/ and torch/ (#105436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105436
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-21 07:38:46 +00:00
322dff475c Skip test_cudnn_rnn when cudnn not available (#105701)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105701
Approved by: https://github.com/mlazos
2023-07-21 06:03:50 +00:00
429d45f91a Simplify handle indexing (#105006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105006
Approved by: https://github.com/awgu
2023-07-21 05:53:23 +00:00
b0a04331b4 [dynamo] Fix import if numpy is not installed (#105711)
This [line](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/allowed_functions.py#L18) results in an import issue if numpy is not installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105711
Approved by: https://github.com/yanboliang, https://github.com/ezyang
2023-07-21 05:52:32 +00:00
e40f8acef2 [inductor][fx passes] batch layernom (#105492)
Summary: Batch layernorm. Fuse independent horizontal layernorm with same size into one.

Test Plan:
# unit test
```
buck test mode/dev-nosan //caffe2/test/inductor:group_batch_fusion
File changed: fbcode//caffe2/test/inductor/test_group_batch_fusion.py
Buck UI: https://www.internalfb.com/buck2/68eb51e1-bdbc-4847-aabf-e50737d8485b
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549764442206
Network: Up: 0 B  Down: 0 B
Jobs completed: 10. Time elapsed: 1:07.2s.
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D47447542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105492
Approved by: https://github.com/jansel, https://github.com/xuzhao9
2023-07-21 05:03:04 +00:00
a01a732954 Rename some sizevars methods for clarity (#105585)
The guard functions require you to ALREADY KNOW that a particular
condition holds.  If you don't know (you want to guard on an expression
being a particular value, and then get access to that value), use
the evaluate functions.

I renamed the functions that don't abide by this:

```
guard_min -> evaluate_min
guard_max (deleted, no uses)
guard_static_shape -> evaluate_static_shape
guard_static_shapes -> evaluate_static_shapes
```

Some added comments.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105585
Approved by: https://github.com/voznesenskym
2023-07-21 04:46:23 +00:00
cce2b7e3c9 [dynamo][numpy] Add support for builtin len() on numpy ndarray (#105691)
Issue #105054
```
def fn(x):
  v = x.sum() / len(x)
  return v
```

This creates a graph break because we don't know how to handle __len__ method.

Solution is just delegate it back to `TensorVariable`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105691
Approved by: https://github.com/ezyang
2023-07-21 03:50:40 +00:00
fed8d3608d Update core aten decomp table (#105673)
Updated the decomposition table based on the existing [Core ATen IR](https://pytorch.org/docs/stable/ir.html) list, and moved rest of decompositions to inductor's decomposition table.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105673
Approved by: https://github.com/SherlockNoMad
2023-07-21 02:45:37 +00:00
c759a57003 Skip deterministic mode for SAM (#105615)
SAM uses cumsum which doesnt have a deterministic mode enabled so this the onl way I can work around this https://github.com/pytorch/pytorch/issues/89492

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105615
Approved by: https://github.com/eellison, https://github.com/cpuhrsch
2023-07-21 01:52:08 +00:00
117325862c Revert "Add torch.utils to the docs page, remove dead code and fix docstrings (#105142)"
This reverts commit e985719e98ba02f61438d6a27e29caeaeedb9e6c.

Reverted https://github.com/pytorch/pytorch/pull/105142 on behalf of https://github.com/huydhn due to Sorry for reverting this but it is failing python doc build job in trunk e985719e98 ([comment](https://github.com/pytorch/pytorch/pull/105142#issuecomment-1644874540))
2023-07-21 01:47:49 +00:00
6ed96b9ed8 inductor: fix bug in nn.Linear when in_feature size is zero (#105449)
Fix #104937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105449
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-21 01:10:01 +00:00
cb9abf725c Update torch.compile docstring (#105652)
Update the description of 'mode' parameter for torch.compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105652
Approved by: https://github.com/ezyang
2023-07-21 01:02:31 +00:00
143c83d637 [quant][pt2e][be] Remove unneeded code (#105676)
Summary:
att

Test Plan:
CIs

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105676
Approved by: https://github.com/andrewor14
2023-07-21 00:51:22 +00:00
a8f568e99b Make recompiles log print stack traces (#105663)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105663
Approved by: https://github.com/voznesenskym
2023-07-21 00:31:22 +00:00
e985719e98 Add torch.utils to the docs page, remove dead code and fix docstrings (#105142)
As per title.
Note that the c++ side code for the minidumps part was removed. So trying to call any of these 3 functions today results in an error saying that `torch._C` doesn't have these attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105142
Approved by: https://github.com/janeyx99
2023-07-21 00:14:59 +00:00
1e87778552 [inductor] refactor wrapper benchmark code out of utils.py (#105584)
Refactor wrapper benchmark out of utils.py since
1. utils.py gets too large
2. I plan to add more code to wrapper benchmark for multi-kernel.

This is split out from https://github.com/pytorch/pytorch/pull/103469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105584
Approved by: https://github.com/jansel
2023-07-21 00:01:35 +00:00
07ea344dcf Fix docs not showing error, remove circleci docs scripts (#105678)
Docs were not exiting with failure, for example https://github.com/pytorch/pytorch/actions/runs/5604612586/job/15184094038#step:9:1131 because the if statement returned 0 if we want to exit.

Also get rid of the circleci scripts since they aren't used anywhere.

Example error:
```
copying static files... done
copying extra files... done
dumping search index in English (code: en)... done
dumping object inventory... done
build finished with problems, 1 warning.
make: *** [Makefile:49: html] Error 1
+ code=2
+ '[' 2 -ne 0 ']'
+ set +x
=========================
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/nn/parallel/comm.py:docstring of torch.nn.parallel.comm.scatter:1: WARNING: more than one target found for cross-reference 'Stream': torch.cuda.Stream, torch.cuda.streams.Stream, torch.cpu.Stream
=========================
Docs build failed. If the failure is not clear, scan back in the log
for any WARNINGS or for the line build finished with problems
(tried to echo the WARNINGS above the ==== line)
=========================
+ return 2
+ exit 0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105678
Approved by: https://github.com/seemethere
2023-07-20 23:11:31 +00:00
e47fad68a0 [caffe2] Update tracepoint USDT macros (#105232)
Summary:
Fix existing CAFFE static tracepoint macros and make them match the latest FOLLY version.

Per anakryiko, current `CAFE_SDT` definition is broken. Quote:
```
"Arguments: -5@-16(%rbp) -4@$100

Arguments: -8@-16(%rbp) -4@$100

#define FOLLY_SDT_IS_ARRAY_POINTER(x)  ((__builtin_classify_type(x) == 14) ||  \
                                        (__builtin_classify_type(x) == 5))

vs

#define CAFFE_SDT_ISARRAY(x)  (__builtin_classify_type(x) == 14)

https://github.com/atgreen/gcc/blob/master/gcc/typeclass.h

that 5 is "pointer_type_class"
so you were right, it's just fixed up version of header
I think it should be 8, not 5
5 is the size of literal, but you don't pass string literal as an argument, you pass its address, so actual argument is a pointer, and so 8 byte long

you can try just fixing up CAFFE_SDT macro
```
 {F1048035373}

Test Plan:

Tested the following macros on test scripts with libbpf USDTs:
CAFFE_SDT
CAFFE_DISABLE_SDT
CAFFE_SDT_WITH_SEMAPHORE

Reviewed By: RihamSelim

Differential Revision: D47159249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105232
Approved by: https://github.com/chaekit, https://github.com/malfet
2023-07-20 22:56:11 +00:00
024d26208c Add Freezing Option to Benchmarking (#105616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105616
Approved by: https://github.com/desertfire
2023-07-20 22:50:51 +00:00
8399cf9bfe Rnn base hidden size type check (#105659)
Fixes #105631

Added a type and value check on `hidden_size` to align behaviour between GPU and CPU modes and alert users when the wrong type is supplied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105659
Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki
2023-07-20 22:45:43 +00:00
18d8961d91 [Pytorch][Vulkan] aten::pow (#105550)
Summary:
Add support for [aten::pow](https://pytorch.org/docs/stable/generated/torch.pow.html#torch.pow) in [various forms](https://www.internalfb.com/code/fbsource/[c717e1fa980ed47c6580778dcfa49c21d3270a67]/xplat/caffe2/aten/src/ATen/native/native_functions.yaml?lines=9656%2C9670%2C9685%2C9693%2C9699):

|Not in-place| Base | Exp |
|--| -- | -- |
|pow| Tensor | Tensor |
|pow_tensor_scalar| Tensor | Scalar |
|pow_scalar_tensor| Scalar | Tensor |
|In-place| Base | Exp |
|--| -- | -- |
|pow_ | Tensor | Tensor |
|pow_tensor_scalar_| Tensor | Scalar |

Test Plan:
pow tests
```
[lfq@35771.od /data/sandcastle/boxes/fbsource (97d4bdf9e)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin  -- --gtest_filter="*pow*"
Building: finished in 0.1 sec (100%) 329/329 jobs, 0/329 updated
  Total time: 0.2 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *pow*
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.pow
[       OK ] VulkanAPITest.pow (255 ms)
[ RUN      ] VulkanAPITest.pow_broadcast
[       OK ] VulkanAPITest.pow_broadcast (3 ms)
[ RUN      ] VulkanAPITest.pow_
[       OK ] VulkanAPITest.pow_ (90 ms)
[ RUN      ] VulkanAPITest.pow_broadcast_other_
[       OK ] VulkanAPITest.pow_broadcast_other_ (0 ms)
[ RUN      ] VulkanAPITest.pow_tensor_scalar
[       OK ] VulkanAPITest.pow_tensor_scalar (57 ms)
[ RUN      ] VulkanAPITest.pow_tensor_scalar_
[       OK ] VulkanAPITest.pow_tensor_scalar_ (83 ms)
[ RUN      ] VulkanAPITest.pow_scalar_tensor
[       OK ] VulkanAPITest.pow_scalar_tensor (50 ms)
[----------] 7 tests from VulkanAPITest (542 ms total)

[----------] Global test environment tear-down
[==========] 7 tests from 1 test suite ran. (542 ms total)
[  PASSED  ] 7 tests.
```

All tests
```
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 317 tests from VulkanAPITest (18448 ms total)

[----------] Global test environment tear-down
[==========] 317 tests from 1 test suite ran. (18448 ms total)
[  PASSED  ] 316 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
clang-format on glsl and cpp files

Reviewed By: SS-JIA

Differential Revision: D46704167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105550
Approved by: https://github.com/SS-JIA
2023-07-20 22:19:25 +00:00
795885d947 [docs] Fix docstring. (#105689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105689
Approved by: https://github.com/clee2000
2023-07-20 22:02:43 +00:00
450c22c311 mypy index propagation (#105622)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105622
Approved by: https://github.com/eellison
2023-07-20 21:37:43 +00:00
fe7187b903 mypy _inductor/cuda_properties (#105620)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105620
Approved by: https://github.com/eellison
2023-07-20 21:13:01 +00:00
75a8c8a538 softshrink lowering (#105603)
Fixes https://github.com/pytorch/pytorch/issues/105563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105603
Approved by: https://github.com/Chillee
2023-07-20 20:26:05 +00:00
6560750d08 [Dynamo] Support list indexed by constant tensor (#105509)
Fixes #104092

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105509
Approved by: https://github.com/eellison
2023-07-20 20:14:04 +00:00
e6fd8ca3ee Fix test failure in TestCudaMultiGPU.test_cuda_device_memory_allocated (#105501)
The test

f508d3564c/test/test_cuda_multigpu.py (L1282-L1290)

Torch cuda caching allocator may cache the allocation and cause the "new_alloc" being the same as the "old_alloc".
```python
     self.assertGreater(memory_allocated(0), current_alloc[0])
```

I suggest that we use `assertGreaterEqual` instead of `assertGreater` in the test.

Individually running only this test does not make it fail but running it together with other tests from the same test module will make it fail.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105501
Approved by: https://github.com/zou3519
2023-07-20 19:59:10 +00:00
6abb8c382c [export] add kwargs support for export. (#105337)
Solving #105242.

During export, the exported function's signature changes multiple times. Suppose we'd like to export f as shown in following example:
```python
def f(arg1, arg2, kw1, kw2):
  pass

args = (arg1, arg2)
kwargs =  {"kw2":arg3, "kw1":arg4}

torch.export(f, args, kwargs)
```
The signature changes mutiple times during export process in the following order:
1. **gm_torch_level = dynamo.export(f, *args, \*\*kwargs)**. In this step, we turn all  kinds of parameters such as **postional_only**, **var_positioinal**, **kw_only**, and **var_kwargs** into **positional_or_kw**.It also preserves the positional and kword argument names in original function (i.e. f in this example) [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/export.py#L546C13-L546C27). The order of kwargs will be the **key order** of kwargs (after python 3.6, the order is the insertion of order of keys) instead of the original function signature and the order is baked into a _orig_args varaible of gm_torch_level's pytree info. So we'll have:
```python
def gm_torch_level(arg1, arg2, kw2, kw1)
```
Such difference is acceptable as it's transparent to users of export.

2. **gm_aot_export = aot_export_module(gm_torch_level, pos_or_kw_args)**. In this step, we need to turn kwargs into positional args in the order of how gm_torch_level expected, which is stored in _orig_args. The returned gm_aot_export has the graph signature of flat_args, in_spec = pytree.tree_flatten(pos_or_kw_args):
``` python
flat_args, _ = pytree.tree_flatten(pos_or_kw_args)
def gm_aot_export(*flat_args)
```

3. **exported_program(*args, \*\*kwargs)**. The epxorted artifact is exported_program, which is a wrapper over gm_aot_export and has the same calling convention as the original function "f". To do this, we need to 1. specialize the order of kwargs into pos_or_kw_args and 2. flatten the pos_or_kw_args into what gm_aot_export expected.  We can combine the two steps into one with :
```python
_, in_spec = pytree.tree_flatten((args, kwargs))

# Then during exported_program.__call__(*args, **kwargs)
flat_args  = fx_pytree.tree_flatten_spec((args, kwargs), in_spec)
```
, where kwargs is treated as a normal pytree whose keyorder is preserved in in_spec.

Implementation-wise, we treat _orig_args in dynamo exported graph module as single source of truth and kwags are ordered following it.

Test plan:
See added tests in test_export.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105337
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2023-07-20 19:53:08 +00:00
9584d614a1 [inductor] add decompositions for aten.angle (#105609)
Fixes #105564.

Added tests.

CPU benchmarking result:
Before decomposition:
```
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]  ===== Forward graph 0 =====
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.4 from /home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py:477 in wrapped class <lambda>(torch.nn.Module):
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, arg0_1: f32[100000]):
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/yidi/local/t.py:5, code: return torch.angle(x)
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]         angle: f32[100000] = torch.ops.aten.angle.default(arg0_1);  arg0_1 = None
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return (angle,)
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-07-19 14:59:51,277] torch._functorch.aot_autograd.__aot_graphs: [INFO]
eager:
per-call time (us): 1069.2930221557617
compiled:
per-call time (us): 742.4068450927734
```

After decomposition:
```
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]  ===== Forward graph 0 =====
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]  <eval_with_key>.4 from /home/yidi/local/pytorch/torch/fx/experimental/proxy_tensor.py:477 in wrapped class <lambda>(torch.nn.Module):
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]     def forward(self, arg0_1: f32[100000]):
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         # File: /home/yidi/local/t.py:5, code: return torch.angle(x)
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         lt: b8[100000] = torch.ops.aten.lt.Scalar(arg0_1, 0)
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         scalar_tensor: f32[] = torch.ops.aten.scalar_tensor.default(0.0, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'))
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         scalar_tensor_1: f32[] = torch.ops.aten.scalar_tensor.default(3.141592653589793, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'))
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         where: f32[100000] = torch.ops.aten.where.self(lt, scalar_tensor_1, scalar_tensor);  lt = scalar_tensor_1 = scalar_tensor = None
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         isnan: b8[100000] = torch.ops.aten.isnan.default(arg0_1);  arg0_1 = None
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         scalar_tensor_2: f32[] = torch.ops.aten.scalar_tensor.default(0, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'))
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         scalar_tensor_3: f32[] = torch.ops.aten.scalar_tensor.default(nan, dtype = torch.float32, layout = torch.strided, device = device(type='cpu'))
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         where_1: f32[100000] = torch.ops.aten.where.self(isnan, scalar_tensor_3, scalar_tensor_2);  isnan = scalar_tensor_3 = scalar_tensor_2 = None
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         add: f32[100000] = torch.ops.aten.add.Tensor(where, where_1);  where = where_1 = None
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]         return (add,)
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]
[2023-07-19 14:57:53,849] torch._functorch.aot_autograd.__aot_graphs: [INFO]
eager:
per-call time (us): 1228.0082702636719
compiled:
per-call time (us): 83.6038589477539
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105609
Approved by: https://github.com/jansel
2023-07-20 19:12:20 +00:00
9760ea58a3 fix lint (#105675)
Forward fix of the lint issues introduced by https://github.com/pytorch/pytorch/pull/104242
We are forward fixing as this PR contains Meta internal changes that would be tricky to revert smoothly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105675
Approved by: https://github.com/jerryzh168, https://github.com/albanD, https://github.com/atalman
2023-07-20 18:42:25 +00:00
3464cd6e62 Close non existent disable issues (#105096)
example run https://github.com/pytorch/pytorch/actions/runs/5539549596/jobs/10110608650?pr=105096
I spot checked a few to make sure the tests are gone, and most of them are automatic dynamic shapes tests, which got renamed.

I will remove the pull_request trigger and the dry run before merging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105096
Approved by: https://github.com/huydhn
2023-07-20 18:07:37 +00:00
777fc0bb58 [dynamo] fine-grained bytecode-source attribution in python 3.11 (#104676)
Since Python 3.11 bytecode contains endline and column information, for each bytecode, we attribute the source code corresponding to the bytecode in a more accurate way. For example, we can highlight a function call in a series of nested function calls, or highlight a function call spanning multiple lines.

Sample:
```python
import torch
import torch._dynamo
from functorch.experimental.control_flow import cond

def h(x):
    return x * 5

def true_fn(x):
    return x * 2

def false_fn(x):
    return x * 3

def f(pred, x):
    x = h(
        h(h(x))
    )
    x = x[1:][:2]
    torch._dynamo.graph_break()
    x = cond(pred, true_fn, false_fn, [x])

opt_f = torch.compile(f, backend="eager")
opt_f(torch.tensor(True), torch.randn(3, 3, 3, 3))
```

Output:
```
$ TORCH_LOGS="trace_call" python playground9.py
TRACE inlined call h from f /scratch/williamwen/work/pytorch/playground9.py:16
        h(h(x))
          ~^^^
TRACE FX call mul from h /scratch/williamwen/work/pytorch/playground9.py:6 (inline depth: 1)
    return x * 5
           ~~^~~
TRACE inlined call h from f /scratch/williamwen/work/pytorch/playground9.py:16
        h(h(x))
        ~^^^^^^
TRACE FX call mul_1 from h /scratch/williamwen/work/pytorch/playground9.py:6 (inline depth: 1)
    return x * 5
           ~~^~~
TRACE inlined call h from f /scratch/williamwen/work/pytorch/playground9.py:15
    x = h(
        ~^
        h(h(x))
        ^^^^^^^
    )
    ^
TRACE FX call mul_2 from h /scratch/williamwen/work/pytorch/playground9.py:6 (inline depth: 1)
    return x * 5
           ~~^~~
TRACE FX call getitem from f /scratch/williamwen/work/pytorch/playground9.py:18
    x = x[1:][:2]
        ~^^^^
TRACE FX call getitem_1 from f /scratch/williamwen/work/pytorch/playground9.py:18
    x = x[1:][:2]
        ~~~~~^^^^
TRACE inlined call true_fn from <resume in f> /scratch/williamwen/work/pytorch/playground9.py:20
    x = cond(pred, true_fn, false_fn, [x])
        ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TRACE FX call mul from true_fn /scratch/williamwen/work/pytorch/playground9.py:9 (inline depth: 1)
    return x * 2
           ~~^~~
TRACE inlined call false_fn from <resume in f> /scratch/williamwen/work/pytorch/playground9.py:20
    x = cond(pred, true_fn, false_fn, [x])
        ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TRACE FX call mul from false_fn /scratch/williamwen/work/pytorch/playground9.py:12 (inline depth: 1)
    return x * 3
           ~~^~~
TRACE FX call cond from <resume in f> /scratch/williamwen/work/pytorch/playground9.py:20
    x = cond(pred, true_fn, false_fn, [x])
        ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104676
Approved by: https://github.com/ezyang
2023-07-20 17:18:52 +00:00
b5d3d58497 Fixed cmake mkl lib path in caffee2 public (#105525)
This small change fixes a linking error (Intel MKL) for distributed version of libtorch c++ using cmake.

Fixes #105215.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105525
Approved by: https://github.com/albanD
2023-07-20 17:15:09 +00:00
25d80c69ce [foreach] super minor BE: remove unnecessary cast (#105601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105601
Approved by: https://github.com/albanD
2023-07-20 17:06:52 +00:00
e855348cdf [foreach][SGD] minimize intermediates=1 to decrease peak memory (#105599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105599
Approved by: https://github.com/albanD
2023-07-20 17:06:52 +00:00
585ce32ca1 Heap buffer overflow in ditributed/rpc module (#105537)
Hi! we've been fuzzing PyTorch project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).
We've found a couple heap-buffer-overflows in `distributed/rpc` module.

PyTorch version: 0f1621df1a

OS: Ubuntu 20.04

### How to reproduce

1.  Build docker from this [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch) and run the container.
2.  Then run `message_deserialize-afl++` fuzzing target on provided crash-inputs ([crash-056826339f6da8dbb97c944178e94494369a9e22.zip](https://github.com/pytorch/pytorch/files/12096151/crash-056826339f6da8dbb97c944178e94494369a9e22.zip), [crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip](https://github.com/pytorch/pytorch/files/12096160/crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip)):
```
unzip crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip
/message_deserialize-afl++ crash-4f85db9f19fe152c0018f6675c3b4c122227058f
```

### Heap buffer overflow in torch/csrc/jit/serialization/pickle.cpp:144

[crash-056826339f6da8dbb97c944178e94494369a9e22.zip](https://github.com/pytorch/pytorch/files/12096151/crash-056826339f6da8dbb97c944178e94494369a9e22.zip)

```asan
    "==7614==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60b001b58355 at pc 0x0000005d1147 bp 0x7fffffffa610 sp 0x7fffffff9de0",
    "READ of size 256 at 0x60b001b58355 thread T0",
    "    #0 0x5d1146 in __asan_memcpy /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3",
    "    #1 0xd1cd19f in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3::operator()(char*, unsigned long) const /pytorch/torch/csrc/jit/serialization/pickle.cpp:144:9",
    "    #2 0xd1cd19f in unsigned long std::__invoke_impl<unsigned long, torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3&, char*, unsigned long>(std::__invoke_other, torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&))::$_3&, char*&&, unsigned long&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14",
    "    #3 0xd27aa48 in std::function<unsigned long (char*, unsigned long)>::operator()(char*, unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14",
    "    #4 0xd27a61c in torch::jit::Unpickler::readSlowWithBuffer(char*, unsigned long) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:1047:23",
    "    #5 0xd2698b8 in unsigned char torch::jit::Unpickler::read<unsigned char>() /pytorch/torch/csrc/jit/serialization/unpickler.h:111:7",
    "    #6 0xd268816 in torch::jit::Unpickler::readOpCode() /pytorch/torch/csrc/jit/serialization/unpickler.h:130:38",
    "    #7 0xd268816 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:238:17",
    "    #8 0xd268522 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3",
    "    #9 0xd1c8502 in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20",
    "    #10 0xd1c8dbd in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10",
    "    #11 0xe56b16d in torch::distributed::rpc::readWrappedPayload(std::vector<char, std::allocator<char> >&, torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:515:18",
    "    #12 0xe3d8f29 in torch::distributed::autograd::RpcWithProfilingReq::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/autograd/rpc_messages/rpc_with_profiling_req.cpp:112:24",
    "    #13 0xe55f692 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:138:14",
    "    #14 0x6120a8 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27",
    "    #15 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #16 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #17 0x525a3b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #18 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #19 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "    #20 0x51a60d in _start (/message_deserialize_fuzz+0x51a60d)",
    "",
    "0x60b001b58355 is located 0 bytes to the right of 101-byte region [0x60b001b582f0,0x60b001b58355)",
    "allocated by thread T0 here:",
    "    #0 0x60c7bd in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3",
    "    #1 0x62c7fd in std::_Vector_base<char, std::allocator<char> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20",
    "    #2 0x62c7fd in void std::vector<char, std::allocator<char> >::_M_range_initialize<unsigned char const*>(unsigned char const*, unsigned char const*, std::forward_iterator_tag) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1582:14",
    "    #3 0x612913 in std::vector<char, std::allocator<char> >::vector<unsigned char const*, void>(unsigned char const*, unsigned char const*, std::allocator<char> const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:657:4",
    "    #4 0x611c4a in LLVMFuzzerTestOneInput /message_deserialize.cc:181:21",
    "    #5 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #6 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #7 0x525a3b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #8 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #9 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "",
    "SUMMARY: AddressSanitizer: heap-buffer-overflow /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:22:3 in __asan_memcpy",
    "Shadow bytes around the buggy address:",
    "  0x0c1680363010: 00 00 00 fa fa fa fa fa fa fa fa fa 00 00 00 00",
    "  0x0c1680363020: 00 00 00 00 00 00 00 00 00 00 fa fa fa fa fa fa",
    "  0x0c1680363030: fa fa 00 00 00 00 00 00 00 00 00 00 00 00 00 fa",
    "  0x0c1680363040: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00",
    "  0x0c1680363050: 00 00 00 00 00 fa fa fa fa fa fa fa fa fa 00 00",
    "=>0x0c1680363060: 00 00 00 00 00 00 00 00 00 00[05]fa fa fa fa fa",
    "  0x0c1680363070: fa fa fa fa 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c1680363080: 05 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c1680363090: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c16803630a0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c16803630b0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "Shadow byte legend (one shadow byte represents 8 application bytes):",
    "  Addressable:           00",
    "  Partially addressable: 01 02 03 04 05 06 07",
    "  Heap left redzone:       fa",
    "  Freed heap region:       fd",
    "  Stack left redzone:      f1",
    "  Stack mid redzone:       f2",
    "  Stack right redzone:     f3",
    "  Stack after return:      f5",
    "  Stack use after scope:   f8",
    "  Global redzone:          f9",
    "  Global init order:       f6",
    "  Poisoned by user:        f7",
    "  Container overflow:      fc",
    "  Array cookie:            ac",
    "  Intra object redzone:    bb",
    "  ASan internal:           fe",
    "  Left alloca redzone:     ca",
    "  Right alloca redzone:    cb",
    "==7614==ABORTING"
```

### Heap-buffer-overflow in aten/src/ATen/core/ivalue.h:432

[crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip](https://github.com/pytorch/pytorch/files/11553011/crash-4f85db9f19fe152c0018f6675c3b4c122227058f.zip)

```asan
    "==60983==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6150001e4108 at pc 0x000000601877 bp 0x7fffffff9fd0 sp 0x7fffffff9fc8",
    "READ of size 4 at 0x6150001e4108 thread T0",
    "    #0 0x601876 in c10::IValue::isTensor() const /pytorch/aten/src/ATen/core/ivalue.h:432:27",
    "    #1 0x601876 in c10::IValue::destroy() /pytorch/aten/src/ATen/core/ivalue.h:1148:9",
    "    #2 0x699f72 in c10::IValue::~IValue() /pytorch/aten/src/ATen/core/ivalue.h:236:5",
    "    #3 0x699f72 in void std::_Destroy<c10::IValue>(c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:140:19",
    "    #4 0x699f72 in void std::_Destroy_aux<false>::__destroy<c10::IValue*>(c10::IValue*, c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:152:6",
    "    #5 0x699f72 in void std::_Destroy<c10::IValue*>(c10::IValue*, c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_construct.h:184:7",
    "    #6 0x699f72 in void std::_Destroy<c10::IValue*, c10::IValue>(c10::IValue*, c10::IValue*, std::allocator<c10::IValue>&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/alloc_traits.h:738:7",
    "    #7 0x699f72 in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_erase_at_end(c10::IValue*) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1796:6",
    "    #8 0x699e4a in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_erase(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, __gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:191:4",
    "    #9 0xea5b11e in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:454:14",
    "    #10 0xea57d97 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27",
    "    #11 0xea579f1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3",
    "    #12 0xe9a435e in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20",
    "    #13 0xe9a471c in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10",
    "    #14 0xfcd034b in torch::distributed::autograd::PropagateGradientsReq::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/autograd/rpc_messages/propagate_gradients_req.cpp:54:18",
    "    #15 0xfe720ff in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:132:14",
    "    #16 0x5c5c93 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27",
    "    #17 0x5c2bfd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7",
    "    #18 0x5c2a08 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c",
    "    #19 0x5c25c8 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10",
    "    #20 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "    #21 0x50237d in _start (/message_deserialize_afl+0x50237d)",
    "",
    "0x6150001e4108 is located 8 bytes to the right of 512-byte region [0x6150001e3f00,0x6150001e4100)",
    "allocated by thread T0 here:",
    "    #0 0x5bfbfa in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3",
    "",
    "SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:432:27 in c10::IValue::isTensor() const",
    "Shadow bytes around the buggy address:",
    "  0x0c2a800347d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a800347e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c2a800347f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c2a80034800: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "  0x0c2a80034810: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00",
    "=>0x0c2a80034820: fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034830: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034840: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034850: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034860: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c2a80034870: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "Shadow byte legend (one shadow byte represents 8 application bytes):",
    "  Addressable:           00",
    "  Partially addressable: 01 02 03 04 05 06 07",
    "  Heap left redzone:       fa",
    "  Freed heap region:       fd",
    "  Stack left redzone:      f1",
    "  Stack mid redzone:       f2",
    "  Stack right redzone:     f3",
    "  Stack after return:      f5",
    "  Stack use after scope:   f8",
    "  Global redzone:          f9",
    "  Global init order:       f6",
    "  Poisoned by user:        f7",
    "  Container overflow:      fc",
    "  Array cookie:            ac",
    "  Intra object redzone:    bb",
    "  ASan internal:           fe",
    "  Left alloca redzone:     ca",
    "  Right alloca redzone:    cb",
    "==60983==ABORTING"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105537
Approved by: https://github.com/albanD
2023-07-20 16:56:49 +00:00
0af18f2234 Unify TEST_CUDNN definition (#105594)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105594
Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym
2023-07-20 16:10:26 +00:00
b64bd4a5dd Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)
Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf

Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged

TODO:
 - Refactor duplicated code
 - Cleanup unbalanced pragma pop in dtype utils
 - Add native implementation on the CUDA size

Co-authored-by: Nikita Shulga <nshulga@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242
Approved by: https://github.com/albanD
2023-07-20 16:09:11 +00:00
803d58a408 Add TensorPipe header files to Python package (#105521)
This change adds the TensorPipe header files to `torch_package_data` if `USE_DISTRIBUTED` is set to `ON` in the CMake cache. The TensorPipe library and CMake config is already available in the Torch wheel, but the headers are not. This resolves issue where out-of-tree backends could not implement TensorPipe converters, because the definition of the `tensorpipe::Message` struct is defined in the TensorPipe headers.

Fixes #105224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105521
Approved by: https://github.com/albanD
2023-07-20 16:06:00 +00:00
154d89b224 Revert "Unify TEST_CUDNN definition (#105594)"
This reverts commit 1ea153a11d3011b90cdac1f9977889988a0c981f.

Reverted https://github.com/pytorch/pytorch/pull/105594 on behalf of https://github.com/PaliC due to breaks periodic test `distributed/_tensor/test_dtensor.py::TestDynamoDTensor::test_dynamo_dtensor` ([comment](https://github.com/pytorch/pytorch/pull/105594#issuecomment-1644166414))
2023-07-20 15:48:25 +00:00
f2b15772ff Revert "Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)"
This reverts commit a9804130e5a9a982d82934fa9702abd08d6903ce.

Reverted https://github.com/pytorch/pytorch/pull/104242 on behalf of https://github.com/PaliC due to breaks lint (run lintrunner and remerge) ([comment](https://github.com/pytorch/pytorch/pull/104242#issuecomment-1644150284))
2023-07-20 15:37:53 +00:00
02cd971e95 [C10D] Improve MTPG autograd test. Fixes #105106 (#105356)
Explicitly asserts that bwd is running from the same thread as fwd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105356
Approved by: https://github.com/rohan-varma, https://github.com/wanchaol, https://github.com/fduwjj
2023-07-20 13:51:21 +00:00
ded9b94207 Improved error messages for deprecated linalg functions. (#105506)
Fixes #105452

New error messages to point out potentially breaking/annoying changes between the old and new interfaces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105506
Approved by: https://github.com/lezcano
2023-07-20 10:48:06 +00:00
ca126880d9 Enable intellisense for _dynamo, _inductor and onnx by importing under type_checking guard (#105361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105361
Approved by: https://github.com/malfet
2023-07-20 10:40:02 +00:00
a9804130e5 Add torch.float8_e5m2 and torch.float8_e4m3 data types (#104242)
Proposal of two float8 variants - e5m2 and e4m3 - based on https://arxiv.org/pdf/2209.05433.pdf

Hide all Float8 operator implementations behind `#if !defined(C10_MOBILE)` guard to keep Android build size almost unchanged

TODO:
 - Refactor duplicated code
 - Cleanup unbalanced pragma pop in dtype utils
 - Add native implementation on the CUDA size

Co-authored-by: Nikita Shulga <nshulga@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104242
Approved by: https://github.com/albanD
2023-07-20 09:45:45 +00:00
1ea153a11d Unify TEST_CUDNN definition (#105594)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105594
Approved by: https://github.com/larryliu0820, https://github.com/voznesenskym
2023-07-20 08:36:58 +00:00
692e0566d6 Rely on is_expr_static_and_true to test gcd (#105578)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105578
Approved by: https://github.com/voznesenskym
2023-07-20 08:29:04 +00:00
71067631c2 [inductor] Fix an AOTInductor missing output issue (#105496)
Summary: When an output buffer is reused instead of directly referring to the passed-in output, we need to explictly make a copy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105496
Approved by: https://github.com/jansel
2023-07-20 08:27:31 +00:00
0e4c12157c inductor: add support for 0 repeats (#105446)
Fix #104948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105446
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-20 08:04:47 +00:00
0578732bc3 [inductor] fix duplicate arg handling in triton templates (#105315)
Fixes #105212

De-duplicate kernel args in codegen and autotuning of `torch.mm` and `torch.bmm`.

refer to https://github.com/pytorch/pytorch/issues/105212#issuecomment-1637168866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105315
Approved by: https://github.com/jansel
2023-07-20 07:46:46 +00:00
a5317ae857 Remove unnecessary left == right test. (#105576)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105576
Approved by: https://github.com/voznesenskym
2023-07-20 07:33:08 +00:00
980589b04d [ONNX] Suppress ORT warnings in unit tests (#105624)
As title, these warnings are too noisy and made CI test logs hard to read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105624
Approved by: https://github.com/justinchuby
2023-07-20 07:21:21 +00:00
8daed86e4e [Inductor] aten.dist decomposition (#105586)
Fixes #105557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105586
Approved by: https://github.com/desertfire, https://github.com/Chillee
2023-07-20 06:42:44 +00:00
dfc9874740 Revert "inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440)"
This reverts commit 18bcf62bbcf7ffd47e3bcf2596f72aa07a07d65f.

Reverted https://github.com/pytorch/pytorch/pull/105440 on behalf of https://github.com/XiaobingSuper due to introduce core dumped when init bfloat16 zero tensor ([comment](https://github.com/pytorch/pytorch/pull/105440#issuecomment-1643079005))
2023-07-20 03:56:44 +00:00
dff4e034b8 [quant][pt2e][be] Rename qnnpack quantizer to xnnpack quantizer (#105551)
Summary: att

Test Plan: sandcastle CI and OSS CI

Reviewed By: andrewor14

Differential Revision: D47422894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105551
Approved by: https://github.com/andrewor14
2023-07-20 03:52:40 +00:00
c6653b65d8 Back out "Make adding buffers more like adding parameters (#104069)" (#105581)
Summary:
D47537831 is breaking pyper tests: https://fb.workplace.com/groups/802176577445480/posts/1018902842439518/

with `TypeError: register_buffer() takes 3 positional arguments but 4 were given`

Original commit changeset: d4b4069fbd38

Original Phabricator Diff: D47537831

Test Plan:
```
buck2 run //caffe2/torch/fb/training_toolkit/integration_tests/training_lifecycle/cogwheel_tests/pyper_release_v2:cogwheel_smallworld_inline_cvr_infer_pyper_pyper__canary_offline_training-launcher -- --run-harness-in-tupperware --build-fbpkg ads_dper3 --build-fbpkg training_platform
```

Reviewed By: atalman

Differential Revision: D47600140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105581
Approved by: https://github.com/mikaylagawarecki
2023-07-20 03:39:53 +00:00
2e81cdc1dd Remove dead sizevars.__getitem__ (#105579)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105579
Approved by: https://github.com/albanD
2023-07-20 03:06:01 +00:00
43540a1cab Tighten size_hint invariant (#105580)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105580
Approved by: https://github.com/albanD
2023-07-20 03:04:21 +00:00
690ea933ca Enable more e2e foreach optimizer compilation tests (#105438)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105438
Approved by: https://github.com/jansel
2023-07-20 02:41:19 +00:00
0cd51b3df0 Reland: Value range refinement using multi-variate expressions (#105491)
Trying to re-land: #97964.

Test strategy:

```
buck2 test '@fbcode//mode/dev-nosan' fbcode//pye/model_inventory/inside_out_tracking_model:inside_out_tracking_model_test -- --exact 'pye/model_inventory/inside_out_tracking_model:inside_out_tracking_model_test - test_executorch_e2e_output_consistency_aten (pye.model_inventory.inside_out_tracking_model.InsideOutTrackingModelTest.InsideOutTrackingModelTest)'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105491
Approved by: https://github.com/ezyang
2023-07-20 02:38:39 +00:00
3dacc8e847 [PyTorch] [Memory profiler] Early return if qualified name is invalid (#105495)
Summary: Return early if we can easily determine the operator qualified name is invalid before attempting to retrieve the schema. In particular "::" should always be present. Quick estimate shows that this is >50x faster (100 us -> 2 us).

Test Plan: CI

Differential Revision: D47562587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105495
Approved by: https://github.com/aaronenyeshi
2023-07-20 00:58:32 +00:00
86076abeff Update slow CI jobs to rocm5.6 (#105516)
Follow-up to https://github.com/pytorch/pytorch/pull/103092, which missed updating the slow CI jobs to ROCm5.6, as they were recently moved to slow.yml by def50d2534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105516
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-07-20 00:56:46 +00:00
93e6fc54fa [PyTorch] Remove device transfers from JNI (#105583)
Summary:
If a model was exported for Vulkan backend without (automatic or manual) device transfers, then the export is incorrect, and the JNI need not correct that.
(If this assumption is incorrect, please give feedback.)

Undo the changes from
- D23763771: automatic device transfers in JNI
- D39519168: `"requires_backend_transfers"` logic in JNI

Test Plan: Verify CUNET+ hybrid model from D47488843 works.

Reviewed By: SS-JIA

Differential Revision: D47527244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105583
Approved by: https://github.com/SS-JIA
2023-07-20 00:26:21 +00:00
0b524343be Reenable UFMT on pyi (#105577)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105577
Approved by: https://github.com/albanD
2023-07-20 00:11:45 +00:00
5abc5ab55d [inductor] Disable cudagraphs if index_put_ fallback is encountered (#105439)
**TL;DR**: if lowerings.py encounters aten.index_put, it will set V.graph.cudagraphs_okay = False, which will disable cudagraphs. index_put needs to be disabled because it crashes cuda graphs.

index_put_ fallbacks fail with cuda graphs when `accumulate=True` - likely for the same reason that it fails with deterministic_algorithms_enabled:
fcb7d4b358/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L730)

A first attempt was just to expand the scenarios where `index_put_` is one of the disallowed kernels in utils.py: 2fa7d11b64/torch/_inductor/utils.py (L436-L438)

However this disables cuda graphs in too many scenarios, because index_put doesn't cause issues if it gets fused, it only causes issues if the aten kernel gets called. So in the updated version of this PR, we check for fallbacks in lowerings.py and disable cudagraphs only if a fallback is encountered there.

Example of failure outside of PT2:

```python
import torch

def fn(x, y, z):
    x = torch.zeros_like(x)
    return x.index_put_([y], z, True)
    # return x + 1

x = torch.zeros((512, 512), dtype=torch.bool, device='cuda')
y = torch.arange(512, dtype=torch.int64, device='cuda')
z = torch.ones((512, 512), dtype=torch.bool, device='cuda')

s = torch.cuda.Stream()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    for i in range(3):
        fn(x, y, z)
torch.cuda.current_stream().wait_stream(s)

g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    fn(x, y, z)
```

fails with
```
Traceback (most recent call last):
  File "/data/users/dberard/scripts/graphed_index_put.py", line 24, in <module>
    fn(x, y, z)
  File "/data/users/dberard/scripts/graphed_index_put.py", line 8, in fn
    return x.index_put_([y], z, True)
RuntimeError: CUDA error: operation not permitted when stream is capturing
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/users/dberard/scripts/graphed_index_put.py", line 24, in <module>
    fn(x, y, z)
  File "/data/users/dberard/pytorch/torch/cuda/graphs.py", line 173, in __exit__
    self.cuda_graph.capture_end()
  File "/data/users/dberard/pytorch/torch/cuda/graphs.py", line 79, in capture_end
    super().capture_end()
RuntimeError: CUDA error: operation failed due to a previous error during capture
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```

Differential Revision: [D47538548](https://our.internmc.facebook.com/intern/diff/D47538548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105439
Approved by: https://github.com/eellison
2023-07-19 23:38:29 +00:00
bc6bca9d42 [XNNPACK][QS8] torch.slice (#105252)
Differential Revision: [D47487423](https://our.internmc.facebook.com/intern/diff/D47487423/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105252
Approved by: https://github.com/digantdesai
2023-07-19 23:36:02 +00:00
fa6be2fa6f [Quant][PT2E] Remove x86 inductor pt2e backend config (#105039)
**Summary**
For the Quantization PT2E path, we recommend to use `X86InductorQuantizer` instead of backend config of `x86_inductor_pt2e_backend_config`. Remove the `x86_inductor_pt2e_backend_config` and the relevant testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105039
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-07-19 23:18:29 +00:00
af9a4e08fa [dynamo][rewrite_asserts] Insert assertion msg in bytecode only when needed (#105549)
Fixes https://github.com/pytorch/pytorch/issues/105513

The main issue is that we could call `self.LOAD_CONST` and change Dynamo stack, and then decide that we can't rewrite it later. This PR ensures that we change the dynamo stack only when we decide to rewrite asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105549
Approved by: https://github.com/tugsbayasgalan
2023-07-19 23:14:01 +00:00
6c432381f5 [Quant][Inductor] Use truncate instead of default rounding round when convert float to uint8 (#105109)
**Summary**
When convert float tensor to uint8 data type as `tensor.to(dtype=torch.uint8)`, PyTorch will directly truncate the decimal. Previously, in `convert_float_to_uint8` we use `_mm512_cvtps_epi32` which uses default rounding mode (round to nearest) to convert float to uint8 which doesn't align with the eager mode behavior. Change  `_mm512_cvtps_epi32` to `_mm512_cvttps_epi32` to use directly truncate when convert float tensor to uint8.

**Test Plan**
```
python -m pytest test_cpu_repro.py -k test_to_uint8_rounding_method
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105109
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/jerryzh168
2023-07-19 23:07:16 +00:00
a832967627 Migrate tuple(handle) -> handle (#104488)
We strengthen the invariant that one FSDP managed module has one flatparameter, and remove unused code that would have supported 1:many module to flatparam mapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104488
Approved by: https://github.com/awgu
2023-07-19 22:33:35 +00:00
c54f630201 [7/n][FSDP] make use_dtensor=True work with offload_to_cpu=True for load_state_dict (#105378)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105378
Approved by: https://github.com/fegin
2023-07-19 21:36:37 +00:00
5ce5372d70 Create tensor from Numpy in current device. (#105546)
Fix: #105046

This PR changes how tensors are created from Numpy arrays, when tracing with
dynamo. Instead of using `from_numpy`, we use `as_tensor`. The latter takes into
consideration the current device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105546
Approved by: https://github.com/lezcano
2023-07-19 21:31:52 +00:00
73e1455327 [BE] Enable ruff's UP rules and autoformat test/ (#105434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434
Approved by: https://github.com/albanD
2023-07-19 20:36:06 +00:00
7b56238551 fix typo (#105507)
Differential Revision: D47568928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105507
Approved by: https://github.com/awgu, https://github.com/fduwjj
2023-07-19 20:34:43 +00:00
801fb93b0c Update pybind11 submodule to 2.11.0 (#105245)
Update pybind11 submodule to 2.11.0 with better python 3.12 support, bugfixes, a few new features, and more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105245
Approved by: https://github.com/albanD
2023-07-19 19:56:16 +00:00
70b5264ec5 [EZ][BE] Fix the massively annoying strict-weak-ordering issue. (#105189)
Summary:
kip_fist_pump

Running any EgoOCR workflow in non-opt modes was breaking with https://fburl.com/strict-weak-ordering

Painstakingly found out that the stable_sort comparator in the generate_proposals caffe2 op was the issue due to numerical imprecision. This was causing Word Detector model to barf with the error. Adding explicit handling for the [irreflexivity property](https://www.boost.org/sgi/stl/StrictWeakOrdering.html) fixes this annoying strict-weak-ordering issue that has bugged me and several others(https://fb.workplace.com/groups/1405155842844877/permalink/7079705785389826/) for a while.

We can finally run all OCR workflows in non-opt mode! :)

Test Plan:
Debugged this with `fdb --disable-auto-breakpoints --secondary-debugger=lldb buck2 run mode/dev-sand ai_demos/server_model_zoo/models/ego_ocr_e2e_prod:ego_ocr_e2e_prod_binary`

and running `breakpoint set -E c++` in the lldb terminal.

Differential Revision: D47446816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105189
Approved by: https://github.com/malfet, https://github.com/atalman
2023-07-19 19:37:50 +00:00
4448c78a5d [ONNX] Add missing spaces between sentences in warning text (#105527)
Hello!

I ran into this warning locally and noticed that it was formatted incorrectly. Even a link was wrong because of it: https://github.com/pytorch/pytorch/issues.Error

This should resolve it.

- Tom Aarsen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105527
Approved by: https://github.com/justinchuby
2023-07-19 17:57:27 +00:00
218b5477ea switching NNC as default for TorchScript support (#105185)
Disable nvfuser by default in TorchScript
Add deprecation warning for nvfuser usage via TorchScript and PrimTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105185
Approved by: https://github.com/malfet, https://github.com/davidberard98
2023-07-19 16:31:34 +00:00
a10f93f606 [composable API] Fix the replicate_device_id test case to avoid copy replicated models. (#105503)
We should not `replicate` deeocopy.copy(a already replicated model).

Differential Revision: [D47566678](https://our.internmc.facebook.com/intern/diff/D47566678/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105503
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-07-19 16:20:43 +00:00
f139aab2f4 [dynamo] add initial dynamo support for DTensor (#103146)
This PR adds initial dynamo support for DTensor, in particular, it:
- allows DTensor be passed into a compiled function, and allow fakify
DTensor during dynamo tracing by turning the inner local tensor to meta
tensor.
- We use `allow_in_graph` to include `DTensor` and `DTensor.from_local` to be represented as `TorchVariable`
- The dtensor created becomes a normal `TensorVariable` and it would insert any tensor operations to the output graph just like torch.Tensor
- note that dtensor have a new instance method `redistribute` compare to plain tensor, and we currently special handle it in `TensorVariable`

`from_local` and `redistribute` both accepts some non-trival metadata as arguments (i.e. DeviceMesh, Placement) which fx.Graph does not support. In order to let these two APIs appear in the dynamo captured graph, we encoded the metadata into a new_function (like `functools.partial`) and the new function only accepts prim args (i.e. tensor), then we put `call_function` with this new_function to the graph. This is suggested by @ezyang. The underlying rationale here is that the metadata will not change across the graph invocations so it's safe to encode them.

Captured graph:
```
    def forward(self, L_x_ : torch.Tensor):
        l_x_ = L_x_

        # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:685, code: dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
        prim_from_local = torch__dynamo_variables_torch_prim_from_local(l_x_, run_check = False);  l_x_ = None

        # File: /scratch/wanchaol/work/pytorch/test/distributed/_tensor/test_dtensor.py:686, code: return dt.redistribute(mesh, [Replicate()]).to_local() + 2
        prim_redistribute = torch__dynamo_variables_tensor_prim_redistribute(prim_from_local);  prim_from_local = None
        to_local = prim_redistribute.to_local();  prim_redistribute = None
        add = to_local + 2;  to_local = None
        return (add,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103146
Approved by: https://github.com/voznesenskym
2023-07-19 16:01:12 +00:00
a788365d14 Switch UFMT to denylist rather than allowlist (#105536)
The new denylist was generated with this script: https://gist.github.com/ezyang/851589ac4694ed131feee7ad59281ca9

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105536
Approved by: https://github.com/malfet, https://github.com/albanD
2023-07-19 15:15:28 +00:00
232b96b6e2 [BE] Enable ruff's UP rules and autoformat distributed/ (#105433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105433
Approved by: https://github.com/albanD
2023-07-19 14:27:11 +00:00
2125794c12 [MPS][BE] Use Tensor::copy_ (#105475)
Replace `const_cast<Tensor&>(x) = y;` with `x.copy_(y);`

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at d2b7a0d</samp>

> _`copy_` not `clone`_
> _MPS backend runs faster_
> _Winter memory_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105475
Approved by: https://github.com/kulinseth
2023-07-19 14:26:36 +00:00
8a688277a2 [BE] Enable ruff's UP rules and autoformat dynamo / functorch and refs (#105432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105432
Approved by: https://github.com/ezyang
2023-07-19 13:48:44 +00:00
88f119775d Upgrade nightly wheels to rocm5.6 (#105076)
Tests https://github.com/pytorch/builder/pull/1442

Fixes #104419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105076
Approved by: https://github.com/atalman
2023-07-19 13:47:58 +00:00
cb7a30f656 [BE] Enable ruff's UP rules and autoformat inductor/ (#105431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105431
Approved by: https://github.com/albanD
2023-07-19 13:45:00 +00:00
c0d8a4af0a [BE] Enable ruff's UP rules and autoformat ao/ (#105430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105430
Approved by: https://github.com/albanD, https://github.com/malfet
2023-07-19 13:44:37 +00:00
0b6de0eb1c Improve validator module behavior if Z3 is not installed. (#105168)
Fixes: #105143

In summary, the changes are:

- Check if Z3 is installed when the module is loaded
- Naming consistently as "translation validation" (not "validator")
- Skipping tests if Z3 is not installed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105168
Approved by: https://github.com/ezyang
2023-07-19 13:11:22 +00:00
e137ac6c59 [dynamo][torch_np] support linalg, random and fft module (#105320)
Support tracing through `np.linalg` with `torch_np` installed. Will update with other modules if this approach makes sense.

TODO:
* [x] Add test for `fft` and `random`.

Fixes https://github.com/pytorch/pytorch/issues/105269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105320
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-07-19 11:06:37 +00:00
18bcf62bbc inductor: promote half/bfloat16 constant to float for cpu vectorization path (#105440)
As scalar path, we should also promote half/bfloat16 constant to float for better accuracy, after this PR, the TIMM ```dm_nfnet``` model amp path can be passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105440
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-19 06:53:23 +00:00
7ddb66e334 Fix for "AttributeError when attempting to remove inductor buffers twice" (#104901)
Fixes #102857

I added the proposed fix and found a reasonably small test case. I don't have any insight into why this test case was causing the error, but it is fixed now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104901
Approved by: https://github.com/eellison
2023-07-19 06:00:32 +00:00
167eab1cec [inductor] Suport OMP on MacOS (#105136)
Fixes Dynamo + MacOS: fatal error: 'omp.h' file not found #95708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105136
Approved by: https://github.com/jansel
2023-07-19 05:58:43 +00:00
0e85c224f8 Use shareable calculate-docker-image GHA (#105372)
Switch from PyTorch `calculate-docker-image` GHA to its shareable version on test-infra https://github.com/pytorch/test-infra/pull/4397.

I will clean up PyTorch `calculate-docker-image` GHA in a separate PR after landing this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105372
Approved by: https://github.com/malfet
2023-07-19 05:02:01 +00:00
554052f321 [quant][pt2e][be] Rename prepare_pt2e_quantizer to prepare_pt2e (#105484)
Summary: att

Test Plan: sandcastle and OSS CI

Reviewed By: andrewor14

Differential Revision: D47422892

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105484
Approved by: https://github.com/andrewor14
2023-07-19 04:51:37 +00:00
5ef023b05a [BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429
Approved by: https://github.com/malfet
2023-07-19 04:46:37 +00:00
9c225c9b9a [pytorch] Change autograd fallback mode to Nothing (#105505)
Summary:
This caused some internal tests to fail. I'm not sure it is possible to easily
revert the original diff. This diff is a hotfix that changes the autograd
fallback behavior to what it was previously.

Test Plan: Existing tests

Differential Revision: D47569822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105505
Approved by: https://github.com/soulitzer
2023-07-19 04:35:37 +00:00
d2fa3f608b Produce more logs from TCPStore init (#105350)
this diff:
1. adds debug logs to TCPStore initialization to better capture the "connection reset by peer" error.

Differential Revision: [D47454956](https://our.internmc.facebook.com/intern/diff/D47454956/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105350
Approved by: https://github.com/kumpera, https://github.com/fduwjj
2023-07-19 04:15:48 +00:00
d2c24eca8a Fix mps unary op issue on non densely stored tensors (#105512)
This pr fixes a bug where non densely stored tensors were not converted to the dense tensors of the correct scalar type in the mps `unary_op` helper function

Fixes https://github.com/pytorch/pytorch/issues/105284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105512
Approved by: https://github.com/malfet
2023-07-19 03:56:38 +00:00
133a5e9a7a Upgrade triton pin (#105463)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105463
Approved by: https://github.com/albanD
2023-07-19 03:55:41 +00:00
64c39ece65 Fix a docstring of resolve_neg (#104151)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104151
Approved by: https://github.com/malfet
2023-07-19 03:55:20 +00:00
fe04c6c371 [inductor] Allow specify a subdir to store .so and .cubin files (#105466)
Summary: The subdir is used to store .so and .cubin files generated by AOTInductor. It can either be specified, or created based on hash of the input graph.

Differential Revision: [D47556730](https://our.internmc.facebook.com/intern/diff/D47556730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105466
Approved by: https://github.com/chenyang78
2023-07-19 03:13:50 +00:00
1597dd7a54 Report guard failures with recompiles logging (#105500)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105500
Approved by: https://github.com/Chillee, https://github.com/anijain2305
2023-07-19 02:20:44 +00:00
11b753af01 Refactor causal mask generation and detection for nn.transformer (#105265)
Summary:
* Create a private global-scope function _generate_subsequent because static class attribute member functions not supported by TorchScript resulting in torchscripting errors.
* Make TransformerEncoder and TransformerDecoder consistent w.r.t. is_causal handling by calling _detect_casual_mask
* Clarify documentation that is_causal is a hint
* Move causal mask detection into a method _detect_causal_mask
* only accept input-size compatible causal mask as causal mask
* update _generate_subsequent_causal_mask to include factory kwargs for dtype and device:
   avoid extra copies & conversions by passing directly to torch.full.

Test Plan: sandcastle & github CICD
Continuation of #101487 (due to a tooling issue) which is a continuation-in-part of https://github.com/pytorch/pytorch/pull/98327 by @janEbert

Differential Revision: D47427117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105265
Approved by: https://github.com/mikaylagawarecki
2023-07-19 01:26:50 +00:00
14d87bb5ff [BE] Enable ruff's UP rules and autoformat tools and scripts (#105428)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105428
Approved by: https://github.com/albanD, https://github.com/soulitzer, https://github.com/malfet
2023-07-19 01:24:44 +00:00
5666d20bb8 Add unlifting pass under private config (#104897)
Summary: We wanna do this little by little. For now, I tried only on DissectedPartsModel which needs to use aot_export version.

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D46785735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104897
Approved by: https://github.com/JacobSzwejbka
2023-07-19 01:16:35 +00:00
fbd7e74c92 [inductor] Enable mypy checking in lowering.py (#105317)
Summary:

As suggested in #105230, mypy checking is enabled in `torch/_inductor/lowering.py`.

23 errors fixed; 6 silenced with `# type: ignore[attr-defined]`.

Test Plan:

Before the fix:

```
$ mypy torch/_inductor/lowering.py

torch/_inductor/lowering.py:139:16: error: "Symbol" has no attribute "is_integer"  [attr-defined]
torch/_inductor/lowering.py:263:20: error: Incompatible types in assignment (expression has type "Union[List[Any], Tuple[Any, ...]]", variable has type "List[Any]")  [assignment]
torch/_inductor/lowering.py:427:49: error: "IRNode" has no attribute "get_size"  [attr-defined]
torch/_inductor/lowering.py:439:37: error: "IRNode" has no attribute "get_dtype"  [attr-defined]
torch/_inductor/lowering.py:456:34: error: "IRNode" has no attribute "get_device"  [attr-defined]
torch/_inductor/lowering.py:645:44: error: Need type annotation for "b"  [var-annotated]
torch/_inductor/lowering.py:1321:12: error: "FakeTensor" has no attribute "is_cpu"  [attr-defined]
torch/_inductor/lowering.py:1542:24: error: Argument 3 to "FixedLayout" has incompatible type "List[int]"; expected "List[Expr]"  [arg-type]
torch/_inductor/lowering.py:1542:81: error: Argument "offset" to "FixedLayout" has incompatible type "int"; expected "Expr"  [arg-type]
torch/_inductor/lowering.py:1571:24: error: Argument 3 to "FixedLayout" has incompatible type "List[int]"; expected "List[Expr]"  [arg-type]
torch/_inductor/lowering.py:1571:81: error: Argument "offset" to "FixedLayout" has incompatible type "int"; expected "Expr"  [arg-type]
torch/_inductor/lowering.py:1654:12: error: Incompatible types in assignment (expression has type "List[Any]", variable has type "Tuple[Any, ...]")  [assignment]
torch/_inductor/lowering.py:2009:9: error: Need type annotation for "ranges" (hint: "ranges: List[<type>] = ...")  [var-annotated]
torch/_inductor/lowering.py:2151:16: error: Incompatible types in assignment (expression has type "List[Any]", variable has type "Tuple[Any, ...]")  [assignment]
torch/_inductor/lowering.py:2198:43: error: Item "type" of "Union[List[Any], type]" has no attribute "__iter__" (not iterable)  [union-attr]
torch/_inductor/lowering.py:2229:36: error: Argument 1 to "len" has incompatible type "Union[List[Any], type]"; expected "Sized"  [arg-type]
torch/_inductor/lowering.py:2231:38: error: Item "type" of "Union[List[Any], type]" has no attribute "__iter__" (not iterable)  [union-attr]
torch/_inductor/lowering.py:2233:35: error: Item "type" of "Union[List[Any], type]" has no attribute "__iter__" (not iterable)  [union-attr]
torch/_inductor/lowering.py:2569:54: error: Incompatible default for argument "reduce" (default has type "None", argument has type "str")  [assignment]
torch/_inductor/lowering.py:2569:54: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
torch/_inductor/lowering.py:2569:54: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
torch/_inductor/lowering.py:2586:59: error: Incompatible default for argument "reduce" (default has type "None", argument has type "str")  [assignment]
torch/_inductor/lowering.py:2586:59: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
torch/_inductor/lowering.py:2586:59: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
torch/_inductor/lowering.py:2720:65: error: Incompatible default for argument "scales_x" (default has type "None", argument has type "Tuple[float]")  [assignment]
torch/_inductor/lowering.py:2720:65: note: PEP 484 prohibits implicit Optional. Accordingly, mypy has changed its default to no_implicit_optional=True
torch/_inductor/lowering.py:2720:65: note: Use https://github.com/hauntsaninja/no_implicit_optional to automatically upgrade your codebase
torch/_inductor/lowering.py:2735:5: error: Name "scale" already defined on line 2731  [no-redef]
torch/_inductor/lowering.py:2758:47: error: Argument 3 to "upsample_nearestnd" has incompatible type "Tuple[Optional[float]]"; expected "Tuple[float]"  [arg-type]
torch/_inductor/lowering.py:2765:47: error: Argument 3 to "upsample_nearestnd" has incompatible type "Tuple[Optional[float], Optional[float]]"; expected "Tuple[float]"  [arg-type]
torch/_inductor/lowering.py:2776:47: error: Argument 3 to "upsample_nearestnd" has incompatible type "Tuple[Optional[float], Optional[float], Optional[float]]"; expected "Tuple[float]"  [arg-type]
torch/_inductor/lowering.py:2949:13: error: No binding for nonlocal "grad" found  [misc]
torch/_inductor/lowering.py:3063:49: error: Argument 2 to "range_mask_low" has incompatible type "int"; expected "Expr"  [arg-type]
torch/_inductor/lowering.py:3271:48: error: "IRNode" has no attribute "data"  [attr-defined]
torch/_inductor/lowering.py:3272:16: error: "IRNode" has no attribute "data"  [attr-defined]
Found 29 errors in 1 file (checked 1 source file)
```

After the fix:

```
$ mypy torch/_inductor/lowering.py

Success: no issues found in 1 source file
```

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105317
Approved by: https://github.com/eellison
2023-07-19 00:33:11 +00:00
88f1885ec9 [XNNPACK][QS8] torch.cat (#104800)
Differential Revision: [D47304143](https://our.internmc.facebook.com/intern/diff/D47304143/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104800
Approved by: https://github.com/digantdesai
2023-07-19 00:15:05 +00:00
39477f7ca9 Remove unnecessary seen check in get_current_graph_task_execution_order (#105487)
https://github.com/pytorch/pytorch/pull/105353#discussion_r1266977015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105487
Approved by: https://github.com/albanD, https://github.com/jansel
2023-07-18 23:49:45 +00:00
78a7684b5b [Pytorch] Unary Ops (#104994)
Summary:
Use templates to generate shaders for unary operators `exp` and `sqrt` for in-place and not in-place variants.

[sqrt](https://pytorch.org/docs/stable/generated/torch.sqrt.html)
[exp](https://pytorch.org/docs/stable/generated/torch.Tensor.exp.html#torch.Tensor.exp)

Refactor: use 'NAME' field in yaml for generated shader name in `gen_vulkan_spv.py`

Test Plan:
New tests:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*unary_op*"

Parsing buck files: finished in 16.1 sec
Creating action graph: finished in 0.7 sec
Downloaded 75/3986 artifacts, 248.89 Mbytes, 96.3% cache miss (for updated rules)
Building: finished in 08:24.0 min (100%) 2571/2571 jobs, 2571/2571 updated
  Total time: 08:40.9 min
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *unary_op*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unary_op_exp
[       OK ] VulkanAPITest.unary_op_exp (479 ms)
[ RUN      ] VulkanAPITest.unary_op_exp_
[       OK ] VulkanAPITest.unary_op_exp_ (1 ms)
[ RUN      ] VulkanAPITest.unary_op_sqrt
[       OK ] VulkanAPITest.unary_op_sqrt (2 ms)
[ RUN      ] VulkanAPITest.unary_op_sqrt_
[       OK ] VulkanAPITest.unary_op_sqrt_ (2 ms)
[----------] 4 tests from VulkanAPITest (485 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (485 ms total)
[  PASSED  ] 4 tests.
```

All tests:
https://www.internalfb.com/phabricator/paste/view/P786547213

Run clang-format on shader files and `UnaryOp.cpp`

Differential Revision: D47271856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104994
Approved by: https://github.com/SS-JIA
2023-07-18 23:43:57 +00:00
e983625f22 [FSDP] Fix skip-sharded-views + mixed precision (#105346)
This fixes https://github.com/pytorch/pytorch/issues/104504.

- When not using full-precision eval, the relevant fix is to force `_use_sharded_views()` calls if needed in `SUMMON_FULL_PARAMS` training state.
- When using full-precision in eval, the relevant fix is tracking what was the unsharded flat parameter from which the unsharded views were computed and using that instead of determining the unsharded flat parameter from the calling context via `_get_padded_unsharded_flat_param()`.

This also fixes https://github.com/pytorch/pytorch/issues/104770.
<details>
<summary> Print output showing parity </summary>

```
Key: 0
Model 1: [-1.5, 6.40625, -0.9453125, -0.3828125, 0.16015625, -1.5078125]
Model 2: [-1.5, 6.40625, -0.9453125, -0.3828125, 0.16015625, -1.5078125]

Key: 1
Model 1: [0.0157470703125, -0.8828125, 5.65625, 1.1328125, 0.275390625, 0.11181640625]
Model 2: [0.0157470703125, -0.8828125, 5.65625, 1.1328125, 0.275390625, 0.11181640625]

Key: 2
Model 1: [0.1689453125, -0.00567626953125, -0.09375, 7.34375, -0.18359375, -0.09521484375]
Model 2: [0.1689453125, -0.00567626953125, -0.09375, 7.34375, -0.18359375, -0.09521484375]

Key: 3
Model 1: [0.546875, -0.8984375, 0.228515625, 0.7578125, 6.0625, 0.435546875]
Model 2: [0.546875, -0.8984375, 0.228515625, 0.7578125, 6.0625, 0.435546875]

Key: 4
Model 1: [-0.66796875, -0.88671875, 0.30078125, 0.06494140625, 0.412109375, 6.9375]
Model 2: [-0.66796875, -0.88671875, 0.30078125, 0.06494140625, 0.412109375, 6.9375]

Key: 5
Model 1: [0.07763671875, 0.8671875, -0.43359375, 0.5703125, 0.76171875, -0.0089111328125]
Model 2: [0.07763671875, 0.8671875, -0.43359375, 0.5703125, 0.76171875, -0.0089111328125]

Key: 6
Model 1: [-0.283203125, -0.361328125, 0.474609375, 0.10205078125, 1.125, -0.0859375]
Model 2: [-0.283203125, -0.361328125, 0.474609375, 0.10205078125, 1.125, -0.0859375]

Key: 7
Model 1: [1.140625, 0.62890625, -0.07568359375, -1.0390625, -0.2578125, -0.053955078125]
Model 2: [1.140625, 0.62890625, -0.07568359375, -1.0390625, -0.2578125, -0.053955078125]

Key: 8
Model 1: [0.68359375, -1.09375, 0.59375, 1.0, -0.23828125, 0.578125]
Model 2: [0.68359375, -1.09375, 0.59375, 1.0, -0.23828125, 0.578125]

Key: 9
Model 1: [0.515625, 0.296875, -0.1826171875, -0.12890625, -0.51953125, -0.3359375]
Model 2: [0.515625, 0.296875, -0.1826171875, -0.12890625, -0.51953125, -0.3359375]
```

</details>

Follow-ups:
- I suspect that for `SHARD_GRAD_OP`, train forward -> eval forward when using full-precision in eval will not free the low-precision unsharded parameters from the train forward, resulting in 1.5x unsharded parameter memory.

Differential Revision: [D47527597](https://our.internmc.facebook.com/intern/diff/D47527597)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105346
Approved by: https://github.com/fegin, https://github.com/rohan-varma
2023-07-18 23:13:53 +00:00
e645f2adaf [DTensor] Fix device detection logic for TestDTensorPlacementTypes::test_split_tensor. (#105357)
The test should respect self.device_type as it checks whether the environment
has enough GPUs to serve the requested world size.

The test will lead to hangs if we try to run 8 ranks over our 2-4 GPUs CI instances.

Fixes #104769
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105357
Approved by: https://github.com/wanchaol
2023-07-18 21:53:50 +00:00
242a7743f3 [BE] Enable ruff's UP rules and autoformat onnx/ (#105427)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105427
Approved by: https://github.com/malfet
2023-07-18 21:41:24 +00:00
f508d3564c [Pytorch][Vulkan] Templatize BinaryOps (#105380)
Summary:
Use templates to generate the kernels for add, sub, mul, div and their variants (tensor/scalar, in-place/not in-place).

Rename Arithmetic.cpp to BinaryOp.cpp

Test Plan:
https://www.internalfb.com/phabricator/paste/view/P785131030

```
 buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

...

xplat/caffe2/aten/src/ATen/test/vulkan_api_test.cpp:6377: Skipped
QueryPool is not available
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log (0 ms)
[----------] 307 tests from VulkanAPITest (5427 ms total)

[----------] Global test environment tear-down
[==========] 307 tests from 1 test suite ran. (5427 ms total)
[  PASSED  ] 306 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

  YOU HAVE 5 DISABLED TESTS
```

Differential Revision: D47307169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105380
Approved by: https://github.com/SS-JIA
2023-07-18 21:21:19 +00:00
78829d6e07 Fix isinstance check in quat_utils (#105476)
Calling `isinstance(x, Tuple[Node, Node])` would either fail, or raise a
type error on a more modern Python, as none of the tuples are actually
instances of `Tuple`

```python
>>> from typing import Tuple
>>> from torch.fx import Node
>>> edge_or_node=(Node(None, "foo", "output", "foo", None, None), Node(None, "bar", "output", "bar", None, None))
>>> isinstance(edge_or_node, tuple) and len(edge_or_node) == 2 and all(isinstance(x, Node) for x in edge_or_node)
True
>>> isinstance(edge_or_node, Tuple[Node, Node])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/malfet/miniconda3/lib/python3.10/typing.py", line 994, in __instancecheck__
    return self.__subclasscheck__(type(obj))
  File "/Users/malfet/miniconda3/lib/python3.10/typing.py", line 997, in __subclasscheck__
    raise TypeError("Subscripted generics cannot be used with"
TypeError: Subscripted generics cannot be used with class and instance checks
```

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 40fa451</samp>

> _Fix type annotation_
> _Quantize nodes in the graph_
> _Autumn leaves falling_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105476
Approved by: https://github.com/jerryzh168
2023-07-18 21:16:05 +00:00
3721fa5612 [BE] Enable ruff's UP rules and autoformat optim/ (#105426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105426
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi, https://github.com/janeyx99
2023-07-18 21:07:43 +00:00
be03a56955 [BE] Enable ruff's UP rules and autoformat testing/ (#105425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105425
Approved by: https://github.com/malfet
2023-07-18 21:04:39 +00:00
9e1b07e692 [C10d] Handle bool tensors in gloo. Fixes #103585. (#105354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105354
Approved by: https://github.com/wanchaol
2023-07-18 20:42:58 +00:00
abc1cadddb [BE] Enable ruff's UP rules and autoformat utils/ (#105424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105424
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-18 20:17:25 +00:00
91ab32e4b1 [pt2][inductor] fix LoweringException: TypeError: '<' not supported between instances of 'ExternKernelCaller' and 'ExternKernelCaller' (#105469)
Summary: `sort_keys=True` for autotuning results fails because we can't compare ExternKernelCaller objects. besides, it isn't really necessary to sort the keys, either for the autotuning results or the sysinfo. let's just drop sorting all together

Test Plan: sandcastle + CI

Reviewed By: aaronenyeshi

Differential Revision: D47544587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105469
Approved by: https://github.com/jansel
2023-07-18 20:08:50 +00:00
8cd94e1eab [MPS] Add lerp implementation (#105470)
lerp.Scalar fits very well into binary op template
Add a very naive implementation for `lerp.Tensor` as `add_out(self, weights.mul(end.sub(self)))`

Enable `lerp` testing in `test_mps`

Fixes https://github.com/pytorch/pytorch/issues/105382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105470
Approved by: https://github.com/albanD
2023-07-18 20:01:04 +00:00
cb23373264 [dynamo] allow tensor subclass fakification in dynamo (#105308)
This PR adds necessary plumbing through torchdynamo to allow tensor
subclasses with certain contract (i.e. with `__tensor_flatten__` and
`__tensor_unflatten__`) to goes through the dynamo fakification pass by
fakifying the tensor subclass internal components.

Some of the tensor subclass contract logic mostly borrowed from
https://github.com/pytorch/pytorch/pull/97540

Added some tests to verify simply passing through a tensor subclass
(i.e. DTensor) through dynamo eager works as expected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105308
Approved by: https://github.com/ezyang
2023-07-18 17:28:04 +00:00
bcb9ca4e5a [dtensor] canonicalize detach callsites and use view_as when appropriate (#105239)
This PR canonicalize the detach callsite to only call the detach
from `distribute_tensor`. Change other callsite to view_as and remove the
tensor constructor detach call

This is so that we don't detach local tensor for every op run when
rewrapping the DTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105239
Approved by: https://github.com/albanD
2023-07-18 17:13:37 +00:00
1aba399138 allow set_multithreading_enabled to act as function and context manager (#105291)
Fixes #104985

Implemented `set_multithreading_enabled` C++ function to directly alter state rather than using `MultithreadingEnabled` class, which was automatically resetting the state when the object was destroyed. This behavior more closely aligns with set_grad_enabled which does work as expected. This allows us to change python class `set_multithreading_enabled` to act as both a function and context manager.

I also added a getter: `torch._C.is_multithreading_enabled`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105291
Approved by: https://github.com/albanD
2023-07-18 16:55:40 +00:00
ed2b9f1af1 [quant][pt2e] rename _quantize_pt2e to quantize_pt2e (#105377)
Summary: att

Test Plan: CIs

Reviewed By: andrewor14

Differential Revision: D47234357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105377
Approved by: https://github.com/andrewor14
2023-07-18 16:46:05 +00:00
cyy
8364a5116c Simplify Dispatcher case for zero arguments (#104613)
MSVC detects calling  Dispatcher::callWithDispatchKeySlowPath with zero arguments.  This PR fixes it and simplifies code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104613
Approved by: https://github.com/ezyang
2023-07-18 16:42:57 +00:00
133c5ec997 Add torch.ops.out_dtype (#103333)
https://docs.google.com/document/d/10DYFG2sU3TSvguFP5kYwYLlo45KHFg3BhBOkUk0NKsU/edit#bookmark=id.hgfzmhlzkamk

Renamed mixed_dtype --> out_dtype because "mixed_dtype is not very descriptive in the context of regular pytorch where we support type promotion on most ops"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103333
Approved by: https://github.com/zou3519
2023-07-18 16:25:45 +00:00
1b78f23a1a Allow nn.ChannelShuffle to run without erroring on CUDA tensors (#105351)
Summary: Include GPU support for `nn.ChannelShuffle` & update test.

Fix: #104603

Test Plan: Please see GitHub Actions.

Differential Revision: D47523764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105351
Approved by: https://github.com/mikaylagawarecki
2023-07-18 16:24:30 +00:00
dc1186b0f9 Add peterbell10 to core reviewers (#105461)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105461
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-07-18 13:52:45 +00:00
b10de43c0a Add aot_inductor as a test backend for benchmarking (#105221)
Summary:
Original PR at https://github.com/pytorch/pytorch/pull/104977. Landing from fbcode instead.

Add an aot_inductor backend (Export+AOTInductor) in the benchmarking harness. Note it is not a dynamo backend.

Moved files from torch/_inductor/aot_inductor_include to torch/csrc/inductor as a more standard way for exposing headers
Created a caching function in benchmarks/dynamo/common.py for compiling, loading and caching the .so file, as a proxy for a pure C++ deployment, but easier for benchmarking.

Differential Revision: D47452591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105221
Approved by: https://github.com/jansel
2023-07-18 13:16:36 +00:00
671a21926f [torch_np] update test to use ones_like instead of empty_like (#105453)
This test fails locally (probably because deterministic mode is not on by default).

We replace the use of `empty_like` to `ones_like` as this test doesn't need `empty_like`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105453
Approved by: https://github.com/lezcano
2023-07-18 13:13:11 +00:00
5e942ac5ec add bfloat16 support for reflection and replication padding (#102949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102949
Approved by: https://github.com/cpuhrsch
2023-07-18 13:01:09 +00:00
1a4ee2a6bb Add XPU support for storage resize_ (#105262)
We'd like to add XPU device support for storage resize_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105262
Approved by: https://github.com/mikaylagawarecki
2023-07-18 12:46:00 +00:00
d09195ce82 inductor: fix fx tracing error for freezing pass (#105133)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105133
Approved by: https://github.com/eellison
2023-07-18 10:40:22 +00:00
38c1e86ee2 inductor: make sure as_strided ops' input layout is not changed after converting conv's weight format (#105122)
For the freezing path, if we convert conv's weight to channels last, we need to make sure as_strided ops' input layout is not changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105122
Approved by: https://github.com/jgong5, https://github.com/shunting314
2023-07-18 09:26:54 +00:00
964d29f312 [BE] Enable ruff's UP rules and autoformat torchgen/ (#105423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105423
Approved by: https://github.com/Skylion007
2023-07-18 06:44:20 +00:00
6ca3d7e1a2 [pt2][inductor] only use global cache on MAST (#105375)
Summary:
until we can further investigate the autotuning differences between MAST and non-MAST (devserver) environments, turn off the global cache for all non-MAST environments. this ensures we don't see unexpected regressions

also update scuba logging for cache lookup, and add scuba logging for autotuning results.

Test Plan: sandcastle + CI

Differential Revision: D47516633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105375
Approved by: https://github.com/jansel
2023-07-18 06:16:47 +00:00
8010f6bf48 [dynamo][inductor] Provide public API to get compiler options/configs (#105026)
issues resolved: https://github.com/pytorch/pytorch/issues/101832

**context**: get torch.compile config for further usage. E.g, the training platform wants to get if model is compiled with cudagraph enabled and trigger further action

**how it is implemented**
   * the core logic is backend.get_compiler_config() in torch/_dynamo/eval_frame.py
   * for backend='inductor' / _TorchCompileInductorWrapper, we have inductor-specific implementation in get_compiler_config in torch/_inductor/compile_fx.py and torch/__init__.py

**how to use it**: Below is an example.

```
model = DummyModule()
optimized_module = torch.compile(
    model, options={"triton.cudagraphs": True}
)
compiler_config = optimized_module.get_compiler_config()

if compiler_config["triton.cudagraphs"]:
   pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105026
Approved by: https://github.com/yanboliang, https://github.com/jansel
2023-07-18 06:12:06 +00:00
4b3c261a2e inductor: fix issue of vectorization when the store's index is constant value (#105314)
Fix #104515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105314
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-18 04:54:25 +00:00
3201a90428 inductor: don't force convert channels last if one op's user is as_strided ops (#105111)
For ```aten.unfold```, it will be decomposed to ```aten.as_satrided```, and it depends on input size and stride, we shouldn't change its' stride to avoid getting a wrong value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105111
Approved by: https://github.com/shunting314, https://github.com/eellison
2023-07-18 04:52:19 +00:00
5d473a950f Make conversions from/to sparse semi-structured always @torch.compile-d (#105272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105272
Approved by: https://github.com/ezyang
2023-07-18 04:51:28 +00:00
ad6dad810e [dynamo][profiler] More verbose profiler warning (#105362)
torch.profiler.record_function and torch.profiler.profile are ignored by dynamo. In the common case, users have `record_function` in the middle of their program in order to annotate a section of the profile.

The previous error message was `Profiler will be ignored`. Users would think that profiling would be completely ignored.

Now the message will look like `Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105362
Approved by: https://github.com/yanboliang, https://github.com/aaronenyeshi
2023-07-18 04:42:13 +00:00
2ba9b56449 [ONNX] Fix aten::cat export when arg include parameters (#105373)
Not all fx.Node are guaranteed to have meta["val"]. 'get_attr' nodes
do not. This PR fixes the callsite checking if meta["val"] is symbol.
Fixes #105370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105373
Approved by: https://github.com/titaiwangms
2023-07-18 04:21:59 +00:00
dc58259746 [Inductor] [FX passes] Group linear fusion (#105116)
Summary:
The draft version of a group + batch fusion framework, and the group linear fusion implementation.
In the future, it's pretty straightforward to add a new group/batch fusion policy by defining a class with match + fuse functions.

Test Plan: buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:group_batch_fusion

Differential Revision: D46956695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105116
Approved by: https://github.com/jansel
2023-07-18 03:56:42 +00:00
ba00b0939e Inductor cpp wrapper: support torch.complex64 (#105305)
Add `torch.complex64` into the supported dtype list of cpp wrapper to fix CPU cpp wrapper failure on llama.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105305
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-07-18 02:37:44 +00:00
fcb7d4b358 Mark bincount CUDA deterministic if weights are not given (#105244)
Fixes #98316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105244
Approved by: https://github.com/mikaylagawarecki
2023-07-18 01:16:51 +00:00
e9fd815226 Misc visibility changes for compiled autograd (#105298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105298
Approved by: https://github.com/albanD, https://github.com/soulitzer
2023-07-18 01:10:04 +00:00
cf404a8ce4 Fix get_current_graph_task_execution_order accumulate_grads ordering (#105353)
Fixes https://github.com/pytorch/pytorch/issues/105293
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105353
Approved by: https://github.com/albanD
2023-07-18 00:59:25 +00:00
750b9b359f fix aot_inductor+dynamo fail on IG_CTR (#105250)
Test Plan: CI.

Reviewed By: chenyang78, houseroad

Differential Revision: D47464664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105250
Approved by: https://github.com/houseroad
2023-07-18 00:26:09 +00:00
a4021af42e [Pytorch] General broadcast for arithmetic operators (#104718)
Summary:
Currently, broadcast is supported for 4D tensors where, if the batch or channel dimensions are not equal, then the batch and channel of one tensor must both be 1, ie:
```
tensorA NCHW:
5, 2, 3, 3
tensorB NCHW:
1, 1, 3, 3 --> batch=1, channel=1
```
This diff adds broadcast support for 4D tensors where the batch and channel of a tensor are different, ie:
```
tensorA NCHW:
5, 1, 3, 3
tensorB NCHW:
1, 5, 3, 3
```

Broadcast rules:
```
- tensorA.dim()[x] = tensorB.dim()[x]
- tensorA.dim()[x] == 1 || tensorB.dim()[x] == 1
- tensorA.dim()[x] does not exist || tensorB.dim()[x] does not exist
```

Broadcast method:

1. Pass `output`, `input` and `other` tensors to the shader
2. Iterate through the output texture to calculate the value of each texel (no repeating)
3. Mapping NHW positions: use modulo
4. Mapping C position: divide pos.z by ceil(C/4) to map to original tensor range

 ---
Also some test refactoring to reduce repeated setup code.

Test Plan:
New tests:

Add
```
[ RUN      ] VulkanAPITest.add_broadcast5
[       OK ] VulkanAPITest.add_broadcast5 (0 ms)
[ RUN      ] VulkanAPITest.add_broadcast6
[       OK ] VulkanAPITest.add_broadcast6 (0 ms)
```

Sub
```
[ RUN      ] VulkanAPITest.sub_broadcast5
[       OK ] VulkanAPITest.sub_broadcast5 (0 ms)
[ RUN      ] VulkanAPITest.sub_broadcast6
[       OK ] VulkanAPITest.sub_broadcast6 (0 ms)
```

Mul
```
[ RUN      ] VulkanAPITest.mul_broadcast5
[       OK ] VulkanAPITest.mul_broadcast5 (1 ms)
[ RUN      ] VulkanAPITest.mul_broadcast6
[       OK ] VulkanAPITest.mul_broadcast6 (1 ms)
```

Div
```
[ RUN      ] VulkanAPITest.div_broadcast5
[       OK ] VulkanAPITest.div_broadcast5 (1 ms)
[ RUN      ] VulkanAPITest.div_broadcast6
[       OK ] VulkanAPITest.div_broadcast6 (2 ms)
```

All tests:
https://www.internalfb.com/phabricator/paste/view/P781794761

Run clang-format on glsl files and Arithmetic.cpp

Differential Revision: D46874508

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104718
Approved by: https://github.com/SS-JIA
2023-07-18 00:15:19 +00:00
196f2ab014 Log triton random warning to INFO (#105343)
For https://github.com/pytorch/pytorch/issues/105204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105343
Approved by: https://github.com/lessw2020, https://github.com/Skylion007
2023-07-18 00:06:52 +00:00
88aa51fe85 [dynamo] Support defaults for namedtuples (#105341)
Fixes https://github.com/pytorch/pytorch/issues/103008

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105341
Approved by: https://github.com/jansel
2023-07-17 23:52:57 +00:00
03a4aecf60 Make libtorch CUDA12 builds actually build on CUDA-12 (#105364)
Not sure, why https://github.com/pytorch/pytorch/pull/102178 downgraded this build from CUDA-12.1 down to CUDA-11.8

Hattip to @ptrblck for spotting it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105364
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2023-07-17 23:44:39 +00:00
a6758cb304 Revert "Revert "SetVariable in dynamo (#103205)"" + Fix for improved graph breaks (#105345)
This reverts commit 94b3f9f646a84fb7bb0df997a57d410697440210.

Fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105345
Approved by: https://github.com/atalman
2023-07-17 23:21:30 +00:00
b2150b4795 [pt2][inductor] move gemm local cache to cache_dir()/cache/{hash} (#105334)
Summary: move gemm autotuning local cache to `cache_dir()/cache/{hash}` since we might have multiple local caches, i.e. one cache with `allow_tf32=True` and one cache with `allow_tf32=False`

Test Plan: sandcastle + CI

Differential Revision: D47504654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105334
Approved by: https://github.com/jansel
2023-07-17 23:05:50 +00:00
5e6c124649 upgrade multipy to latest master there (#105344)
This is in particular to have https://github.com/pytorch/multipy/pull/325 which will unblock https://github.com/pytorch/pytorch/pull/105245
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105344
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-07-17 22:15:03 +00:00
d623f22b8b Skip frame if the graph is empty (#105228)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105228
Approved by: https://github.com/anijain2305
2023-07-17 21:50:00 +00:00
0af287cef2 Update batch_norm_backward_elemt args in native_functions.yaml (#104529)
cuda kernel impl alread change this name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104529
Approved by: https://github.com/soulitzer
2023-07-17 21:25:47 +00:00
37f5d7866c Remove redundant setuptools in pyproject.toml requires (#105303)
There are two `setuptools` require in requires. It was a typo in original PR: e85dfb6203 .

This PR just remove the redundant one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105303
Approved by: https://github.com/mikaylagawarecki
2023-07-17 20:31:08 +00:00
a26afb9848 Better comparisons for np.ndarrays in dynamo (#105333)
This takes tolerances into account.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105333
Approved by: https://github.com/larryliu0820
2023-07-17 20:20:50 +00:00
3fdf365397 Move TypeAndSize out of /generated/ (#105195)
This avoids a circular import in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105195
Approved by: https://github.com/albanD
2023-07-17 19:31:27 +00:00
28d018dafd [inductor] Implement bucketize() for dependencies.py (#105102)
dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed.

Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined.

Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102
Approved by: https://github.com/eellison
2023-07-17 19:15:00 +00:00
74a08db12b [BE] Speedup bazel builds (#105347)
The way it was setup before this PR are that results of `build` step are discarded and are repeated anew during the test step, that was executed in a new pristine container instance.

Avoid that by running builds and tests in the same container instance, that is passed from `Build` to `Test` step via `GITHUB_ENV` variable.

Test plan: [linux-bionic-cpu-py3.10-gcc9-bazel-test](https://github.com/pytorch/pytorch/actions/runs/5578973087/jobs/10193974290#logs) finishes in 27 min instead of 40 min on trunk and [linux-focal-cuda12.1-py3.10-gcc9-bazel-test](https://github.com/pytorch/pytorch/actions/runs/5578973087/jobs/10193975032#logs) finsihes in 44 min instead of 90 min on trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105347
Approved by: https://github.com/huydhn
2023-07-17 18:50:30 +00:00
32d422f335 Make adding buffers more like adding parameters (#104069)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new `Buffer` class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the `register_buffer` method has not been changed. The `persistent` parameter in the `Buffer` type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new `Buffer` type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the `Buffer` type can be used as a drop in replacement for `register_buffer` as it just leads to `register_buffer` being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104069
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:59:05 +00:00
4fc47b4156 Allow disabling bias for LayerNorm (#101683)
Only relevant if `elementwise_affine=True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101683
Approved by: https://github.com/mikaylagawarecki
2023-07-17 17:56:21 +00:00
e0d2ad1a21 [Profiler][Memory] Export raw timestamped events in export_memory_timeline_raw (#105094)
Summary:
Rather than processing the events into a time and sizes plot, dump the actual events as (timestamp, action, num of bytes, category) when output file ends in `raw.json.gz`.

This can allow downstream analysis tools to process these events. It also avoids having to control the granularity of the previous json.gz in memory profiler.

Test Plan: CI Tests

Differential Revision: D47416544

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105094
Approved by: https://github.com/davidberard98
2023-07-17 17:39:37 +00:00
95232c216b [dynamo] Bugfix for enums (#105306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105306
Approved by: https://github.com/yanboliang
2023-07-17 16:39:16 +00:00
ca47205783 Revert "create mergability check (#105086)"
This reverts commit 0a20233e5bf495d06bb0c47e671cdcbbaede7b79.

Reverted https://github.com/pytorch/pytorch/pull/105086 on behalf of https://github.com/PaliC due to no longer needed ([comment](https://github.com/pytorch/pytorch/pull/105086#issuecomment-1638436771))
2023-07-17 16:05:54 +00:00
07a1c3f7ff Exercise subclass of TransformerEncoderLayer (#105297)
Summary: Exercise subclass of TransformerEncoderLayer
Additional unit tests for change in #102045 to show correct e2e operation (cf. issue #100188)

Also: remove batch_first from list of TS module constants where it is not used to resolve torchscripting warning

Test Plan: saqndcastle, github

Differential Revision: D47503004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105297
Approved by: https://github.com/davidberard98
2023-07-17 16:03:10 +00:00
e5f5bcf6d4 [inductor] include global cache dir in inductor resources (#102130)
Summary: adding global cache dir glob to inductor resources

Test Plan: sandcastle + CI + tested locally

Differential Revision: D46131451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102130
Approved by: https://github.com/jansel
2023-07-17 15:44:16 +00:00
05854212dd add syncBN support for custom device (#104250)
Fixes #ISSUE_NUMBER
there are some hard checks for `cuda`, so I make optimize the check so that we can run it for other device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104250
Approved by: https://github.com/albanD
2023-07-17 15:41:39 +00:00
2fa7d11b64 Immediately compile backwards graph in AOTAutograd if dynamic shapes (#104971)
Previously, we made backwards graph compilation lazy to avoid paying
for compilation if the user didn't actually end up using the backwards
graph.  This was useful in the old days when a lot of things in Inductor
didn't work and we could bypass errors this way.

However, this has a bad implication for dynamic shapes: the backwards
graph compilation can trigger extra guards, which are too late to
install in the Dynamo context if we wait until backwards is being run.
So in this PR I move us back to compiling backwards graph immediately
if we capture any SymInts for backwards.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104971
Approved by: https://github.com/Chillee
2023-07-17 15:37:17 +00:00
94b3f9f646 Revert "SetVariable in dynamo (#103205)"
This reverts commit 82fb5edfc725714d6ccb3cb978a42d29b4c34cc2.

Reverted https://github.com/pytorch/pytorch/pull/103205 on behalf of https://github.com/atalman due to Failing cuda11.8-py3.10-gcc7-sm86 / test (inductor_torchbench_dynamic) with CUDA oom ([comment](https://github.com/pytorch/pytorch/pull/103205#issuecomment-1638115073))
2023-07-17 13:13:47 +00:00
07108ff1e8 Fix typos under _decomp directory (#105210)
Fix typos under _decomp directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105210
Approved by: https://github.com/ezyang, https://github.com/Neilblaze
2023-07-17 11:41:30 +00:00
3152feab07 Assert that we can compute the bounds for guards using rational numbers (#105139)
This makes sure that the bounds are always correct, as we're not losing
precision
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105139
Approved by: https://github.com/ezyang
2023-07-17 11:34:05 +00:00
34c91a7051 Prefer bound_sympy over sympy_interp (#105138)
This is the first PR towards simplifying sympy_interp, and more
generally, simplifying the implementation of ValueRangeAnalysis for
SymPy expressions.

In general, it would be conteptually good to have a minimal subset of
operations that conform our SymPy expressions, let that be guards or
indexing expressions. This would allow us to reason better about SymPy
guards and potentially have invariants like knowing that guards are
continuous piecewise rational functions. If this were the case,
we could operate on them using exact arithmetic and completely avoid
precision errors like the one found in https://github.com/pytorch/pytorch/issues/105097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105138
Approved by: https://github.com/ezyang
2023-07-17 11:34:05 +00:00
eae99b0f51 Bound just size variables in bound_sympy (#105155)
We also bound them as starting on 2, because of 0,1 specialisation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105155
Approved by: https://github.com/ezyang
2023-07-17 11:34:05 +00:00
b5beced299 [xla hash update] update the pinned xla hash (#105312)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105312
Approved by: https://github.com/pytorchbot
2023-07-17 11:09:56 +00:00
7f84d55e58 [BE] Add actual dtype to RuntimeError in ADDMM_META() (#105309)
Summary: Include actual dtype in RuntimeError

Test Plan: Please see GitHub Actions

Fix: #105243

Differential Revision: D47506482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105309
Approved by: https://github.com/IvanYashchuk
2023-07-17 10:54:19 +00:00
8c479d32da [inuctor][easy] avoid duplicate kernel definitions (#105099)
When running BertForMaskedLM , I found if I enable the kernel benchmark, essentially identical kernels will be defined once for each call site. The reason is the benchmark harness of those kernels uses different seed_offset for each invocation. We should be safe to just force seed_offset to be 0 so we can deduplicate identical kernel definitions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105099
Approved by: https://github.com/jansel
2023-07-17 05:34:09 +00:00
93f852f201 Add PyObject_CallMethodNoArgs to pythoncapi_compat.h (#105285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105285
Approved by: https://github.com/Skylion007
2023-07-17 03:19:04 +00:00
e68cf02420 Revert "[inductor] Implement bucketize() for dependencies.py (#105102)"
This reverts commit cff5d6a22c8326f3d9fcbed2f68c5aaae9f4523a.

Reverted https://github.com/pytorch/pytorch/pull/105102 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105102#issuecomment-1637261924))
2023-07-17 01:22:19 +00:00
9adfaf8807 [inductor] Add lowering for aten.unfold (#105165)
The decomposition for unfold uses `as_strided` which forces the input to be
realized. Instead, this implements it as a `GenericView` with reindexing
which removes the need to realize, though it does call `mark_reuse` incase
the input computation is expensive and the windows overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105165
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-16 13:09:23 +00:00
b47d91a537 [dynamo] Reland Move decorators into decorators.py (#105273)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105273
Approved by: https://github.com/jansel
2023-07-16 05:33:54 +00:00
cbe0254dc4 Optimize sparse.mm reduce in BFloat16 data type in CPU backend (#103239)
### Description

This PR is to optimize sparse.mm reduce of BFloat16 data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Half support (need support addmm Half implementation) will be done once https://github.com/pytorch/pytorch/pull/99498 upstream.

Next step:
- [x] Add benchmarks
- [x] Update UTs
- [x] Check backward behaviors
- [x] Refactor code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/509e8482-9160-4b85-bc39-5b6aad510283)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/c953a494-8f8e-4dbd-a8a7-421d8c22e946)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103239
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-16 05:02:10 +00:00
e3c4f2fb59 [vision hash update] update the pinned vision hash (#105282)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105282
Approved by: https://github.com/pytorchbot
2023-07-16 04:05:04 +00:00
efdabbff06 Assert that evaluate_expr matches hint (#97792)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97792
Approved by: https://github.com/avikchaudhuri
2023-07-15 23:57:42 +00:00
5837e95d30 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`

Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04:
- Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh`
- Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-15 20:30:20 +00:00
5cd861fcf7 Add empty/empty_like to core aten decomps (#105158)
Fixes https://github.com/pytorch/pytorch/issues/104871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105158
Approved by: https://github.com/SherlockNoMad
2023-07-15 18:48:55 +00:00
1152e86da1 Transmute refined SymInt into int (#104828)
Previously, x.size(0) could return a SymInt, even when the internal
sympy expression was actually already constant (e.g., due to an
introduced guard.)  We now allow to query the Python object with
maybe_as_int which allows us to transmute these objects back to
int when possible.

It is still possible to end up with a constant SymInt even after this
change, e.g., if you get out a SymInt and while holding onto it
specialize it, but casual users are more likely to get ints when they
want to.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828
Approved by: https://github.com/Skylion007
2023-07-15 18:46:10 +00:00
66d3729388 Add THPVariable_WrapList helper (#105194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105194
Approved by: https://github.com/soulitzer, https://github.com/albanD
2023-07-15 18:13:35 +00:00
7b4d080496 [quant][pt2e] Rename _pt2e to pt2e (#104668)
Summary:
X-link: https://github.com/pytorch/executorch/pull/3

att

Test Plan: Imported from OSS

Differential Revision: D47202807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104668
Approved by: https://github.com/andrewor14
2023-07-15 06:34:17 +00:00
a63f3f4335 [dynamo] Disable fused adam op compile (#105256)
Don't compile the hand-fused adam implementation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105256
Approved by: https://github.com/Chillee
2023-07-15 06:13:40 +00:00
922a98e693 [ONNX] Enable attribute type checking in onnx dispatcher (#105104)
The dipatcher didn't check attribute dtype, as AttributeProto is a totally different system from InputProto in ONNX. This PR introduces the mapping table for AttributeProto type to python type. And further utilize it in opschema matching.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105104
Approved by: https://github.com/thiagocrepaldi
2023-07-15 06:06:39 +00:00
0285366464 Revert "[dynamo] Maintainable code - Move export impl to a different file (#105071)"
This reverts commit 068f163dd3beb5883cda37518017d18fc6a50561.

Reverted https://github.com/pytorch/pytorch/pull/105071 on behalf of https://github.com/clee2000 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/105071#issuecomment-1636654945))
2023-07-15 04:18:07 +00:00
1fdc88f877 Inductor cpp wrapper: fix codegen of FallbackKernel with kwargs (#104575)
Fix cpp wrapper failure on TorchBench model `hf_Reformer` with `randn`:
```
random_rotations = torch.randn(rotations_shape, device=vectors.device, dtype=vectors.dtype)
```

For cpp wrapper, when `kwargs` is not empty, for `OpOverloadPacket` kernel, we need to know the exact overload schema to handle the `kwargs` properly when calling the cpp kernel: including finding the correct order of the kwargs and getting the default value for optional args without provided value when calling the function (`layout` in the above case).

The current support in this PR is conservative and we'll extend the functionality in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104575
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-15 03:33:44 +00:00
ffce2492af Remove set_default_dtype calls from jit and ops tests (#105072)
Part of #68972

This only attempts to avoid setting the default dtype for `test_jit.py` and `test_ops.py`. There are other tests, like `test_nn.py`, which will be addressed in follow up PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105072
Approved by: https://github.com/ezyang
2023-07-15 03:18:33 +00:00
82fb5edfc7 SetVariable in dynamo (#103205)
Set initial
Fixes https://github.com/pytorch/pytorch/issues/94738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103205
Approved by: https://github.com/jansel
2023-07-15 02:25:31 +00:00
a155c68e6d [MPI] Allow previously initialized (#105023)
This pull request fixes #33943 for those applications that initialize MPI before `init_process_group("mpi", ...)` call, including `mpi4py`, some LibTorch applications and beyond.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105023
Approved by: https://github.com/H-Huang
2023-07-15 01:24:56 +00:00
15411b8afd [ONNX] Make unsupported node analysis result deterministic (#105231)
Replace `Set` with `Dict` for node.target to keep insertion order.

Fixes https://github.com/pytorch/pytorch/issues/105200
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105231
Approved by: https://github.com/thiagocrepaldi
2023-07-15 01:15:05 +00:00
d438b99823 [reland][inductor] fix a custom_op test problem (#105234)
Summary: reland https://github.com/pytorch/pytorch/pull/104972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105234
Approved by: https://github.com/clee2000
2023-07-15 01:01:12 +00:00
15fd1ea118 Revert "[Reland] Update mypy to 1.4.1 (#105227)"
This reverts commit c9c4f8efc3dd4e66059522bf5f5c1ba0431e2069.

Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))
2023-07-14 22:28:35 +00:00
fb376f80a2 [retry][dynamo][numpy] Add support for np.dtype (#105034)
Original PR: #103546

Trying to support numpy function call in dynamo, with numpy dtype as argument.

For example:

```
def fn(x: int):
    return np.empty_like(x, dtype=np.float64)
```

This currently doesn't work because `NumpyVariable` doesn't implement `as_proxy()`. The idea in `as_proxy()` for now is to convert `np.float64` and other np.<dtype> into `str` and then feed into the corresponding `torch_np` method. The assumption here is that all `torch_np` methods that are taking `dtype` kwarg will be able to also take `str` as `dtype`. This assumption stands for `numpy`.

For previous example, we convert `np.float64` to `"float64"` in `as_proxy()` and then feed it into `torch_np.empy_like()` method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105034
Approved by: https://github.com/voznesenskym
2023-07-14 21:36:36 +00:00
7e72126487 [pt2] add decomps for multi_margin_loss ops (#104578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104578
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-07-14 21:16:09 +00:00
0a6888243b multi_margin_loss: check weight shape, make contiguous on CPU, add tests (#104852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104852
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
de67b52a88 Unify multi_margin_loss_shape_check on CPU and CUDA (#104851)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104851
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
0c89596e4f [OpInfo] add reference and error inputs for multi_margin_loss (#104850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850
Approved by: https://github.com/ezyang
2023-07-14 21:16:09 +00:00
d06e1df1aa [torchgen] Rename executorch's RuntimeContext to KernelRuntimeContext (#104892)
Rename the context type to match changes in executorch.

Differential Revision: [D46977359](https://our.internmc.facebook.com/intern/diff/D46977359/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104892
Approved by: https://github.com/larryliu0820
2023-07-14 21:15:50 +00:00
99ab2ad677 [nightly] Fix macos nightly conda builds due to miniconda update (#105226)
This is to fix failing macos conda nightly builds: https://github.com/pytorch/pytorch/actions/runs/5551435365/jobs/10147149169

Accompanying builder PR: https://github.com/pytorch/builder/pull/1460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105226
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2023-07-14 21:03:36 +00:00
c9c4f8efc3 [Reland] Update mypy to 1.4.1 (#105227)
This PR re-lands
- [Typing] Fix PEP 484 Violation (#105022)
- Update mypy to 1.4.1 (#91983)

That were reverted due to the conflict with internal source repo.

Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  - Add assert it `torch/optim/optimizer.py` that Optional list is not None
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227
Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007
2023-07-14 20:45:12 +00:00
1518d5eec4 Update Documentation for TripletMarginLoss (#105115)
This PR updates the documentation for `TripletMarginLoss` in `torch.nn`. The previous version of the documentation didn't mention the parameter `eps` used for numerical stability.

This PR does the following:
1. Describes the purpose and use of the `eps` parameter in the `TripletMarginLoss` class documentation.
2. Includes `eps` in the example usage of `TripletMarginLoss`.

Please review this update for the completeness with respect to the `TripletMarginLoss` functionality. If there are any issues or further changes needed, please let me know.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105115
Approved by: https://github.com/mikaylagawarecki
2023-07-14 20:04:25 +00:00
cff5d6a22c [inductor] Implement bucketize() for dependencies.py (#105102)
dependencies.py is used for tracking reads and writes, which is used for identifying dependencies between buffers: i.e. if buffer X reads buffer Y, then X depends on Y. ops.bucketize() reads from an offsets tensor, so we should track it in dependencies.py to correctly track dependencies. Since bucketize performs a binary search over the offsets tensor, the dependency is marked as a StarDep to indicate that the entire tensor is needed.

Use case: we find that jagged tensor dense_to_jagged ops - which use bucketize() to map jagged indices to dense indices - perform better if the bucketize() kernel is separated from the gather kernel. Previously, because bucketize() wasn't marked as reading anything, it would just get inlined.

Differential Revision: [D47422704](https://our.internmc.facebook.com/intern/diff/D47422704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105102
Approved by: https://github.com/eellison
2023-07-14 19:54:06 +00:00
4fc408ded2 [ROCm] Add AMD devs as owners for hipify code (#105080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105080
Approved by: https://github.com/malfet
2023-07-14 19:51:31 +00:00
bf46b6653f [export] Allow optional call-spec (#105179)
Summary: Submodules may have a none call-spec values, which is ok. Updating types + serializer to handle this

Test Plan: CI

Reviewed By: ydwu4, zhxchen17

Differential Revision: D47353101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105179
Approved by: https://github.com/zhxchen17
2023-07-14 19:11:47 +00:00
d3b96969a0 Upgrade CI to ROCm5.6 (#103092)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103092
Approved by: https://github.com/malfet
2023-07-14 19:10:28 +00:00
fcbe1be8f9 Update torchbench.txt to include SAM (#105222)
Goal is to include 745644f391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105222
Approved by: https://github.com/ezyang, https://github.com/cpuhrsch
2023-07-14 18:37:30 +00:00
d855c6c7de [PyTorch-TB] Write full tensor as tensor proto (#105186)
Write full tensor as tensor proto
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105186
Approved by: https://github.com/atalman
2023-07-14 18:04:09 +00:00
233f917c83 Revert "Add torch.ops.out_dtype (#103333)"
This reverts commit 7c10b58c5fb1e007801ff8f781eda72e4724994f.

Reverted https://github.com/pytorch/pytorch/pull/103333 on behalf of https://github.com/atalman due to broke trunk win-vs2019-cpu-py3 ([comment](https://github.com/pytorch/pytorch/pull/103333#issuecomment-1636195679))
2023-07-14 17:59:25 +00:00
0401ffa83f s390x: fix special_hermite_polynomial_h for '+/-inf' and '+/-nan' (#104705)
On s390x static cast may return big positive number, in that case uninitialized value of 'r' is returned. In case of +/-inf or +/-nan use -1 explicitely.
Also initialize 'r' to 0 in case 'n+n' overflows anyway.

This change fixes
test_vmap_exhaustive_special_hermite_polynomial_h_cpu_float32 from test/functorch/test_vmap.py on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104705
Approved by: https://github.com/ezyang
2023-07-14 17:55:45 +00:00
17250976f3 correct empty tensor mps all operation (#105218)
Fixes #104694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105218
Approved by: https://github.com/ezyang, https://github.com/kulinseth
2023-07-14 17:42:54 +00:00
15ea0a00cb Fix RRef type annotations (#104876)
Test Plan: Sandcastle

Reviewed By: H-Huang

Differential Revision: D47334579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104876
Approved by: https://github.com/H-Huang
2023-07-14 17:31:51 +00:00
c0a278d6f0 Upload all test stats only if the workflow is from pytorch/pytorch main (#105087)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105087
Approved by: https://github.com/huydhn
2023-07-14 17:07:48 +00:00
7c10b58c5f Add torch.ops.out_dtype (#103333)
https://docs.google.com/document/d/10DYFG2sU3TSvguFP5kYwYLlo45KHFg3BhBOkUk0NKsU/edit#bookmark=id.hgfzmhlzkamk

Renamed mixed_dtype --> out_dtype because "mixed_dtype is not very descriptive in the context of regular pytorch where we support type promotion on most ops"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103333
Approved by: https://github.com/zou3519
2023-07-14 16:40:05 +00:00
10cbc9a063 Enable cuda graphs for dynamic shapes (#105064)
The general idea is to do a separate CUDA graph for each size. Because of cuda graph trees, these graphs will all share the same memory pool, so your memory usage will only be the worst case memory usage of the biggest dynamic size you want. This requires an extra dispatch in the cudagraphified callable. You must pay for a CUDA graph recording for every dynamic size you encounter, but this is MUCH cheaper than running the entire PT2 compile stack, so I expect you to still see benefits.

This was surprisingly easy to do.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105064
Approved by: https://github.com/voznesenskym
2023-07-14 16:13:50 +00:00
9fc3a22731 Turn on typechecking in cudagraph_trees (#105151)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105151
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-07-14 16:13:50 +00:00
1646d6f939 Revert "Merge and improve torch optim optimizer type stubs (#102593)"
This reverts commit 3279f06410032e9798e380cedf552f5b706ac6c1.

Reverted https://github.com/pytorch/pytorch/pull/102593 on behalf of https://github.com/malfet due to There is nothing wrong with this PR, but it fails some internal builds that depend on outdated typing_extensions, will reland when update is done ([comment](https://github.com/pytorch/pytorch/pull/102593#issuecomment-1636062515))
2023-07-14 16:04:54 +00:00
3c5a494d7a Revert "Update mypy to 1.4.1 (#91983)"
This reverts commit 634659e262f82bbc76aa776119c9fea079fbffe3.

Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))
2023-07-14 15:59:16 +00:00
1c69f363c4 Revert "Transmute refined SymInt into int (#104828)"
This reverts commit 0f322a300efe588a4f7ae61dabcfd4fe0aa9e225.

Reverted https://github.com/pytorch/pytorch/pull/104828 on behalf of https://github.com/ezyang due to executorch failure ([comment](https://github.com/pytorch/pytorch/pull/104828#issuecomment-1635997559))
2023-07-14 15:08:11 +00:00
f03a8f0589 [reland] Deprecate registering autograd kernels at not an autograd key (#105078)
Summary:
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.

If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8

It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).

Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.

The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible

There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).

Test Plan:
- bunch of new tests

BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).

If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.

Differential Revision: D47408353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105078
Approved by: https://github.com/soulitzer
2023-07-14 15:03:07 +00:00
b4d91b1c5b Revert "[Typing] Fix PEP 484 Violation (#105022)"
This reverts commit 4148b7badacace65b8d6309f3f364569c2b0e6a4.

Reverted https://github.com/pytorch/pytorch/pull/105022 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/105022#issuecomment-1635967734))
2023-07-14 14:45:09 +00:00
528ab477ce [reland][inductor] Register an op for mm_plus_mm (#105153)
Summary: Reland https://github.com/pytorch/pytorch/pull/104835 after fixing internal build issues

Test Plan: CI

Differential Revision: D47442849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105153
Approved by: https://github.com/clee2000
2023-07-14 14:35:29 +00:00
c099b7e07a ValueRange analysis for indirect indexing (#102611)
We do so by forwarding ValueRange analysis from IR buffers to CSEvars

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102611
Approved by: https://github.com/eellison, https://github.com/peterbell10
2023-07-14 13:43:05 +00:00
87a3ed58cb Fix ranges for range vars (#104987)
Ranges are inclusive on both ends...

We take this chance to delete a stale comment

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104987
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-14 13:43:05 +00:00
88dcecdf54 Remove unnecessary casting in triton (#104975)
This used to be necessary before we advanced the pin past https://github.com/openai/triton/pull/1641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104975
Approved by: https://github.com/peterbell10, https://github.com/Chillee
2023-07-14 13:43:05 +00:00
dc4a0582fb fix hash_storage's padding calculation (#105036)
Fixes #105035.

The existing implementation attempts to make `x.numel() % 4 == 0` by appending `x.numel() % 4` zeros. This is backwards, e.g if `x.numel() % 4 == 1`, we need to append `[0, 0, 0]`, not `[0]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105036
Approved by: https://github.com/soulitzer, https://github.com/ezyang
2023-07-14 13:38:08 +00:00
8e01f75b1b New Mod class for SymPy expressions. (#104968)
This PR introduces a new `Mod` class to be used with SymPy expressions. The main reason
being due to SymPy simplification errors (#97792).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104968
Approved by: https://github.com/ezyang
2023-07-14 13:34:52 +00:00
068f163dd3 [dynamo] Maintainable code - Move export impl to a different file (#105071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105071
Approved by: https://github.com/voznesenskym
2023-07-14 09:28:33 +00:00
b7b44e766b [Checkpoint] Separate implementation into generator (#105101)
Separates the non-reentrant AC implementation into a generator so that
other APIs such as composable checkpoint API can use the generator as pre and
post forward logic.

Differential Revision: [D47419387](https://our.internmc.facebook.com/intern/diff/D47419387/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105101
Approved by: https://github.com/soulitzer
2023-07-14 06:27:13 +00:00
6871cf6e1e refactor GeneratorForPrivateuseone (#105038)
1. Modify the usage of std::mutex
2. restrict symbol scope
3. code format

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105038
Approved by: https://github.com/soulitzer
2023-07-14 06:10:11 +00:00
90b50f0303 [quant][pt2e] change internal code to only import from _quantize_pt2e (#105162)
Summary: This is to make public api clear so that we can make implementation details change easier in the future

Test Plan: CIs

Differential Revision: D47445767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105162
Approved by: https://github.com/andrewor14
2023-07-14 05:14:29 +00:00
893983e59f Use GitHub REST API to get the merge base commit SHA (#105098)
Fixes https://github.com/pytorch/pytorch/issues/104713

### Testing
Manual testing locally using #104121 and confirm that the correct merge base commit is returned [803c14490b189f9b755ecb9f2a969876088ea243](1cb87771c1) instead of the wrong value provided by `baseRefOid` (de7b6e55eb0f87f8d822f69bad6b4189a857b460).  Here is the JSON output of the GraphQL query for PR info https://paste.sh/TJ-QQWz4#fvE3Y6qoJ8vDkRBZ3vowkZ3m

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105098
Approved by: https://github.com/malfet
2023-07-14 04:25:45 +00:00
9942a14e96 Fix torch.compile g++ flag error on ppc64le (#104956)
g++ flag -march is not recognised on ppc64le. So adding a check for platform machine to be ppc64le and using -mcpu flag instead. Other architectures will still use -march flag

This fixes the torch.compile feature failure on ppc64le

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104956
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-14 04:09:17 +00:00
a66f08d626 enable channels last for replication padding on CPU (#102597)
Enable channels last support for replication padding on CPU. This patch add channels last support for ReplicationPad2d/3d on CPU backend. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReplicationPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NHWC: 0.339 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 82.935 ms

(after)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.324 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.717 ms
```

### single socket inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NHWC: 0.135 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 7.203 ms

(after)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.029 ms
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 3.174 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102597
Approved by: https://github.com/CaoE, https://github.com/cpuhrsch
2023-07-14 03:44:55 +00:00
c1877c741c [vision hash update] update the pinned vision hash (#105191)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105191
Approved by: https://github.com/pytorchbot
2023-07-14 03:38:31 +00:00
4328138c1e AOT inductor: error: ‘c10::Dispatcher’ has not been declared (#104742)
Differential Revision: D47275262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104742
Approved by: https://github.com/desertfire
2023-07-14 01:47:52 +00:00
46104882d7 inductor: enable cpu fusion for dynamic shapes path (#104945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104945
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-14 00:29:55 +00:00
8af8e1fe36 Add torch._dynamo.maybe_mark_dynamic (#105145)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105145
Approved by: https://github.com/aakhundov, https://github.com/Chillee
2023-07-14 00:29:16 +00:00
8a6e5d7cc7 CUDAGraph trees real inputs should never be SymInt (#105148)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105148
Approved by: https://github.com/Skylion007, https://github.com/eellison
2023-07-14 00:28:31 +00:00
d7e6040efa Update sparse semi-structured linear operator (#104608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104608
Approved by: https://github.com/cpuhrsch
2023-07-13 23:52:39 +00:00
b88b742db8 fixed torch.manual_seed note (#105175)
Fixes https://github.com/pytorch/pytorch/issues/87509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105175
Approved by: https://github.com/ezyang
2023-07-13 23:43:44 +00:00
85745cd3d9 Fix bug in fuse_modules (#105069)
Summary: This diff fixes the issue reported in https://github.com/pytorch/pytorch/issues/105063 and also related to internal caffe2 bug (reproduced error in internal fb pytorch: N3945540)

Test Plan: Wait for sandcastle with the added unit test in caffe2/torch/ao/quantization/eager/test_fuse_eager

Differential Revision: D47402357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105069
Approved by: https://github.com/jerryzh168
2023-07-13 23:39:59 +00:00
b33d63d97b [BE] Use ValueError for input.dim check in torch.nn.modules (#105127)
Summary: Use ValueError for input.dim check instead of Assertion Error.

Fix: #104839

Test Plan: Please see GitHub actions.

Differential Revision: D47427998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105127
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-13 23:20:46 +00:00
cd15229950 [foreach][RMSprop] Minimize intermediates=2 to decrease peak memory (#105161)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105161
Approved by: https://github.com/albanD
2023-07-13 23:18:54 +00:00
219cf2a1c8 [foreach][ASGD] Minimize intermediates=1 to decrease peak memory (#105146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105146
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-07-13 23:18:54 +00:00
3a7d77f704 Serialize empty pytree cases (#105159)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105159
Approved by: https://github.com/zhxchen17
2023-07-13 23:02:59 +00:00
6c10edcb2d move kineto submodule commit (#105144)
kineto submodule pointer moved to 7c2c55054410346f1aa641256b82f6fb31d6c78f
```
Commit 7c2c55054410346f1aa641256b82f6fb31d6c78f (HEAD -> main, origin/main, origin/HEAD)
Author: Aaron Enye Shi <enye.shi@gmail.com>
Date:   Thu Jul 6 08:00:51 2023 -0700

    Update tb_plugin to new python and pytorch version (#778)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105144
Approved by: https://github.com/aaronenyeshi, https://github.com/Skylion007
2023-07-13 22:42:47 +00:00
485cad4a86 Dynamo tensor aliasing guards, dedup graphargs (#104921)
The story here is relatively simple - when we go to wrap a tensor, we (1) ensure that it is a real, not fake tensor (2) check if we have seen it before. (3) If we have seen it, we create a positive alias guard and return the associated variable. If not, we proceed.

By short circuiting here, we avoid lifting it to a graph input, and guarantee that the only names passed to tensors are unique. This allows us to guard on the unique relationships (pyboject addresses, aka IDs, cannot match) to give us guards for negative aliases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104921
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-07-13 22:18:08 +00:00
f987d11fa7 Reland: Make torch.empty* deterministic by filling with NaN or max int (#104995)
Relands #101849 after #104302 reverted it.

torchrec PR https://github.com/pytorch/torchrec/pull/1269 fixes the torchrec failure that caused #101849 to be reverted

Part of #82004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104995
Approved by: https://github.com/albanD
2023-07-13 22:18:03 +00:00
42530c17fc [ONNX] Fix UnsupportedFxNodesAnalysis after onnx dispatcher changes (#105156)
Simplifies the logic to not depend on info within the exception raised. Due to changes
in onnx dispatcher, the diagnostic within exception raised is now different, which broke
this pass in retrieving the unsupported fx node kind. Adds proper unittest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105156
Approved by: https://github.com/thiagocrepaldi
2023-07-13 21:46:29 +00:00
15c1e44d64 [ONNX] Apply param_manipulation.py from onnx-script to op validation and dispatcher (#104679)
Previous to this PR, we are comparing torch args/kwargs with OnnxFunction OpSchema without normalizing args/kwargs first. Essentially, the function signature is different between ATen and OnnxFunction, and onnx-script has preprocessing on these args/kwargs with an internal tools: `param_manipulation` for both eager mode and graph mode. This PR references on the internal tool to normalize the torch args/kwargs before feeding them to OnnxFunction during op_level_debug and dispatching. The PR significantly reduces the dispatching need on nearest matching mechanism.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104679
Approved by: https://github.com/BowenBao
2023-07-13 21:16:35 +00:00
fc2f87b281 Add semi-structured sparse conversions (#103830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103830
Approved by: https://github.com/amjames, https://github.com/jcaip, https://github.com/cpuhrsch
2023-07-13 21:09:09 +00:00
15478a50ef Revert "[export] Allow optional call-spec (#105041)"
This reverts commit 194fe1d12f9860734cc28ed21bdabda2fbb06336.

Reverted https://github.com/pytorch/pytorch/pull/105041 on behalf of https://github.com/atalman due to broke lintrunner ([comment](https://github.com/pytorch/pytorch/pull/105041#issuecomment-1634911637))
2023-07-13 21:01:21 +00:00
df3a64fb3e [Dockerfile] Add make triton to the build target (#105114)
Docker `build` layer is missing `triton` dependency, so image build for this target can not be used with `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105114
Approved by: https://github.com/malfet
2023-07-13 20:20:26 +00:00
ef05c5f202 Use plain power operator in Adam/Adamw when capturing (#104254)
The goal is to fix the problem from https://github.com/pytorch/pytorch/pull/102858

The full error this used to raise was :
```
2023-06-27T15:12:15.0663239Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/adamw.py", line 409, in _single_tensor_adamw
2023-06-27T15:12:15.0663699Z     bias_correction1 = 1 - beta1 ** step
2023-06-27T15:12:15.0664200Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 40, in wrapped
2023-06-27T15:12:15.0664547Z     return f(*args, **kwargs)
2023-06-27T15:12:15.0665031Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py", line 882, in __rpow__
2023-06-27T15:12:15.0665483Z     return torch.tensor(other, dtype=dtype, device=self.device) ** self
2023-06-27T15:12:15.0665899Z RuntimeError: CUDA error: operation not permitted when stream is capturing
2023-06-27T15:12:15.0666401Z CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
```

This pow issue was fixed in https://github.com/pytorch/pytorch/pull/104264 and so this problem should be solvable now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104254
Approved by: https://github.com/janeyx99, https://github.com/aws-murandoo
2023-07-13 19:24:25 +00:00
194fe1d12f [export] Allow optional call-spec (#105041)
Summary: Submodules may have a none call-spec values, which is ok. Updating types + serializer to handle this

Test Plan: CI

Differential Revision: D47353101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105041
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2023-07-13 18:39:54 +00:00
b06a426390 Fix typo (#105130)
casual → causal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105130
Approved by: https://github.com/Skylion007
2023-07-13 18:35:57 +00:00
44c8515d0d SDPA: frontend for BSR masks (#104042)
This PR implements a (yet private) frontend for scaled_dot_product_attention that works with BSR `attn_mask`.

This function is directly comparable (with suitable masks) with `torch.nn.functional.scaled_dot_product_attention` once `attn_mask.dtype == torch.bool`, but it's behavior is different when `attn_mask.dtype != torch.bool`. This is because `torch.nn.functional.scaled_dot_product_attention` assumes that irrelevant values are supposed to be filled with `-inf`, while the selected ones should be `0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104042
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-07-13 18:01:21 +00:00
05eea20eb9 [dynamo] Simulate torch function enablement state (#105091)
Part of https://github.com/pytorch/pytorch/issues/93723

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105091
Approved by: https://github.com/voznesenskym, https://github.com/anijain2305
2023-07-13 17:42:20 +00:00
87cf51cc7f Switch automatic_dynamic_shapes to True by default in fbcode (#104883)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104883
Approved by: https://github.com/xw285cornell
2023-07-13 17:37:57 +00:00
c36dca7bc5 Revert "[inductor] Register an op for mm_plus_mm (#104835)" (#105150)
This reverts commit 9c46a1620c99626ee9db01985a569ba79888508b.

Actual revert referenced in https://github.com/pytorch/pytorch/pull/105149

#104835 is causing internal builds to fail

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105150
Approved by: https://github.com/atalman
2023-07-13 17:13:45 +00:00
91c64f55ab Revert "[inductor] fix a custom_op test problem (#104972)" (#105149)
This reverts commit be76bfb743c941278cc3cf94816d2181f0a30867.

I need to revert https://github.com/pytorch/pytorch/pull/104835 and this is causing a merge conflict

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105149
Approved by: https://github.com/izaitsevfb, https://github.com/atalman
2023-07-13 17:06:09 +00:00
d1fedad080 Perform value range analysis with rationals when possible (#105137)
This is particularly useful for guards to avoid rounding errors, as most
guards (all?) are rational functions.

Fixes https://github.com/pytorch/pytorch/issues/105097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105137
Approved by: https://github.com/ezyang
2023-07-13 16:45:47 +00:00
634659e262 Update mypy to 1.4.1 (#91983)
Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional)
Plus few real fixes:
  - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi`
  - Add missing return statement to `torch._export. deserialize_graph`
  - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights`
  -
TODO (in followup PR):
  - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi
2023-07-13 16:30:36 +00:00
f73757d551 enable channels last for reflection padding on CPU (#102518)
Add channels last support for reflection padding on CPU. The following test cases will pass with this patch:
```
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad2d_cpu_float32
python test_modules.py TestModuleCPU.test_memory_format_nn_ReflectionPad3d_cpu_float32
```

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.356 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 86.821 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.328 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 16.806 ms
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.142 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) ,  NHWC: 7.367 ms

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) ,  NHWC: 0.027 ms
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NHWC: 3.181 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102518
Approved by: https://github.com/CaoE, https://github.com/cpuhrsch
2023-07-13 16:22:31 +00:00
d35137cc37 Revert "[PyTorch TB] Write raw tensor as tensor_proto (#104908)"
This reverts commit dceae41c29782399c84304812696a8382e9b4292.

Reverted https://github.com/pytorch/pytorch/pull/104908 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104908#issuecomment-1634532376))
2023-07-13 16:22:04 +00:00
e1502c0cdb Add some more files to ciflow/inductor (#105112)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105112
Approved by: https://github.com/albanD
2023-07-13 14:44:42 +00:00
c6b9c31a2c [inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115)
Fixes https://github.com/pytorch/pytorch/issues/100067, https://github.com/pytorch/pytorch/issues/98268 and https://github.com/pytorch/pytorch/issues/93428.

See the comment [here](https://github.com/pytorch/pytorch/issues/100067#issuecomment-1523856970) for details. The bug was that the decomposition that inductor uses for `aten.copy` doesn't respect the strides of the input in all cases. The fixes that I added should work, but will be pretty slow - we allocate a tensor (potentially larger than `self` if `self` is a slice), and perform an `as_strided_scatter` + `as_strided`. Longer term, stride-agnostic IR should let us remove this decomp?  cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @anijain2305 @soumith @desertfire

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100115
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-07-13 14:40:57 +00:00
053654b9cf Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-13 09:34:29 +00:00
735e6ae801 [dynamo] Maintainable code - Move decorators in a separate file (#105070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105070
Approved by: https://github.com/ezyang
2023-07-13 07:41:19 +00:00
4a0d773a08 Update attention.cpp to remove warning about preferring torch.bool type (#103362)
Update attention.cpp to remove warning about preferring torch.bool data type

Fixes #100469 #97532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103362
Approved by: https://github.com/mikaylagawarecki
2023-07-13 07:07:46 +00:00
0f322a300e Transmute refined SymInt into int (#104828)
Previously, x.size(0) could return a SymInt, even when the internal
sympy expression was actually already constant (e.g., due to an
introduced guard.)  We now allow to query the Python object with
maybe_as_int which allows us to transmute these objects back to
int when possible.

It is still possible to end up with a constant SymInt even after this
change, e.g., if you get out a SymInt and while holding onto it
specialize it, but casual users are more likely to get ints when they
want to.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828
Approved by: https://github.com/Skylion007
2023-07-13 07:02:52 +00:00
242fc29c96 [FSDP] Refactor optimizer in backward (#104813)
1) Use zero_grad(set_to_none=True) to set grad to None, 2) call
prepare_grad_for_optim() before call to .step, 3) use
_reset_flat_param_grad_info to set flat param gradient back to None. These
changes should just be refactors and equivalent to how gradient memory was
managed  before.

Differential Revision: [D47310761](https://our.internmc.facebook.com/intern/diff/D47310761/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104813
Approved by: https://github.com/awgu
2023-07-13 06:42:53 +00:00
f2eed129c4 FSDP optimizer overlap (#98667)
constraints:

1. No support for gradient accumulation
2. CPU offload runs step() on CPU. In future PRs ideally we'd run this on GPU.
3. When CPU offload + optimizer overlap, we have to copy the flat_param grad to CPU with non_blocking=False, otherwise step() might run on invalid data.
4. Step is waited on in post backward final cb, when in theory it can wait until the next forward.

Differential Revision: [D44809582](https://our.internmc.facebook.com/intern/diff/D44809582/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98667
Approved by: https://github.com/awgu, https://github.com/fegin
2023-07-13 06:42:53 +00:00
1d02106e03 Preserve source_fn or nn_module_stack in the lifted params (#105017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105017
Approved by: https://github.com/angelayi
2023-07-13 06:03:28 +00:00
dceae41c29 [PyTorch TB] Write raw tensor as tensor_proto (#104908)
This is the first diff to support logging of raw tensors for [TensorBoard Intermediate Logging](https://www.internalfb.com/intern/wiki/TensorBoard/Intermediate_Logging/)

Ultimately, we aim to support the feature where store full tensor is stored as a tensor protobuf to TB. Protobuf contains shape, dtype, and elements of the given tensor.

1. add `tensor_proto()` to `summary.py` which takes a tensor and convert to protobuf
2. add `add_tensor()` to `writer.py`
3. formatting changes introduced by `arc lint`
-------------

Differential Revision: [D47302017](https://our.internmc.facebook.com/intern/diff/D47302017/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104908
Approved by: https://github.com/kunalb
2023-07-13 05:30:50 +00:00
b99d605a30 Add meta registration for foreach_mul_ (#105107)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105107
Approved by: https://github.com/Chillee, https://github.com/voznesenskym
2023-07-13 04:45:22 +00:00
0faf8ed49f Skip TS backend in FBCODE (#104354)
Summary:
Fixes:
```
Traceback (most recent call last):
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/1a4194a16794cc72/caffe2/test/__torch__/torch#link-tree/torch/testing/_internal/common_device_type.py", line 543, in setUpClass
    torch._lazy.ts_backend.init()
  File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/1a4194a16794cc72/caffe2/test/__torch__/torch#link-tree/torch/_lazy/ts_backend.py", line 6, in init
    torch._C._lazy_ts_backend._init()
RuntimeError: TorchScript backend not yet supported in FBCODE/OVRSOURCE builds
```

Test Plan: Sandcastle

Differential Revision: D47093028

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104354
Approved by: https://github.com/malfet
2023-07-13 02:46:58 +00:00
0a20233e5b create mergability check (#105086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105086
Approved by: https://github.com/izaitsevfb
2023-07-13 02:21:27 +00:00
eqy
2c85f28c71 [CUDA][cudaMallocAsync] Reduce record-stream warning spam (#105015)
Addresses #104925

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105015
Approved by: https://github.com/eellison
2023-07-13 02:06:14 +00:00
7030403048 Fix initializer naming at torch.onnx.ExportOutput.save_model_with_external_data (#105002)
This PR is only relevant for the Fake tensor Mode ONNX export. For the conventional export, everything is unchanged.

* An optional `rename_initializer=False` argument is added to an internal function `torch/onnx/_internal/fx/serialization.py::save_model_with_external_data` which is used by the public API `ExportOutput.save`.
* The default behavior (`rename_initializer=False`) is meant to be used by public API `torch.onnx.dynamo_export` with the default Dynamo-based FX tracer (`DynamoExport`). In this scenario, both graph ONNX graph inputs and initializers have matching name with `.` in it (e.g. `linear.weight`)
* `rename_initializer=True` is meant to be used by `torch.onnx.dynamo_export` with a non-publicly-supported FX tracer called `FXSymbolicTracer`. This tracer lifts the FX graph initializers as inputs before FX->ONNX start, and because of this, the initializer names must be valid python identifiers (meaning `.` are not supported argument name and must be replaced by `_` or similar). This causes the graph inputs to have names with `_` (e.g. `linear_weight`) while the initializers have `.` (e.g. `linear.weight`) in their name. This flag resolves this mismatch by replacing `.` by `_` when saving the ONNX proto (`save_model_with_external_data`).
* This PR also adds unit tests for numerical validation against pytorch eager for onnx export using dynamo-based fx tracer and fake mode enabled. (There are already tests for export with fx symbolic tracer with fake mode)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105002
Approved by: https://github.com/BowenBao
2023-07-13 02:03:16 +00:00
bf40561ab4 [ONNX] Support 'aten::randint' in torchscript onnx exporter (#105089)
Export as 'ONNX::RandomUniform' which produces floating point result,
then round it to integer with 'ONNX::Cast'.

Fixes https://github.com/microsoft/onnx-converters-private/issues/173
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105089
Approved by: https://github.com/thiagocrepaldi
2023-07-13 01:50:03 +00:00
9647a251cb [dynamo] Dataclass variables with default field (#104840)
The main complexity comes from the __init__ function of Dataclass variables which look something like this

```
[2023-07-10 05:01:29,548] torch._dynamo.symbolic_convert: [DEBUG] INLINING <code object __init__ at 0x7f7015154450, file "<string>", line 2>
  3           0 LOAD_FAST                1 (b)
              2 LOAD_FAST                0 (self)
              4 STORE_ATTR               0 (b)

  4           6 LOAD_FAST                2 (named_tensors)
              8 LOAD_DEREF               0 (_HAS_DEFAULT_FACTORY)
             10 IS_OP                    0
             12 POP_JUMP_IF_FALSE       20
             14 LOAD_DEREF               1 (_dflt_named_tensors)
             16 CALL_FUNCTION            0
             18 JUMP_FORWARD             2 (to 22)
        >>   20 LOAD_FAST                2 (named_tensors)
        >>   22 LOAD_FAST                0 (self)
             24 STORE_ATTR               1 (named_tensors)
             26 LOAD_CONST               0 (None)
             28 RETURN_VALUE
```

There are multiple issues
* VariableBuilder call in functions.py was wrong. We were calling *options as args.
* We were not setting source while tracking the new object. This led to no source for Dataclass variable, which has some new variables in its closures as seen in the above bytecode.
* There is IS_OP in above bytecode, which brings more cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104840
Approved by: https://github.com/jansel
2023-07-13 01:25:57 +00:00
601db856d1 elevated cudagraphs failure to warning, added lineno to recompiles (#105081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105081
Approved by: https://github.com/mlazos
2023-07-13 01:17:58 +00:00
3fe2b73416 Update use_mkldnn in LSTM op to avoid input and parameter not in the same device (#102050)
This PR is to fix https://github.com/pytorch/pytorch/issues/101935.

Only when input, parameters and hidden states are all in CPU device, LSTM will go into oneDNN fast path implementation. Otherwise, it will fallback to the original implmentation.

Note here, if input and parameters are indeed not in the same device, it will encounter Error `Input and parameter tensors are not at the same device, found input tensor......` in `check_attributes`. Therefore, the proper usage of LSTM is `input.to(device)` and `model.to(device)` together.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102050
Approved by: https://github.com/XiaobingSuper, https://github.com/albanD
2023-07-13 01:13:59 +00:00
5b4aacd691 Revert "[DCP] Add FsspecReader and FsspecWriter to checkpoint __init__.py (#105088)"
This reverts commit 76a053d55cb23948c7b331a48f921744db24601e.

Reverted https://github.com/pytorch/pytorch/pull/105088 on behalf of https://github.com/atalman due to broke trunk and  linux-focal-py3.9-clang7-asan ([comment](https://github.com/pytorch/pytorch/pull/105088#issuecomment-1633385350))
2023-07-13 00:59:55 +00:00
954bae8e53 [FSDP][Easy] Rename streams; add back stream sharing test (#104966)
Purely out of preference, this PR renames the streams to `_unshard_stream` instead of `_streams_unshard` etc. since the former reads more naturally. The PR also removes some duplicated comments and adds back a unit test that streams are shared.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104966
Approved by: https://github.com/rohan-varma
2023-07-13 00:24:41 +00:00
59bb07ca46 Update vendored version of pythoncapi_compat (#105083)
In preparation for 3.12 support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105083
Approved by: https://github.com/Skylion007
2023-07-12 23:43:11 +00:00
4f8ba6f8f6 [DeviceMesh]Add validate mesh flag to DeviceMesh (#104807)
When creating DeviceMesh, _init_process_group() would validate that all calling ranks pass in the same `mesh` argument. In FSDP, we are currently creating the DeviceMesh based on the pg of the root state so the mesh will always be valid. Adding the flag to DeviceMesh, so we can skip the all_gather_tensor of the validation during construction time.

_validate_mesh is default to True, but we manually flip it to False when initializing device mesh in FSDP's  _runtime_utils.py.

Will modify skipping pg creation if existed for both 1D and 2D cases and then delete _init_process_groups flag in a follow up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104807
Approved by: https://github.com/wanchaol
2023-07-12 23:42:13 +00:00
76a053d55c [DCP] Add FsspecReader and FsspecWriter to checkpoint __init__.py (#105088)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105088
Approved by: https://github.com/kumpera
2023-07-12 23:40:35 +00:00
15c67ca95c Update troubleshooting.rst (#105018)
Update with `TORCH_LOGS` information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105018
Approved by: https://github.com/mlazos
2023-07-12 21:42:53 +00:00
cf9d784e32 Skip test_indirect_device_assert in fbcode (#105065)
It spawns a python subprocess, but this "python" isn't really what we
want, since it doesn't have torch, etc.  There are probably ways to make this
work but it's not worth figuring it out.

Differential Revision: [D47402347](https://our.internmc.facebook.com/intern/diff/D47402347/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105065
Approved by: https://github.com/ezyang
2023-07-12 21:26:53 +00:00
398606e1c4 Fix bug when an index appears in two expressions (#104886)
We were not adding the bounds to `replacement_vars` in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104886
Approved by: https://github.com/eellison, https://github.com/Skylion007
2023-07-12 21:26:30 +00:00
2563079d59 [ONNX] Allow None as operator argument (#105040)
Needed by 'aten.index.Tensor', where 'indices' is list of optional
tensors.

Related https://github.com/microsoft/onnxscript/pull/862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105040
Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-07-12 21:11:25 +00:00
f0ed71273e Make ops functional (#105000)
when you run DEBUG=1 mode, these op error because it thinks the implementation is not-functional even though the schema claims it functional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105000
Approved by: https://github.com/angelayi
2023-07-12 20:46:39 +00:00
d77a2d8fe3 Remove shard ID and unstable suffix when comparing failed job names with the base commit (#104821)
Fixes https://github.com/pytorch/test-infra/issues/4328

This goes together with https://github.com/pytorch/test-infra/pull/4353 and it updates `trymerge` to remove shard ID and the `unstable` suffix when comparing failed job names with the base commit.

### Testing

Add unit tests with the reported issue as a test case https://github.com/pytorch/pytorch/pull/104214 to make sure that the failure there is reported as ignorable broken trunk instead of a new failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104821
Approved by: https://github.com/malfet
2023-07-12 19:51:18 +00:00
64bbe61600 Fix lint: [PyTorch] Add Vulkan support for at::softmax 1,2,3 dimension tensors (#105082)
Fix lint.
Follow up on: https://github.com/pytorch/pytorch/pull/105012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105082
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2023-07-12 19:47:34 +00:00
fc012d716d [core] Bring cpu device module closer to cuda's. (#103172)
By implementing some of the functionality used by CUDA we make
implementing device agnostic code a lot easier.

With this set of changes it's now possible to get FSDP wrap a trivial
module. FWD/BWD still TBD.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103172
Approved by: https://github.com/wz337, https://github.com/wanchaol
2023-07-12 19:43:22 +00:00
66fb83293e [inductor] Add min/max to index propagation pass (#105020)
This allows `ops.minimum` and `ops.maximum` to be hoisted for indirect indexing
into direct indexing expressions. I also add support to the cpp printer for
Min/Max and fix the triton printer to support multi-argument Min/Max.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105020
Approved by: https://github.com/lezcano
2023-07-12 19:03:01 +00:00
06a5df8d31 Revert "Transmute refined SymInt into int (#104828)"
This reverts commit 4694f5435657b157a37ccec0d4a90b27c4b003c7.

Reverted https://github.com/pytorch/pytorch/pull/104828 on behalf of https://github.com/ezyang due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/104828#issuecomment-1633049980))
2023-07-12 18:57:58 +00:00
246dc0d9f2 [MTPG] Use TLS propagation to enable MTPG from bwd. (#104735)
We use PyTorch's built-in tls propagation in ThreadLocalState to forward the world object
from the fwd thread to the bwd thread.

This further closes the gap on enabling FSDP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104735
Approved by: https://github.com/rohan-varma
2023-07-12 18:47:02 +00:00
43c94360e2 [PyTorch] Add Vulkan support for at::softmax 1,2,3 dimension tensors (#105012)
Summary: This rounds out the support for the [softmax function](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) on the Vulkan GPU backend. The test inputs of the 1,2,3 dimension cases are simply the truncated existing 4 dimension inputs. The existing shader algorithms are reused.

Test Plan:
1. `buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook
2. Confirm all tests pass with no regression, and the added tests `*softmax*` pass under `-- --gtest_filter="*softmax*"`
2a. All tests P782531732
2b. `softmax` tests P782529114

```
~/fbsource » buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*softmax*"
Buck UI: https://www.internalfb.com/buck2/692eb82d-c2ee-49bb-833f-3c11d6e2fea9
Jobs completed: 4. Time elapsed: 0.1s.
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *softmax*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.softmax
[       OK ] VulkanAPITest.softmax (42 ms)
[ DISABLED ] VulkanAPITest.DISABLED_log_softmax
[----------] 1 test from VulkanAPITest (42 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (42 ms total)
[  PASSED  ] 1 test.

  YOU HAVE 1 DISABLED TEST

```

Reviewed By: SS-JIA

Differential Revision: D46985319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105012
Approved by: https://github.com/SS-JIA
2023-07-12 18:41:03 +00:00
08cbfb2a58 Avoid tensor creation and use scalar overload (#104264)
I would expect this preserves the behavior but there might be weird edge cases?
@mruberry might know?

The aim is to fix https://github.com/pytorch/pytorch/pull/104254 (and make `1 ** t` capturable via cudagraph)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104264
Approved by: https://github.com/zou3519
2023-07-12 18:11:27 +00:00
16d3638c11 Add best practices for CPU backend doc (#105051)
Content same as #103948
@svekars the PR content is updated per your comment, but when trying to solve the conflict the original PR was closed by a mis-operation. Would you help handle this new one? sorry for the inconvenience.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105051
Approved by: https://github.com/svekars
2023-07-12 18:04:51 +00:00
be76bfb743 [inductor] fix a custom_op test problem (#104972)
Summary: https://github.com/pytorch/pytorch/pull/104349 added a test
which sometimes triggers duplicated op registration on CI, e.g.
https://github.com/pytorch/pytorch/issues/104856. This PR fixes it by
only registrating the custom op once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104972
Approved by: https://github.com/eellison
2023-07-12 18:00:28 +00:00
5c7e826f4d [ONNX][TypePromo] Introduce ReductionTypePromotionRule (#104491)
Introduce `ReductionTypePromotionRule` and rename `TypePromotionRule` as
`ElementwiseTypePromotionRule`. Created base abstract class `TypePromotionRule`.
Reduction rules are manually curated because the total number of ops is low, yet
most of them require some special treatment. The list that are covered in our unittest is

    - "all", done
    - "amax", done
    - "amin", done
    - "any", done
    - "cumsum", done
    - "cumprod", no torchlib impl
    - "mean", done
    - "std", no torchlib impl
    - "std_mean", no torchlib impl
    - "sum", done
    - "sum_to_size", no torchlib impl
    - "prod", no torchlib impl
    - "var", no torchlib impl
    - "var_mean", tricky. Node has multiple outputs. Follow up in separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104491
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-07-12 17:53:59 +00:00
0e7529940d Revert "Switch automatic_dynamic_shapes to True by default in fbcode (#104883)"
This reverts commit d1ca98665f15b6d71523048e3d6b0c9cfa3c2d1d.

Reverted https://github.com/pytorch/pytorch/pull/104883 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104883#issuecomment-1632931223))
2023-07-12 17:30:18 +00:00
4694f54356 Transmute refined SymInt into int (#104828)
Previously, x.size(0) could return a SymInt, even when the internal
sympy expression was actually already constant (e.g., due to an
introduced guard.)  We now allow to query the Python object with
maybe_as_int which allows us to transmute these objects back to
int when possible.

It is still possible to end up with a constant SymInt even after this
change, e.g., if you get out a SymInt and while holding onto it
specialize it, but casual users are more likely to get ints when they
want to.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104828
Approved by: https://github.com/Skylion007
2023-07-12 16:40:21 +00:00
1ecef7d805 Remove unused private code from ATEN (#104751)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 52dac58</samp>

Add support for `torch.linalg.cholesky_ex` function that returns the Cholesky factorization and an error indicator. Refactor existing `torch.cholesky` and `torch.linalg.cholesky` to use the new function internally. Update tests and documentation accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104751
Approved by: https://github.com/albanD
2023-07-12 16:09:42 +00:00
979f826015 Read out real strides from compilation result, rather than real args (#105010)
This prefigures a refactor that will move the backward compilation
to entirely ahead of time, so I need to extract these strides some
other way.  Straight from the compiler's mouth will do it.

I can't easily get the information via the return result of `fw_compiler` without changing the calling convention, so instead I smuggle it via TracingContext. TracingContext may be None when we are compiling patterns for the joint graph pattern matcher.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105010
Approved by: https://github.com/shunting314
2023-07-12 11:33:08 +00:00
4148b7bada [Typing] Fix PEP 484 Violation (#105022)
Not sure, how it worked before, but if arguments must be annotated is optional if they are defaulted to None

Towards enabling mypy-1.4.1 in lintrunner

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 5e1b9f4</samp>

> _We annotate the arguments of doom_
> _To show the `None` values of gloom_
> _We improve the type checking and readability_
> _With `Optional` annotations of metal-ity_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105022
Approved by: https://github.com/izaitsevfb, https://github.com/huydhn, https://github.com/Skylion007
2023-07-12 10:20:48 +00:00
603a777b09 [PyTorch TB] Refactor formatting (#105027)
This is the first diff to support logging of raw tensors for [TensorBoard Intermediate Logging](https://www.internalfb.com/intern/wiki/TensorBoard/Intermediate_Logging/)

Ultimately, we aim to support the feature where store full tensor is stored as a tensor protobuf to TB. Protobuf contains shape, dtype, and elements of the given tensor.

This diff only contains formatting changes.
-------------

Differential Revision: [D47302017](https://our.internmc.facebook.com/intern/diff/D47302017/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105027
Approved by: https://github.com/kunalb
2023-07-12 06:08:18 +00:00
c7a76d9be5 Replace use of first_layer in init with encoder_layer argument to init (#104058)
Summary:
Replace use of `first_layer` in init with `encoder_layer` argument to init
(better eng)

Test Plan: sandcastle, github CI

Differential Revision: D46940537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104058
Approved by: https://github.com/mikaylagawarecki
2023-07-12 05:31:15 +00:00
c03558fa8d [doc] apply weight after p in MultiMarginLoss (#104844)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104844
Approved by: https://github.com/lezcano
2023-07-12 03:42:14 +00:00
0bc382ea55 [foreach][Adamax] Minimize intermediates=1 to decrease peak memory (#104991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104991
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-12 03:09:17 +00:00
ea6a563a8c [foreach][Adagrad] Minimize intermediates=2 to decrease peak memory (#104988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104988
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-12 03:09:17 +00:00
455f495f04 [foreach][Adadelta] Minimize intermediates=3 to decrease peak memory (#104983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104983
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-12 03:09:15 +00:00
9c46a1620c [inductor] Register an op for mm_plus_mm (#104835)
Summary: Currently the aten version of mm_plus_mm has no cpp
implementation, and thus cpp_wrapper can not generate the correct cpp
function call for it.

Differential Revision: [D47372057](https://our.internmc.facebook.com/intern/diff/D47372057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104835
Approved by: https://github.com/jansel, https://github.com/SherlockNoMad
2023-07-12 02:34:02 +00:00
5f2a76ddf7 inductor: fix LoweringException of AdaptiveAvgPool with output_size 0 (#104691)
Fix #104618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104691
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison
2023-07-12 02:28:13 +00:00
9d1f5a35df Move more stuff into ViewAndMutationMeta (#105009)
The one sort of tricksy thing about this PR is that `num_symints_saved_for_bw` is populated later; we compute the metadata with a forward pass, but we only know `num_symints_saved_for_bw` once we run partitioning. This seems... fine.

Also, by pushing the conditionals into the slices, I can remove the top level if...else branch, for a nice simplification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105009
Approved by: https://github.com/albanD
2023-07-12 02:22:44 +00:00
5913437a40 aot inductor: opportunistically fix check_output -> check_call (#104743)
Summary:
This usage is not ideal:

    subprocess.check_output(cmd, stderr=subprocess.STDOUT)

* `check_output` will capture the command's stdout, and here we did not return it
* not ideal to redirect the sub-command's stderr to the host process's stdout (with `check_call`, stdout stays stdout, stderr stays stderr).

Differential Revision: D47275261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104743
Approved by: https://github.com/frank-wei
2023-07-12 00:36:27 +00:00
980fb94f9c [Doc] Specify output parameters for FractionalMaxPool2d and FractionalMaxPool3d (#104941)
Summary: Specify one of the output parameters must be set for FractionalMaxPool2d and FractionalMaxPool3d.

Fix: #104861

Test Plan: Please see GitHub Actions

Differential Revision: D47357240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104941
Approved by: https://github.com/mikaylagawarecki
2023-07-11 23:51:24 +00:00
73e179a5ca Follow file move for functorch bits for ciflow/inductor (#105019)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105019
Approved by: https://github.com/Skylion007
2023-07-11 23:29:08 +00:00
1a6619a830 Added missing whitespace when reporting invalid gradient type (#104992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104992
Approved by: https://github.com/albanD, https://github.com/soulitzer, https://github.com/Skylion007
2023-07-11 22:24:02 +00:00
96b91ab248 Fix merged lintrunner error (#105005)
Fixes lintrunner linter race condition. Follow up to #104917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/105005
Approved by: https://github.com/malfet, https://github.com/ezyang
2023-07-11 22:04:49 +00:00
ece19bf018 Update run_test.py to use TEST_WITH_SLOW_GRADCHECK flag (#104819)
Finishes the job from #104537. See https://github.com/pytorch/pytorch/pull/104537#pullrequestreview-1520065008
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104819
Approved by: https://github.com/huydhn
2023-07-11 21:58:46 +00:00
24aa8b9b9a Revert "Deprecate registering autograd kernels at not an autograd key (#104481)"
This reverts commit ed13ab666419ae5dd3adbdb048c8f96f62b14b3d.

Reverted https://github.com/pytorch/pytorch/pull/104481 on behalf of https://github.com/atalman due to failed in periodic tests ([comment](https://github.com/pytorch/pytorch/pull/104481#issuecomment-1631552846))
2023-07-11 21:48:22 +00:00
2f95a3d0fc [BE]: Apply ruff PERF fixes to torch (#104917)
Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-07-11 20:45:21 +00:00
c9a806be28 [ROCm] enable additional inductor/dynamo UTs (#104624)
Enables additional inductor UTs on ROCm and un skips outdated skips.

I have also removed a group of failures in `test_torchinductor_opinfo` which are now passing for CUDA and ROCm

```
-    # The following 3 tests fail on CUDA with AssertionError: expected size 5==5, stride 5==1 at dim=0
-    # linalg._svd's return value has different strides on CUDA vs CPU which causes this
-    # In test_meta.py there is a mechanism to skipping strides checks for some ops
-    # (including _linalg_svd), possibly we should have something similar here
-    "linalg.cond": {f32, f64},
-    "linalg.svdvals": {f32, f64},
-    "linalg.matrix_rank": {f32, f64},
-    "linalg.svd": {f32, f64},
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104624
Approved by: https://github.com/malfet
2023-07-11 20:44:02 +00:00
6f27c5185f Fix broken link to torch.compile docs (#104982)
The existing link https://pytorch.org/docs/master/dynamo/custom-backends.html is 404. Updating to use the new link.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104982
Approved by: https://github.com/msaroufim
2023-07-11 20:35:47 +00:00
c60cb91700 [dynamo] fix bug where trace_source and graph_sizes artifacts were not being printed with TORCH_LOGS='+dynamo' (#104912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104912
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2023-07-11 20:09:44 +00:00
a2f04e9841 Force multi-line messages to still get log format prefix (#104932)
This makes it easier to exclude multi-line messages using single line
grepping.  If your screen is wide enough this should not be a big
problem.

Example of what it looks like:

```
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] GUARDS:
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG]   hasattr(L['x'], '_dynamo_dynamic_indices') == False
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG]   ___is_grad_enabled()
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG]   not ___are_deterministic_algorithms_enabled()
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG]   utils_device.CURRENT_DEVICE == None
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104932
Approved by: https://github.com/mlazos, https://github.com/albanD
2023-07-11 20:00:52 +00:00
515e3f2bb9 Add [rankN]: to log messages when distributed is initialized (#104929)
Doing it in the formatter is kind of naughty but I stared a while at logging.setLogRecordFactory for a bit, and then decided it was a bit too global for a library to use well.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104929
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2023-07-11 20:00:52 +00:00
5e4ee15e85 [MPS] Fix unique flatten logic (#104938)
Tensor must be flatted if dim is none before checking whether or not dim dimension is already None

Fixes https://github.com/pytorch/pytorch/issues/104879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104938
Approved by: https://github.com/albanD
2023-07-11 19:55:56 +00:00
ad37dd5155 Make unspecified ints to range over negative and positive. (#104658)
Currently, negative unspecified ints get specialized. This PR creates symbolic values for
unspecified ints (including negative ones).

For example, with this PR, the following code only compiles once, instead of 3 times:

```python
def foo(x, y):
    return torch.fill(torch.zeros(x.shape), y)

foo(10)
foo(-5)
foo(-3)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104658
Approved by: https://github.com/ezyang
2023-07-11 19:13:16 +00:00
4b29829ece [quant][pt2] Fix QAT convert for mobilenetv2 (#104110)
Summary:
QAT convert for mobilenetv2 was previously not working
because we incorrectly applied dropout during eval as well as
training. This is because, for exported models, model.eval() does
not change the behavior of dropout, unlike models with torch ops.
This commit simulates the effects of model.eval() for exported
models as well by replacing the aten dropout pattern before eval.
As of this commit, end-to-end QAT numerics now match for
mobilenetv2 between FX and PT2.

Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_mobilenet_v2

Differential Revision: D46750343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104110
Approved by: https://github.com/jerryzh168
2023-07-11 18:42:42 +00:00
eb03af44ee Fixes to the torch.compile doc and doctest (#104911)
Fixing the user warning in doctest by removing autosummary from the compile/index.rst :
```
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/__init__.py:docstring of torch.compile:1: WARNING: duplicate object description of torch.compile, other instance in compile/generated/torch.compile, use :noindex: for one of them
```
The error is no longer present in the log: https://github.com/pytorch/pytorch/actions/runs/5513741050/jobs/10052379357?pr=104911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104911
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-07-11 17:54:12 +00:00
6abe0b2ee8 Disable translation validation on performance runs. (#104887)
This PR disables translation validation (TV) when running the benchmark suits on
performance workflows: inductor with A100s.

In summary, the changes are:

- Add flag for turning TV on and off on _benchmarks/dynamo/common.py_
- Turn TV on only on CI accuracy builds
- Add `--no-translation-validation` target flag to _.ci/pytorch/test.sh_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104887
Approved by: https://github.com/ezyang
2023-07-11 17:30:40 +00:00
5d4b2fcc6f Updated pillow version to 9.3.0 for Python version <= 3.8 (#104958)
There are several vulnerabilities with pillow version 9.2.0. In the worst case, this can lead to arbitrary code execution - https://security.gentoo.org/glsa/202211-10.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104958
Approved by: https://github.com/jeanschmidt, https://github.com/malfet
2023-07-11 17:27:09 +00:00
f01deb23d5 Revert "[dynamo][numpy] Add support for np.dtype (#103546)"
This reverts commit 07107919297db3f8ab37f11c12666b6d6d5f692e.

Reverted https://github.com/pytorch/pytorch/pull/103546 on behalf of https://github.com/voznesenskym due to Failed on bench, unclear why bench test did not run on CI ([comment](https://github.com/pytorch/pytorch/pull/103546#issuecomment-1631203461))
2023-07-11 17:23:11 +00:00
49a2b72927 [inductor] handle Min and Max in TritonPrinter (#104944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104944
Approved by: https://github.com/ezyang
2023-07-11 17:11:31 +00:00
15aa401baa [foreach][NAdam] Minimize use of intermediates to decrease peak memory (#104910)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104910
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-07-11 17:08:07 +00:00
6878d3a157 [foreach][RAdam] Minimize use of intermediates to decrease peak memory (#104904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104904
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-07-11 17:08:07 +00:00
ed13ab6664 Deprecate registering autograd kernels at not an autograd key (#104481)
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.

If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8

It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).

Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.

The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible

There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).

Test Plan:
- bunch of new tests

BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).

If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104481
Approved by: https://github.com/soulitzer
2023-07-11 16:48:39 +00:00
e095716161 Add a note for Incorrect signature in nn.Module.register_full_backwar… (#104964)
…d_pre_hook

Fixes #102645

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104964
Approved by: https://github.com/albanD
2023-07-11 16:24:13 +00:00
231364fd06 [optim] use lerp whenever possible (#104796)
This is a better copy (with fixes) of #104781.

Test plan:
CI will pass once https://github.com/pytorch/pytorch/pull/104784 is landed

Internal CI (and the newly enabled compiled optim tests) will pass after https://github.com/pytorch/pytorch/pull/104866 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104796
Approved by: https://github.com/albanD
2023-07-11 14:32:59 +00:00
999abd56a7 [BE] Make ONNX imports lazy (#104843)
This reduces total number of imported modules by default from 1419 to 1322 according to
```
time python -c "import sys;before=len(sys.modules);import torch;after=len(sys.modules);print(f'torch-{torch.__version__} imported {after-before} modules')"
```

and slightly reduces import time, while having no effect on UX (i.e. `torch.onnx.` submodule is kept intact)

Suppress lint errors that appear after mypy accidentally starts listing more files, for more details see: https://github.com/pytorch/pytorch/issues/104940

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104843
Approved by: https://github.com/jansel, https://github.com/albanD
2023-07-11 12:54:22 +00:00
26f7f470df Handle empty PR body in filter_test_configs (#104914)
This is a bug discovered by https://github.com/pytorch/pytorch/pull/104810.  Basically, when the PR body is empty, GitHub API returns a None value, which is passed into `parse_reenabled_issues` causing it to fail.

### Testing

```
python3 .github/scripts/filter_test_configs.py \
  --workflow "pull" \
  --job-name "linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit / filter," \
  --test-matrix "{ include: [ { config: 'default', shard: 1, num_shards: 1, runner: 'linux.2xlarge' }, ]}" \
  --pr-number "104810" \
  --tag "" \
  --event-name "pull_request" \
  --schedule "" \
  --branch ""
```

The command works correctly without failing now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104914
Approved by: https://github.com/clee2000
2023-07-11 10:16:58 +00:00
db4aed6a03 Include nn.ParameterDict in dynamo __getitem__ (#99771)
Summary:

Fix: #99735

Test Plan: Please see GitHub tests.

Differential Revision: D45200616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99771
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2023-07-11 08:19:04 +00:00
ba167e6578 Inductor cpp wrapper: fix codegen of ScatterFallback (#104524)
Fix cpp wrapper failure on TorchBench model `basic_gnn_edgecnn` and `hf_Reformer` which contain scatter OP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104524
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-11 08:17:56 +00:00
0710791929 [dynamo][numpy] Add support for np.dtype (#103546)
## Problem

Trying to support numpy function call in dynamo, with numpy dtype as argument.

For example:

```
def fn(x: int):
    return np.empty_like(x, dtype=np.float64)
```

## Solution

This currently doesn't work because `NumpyVariable` doesn't implement `as_proxy()`. The idea in `as_proxy()` for now is to convert `np.float64` and other np.<dtype> into `torch.dtype` and then feed into the corresponding `torch_np` method.

For previous example, we convert `np.float64` to `torch.float64` in `as_proxy()` and then feed it into `torch_np.empy_like()` method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103546
Approved by: https://github.com/ezyang
2023-07-11 06:29:15 +00:00
90eaa98d13 dynamo : kwarg support for wrap (higher order op) (#104180)
Ref: https://github.com/pytorch/pytorch/issues/100278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104180
Approved by: https://github.com/zou3519
2023-07-11 06:08:18 +00:00
ed5ea15714 [Easy] remove debug code (#104915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104915
Approved by: https://github.com/mlazos
2023-07-11 04:01:02 +00:00
f1bff6601c [ONNX] Add fake tensor support to torch.onnx.dynamo_export (#103865)
## Context prior to this PR

https://github.com/pytorch/pytorch/pull/100017/ was merged onto PyTorch `main` branch with the goal of enabling `torch._dynamo.export` to perform symbolic tracing.
In that context, symbolic tracing is defined as tracing of a model using fake inputs and weights. An input is Fake when `torch.nn.Tensor` is replaced by `torch._subclasses.FakeTensor`, whereas a weight is fake when a `torch.nn.Parameter` is replaced by `torch._subclasses.FakeTensor`.

For additional context, several strategies were discussed with Meta to enable this feature, including 1) calling `torch._dynamo.export` within a `torch._subclass.FakeTensorMode` context and 2) **fake**fying input and model as separate step and then call `torch._dynamo.export` without an active `torch._subclass.FakeTensorMode` context. At the end, 2) was preferred and implemented by #100017 to minimize the number of side-effects the fake tensor mode has on the code base.

As a consequence, `torch._dynamo.export` API introduced a new argument called `fake_mode`. When symbolic tracing is used, the user must pass in the `fake_mode` used to fakefy both the input and the model. Internally, `torch._dynamo.export` will adopt this `fake_mode` instead of creating its own instance. This is needed because each instance of `FakeTensorMode` has metadata on the tensor/parameter it fakefied. Thus, using real tensor/model and specify a `fake_mode` to `torch._dynamo.export` is an error. Also, specify a `fake_mode` instance to `torch._dynamo.export` different than the one used to fakefy the model and input is also an error.

## Changes introduced from this PR

This PR is intended to integrate `torch._dynamo.export(fake_mode=...)` through `torch.onnx.dynamo_export`. In essence, it
* Introduces a new public API `ONNXFakeContext` which wraps a `FakeTensorMode` under the hood. This removes complexity from the user side while still allow the exporter to leverage the fake mode.
* Adds a new public API `enable_fake_mode` *context manager* that instantiates and return a `ONNXFakeContext`.
* Adds a new `ExportOptions.fake_context` that will be used to persist the `ONNXFakeContext` created by `enable_fake_mode` and plumb through until it reaches the call to `torch._dynamo.export`.
* Adds a `model_state_dict` argument to `ExportOutput.save` API.
  * When model is exported with fake tensors, no actual data exist in the FX module and, therefore, in the ONNX graph.
    * In fact, `torch.fx.make_fx` lifts initializers as model input when fake tensors are used
      * https://github.com/pytorch/pytorch/pull/104493 is needed to enforce name matching between Parameters and inputs
    *  A model checkpoint file or state_dict is needed to populate the ONNX graph with real initializers through `export_output.save(model_state_dict=...)` API

Symbolic tracing, or onnx fake mode, is only enabled when the user instantiates the input and model within the `enable_fake_mode` context. Otherwise, real tracing is done, which preserves the current behavior.

## Usability

Because symbolic tracing depends a lot on having changes made on Dynamo side before it can be consumed on ONNX exporter, this feature may have its API and assumptions changed as symbolic tracing matures upstream. Nonetheless, it is still important to have this feature merged ASAP on the ONNX exporter side to "lock" changes on Dynamo that would otherwise break ONNX exporter without warning.

Example:

```python
class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = torch.nn.Linear(2, 2)

    def forward(self, x):
        out = self.linear(x)
        return out

with torch.onnx.enable_fake_mode() as fake_context:
    x = torch.rand(5, 2, 2)
    model = Model()

# Export the model with fake inputs and parameters
export_options = ExportOptions(fake_context=fake_context)
export_output = torch.onnx.dynamo_export(
    model, x, export_options=export_options
)

model_state_dict = Model().state_dict()  # optional
export_output.save("/path/to/model.onnx", model_state_dict=model_state_dict)
```

## Next steps

* Add unit tests running the exported model with ORT
Today this is not possible yet because `make_fx` used by our Decomposition pass lifts initializers as model inputs. However, the initializer names are not preserved by FX tracing, causing a mismatch between the initializer and input name.
https://github.com/pytorch/pytorch/pull/104493 and https://github.com/pytorch/pytorch/pull/104741 should fix the initializer mismatch, enabling model execution

* Revisit `ONNXTorchPatcher` and how the ONNX initializers are saved in the graph as external data
We can try to get rid of the PyTorch patcher. If we can't, we might prefer to create specific patchers, say `FXSymbolicTracePatcher` used specifically during an export using `torch.fx.symbolic_trace` and maybe a `ExportOutputSavePacther` used specifically for `ExportOutput.save` to prevent "patching too many pytorch API that we don't need

## References

* [FakeTensor implementation](https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/fake_tensor.py)
* [PR that adds fake tensor support to torch._dynamo.export](https://github.com/pytorch/pytorch/pull/100017)
* [Short fake tensor documentation](https://pytorch.org/torchdistx/latest/fake_tensor.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103865
Approved by: https://github.com/BowenBao
2023-07-11 03:17:17 +00:00
ca8c56ff5d fix QuantizeAvx512 (#104400)
For quantize
```
  for (; i < len / VLEN * VLEN; i += VLEN) {
    __m512 x_vals = _mm512_load_ps(src + i);
    __m512 x_transformed_v = _mm512_mul_ps(x_vals, inverse_scale_v);
    x_transformed_v =
        _mm512_min_ps(x_transformed_v, _mm512_set1_ps(int32_float_max_val));
    __m512i x_rounded_v = _mm512_cvtps_epi32(x_transformed_v);
    x_rounded_v = _mm512_add_epi32(x_rounded_v, _mm512_set1_epi32(zero_point));
    __m512i x_clipped_v =
        _mm512_max_epi32(min_v, _mm512_min_epi32(max_v, x_rounded_v));

    x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v);
    x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v);
    _mm_storeu_si128(
        reinterpret_cast<__m128i*>(dst + i),
        _mm512_castsi512_si128(x_clipped_v));
  }
```

```
    x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v);
    x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v);
```
is aiming to cast `int32` to `int8` and shuffle 16 `int8` to the first 128 bits.

For example, `A1` represent 8bit
```
    x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v);
    A1A2A3**A4** B1B2B3**B4** C1C2C3**C4** D1D2D3**D4**            -> D4C4B4A4 other 32 * 3 bit
    E1E2E3**E4** F1F2F3**F4** G1G2G3**G4** H1H2H3**H4**            -> H4G4F4E4 other 32 * 3 bit
    I1I2I3**I4** J1J2J3**J4** K1K2K3**K4** L1L2L3**L4**            -> L4K4J4I4 other 32 * 3 bit
    M1M2M3**M4** N1N2N3**N4** O1O2O3**O4** P1P2P3**P4**            -> P4O4N4M4 other 32 * 3 bit
    x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v);
    D4C4B4A4 other 32 * 3 bit        -> D4C4B4A4 H4G4F4E4 L4K4J4I4 P4O4N4M4
    H4G4F4E4 other 32 * 3 bit           other 3 * 4 * 32 bits
    L4K4J4I4 other 32 * 3 bit
    P4O4N4M4 other 32 * 3 bit

```

Based on https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_permutexvar_epi32&ig_expand=4966,5088.
```
FOR j := 0 to 15
	i := j*32
	id := idx[i+3:i]*32
	dst[i+31:i] := a[id+31:id]
ENDFOR
dst[MAX:512] := 0
```
the `permute_mask_l8_v` should satisfy
```
permute_mask_l8_v[3:0] = 0
permute_mask_l8_v[3 + 32:0 + 32] = 4
permute_mask_l8_v[3 + 64:0 + 64] = 8
permute_mask_l8_v[3 + 96:0 + 96] = 12
```
The other part of `permute_mask_l8_v` does not matters.

`AVX2` version is correct.

It is not exposed before it is only called with fixed length `64` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_qint.h#L545-L546.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104400
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/jerryzh168
2023-07-11 02:02:23 +00:00
dbb69f78fe Add assert + test for artifact log booleans (#104907)
Fixes https://github.com/pytorch/pytorch/issues/104885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104907
Approved by: https://github.com/ezyang
2023-07-11 01:59:23 +00:00
d184c81166 Add -fstandalone-debug debug flag (#104475)
# Summary

While debugging something in lldb, I found that the formatter I wrote for c10::intarrayref was not working correctly producing:
`(std::string) $6 = error: summary string parsing error`

Based off of this thread: https://github.com/vadimcn/codelldb/issues/415

I adde the standalone-debug information and fixed the std::string formatting issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104475
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-11 01:29:20 +00:00
63d1fb21f5 [FSDP] Default limit_all_gathers=True (#104900)
This PR defaults to `limit_all_gathers=True`.

I included a `record_function()` for the rate limiter synchronization to help with user confusion on the gap in the pre-forward:
<img width="874" alt="Screenshot 2023-07-10 at 3 28 18 PM" src="https://github.com/pytorch/pytorch/assets/31054793/61f55e0e-58d7-4162-9395-bea06d3e8d8a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104900
Approved by: https://github.com/fegin
2023-07-11 01:04:29 +00:00
7c3c3dd7ca [C10D] Reimplement TCPStore wait timeout logic. (#100594)
Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout.

This happens because all recv functions raise an exception on timeout and that's it.
The problem is that on timeout we need to unregister the wait.

We implement this with client side cancelation by adding a new CANCEL_WAIT instruction.

So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command.
The server sends a WAIT_CANCELED response to that command, always.

This gets us down to the last issue, which is that there's a race between timeout'ing,
canceling the wait and the wait completing. The client needs to handle the server sending
a STOP_WAITING followed by a WAIT_CANCELED answer.

This ensures client and server state are synchronized regardless of whether the wait
timeouts or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594
Approved by: https://github.com/H-Huang
2023-07-11 00:36:41 +00:00
332f2057df [XNNPACK][QS8] torch.nn.ELU (#104307)
Differential Revision: [D47075933](https://our.internmc.facebook.com/intern/diff/D47075933/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104307
Approved by: https://github.com/digantdesai
2023-07-11 00:35:13 +00:00
c4e084e3c7 [XNNPACK][QS8] torch.nn.ConstantPad2d (#104306)
Differential Revision: [D47075932](https://our.internmc.facebook.com/intern/diff/D47075932/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104306
Approved by: https://github.com/digantdesai
2023-07-11 00:35:02 +00:00
2c960c73a3 [XNNPACK][QS8] torch.permute (#104305)
Differential Revision: [D47075934](https://our.internmc.facebook.com/intern/diff/D47075934/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104305
Approved by: https://github.com/digantdesai
2023-07-11 00:34:58 +00:00
d41c4a8338 [XNNPACK][QS8] torch.clamp (#104304)
Differential Revision: [D47075935](https://our.internmc.facebook.com/intern/diff/D47075935/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104304
Approved by: https://github.com/digantdesai
2023-07-11 00:34:58 +00:00
66c41e1c5e Avoid generating core dumps when CONTINUE_THROUGH_ERROR is set (#104905)
Fixes https://github.com/pytorch/pytorch/issues/104234.  This closes another loop hole where multiple core files could be generated when CONTINUE_THROUGH_ERROR flag is set in CI.  This ensures that only one core file is generated in regular Linux test job.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104905
Approved by: https://github.com/clee2000
2023-07-11 00:20:33 +00:00
e940d5d3c3 Disable cudagraphs by default when dynamic shape is enabled. (#104448)
Disable cudagraphs when dynamic shape is enabled (via torch.compile(dynamic=True)).
Otherwise, Inductor recompiles for each new shape, which doesn't seem to be very reasonable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104448
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-07-11 00:16:37 +00:00
3279f06410 Merge and improve torch optim optimizer type stubs (#102593)
Fixes #102428

Also improves hook registration type hints:

```python
from typing import Any, Dict, Tuple

from torch import nn
from torch.optim import Adam, Adagrad, Optimizer

linear = nn.Linear(2,2)
optimizer = Adam(linear.parameters(), lr=0.001)

def pre_hook_fn_return_none(optimizer: Adam, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None:
    return None

def pre_hook_fn_return_modified(
    optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]
) -> Tuple[Tuple[Any, ...], Dict[str, Any]]:
    return inputs, kwargs

def hook_fn(optimizer: Optimizer, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None:
    return None

def hook_fn_other_optimizer(optimizer: Adagrad, inputs: Tuple[Any, ...], kwargs: Dict[str, Any]) -> None:
    return None

optimizer.register_step_post_hook(hook_fn)  # OK

optimizer.register_step_pre_hook(pre_hook_fn_return_none)  # OK
optimizer.register_step_pre_hook(pre_hook_fn_return_modified)  # OK

optimizer.register_step_post_hook(hook_fn_other_optimizer)  # Parameter 1: type "Adam" cannot be assigned to type "Adagrad"

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102593
Approved by: https://github.com/janeyx99
2023-07-11 00:07:30 +00:00
6059fea760 Make perf_hint_log report at info level (#104873)
If you do it at warning, these log messages will get displayed by
default, which is not the intended behavior.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104873
Approved by: https://github.com/mlazos
2023-07-10 23:46:34 +00:00
4063158df9 Enable running compiled optimizers in CI (#104888)
as title

for reference: this is a followup to https://github.com/pytorch/pytorch/pull/104121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104888
Approved by: https://github.com/janeyx99
2023-07-10 23:45:41 +00:00
7e9c891056 [foreach][AdamW] Minimize intermediates to save peak memory (#104898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104898
Approved by: https://github.com/albanD
2023-07-10 23:40:52 +00:00
d5dbe77629 Fix mod semantics for Z3Ops. (#104827)
Python `mod` semantics is not the same as the mathematical modulus operation. According to
the Python reference: `a = floor(a / b) * b + a % r`.

In other words: `a % b = a - floor(a / b) * b`.

This PR fixes the old implementation which used SMT-LIB2 semantics for `mod`. In short, it
only worked with integers and had the following guarantee: `0 <= a % b < b`.

In summary, the changes are:
- `a % b = a - floordiv(a, b) * b`
- `a` and `b` can be both integer or real
- The result will be real if any of the arguments is real. Otherwise, it will be integer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104827
Approved by: https://github.com/lezcano
2023-07-10 23:35:04 +00:00
951b9a6a14 Update torchbench pin (#104829)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104829
Approved by: https://github.com/albanD
2023-07-10 23:31:27 +00:00
0300be5b7b Fix AttributeError("'constexpr' object has no attribute 'type'") (#104831)
Fixes #104759

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104831
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2023-07-10 23:26:42 +00:00
aa84078c6c [PTD][TP] Add BWD support for colwise embedding sharding (#104820)
Originally, we didn't enable BWD for colwise embedding because we thought it was just for inference, but it turns out that we do need it for training. So, let's enable it for now and unit test is also added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104820
Approved by: https://github.com/fegin
2023-07-10 22:33:20 +00:00
fd378db6a8 Fix lint after 104902 (#104909)
Fix lint after PR: #104902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104909
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn
2023-07-10 22:17:06 +00:00
9861c4a3f8 Add lerp decomps + meta registrations (#104866)
as title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104866
Approved by: https://github.com/janeyx99
2023-07-10 22:07:57 +00:00
dff42857bd [inductor] update triton pin (#104303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104303
Approved by: https://github.com/desertfire
2023-07-10 21:30:50 +00:00
c2e286daf9 Testing: Print test reproduction command on failure (#104537)
MS2 of the Reproducible Testing BE initiative. For context, this is the ask:

```
Another thing that would be really great as we start to have more dependent
systems or types of tests (functorch, dynamo, crossref) would be to have a
minimally reproducible version of the test (something at the end of the HUD
comment like: "Run python test/test_file.py -k test_name" but also if you need
flags, like crossref it would be like "Run <flag to run crossref> python test/..." ). I'll
often go through the test infra to find the flags that I need to pass when
something only breaks crossref/dynamo tests.
```

Implementation details:
* Adds a new flag `PRINT_REPRO_ON_FAILURE` that is settable through the environment variable `PYTORCH_PRINT_REPRO_ON_FAILURE=1`
    * **Default is ON but I can be persuaded otherwise**
* When the flag is enabled, our base `TestCase` will wrap the test method in a context manager that catches any non-skip exceptions and appends a repro string to the exception message. The repro includes setting of necessary test flags through env vars. Example:

```
To execute this test, run the following from the base repo dir:
    PYTORCH_TEST_WITH_CROSSREF=1 python test/test_ops.py -k test_foo_add_cuda_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
* To keep track of flag settings, this PR introduces a new `TestEnvironment` class that defines global flags by querying related environment variables. Flag and env var names are purposefully kept searchable via full names. Example usages:
```python
TestEnvironment.def_flag("TEST_WITH_TORCHINDUCTOR", env_var="PYTORCH_TEST_WITH_INDUCTOR")
# can track implication relationships to avoid adding unnecessary flags to the repro
TestEnvironment.def_flag(
    "TEST_WITH_TORCHDYNAMO",
    env_var="PYTORCH_TEST_WITH_DYNAMO",
    implied_by_fn=lambda: TEST_WITH_TORCHINDUCTOR or TEST_WITH_AOT_EAGER)
# can use include_in_repro=False to keep the flag from appearing in the repro command
TestEnvironment.def_flag(
    "DISABLE_RUNNING_SCRIPT_CHK", env_var="PYTORCH_DISABLE_RUNNING_SCRIPT_CHK", include_in_repro=False)
# the default default value is False, but this can be changed
TestEnvironment.def_flag(
    "PRINT_REPRO_ON_FAILURE", env_var="PYTORCH_PRINT_REPRO_ON_FAILURE", default=(not IS_FBCODE), include_in_repro=False)
```
* AFAICT it is only feasible to achieve this from within the test framework rather than at the CI level. This is because CI / `run_test.py` are unaware of individual test cases. Implementing it in our base `TestCase` class has the broadest area of effect, as it's not isolated to e.g. OpInfo tests.
* I couldn't find an easy way to test the logic via `test_testing.py`, as the logic for extracting the test filename doesn't work for generated test classes. I'm open to ideas on testing this, however.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104537
Approved by: https://github.com/ezyang, https://github.com/janeyx99, https://github.com/huydhn
2023-07-10 21:24:02 +00:00
912a6a1b5a [pt2][test] Loosen stack trace check in test (#104902)
Depending on inlining and demangling provided by the underlying
compiler, we may get different function names and namespaces in the stack
trace.  Allow everything I've seen so far.

Differential Revision: [D47344213](https://our.internmc.facebook.com/intern/diff/D47344213/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47344213/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104902
Approved by: https://github.com/eellison
2023-07-10 21:12:25 +00:00
86680a6c0b [dynamo] handle calls to typing.cast (#104799)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104799
Approved by: https://github.com/jansel
2023-07-10 21:05:17 +00:00
2ee440054b Small tweaks to SDPA docs (#104749)
Fixes #104652

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 2d61112</samp>

No summary available (An error occurred while summarizing these changes: Gave up after 3 retries: Failed to read error response)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104749
Approved by: https://github.com/mikaylagawarecki
2023-07-10 21:01:45 +00:00
d1ca98665f Switch automatic_dynamic_shapes to True by default in fbcode (#104883)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104883
Approved by: https://github.com/xw285cornell
2023-07-10 20:32:22 +00:00
bcdd4130b4 [inductor] Fix float64 constants in triton codegen (#104830)
Fixes #101684

Before this change, we get a float constant in triton
```
tmp0 = 0.2
```
which in triton IR becomes a float32 value
```
%cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf32>
```

After, we get a tensor with explicit type
```
tmp0 = tl.full([1], 0.2, tl.float64)
```
which does generate a float64 in the triton IR
```
%cst_0 = arith.constant dense<2.000000e-01> : tensor<2xf64>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104830
Approved by: https://github.com/lezcano
2023-07-10 19:40:50 +00:00
7b538d8987 [DCP][fsspec] Consolidate OSS FsspecWriter/Reader and internal FsspecWriter/Reader (#104724)
Summary:
This diff does the following:
1. re-enable single_file_per_rank for FsspecWriter, as the issue of file slicing error is resolved because of [https://github.com/pytorch/pytorch/pull/99167]
2. remove sync_files from FsspecWriter as there is no fsspec equivalence.
3. remove the internal implementation of FsspecWriter/Reader, as it has been upstreamed to PyTorch OSS
4. keep the internal test for manifold inside internal as we can only test it in fb environment
5. consolidate test to remove duplicates
6. remove unnecessary TARGETS

Test Plan:
```
buck test @//mode/dev-nosan  //caffe2/test/distributed/checkpoint/fb:test_fsspec_filesystem -- --print-passing-details

----------------------------------------------------------------------
Ran 1 test in 54.894s

OK
/usr/local/fbcode/platform010/lib/python3.8/tempfile.py:818: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmpzomokvh6'>
  _warnings.warn(warn_message, ResourceWarning)

Buck UI: https://www.internalfb.com/buck2/4cb722a2-3ee7-48f2-a9ef-55ee6fb1a498
Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724447995201
Network: Up: 8.8 MiB  Down: 1.5 GiB  (reSessionID-04c29f56-ae94-4187-8a1a-c812f432674d)
Jobs completed: 209847. Time elapsed: 1:56.5s.
Cache hits: 100%. Commands: 85687 (cached: 85687, remote: 0, local: 0)
Tests finished: Pass 3. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D47266068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104724
Approved by: https://github.com/fegin, https://github.com/fduwjj
2023-07-10 19:31:01 +00:00
48a49b2683 use more informative error message for ConstandPad2d/3d (#104762)
Fixes #104508

As discussed in #104508, the current error message for `torch.nn.ConstantPad2d` and `torch.nn.ConstantPad3d` is misleading, this PR fixes the problem.
The fixed error message is shown below:
For `torch.nn.ConstantPad2d`:
<img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/6964699/dd15f42a-b6ad-4c6d-aa41-f26d08144189">
For `torch.nn.ConstantPad3d`:
<img width="630" alt="image" src="https://github.com/pytorch/pytorch/assets/6964699/ac99b80f-73c1-4d7f-b9a1-74bf45ee4c21">

cc:
@mikaylagawarecki Please help me check this PR, thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104762
Approved by: https://github.com/mikaylagawarecki
2023-07-10 19:00:47 +00:00
1ad435772b Added option to always call nn.Module global/non-global forward hooks (#104278)
Fix #103997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104278
Approved by: https://github.com/albanD
2023-07-10 18:58:07 +00:00
0433cb0596 [dynamo] simulate tracing tree_map_only (#104815)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104815
Approved by: https://github.com/voznesenskym
2023-07-10 18:05:35 +00:00
b93590b692 Copy debug artifacts instead of renaming (#104561)
Fixes https://github.com/pytorch/pytorch/issues/100567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104561
Approved by: https://github.com/jansel
2023-07-10 17:44:13 +00:00
35f0e35529 [foreach][Adam] Minimize use of intermediates to decrease peak memory (#104780)
Starts addressing https://github.com/pytorch/pytorch/issues/97712 by
- Minimizing intermediates usage for foreach Adam
- Document the extra memory usage
- Add comments within the code for clarity now that we reuse intermediates
- Add tests
- Did some refactoring

Next steps involve doing this for all other foreach implementations. Note that even after this change, foreach mem usage will be higher than forloop due to the fact that we have a minimum budget of 1 intermediate (to not muddle the input values) and the intermediate will be larger. For capturable, the memory usage is higher due to moving more tensors to CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104780
Approved by: https://github.com/albanD
2023-07-10 17:38:46 +00:00
e25f5732c8 Add meta registrations and distributed decomps: _foreach_div_.Scalar, sqrt_.default (#104779)
This PR unblocks #104780 by resolving spmd tracing test issues and by adding meta registrations for foreach inplace ops (div_ and sqrt_)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104779
Approved by: https://github.com/fegin, https://github.com/albanD
2023-07-10 17:38:46 +00:00
038cb4075a Add capturable/maximize tests to Adam(W) optim configs (#104669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104669
Approved by: https://github.com/albanD
2023-07-10 17:38:46 +00:00
af52f6b928 [DCP] Add documentation for HSDP saving using DCP (#104810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104810
Approved by: https://github.com/fduwjj
2023-07-10 17:33:05 +00:00
e695b397e1 Fix broken ROCm quick start link (#104527)
The AMD ROCm docs got a new subdomain and the naming changed a bit, so the old link went 404.

This PR just updates the link to the newest quick start guide which includes the installation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104527
Approved by: https://github.com/pruthvistony, https://github.com/hongxiayang, https://github.com/malfet
2023-07-10 16:49:17 +00:00
4911b80b8e [inductor] addmm + ReLU / GELU fusion pass (#104132)
Summary:

Add a new path in `post_grad.py` for replacing addmm + ReLU / GELU activation with the corresponding `_addmm_activation` call (with `use_gelu=False` or `True`, respectively). The replacement is done only on `max_autotune_gemm=False` and when the activation is fusible.

Test Plan:

$ python test/inductor/test_pattern_matcher.py -k test_addmm_activation -v

(__main__.TestPaternMatcher.test_addmm_activation) ... /data/users/aakhundov/pytorch/torch/_inductor/compile_fx.py:128: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
Using FallbackKernel: aten._addmm_activation.default
Using FallbackKernel: aten._addmm_activation.default
/data/users/aakhundov/pytorch/torch/_dynamo/eval_frame.py:373: UserWarning: changing options to `torch.compile()` may require calling `torch._dynamo.reset()` to take effect
  warnings.warn(
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
ok

----------------------------------------------------------------------
Ran 1 test in 13.415s

OK

Reviewers: @eellison

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104132
Approved by: https://github.com/eellison, https://github.com/jansel
2023-07-10 16:44:14 +00:00
7166df8094 Add big doc to wrap_fx_proxy_cls (#103407)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103407
Approved by: https://github.com/voznesenskym
2023-07-10 16:00:11 +00:00
4b8378967a Fix pytest test discovery for vscode (#104864)
With the latest update, this test class name started breaking pytest test discovery in vscode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104864
Approved by: https://github.com/Chillee, https://github.com/albanD, https://github.com/malfet
2023-07-10 14:56:41 +00:00
af34123caf Consolidate example_value int cases in wrap_fx_proxy_cls (#104836)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104836
Approved by: https://github.com/voznesenskym
2023-07-10 13:06:06 +00:00
e7fe2a797c Revert "[optim] use lerp whenever possible (#104796)"
This reverts commit fbe2a7e50a940ba7a12b003241a2699f7a731afb.

Reverted https://github.com/pytorch/pytorch/pull/104796 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/104796#issuecomment-1628591105))
2023-07-10 09:36:41 +00:00
46154c4c35 [FSDP][optim_state_dict] The correct way to initialize optimizer states if the corresponding param is empty (#104765)
When using KeyedOptimizer.init_state(), some optimizers initializes the states even if the param is empty (size() == 0) while some optimizer avoid initializing the states. There is no way FSDP can tell. Instead, FSDP should look up `optim.state`. Fortunatelly, `optim.state` does not rely on FQNs which some internal users change the FQNs.

Differential Revision: [D47285562](https://our.internmc.facebook.com/intern/diff/D47285562/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104765
Approved by: https://github.com/fduwjj
2023-07-10 08:00:55 +00:00
54f33265db inductor(re-land): support cpu fusion path for bfloat16 amp (#104399)
This PR is about the fusion of amp path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104399
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-10 00:58:04 +00:00
1b24a75175 Generalize sympy.Rel test to sympy.logic.boolalg.Boolean (#104833)
Constant booleans are not relational, but you typically
will still want to match against them.

I grepped the codebase and this appears to be exhaustive.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104833
Approved by: https://github.com/voznesenskym
2023-07-10 00:07:06 +00:00
26ff7a7e2a Allow for torch.sym_int to return int while tracing (#104837)
Per https://github.com/pytorch/pytorch/pull/103303 we cannot
universally allow tracing in all functions that return int,
as the graph breaks appear to be load bearing in some cases.
However, allowing for torch.sym_int to be traced in even if
the result is statically known is fine; this can happen in
case of a SymBool to int conversion.

This PR is not exhaustive but e.g., I fixed size/stride/numel
handling in https://github.com/pytorch/pytorch/pull/103438
The biggest risk is that arithmetic operations on sizes end
up getting constant-ified (this appears to have happened in
practice for modulus, which is why it's in this list.)  If
we don't care about spewing useless computation into the graph,
a more aggressive version of this PR would be to greatly expand
the list of allowed to specialize to int targets and then undo
https://github.com/pytorch/pytorch/pull/103438

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104837
Approved by: https://github.com/voznesenskym
2023-07-09 23:17:59 +00:00
dfe7a1e089 [dynamo] Support wrapping + returning tensor subclasses (#104802)
as title - used for [tracing the FSDP collectives](d8cb80e382/torch/distributed/_functional_collectives.py (L425))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104802
Approved by: https://github.com/jansel
2023-07-09 22:16:10 +00:00
51e246affc Update cuDNN frontend submodule to v9.1 (#104847)
Updates cudnn_frontend to the from v9 to v9.1 with the latest bugfixes and cmake fixes. Most notable the previous release forgot to increment the version constants. :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104847
Approved by: https://github.com/ezyang
2023-07-09 21:53:16 +00:00
546db2e36e [fx] make fx.wrap idempotent (#104838)
Previously, if you called `torch.fx.wrap()` on the same thing twice, it would add two entries to `_wrapped_fns_to_patch`. Then, when tracing, the patcher would process them both. On the second entry, the patcher would double-wrap the fn (e.g. `wrap(wrap(orig_fn))`)

This makes it so that wrapping is observable after the trace. While normally, a Patcher instance will "revert" the wrapping after tracing, the double wrapped function goes from `wrap(wrap(orig_fn)) -> wrap(orig_fn)`.

This happens to work in normal fx stuff (after all, the wrapper function will behave exactly like the original function). But it upsets torch.package, which is not expecting to see a weird wrapper function in the graph.

This PR adds a dictionary to deduplicate `wrap()` calls, ensuring that the patcher only operates each once per frame-fn pair.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104838
Approved by: https://github.com/Chillee
2023-07-09 20:57:46 +00:00
87e6b19ee0 [export] Make serializer more composable (#104816)
Test Plan: CI

Differential Revision: D47311044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104816
Approved by: https://github.com/zhxchen17
2023-07-09 19:02:35 +00:00
98d48709fe update cudnn==8.9.2.26 in .ci/docker (#104795)
Follow-up of https://github.com/pytorch/builder/pull/1436

CC @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104795
Approved by: https://github.com/malfet
2023-07-09 18:55:34 +00:00
395a0ba303 Training skip list should not be applied on inference bench (#104738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104738
Approved by: https://github.com/thiagocrepaldi, https://github.com/desertfire
2023-07-09 17:39:17 +00:00
a860b965f1 [inductor] Relax custom op schema checking for cpp_wrapper (#104349)
Summary: Remove fallback ops whitelist because FallbackKernel.set_cpp_kernel is doing sufficient checking

Differential Revision: [D47269612](https://our.internmc.facebook.com/intern/diff/D47269612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104349
Approved by: https://github.com/jgong5, https://github.com/chunyuan-w, https://github.com/jansel
2023-07-09 17:31:31 +00:00
dd6c38cb59 [vision hash update] update the pinned vision hash (#104834)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104834
Approved by: https://github.com/pytorchbot
2023-07-09 03:30:51 +00:00
3179c21286 remove aot_inductor_lib from deeplearning (#104730)
Summary:
AOTInductor model wrapper code has been moved to torch/_inductor so
that we can remove the duplicates from deeplearning, which were
placed there temporarily.

This PR also made the following changes to inductor codecache to make it work with AOTInductor:

* take the full input and output paths in aot_mode
* use a more suitable way to retrieve dirname from the input_path

Differential Revision: D47118805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104730
Approved by: https://github.com/jansel
2023-07-08 21:26:14 +00:00
dffcf999bd Misc changes from compiled autograd branch (#104316)
This PR pulls out some standalone changes from #103822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104316
Approved by: https://github.com/ezyang
2023-07-08 20:59:20 +00:00
e80787c8e1 [inductor] Split ops.reduction into reduction and store_reduction (#102737)
This is intended as a first step towards reductions with multiple outputs. This
also incidentally improves CSE of reductions under C++ codegen. For example,
```python
def fn(x):
    return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1)
```

Currently this generates two reductions, where the common load is CSEd
```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
    if (tmp_acc1.value > tmp0) {
        tmp_acc1.index = i1; tmp_acc1.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
auto tmp2 = tmp_acc1.index;
out_ptr1[static_cast<long>(i0)] = tmp2;
```

but with this change it gets CSEd to a single accumulator

```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
out_ptr1[static_cast<long>(i0)] = tmp1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-07-08 20:48:29 +00:00
0ceca92f80 [inductor] Add single pass "var_unnormalized" reduction_type (#102486)
This is a bit inefficient because it computes the mean and throws it
away since ir.Reduction nodes only have 1 output. However, the mean
can at least be scheduled into the same loop as the variance now since
there is no data dependency. Thus we can take fewer passes over the
data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-08 20:48:29 +00:00
26108d5d2b Add --check-str support to after_aot minifier (#104758)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104758
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-07-08 20:20:55 +00:00
85cbe7e6fd Add timeout for translation validation instances. (#104654)
As of now, translation validation runs to its completion. However, Z3 is time
consuming. PR #104464, for example, disables translation validation for a few benchmarks.

Instead, this PR introduces a timeout for translation validation. In that case, Z3 will
return `unknown`, since it wasn't able to prove or disprove the assertions. Then, we log
it as a warning, but don't stop execution.

Here's a summary of the changes:

- Added an environment variable for turning translation validation on and off
- Added an environment variable for setting the translation validation timeout
- Possibly reverts the changes in #104464
- ~~Move from "QF_NRA" to "QF_NIRA" logic~~
    - ~~It makes more sense, given the nature of the problems~~
    - "QF_NRA" seems to solve more instances of _dynamo/test_dynamic_shapes.py_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104654
Approved by: https://github.com/ezyang
2023-07-08 19:19:00 +00:00
91dcc3b272 Fix activation checkpoint for mps (#104787)
Fixes https://github.com/pytorch/pytorch/issues/104478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104787
Approved by: https://github.com/albanD
2023-07-08 14:57:05 +00:00
c85468a94c [autograd Function] Add private API to not materialize grads for non-differentiable outputs (#104291)
Fixes https://github.com/pytorch/pytorch/issues/104272

This PR adds a new private API `materialize_non_diff_grads` (default True) such that when set to False, grad outputs corresponding to outputs marked non-differentiable would receive None instead of a zero-filled tensor. This is overrides the setting of `materialize_grads`, i.e. grad outputs corresponding non-differentiable outputs would still be None even if `materialize_grads=True` (the default).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104291
Approved by: https://github.com/albanD
2023-07-08 14:53:54 +00:00
e600505e32 [FSDP][5/N] Unblock ignored_states + auto wrap (for now) (#104418)
The "for now" is because we still have the issue that when using the parameter `ignored_states` path, we do not recover the ignored modules, so FSDP still wraps those as empty shells (no managed parameters), which is not ideal. This is not a blocking issue as far as I know.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104418
Approved by: https://github.com/rohan-varma
2023-07-08 12:40:14 +00:00
610f74627e [FSDP][4/N] Remove _get_fully_sharded_module_to_states (#104409)
`_get_fully_sharded_module_to_states()` was used to emulate auto wrapping without actually calling `fully_shard`. Since we committed to unifying (see previous PR), we can remove this function and its helpers/tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104409
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2023-07-08 12:40:14 +00:00
d9be0366d3 [FSDP][3/N] Unify fully_shard auto wrap (#104408)
This moves `fully_shard` to use `_auto_wrap()` just like `FullyShardedDataParallel`. This means that `fully_shard` goes through the `_init_param_handle_from_module()` path (i.e. 1 `fully_shard` per "wrap"), removing the need for `_init_param_handles_from_module()` (which was 1 `fully_shard` for all "wraps" of a given policy). `_auto_wrap()` simply calls `fully_shard` on target submodules.

This includes several important fixes:
- We should register the pre/post-forward hooks on the module regardless of it has managed parameters.
- We can permit `_module_handles` to return `[]` in the composable path (for when the module has no managed parameters).
- We should unify the paths for `_get_buffers_and_dtypes_for_computation()` (previously, composable path was buggy in some cases).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104408
Approved by: https://github.com/rohan-varma
2023-07-08 12:40:12 +00:00
6d71b4f9f1 [FSDP][2/N][Easy] Prepare _auto_wrap for fully_shard (#104407)
This mainly just changes the `_auto_wrap()` function signature and generalizes the `_check_nested_wrapping()` to both wrapper and composable paths (though the composable path will not hit in this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104407
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2023-07-08 12:40:09 +00:00
d58f75be8b [FSDP][1/N] Move wrapper ModuleWrapPolicy to new path (#104346)
This PR is the first in refactoring the auto wrapping, only affecting `ModuleWrapPolicy` for wrapper `FullyShardedDataParallel`. The end goal is to improve the auto wrapping infra to support:
- Checking valid frozen parameters (uniform frozenness per FSDP)
- Checking valid shared parameters (shared parameters assigned to their lowest-common-ancestor module or higher)
- Writing auto wrapping policies that may take multiple passes over the module tree
- Specifying different FSDP kwargs per FSDP instance (instead of enforcing the same for all FSDP instances constructed via an auto wrap policy)

The way I envision achieving this is that, we decouple the actual "wrapping" (which is `_post_order_apply()` in this PR) from constructing the wrapping targets and kwargs (which is `target_module_to_kwargs` in this PR). In that way, a policy reduces to just constructing that latter `target_module_to_kwargs` mapping.

I do not personally recommend the size-based policy, but if we wanted to implement that under this new organization, the tracking of wrapped/nonwrapped numel should be done in the pass over the module tree prior to the actual "wrapping". This modularization keeps the actual "wrapping" part simple.

The change to how `old_dtype` is handled is mainly to avoid keeping a reference to `_override_module_mixed_precision()` function closure in each hook and to allow the function to take in all module clases at once to return which ones actually got overridden for the downstream error message. (We can directly store the global state as a mapping.)

To-do in follow-ups (not in order):
- Add frozen parameter check before `_post_order_apply()`
- Add shared parameter check before `_post_order_apply()`
- Expose wrapping policy that allows per module / per module class kwarg customization (where any unspecified kwarg adopts the root's kwarg)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104346
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2023-07-08 12:40:07 +00:00
f334b54d7f Handle the list of skipped messages when uploading disabled test stats (#104803)
This fixes the failure when a list of skipped messages is encountered when uploading disabled test stats, for example https://github.com/pytorch/pytorch/actions/runs/5489936777/jobs/10004725533.

This happens for ONNX tests (running regularly), i.e. https://ossci-raw-job-status.s3.amazonaws.com/log/14868893973:

```
onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool SUBSKIP [0.0000s] (Logic not implemented for size 0 inputs in op.Reshape) [ 47%]
onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool SUBSKIP [0.0000s] (Logic not implemented for size 0 inputs in op.Reshape) [ 47%]
...
onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool SUBSKIP [0.0000s] (Logic not implemented for size 0 inputs in op.Reshape) [ 47%]
onnx/test_op_consistency.py::TestOnnxModelOutputConsistency_opset13CPU::test_output_match_tile_cpu_bool PASSED [0.3136s] [ 47%]
```

The corresponding XML output is as follows https://paste.sh/b1DbSLJD#M-0WsXd9snjEVFh4ZsxPPIlv where `skipped` is a list of skipped messages instead of a dictionary.

As we only care about gathering disabled tests stats in this script, the list of skipped messages can be safely ignored.

### Testing

* Gathering disabled test stats works correctly when running under rerunning disabled tests mode https://github.com/pytorch/pytorch/actions/runs/5487829458/jobs/9999835911
* The command works locally for the above failed workflow (which is not a rerunning disabled tests workflow):

```
python3 -m tools.stats.check_disabled_tests --workflow-run-id "5488337480" --workflow-run-attempt 1 --repo "pytorch/pytorch"
...
The following 0 tests should be re-enabled:
The following 0 are still flaky:
Writing 0 documents to S3
Done!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104803
Approved by: https://github.com/clee2000
2023-07-08 07:23:46 +00:00
fbe2a7e50a [optim] use lerp whenever possible (#104796)
This is a better copy (with fixes) of #104781.

Test plan:
CI will pass once https://github.com/pytorch/pytorch/pull/104784 is landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104796
Approved by: https://github.com/albanD
2023-07-08 07:13:38 +00:00
5da4745c24 [ONNX] Fix exported onnx initializer name (#104741)
Restore ONNX initializer name to exactly match torch parameter and buffer name.
Fixes #104670 that could lead to potentially duplicated ONNX initializers after export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104741
Approved by: https://github.com/thiagocrepaldi
2023-07-08 04:36:33 +00:00
012561ff39 [ONNX] Restore readable names for parameters and buffers (#104493)
This PR introduces a new pass that restore parameter and buffer names from original module.

It is useful for readability of the exported ONNX graph. It restores the parameter and buffer names from the original module. For example, if the original module has a parameter named `root.linear.0.weight`, and the parameter is renamed to
  `_param_constant9` by FX, this pass will rename it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104493
Approved by: https://github.com/wschin, https://github.com/thiagocrepaldi
2023-07-08 04:36:33 +00:00
3d51c2e06d [ONNX] Refactor FX Registry and Support Custom Operator in FX Exporter (#103943)
## ONNXRegistry

### Motivation
In #100660, we used the torchscript registry to allow dispatcher. However, it doesn't meet the needs of FX exporter. The idea of torchscript exporter is built on top of three  points:

(1) Use `_SymbolicFunctionGroup` to dispatch opset version as we need ops to fall back when we don't have it in the current exporter opset version
(2) One aten maps to multiple supported opset versions, and each version maps to one symbolic function
(3) Custom symbolic function is considered prior to default symbolic function

Now that TorchLib will support all aten op across all opset versions, we don't need the opset version dispatch layer. And with onnx overloads created by torchlib, we need a way to support custom operators and prioritize them among all overloads.

### Feature
Introduce a public OnnxRegistry API initiated with fixed opset version which supports user registered operators. **The dispatching opset version is no longer needed as TorchLib is expected to provide full aten support across all opset version. And Dispatcher is expected to prioritize custome operators than the defaults.**

### API:
(1) `register_custom_op(self, function: OnnxFunction, domain: str, op_name: str, overload: Optional[str] = None)`: Register a custom operator into the current OnnxRegistry. This is expected to be used when the default operators don't mee the need of users. **For example, need a different opset version from the registry, or different calculation**.
(2) `is_registered_op(self, domain: str, op_name: str, overload: Optional[str] = None)`: Whether the aten op is registered.
(3) `get_functions(domain: str, op_name: str, overload: Optional[str] = None)`: Return a set of registered SymbolicFunctions under the aten

### TODO:
(1)`remove_op(op_name: str)`: removing the whole support for certain op allows decompose the graph to prims.
(2)Expose OnnxRegistry to users, and disable the opset_version option in export API. Export API should use the ops in registry only.

---

## OnnxDispatcher

The Changes in the function `dispatch` and `_find_the_perfect_or_nearest_match_onnxfunction` are meant to allow complex type and custom operator supports.

### Respect Custom Ops
(1) Override: Check if we can find the perfect match in custom operator overloads prior to defaults
(2) Tie breaker: If we have the same nearest match of default and custom overload, we choose the custom.

### Supplementary

[Design discussion doc](https://microsoft-my.sharepoint.com/:w:/p/thiagofc/EW-5Q3jWhFNMtQHHtPpJiAQB-P2qAcVRkYjfbmeSddnjWA?e=QUX9zG&wdOrigin=TEAMS-ELECTRON.p2p.bim&wdExp=TEAMS-TREATMENT&wdhostclicktime=1687554493295&web=1)

Please check the Registry and Dispatcher sections.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103943
Approved by: https://github.com/BowenBao, https://github.com/justinchuby
2023-07-08 04:15:58 +00:00
f45629d6ed Pin pillow (#104760)
Pin pillow to fix inductor periodic failure eb4a1a07af
https://github.com/pytorch/pytorch/actions/runs/5488678286/jobs/10002712674
```
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/detectron2/data/transforms/transform.py", line 46, in ExtentTransform
    def __init__(self, src_rect, output_size, interp=Image.LINEAR, fill=0):
AttributeError: module 'PIL.Image' has no attribute 'LINEAR'. Did you mean: 'BILINEAR'?
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104760
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-07-08 04:14:59 +00:00
dbc2216800 Add autograd modes table to docs (#104774)
Fixes #104461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104774
Approved by: https://github.com/soulitzer
2023-07-08 03:14:10 +00:00
2df939aaca [inductor] Update ops.bucketize to take offsets_size as a sympy.Expr (#104756)
Background/problem: ops.bucketize needs to take a value `offsets_size`, which is the length of the `offsets` tensor. It is used, e.g., for the bounds of the binary search over the `offsets` tensor. The previous implementation of `ops.bucketize` expected `offsets_size` to be a CSEVariable; i.e. we'd pass `offsets_size = ops.index_expr(offsets.get_size()[0])` into `ops.bucketize()`.  However, `ops.index_expr` will sometimes broadcast, turning the scalar `offsets_size` into a tensor. That caused errors, because [triton_helpers.bucketize_binary_search](a2fe6953bc/torch/_inductor/triton_helpers.py (L153-L155)) expects `offsets_size` to be a scalar. [Link - where the broadcasting happens](a2fe6953bc/torch/_inductor/codegen/triton.py (L1056))

Solution (this PR): Instead of passing `offsets_size` into `ops.bucketize` as a CSEVariable, pass in a sympy.Expr. Then, inside ops.bucketize, convert the sympy.Expr into a string that can be used in the generated triton code.

Differential Revision: [D47282413](https://our.internmc.facebook.com/intern/diff/D47282413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104756
Approved by: https://github.com/jansel
2023-07-08 01:08:55 +00:00
3d07184930 Move optimize indexing to use the class Bounds (#104558)
This PR removes plenty of duplicated code. In particular, it removes the two repeated implementations of `get_expr_range`, which are superseded by the more correct `bound_sympy`.

The two duplicated `get_expr_range`s were a result of an oversight in https://github.com/pytorch/pytorch/pull/100549.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104558
Approved by: https://github.com/eellison
2023-07-07 23:52:14 +00:00
710abc41cc Implement bound_sympy (#104559)
The analysis for SymPy expressions was incorrect as, even though it said
that the assumption was "smoothness" the assumption was, in fact, that he
formula was monotone in every variable. In other words, it was
assuming that the derivative does not change signs in any variable (!!).

We implement a function that, given bounds on the values of the free
symbols of a sympy expression, it gives a bound on a the expression
itself.

We reshuffle a few things in value_ranges.py to create a
`SymPyValueRangeAnalysis` class, but we do not change any code really.
The only relevant change in that file is the addition of the
`sympy_bound`s function. We do this because we don't want to inadvertently
use any fallbacks in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104559
Approved by: https://github.com/eellison
2023-07-07 23:52:14 +00:00
ff05f81e1d Simplify and extend ValueRanges (#104557)
This PR:
- It adds a few boolean variants of some methods that were missing
- It simplifies the implementation of plenty of the operations
- Adds ModularIndexing to the SymPy interpreter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104557
Approved by: https://github.com/eellison
2023-07-07 23:52:13 +00:00
2f04aab140 [SPMD] Disable all SPMD tests (#104784)
SPMD is not actively developed and is out-of-sync with the PyTorch compiler code.  Disable the tests for now.

Differential Revision: [D47296840](https://our.internmc.facebook.com/intern/diff/D47296840/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104784
Approved by: https://github.com/fduwjj
2023-07-07 23:31:54 +00:00
ae12081e70 [ONNX] Remove unnecessary deepcopy on args in 'DynamoExport' (#104736)
The comment is outdated. There should be no side-effects on `args` and `kwargs`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104736
Approved by: https://github.com/thiagocrepaldi
2023-07-07 22:26:27 +00:00
c68fac9c25 [pt2][inductor] include allow_tf32 in system information (#104129)
Summary: include `allow_tf32` in system information; previously aten results did not specify whether `allow_tf32` was true or not

Test Plan: sandcastle + CI

Differential Revision: D46568468

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104129
Approved by: https://github.com/jansel
2023-07-07 21:47:35 +00:00
ed4a8869af [ONNX] Fix third party custom operator support in torchscript exporter (#104785)
Previous to this PR, to support onnxscript function proto in torchscript exporter, the registered custom symbolic functions are all forced to call `.function_proto` API as onnxscript functions. The PR makes sure the custom function is onnxscript function before using the API. To avoid the dependency, hasattr is used instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104785
Approved by: https://github.com/BowenBao
2023-07-07 21:29:33 +00:00
d8cb80e382 [inductor] If a kernel contains bucketize, try using config with num_elements_per_warp=32 (#104456)
In binary search triton implementations, (#104007) num_elements_per_warp=32 performs a lot better than larger values.

This PR adds an autotuning config option for this purpose. But since autotuning can affect compile times and this config isn't generally useful, we only try this config if bucketize is present. This is done by adding an extra field to triton_meta which is used by the pointwise autotuning

Performance: reused https://gist.github.com/davidberard98/066fd2115f59f5889ef61e4527d1eba5.

Before:
```
Eager 0.30088499188423157 ms
PT2   0.9296960234642029 ms
```

After:
```
Eager 0.3011910021305084 ms
PT2   0.22977299988269806 ms
```

Differential Revision: [D47237103](https://our.internmc.facebook.com/intern/diff/D47237103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104456
Approved by: https://github.com/eellison
2023-07-07 20:32:41 +00:00
1a661639f7 [quant] Support integer implementations for adaptive_avg_pool2d (#104226)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104226
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-07-07 19:36:31 +00:00
98e14ac37e [ONNX][TypePromo] Simplify API _run_node_and_set_meta (#104720)
Previously it is defined as `_run_node_and_update_meta_val` which selectively only
updates `meta["val"]`. The behavioral difference stems from two type of scenarios:
node creation and node modification. `node.meta` is empty for the former, while already
exist and populated for the latter. This PR updates the API to handle both scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104720
Approved by: https://github.com/thiagocrepaldi
2023-07-07 17:33:53 +00:00
fa262eb46e [ONNX][TypePromo] aten.div (#104229)
Unlike the majority of other operators, `aten.div` requires this specialized rule since its
type promotion kind depends on value of kwargs `rounding_mode`.

Checkout note `Update type promotion rule` in `type_promotion.py` before manually
adding a type promotion rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104229
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-07-07 17:31:05 +00:00
29c30b1db8 [export] Fix serialize nn_module_stack (#104721)
Summary:
Some serialized nn_module_stacks contain nested commas, something like:
`(getitem(L['module'],0),torch.nn.modules.linear.Linear)`
Fixing the parsing so that we can deserialize the string in the format of: `(local identifier, module type)`

Test Plan: CI

Differential Revision: D47252881

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104721
Approved by: https://github.com/zhxchen17
2023-07-07 17:13:17 +00:00
6a3d5f1986 [HigherOrderOp] Remove _deprecated_global_ns from cond (#104380)
Remove _deprecated_global_ns from cond following #104105.

We change the module attribute of HigherOrderOperator instances in the constructor from torch.ops to torch.ops.higher_order when self.namespace is "higher_order". For subclasses (e.g. customized higher order operator), we leave their \_\_module\_\_ unchanged.

Will import this PR to fix internal tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104380
Approved by: https://github.com/zhxchen17, https://github.com/zou3519
2023-07-07 17:13:09 +00:00
d5a83a5f27 [export] Fix deserialization of symint (#104722)
Test Plan: CI

Differential Revision: D47269143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104722
Approved by: https://github.com/zhxchen17
2023-07-07 17:03:46 +00:00
199e93a0da [export] Serialize optional tensors (#104723)
Test Plan: Test in model inventory

Differential Revision: D47269141

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104723
Approved by: https://github.com/zhxchen17
2023-07-07 16:55:12 +00:00
78734a76ad Revert "Add libxml2 and libxslt in docker image (#104663)"
This reverts commit 315a77a02d3648caaffa0b6fd56f35606c50aaef.

Reverted https://github.com/pytorch/pytorch/pull/104663 on behalf of https://github.com/clee2000 due to broke periodic inductor testing ([comment](https://github.com/pytorch/pytorch/pull/104663#issuecomment-1625683229))
2023-07-07 16:53:38 +00:00
2fdf1175cd [ONNX][TypePromo] Explicit type promotion pass (#104064)
This PR adds the `ExplicitTypePromotionPass` that does an fx graph to fx graph transformation
explicitly adding cast nodes into the graph to emulate the PyTorch type promotion behavior.

Full design doc and discussion at https://microsoft-my.sharepoint.com/:w:/p/bowbao/Edj2lF1oi0JIitT_3ntyuqkBo6ll7N6NJDmavM0lp_KkEA?e=OElyjR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104064
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-07-07 16:52:21 +00:00
eb4a1a07af Upgrade HuggingFace to v4.30.2 (#104657)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104657
Approved by: https://github.com/albanD
2023-07-07 16:01:00 +00:00
c500f1d13b [CMake] Fix TORCH_CUDA_ARCH_LIST warning (#104680)
The warning complains that `TORCH_CUDA_ARCH_LIST` is set on the environment
instead of being defined as a build variable, which is fixed by the change to
`tools/setup_helpers/cmake.py`.

However, I still see the warning even with this fix because
```cmake
if((NOT EXISTS ${TORCH_CUDA_ARCH_LIST}) ...
```
is actually checking whether a file exists called "7.5" (or whatever arch is
being requested). Instead we want to check if the variable is defined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104680
Approved by: https://github.com/albanD
2023-07-07 15:12:54 +00:00
6970ffbbc7 [HigherOrderOps] Clean up side effect handling (#104685)
I think after https://github.com/pytorch/pytorch/pull/104077, we don't
need to do a diff between the SideEffects object before/after for
HigherOrderOps -- the ability is baked into speculate_subgraph.

The rationale for this PR is that diff-ing the SideEffects object didn't
work very well: it was overly conservative. If a variable gets
tracked for mutation, or a new cell variable is created, then the
SideEffects object changes.

The SideEffects object tracks two types of side effects:
- variable assignment/modification. This is covered by
[check_allowed_side_effect](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/side_effects.py#L146C9-L146C34)
- save_for_backward tracking. I don't think we even need to track this;
if the inputs require grad, then we cannot graph break in the middle of
autograd.Function, so we never need to replay calling `save_for_backward`.
If the inputs don't require grad, then `save_for_backward` doesn't do
anything, so it doesn't need to be replayed either. If we wanted to be
safe we could also call `check_allowed_side_effect` there.

Test Plan:
- #104077 introduced some heavy testing already. This PR adds some more
test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104685
Approved by: https://github.com/ydwu4
2023-07-07 14:23:35 +00:00
4ad5081794 [HigherOrderOp] Fix returning captured value (#104371)
Fixes #104298.

The bug was:
- we were only checking for freevars in SubgraphTracer.create_proxy
- freevars can also show up in SubgraphTracer.create_node

This PR adds handles free variable handling of the output of the graph
(which is created via `create_node`) in `speculate_subgraph`.

Because `create_proxy` calls `create_node`, you may be wondering why we
can't do the freevar lifting in `create_node`. The answer is that:
- `create_node` only gets used by Dynamo to create outputs of a graph,
so it is called rarely. All other callsites go through `create_proxy`.
- our freevar system is based off of VariableTrackers being associated with
Proxy objects which are associated with a SubgraphTracer.
- `create_proxy` accepts Proxy args while `create_node` accepts Node args
- Given a node, there isn't a way to retrieve the existing proxy that
wraps the node.

Test Plan:
- add new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104371
Approved by: https://github.com/ydwu4
2023-07-07 14:23:35 +00:00
8d65635378 Prefixing DeviceType with c10 namespace to avoid name collisions (#104364)
Fixes #91338

Follow up from https://github.com/pytorch/pytorch/pull/91342

> 🚀 The feature, motivation and pitch
> We have an existing DeviceType class all over the place in our code base, and it conflicts with the one that is used in torch. > Thankfully the pytorch DeciceType enum class is under the c10 namespace.

```
In file included from /xxx/build/_deps/torch-src/../../aten/src/ATen/ops/view.h:5:
/xxx/_deps/torch-src/aten/src/ATen/Context.h:265:14: error: reference to 'DeviceType' is ambiguous
    if (p == DeviceType::HIP) {
             ^
/xxx/include/Common_types.h:178:8: note: candidate found by name lookup is 'DeviceType'
struct DeviceType {
       ^
/xxx/build/_deps/torch-src/c10/../c10/core/DeviceType.h:32:12: note: candidate found by name lookup is 'c10::DeviceType'
enum class DeviceType : int8_t {
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104364
Approved by: https://github.com/albanD
2023-07-07 13:23:03 +00:00
296b45f9d3 Cleanup scatter-related code (#103074)
This patch cleans up scatter-related code.

GNN-specific implementation for scatter operation uses `radix_sort` to sort the indices, as `radix_sort` was recently moved to FBGEMM common utils (via [pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)), we do not need a local copy of the algorithm anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103074
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD
2023-07-07 12:38:46 +00:00
wgb
63dc24b4a6 Expose some APIs in FunctionsManual.h (#104684)
Fixes #ISSUE_NUMBER
Exporse some api in FunctionsManual.h for custom devices. This can be used in codegen features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104684
Approved by: https://github.com/albanD
2023-07-07 11:22:40 +00:00
0bf39d5663 [FSDP] Option for eval in fp32/bf16 (#104682)
In https://github.com/pytorch/pytorch/pull/97645 and some follow up diffs, we made FSDP run in full precision in eval mode, even if mixed precision was specified.

However, this is probably not the best idea and we should provide a flag for users to have control over this a bit more. Adding an env var FSDP_FULL_PREC_IN_EVAL and defaulting it to off, users who want to run eval in fp32 can toggle this before wrapping model in FSDP:

os.environ["FSDP_FULL_PREC_IN_EVAL"] = "1"

Verified that unittests, APS workflow, TNT workloads can run eval appropriately with this change.

Differential Revision: [D47246556](https://our.internmc.facebook.com/intern/diff/D47246556/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104682
Approved by: https://github.com/awgu
2023-07-07 08:14:23 +00:00
e517b3651a [pytorch] put more to pytorch_fmha namespace (#104628)
Summary:
Without this diff we get
```
CUDA error (./fbcode/caffe2/aten/src/ATen/native/transformers/cuda/flash_attn/fmha_bwd_launch_template.h:113): an illegal memory access was encountered
```

Test Plan:
hg up e49463501
fbcode/ai_codesign/gen_ai/xlformers/scripts/run_xlformers_train_local.sh

Reviewed By: drisspg

Differential Revision: D47220255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104628
Approved by: https://github.com/drisspg
2023-07-07 06:46:01 +00:00
348dfc1cf3 Update cuDNN to 8.9.2.26 (#104757)
Companion PR for https://github.com/pytorch/builder/pull/1436
Should fix `cuDNN version incompatibility: PyTorch was compiled  against (8, 9, 2) but found runtime version (8, 8, 1). PyTorch already comes bundled with cuDNN. One option to resolving this error is to ensure PyTorch can find the bundled cuDNN.one possibility is that there is a conflicting cuDNN in LD_LIBRARY_PATH.` error in [manywheel-py3_8-cuda12_1-with-pypi-cudnn-test](https://github.com/pytorch/pytorch/actions/runs/5480628146/jobs/9986843286#step:16:2347)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104757
Approved by: https://github.com/drisspg, https://github.com/xw285cornell, https://github.com/r-barnes
2023-07-07 05:46:46 +00:00
8ca63ff9a8 Revert "[inductor] Add single pass "var_unnormalized" reduction_type (#102486)"
This reverts commit 7e098f95593240d45d28f040ff53f268ad3d9a93.

Reverted https://github.com/pytorch/pytorch/pull/102486 on behalf of https://github.com/clee2000 due to sorry but this seems to have broken inductor/test_torchinductor.py::CpuTests::test_std_cpu on mac x86 64 machines 7e098f9559 https://github.com/pytorch/pytorch/actions/runs/5479008241/jobs/9981443710 ([comment](https://github.com/pytorch/pytorch/pull/102486#issuecomment-1624739465))
2023-07-07 04:57:20 +00:00
1280b19827 Revert "[inductor] Split ops.reduction into reduction and store_reduction (#102737)"
This reverts commit 59b8d5be7405c6f8a445b504b73a7e8c7812e860.

Reverted https://github.com/pytorch/pytorch/pull/102737 on behalf of https://github.com/clee2000 due to sorry but i need to revert this to revert the other one in the stack ([comment](https://github.com/pytorch/pytorch/pull/102737#issuecomment-1624735108))
2023-07-07 04:53:14 +00:00
a2fe6953bc Generate nearbyint for Round in tensorexpr llvm codegen, match torch.round result (#104430)
Fixes #103465, which matches the behavior of `torch.round` ([doc](https://pytorch.org/docs/stable/generated/torch.round.html?highlight=round#torch.round)) - “round half to even”

Using the repro code, the output is correct:
```
Using torch version=2.1.0a0+git84fedbc and optimization enabled=True
[cpu ] Python = 2, Torch = 2, Torch traced = 2
Using torch version=2.1.0a0+git84fedbc and optimization enabled=False
[cpu ] Python = 2, Torch = 2, Torch traced = 2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104430
Approved by: https://github.com/jgong5, https://github.com/davidberard98
2023-07-07 01:47:46 +00:00
8ce3a18b6a inductor: reduce complie time by reducing repr calls of quantize or Opaque tensor (#104696)
For quantize or opaue tensor, if they are constant values, the calls of  tensor ```__repr__``` will have memory copy(https://github.com/pytorch/pytorch/blob/main/torch/_tensor_str.py#L550):
db1ac4e29b/torch/_inductor/codegen/wrapper.py (L289-L292)

for CPP codegen, there have many times of initiation of ```WrapperCodeGen```: https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/cpp.py#L2023, which consumes much time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104696
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-07 01:12:34 +00:00
0ccdbbe233 Add deterministic path for Tensor.resize_ (#104300)
New elements added to a tensor by `torch.Tensor.resize_` are set to NaN/MAX_INT when deterministic mode is turned on.

When `torch.Tensor.resize_` is called on a quantized tensor and deterministic mode is turned on, a nondeterministic error is raised.

Part of #82004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104300
Approved by: https://github.com/albanD
2023-07-07 00:22:13 +00:00
d64bada876 Refactor funcol for readability and dynamo tracing (#104387)
Move eager kernel impls to separate file, which is eaiser to read
(since users may be confused about 2 versions of each kernel in the same file)
and easier to set a dynamo policy to trace only the first file currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104387
Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/kumpera
2023-07-06 23:29:49 +00:00
456ecefd52 [BE] Fix warning in top-level CMakeLists.txt (#104726)
Fixes warning introduced by https://github.com/pytorch/pytorch/issues/102594:
```
CMake Warning (dev) in CMakeLists.txt:
  A logical block opening on the line
    /pytorch/CMakeLists.txt:726 (if)
  closes on the line
    /pytorch/CMakeLists.txt:735 (endif)
  with mis-matching arguments.
```

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at b7555d5</samp>

> _`DEBUG_CUDA` on_
> _No more CUDA in exe_
> _Winter bug is fixed_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104726
Approved by: https://github.com/huydhn, https://github.com/atalman
2023-07-06 22:13:29 +00:00
8c13e96be2 [dynamo] add logging artifact for traced graph tensor sizes (#104672)
Log tensor size information with the `graph_sizes` logging artifact, as part of the model x-ray feature requests. Typically can be combined with `graph_code`.

Sample:
```python
import torch

def fn(a, b, c, d):
    return (a + b) @ (c + d)

opt_fn = torch.compile(fn, backend="eager", dynamic=False)
opt_fn(torch.randn(10, 20), torch.randn(1, 20), torch.randn(20, 15), torch.randn(1, 15))
opt_fn(torch.randn(5, 2), torch.randn(1, 2), torch.randn(2, 4), torch.randn(1, 4))
```

Output:
```shell
$ TORCH_LOGS="graph_sizes,graph_code" python playground8.py
[2023-07-06 01:42:39,093] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH
 ===== __compiled_fn_0 =====
 <eval_with_key>.0 class GraphModule(torch.nn.Module):
    def forward(self, L_a_ : torch.Tensor, L_b_ : torch.Tensor, L_c_ : torch.Tensor, L_d_ : torch.Tensor):
        l_a_ = L_a_
        l_b_ = L_b_
        l_c_ = L_c_
        l_d_ = L_d_

        # File: playground8.py:66, code: return (a + b) @ (c + d)
        add = l_a_ + l_b_;  l_a_ = l_b_ = None
        add_1 = l_c_ + l_d_;  l_c_ = l_d_ = None
        matmul = add @ add_1;  add = add_1 = None
        return (matmul,)

[2023-07-06 01:42:39,093] torch._dynamo.output_graph.__graph_sizes: [DEBUG] TRACED GRAPH TENSOR SIZES
===== __compiled_fn_0 =====
l_a_: (10, 20)
l_b_: (1, 20)
l_c_: (20, 15)
l_d_: (1, 15)
add: (10, 20)
add_1: (20, 15)
matmul: (10, 15)

[2023-07-06 01:42:39,198] torch._dynamo.output_graph.__graph_code: [DEBUG] TRACED GRAPH
 ===== __compiled_fn_1 =====
 <eval_with_key>.1 class GraphModule(torch.nn.Module):
    def forward(self, s0 : torch.SymInt, s1 : torch.SymInt, L_a_ : torch.Tensor, L_b_ : torch.Tensor, s4 : torch.SymInt, L_c_ : torch.Tensor, L_d_ : torch.Tensor):
        l_a_ = L_a_
        l_b_ = L_b_
        l_c_ = L_c_
        l_d_ = L_d_

        # File: playground8.py:66, code: return (a + b) @ (c + d)
        add = l_a_ + l_b_;  l_a_ = l_b_ = None
        add_1 = l_c_ + l_d_;  l_c_ = l_d_ = None
        matmul = add @ add_1;  add = add_1 = None
        return (matmul,)

[2023-07-06 01:42:39,198] torch._dynamo.output_graph.__graph_sizes: [DEBUG] TRACED GRAPH TENSOR SIZES
===== __compiled_fn_1 =====
l_a_: (s0, s1)
l_a_ (concrete): (5, 2)
l_b_: (1, s1)
l_b_ (concrete): (1, 2)
l_c_: (s1, s4)
l_c_ (concrete): (2, 4)
l_d_: (1, s4)
l_d_ (concrete): (1, 4)
add: (s0, s1)
add (concrete): (5, 2)
add_1: (s1, s4)
add_1 (concrete): (2, 4)
matmul: (s0, s4)
matmul (concrete): (5, 4)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104672
Approved by: https://github.com/ezyang
2023-07-06 21:44:05 +00:00
c7c9aa797f [dynamo] New logging artifacts for source code attribution (#104013)
Prototype for the feature request:
>When working on a codebase that is unfamiliar to you, it can be helpful to single step through all of the code to see what is getting executed, what conditional branches are taken, and where indirect function jumps go. Model x-ray uses dynamo to give you a single step log of every source code line that does something relevant (i.e., a Tensor operation)

Dynamo logs to the ~`starts_line`~ `trace_source` logging artifact at the start of tracing new bytecode with a new line. It logs the line of source code associated with that bytecode.

~~Dynamo logs to the `graph_source` logging when a FX GraphModule is constructed. For each node in the graph, it logs the location of the original source code associated with that node.~~

Development notes: https://docs.google.com/document/d/1LjFeHzCgDDt535QUq5HydcQs56d7jWl5RvW8TLZN19g/edit?usp=sharing

Since the draft, we removed the `graph_source` logging artifact since printing the code of `GraphModule`s already displays the original source.

Sample:

```python
import torch
from functorch.experimental.control_flow import cond

def true_fn(x):
    return x * 2

def false_fn(x):
    return x * 3

def f_cond(pred, x):
    return cond(pred, true_fn, false_fn, [x])

def f_outer(pred, x):
    y = f_cond(pred, x)
    if x.sum() > 0:
        x = x * 2
    else:
        x = x * 3
    return x, y

opt_f_cond = torch.compile(f_outer, backend="eager")
opt_f_cond(torch.tensor(True), torch.randn(3, 3))
```

Logs:
```shell
$ TORCH_LOGS="trace_source" python playground8.py
TRACE starts_line f_outer playground8.py:54
    def f_outer(pred, x):
TRACE starts_line f_outer playground8.py:55
        y = f_cond(pred, x)
TRACE starts_line f_cond playground8.py:51 (inline depth: 1)
    def f_cond(pred, x):
TRACE starts_line f_cond playground8.py:52 (inline depth: 1)
        return cond(pred, true_fn, false_fn, [x])
TRACE starts_line true_fn playground8.py:45 (inline depth: 2)
    def true_fn(x):
TRACE starts_line true_fn playground8.py:46 (inline depth: 2)
        return x * 2
TRACE starts_line false_fn playground8.py:48 (inline depth: 2)
    def false_fn(x):
TRACE starts_line false_fn playground8.py:49 (inline depth: 2)
        return x * 3
TRACE starts_line f_outer playground8.py:56
        if x.sum() > 0:
TRACE starts_line <resume in f_outer> playground8.py:56
        if x.sum() > 0:
TRACE starts_line <resume in f_outer> playground8.py:57
            x = x * 2
TRACE starts_line <resume in f_outer> playground8.py:60
        return x, y
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104013
Approved by: https://github.com/ezyang
2023-07-06 21:43:55 +00:00
8c0b9a2d69 [ONNX] Export dynamic step size for aten::slice() (#104385)
This commit improves the export of aten::slice() to ONNX in the following ways:

1. The step size can be an input tensor rather than a constant.
2. Fixes a bug where using a 1-D, 1-element torch tensor as an index created a broken ONNX model.

This commit also adds tests for the new functionality.

Fixes #104314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104385
Approved by: https://github.com/thiagocrepaldi
2023-07-06 21:38:59 +00:00
c42fd73cf9 Add functions to get and set default endianness in load() functions (#101973)
By default interpret tensor data as native endian, but add an option to interpret data as little endian or big endian.

Related to #101688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101973
Approved by: https://github.com/mikaylagawarecki
2023-07-06 20:12:56 +00:00
2efe4d809f [hotfix inductor test] disable cpp vectorization codegen in fbcode for inductor (#104560)
Summary:
After D46364355 landed, a few inductor internal tests started failing. When I ran this locally:
```
buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config
```

The test appeared to hang with this output, until it would fail with a timeout after 10 minutes passed:
```
Test caffe2/test/inductor:config -- discovering tests [local_execute]
```

Eventually, I realized that inductor has a value `HAS_CPU` (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/testing/_internal/inductor_utils.py?lines=23) that is implemented lazily. Part of that implementation involves inspecting `/proc/cpuinfo` to figure out what vectorized intructions are available, and that call appeared to hang (https://www.internalfb.com/code/fbsource/[6cc47fa5eb77a93d91a519d3eb3df67ceddb8faa]/fbcode/caffe2/torch/_inductor/codecache.py?lines=568).

Since vectorized codegen for inductor cpu internally already isn't working, I hardcoded that test to fail for now in fbcode.

Test Plan:
Confirmed that this passes:
`buck2 test fbcode//mode/dev-nosan fbcode//caffe2/test/inductor:config`

Differential Revision: D47199912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104560
Approved by: https://github.com/desertfire, https://github.com/bertmaher
2023-07-06 19:00:13 +00:00
b190f46514 Allow NumPy code in torch.compile to run on cuda (#104699)
This can be achieved by doing `torch.set_default_device("cuda")`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104699
Approved by: https://github.com/ezyang, https://github.com/larryliu0820
2023-07-06 18:43:09 +00:00
b073f6a5e8 Revert "inductor: support cpu fusion path for bfloat16 amp (#104399)"
This reverts commit c46869a9415ef152be15bac65b64e8a75503c27d.

Reverted https://github.com/pytorch/pytorch/pull/104399 on behalf of https://github.com/clee2000 due to Sorry but it seems like this PR broke slow tests (and maybe also mac periodic tests?) inductor/test_cpp_wrapper.py::TestCppWrapper::test_conv2d_unary_cpu_cpp_wrapper c46869a941 https://github.com/pytorch/pytorch/actions/runs/5477792452/jobs/9977634660 ([comment](https://github.com/pytorch/pytorch/pull/104399#issuecomment-1624131181))
2023-07-06 18:26:17 +00:00
a358a9262e [inductur] coordesc tuner bug fix with no_x_dim kernel (#104692)
We recently have an optimization to squash x dimension for persistent reduction kernel when we are confident that XBLOCK will always be 1.  We need update the code so that coordinate descent tuner does not tune XBLOCK in this case.

Test command. Fail before the fix and pass after.
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --accuracy --only BertForMaskedLM --inference
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104692
Approved by: https://github.com/jansel
2023-07-06 17:47:02 +00:00
c42de84708 [quant] Skip some x86 quantizer tests for now due to time out (#104666)
Summary: att

Test Plan: sandcastle ci

Reviewed By: malfet

Differential Revision: D47234616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104666
Approved by: https://github.com/DanilBaibak
2023-07-06 17:34:13 +00:00
202fb95c68 [benchmark][export] Add torch.export passrate for TB/TIMM benchmarks (#104382)
issues resolved: https://github.com/pytorch/pytorch/issues/104294

local test on TB and TIMM
* python benchmarks/dynamo/torchbench.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary
* python benchmarks/dynamo/timm_models.py -d cuda --inference --accuracy --progress --export --print-dataframe-summary

why not HF
* huggingface use kwargs (dict) to torch.nn.module
* we will need to support kwargs in torch._export.export, which is in progress

local test result

timm 95% pass rate (58 ouf of 61 passed) P781702926
* 1 x [export specific]1 x ERROR:common:Mutating module attribute rel_indices during export
* 1 x[not relevant to export] Unknown model (SelecSls42b)
* 1 x [not relevant to export] Failed to load model: HTTP Error 409: Public access is not permitted on this storage account

torchbench 54% pass rate (41 out of 75 passed) P781690552
* 7 x ERROR:common:Dynamo input and output is a strict subset of traced input/output
* 3 x ERROR:common:call_method NNModuleVariable() / UserDefinedObjectVariable
* 3 x ERROR:common:Mutating module attribute {xx} during export.
* 2 x ERROR:common:inline in skipfiles
* 2 x ERROR:common:Consider annotating your code using constrain_as_*(). It appears that you're trying
* 1 x ERROR:common:guard on data-dependent symbolic int/float
* 1 x ERROR:common:Tensor.tolist
* 1 x ERROR:common:Tensor.numpy. Turn on config.numpy_ndarray_as_tensor and install torch_np to support tensor.numpy().  [may be dev * env?]
* 1 x ERROR:common:missing: BUILD_SET
* 1 x ERROR:common:whole graph export entails exactly one guard export
* 1 x ERROR:common:call_function BuiltinVariable(str) [GetAttrVariable(UserMethodVariable(<function
* 1 x ERROR:common:Dynamic slicing on data-dependent value is not supported
  * 1 x ERROR:common:Failed running call_function <function interpolate at 0x7f60a8361ea0>(*(FakeTensor(..., device='cuda:0', size=(1, 3, * 427,
* 1 x ERROR:common:Dynamo attempts to add additional input during export: value=0.6177528500556946, source=RandomValueSource(random_call_index=0)
* 1 x Found following user inputs located at [16, 17, 18, 19, 20, 21, 22] are mutated. This is currently banned in the aot_export workflow.
* 1 x RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation
* 4 x pass_due_to_skip
* 1 x eager_2nd_run_OOM
* 1 x fail_accuracy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104382
Approved by: https://github.com/zhxchen17
2023-07-06 17:16:07 +00:00
f8aedf1efe Revert "Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)"
This reverts commit da7675621efce341c80187e404ac62cb6c22bbf8.

Reverted https://github.com/pytorch/pytorch/pull/103427 on behalf of https://github.com/clee2000 due to sorry but it looks like this pr broke test_scatter_gather_ops.py::TestScatterGatherCPU::test_scatter_expanded_index_cpu_bfloat16 on periodic parallelnative testing da7675621e https://github.com/pytorch/pytorch/actions/runs/5477783108/jobs/9977608393 ([comment](https://github.com/pytorch/pytorch/pull/103427#issuecomment-1624008753))
2023-07-06 17:02:03 +00:00
c4cf90aad1 inductor: fix assert error when load a bfloat16 inf constant (#104614)
Fix ```nanogpt_generate``` bfloat16 path error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104614
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-07-06 17:01:04 +00:00
4fafe0b74c [export][serde] Hookup export upgrader with TorchScript upgrader entries (#104227)
Adding an API to get the upgraders entry map directly from:

https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/operator_upgraders/upgraders_entry.cpp#L17

Combine the information there along with the operator version map from:

https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/operator_upgraders/version_map.cpp#L18

We can get a upgrader map with: upgrader name, old schema and upgrader string.

This dict will be sent to GraphModuleOpUpgrader to populate the upgrader passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104227
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2023-07-06 16:57:36 +00:00
6c1d959889 [FSDP] Annotate modules for fully_shard (#104363)
This annotates modules managed by `fully_shard` for TorchDynamo to treat them specially.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104363
Approved by: https://github.com/fegin
2023-07-06 16:56:59 +00:00
7c8dded9db [BE] QNNPACK Test - FC, use ASSERT_NEAR (#104651)
Compare fp numbers using assert_near with reference * 10e-4. Somewhat arbitrary threhold which makes the test to pass on SSE2, given the absolute numbers are in somewhat wider range.

Differential Revision: [D47195286](https://our.internmc.facebook.com/intern/diff/D47195286/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104651
Approved by: https://github.com/mcr229
2023-07-06 16:32:56 +00:00
cbad55f6c4 [BE] QNNPACK Test - Sparsegemm tests, use ASSERT_NEAR (#104650)
Compare fp numbers using assert_near with reference * 10e-3. Somewhat arbitrary threhold which makes the test to pass on SSE2, given the absolute numbers are in somewhat wider range.

Differential Revision: [D47195288](https://our.internmc.facebook.com/intern/diff/D47195288/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104650
Approved by: https://github.com/mcr229
2023-07-06 16:32:56 +00:00
ce1a40519f [BE] QNNPACK - Q8[g]avg, loosen threshold to allow fp compare to pass (#104649)
0.5 --> 0.5001 to tolorate fp-op reordering surfaced with LLVM15. Not the best fix.

Differential Revision: [D47195289](https://our.internmc.facebook.com/intern/diff/D47195289/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104649
Approved by: https://github.com/mcr229
2023-07-06 16:32:56 +00:00
833faccce2 [BE] QNNPACK Test - DQgemm tests, use ASSERT_NEAR (#104648)
Compare fp numbers using assert_near with reference * 10e-4. Somewhat arbitrary threhold which makes the test to pass on SSE2, given the absolute numbers are in somewhat wider range.

Differential Revision: [D47195287](https://our.internmc.facebook.com/intern/diff/D47195287/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104648
Approved by: https://github.com/mcr229
2023-07-06 16:32:56 +00:00
5c2dc9b0b2 Label for mem leack check (#104643)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104643
Approved by: https://github.com/huydhn
2023-07-06 16:32:49 +00:00
315a77a02d Add libxml2 and libxslt in docker image (#104663)
lxml got updated to 4.9.3 and wants libxml2 and libxslt
https://github.com/pytorch/pytorch/actions/runs/5467965204/jobs/9956542063#step:5:5064
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104663
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-07-06 16:32:30 +00:00
59b8d5be74 [inductor] Split ops.reduction into reduction and store_reduction (#102737)
This is intended as a first step towards reductions with multiple outputs. This
also incidentally improves CSE of reductions under C++ codegen. For example,
```python
def fn(x):
    return torch.argmin(x, dim=-1), torch.argmin(x, dim=-1)
```

Currently this generates two reductions, where the common load is CSEd
```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
    if (tmp_acc1.value > tmp0) {
        tmp_acc1.index = i1; tmp_acc1.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
auto tmp2 = tmp_acc1.index;
out_ptr1[static_cast<long>(i0)] = tmp2;
```

but with this change it gets CSEd to a single accumulator

```cpp
for(long i1=static_cast<long>(0L); i1<static_cast<long>(10L); i1+=static_cast<long>(1L))
{
    auto tmp0 = in_ptr0[static_cast<long>(i1 + (10L*i0))];
    if (tmp_acc0.value > tmp0) {
        tmp_acc0.index = i1; tmp_acc0.value = tmp0;
    }
}
auto tmp1 = tmp_acc0.index;
out_ptr0[static_cast<long>(i0)] = tmp1;
out_ptr1[static_cast<long>(i0)] = tmp1;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102737
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-07-06 16:22:19 +00:00
def7b3ed60 Enable bitwise shift operations tests (#97150)
With #70904 fixed, we can remove the skips for the bitwise shift tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97150
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-07-06 15:32:57 +00:00
17ab4f85e9 [c10d] Adopt allgather_into_tensor_coalesced for NCCL. (#103086)
This is done by adding c10d::_allgather_into_tensor_coalesced wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103086
Approved by: https://github.com/rohan-varma
2023-07-06 15:05:55 +00:00
0aa6486441 inductor: reduce compile time for cpu backend by reducing weight conversion (#104402)
Before this PR, we always add ```to_mkldnn``` before doing weight packing, it is redundant, we can directly convert a dense tensor to block tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104402
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison, https://github.com/desertfire
2023-07-06 13:44:50 +00:00
adf1405909 [HigherOrderOp] Simplify design by removing reliance on name match (#104350)
Previously:
- we were keeping a list of proxies seen by the current SubgraphTracer.
It turns out, fx.Proxy has a .tracer field that we should be able to use instead.
- we were using name matching to determine if a freevar was already
lifted to being the input of the parent SubgraphTracer. Voz and I have
previously expressed concerns about the robsustness of name matching.

This PR introduces a simplified design with more invariants:
- When doing HigherOrderOp tracing, we may encounter Proxys
- Each Proxy object is associated with a SubgraphTracer.
- The new invariant is that SubgraphTracer should only construct Nodes
using Proxy that come from the SubgraphTracer. This helps us avoid
malformed graphs.
- If the Proxy object came from another SubgraphTracer, then this means
it is a free variable. We need to lift it to being an input of the
current SubgraphTracer, which will result in the construction of a new
Proxy in the current SubgraphTracer. This new Proxy should be used
whenever the old Proxy is seen by the current SubgraphTracer.

Test Plan:
- existing tests + some new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104350
Approved by: https://github.com/ydwu4, https://github.com/voznesenskym
2023-07-06 13:32:33 +00:00
69c4314945 Add more child links to benchmark readme (#104627)
Fixes #104625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104627
Approved by: https://github.com/drisspg
2023-07-06 12:11:00 +00:00
db1ac4e29b fix functional collective's allgather for gloo (#104681)
Summary: We should explicitly check for the gloo backend instead of relying on the shard's device, because user might pass a GPU tensor as input and a process group gloo as the pg, and expect that should work.

Differential Revision: D47249172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104681
Approved by: https://github.com/rohan-varma, https://github.com/fduwjj
2023-07-06 09:52:48 +00:00
b1ea0d90fe [MPS] Set the default optimization level (#104661)
Set the graph optimization level to 0 (avoids dispatches to Neural Engine).
Fixes https://github.com/pytorch/pytorch/issues/104642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104661
Approved by: https://github.com/razarmehr
2023-07-06 08:46:44 +00:00
917ac30aeb Revert "inductor: reduce compile time for cpu backend by reducing weight conversion (#104402)"
This reverts commit 6bfd507c15f2d26212d3e2b9e581d9525bfd37d1.

Reverted https://github.com/pytorch/pytorch/pull/104402 on behalf of https://github.com/XiaobingSuper due to introduce compile error for fp32 linear ([comment](https://github.com/pytorch/pytorch/pull/104402#issuecomment-1623189759))
2023-07-06 08:13:02 +00:00
8e2e2d730e [Quant][PT2E]Accelerate test of conv2d_add and conv2d_add_relu by reducing test configs (#104686)
**Summary**
Reduce the test time of `test_conv2d_binary_with_quantizer_api` and `test_conv2d_binary_unary_with_quantizer_api`.
* For `test_conv2d_binary_with_quantizer_api`, reduce the number of test config from 12 to 2.
* For `test_conv2d_binary_unary_with_quantizer_api`, reduce the number of test config from 24 to 2.

**Test Plan**
```
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_with_quantizer_api
python -m pytest test_x86inductor_quantizer.py -k test_conv2d_binary_unary_with_quantizer_api
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104686
Approved by: https://github.com/jerryzh168
2023-07-06 07:34:46 +00:00
ac9c2aa6ee Use random inputs for mps extension tests (#104597)
The tested function simply adds `x` and `y`, given this fact, using random inputs instead of zeros makes more sense.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104597
Approved by: https://github.com/albanD
2023-07-06 07:14:56 +00:00
6bfd507c15 inductor: reduce compile time for cpu backend by reducing weight conversion (#104402)
Before this PR, we always add ```to_mkldnn``` before doing weight packing, it is redundant, we can directly convert a dense tensor to block tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104402
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/eellison
2023-07-06 06:07:05 +00:00
434fcffa21 [6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087)
This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.load_state_dict(). It only works for offload_to_cpu=False now.

Next PR will make use_dtensor=True work with offload_to_cpu=True for load_state_dict().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104087
Approved by: https://github.com/fegin
2023-07-06 05:36:19 +00:00
a956b1c849 optimize mimalloc build options. (#104497)
1. pytorch only need static lib, disable other libs.
2. disable override, pytorch only access mimalloc via cpu_alloc/cpu_free.

Reference: https://github.com/microsoft/mimalloc/blob/master/CMakeLists.txt#L10-L25

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104497
Approved by: https://github.com/jgong5, https://github.com/albanD
2023-07-06 04:44:21 +00:00
c3f29ed16e Update cutlass submodule to stable 3.1 from RC (#104638)
CUTLASS was on a release candidate previously this updates it up to stable with a few additional fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104638
Approved by: https://github.com/ezyang
2023-07-06 04:32:23 +00:00
22520964ae inductor: convert view to reshape before doing fake_tensor_prop at freezing step (#104612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104612
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/shunting314
2023-07-06 04:27:50 +00:00
13763f58ad [vision hash update] update the pinned vision hash (#104677)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104677
Approved by: https://github.com/pytorchbot
2023-07-06 03:26:41 +00:00
df281bf788 Refactor unwrap_proxy() for proxy tensor tracing. (#104667)
Test Plan: CI

Differential Revision: D47241815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104667
Approved by: https://github.com/tugsbayasgalan
2023-07-06 03:03:13 +00:00
d0e5c681f5 [dynamo][ddp][ac] Fallback to single bucket when higher order op (#104639)
This helps unblock an internal model. The real fix requires lot of work, which might question the alternate approach of partitioning AOT graphs instead of Dynamo graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104639
Approved by: https://github.com/wconstab
2023-07-06 02:20:15 +00:00
da7675621e Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427)
### Description

This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type.

Next step:

 - [x] Add benchmarks
 - [x] Extend to Half
 - [x] Simplify code

### Performance test (Updated)

Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
With jemalloc and iomp

Single socket (40C)
![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3)

Single core
![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427
Approved by: https://github.com/mingfeima, https://github.com/albanD
2023-07-06 01:23:56 +00:00
bf127d236a Update xla.txt (#104671)
As discussed with @JackCaoG

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104671
Approved by: https://github.com/JackCaoG
2023-07-06 01:02:21 +00:00
c46869a941 inductor: support cpu fusion path for bfloat16 amp (#104399)
This PR is about the fusion of amp path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104399
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-06 00:51:29 +00:00
e802900bdc inductor: move the CPU weight packing path after of AOTAutograd (#103851)
At next step:
1. support amp path for applying more fusion.
2. support dynamic shape path for applying more fusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103851
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-06 00:48:35 +00:00
8c191d8eef [dynamo][ac] Reland #104397 - Remove disable monkeypatching of utils.checkpoint (#104665)
NO CHANGE from before. The ancestor diff was reverted, so this diff got reverted as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104665
Approved by: https://github.com/wconstab
2023-07-06 00:48:02 +00:00
0444f9f85b [dynamo] Reland #104317 - Lazy disable_dynamo API out-of-dynamo (#104664)
Internal failed because of torch.deploy issues with disable_dynamo in fx/* and _jit/* files. Removing disable_dynamo for both. Added a comment in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104664
Approved by: https://github.com/wconstab
2023-07-06 00:48:02 +00:00
d3589c9456 reduce computation of batch_norm when weight or bias is none (#104616)
For batch_norm decomposition, if weight or bias is None, we can skip some computations for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104616
Approved by: https://github.com/lezcano, https://github.com/desertfire, https://github.com/jgong5
2023-07-06 00:47:41 +00:00
13ea3d8530 [jit] Fix inspect.get_annotations usage in python >= 3.10 (#104485)
Fixes #104484

For >= 3.10, we use `inspect.get_annotations` instead of `getattr(.., "__annotations__")`. [Docs](https://docs.python.org/3/library/inspect.html#inspect.get_annotations) say that get_annotations() "Ignores inherited annotations on classes. If a class doesn’t have its own annotations dict, returns an empty dict.". In practice though, this doesn't seem always true; until you call inspect.getmembers it seems like you still get inherited annotations. In particular, this means that if you script a certain type twice, the first time it may pass scripting but on the second try it may not pass scripting.

This PR adds a more comprehensive handling of get_annotations by recursively reading the annotations of the base types. (TorchScript doesn't officially support this; but since it worked in <3.10, it's now breaking internal stuff as python gets upgraded to 3.10)

Verified in #104486 that the test does actually fail before the changes in this PR were added.

Differential Revision: [D47163891](https://our.internmc.facebook.com/intern/diff/D47163891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104485
Approved by: https://github.com/eellison
2023-07-06 00:37:47 +00:00
7e098f9559 [inductor] Add single pass "var_unnormalized" reduction_type (#102486)
This is a bit inefficient because it computes the mean and throws it
away since ir.Reduction nodes only have 1 output. However, the mean
can at least be scheduled into the same loop as the variance now since
there is no data dependency. Thus we can take fewer passes over the
data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102486
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-07-06 00:00:59 +00:00
63755efb90 Disable git fsmonitor daemon on Windows (#104662)
Looking at the issue on https://github.com/actions/checkout/issues/1018, I suspect that this is same flaky issue failing GHA checkout on Windows https://github.com/pytorch/pytorch/actions/runs/5459471366/jobs/9935736289
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104662
Approved by: https://github.com/clee2000
2023-07-06 00:00:22 +00:00
611febf6cf [quant] Support integer implementations for max_pool2d (#104225)
Summary:
This is needed for representing quantized model in pt2 export quantization flow

Test Plan:
tested by opinfo, python test/test_ops.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104225
Approved by: https://github.com/kimishpatel
2023-07-05 23:54:07 +00:00
a290cbf32b Enable fused foreach Adam compilation (#104121)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104121
Approved by: https://github.com/janeyx99
2023-07-05 23:40:03 +00:00
01e6d64dd2 [MPS] Fix unary ops over sparse-mapped tensors (#100765)
If input tensor is backed by a sparse view, create a dense copy before running unary op, otherwise op will be applied against the wrong elements.
Introduce `is_dense_in_storage` that returns true if tensor/view are mapped to a dense area in  the tensor storage.
Add unit test to validate the fix.

Fixes https://github.com/pytorch/pytorch/issues/98074
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100765
Approved by: https://github.com/albanD
2023-07-05 23:17:43 +00:00
4005152b92 [dynamo] Organize higherorderops variable trackers (#104565)
The main change is moving the higherorderops from torch.py to higher_order_ops.py. And creating smaller subclasses of HigherOrderOp for cond, map etc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104565
Approved by: https://github.com/zou3519
2023-07-05 22:19:26 +00:00
666aeaa313 Preserve original co_filename when FX symbolic_trace (#103885)
Previously, you'd get `<eval_with_key>.0`; now you get `<eval_with_key>.0 from /data/users/ezyang/b/pytorch/test/dynamo/test_misc.py:5683 in forward`

I used to do this with globals, but now I do it with a `co_fields` parameter that's plumbed around, because putting things in globals has implications(TM). Happy to bikeshed on the `co_fields` structure.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103885
Approved by: https://github.com/albanD
2023-07-05 22:00:05 +00:00
4baac20117 [BE] switch fprintf to fmt::print (#104640)
Testing out the new automated clang-tidy check in master. Code should be faster, more modern, and more efficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104640
Approved by: https://github.com/malfet
2023-07-05 21:11:39 +00:00
f00f1d4cfb add fused support for xpu devices (#104517)
We want to add fused support for xpu devices in optimizer so we add 'xpu' to the fused support list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104517
Approved by: https://github.com/ezyang
2023-07-05 21:07:00 +00:00
wgb
b5c2404116 Expose TorchDispatchUtils Api for Extensions (#104619)
Fixes #ISSUE_NUMBER
Exporse some api in TorchDispatchUtils.h for custom devices. This can be used in DEBUG package.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104619
Approved by: https://github.com/ezyang
2023-07-05 20:21:23 +00:00
5b600dee19 Properly preserve --tracing-mode when isolated minify (#104101)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104101
Approved by: https://github.com/voznesenskym
2023-07-05 20:19:11 +00:00
3dc4adc7a6 Don't build CUDA with debug info by default. (#102617)
Fixes https://github.com/pytorch/pytorch/issues/102594

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102617
Approved by: https://github.com/malfet
2023-07-05 20:16:19 +00:00
0cee4e3c32 Turn translation validation off on timeouts. (#104464)
Follow-up to PR: #97964

After the introduction of translation validation, (TV) a few TIMM and TorchBench benchmarks
started failing due to TIMEOUT. This PR turns TV off for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104464
Approved by: https://github.com/malfet
2023-07-05 19:01:50 +00:00
40b8d10d5e Re-land: Turn translation validation on for tests and accuracy runs by default. (#104467)
Re-landing: #103611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104467
Approved by: https://github.com/malfet
2023-07-05 19:01:50 +00:00
5ab2b27353 Revert "Re-enable low memory dropout (#103330)"
This reverts commit f32593630bceed0eb51656304841d9f5de09ec7c.

Reverted https://github.com/pytorch/pytorch/pull/103330 on behalf of https://github.com/davidberard98 due to large compilation time regression ([comment](https://github.com/pytorch/pytorch/pull/103330#issuecomment-1622304072))
2023-07-05 19:00:40 +00:00
fb1ad02833 Support bit shifting SymInts (#104318)
Fixes #104228.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104318
Approved by: https://github.com/ezyang
2023-07-05 18:35:57 +00:00
d3ba8901d8 Adding precision issue note docs for functional.interpolate (#104622)
Fixes #104157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104622
Approved by: https://github.com/ezyang
2023-07-05 16:20:57 +00:00
05eaf5ab51 optimized the backward of index_select when dim = 0 on CPU (#102961)
This one is targeting at improving the performance of backward for `index_select` when dim = 0; The fast path uses the existing kernel for `scatter_add` when dim = 0.

The following result is based on weight size [50000, 128], index size [50000]. CPU type: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz, dual sockets, 24 cores per socket.

This patch will bring **7.9x** speedup for the backward of `index_select`.

* before, each index_add takes `8.678ms`
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
autograd::engine::evaluate_function: IndexSelectBack...         0.05%       9.709ms        88.92%       18.095s       9.048ms          2000
                                   IndexSelectBackward0         0.09%      18.823ms        88.83%       18.077s       9.038ms          2000
                            aten::index_select_backward         0.00%     694.000us        88.78%       18.067s       9.033ms          2000
                                       aten::index_add_        85.21%       17.341s        85.29%       17.356s       8.678ms          2000
                                     aten::index_select         5.45%        1.110s         5.59%        1.138s     563.205us          2020
autograd::engine::evaluate_function: torch::autograd...         0.05%       9.443ms         5.15%        1.047s     523.517us          2000
                        torch::autograd::AccumulateGrad         0.12%      24.361ms         5.10%        1.038s     519.034us          2000
                                             aten::add_         4.98%        1.014s         4.98%        1.014s     507.092us          1999
                                        aten::new_zeros        -0.01%   -1189.000us         3.42%     696.845ms     348.423us          2000
                                            aten::zero_         0.14%      28.879ms         3.30%     671.912ms     335.956us          2000
                                            aten::fill_         3.28%     666.919ms         3.28%     666.919ms     333.459us          2000
                                            aten::randn         0.00%      49.000us         0.33%      67.018ms      33.509ms             2
                                          aten::normal_         0.33%      66.778ms         0.33%      66.778ms      33.389ms             2
                                           aten::select         0.13%      25.738ms         0.17%      33.620ms       4.182us          8040
                                            aten::empty         0.10%      19.746ms         0.10%      19.746ms       4.908us          4023
                                        aten::new_empty         0.03%       5.371ms         0.07%      14.532ms       7.266us          2000
                                       aten::as_strided         0.04%       7.945ms         0.04%       7.945ms       0.988us          8040
                                     aten::is_same_size         0.01%       2.228ms         0.01%       2.228ms       1.114us          2000
                                          aten::randint         0.00%      27.000us         0.00%     600.000us     600.000us             1
                                          aten::random_         0.00%     564.000us         0.00%     564.000us     564.000us             1
                                           aten::detach         0.00%       4.000us         0.00%      30.000us      30.000us             1
                                                 detach         0.00%      26.000us         0.00%      26.000us      26.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
```

* after, each index_add takes `1.093ms`

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
autograd::engine::evaluate_function: IndexSelectBack...         0.34%      17.992ms        55.21%        2.897s       1.449ms          2000
                                   IndexSelectBackward0         0.18%       9.661ms        54.86%        2.879s       1.440ms          2000
                            aten::index_select_backward        -0.04%   -2182.000us        54.68%        2.870s       1.435ms          2000
                                       aten::index_add_        41.54%        2.180s        41.67%        2.187s       1.093ms          2000
                                     aten::index_select        23.52%        1.234s        24.15%        1.267s     627.413us          2020
autograd::engine::evaluate_function: torch::autograd...         0.22%      11.786ms        19.22%        1.009s     504.313us          2000
                        torch::autograd::AccumulateGrad         0.43%      22.459ms        19.02%     998.351ms     499.175us          2000
                                             aten::add_        18.59%     975.864ms        18.59%     975.864ms     488.176us          1999
                                        aten::new_zeros        -0.03%   -1351.000us        12.76%     669.825ms     334.913us          2000
                                            aten::zero_         0.58%      30.644ms        12.30%     645.262ms     322.631us          2000
                                            aten::fill_        12.20%     640.196ms        12.20%     640.196ms     320.098us          2000
                                            aten::randn         0.00%      54.000us         1.33%      70.001ms      35.001ms             2
                                          aten::normal_         1.33%      69.745ms         1.33%      69.745ms      34.873ms             2
                                            aten::empty         0.43%      22.406ms         0.43%      22.406ms       5.569us          4023
                                           aten::select         0.30%      15.539ms         0.41%      21.411ms       5.300us          4040
                                        aten::new_empty         0.10%       5.016ms         0.28%      14.731ms       7.365us          2000
                                       aten::as_strided         0.24%      12.460ms         0.24%      12.460ms       2.063us          6040
                                     aten::is_same_size         0.05%       2.675ms         0.05%       2.675ms       1.337us          2000
                                          aten::randint         0.00%      26.000us         0.01%     632.000us     632.000us             1
                                          aten::random_         0.01%     598.000us         0.01%     598.000us     598.000us             1
                                           aten::detach         0.00%       6.000us         0.00%      28.000us      28.000us             1
                                                 detach         0.00%      22.000us         0.00%      22.000us      22.000us             1
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102961
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-07-05 15:32:58 +00:00
3834582327 [ONNX] Add autograd_inlining flag to torch.onnx.export (#104067)
Fixes #88286, Fixes #97160

Repro:

```python
import torch
import io
from torch.utils.checkpoint import checkpoint

class A(torch.nn.Module):
    # A supported module.
    def __init__(self):
        super(A, self).__init__()
        self.l1 = torch.nn.Linear(2, 2)

    def forward(self, x):
        return self.l1(x)

class B(torch.nn.Module):
    # This module is not exportable to ONNX because it
    # uses gradient-checkpointing. However, its two sub-module's
    # are exportable, so ORTModule should be used to compute them.
    def __init__(self):
        super(B, self).__init__()
        self.l1 = torch.nn.Linear(2, 2)
        self.a = A()

    def forward(self, x):
        def custom():
            def custom_forward(x_):
                return self.a(x_)

            return custom_forward

        z = self.l1(checkpoint(custom(), x))
        return z

torch.onnx.export(
    B(),
    (torch.randn(2, 2),),
    io.BytesIO(),
    autograd_inlining=True
)
```

`torch.onnx.export(autograd_inlining=True)` should repro the user error as this is the original execution path.
```bash
Traceback (most recent call last):
  File "repro88286.py", line 36, in <module>
    torch.onnx.export(
  File "<@beartype(torch.onnx.utils.export) at 0x7f0f011faee0>", line 385, in export
  File "/opt/pytorch/torch/onnx/utils.py", line 511, in export
    _export(
  File "/opt/pytorch/torch/onnx/utils.py", line 1576, in _export
    graph, params_dict, torch_out = _model_to_graph(
  File "<@beartype(torch.onnx.utils._model_to_graph) at 0x7f0f01187dc0>", line 11, in _model_to_graph
  File "/opt/pytorch/torch/onnx/utils.py", line 1130, in _model_to_graph
    graph, params, torch_out, module = _create_jit_graph(model, args)
  File "/opt/pytorch/torch/onnx/utils.py", line 1006, in _create_jit_graph
    graph, torch_out = _trace_and_get_graph_from_model(model, args)
  File "/opt/pytorch/torch/onnx/utils.py", line 910, in _trace_and_get_graph_from_model
    trace_graph, torch_out, inputs_states = torch.jit._get_trace_graph(
  File "/opt/pytorch/torch/jit/_trace.py", line 1269, in _get_trace_graph
    outs = ONNXTracedModule(f, strict, _force_outplace, return_inputs, _return_inputs_states)(*args, **kwargs)
  File "/opt/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/pytorch/torch/jit/_trace.py", line 128, in forward
    graph, out = torch._C._create_graph_by_tracing(
  File "/opt/pytorch/torch/jit/_trace.py", line 119, in wrapper
    outs.append(self.inner(*trace_inputs))
  File "/opt/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/pytorch/torch/nn/modules/module.py", line 1492, in _slow_forward
    result = self.forward(*input, **kwargs)
  File "repro88286.py", line 32, in forward
    z = self.l1(checkpoint(custom(), x))
  File "/opt/pytorch/torch/utils/checkpoint.py", line 412, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/opt/pytorch/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
RuntimeError: _Map_base::at
```
By using `autograd_inlining=False`, the export still fail with a different error because autograd inlining is not enabled:

```bash
Traceback (most recent call last):
  File "repro88286.py", line 36, in <module>
    torch.onnx.export(
  File "<@beartype(torch.onnx.utils.export) at 0x7f6088b32ee0>", line 385, in export
  File "/opt/pytorch/torch/onnx/utils.py", line 511, in export
    _export(
  File "/opt/pytorch/torch/onnx/utils.py", line 1615, in _export
    ) = graph._export_onnx(  # type: ignore[attr-defined]
RuntimeError: ONNX export failed: Couldn't export Python operator CheckpointFunction
```
To allow `CheckpointFunction` into the onnx graph, `operator_export_type=torch.onnx.OperatorExportTypes.ONNX_FALLTHROUGH` flag can be added to `torch.onnx.export`, which would lead to the following ONNX graph:

```bash
Exported graph: graph(%prim::PythonOp_0 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu),
      %l1.weight : Float(2, 2, strides=[2, 1], requires_grad=1, device=cpu),
      %l1.bias : Float(2, strides=[1], requires_grad=1, device=cpu)):
  %/PythonOp_output_0 : Float(2, 2, strides=[2, 1], requires_grad=0, device=cpu) = ^CheckpointFunction[inplace=0, module="torch.utils.checkpoint", onnx_name="/PythonOp"](<function B.forward.<locals>.custom.<locals>.custom_forward at 0x7fdf9182f670>, True)(%prim::PythonOp_0), scope: __main__.B:: # /opt/pytorch/torch/autograd/function.py:506:0
  %6 : Float(2, 2, strides=[2, 1], requires_grad=1, device=cpu) = onnx::Gemm[alpha=1., beta=1., transB=1, onnx_name="/l1/Gemm"](%/PythonOp_output_0, %l1.weight, %l1.bias), scope: __main__.B::/torch.nn.modules.linear.Linear::l1 # /opt/pytorch/torch/nn/modules/linear.py:114:0
  return (%6)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104067
Approved by: https://github.com/BowenBao, https://github.com/kit1980
2023-07-05 15:27:36 +00:00
c00dd43e43 [pt2] add metas for multilabel_margin_loss ops (#104388)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104388
Approved by: https://github.com/ezyang
2023-07-05 13:42:22 +00:00
a3aa4da154 [pt2] add metas for multi_margin_loss ops (#104236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104236
Approved by: https://github.com/ezyang
2023-07-05 13:40:05 +00:00
ad58aba932 [pt2] add metas for adaptive_max_pool ops (#104167)
Fixes #103892.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104167
Approved by: https://github.com/ezyang
2023-07-05 07:02:07 +00:00
54e320d4d1 Revert "[dynamo] Lazy disable_dynamo API out-of-dynamo (#104317)"
This reverts commit 5c12a810ac2d40ee74098c8adcf9ec7dddd9476e.

Reverted https://github.com/pytorch/pytorch/pull/104317 on behalf of https://github.com/huydhn due to This has been reverted internally by D47166892, so I need to also revert it on OSS to keep them in sync ([comment](https://github.com/pytorch/pytorch/pull/104317#issuecomment-1621099151))
2023-07-05 06:21:48 +00:00
40f53912cf Revert "[dynamo][ac] Remove disable monkeypatching of utils.checkpoint (#104397)"
This reverts commit 537a6c0651edda1e1a55b90658a6c24d049ff982.

Reverted https://github.com/pytorch/pytorch/pull/104397 on behalf of https://github.com/huydhn due to This has been reverted internally by D47216591, so I need to also revert it on OSS to keep them in sync ([comment](https://github.com/pytorch/pytorch/pull/104397#issuecomment-1621086360))
2023-07-05 06:11:08 +00:00
0c8323e4a4 cmake: allow USE_SYSTEM_ZSTD (#104611)
Fixes #44255.

This is part of larger work I'm doing to allow for more `USE_SYSTEM_*` options to allow Nix to have faster re-builds of PyTorch: https://github.com/NixOS/nixpkgs/pull/239291.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104611
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-07-05 04:47:35 +00:00
ea4d5c4538 [Quant][PT2E] Enable vec code gen for pair of quant/dequant (#104503)
**Summary**
We have supported the vectorization code gen with pattern of `dequant-relu-quant`, for which `to_uint8` is the last node of quant pattern before store into memory. However, there is another case that `dequant1-relu-quant2-dequant2-relu-quant3`. In this case, `quant2` is at the middle of fusion pattern, we enable vectorization code gen of `quant2-dequant2` in this PR.

**Test Plan**
```
python -u -m pytest -s -v test_cpu_repro.py  -k test_dequant_relu_quant_dequant_relu_quant_lowering
```

**Next Step**
* For better performance, we can add another pass to eliminate pair nodes of `float_to_uint8` and `uint8_to_float`.
* For better performance, we should annotate `dequant1` and `quant2` as share observer in quantization recipe. Then we can lower `dequant1-relu-quant2` into a QReLU node to fully eliminate the calculation of `dequant1` and `quant2`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104503
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-05 01:59:00 +00:00
12ca224662 Add hacked_twin overloads for _unsafe indexing functions (#104127)
Fixes #104037

This hacky workaround already exists for the normal overloads.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104127
Approved by: https://github.com/ezyang
2023-07-05 01:05:27 +00:00
2385dad4b3 Enable automatic_dynamic_shapes by default (#103623)
Some notes:

* I now manually turn off `_generate` jobs from running with cudagraphs, as it is unrealistic to expect to cudagraph autoregressive generation up to max sequence length, this would imply compiling the entire unrolled sequence generation. Concretely, cm3leon_generate was timing out post this change, likely due to the compile time slowdown of dynamic shapes ON TOP OF accidentally unrolling all the loops
* A few torch._dynamo.reset tactically inserted to force recompiles on tests that expected it
* expectedFailureAutomaticDynamic flip into patching automatic_dynamic_shapes=False

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103623
Approved by: https://github.com/voznesenskym
2023-07-05 00:25:02 +00:00
2abbed42ee correct the generated code and corresponding text to make them consistent (#104596)
Fixes #104500

As discussed in #104500, the [corresponding doc](https://pytorch.org/docs/stable/dynamo/get-started.html#getting-started) for dynamo is inconsistent between the code and explanation. I have run the code example to get the correct code.
![image](https://github.com/pytorch/pytorch/assets/6964699/d11e0f2f-2225-4ba9-8934-b06c9fc78721)
This PR fixes the problem and makes the doc more readable.

cc:
@davidberard98 @ezyang  please help me check this PR, thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104596
Approved by: https://github.com/ezyang
2023-07-04 22:56:03 +00:00
bfd995f0d6 Revert "Specialize storage_offset - Does not cover automatic dynamic (#104204)"
This reverts commit 803c14490b189f9b755ecb9f2a969876088ea243.

Reverted https://github.com/pytorch/pytorch/pull/104204 on behalf of https://github.com/ezyang due to also due to https://github.com/pytorch/pytorch/issues/104563 ([comment](https://github.com/pytorch/pytorch/pull/104204#issuecomment-1620653507))
2023-07-04 19:41:32 +00:00
e8174faa02 cmake: respect USE_SYSTEM_LIBS when USE_NCCL is set (#104511)
Even though `USE_SYSTEM_LIBS` is set to true, we still need to set `USE_SYSTEM_NCCL` for the system NCCL to be used.

This fixes that by adding a conditional `set` similar to what is done for `USE_TBB`: e9ebda29d8/CMakeLists.txt (L426-L428)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104511
Approved by: https://github.com/ezyang
2023-07-04 19:08:50 +00:00
52094a3454 Correct warning message info in fork_rng (#104525)
the warning message in fork_rng missed format prefix. This PR add it back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104525
Approved by: https://github.com/Skylion007
2023-07-04 19:08:16 +00:00
5c580a9846 [decomp] Add test tracking core ATen operators (#104262)
This adds an expect-test that finds the set of core ATen operators by
subtracting the operators with decomposition in core_aten_decompositions from the
set of all operators that have decompositions and could be decomposed.

This is useful because if you add a new decomposition but forget to add it to
the list of core decompositions, it will appear in the PR diff.

Also, by going through this list I have identified some operators where the
functional variant is decomposed, but not the inplace variant which must be an
oversight.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104262
Approved by: https://github.com/lezcano
2023-07-04 16:41:44 +00:00
d62a80adc3 remove ipex backend (#104329)
Move IPEX backend from PyTorch to IPEX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104329
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-04 09:21:27 +00:00
7ae100628e Move most SymPy functions to their own file (#104556)
All these are standalone implementations of some functions and they
don't depend on anything else, so we better have them under the
`_sympy/` folder on their own

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104556
Approved by: https://github.com/ezyang
2023-07-04 03:53:48 +00:00
985cb5055c [vision hash update] update the pinned vision hash (#104562)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104562
Approved by: https://github.com/pytorchbot
2023-07-04 03:32:09 +00:00
2a21469a77 [Quant][PT2E] Enable conv2d unary and binary recipe for x86 inductor quantizer (#98826)
**Summary**

- Recipe to annotate `conv2d_relu` for `X86InductorQuantizer` is added.
- Recipe to annotate `conv2d_add` for `X86InductorQuantizer` is added.
- Recipe to annotate `conv2d_add_relu` for `X86InductorQuantizer` is added.

**Test Plan**
```
python -u -m pytest -s -v test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98826
Approved by: https://github.com/jerryzh168
2023-07-04 00:01:10 +00:00
8780bd6a01 [ONNX] Use load_model_from_string (#104533)
Instead of load_from_string because it is an alias: 3645b70caa/onnx/__init__.py (L320)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104533
Approved by: https://github.com/BowenBao
2023-07-03 23:00:57 +00:00
07c60d11b3 replace AT_ERROR(...) with TORCH_CHECK(false, ...) (#104534)
Merely cosmetic for `AT_ERROR` I found by chance, following e9d2d74f0a/c10/util/Exception.h (L622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104534
Approved by: https://github.com/soulitzer
2023-07-03 22:43:19 +00:00
709c9b5c93 Fix tabulate import error (#104468)
### Description

This PR fixes issue #104166 by re-raising the exception.

### Context

The `tabulate` package needs to be installed with `pip install tabulate` before calling `tabulate(...)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104468
Approved by: https://github.com/Skylion007, https://github.com/BowenBao
2023-07-03 21:55:53 +00:00
d7b5cd7d0b Fix mH() to mH in Python examples (#104532)
`mH()` results in
TypeError: 'Tensor' object is not callable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104532
Approved by: https://github.com/lezcano
2023-07-03 21:35:47 +00:00
e9d2d74f0a [inductor] Add prims._inductor_bucketize and add lowerings (#104007)
**TL;DR**: This PR is a first step in adding lowerings for torch.bucketize. It adds an initial lowering for this op - but because this  implementation is not currently efficient, it registers the lowering for prims._inductor_bucketize. After we make the implementation more efficient, we'll remove prims._inductor_bucketize and add the lowering directly to torch.bucketize.

**Background - torch.bucketize**: torch.bucketize(values, boundaries, right=False): for an arbitrary tensor of values and a non-decreasing 1D tensor of boundaries that define buckets, it returns the index of the bucket that each of the values will fall in. e.g. for values [0, 1, 2, 3, 4] and boundaries [1, 3], it will return [0, 0, 1, 1, 2].

**Implementation**: This PR adds a new inductor op called "bucketize". In this PR it only has a triton implementation - for CPU it is a fallback. The triton implementation uses a binary search in `triton_helpers.py`. This PR also adds a new prim `_inductor_bucketize()` for testing purposes and adds lowering for this op.

~~**"right"**: The current behavior of the "right" kwarg in the inductor op is the opposite of the behavior of the torch op. "right" controls how the op treats a value that is equal to one of the boundary values. In the torch op, "right=True" means "if a value is equal to a boundary value, then put it in the bucket to the right". In the inductor op, "right=True" means "the right boundary of a bucket is closed". These are opposite. **I'm open to switching the behavior of the inductor op** - but I chose to implement this way because I think it makes more sense, and I think the torch.bucketize behavior may have been a mistake (it's the opposite of numpy.digitize).~~ Switched the behavior of the inductor bucketize op to match the torch op

* places where "right" means "if a value is equal to a boundary value, then put it in the bucket to the right" (i.e. current torch.bucketize behavior)
  + current torch.bucketize behavior
  + table in [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html)
* places where "right" means "the right boundary of a bucket is closed":
  + the text description of [torch.bucketize docs](https://pytorch.org/docs/stable/generated/torch.bucketize.html) (observed in #91580)
  + [numpy.digitize](https://numpy.org/doc/stable/reference/generated/numpy.digitize.html) (which is basically the same op)

**Performance**: Benchmark script: "values" as a [16, 1024, 1024] float32 tensor and "boundaries" as a [1025] tensor (i.e. defining 1024 buckets).

As is:
```
Eager 0.30117499828338623 ms
PT2   0.9298200011253357 ms
```

But performance improves significantly if we add an additional pointwise autotuning config (WIP in #104456):
```
Eager 0.3015420138835907 ms
PT2   0.23028500378131866 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104007
Approved by: https://github.com/jansel
2023-07-03 16:52:38 +00:00
0ac2666d72 Advance docker builds to cuda 11.8 (#104528)
Advance docker builds to cuda 11.8
This should fix Docker build nightly failure: https://hud.pytorch.org/hud/pytorch/pytorch/nightly/1?per_page=50&name_filter=Docker

The following docker image no longer exists: ``nvidia/cuda:11.7.0-cudnn8-devel-ubuntu20.04``
Hence advancing build to ``nvidia/cuda/11.8.0-cudnn8-devel-ubuntu20.04``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104528
Approved by: https://github.com/DanilBaibak
2023-07-03 16:44:26 +00:00
d6b1f12846 Add onnx to common_methods_invocations.py approvers (#104530)
Add onnx to common_methods_invocations.py approvers so that the torch.onnx group can contribute OpInfos and add skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104530
Approved by: https://github.com/kit1980
2023-07-03 16:43:22 +00:00
437bc5b1b7 sparse_mask: backward support for sparse lhs (take 2) (#104341)
This is a copy of https://github.com/pytorch/pytorch/pull/95165 with some bug fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104341
Approved by: https://github.com/albanD, https://github.com/pearu, https://github.com/amjames
2023-07-03 14:12:44 +00:00
f353d17755 Revert "[ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)"
This reverts commit ef7bc3e23d128b92e7826342e879438d844f7312.

Reverted https://github.com/pytorch/pytorch/pull/104425 on behalf of https://github.com/huydhn due to Sorry for reverting your PR.  It is failing CUDA test in trunk built in debug mode https://github.com/pytorch/pytorch/actions/runs/5429187622/jobs/9874360641 ([comment](https://github.com/pytorch/pytorch/pull/104425#issuecomment-1617247699))
2023-07-03 04:18:04 +00:00
9f7ad25c98 [PyTorch][Dispatcher] Fix destruction order fiasco crash (#104393)
Summary:
The current implementation of `Dispatcher` returns an RAII object
from it's `register*` methods which, on destruction, uses a saved
reference to the `Dispatcher` to call the associated `deregister*`
method.

However, nothing guarantees that the `Dispatcher` is destroyed
*after* all RAII objects have been destroyed and, in practice, we
see segfaults caused when a global `Dispatcher` is cleaned up
before RAII globals.

This diff fixes by keeping the `Dispatcher` lock and "alive" marker
in a `std::shared_ptr` which the callbacks copy and then use to
verify that the `Dispatcher` is still alive before continuing.

https://fb.workplace.com/groups/1405155842844877/posts/7143161099044294/
https://fb.workplace.com/groups/python.builds/posts/3510588832595867/
S349108

Test Plan: CI

Differential Revision: D47113122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104393
Approved by: https://github.com/ezyang
2023-07-03 00:17:42 +00:00
707d265db2 [Inductor][Quant]Refactor load and store vectorization code generation with uint8 data type (#104075)
**Summary**
Refactor the vectorization code generation of uint8 input data type. Previously, we combine the uint8 data load and uint8 to float data convert into one step as `load_uint8_as_float` and `store_float_as_uint8`. After refactor, we split them into 2 steps of load/store and data type convert to make the behavior same as BFloat16 data type .

The previous generated code is:
```
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(i0));
    auto tmp1 = (tmp0);
    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
    auto tmp3 = tmp1 - tmp2;
    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01));
    auto tmp5 = tmp3 * tmp4;
    auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
    auto tmp7 = tmp6 * tmp2;
    auto tmp8 = tmp7.round();
    auto tmp9 = tmp8 + tmp2;
    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0));
    auto tmp11 = at::vec::maximum(tmp9, tmp10);
    auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0));
    auto tmp13 = at::vec::minimum(tmp11, tmp12);
    auto tmp14 = (tmp13);
    at::vec::store_float_as_uint8(tmp14, out_ptr0 + static_cast<long>(i0));
}
```

After this PR, the generated code is:
```
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(432L); i0+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<uint8_t>::loadu(in_ptr0 + static_cast<long>(i0), 16);
    auto tmp1 = cvt_uint8_to_fp32_with_same_elem_num(tmp0);
    auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100.0));
    auto tmp3 = tmp1 - tmp2;
    auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01));
    auto tmp5 = tmp3 * tmp4;
    auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
    auto tmp7 = tmp6 * tmp2;
    auto tmp8 = tmp7.round();
    auto tmp9 = tmp8 + tmp2;
    auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(0.0));
    auto tmp11 = at::vec::maximum(tmp9, tmp10);
    auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(255.0));
    auto tmp13 = at::vec::minimum(tmp11, tmp12);
    auto tmp14 = cvt_fp32_to_uint8(tmp13);
    tmp14.store(out_ptr0 + static_cast<long>(i0), 16);
}
```

**Test Plan**
```
python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104075
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-07-01 23:12:43 +00:00
fcb53c1394 Revert "[6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087)"
This reverts commit 49af83cf442ef569c8eb4f5a29f46a65abc0e2d2.

Reverted https://github.com/pytorch/pytorch/pull/104087 on behalf of https://github.com/huydhn due to This is failing in trunk 49af83cf44, probably due to a land race ([comment](https://github.com/pytorch/pytorch/pull/104087#issuecomment-1615608189))
2023-07-01 07:50:31 +00:00
bd0f0f40a1 [PT2][Quant] Enable symbolic shape in linear quantization (#104473)
When tracing with symbolic shapes, arbitrary sym_size nodes can appear in the
graph. Earlier changes did not account for this and quantizer fails to annotate
the right nodes. This diff fixes that by not annotating sym_size nodes, which
should really not be relevant for quantization.

As next steps, we should validate in quant workflow that a) sym_int nodes are not
being quantized and b) add similar support, as this diff, for generic
annotations

Differential Revision: [D47132050](https://our.internmc.facebook.com/intern/diff/D47132050/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104473
Approved by: https://github.com/jerryzh168
2023-07-01 05:14:30 +00:00
4e27e6c160 [vision hash update] update the pinned vision hash (#104490)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104490
Approved by: https://github.com/pytorchbot
2023-07-01 03:34:21 +00:00
004ff536e8 [ROCm] Fix circular recursion issue in hipification (#104085)
This PR fixes the circular issue during hipification process by introducing current_state to track whether a file is processed for hipification. (Iterative DFS)
The issue arises when two header files try to include themselves, which leads to a circular recursion or an infinite loop.

Fixes the related issues such as :
https://github.com/pytorch/pytorch/issues/93827
https://github.com/ROCmSoftwarePlatform/hipify_torch/issues/39

Error log:
```
  File "/opt/conda/lib/python3.8/posixpath.py", line 471, in relpath
    start_list = [x for x in abspath(start).split(sep) if x]
  File "/opt/conda/lib/python3.8/posixpath.py", line 375, in abspath
    if not isabs(path):
  File "/opt/conda/lib/python3.8/posixpath.py", line 63, in isabs
    sep = _get_sep(s)
  File "/opt/conda/lib/python3.8/posixpath.py", line 42, in _get_sep
    if isinstance(path, bytes):
RecursionError: maximum recursion depth exceeded while calling a Python object
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104085
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-07-01 03:25:51 +00:00
e865bc7da4 add SM80OrLater checks to bfloat16 torchinductor tests (#104436)
Fixes #103993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104436
Approved by: https://github.com/kit1980
2023-07-01 03:15:46 +00:00
b3e60ee052 Fix broken torch._inductor.config import (#104477)
This fixes the bug in profiler code exposed by  https://github.com/pytorch/pytorch/pull/104368 that introduced on the fact that `import torch._dynamo` also imports `torch._inductor.config`:
```
$ python -c "import torch._inductor;print(torch._inductor.config)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: module 'torch._inductor' has no attribute 'config'
(base) $ python -c "import torch._dynamo;print(torch._inductor.config)"
<module 'torch._inductor.config' from '/home/nshulga/git/pytorch/pytorch/torch/_inductor/config.py'>
```

### Testing
D47159397

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104477
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
2023-07-01 02:23:44 +00:00
d6f1827181 [Inductor][Quant] Add UT to combine dynamo export and inductor constant folding (#104245)
**Summary**
As we report in [Issue-103582](https://github.com/pytorch/pytorch/issues/103582), previous `constant folding` failed to work after `dynamo export`. With latest PyTorch main branch, we can't reproduce this error. Adding UT in this PR to cover this use case.

**Test Plan**
```
python -m pytest test_inductor_freezing.py -k test_functional_constant_folding_after_dynamo_export
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104245
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-07-01 01:50:50 +00:00
b1c31b1d26 [pt2] metas and SymInt support for max_pool ops (#103951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103951
Approved by: https://github.com/Chillee, https://github.com/kulinseth
2023-07-01 01:33:35 +00:00
c4a6f86062 [pt2] add metas for max_unpool2d and max_unpool3d (#103821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103821
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2023-07-01 01:33:35 +00:00
f9aa004d39 [ONNX][TypePromo] Materialize type promotion rules (#104063)
This PR adds the materialized type promotion rule set and the type promotion rule infrastructure
that will be consumed by ONNX exporter. The script that generates the rule set is included, and
a local unittest is added to check the validity of the materialized rule set.

Full design doc and discussion at https://microsoft-my.sharepoint.com/:w:/p/bowbao/Edj2lF1oi0JIitT_3ntyuqkBo6ll7N6NJDmavM0lp_KkEA?e=OElyjR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104063
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-07-01 01:20:08 +00:00
828b275740 [exportdb] Setup website (#104288)
<img width="1109" alt="image" src="https://github.com/pytorch/pytorch/assets/10901756/e67ff8a9-adb1-466f-8285-fb7d3653d139">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104288
Approved by: https://github.com/zhxchen17
2023-07-01 01:03:56 +00:00
49af83cf44 [6/n][FSDP] Update _sharded_pre_load_state_dict_hook to use DTensor when use_dtensor=True in ShardedStateDictConfig (#104087)
This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.load_state_dict(). It only works for offload_to_cpu=False now.

Next PR will make use_dtensor=True work with offload_to_cpu=True for load_state_dict().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104087
Approved by: https://github.com/fegin
2023-07-01 01:02:59 +00:00
1de1bea60d Back out "[Inductor][FX passes] Remove config.split_cat_fx_passes & A… (#104370)
…dd config.experimental_patterns" (#104362)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104362

revert D46752606 to unblock pyper release. This diff introduced a package incompatibility between ads_dper3 and training_platform.

Test Plan: tests pass

Reviewed By: yanboliang

Differential Revision: D47100297

fbshipit-source-id: 24a0bc149f0f9165b5ffcca80e669e917d6dd4c2

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104370
Approved by: https://github.com/yanboliang
2023-07-01 00:44:46 +00:00
9626604bdd [inductor] Fix squeeze normalization pattern (#104434)
Fixes #103875

In the test sample, this pass would turn:
```
%squeeze : [num_users=1] = call_method[target=squeeze](args = (%l_x_, 1, 2), kwargs = {})
```
into
```
%squeeze_1 : [num_users=1] = call_function[target=torch.squeeze](args = (%l_x_, 1), kwargs = {})
```
which is clearly wrong as we've dropped the second squeeze dim.

Instead, this PR now normalizes to
```
%squeeze_1 : [num_users=1] = call_function[target=torch.squeeze](args = (%l_x_, (1, 2)), kwargs = {})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104434
Approved by: https://github.com/jansel
2023-07-01 00:28:02 +00:00
d982fdb5d5 [FSDP] Rework meta device init (#104189)
This addresses https://github.com/pytorch/pytorch/issues/104187.

After this PR, the contract with the user is that:
- If passing `param_init_fn=None`, each `nn.Module.reset_parameters()` should only initialize its own parameters/buffers (like `parameters(recurse=False)`/`buffers(recurse=False)`).
- If passing `param_init_fn` not equal to `None`, then similarly, one call to `param_init_fn(module)` should only initialize `module`'s own parameters/buffers.

With this contract and this PR's changes, meta device initialization through either `reset_parameters()` or `param_init_fn` should be correct. Those functions will run on the original parameter/buffer shapes allowing for correct shape-dependent computations like for fan-in/fan-out, and there will not be any re-initialization of any module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104189
Approved by: https://github.com/rohan-varma
2023-07-01 00:25:12 +00:00
93f5a82e37 Add detailed requirement of compiler in README.md (#103819)
Refer to this [issue](https://github.com/pytorch/pytorch/issues/102258), a compiler that fully supports C++17 is required, otherwise the precision of some operators will have problems for aarch64. Therefore, It will be more user-friendly to specify the gcc version especially for aarch64.

Fixes #102258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103819
Approved by: https://github.com/kit1980
2023-06-30 22:59:50 +00:00
3ff111a4b4 doc: fix fake_quantize_per_tensor_affine docs (#104453)
Fixes #82800

Fixes wrong `fake_quantize_per_tensor_affine` example and wrong `fake_quantize_per_tensor_affine` formula

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104453
Approved by: https://github.com/kit1980
2023-06-30 22:59:00 +00:00
a5ca445f79 Check for corrupted ivalues. (#104243)
Hi! We've been fuzzing torchvision project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz).
We've found a SEGV error at address 0x0 at `vector.h:163` in pytorch third-party project flatbuffers.

The error occurs because the `ivalues` field of flatbuffer module can be null, so the corresponding check must be inserted.

torchvision version: 9d0a93eee90bf7c401b74ebf9c8be80346254f15

pytorch version: 0f1621df1a0a73956c7ce4e2f72f069e610e0137

OS: Ubuntu 20.04

How to reproduce

1. Build docker from [here](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/torchvision) and run the container:

        sudo docker build -t oss-sydr-fuzz-torchvision .
        sudo docker run --privileged --rm -v `pwd`:/fuzz -it oss-sydr-fuzz-torchvision /bin/bash

2. Run the target on this input:
[malformed-module.txt](https://github.com/pytorch/pytorch/files/11879653/malformed-module.txt)

        /encode_png_fuzz malformed-module.txt

3. You will see the following output:

        AddressSanitizer:DEADLYSIGNAL
        =================================================================
        ==1154==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x00000d17cc61 bp 0x7ffcbe8637f0 sp 0x7ffcbe863660 T0)
        ==1154==The signal is caused by a READ memory access.
        ==1154==Hint: address points to the zero page.
            #0 0xd17cc61 in flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue> >::size() const /pytorch/third_party/flatbuffers/include/flatbuffers/vector.h:163:48
            #1 0xd17cc61 in torch::jit::(anonymous namespace)::FlatbufferLoader::parseModule(torch::jit::mobile::serialization::Module*) /pytorch/torch/csrc/jit/mobile/flatbuffer_loader.cpp:293:32
            #2 0xd17dd23 in torch::jit::parse_and_initialize_mobile_module_for_jit(void*, unsigned long, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, std::vector<c10::IValue, std::allocator<c10::IValue> >&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >*) /pytorch/torch/csrc/jit/mobile/flatbuffer_loader.cpp:809:29
            #3 0xdd661b4 in torch::jit::parse_and_initialize_jit_module(std::shared_ptr<char>, unsigned long, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, c10::optional<c10::Device>) /pytorch/torch/csrc/jit/serialization/import.cpp:345:28
            #4 0xdd6b24a in torch::jit::_load_jit_module_from_bytes(std::shared_ptr<char>, unsigned long, std::shared_ptr<torch::jit::CompilationUnit>, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:547:14
            #5 0xdd6c6df in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:443:10
            #6 0xdd6c1c7 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:421:10
            #7 0xdd6dce4 in torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:503:10
            #8 0xf2d3f75 in torch::serialize::InputArchive::load_from(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) /pytorch/torch/csrc/api/src/serialize/input-archive.cpp:97:13
            #9 0x60509c in void torch::load<at::Tensor, char*&>(at::Tensor&, char*&) /pytorch/torch/include/torch/csrc/api/include/torch/serialize.h:107:11
            #10 0x6036be in LLVMFuzzerTestOneInput /vision/encode_png.cc:38:5
            #11 0x66b041 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
            #12 0x6544cc in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
            #13 0x65a61b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
            #14 0x654222 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
            #15 0x7f0c87b9c082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
            #16 0x542cdd in _start (/encode_png_fuzz+0x542cdd)

        AddressSanitizer can not provide additional info.
        SUMMARY: AddressSanitizer: SEGV /pytorch/third_party/flatbuffers/include/flatbuffers/vector.h:163:48 in flatbuffers::Vector<flatbuffers::Offset<torch::jit::mobile::serialization::IValue> >::size() const
        ==1154==ABORTING

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104243
Approved by: https://github.com/kit1980
2023-06-30 22:53:49 +00:00
f20fe674f9 [easy][cuda] Removed the warp size hardcode on layer norm backward CUDA kernel (#104441)
Just a nit fix where `GammaBetaBackwardCUDAKernel_32x32` kernel used a hardcoded warp size for performing the reduction and laneId calculation. Changed this to use `C10_WARP_SIZE`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104441
Approved by: https://github.com/malfet
2023-06-30 22:49:13 +00:00
8958f041be Revert "Add forward mode AD to out-place foreach functions (#102409)"
This reverts commit e2ec0ba404f9fbd3c215cad4cabd7383c692cb33.

Reverted https://github.com/pytorch/pytorch/pull/102409 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but it is failing some tests in trunk e799f565eb ([comment](https://github.com/pytorch/pytorch/pull/102409#issuecomment-1615254393))
2023-06-30 22:46:57 +00:00
c178257b40 Don't limit fusions with foreach scheduler nodes (#104471)
Ignore config fusion limit for foreach nodes since they have their own fusion limits which will be split automatically. With the fusion limit this will automatically start not fusing epilogue copies if there are more than 64 tensors in the foreach lists (very bad) which will create a ton of extra allocations. With this change, fusions with the subkernels still respect this limit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104471
Approved by: https://github.com/jansel
2023-06-30 21:59:18 +00:00
ef7bc3e23d [ROCm] reduce tolerance for triangular solve with well_conditioned set to True (#104425)
Current test case causes an edge case tensor input that causes a single generated tensor to fail the tolerance assertion on ROCm only and only for float32. We have reviewed the logic with our libraries team and have discovered the discrepancy is due to a difference in order of operations on AMD GPUs. They came back with "working as intended" and found no perceivable bug. Interestingly, if we change the values in ks, ns, or bs, the test passes on ROCm. These particular sizes in this particular order generates a single problematic input that causes the assertion to fail the tolerance check by ~0.07. Again, this is not a bug, just differences in implementation. This PR loosens the tolerance for ROCm only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104425
Approved by: https://github.com/jeffdaily, https://github.com/nikitaved, https://github.com/lezcano
2023-06-30 21:43:42 +00:00
6929e9e947 Use int64_t accordingly in cunn_SoftMaxBackward to avoid int overflow (#104270)
Fixes #103501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104270
Approved by: https://github.com/malfet, https://github.com/mikaylagawarecki
2023-06-30 21:39:46 +00:00
4de1ee6ba4 Revert "Value range refinement using multi-variate expressions. (#97964)"
This reverts commit 26424122076c880694f3fe39ad21860bddb9b475.

Reverted https://github.com/pytorch/pytorch/pull/97964 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it is breaking an internal test ([comment](https://github.com/pytorch/pytorch/pull/97964#issuecomment-1615194524))
2023-06-30 21:08:05 +00:00
7acc4a2e86 add generic func to get function defined in custom device module (#99048)
Fixes #ISSUE_NUMBER
Now for the custom device, we use `getattr` and `setattr` to run the func defined in custom device module in some files, such as `AMP`, `random`, `DDP` and so on. So I want to add a generic func to get these funcs more friendly, could you take a look? @bdhirsh @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99048
Approved by: https://github.com/bdhirsh
2023-06-30 20:02:44 +00:00
b5980c0b86 [PyTorch Vulkan] add Vulkan support for aten::masked_fill (#104444)
Summary:
Implemented `aten::masked_fill` for Vulkan backend, see https://pytorch.org/docs/stable/generated/torch.Tensor.masked_fill.html for the behavior of this operator.

Some explanation of the implementation:
- The shapes of the input tensor and mask should be broadcastable (see [broadcasting semantics](https://pytorch.org/docs/stable/notes/broadcasting.html)). For example, the input tensor is of shape [3, 1, 5] and mask of shape [2, 1]. Then the output is of shape [3, 2, 5].
- A straightforward implementation is to generate an output and a mask, both of shape [3, 2, 5], by applying `repeat` operations on the input tensor and mask respectively. Then we traverse the mask and fill elements of output with `value` where mask is `True`.
- However the `repeat` operation on mask is unnecessary and incurs extra time and space overhead. Instead we can keep the mask as it is and traverse the original mask and compute the corresponding broadcasted positions in the output tensor (see the shader file `masked_fill.glsl` for such computation).

Some explanation of the test:
- We test all possible broadcasting of the input tensor and mask. Manually setting all possible broadcastable shapes is intimidating. Instead we apply the following algorithm to automatically generate all possible cases which only requires one input of the shapes of the input tensor and mask.
  - First we set an identical shape for the `input_shape` and `mask_shape`, e.g. both are of [3, 5, 2, 3].
  - Then we truncate all possible proceeding dimensions of `input_shape` and `mask_shape` respectively. Denote the results as `curr_input_shape` and `curr_mask_shape`, e.g. `curr_input_shape = [5, 2, 3]` and `curr_mask_shape = [2, 3]`.
  - Next, for both `curr_input_shape` and `curr_mask_shape` we generate all possible subsets of the indices and set the corresponding elements to 1 for each subset. For example, for `curr_input_shape = [5, 2, 3]`, a possible `input_idx_subset = [0, 2]`. We set the 0th and 2nd elements of `curr_input_shape` to be 1, then `curr_input_shape = [1, 2, 1]`. Similarly for `curr_mask_shape = [2, 3]`, a possible `mask_idx_subset = [0]`, then the updated `curr_mask_shape = [1, 3]`.
  - In the end, we test `masked_fill` with the combinations of `curr_input_shape` and `curr_mask_shape`. In the example above, an output tensor of shape [1, 2, 3] will be generated.
  - In `vulkan_api_test.cpp`, a function `gen_all_subsets` is implemented to generate all possible subsets of a given set of indices through backtracking.

Test Plan:
Full test result is shown in P777851326. `masked_fill` related tests are shown below.

```
(base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*mask*"
Building: finished in 0.1 sec (100%) 264/2820 jobs, 0/2820 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *mask*
[==========] Running 5 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 5 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.masked_fill_invalidinputs_exceptions
[       OK ] VulkanAPITest.masked_fill_invalidinputs_exceptions (35 ms)
[ RUN      ] VulkanAPITest.masked_fill_scalar_mult4ch
[       OK ] VulkanAPITest.masked_fill_scalar_mult4ch (582 ms)
[ RUN      ] VulkanAPITest.masked_fill_scalar_nonmult4ch
[       OK ] VulkanAPITest.masked_fill_scalar_nonmult4ch (592 ms)
[ RUN      ] VulkanAPITest.masked_fill_tensor_mult4ch
[       OK ] VulkanAPITest.masked_fill_tensor_mult4ch (0 ms)
[ RUN      ] VulkanAPITest.masked_fill_tensor_nonmult4ch
[       OK ] VulkanAPITest.masked_fill_tensor_nonmult4ch (0 ms)
[----------] 5 tests from VulkanAPITest (1212 ms total)

[----------] Global test environment tear-down
[==========] 5 tests from 1 test suite ran. (1212 ms total)
[  PASSED  ] 5 tests.
```

Reviewed By: SS-JIA

Differential Revision: D46423811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104444
Approved by: https://github.com/SS-JIA
2023-06-30 19:57:07 +00:00
d901dd94cb [logging] add custom format option to logging artifacts (#104443)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104443
Approved by: https://github.com/mlazos
2023-06-30 19:54:14 +00:00
53919d4bf8 add named tensor support for custom device (#104401)
Fixes #ISSUE_NUMBER
1. for custom device(privateuse1 backend), we also want to support named tensors, so I optimize the check and add test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104401
Approved by: https://github.com/mikaylagawarecki
2023-06-30 19:40:12 +00:00
28720ad585 Fix argmax and argmin clamp value on MPS (#104374)
Replace clamp `LLONG_MAX` clamp value with the largest integer value that can be stored in a double. `constantWithScalar` takes as input a `double` value, for which `LLONG_MAX` was not fitting in a dobule, resulting in failures on x86.

Fixes https://github.com/pytorch/pytorch/issues/98191, https://github.com/pytorch/pytorch/issues/92311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104374
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-06-30 18:11:49 +00:00
36c4dad197 [ET][XNNPACK] Add support for quantized LeakyReLU (#104309)
Summary: Also adds support for backend_config

Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:`

Reviewed By: mcr229

Differential Revision: D47043207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104309
Approved by: https://github.com/salilsdesai, https://github.com/manuelcandales
2023-06-30 17:42:22 +00:00
ddd7da7546 Enable more tests (#104437)
Remove `test_segment_reductions` from list of blocklisted tests Remove `@onlyCPU` qualifier from test_segment_reductions as it has CUDA specific parts

Fixes https://github.com/pytorch/pytorch/issues/104410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104437
Approved by: https://github.com/atalman, https://github.com/huydhn
2023-06-30 16:26:11 +00:00
032ea6a61e [ONNX] Create stand alone diagnostic rule on nearest match finding in dispatcher (#104267)
Change the diagnostic call in nearest match finding from UnsupportedNodeAnalysis to its own guarding rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104267
Approved by: https://github.com/thiagocrepaldi, https://github.com/BowenBao
2023-06-30 16:21:08 +00:00
a2a8b4d415 Revert "Turn translation validation on for tests and accuracy runs by default. (#103611)"
This reverts commit e311bed2a8e014f0ccf6fdc3fce11884982ac930.

Reverted https://github.com/pytorch/pytorch/pull/103611 on behalf of https://github.com/malfet due to Broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/103611#issuecomment-1614850276))
2023-06-30 15:54:18 +00:00
b1e4378b05 Migrate jobs from windows.8xlarge.nvidia.gpu to nonephemeral (#104404)
This is yet another step to move windows instances away from ephemeral instances, more details on #101209

Queue times are very high recently for this instance type, migrating away from ephemeral instances will provide a big relief for developers. Even if flakiness is introduced, the overall time-to-signal will be smaller given 20h queue times peaks we've been experiencing.

![Screenshot 2023-06-29 at 12 57 48](https://github.com/pytorch/pytorch/assets/4520845/d2ae7912-1043-431b-a081-d7476f9fd443)

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at cde9c95</samp>

This pull request updates several GitHub Actions workflow files and a template file to use non-ephemeral CUDA GPU runners for Windows binary build jobs. This improves the performance and stability of these jobs and makes the job names more consistent.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at cde9c95</samp>

> _`runs-on` changes_
> _CUDA jobs need `nonephemeral`_
> _faster winter builds_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104404
Approved by: https://github.com/malfet
2023-06-30 15:45:38 +00:00
624d20c3de kill inductor.config.disable_cpp_codegen in internal (#104351)
Summary:
This diff adds a path in inductor to invoke gcc through Remote Execution, when run from within fbcode.

This should (hopefully) let us kill the `inductor.disable_cpp_codegen` flag, since we should now be able to invoke clang at runtime from within fbcode to compile c++ code. This was preventing https://github.com/pytorch/pytorch/pull/100115 from landing, which fixed one of the last remaining models in torchbench that was failing with `torch.compile` (hf_Longformer).

Enumeration of changes:

- updated inductor to invoke `_run_build_command()` when in fbcode, which hooks into Remote Execution
- When inductor invokes g++ normally, it includes a bunch of absolute paths, to stuff like the pytorch header paths, and the input and output path. I changed these all to relative paths when in fbcode, and copied everything we needed into a temp dir that we send to Remote Execution.
- updated `triton/fb/make_build_paths.py` to let us grab paths to openmp, sleef, and ld from within the Remote Execution environment. I'm not sure if there's a better way to do this (but this way appeared to work, thanks to Bert's suggestion from https://www.internalfb.com/diff/D46482550?dst_version_fbid=231706286239076&transaction_fbid=229345569847706)
- factored `triton/fb/build.py` (it had a function to create a triton build command and run it all in one go, I separated the bit that takes in an arbitrary command (our clang command), and runs it with RE)
- a few tweaks to the include paths that inductor uses: it adds those two extra paths (sleef and openmp), and it also does not manually include the `-ltorch`,`-lc10`,`-ltorch_python`,`-ltorch_cpu` libs - the linker was complaining that it couldn't find those libs, and not including those flags ends up working
- I added a few more missing headers. Maybe with D46527002 this won't be necessary?
- I had a basic manual test in `scripts/hirsheybar/tmp2.py`. We probably want to try running an actual job in MAST to make sure this works.

Test Plan: `scripts/hirsheybar/pt2/tmp2.py` has a basic test, but I'm also planning on testing by kicking off a MAST job with cmf_10x (thanks to a bunch of help from Bert)

Reviewed By: bertmaher

Differential Revision: D46364355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104351
Approved by: https://github.com/bertmaher
2023-06-30 13:32:16 +00:00
e799f565eb [DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel (#103910)
This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-06-30 08:06:41 +00:00
7bc181d374 [Xcode 15][caffe2] Avoid redundant redeclaration of 'constexpr' static data member (#104049)
Summary: Handling the `out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Werror,-Wdeprecated]` warning on Xcode 15.

Test Plan: Build

Reviewed By: n0shake

Differential Revision: D46875270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104049
Approved by: https://github.com/malfet
2023-06-30 06:59:33 +00:00
da06920f47 Replace all_gather in device mesh with functional collective equivalent (#104056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104056
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-06-30 05:30:02 +00:00
77642da3b8 Fix broken meta registration for torch.full (#104451)
Fixes #104117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104451
Approved by: https://github.com/eellison
2023-06-30 05:14:52 +00:00
0b62aca726 Don't decompose aten.bucketize (#104396)
torch.bucketize takes a tensor of values, and a "boundaries" tensor, which is a sorted list of values that represent buckets. It returns the bucket that each value lies in. E.g. if values = [1, 5, 3, 6] and boundaries=[0, 2, 4, 6, 8], the output will be [1, 3, 2, 4].

The current decomposition of this op doesn't work well with dynamic shapes. It performs a binary search, which bakes in the number of iterations in the binary search and requires recompiling (I don't completely understand why/where this happens). I can't think if whether there's a good way to write a decomposition for this op that will work with dynamic shapes.

Use case: this op is very similar to some operations needed by jagged tensors. As a first step, I want to add a lowering for aten.bucketize and make use of opinfos. #104007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104396
Approved by: https://github.com/Chillee
2023-06-30 05:05:08 +00:00
958bd3a549 [fake_pg] remove init barrier env var (#104428)
We can now remove the env var as we by default disable the init barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104428
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2023-06-30 05:04:26 +00:00
56ef8ca054 Fix recursive call error in lift_tracked_freevar_to_input (#104378)
Summary:
The test was failing in `lift_tracked_freevar_to_input `
https://www.internalfb.com/phabricator/paste/view/P776002064

Cause:
* line 1219 assumes that `lift_tracked_freevar_to_input` is never called by the root tracer
* However, when we see a bound free variable in a child tracer, line 1226 will invoke the parent tracer recursively.
* When it reaches the root tracer, the assumption will fail.

Fix:
* we relax the assumption: if `lift_tracked_freevar_to_input` is called on the root tracer, we validate the variable is bound free, to allow the case where `lift_tracked_freevar_to_input` is populated from child tracers.

Test Plan:
pytest ./generated/test_VainF_pytorch_msssim.py
  pytest caffe2/test/dynamo/test_autograd_function.py -k test_function_with_bound_free_variable

Reviewed By: yanboliang

Differential Revision: D47033011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104378
Approved by: https://github.com/Skylion007, https://github.com/yanboliang
2023-06-30 04:53:45 +00:00
e2ec0ba404 Add forward mode AD to out-place foreach functions (#102409)
The major difference from in-place support is that some out-place functions have their derivatives spelled out in derivatives.yaml, which requires some changes in `load_derivatives.py` and some handlings in various places due to the others whose derivatives are generated by `torchgen`.

rel:
- #58833
- #100695

---

# Generated Foreach
```c++
::std::vector<at::Tensor> _foreach_sinh(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachSinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachSinhBackward0>(new ForeachSinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = (self_t.conj() * self_p.cosh().conj()).conj();
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  return result;
}

::std::vector<at::Tensor> _foreach_norm_Scalar(c10::DispatchKeySet ks, at::TensorList self, const at::Scalar & ord) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<bool> _any_has_forward_grad_result(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_result[i] = isFwGradDefined(self[i]);
  }
  std::shared_ptr<ForeachNormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<ForeachNormBackward0>(new ForeachNormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->ord = ord;
    grad_fn->self_ = make_saved_variable_list(self);
    grad_fn->self_size_ = self.size();
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::_foreach_norm(ks & c10::after_autograd_keyset, self_, ord);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  std::vector<c10::optional<at::Tensor>> result_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_result[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
        auto self_p = toNonOptPrimal(self[i]);
        result_new_fw_grad_opts[i] = norm_jvp(self_p, self_t, ord, result[i]);
    }
  }
  for (const auto& i : c10::irange(result_new_fw_grad_opts.size())) {
    auto& result_new_fw_grad_opt = result_new_fw_grad_opts[i];
    if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      result[i]._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
    }
  }
  if (grad_fn) {
    grad_fn->result = result;
  }
  return result;
}

```

# Reference
```c++
at::Tensor sinh(c10::DispatchKeySet ks, const at::Tensor & self) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<SinhBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<SinhBackward0>(new SinhBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::sinh(ks & c10::after_autograd_keyset, self_);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: sinh");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: sinh");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = (self_t.conj() * self_p.cosh().conj()).conj();
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  return result;
}
at::Tensor norm_Scalar(c10::DispatchKeySet ks, const at::Tensor & self, const at::Scalar & p) {
  auto& self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(self));
  std::shared_ptr<NormBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<NormBackward0>(new NormBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self ));
    grad_fn->p = p;
    grad_fn->self_ = SavedVariable(self, false);
  }
  #ifndef NDEBUG
  c10::optional<Storage> self__storage_saved =
    self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
  c10::intrusive_ptr<TensorImpl> self__impl_saved;
  if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
  #endif
  auto _tmp = ([&]() {
    at::AutoDispatchBelowADInplaceOrView guard;
    return at::redispatch::norm(ks & c10::after_autograd_keyset, self_, p);
  })();
  auto result = std::move(_tmp);
  #ifndef NDEBUG
  if (self__storage_saved.has_value() &&
      !at::impl::dispatch_mode_enabled() &&
      !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
  if (self__impl_saved && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(self_))
    TORCH_INTERNAL_ASSERT(self__impl_saved == self_.getIntrusivePtr());
  if (result.has_storage() && !at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result)) {
    TORCH_INTERNAL_ASSERT(result.storage().use_count() == 1, "function: norm_Scalar");
  }
  if (!at::impl::dispatch_mode_enabled() && !at::impl::tensor_has_dispatch(result))
    TORCH_INTERNAL_ASSERT(result.use_count() <= 1, "function: norm_Scalar");
  #endif
  if (grad_fn) {
      set_history(flatten_tensor_args( result ), grad_fn);
  }
  throw_error_for_complex_autograd(result, "norm");
  c10::optional<at::Tensor> result_new_fw_grad_opt = c10::nullopt;
  if (_any_has_forward_grad_result && (result.defined())) {
      auto self_t_raw = toNonOptFwGrad(self);
      auto self_tensor = toNonOptTensor(self);
      auto self_t = (self_t_raw.defined() || !self_tensor.defined())
        ? self_t_raw : at::_efficientzerotensor(self_tensor.sizes(), self_tensor.options());
      auto self_p = toNonOptPrimal(self);
      result_new_fw_grad_opt = norm_jvp(self_p, self_t, p, result);
  }
  if (result_new_fw_grad_opt.has_value() && result_new_fw_grad_opt.value().defined() && result.defined()) {
    // The hardcoded 0 here will need to be updated once we support multiple levels.
    result._set_fw_grad(result_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ false);
  }
  if (grad_fn) {
    grad_fn->result_ = SavedVariable(result, true);
  }
  return result;
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102409
Approved by: https://github.com/soulitzer
2023-06-30 04:51:43 +00:00
8457703e8d lazy init device mesh in fsdp (#104447)
since fsdp state is lazy init, we also need to lazy init device mesh
otherwise devicemesh allgather check would trigger some mismatch in
allgather counts in fsdp tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104447
Approved by: https://github.com/wconstab
2023-06-30 04:40:16 +00:00
0ff9a82a4d [profiler] Fix profiling PT2 w/ dynamic shapes & record_shapes (#104320)
When using torch.profiler.profile(record_shapes=True), the profiler tries to collect `tensor.sizes()` to put this information into the profile trace.

When dynamic shapes is turned on, sometimes tensors will appear that have symbolic sizes. In that case, `tensor.sizes()` can throw an assertion. This PR checks to see if tensor has symbolic shapes, and doesn't collect shape info in that case.

Differential Revision: [D47082414](https://our.internmc.facebook.com/intern/diff/D47082414)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104320
Approved by: https://github.com/aaronenyeshi
2023-06-30 04:35:52 +00:00
ecca9591d5 [quant][pt2e] Add reference representation for quantize/dequantize operators (#104395)
Summary: Similar to quantized add, in this PR we added the reference represenation for quantize/dequantize operators

Test Plan:
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_quantize (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_dequantize (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'

Reviewed By: kimishpatel

Differential Revision: D46959928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104395
Approved by: https://github.com/andrewor14
2023-06-30 04:32:18 +00:00
a704251628 inductor: fix compile error of bfloat16 broadcast operation (#104319)
For the bfloat16 broadcast, there is always has compile error:
```
error: could not convert ‘tmp2’ from ‘Vectorized<float>’ to ‘Vectorized<c10::BFloat16>
```

This PR will fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104319
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-30 04:14:38 +00:00
89decc3a10 [vision hash update] update the pinned vision hash (#104449)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104449
Approved by: https://github.com/pytorchbot
2023-06-30 03:42:04 +00:00
537a6c0651 [dynamo][ac] Remove disable monkeypatching of utils.checkpoint (#104397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104397
Approved by: https://github.com/wconstab
2023-06-30 02:27:06 +00:00
2642412207 Value range refinement using multi-variate expressions. (#97964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97964
Approved by: https://github.com/ezyang
2023-06-30 01:32:22 +00:00
ffb526a2e4 Value range refinement using uni-variate expressions. (#97963)
This PR introduces value range refinement of shape symbols by symbolically evaluating the
value range of the involved guards. This should help `_maybe_evaluate_static` to eliminate
more guards.

This is a stack of PRs created from the discussion on: #96616.

In summary, this PR:
- simplifies `FloorDiv` nodes on the left-hand side of an expression so as to isolate a
symbol in the numerator
- tries to match the expression against the form: `<symbol> <relop> <expr>`
- uses the matched expression for refining the value range of `<symbol>` using the range
of `<expr>`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97963
Approved by: https://github.com/ezyang
2023-06-30 01:32:22 +00:00
e311bed2a8 Turn translation validation on for tests and accuracy runs by default. (#103611)
This PR turns translation validation on by default for tests and accuracy benchmark
runs. It also installs Z3 on CI.

The main changes are:

- Add `--no-translation-validation` as an option in _test/run_tests.py_
    - Set `PYTORCH_TEST_WITH_TV` environment variable
- Add `TEST_WITH_TV` variable in _torch/testing/_internal/common_utils.py_
- Turn translation validation on for accuracy benchmarks in _benchmarks/dynamo/common.py_
- Add Z3 installation on CI scripts

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103611
Approved by: https://github.com/ezyang
2023-06-30 01:32:21 +00:00
d0509fe32d Document how functional collectives work under eager/dynamo (#104386)
Move user facing apis to the top for best visibility
(strictly code-motion in this PR, besides adding comments)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104386
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol
2023-06-30 01:12:55 +00:00
ffb1b4c462 [inductor] Install guards on both cases of View.handle_negative_index (#103780)
This branch is not an optimization, it's a correctness issue so there should be
a guard installed on both sides of the branch. Otherwise we could have an
expression like `(s0 - s1)` that is initially positive, then becomes negative
with a new set of shapes and now references an invalid index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103780
Approved by: https://github.com/ezyang
2023-06-29 23:09:53 +00:00
d455d48744 Add back in reduce_scatter_tensor_coalesced (#104345)
#104256 erroneously removed the pybind definition for `reduce_scatter_tensor_coalesced` introduced in #103561

This adds it back in and introduces a test for the API.

Test command:
```
pytest test/distributed/test_c10d_nccl.py -vsk test_reduce_scatter_tensor_coalesced
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104345
Approved by: https://github.com/kwen2501
2023-06-29 22:53:26 +00:00
a993319a4b [export] Dont run export guard hook when there is no graph (#104383)
I am not able to create a test case. I saw this on an internal model which is too big to repro.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104383
Approved by: https://github.com/yanboliang
2023-06-29 22:17:04 +00:00
76a91075ea propagate pred guards in TorchHigherOrderOperatorVariable call_function for cond (#104379)
Fixes https://github.com/pytorch/pytorch/issues/104372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104379
Approved by: https://github.com/voznesenskym, https://github.com/ydwu4, https://github.com/zou3519
2023-06-29 20:47:00 +00:00
12f19b5dd9 consider CALL_FINALLY non-jumping in stacksize_analysis (#103621)
Fixes #97811.

This PR fixes a bug in `stacksize_analysis`. The pre-`python3.9` opcode `END_FINALLY` should be considered terminal. (edit: this is no longer what this PR does)

With this change, [this](https://github.com/pytorch/pytorch/issues/97811#issuecomment-1591888590) previously failing example  now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103621
Approved by: https://github.com/williamwen42
2023-06-29 20:23:20 +00:00
a78bddac01 Revert D46920584: Multisect successfully blamed D46920584 for test or build failures (#104269) (#104302)
Summary:

This diff is reverting D46920584
D46920584: Make `torch.empty*` deterministic by filling with NaN or max int value (#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures:

Tests affected:
- [torchrec/distributed/composable/tests:test_fsdp - torchrec.distributed.composable.tests.test_fsdp.FullyShardTest: test_composable_checkpoint](https://www.internalfb.com/intern/test/281475062923125/)

Here's the Multisect link:
https://www.internalfb.com/multisect/2341386
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Reviewed By: huydhn, osalpekar

Differential Revision: D46997394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104302
Approved by: https://github.com/osalpekar
2023-06-29 20:20:58 +00:00
a6b9a61a6a Added a note to torch.round doc to indicate the return type (#97227)
Added a note to torch.round doc to indicate the return type of output tensor

Fixes #89056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97227
Approved by: https://github.com/albanD
2023-06-29 20:02:59 +00:00
4ab140902b [docs] Fixed typo in grid_sample doctring (#104406)
Fixed a small typo in grid_sample doctring:

<img width="265" alt="image" src="https://github.com/pytorch/pytorch/assets/2459423/1d2dd7a2-895a-4683-9d9f-a4d1d9d1a4a7">

- https://pytorch.org/docs/main/generated/torch.nn.functional.grid_sample.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104406
Approved by: https://github.com/mikaylagawarecki, https://github.com/svekars
2023-06-29 19:44:54 +00:00
ec85ab6157 Adding aarch64 wheel CI workflows (#104109)
Adding Workflows for building aarch64 Linux PyTorch PIP wheels

Updates:
* Created aarch64 template for generated workflows
* Updated generate_ci_workflows.py to include aarch64
* Generated the aarch64 wheel workflow
* added _binary-build-aarch64.yml for building aarch64 wheel
* added _binary-test-aarch64.yml for sanity check of aarch64 wheel
* Updated binary_linux_test.sh to use --extra-index-url for aarch64 till needed aarch64 dependencies are available at https://download.pytorch.org/whl/nightly/cpu

NOTES:
* The build and test workflows are using arm64v8/alpine and quay.io/pypa/manylinux2014_aarch64:latest docker images at this time.
* Conda generated workflow not included at this time and being worked on.

Workflows were successfully tested at https://github.com/xncqr/pytorch/actions/runs/5351891068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104109
Approved by: https://github.com/malfet, https://github.com/atalman
2023-06-29 18:58:43 +00:00
082832b0f8 Revert "Add DSA to IndexKernel.cu (#104054)"
This reverts commit aaada2c4fcc0f977d9cd297e44a0562c2237dc8d.

Reverted https://github.com/pytorch/pytorch/pull/104054 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/104054#issuecomment-1613583961))
2023-06-29 18:14:16 +00:00
cbb9683e3b [ONNX] Speed up export of large models (#103278)
This commit speeds up the ONNX export of large models by making the following changes:

- Remove unecessary memcpy in `GetGraphProtoSize`
- In `export.cpp`, pass around a pointer to the ModelProto instead of the ModelProto itself.

The shape inference function is still the slowest part of the export for these models with large weights taking up 50% or more of the export time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103278
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-06-29 17:34:29 +00:00
193adde5f4 Fix UnboundLocalError in test_segment_reductions.py (#104353)
Summary:
Fixes:
```
UnboundLocalError: local variable 'expected_result' referenced before assignment
```

Test Plan: Sandcastle

Differential Revision: D47092967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104353
Approved by: https://github.com/malfet
2023-06-29 16:29:34 +00:00
f32593630b Re-enable low memory dropout (#103330)
On attention_is_all_you_need_pytorch:

Perf: 1.526x -> 1.544x
Memory: 1.00 -> 1.05x

Fix for https://github.com/pytorch/pytorch/issues/102319, although I'm not sure all the perf is recovered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103330
Approved by: https://github.com/jansel
2023-06-29 16:27:02 +00:00
60e2a4a4a0 [2D parallel] workaround for FSDP init issue (#104398)
Closes https://github.com/pytorch/pytorch/issues/96491 and does so by relaxing FSDP's assumption that the entire input module must be on the same device. Now, FSDP can accept a module partially on CPU and GPU and just emits a warning.

Differential Revision: [D47117256](https://our.internmc.facebook.com/intern/diff/D47117256/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104398
Approved by: https://github.com/fegin
2023-06-29 16:07:07 +00:00
8cad411d3d Fix UntypedStorage pin error (#104355)
Summary:
Fixes:
```
TypeError: cannot pin 'torch.storage.UntypedStorage' only CPU memory can be pinned
```

Test Plan: Sandcastle

Differential Revision: D47093797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104355
Approved by: https://github.com/malfet
2023-06-29 16:06:52 +00:00
9dda446505 Pin pytest linux dependencies in docker (#104281)
Pin the pytest dependencies listed in requirements-ci.txt, also change the mac ones to match the linux ones.

The new pytest 7.4.0 causes some weird issues with printing skip messages, so pin to a previous version until I can figure out a fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104281
Approved by: https://github.com/huydhn
2023-06-29 16:05:46 +00:00
408cb45e14 [Dynamo] Support threading.local getattr (#104292)
Fixes #104066

threading.local has a custom `__getattribute__` so `_getattr_static`
doesn't work with it. Since we know that threading.local's
`__getattribute__` is well behaved
(e.g. https://github.com/python/cpython/blob/3.11/Lib/_threading_local.py),
we can just special case it.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104292
Approved by: https://github.com/williamwen42, https://github.com/jansel
2023-06-29 14:32:37 +00:00
875f60399e pre_dispatch tracing: support autocast and no_grad/enable_grad ctx managers, add a pre_dispatch_eager dynamo backend (#103024)
This PR adds support for `enable_grad`/`no_grad`/`autocast` context managers getting properly traced in `pre_dispatch` tracing. The stuff in this PR includes:
- I added a torch function mode that runs during make_fx pre_dispatch tracing, `ProxyTorchFunctionMode`. It directly intercepts the torch ops that run during the above context managers, and adds them to the current graph instead of executing them
- `enable_grad` and `no_grad` currently desugar into `torch._C.set_grad_enabled(bool)`, but this API isn't currently overrideable by torch function so I added the ability to interpose there
- the `torch.amp` context managers don't currently have a nice equivalent, like `set_autocast_enabled(state)`, so I ended up adding two new API's: `torch.amp._set_autocast_enabled` and `torch.amp._set_autocast_disabled`. If you look at how the context manager is implemented, it ends up calling several different state-changing functions, some of which depend on the backend - so I figured that it would be cleaner just to add a new API (that should probably only be used by tracing) - but open to feedback
- I added a new dynamo backend, `compile(backend="pre_dispatch_eager")`. When pre_dispatch tracing becomes always-on in inductor, it will be another potential surface for bugs. I also added a test file for it (`test/dynamo/test_pre_dispatch.py`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103024
Approved by: https://github.com/ezyang
2023-06-29 14:17:42 +00:00
ebb8aa9c0b Correct output_padding for quantized tconv (torch->onnx) (#104207)
- In #102759, the support for `quantized::conv_transposeNd` was introduced. This incorrectly set `output_padding` to all zeros. Turns out, you can specify output_padding in PyTorch, but this parameter was not being unpacked correctly and thus did not show up in the python torch->onnx code.
- This adds unpacking of output_padding in `unpack_quantized_weights.cpp` when needed. It also adds this as a parameter in the python functions and uses that (and removes the all-zero defaults)
- Another issue with #102759 is that it only added these new ops to opset10 without adding the ability to specify axis in opset13. This PR also fixes this.

Fixes #104206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104207
Approved by: https://github.com/BowenBao
2023-06-29 13:40:48 +00:00
04c0d85caf [ONNX] Add op_level_debugging rule on validate_op_between_ort_torch (#104268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104268
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-06-29 13:38:13 +00:00
5c12a810ac [dynamo] Lazy disable_dynamo API out-of-dynamo (#104317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104317
Approved by: https://github.com/jansel, https://github.com/wconstab, https://github.com/mlazos
2023-06-29 13:30:17 +00:00
2bb83cd45c [dynamo][ac] Minor refactor for better code organization and a bugfix (#104276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104276
Approved by: https://github.com/zou3519
2023-06-29 12:57:59 +00:00
9154bbc999 Fix CUDA Bazel build to optionally include gmock after #104255 (#104308)
This reverts commit 39868b0578c18ddc194deac697d0675760de5f11.  Fixes https://github.com/pytorch/pytorch/issues/104279.

The change came from an internal codemod diff that we don't want to revert.  AFAIK, this addition is not needed as gmock has already been included https://github.com/google/googletest/blob/main/BUILD.bazel

### Testing

* OSS CUDA Bazel build should be back after this revert
* Import as D47077813 to make sure that nothing breaks internally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104308
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-06-29 07:15:06 +00:00
cyy
f78b92f490 fix an ASAN error in MKLDNN (#104331)
Fixes ASAN stack-use-after-scope in MKLDNN.
The stack trace is
```
2023-06-27T16:37:20.9099950Z ==1424==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f0c5dc20980 at pc 0x7f0c61286a73 bp 0x7ffef8e76990 sp 0x7ffef8e76118
2023-06-27T16:37:20.9100054Z READ of size 24 at 0x7f0c5dc20980 thread T0
2023-06-27T16:37:20.9100327Z     #0 0x7f0c61286a72 in memcmp (/usr/lib/llvm-7/lib/clang/7.0.1/lib/linux/libclang_rt.asan-x86_64.so+0x5da72)
2023-06-27T16:37:20.9100701Z     #1 0x7f0c2f395d0b in c10::ArrayRef<long>::equals(c10::ArrayRef<long>) const (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcb8bd0b)
2023-06-27T16:37:20.9101196Z     #2 0x7f0c314a1bb1 in at::native::mkldnn_matmul(at::Tensor const&, at::Tensor const&, at::Tensor const&, float, float) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xec97bb1)
2023-06-27T16:37:20.9101714Z     #3 0x7f0c301f49c5 in at::native::bmm_out_or_baddbmm_(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&, bool) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9ea9c5)
2023-06-27T16:37:20.9102153Z     #4 0x7f0c301f85ab in at::native::structured_bmm_out_cpu::impl(at::Tensor const&, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9ee5ab)
2023-06-27T16:37:20.9102601Z     #5 0x7f0c32cb3cb6 in at::(anonymous namespace)::wrapper_CPU_bmm(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x104a9cb6)
2023-06-27T16:37:20.9103662Z     #6 0x7f0c32ea1f43 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &(at::(anonymous namespace)::wrapper_CPU_bmm(at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x10697f43)
2023-06-27T16:37:20.9104330Z     #7 0x7f0c3187252a in at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&)> const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) const (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xf06852a)
2023-06-27T16:37:20.9104756Z     #8 0x7f0c3257e097 in at::_ops::bmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xfd74097)
2023-06-27T16:37:20.9105237Z     #9 0x7f0c383c31c3 in torch::autograd::VariableType::(anonymous namespace)::bmm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x15bb91c3)
2023-06-27T16:37:20.9106496Z     #10 0x7f0c383c25b9 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&), &(torch::autograd::VariableType::(anonymous namespace)::bmm(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x15bb85b9)
2023-06-27T16:37:20.9106874Z     #11 0x7f0c3257da60 in at::_ops::bmm::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xfd73a60)
2023-06-27T16:37:20.9107275Z     #12 0x7f0c301fc0e2 in at::native::_matmul_impl(at::Tensor&, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9f20e2)
2023-06-27T16:37:20.9107647Z     #13 0x7f0c301f9c21 in at::native::matmul(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xd9efc21)
2023-06-27T16:37:20.9108853Z     #14 0x7f0c33dca7e3 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, at::Tensor const&), &(at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__matmul(at::Tensor const&, at::Tensor const&))>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x115c07e3)
2023-06-27T16:37:20.9109255Z     #15 0x7f0c32958ef0 in at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1014eef0)
2023-06-27T16:37:20.9110023Z     #16 0x7f0c2f596b62 in at::autocast::WrapFunction_<(at::autocast::CastPolicy)0, (c10::DeviceType)0, at::Tensor (at::Tensor const&, at::Tensor const&), &(at::_ops::matmul::call(at::Tensor const&, at::Tensor const&)), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcd8cb62)
2023-06-27T16:37:20.9110723Z     #17 0x7f0c2f348403 in c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >::operator()(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcb3e403)
2023-06-27T16:37:20.9111596Z     #18 0x7f0c2f348063 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, at::Tensor const&), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, at::Tensor const&> >, at::Tensor (at::Tensor const&, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0xcb3e063)
2023-06-27T16:37:20.9111976Z     #19 0x7f0c32958ef0 in at::_ops::matmul::call(at::Tensor const&, at::Tensor const&) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so+0x1014eef0)
2023-06-27T16:37:20.9112383Z     #20 0x7f0c5803dc3e in torch::autograd::THPVariable_matmul(_object*, _object*, _object*) (/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/lib/libtorch_python.so+0x2b2cc3e)
2023-06-27T16:37:20.9112561Z warning: parsing line table prologue at 0x00000000 should have ended at 0x0000050b but it ended at 0x0000050a
2023-06-27T16:37:20.9112713Z     #21 0x5074a6 in cfunction_call (/opt/conda/envs/py_3.9/bin/python3.9+0x5074a6)
2023-06-27T16:37:20.9112857Z     #22 0x505997 in _PyObject_Call (/opt/conda/envs/py_3.9/bin/python3.9+0x505997)
2023-06-27T16:37:20.9113114Z     #23 0x505997 in PyObject_Call /croot/python-split_1684193875530/work/build-static/<invalid>:293:12
2023-06-27T16:37:20.9113258Z     #24 0x4ed302 in do_call_core (/opt/conda/envs/py_3.9/bin/python3.9+0x4ed302)
2023-06-27T16:37:20.9113633Z     #25 0x4ed302 in _PyEval_EvalFrameDefault /croot/python-split_1684193875530/work/build-static/<invalid>:3582:22
2023-06-27T16:37:20.9113780Z     #26 0x4e6729 in _PyEval_EvalFrame (/opt/conda/envs/py_3.9/bin/python3.9+0x4e6729)
2023-06-27T16:37:20.9114041Z     #27 0x4e6729 in _PyEval_EvalCode /croot/python-split_1684193875530/work/build-static/<invalid>:4329:14
2023-06-27T16:37:20.9114202Z     #28 0x4efd7d in _PyFunction_Vectorcall (/opt/conda/envs/py_3.9/bin/python3.9+0x4efd7d)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104331
Approved by: https://github.com/soulitzer
2023-06-29 07:14:37 +00:00
d4e51511a0 Inductor cpp wrapper: add -ffast-math in linking flag (#104332)
Fix cpp wrapper CPU performance gap on `swsl_resnext101_32x16d` compared with the default python wrapper.

The pre-trained weights of `swsl_resnext101_32x16d` contains denormal numbers (close to 0.0).

Linking with `-ffast-math` will make the CPU flush denormals.
For the default python wrapper, the compilation and linking are done in one command thus `-ffast-math` will take effect in both compilation and linking.
CPP wrapper leverages cpp_extension which will do the compilation and linking in two stages, thus we need to explicitly add `-ffast-math` as a linking flag.

Single thread single batch on ICX:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/chunyuan/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link=blue vlink=purple>

  | time (s) default python wrapper | time (s) cpp wrapper before fix | time (s) cpp wrapper after fix
-- | -- | -- | --
swsl_resnext101_32x16d | 0.459097836 | 13.82326214 | 0.448116195

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104332
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang
2023-06-29 06:59:04 +00:00
732067e5c3 s390x SIMD: propagate NaN value in clamp functions (#102978)
This change fixes test_batch_norm test in test/test_jit_fuser_te.py while keeping TestNNDeviceTypeCPU::test_grid_sample_nan_inf_cpu_float32 test from test/test_nn.py working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102978
Approved by: https://github.com/kit1980
2023-06-29 01:28:04 +00:00
fea683491e Make torch._dynamo lazy-importable (#104368)
Use [PEP-562](https://peps.python.org/pep-0562) to import `_dynamo` and `_inductor` only when needed.

- Remove redundant imports from tests
- Add `test_lazy_imports_are_lazy` to make sure they will not get imported by accident

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at bae8e90</samp>

> _Sing, O Muse, of the daring deeds of PyTorch, the swift and fiery_
> _framework of deep learning, that with skill and cunning wrought_
> _many wonders of dynamic compilation, using the hidden powers_
> _of `_dynamo` and `_inductor`, the secret modules of LLVM and MLIR._

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104368
Approved by: https://github.com/msaroufim, https://github.com/albanD
2023-06-29 00:51:59 +00:00
d0a72ec5e4 Translation validator for dynamo guards. (#102563)
This PR introduces a translation validator for dynamo guards. In summary, it verifies
whether the guards issued as Python code are sound, w.r.t the initial guards.

The main changes in this PR are:

- Create an FX graph for dynamic shapes
- Translate "the original" guards from the FX graph to Z3
- Check if the guards produced by `produce_guards` are sound w.r.t. the ones from the FX graph

gh-stack version of the PR #101146.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102563
Approved by: https://github.com/ezyang
2023-06-28 22:32:53 +00:00
3707fbf63b [RFC]: Add test for graph partition after assertion ops functionalization. (#104287)
This PR:
* Address comment at https://github.com/pytorch/pytorch/pull/103887/files#r1244128266.
* Add test for graph partition to make sure assertion ops functionalization won't break graph partition in unexpected way.

**NOTE**:
In the context of export, it's totally up to the user to any type of graph partition based on specific use case. It's hard to anticipate the concrete downstream use case nor provide any specific functionality to facilitate handling assertion ops (functional / non-functional). So this PR limit to itself to [`CapabilityBasedPartitioner`](2da6cae43c/torch/fx/passes/infra/partitioner.py (L34)) and make sure it doesn't break graph partition unexpectedly (by adding some test).

For the test case used in PR, a few things to highlight:
* Without assertion, the fused graph is roughly like:
```
class fused(torch.nn.Module):
    def forward(self, a, b):
        fused_1 = self.fused_1(a, b);
        relu = fused_1.relu()
        fused_0 = self.fused_0(fused_1, relu)
        return (fused_0, fused_1)

    class fused_0(torch.nn.Module):
        def forward(self, add_2, relu):
            ... # Logic after relu
            return add_4

    class fused_1(torch.nn.Module):
        def forward(self, a, b):
            ... # Logic before relu, `add_1` is only exposed within this submodule.
            return add_2
```
* With the assertion, the fused graph is roughly like:
```
class fused(torch.nn.Module):
    def forward(self, arg0_1: i64[s0], arg1_1: i64[s0]):
        dep_token0 = ...
        ...
        fused_1 = self.fused_1(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
        ...
        getitem: i64[s0] = fused_1[0] # `getitem` is actually `add_1`
        ...
        relu_default: i64[s0] = torch.ops.aten.relu.default(getitem_1)
        ...
        # For inline assertion. Note that `getitem` which is an output of `fused_1`, is consumed by it.
        select_int: i64[] = torch.ops.aten.select.int(getitem, 0, 0)
        eq_scalar: b8[] = torch.ops.aten.eq.Scalar(select_int, 5)
        dep_token2: f32[] = torch.ops.aten._functional_assert_async.msg(
            eq_scalar, 'assertion error', dep_token = dep_token1
        )
        ...
        getitem_1: i64[s0] = fused_1[1] # `getitem_1` is actually `add_2`
        fused_0: i64[s0] = self.fused_0(getitem_1, relu_default)
        ...

        return (fused_0, getitem_1, dep_token2)

    class fused_0(torch.nn.Module):
        def forward(self, add_tensor_2: i64[s0], relu_default: i64[s0]):
            ... # Logic after relu
            return add_tensor_4

    class fused_1(torch.nn.Module):
        def forward(self, arg0_1: i64[s0], arg1_1: i64[s0]):
            ... # Logic before relu
            # `add_tensor_1` (basically `add_1`) is returned to allow downstream assertion op consumes it.
            return (add_tensor_1, add_tensor_2)
```

As shown above, the extra assertion added (actually regardless whether it's funtionalized or not), it **won't** case extra submodule breakage if the asserted node is an intermediate node within the submodule - here the intermediate node will be returned as extra output of submodule so downstream assertion node can consume it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104287
Approved by: https://github.com/tugsbayasgalan
2023-06-28 22:13:27 +00:00
ede1f99904 Add gelu vulkan function (#102762)
Summary: Add gelu function in vulkan

Test Plan:
buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64

[ RUN      ] VulkanAPITest.gelu
[       OK ] VulkanAPITest.gelu (63 ms)
[ RUN      ] VulkanAPITest.gelu_
[       OK ] VulkanAPITest.gelu_ (83 ms)

Reviewed By: SS-JIA

Differential Revision: D46297340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102762
Approved by: https://github.com/SS-JIA
2023-06-28 21:26:14 +00:00
f2ea27e4a0 Replace torch.has_cuda() call with torch.backends.cuda.built() (#104338)
torch.has_cuda() has been deprecated and using torch.backends.cuda.built() instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104338
Approved by: https://github.com/soulitzer
2023-06-28 21:06:52 +00:00
e140c9cc92 Fixes ROCM_HOME detection in case no hipcc is found in path (#95634)
if ROCM_HOME is not set as environment variable,
it tries to find hipcc in the path,
but fails with an empty string instead of an exception,
returning an empty string instead of harcoded '/opt/rocm' as third case

Fixes #95633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95634
Approved by: https://github.com/jithunnair-amd, https://github.com/ezyang
2023-06-28 19:39:26 +00:00
8464a6a165 [GHF] Better check for internal diffs (#104344)
During revert, use title of "Meta Internal-Only Changes Check" to determine whether or not internal diff is associated with the PR. When PR is merged/closed, "Meta Internal-Only Changes Check" status is always success, but title message can differ:
- "There is no internal Diff connected, this can be merged now" means that there are no internal change associated with PR (or it was landed via GitHub First workflow)
- "The internal Diff has landed, this can be merged now" meaning that PR has associated internal DIFF, and OSS and internal reverts must happen in sync using internal tooling. (Or a revert PR can be authored in OSS)

Add regression test for https://github.com/pytorch/pytorch/pull/100652 that was originated from the internal diff, but was merged as OSS PR.

Fixes https://github.com/pytorch/pytorch/issues/104232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104344
Approved by: https://github.com/bigfootjon, https://github.com/huydhn
2023-06-28 19:22:45 +00:00
998c07799f [dynamo] fix deep nested closure cell KeyError (#104222)
Fix https://github.com/pytorch/pytorch/issues/99639 by handling the case in `InliningInstructionTranslator`'s `LOAD_CLOSURE` definition when the requested cell is not in `self.closure_cells`.

My intuition is that the behavior of `LOAD_DEREF` and `STORE_DEREF` on a cell/freevar should not depend on whether or not we called `LOAD_CLOSURE` (that is, we shouldn't create a new cell var in `LOAD_CLOSURE` like in https://github.com/pytorch/pytorch/pull/101357). But we need a way to push cells created by the inlined function that were not present in the caller - `InlinedClosureVariable` is used to differentiate these cells from other cells.

Adding this test causes an error though (EDIT: this test is not relevant to this PR and instead just reveals that `cond` with Python side effects is still broken):
```python
    def test_closure_out_of_scope_cell_with_cond(self):
        from functorch.experimental.control_flow import cond
        cell1 = torch.rand(3, 3)
        cell2 = torch.rand(3, 3)
        orig3 = torch.rand(3, 3)
        def test(x):
            cell3 = orig3.clone()
            def then():
                nonlocal cell3
                cell3 += cell1
                return cell3
            def els():
                nonlocal cell3
                cell3 += cell2
                return cell3
            return cond(x > 0, then, els, [])
        opt_fn = torch._dynamo.optimize("eager")(test)
        result1 = opt_fn(1)
        self.assertTrue(torch.allclose(result1, orig3 + cell1))
        result2 = opt_fn(-1)
        self.assertTrue(torch.allclose(result1, orig3 + cell1 + cell2))
```
```
Traceback (most recent call last):
  File "/scratch/williamwen/work/pytorch2/test/dynamo/test_misc.py", line 1768, in test_closure_out_of_scope_cell_with_cond
    result1 = opt_fn(1)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/eval_frame.py", line 295, in _fn
    return fn(*args, **kwargs)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/eval_frame.py", line 448, in catch_errors
    return callback(frame, cache_size, hooks, frame_state)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 526, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 127, in _fn
    return fn(*args, **kwargs)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 360, in _convert_frame_assert
    return _compile(
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/utils.py", line 180, in time_wrapper
    r = func(*args, **kwargs)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 430, in _compile
    out_code = transform_code_object(code, transform)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/bytecode_transformation.py", line 1000, in transform_code_object
    transformations(instructions, code_options)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/convert_frame.py", line 415, in transform
    tracer.run()
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 2029, in run
    super().run()
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 708, in run
    and self.step()
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 668, in step
    getattr(self, inst.opname)(inst)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 391, in wrapper
    return inner_fn(self, inst)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 1100, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 559, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/torch.py", line 1061, in call_function
    (false_r, false_graph, false_lifted_freevars) = speculate_branch(False)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/torch.py", line 1044, in speculate_branch
    ret_val, ret_graph, ret_lifted_freevars = speculate_subgraph(
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/torch.py", line 850, in speculate_subgraph
    output = f.call_function(tx, args, {})
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/functions.py", line 121, in call_function
    return tx.inline_user_function_return(
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 595, in inline_user_function_return
    result = InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 2134, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 2231, in inline_call_
    tracer.run()
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 708, in run
    and self.step()
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 668, in step
    getattr(self, inst.opname)(inst)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/symbolic_convert.py", line 162, in impl
    self.push(fn_var.call_function(self, self.popn(nargs), {}))
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/variables/builtin.py", line 497, in call_function
    proxy = tx.output.create_proxy(
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 345, in create_proxy
    return self.current_tracer.create_proxy(*args, **kwargs)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 1109, in create_proxy
    new_arg = self.lift_tracked_freevar_to_input(arg)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 1226, in lift_tracked_freevar_to_input
    self.parent.lift_tracked_freevar_to_input(proxy)
  File "/scratch/williamwen/work/pytorch2/torch/_dynamo/output_graph.py", line 1219, in lift_tracked_freevar_to_input
    assert (
AssertionError: lift_tracked_freevar_to_input on root SubgraphTracer

from user code:
   File "/scratch/williamwen/work/pytorch2/test/dynamo/test_misc.py", line 1766, in test
    return cond(x > 0, then, els, [])
  File "/scratch/williamwen/work/pytorch2/test/dynamo/test_misc.py", line 1764, in els
    cell3 += cell2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104222
Approved by: https://github.com/jansel
2023-06-28 17:54:13 +00:00
98f00f881f [inductor] convert layout of conv weight ahead of time for inference (#103642)
This PR handles inference. Will do similar thing for training later.

Some manual testing results shows this can improve inference perf by 2-3% (absolute improvement not relative one).
- convmixer: 4.285x -> 4.309x
- resnet50: 2.170x -> 2.203x

The PR is built upon freezing. Since without freezing, the weight input for a conv node may not be a parameter directly but be the output of precision converting ops. It's so much easier to implement this PR after freezing.

Commands
```
TORCHINDUCTOR_FREEZING=1 python benchmarks/dynamo/timm_models.py --backend inductor --amp --performance --only convmixer_768_32 --inference
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103642
Approved by: https://github.com/eellison
2023-06-28 17:42:32 +00:00
044a8e3305 [skip ci] Fix the deprecating link to **our office hours** (#104339)
Fix the deprecating link to **our office hours**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104339
Approved by: https://github.com/soulitzer
2023-06-28 17:07:36 +00:00
b81f1d1bee Speed up cpp extensions re-compilation (#104280)
Fixes https://github.com/pytorch/pytorch/issues/68066 to a large extend.

This is achieved by not touching files that don't need changing to make sure the ninja caching works as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104280
Approved by: https://github.com/fmassa
2023-06-28 17:06:07 +00:00
42b0bdd0c5 [onnx] Convert aten::flatten with 0d input to onnx Reshape and 1d to Identity (#104089)
Avoid empty tensor generated by Slice op if using _flatten_helper for aten::flatten with 0d/1d input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104089
Approved by: https://github.com/thiagocrepaldi
2023-06-28 17:01:43 +00:00
aaada2c4fc Add DSA to IndexKernel.cu (#104054)
Summary: This diff also makes many things const and rearranges `X >= lb && X <= ub` to be `lb <= X && X <= ub`.

Test Plan:
```
buck2 build mode/dev-nosan fbcode//caffe2:ATen-cu
```

Differential Revision: D46943299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104054
Approved by: https://github.com/xw285cornell
2023-06-28 16:58:02 +00:00
c866446d6c [FSDP] Check module.training for _root_cast_forward_inputs (#104223)
We might erroneously cast forward inputs for the root if it doesn't
manage any handles (FSDP parameters). As a fix, pass in the module and check
its training attribute to ensure we don't cast inputs in eval mode.

Differential Revision: [D47041673](https://our.internmc.facebook.com/intern/diff/D47041673/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104223
Approved by: https://github.com/fegin
2023-06-28 16:38:01 +00:00
ee19121931 Change nn.Module.__getattr__ return type to Any (#104321)
When working with a highly-dynamic python code it's not always possible to express the static types. However if we consider the end-user experience for somebody who uses both pytorch and a static type checker (mypy, pyright), we should error on the side of being ergonomic and not technically correct.

The  `nn.Module.__getattr__` is one of the such examples: on paper the return type is correct. In practice the community would benefit from having `Any` as a return type because it would avoid littering the idiomatic pytorch code with `cast`, `# type: ignore`, `assert`, `isinstance`, etc.

Some evidences:
- linked in the comment thread on pyright bug tracker https://github.com/microsoft/pyright/issues/4213
- `pyre` type checker steps outside of the normal type checking practices and special-cases `registrer_buffer()` in part to avoid this problem. https://pyre-check.org/docs/features/ This is not a very scalable solution since type-checkers generally aim at adhering to the spec (various typing PEPs).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104321
Approved by: https://github.com/kit1980, https://github.com/albanD
2023-06-28 16:14:36 +00:00
eqy
2504af5ec9 [cuDNN][cuDNN V8 API] Improve safety of ParamsHash keys (#104122)
In anticipation of adding some enhancements to the cuDNN benchmark cache (e.g., LRU eviction for memory savings), this PR adds some safety improvements to the handling of cache keys.

Currently, cache keys are dangerous to use, as e.g., a single inadvertent pass-by-value will potentially instantiate a copy with uninitialized padding bytes that will unwittingly hash differently and compare as unequal. This behavior is the result of `ParamsHash` using raw-bytes for hashing and comparison. I've been bitten by this in the past and would like to hopefully eliminate this class of errors.

Additionally, I'm not sure the standard guarantees that default copy/move constructors copy structs byte-for-byte, and this could be problematic when using maps as insertion could call these default constructors in order to instantiate a `std::pair`. Someone knowledgeable in C++ can correct me on this, but it seems that we are potentially relying on the good graces of common compiler implementations rather than the actual standard here.

This PR adds a variant of `ParamsHash` that expects a wrapped POD that has custom byte-for-byte constructors. It modifies the cuDNN V8 API benchmark cache to use this variant, and replaces the `setCacheKey` style code with constructor usage. If this approach looks good to folks I will also port other `ParamsHash` usage (e.g., in cuDNN v7 and other backends) and we can remove `ParamsHash`.

CC @malfet
@ngimel (who originally wanted constructors for keys, but I didn't have this solution in mind at the time)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104122
Approved by: https://github.com/zasdfgbnm, https://github.com/colesbury
2023-06-28 16:13:29 +00:00
35a8242226 [Doc] Add sum reduction for CTCLoss (#100235)
Summary:

Fix: #99141

Reference:
39b885cbbf/aten/src/ATen/native/LossCTC.cpp (L366-L371)

Test Plan: See GitHub Tests.

Differential Revision: D45387774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100235
Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki
2023-06-28 16:08:22 +00:00
0a7b6dd97d [BE] Fix test_trymerge.py (#104343)
- Add `ngimel` to the list of reviewers to make "test_revert_rules" pass
- Change PR in `test_get_classifications_unstable` from https://github.com/pytorch/pytorch/pull/102784 to https://github.com/pytorch/pytorch/pull/104312  as former do not have unstable jobs after merging.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at af26e18</samp>

> _Oh we're the crew of the `test_trymerge.py`_
> _We update the rules and the cases on the fly_
> _We heave and we haul on the count of three_
> _We add a new approver for the `super` rule, aye_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104343
Approved by: https://github.com/jeanschmidt, https://github.com/albanD
2023-06-28 15:05:50 +00:00
fde024b32d [HigherOrderOp] Fall back on all new side effects in speculate_subgraph (#104077)
Fixes #103613.

A requirement for HigherOrderOperators is that after Dynamo capture, the body
function should be functional (i.e. has no observable side effects).
If the body function mutates a variable that is not local to the body, then we
that should induce a graph break.

This PR distinguish between MutableLocals created inside/outside body
and adds relevant checks. (Design originally proposed by voznesenskym.)

- We tag each mutable_local with an id that corresponds to where it came
from. The mutable_local may represent an existing object that gets
tracked by Dynamo or an object that is created while Dynamo is
introspecting.
- This id changes when we are introspecting the body of a HigherOrderOperator.
- If Dynamo wants to perform a side effect using a mutable_local, we
check its .scope field with the current scope id and raise Unsupported
in the desired case (non-local mutation inside HigherOrderOperator body)
- The id is a global thread_local variable. I can make this not a global
variable, but it just takes some engineering time to thread a number through
each of the various ways Dynamo can construct a mutable_local.

Test Plan:
- Add a bunch of new tests. Tests combinations of {global, nonlocal} x
{number, Tensor, list, object, nn.Module} and asserts that HigherOrderOp
falls back on those cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104077
Approved by: https://github.com/voznesenskym, https://github.com/jansel
2023-06-28 14:20:37 +00:00
c06bb82ba1 fix specialization when you pass an unspec int into slicing on a Python list. (#104142)
Fixes #103545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104142
Approved by: https://github.com/malfet, https://github.com/jansel
2023-06-28 13:13:07 +00:00
6493519fff [Easy][FSDP] Remove misleading asserts (#104274)
Since we do not call `_FSDPState.__init__()` and only use it for typing, it is not possible for these attributes to be `None`. The purpose of these `assert`s is to make sure that these attributes are set by `_init_process_group_state_for_hybrid_shard()`. If we care to make that explicit, I would posit that we should be using `hasattr` checks, not `is not None` checks, because if indeed `_init_process_group_state_for_hybrid_shard()` did not set these attributes, then even checking that it is not `None` would lead to an `AttributeError`. I do not include these `hasattr` checks for now since `_init_process_group_state_for_hybrid_shard()` is short enough that we can quickly tell by inspection that it sets the desired attributes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104274
Approved by: https://github.com/rohan-varma
2023-06-28 11:08:47 +00:00
ba9f6e6e92 [FSDP] Validate ignored_modules, ignored_states (#104273)
This checks that `ignored_modules` and `ignored_states` have the expected type and provides a reasonable error message if not. Otherwise, if someone passes a mix of modules and parameters to `ignored_states` for example, then our code may be silently incorrect.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104273
Approved by: https://github.com/rohan-varma
2023-06-28 11:08:47 +00:00
cc27e6c0f9 [FSDP] Fix ignored_states doc (#104253)
This fixes https://github.com/pytorch/pytorch/issues/104246.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104253
Approved by: https://github.com/rohan-varma
2023-06-28 11:08:45 +00:00
9db8ad7f1d [FSDP] Support unfreezing params for reshard-only hook (#104186)
This fixes https://github.com/pytorch/pytorch/issues/104148 (unfreezing parameters after `n` steps).

- This fixes a bug where we did not delete the post-backward hook state properly for the `requires_grad=False` case.
- This makes the `already_resharded` correct for `SHARD_GRAD_OP`.
- This generalizes `_clear_grads_if_needed()` to `_reset_flat_param_grad_info_if_needed()` to additionally include propagating the original parameters' `requires_grad` to the flat parameter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104186
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2023-06-28 11:04:57 +00:00
89fcfc1b8c [Doc] linalg.ldl_factor: render the Shape of tensor A (#99777)
Summary: Fix: #96864

Test Plan: Please see GitHub tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99777
Approved by: https://github.com/lezcano
2023-06-28 09:28:45 +00:00
5cf3a99013 sampled_addmm: backward performance improvements (#103544)
No need to do double `sparse_mask`, let's squash everything into one call!
This PR exercises https://github.com/pytorch/pytorch/pull/103750, so here is an autogened code for the backward pass.

```
at::Tensor sparse_sampled_addmm(c10::DispatchKeySet ks, const at::Tensor & self, const at::Tensor & mat1, const at::Tensor & mat2, const at::Scalar & beta, const at::Scalar & alpha) {
  auto& self_ = unpack(self, "self", 0);
  auto& mat1_ = unpack(mat1, "mat1", 1);
  auto& mat2_ = unpack(mat2, "mat2", 2);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, mat1, mat2 );

  std::shared_ptr<SparseSampledAddmmBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<SparseSampledAddmmBackward0>(new SparseSampledAddmmBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( self, mat1, mat2 ));
    grad_fn->alpha = alpha;
    grad_fn->beta = beta;
    if (grad_fn->should_compute_output(2)) {
      grad_fn->mat1_ = SavedVariable(mat1, false);
    }
    if (grad_fn->should_compute_output(1)) {
      grad_fn->mat2_ = SavedVariable(mat2, false);
    }
    grad_fn->self_ = SavedVariable(self, false);
  }

```

As you can see, we do not save tensors unless needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103544
Approved by: https://github.com/soulitzer
2023-06-28 08:49:54 +00:00
148960b8cc [BE] Modernize C++ in MetalPrepackOpContext (#104312)
Mark destructors as overrides, which fixes:
```cpp
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpRegister.cpp:3:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpContext.h:52:3: warning: '~Conv2dOpContext' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  ~Conv2dOpContext() {
  ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/core/ivalue.h:22:17: note: overridden virtual function is here
class TORCH_API CustomClassHolder : public c10::intrusive_ptr_target {};
                ^
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpRegister.cpp:3:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/metal/MetalPrepackOpContext.h:147:3: warning: '~LinearOpContext' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  ~LinearOpContext() {
  ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/core/ivalue.h:22:17: note: overridden virtual function is here
class TORCH_API CustomClassHolder : public c10::intrusive_ptr_target {};
```
Modernize constructors by passing parameters by value and moving them, rather than by reference, see [clang-tidy pass-by-value rule](https://clang.llvm.org/extra/clang-tidy/checks/modernize/pass-by-value.html).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104312
Approved by: https://github.com/kit1980, https://github.com/osalpekar
2023-06-28 07:17:08 +00:00
c2095af3f8 make funcs argument type from torch.cuda.stream as torch.Stream (#104156)
Fixes #ISSUE_NUMBER
1. we want to support fsdp for custom device, so we make funcs argument type from torch.cuda.stream as torch.Stream
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104156
Approved by: https://github.com/awgu
2023-06-28 06:02:56 +00:00
f7fdaf8191 Revert "Re-enable low memory dropout (#103330)"
This reverts commit 2d14395f176b38b8416c2713d285e5ae55695a5f.

Reverted https://github.com/pytorch/pytorch/pull/103330 on behalf of https://github.com/malfet due to Lots of tests failed with 'prims' object has no attribute 'inductor_random' ([comment](https://github.com/pytorch/pytorch/pull/103330#issuecomment-1610691147))
2023-06-28 04:27:37 +00:00
2d14395f17 Re-enable low memory dropout (#103330)
On attention_is_all_you_need_pytorch:

Perf: 1.526x -> 1.544x
Memory: 1.00 -> 1.05x

Fix for https://github.com/pytorch/pytorch/issues/102319, although I'm not sure all the perf is recovered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103330
Approved by: https://github.com/jansel
2023-06-28 03:13:41 +00:00
a8b63d4d1b [dynamo] If UserDefinedObjectVariable.var_getattr() is a callable, try handling as a TorchVariable (#104231)
In some cases, a UserFunctionVariable can be constructed when the underlying function is actually a TorchVariable. One example is when an attribute on a UnspecializedNNModuleVariable is a torch function. In those cases, we should treat the UserFunctionVariable as a TorchVariable.

This adds a check in UserDefinedObjectVariable.var_getattr() to try to create a TorchVariable instead of a UserFunctionVariable.

Fixes #104172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104231
Approved by: https://github.com/williamwen42, https://github.com/jansel
2023-06-28 02:39:03 +00:00
28d42e66e4 [CI] Add DALLE2_pytorch to FORCE_AMP_FOR_FP16_BF16_MODELS (#104283)
Summary: DALLE2_pytorch inference does not support bfloat16, fallback to use AMP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104283
Approved by: https://github.com/eellison
2023-06-28 02:37:15 +00:00
cyy
54cb61f7d9 enable ASAN on some tests (#103647)
Enabling more tests on ASAN, meanwhile we disable float-divide-by-zero and float-cast-overflow, both are disabled because they are also disabled by default in latest clang.
The following cited doc explains the reasons.
```
-fsanitize=float-cast-overflow: Conversion to, from, or between floating-point types
which would overflow the destination. Because the range of representable values
for all floating-point types supported by Clang is [-inf, +inf], the only cases detected are
conversions from floating point to integer types.
-fsanitize=float-divide-by-zero: Floating point division by zero.
This is undefined per the C and C++ standards,
 but is defined by Clang (and by ISO/IEC/IEEE 60559 / IEEE 754) as producing
either an infinity or NaN value,
so is not included in -fsanitize=undefined.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103647
Approved by: https://github.com/kit1980
2023-06-28 02:17:14 +00:00
27eecf32bd Remove redundant dummy overrides (#103992)
# Tidy the code in [overrides.py](https://github.com/pytorch/pytorch/blob/main/torch/overrides.py)

## Duplicate APIs in the [get_testing_overrides()](https://github.com/pytorch/pytorch/blob/main/torch/overrides.py#L335) function:

| APIs  | Line number|
|-------|-------|
| torch.fft.fft| L544 L564 |
| torch.logsumexp | L670 L672
| torch.narrow_copy | L733 L1126 |
| torch.native_norm | L740 L741 L742 |
| torch.nn.init.constant_ | L885 L887 |
| torch.squeeze_copy | L1134 L1135 |
| torch.view_copy | L1148 L1149 |
| Tensor.\_coalesced\_ | L1236 L1261 |

## Testing script

```Python

import torch
import inspect
import functools
from typing import Dict, Set, Callable

"""
@functools.lru_cache(None)
def get_testing_overrides() -> Dict[Callable, Callable]:
    ...
    Tensor = torch.Tensor
    ret: Dict[Callable, Callable] = {
        # ...
        torch.fft.fft: lambda input, n=None, dim=-1, norm=None: -1,                         # L544
        torch.fft.fft: lambda input, n=None, dim=-1, norm=None: -1,                         # L564
        torch.logsumexp: lambda input, names, keepdim=False, out=None: -1,                  # L670
        torch.logsumexp: lambda input, names, keepdim=False, out=None: -1,                  # L672
        torch.narrow_copy: lambda input, dim, start, length: -1,                            # L733
        torch.narrow_copy: lambda self, dim, start, length: -1,                             # L1126
        torch.native_norm: lambda input, p=2: -1,                                           # L740
        torch.native_norm: lambda input, p=2: -1,                                           # L741
        torch.native_norm: lambda input, p=2, dim=None, keepdim=False, dtype=None: -1,      # L742
        torch.squeeze_copy: lambda self: -1,                                                # L1134
        torch.squeeze_copy: lambda self, dim: -1,                                           # L1135
        torch.view_copy: lambda self, size: -1,                                             # L1148
        torch.view_copy: lambda self, dtype: -1,                                            # L1149
        Tensor._coalesced_: lambda self: -1,                                                # L1236
        Tensor._coalesced_: lambda self, coalesced: -1,                                     # L1261
        # ...
    }
    ...
"""

if __name__ == "__main__":
    ret = torch.overrides.get_testing_overrides()

    Tensor = torch.Tensor
    dups = {"torch.fft.fft": torch.fft.fft,
            "torch.logsumexp": torch.logsumexp,
            "torch.narrow_copy": torch.narrow_copy,
            "torch.native_norm": torch.native_norm,
            "torch.squeeze_copy": torch.squeeze_copy,
            "torch.view_copy": torch.view_copy,
            "Tensor._coalesced_": Tensor._coalesced_}

    for k,v in dups.items():
        print(f"{k:18} {inspect.signature(ret[v])}")

```

## Testing output

```Shell
torch.fft.fft      (input, n=None, dim=-1, norm=None)
torch.logsumexp    (input, names, keepdim=False, out=None)
torch.narrow_copy  (self, dim, start, length)
torch.native_norm  (input, p=2, dim=None, keepdim=False, dtype=None)
torch.squeeze_copy (self, dim)
torch.view_copy    (self, dtype)
Tensor._coalesced_ (self, coalesced)

```

## Explanation:
The function `get_testing_overrides()` returns a `Dict[Callable, Callable]`. The later dummy overrides will cover the previous dummy overrides in the returned `Dict`. Therefore, removing the dummy overrides with homonym API names can tidy the code and increase the readability of the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103992
Approved by: https://github.com/kit1980
2023-06-28 01:59:56 +00:00
361ef824ea Handle custom higher order ops (#104285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104285
Approved by: https://github.com/zhxchen17
2023-06-28 01:53:36 +00:00
05ebd538d4 Inference Horizontal Fuse Addmm (#100746)
Gives 1.5% improvement on PegasusForCausalLM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100746
Approved by: https://github.com/jansel
2023-06-28 01:08:37 +00:00
9165d46b89 DDP + C10D sparse all_reduce changes (#103916) (#104256)
Summary:

reland of https://github.com/pytorch/pytorch/pull/103916

## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_ncclexp=default //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D47056695

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104256
Approved by: https://github.com/rohan-varma
2023-06-28 00:37:52 +00:00
c0aa442cb5 [dynamo][higher order op] Relaxing too restrictive check for output to be a list/tuple of tensors (#104221)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104221
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2023-06-28 00:30:43 +00:00
945a257277 [Quant][PT2E] Supported customized _EQUIVALENT_TYPES in Module Partition API (#102516)
**Summary**
`Module Partition API` can simplify the pattern match process in Quantization annotation. However, current implementation of
`Module Partition API` has hardcoded `_EQUIVALENT_TYPES` 999bae0f54/torch/ao/quantization/_pt2e/graph_utils.py (L13-L20). So, PyTorch Extension Libraries such as [intel-extension-for-pytorch](https://github.com/intel/intel-extension-for-pytorch) can't use `Module Partition API` with customized `_EQUIVALENT_TYPES` . In this PR, we plan to enable customized `_EQUIVALENT_TYPES` by pass in parameter.

**Test Plan**
```
python -m pytest test_graph_utils.py -k test_customized_equivalet_types_dict
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102516
Approved by: https://github.com/jgong5, https://github.com/kimishpatel
2023-06-28 00:20:25 +00:00
298ff41a38 [inductor] fix a bug in coordinate descent tuner (#104293)
The neighbor values we try for a field can be empty in some corner cases.
```
                # E.g., if XBLOCK is 1 initially and size_hint for x is also 1.
                # We would not try either larger or smaller XBLOCK in this case.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104293
Approved by: https://github.com/jansel
2023-06-28 00:05:13 +00:00
280df5dc2e [HigherOrderOp] Remove _deprecated_global_ns from some ops (#104105)
The remaining ops after this PR are:
- cond
- map
- anything that is out of tree.

These are a bit more difficult to remove.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104105
Approved by: https://github.com/ydwu4
2023-06-28 00:03:29 +00:00
de7b6e55eb Fix bad cudagraph interaction (#104196)
Fix for https://github.com/pytorch/pytorch/issues/103126

As mentioned there,

> We need to make sure we are not removing the misaligned inputs before we are checking for misalignment in cudagraphs, so we know not to expect a static input for the misaligned tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104196
Approved by: https://github.com/desertfire
2023-06-27 21:36:09 +00:00
7bb40be143 Fix fake tensor for private use backends (#103090)
Fix for https://github.com/pytorch/pytorch/issues/101244

We need meta to be higher priority than PrivateUse1 (as it is for cpu and cuda) so that when meta is in tls we hit meta kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103090
Approved by: https://github.com/bdhirsh
2023-06-27 21:17:40 +00:00
1a8af1503f Upgrade Pybind submodule to 2.10.4 (#103989)
This is not ready for review, this is to make sure asan is fixed.
Not sure what is the most effective way to track down the bad dec_ref within deploy yet.

The asan silencing is done to match this comment:
1c79003b3c/test/test_cpp_extensions_jit.py (L749-L752)

EDIT: since the final failing function is in libtorch_python.so, we would need to skip that whole lib (not ok). So now we're skipping based on the function name which should be restrictive enough to not hide any real bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103989
Approved by: https://github.com/malfet
2023-06-27 20:22:39 +00:00
c98896b76f [quant][pt2e] Add more precise representation for quantized add (#104130)
Summary:
The planned e2e for quantization in pytorch 2.0 export is the following:

float_model -> prepare_pt2e -> calibration -> convert_pt2e -> ...

inside convert_pt2e, we will first produce a q/dq representation of the quantized model, similar to the previous output of
convert_to_reference_fx in fx grah mode quantization:

```
torch.ops.quantized_decomposed.dequantize_per_tensor -> torch.ops.aten.add -> torch.ops.quantized_decomopsed.quantize_per_tensor
torch.ops.quantized_decomposed.dequantize_per_tensor   /
```

Then we'll rewrite the above to a more precise representation that express the intention in a more precise manner, since
here we actually want to do int8 addition, instead of simulating the int8 addition with fp32 operations, the representation for
quantized add is:

```
def quantized_add(x_i8, x_scale, x_zero_point, y_i8, y_scale, y_zero_point, out_scale, out_zero_point):
    x = (x_scale / out_scale) * x_i8
    y = (y_scale / out_scale) * y_i8
    out = x + y
    out -= (x_zero_point * x_scale - y_zero_point * y_scale) / out_scale
    out += out_zero_point
    return out
```

Test Plan:
```
buck2 test caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_representation_add (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```

Reviewed By: kimishpatel

Differential Revision: D45628032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104130
Approved by: https://github.com/kimishpatel
2023-06-27 20:11:30 +00:00
7bf27cf163 [Inductor][FX passes] Remove config.split_cat_fx_passes & Add config.experimental_patterns (#104208)
Summary:
TLDR:
* Remove config.split_cat_fx_passes, and move split cat passes behind config.pattern_matcher (True by default)
* Add config.experimental_patterns (False by default).
* In the future, general/universal patterns should behind config.pattern_matcher; customized/unmatured patterns should behind config.experimental_patterns.

More details at:
https://docs.google.com/document/d/1P8uJTpOTdQpUbw56UxHol40tt-EPFTq1Qu38072E9aM/edit

Test Plan: Existing unit tests

Reviewed By: jansel, jackiexu1992

Differential Revision: D46752606

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104208
Approved by: https://github.com/williamwen42
2023-06-27 20:08:40 +00:00
2da6cae43c [core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.

In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.

This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
  sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
  semi-structured sparse bool mask and creates an instance of the
  subclass.

**SparseSemiStructuredTensor**

The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.

Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().

Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.

**to_sparse_semi_structured**

This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.

Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.

**User Details**

We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```

The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:

```
from torch.sparse import to_sparse_semi_structured

mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
                                                       mask=linear.weight.bool())

```

This also updates tests and the `torch.sparse` module docstring to
reflect these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
2023-06-27 19:21:06 +00:00
39868b0578 [codemod][third-party][gtest] Migrate all fbcode gtest from tp2 to fbsource/third-party (#104255)
Summary:
## What is this?
This is a giant codemod to migrate all of fbcode from the tp2 version of gtest to the `fbsource/third-party` version.

## Why?
Various parts of the monorepo use different versions of gtest which are incompatible with each other and make maintenance of C++ testing more difficult than it should be. There also doesn't seem to be much reason for this fragmentation. Shifting all `gtest` dependencies towards `fbsource/third-party` is a big step in the right direction towards cleaning this up.

Also -- tp2 is deprecated, so we want to stop using that anyway. If we're going to make improvements to `gtest`, we should get away from tp2 as a first step.

## How?

I used bash script to perform the majority of the codemod: P777150295

I followed up with `rg` to find additional dependencies, then simply iterated a ton until CI was (mostly) happy.

This diff also includes an update to autodeps to use the `third-party/fbsource` version of gtest rather than the `tp2` version.

#forcetdhashing

Test Plan: CI

Differential Revision: D46961576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104255
Approved by: https://github.com/huydhn
2023-06-27 19:10:08 +00:00
a66107a30c [DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235)
# Change
This PR adds two classes to DTensor:

1. `CudaRNGStateTracker`:  `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG).

2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators.

# Warning

- With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that.

- The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235
Approved by: https://github.com/wanchaol
2023-06-27 19:00:25 +00:00
84f578dcc2 [ONNX] Cache AutoTokenizer in CI for test (#104233)
Fixes #103950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104233
Approved by: https://github.com/malfet
2023-06-27 18:55:39 +00:00
93b6b17dd0 CUDA_HOST_COMPILER spelling fix in cmake build files generate method (#104126)
Fix of CUDA_HOST_COMPILER spelling fix in generating additional build option in CMake.generate method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104126
Approved by: https://github.com/malfet
2023-06-27 18:46:12 +00:00
Te
a73ad82c8f conditional CMAKE_CUDA_STANDARD (#104240)
Fixes #104237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104240
Approved by: https://github.com/malfet
2023-06-27 18:41:25 +00:00
bf34ecd0c8 [RFC]: Integrate assertions functionalization to export (after AOT export) (#103887)
This PR integrated the assertion functionalization logic into current export logic.

**NOTE:**
I finally decided to do the assertion functionalization after AOT export instead of before for the following reasons:
* The benefit of AOT export is that the graph is already functionalized so things like method call is already transformed to function call. However, if we do it before AOT export, the graph is still in torch level and extra logic like bab21d20eb/torch/_export/pass_base.py (L201-L204C17) will need to be implemented.
* The graph signature is kind of already incorrect after adding runtime assertions currently (this doesn't seem break logic since we already depend on positions instead of FQNs of outputs). This PR also fixed this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103887
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2023-06-27 18:14:29 +00:00
936cd4f2f5 Migrate exportdb to torch.export (#104260)
Reapply of (https://github.com/pytorch/pytorch/pull/103861). Things that needed to be fixed:

- Fix a bug with returning dict output type
- Make pass_base work with map implementation
- Fix subtle bug with dynamo not propagating "val" in node.meta
- Add export_constraints field in ExportCase in ExportDB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104260
Approved by: https://github.com/angelayi
2023-06-27 17:49:18 +00:00
ab9577087a Update accuracy for dynamo/torchbench CI - vision_maskrcnn, hf_T5_generate and dlrm (#104263)
Fixes breaking CI jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104263
Approved by: https://github.com/atalman, https://github.com/seemethere
2023-06-27 17:33:01 +00:00
ef285faeba [ET][XNNPACK] Add support for quantized Multiply (#104134)
Summary:
Also adds support for backend_config with relu fusion since XNNPACK allows it.

We should revisit the relu fusion once we gain more clarity on quantSrcPartition or some other way to do these fusion and not having to add all combinations.

We should really rename the backend config to et_xnnpack.py or something TODO

Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:`

Differential Revision: D46985169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104134
Approved by: https://github.com/mcr229, https://github.com/salilsdesai
2023-06-27 16:59:28 +00:00
80ea3422f0 [ROCm] Enable tl.reduce usage on ROCm (#104099)
Revert aten.prod explicit fallback on ROCm and enabling the use of tl.reduce in triton codegen. This PR also enables an optimisation that was previously conditionalised out for ROCm https://github.com/pytorch/pytorch/pull/102444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104099
Approved by: https://github.com/peterbell10, https://github.com/malfet
2023-06-27 16:21:32 +00:00
99e87bb6a0 [MPS] Dispatch outer bin edges selection function (#101792)
Dispatch the selection function to prevent using `is_mps()` in `Histogram.cpp`.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b329a02</samp>

This pull request refactors and implements the logic for inferring the bin edges of histograms from the input tensor for different device types. It introduces a dispatch stub `histogram_select_outer_bin_edges_stub` and moves the device-specific code to separate files, such as `HistogramKernel.cpp` and `HistogramKernel.mm`. This improves the modularity and readability of the histogram functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101792
Approved by: https://github.com/albanD
2023-06-27 16:17:10 +00:00
217a8b4697 [MPS] Add MPSProfiler to histogram kernel (#101692)
Apart from introducing MPSProfiler, this PR also
1. removes the synchronization call after all the commands are encoded since the stream will be synchronized along the next graph op is encountered and run. One can take a look at this [PR](https://github.com/pytorch/pytorch/pull/99810) to get some insight.
2. initialize the offset calculation kernel's thread output with 0 to ensure the subsequent offset accumulation is correct. This change makes the kernel aligned with `kernel_index_offsets` kernel.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4094984</samp>

This change enables performance analysis of the `histogram` kernel on MPS devices by using the `MPSProfiler` class to collect and report relevant metrics. It modifies the file `HistogramKernel.mm` to add profiling calls around the kernel execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101692
Approved by: https://github.com/albanD
2023-06-27 16:17:10 +00:00
c40f5edf7b Change tools search order (#104214)
Prevents following cryptic error if one attempts to use `run_tests.py` on system that also has torchaudio installed in dev mode (as `tools` from https://github.com/pytorch/audio might take precedence, but this is not how script should behave):
```
Unable to import test_selections from tools/testing. Running without test selection stats.... Reason: No module named 'tools.stats'
Traceback (most recent call last):
  File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1673, in <module>
    main()
  File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1604, in main
    selected_tests = get_selected_tests(options)
  File "/Users/nshulga/git/pytorch/pytorch/test/run_test.py", line 1418, in get_selected_tests
    path = os.path.join(str(REPO_ROOT), TEST_TIMES_FILE)
NameError: name 'TEST_TIMES_FILE' is not defined
```

But make sure to remove it in the end, otherwise it will not work if torch is installed from wheel, but tests are running from clean repo checkout.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at dd52521</samp>

> _Sing, O Muse, of the cunning code review_
> _That fixed the tests of the `tools` module_
> _By adding and removing the root path_
> _As a shepherd guides his flock to and fro._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104214
Approved by: https://github.com/kit1980
2023-06-27 15:54:34 +00:00
4d613b9a5f [doc] Improve mps package description (#104184)
Fixes #104183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104184
Approved by: https://github.com/malfet
2023-06-27 15:50:36 +00:00
ad2905ad27 Make _test_autograd_multiple_dispatch_view a view operation (#104149)
Fixes the `test_view_copy_cuda` failure case in https://github.com/pytorch/pytorch/issues/99655

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104149
Approved by: https://github.com/soulitzer
2023-06-27 15:43:35 +00:00
567b5e5b28 Multioutput backward formula: allow conditional guards against saving (#103750)
Multi-output backward formulas break the ability of autogen to decide which variables have to be stored in a graph.
This PR introduces a macro `wrap_opt_if` which could be used to hint autogen about variable interdependence.

For example, the following code is being generated for `_trilinear` with this modification:
```
at::Tensor _trilinear(c10::DispatchKeySet ks, const at::Tensor & i1, const at::Tensor & i2, const at::Tensor & i3, at::IntArrayRef expand1, at::IntArrayRef expand2, at::IntArrayRef expand3, at::IntArrayRef sumdim, int64_t unroll_dim) {
  auto& i1_ = unpack(i1, "i1", 0);
  auto& i2_ = unpack(i2, "i2", 1);
  auto& i3_ = unpack(i3, "i3", 2);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( i1, i2, i3 );

  [[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(i1) || isFwGradDefined(i2) || isFwGradDefined(i3));
  std::shared_ptr<TrilinearBackward0> grad_fn;
  if (_any_requires_grad) {
    grad_fn = std::shared_ptr<TrilinearBackward0>(new TrilinearBackward0(), deleteNode);
    grad_fn->set_next_edges(collect_next_edges( i1, i2, i3 ));
    grad_fn->expand1 = expand1.vec();
    grad_fn->expand2 = expand2.vec();
    grad_fn->expand3 = expand3.vec();
    if (grad_fn->should_compute_output(1) || grad_fn->should_compute_output(2)) {
      grad_fn->i1_ = SavedVariable(i1, false);
    }
    if (grad_fn->should_compute_output(0) || grad_fn->should_compute_output(2)) {
      grad_fn->i2_ = SavedVariable(i2, false);
    }
    if (grad_fn->should_compute_output(0) || grad_fn->should_compute_output(1)) {
      grad_fn->i3_ = SavedVariable(i3, false);
    }
    grad_fn->sumdim = sumdim.vec();
  }

```

with the following backward modifications:
```
 - name: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor
  - i1, i2, i3: _trilinear_backward(grad, i1, i2, i3, expand1, expand2, expand3, sumdim, grad_input_mask)
  + i1, i2, i3: "_trilinear_backward(grad,
  +             wrap_opt_if(i1, grad_input_mask[1] || grad_input_mask[2]),
  +             wrap_opt_if(i2, grad_input_mask[0] || grad_input_mask[2]),
  +             wrap_opt_if(i3, grad_input_mask[0] || grad_input_mask[1]),
  +             expand1, expand2, expand3, sumdim, grad_input_mask)"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103750
Approved by: https://github.com/soulitzer
2023-06-27 15:12:09 +00:00
18dacf7e79 [Specialized Kernel] Update yaml syntax to use kernel instead of dispatch (#104070)
Based on this [code search](https://fburl.com/code/gjcnw8ly) (*.yaml with `dispatch: CPU:`), update all files found to use

```
kernels:
    - arg_meta: None
      kernel_name:
```
instead of
```
dispatch:
    CPU:
```
---
## Code changes:

- `fbcode/executorch/codegen/tools/gen_oplist.py`
  - Strip ET specific fields prior to calling parse_native_yaml_struct
---
## Files edited that are not `*functions.yaml` or `custom_ops.yaml`

- fbcode/executorch/kernels/optimized/optimized.yaml
- fbcode/executorch/kernels/quantized/quantized.yaml
- fbcode/executorch/kernels/test/custom_kernel_example/my_functions.yaml

---
## Found Files that were not edited

**Dispatched to more than just CPU**
- fbcode/caffe2/aten/src/ATen/native/native_functions.yaml
- xplat/caffe2/aten/src/ATen/native/native_functions.yaml
- xros/third-party/caffe2/caffe2/aten/src/ATen/native/native_functions.yaml

**Grouped ops.yaml path**
- fbcode/on_device_ai/Assistant/Jarvis/min_runtime/operators/ops.yaml

---
**Design Doc:** https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#heading=h.8raqyft9y50

Differential Revision: [D46952067](https://our.internmc.facebook.com/intern/diff/D46952067/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46952067/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104070
Approved by: https://github.com/larryliu0820
2023-06-27 09:53:20 +00:00
95707ac964 [fake_pg] allow fake_pg allgather to do some simple validation (#104213)
Note that in general it's not good form to try to make FakePG work with 'real data',
but the reasoning here is that we want FakePG to work with DeviceMesh's init code
that have the data validation, which makes it worth the tradeoff.

In general user should use MTPG or normal PG for cases where they may care about
real data from collectives
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104213
Approved by: https://github.com/wconstab, https://github.com/voznesenskym
2023-06-27 09:39:16 +00:00
6c1ccccf21 Enable mimalloc on pytorch Windows (#102595)
This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2.
Major changes:
1. Add mimalloc to the submodule.
2. Add build option "USE_MIMALLOC".
3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance.

Additional Test:
<img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3">
This PR also build & static link mimalloc on Linux well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-27 08:53:26 +00:00
803c14490b Specialize storage_offset - Does not cover automatic dynamic (#104204)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104204
Approved by: https://github.com/wconstab
2023-06-27 05:51:42 +00:00
c3e4a67905 Refactor multigpu tests to test_cuda_multigpu (#104059)
Mostly refactor, that moves all the tests from `test_cuda` that benefit from multiGPU environment into its own file.

- Add `TestCudaMallocAsync` class for Async tests ( to separate them from `TestCudaComm`)
- Move individual tests from `TestCuda` to `TestCudaMultiGPU`
- Move `_create_scaling_models_optimizers` and `_create_scaling_case` to `torch.testing._internal.common_cuda`
- Add newly created `test_cuda_multigpu` to the multigpu periodic test

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at f4d46fa</samp>

This pull request fixes a flaky test and improves the testing of gradient scaling on multiple GPUs. It adds verbose output for two CUDA tests, and refactors some common code into helper functions in `torch/testing/_internal/common_cuda.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104059
Approved by: https://github.com/huydhn
2023-06-27 05:32:05 +00:00
572ff2779b [RESUBMIT] Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103925)
https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.

However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.

To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103925
Approved by: https://github.com/osalpekar
2023-06-27 04:22:03 +00:00
b76a040b18 Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)"
This reverts commit aea771de30427998e83010459b69da1ab66f0879.

Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is still failing CUDA trunk jobs aea771de30 ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608744110))
2023-06-27 03:49:31 +00:00
7157dfdd4a [jit] fix duplicated module input and output values in tracing module (#102510)
remap shall record the original inp pointers instead of remapped ones

testcase

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Normalize(nn.Module):
    def __init__(self):
        super().__init__()

        self.norm = nn.GroupNorm(num_groups=32, num_channels=32)

    def forward(self, x, y):
        if y is None:
            y = x
        else:
            y = self.norm(y)

        y = y * 2

        return y

class G(nn.Module):
    def __init__(self):
        super().__init__()

        self.norm = Normalize()

    def forward(self, x):

        A = self.norm(x, None)
        B = F.relu(A)

        return A, B

class Net(nn.Module):
    def __init__(self):
        super().__init__()

        self.g = G()

        self.norm_1 = Normalize()

    def forward(self, x):
        hs = self.g(x)

        A, B = hs

        h = self.norm_1(B, A)
        return h

net = Net()
net = net.eval()

x = torch.randn(1, 32, 16, 16)

traced = torch.jit.trace(net, x)

print(traced.graph)
```

without this patch, there are duplicated lifted values, %80, %81, %82, %83, %84, %85
```
graph(%self.1 : __torch__.Net,
      %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)):
  %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1)
  %g : __torch__.G = prim::GetAttr[name="g"](%self.1)
  %86 : (Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor) = prim::CallMethod[name="forward"](%g, %x)
  %79 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %80 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %81 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %82 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %83 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %84 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu), %85 : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu) = prim::TupleUnpack(%86)
  %87 : Tensor = prim::CallMethod[name="forward"](%norm_1, %79, %80, %81, %82, %83, %84, %85)
  return (%87)

```

with this patch
```
graph(%self.1 : __torch__.Net,
      %x : Float(1, 32, 16, 16, strides=[8192, 256, 16, 1], requires_grad=0, device=cpu)):
  %norm_1 : __torch__.___torch_mangle_1.Normalize = prim::GetAttr[name="norm_1"](%self.1)
  %g : __torch__.G = prim::GetAttr[name="g"](%self.1)
  %71 : Tensor = prim::CallMethod[name="forward"](%g, %x)
  %72 : Tensor = prim::CallMethod[name="forward"](%norm_1, %71)
  return (%72)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102510
Approved by: https://github.com/davidberard98
2023-06-27 03:43:06 +00:00
aea771de30 [core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.

In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.

This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
  sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
  semi-structured sparse bool mask and creates an instance of the
  subclass.

**SparseSemiStructuredTensor**

The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.

Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().

Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.

**to_sparse_semi_structured**

This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.

Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.

**User Details**

We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```

The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:

```
from torch.sparse import to_sparse_semi_structured

mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
                                                       mask=linear.weight.bool())

```

This also updates tests and the `torch.sparse` module docstring to
reflect these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
2023-06-27 02:37:00 +00:00
968b7b5e0f Initial commit of collective_utils (#101037)
Summary:
Details in T133020932
First commit of collective utils library. Ported over from model store, removed scuba logging, error_trait and all dependencies on modelstore.

Test Plan: In the following diffs.

Differential Revision: D45545970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101037
Approved by: https://github.com/H-Huang
2023-06-27 02:15:16 +00:00
41866a2ead Fix missing mandatory device_type argument in autocast docstring (#97223)
Fixes #[92803](https://github.com/pytorch/pytorch/issues/92803)
![Screenshot from 2023-03-21 12-28-14](https://user-images.githubusercontent.com/100136654/226538769-141f3b9e-0de2-4e86-8e42-d5a4a7413c6f.png)
![Screenshot from 2023-03-21 12-28-29](https://user-images.githubusercontent.com/100136654/226538777-9e719090-75c0-46f7-8594-5efcb0a46df6.png)
![Screenshot from 2023-03-21 12-29-36](https://user-images.githubusercontent.com/100136654/226538783-399a0e60-ffc9-4d73-801c-8cfce366d142.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97223
Approved by: https://github.com/albanD, https://github.com/malfet
2023-06-27 01:54:54 +00:00
6d2da6106d Raise AttributeError in _OpsNamespace if __self__ attribute is requested (#104096)
Summary:
Trying to get the `__self__` attribute on any `_OpNamespace` object should be an invalid operation. The `__self__` attribute only exists on instance method object and not on class objects.

In [dynamo](a152b3e3b8/torch/_dynamo/variables/torch.py (L164)) there is code that tries to access the `__self__` attribute on `TorchVariable`, this currently results in an expensive call to `torch._C._jit_get_operation` [here](a152b3e3b8/torch/_ops.py (L740)) which ultimately fails and throws an exception. For cases where it fails the operation turns out to be quite expensive on the order of ~0.03s.

For edge use cases when exporting large models with quantized ops this exception is thrown 100's of times resulting in a lot of time wasted. By preventing the call to `torch._C._jit_get_operation` we can quickly return from this function and significantly reduce export times. On a large ASR model for example export currently takes **~405** seconds. With this change we can reduce it to **~340s**.

Overall this should also be a harmless change as no one should mostly ever try to access the `__self__` attribute on any `_OpNamespace` object.

Test Plan: Added test case.

Differential Revision: D46959879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104096
Approved by: https://github.com/larryliu0820, https://github.com/ezyang, https://github.com/zou3519
2023-06-27 01:42:06 +00:00
f8ac569365 [Inductor][Quant]Fix tile2d code generation issue with uint8 data type (#104074)
**Summary**
The previous vectorized code generation of tile2d doesn't support input data type of uint8, which still takes it as float and generate wrong result. This PR fixes this issue. Take UT `test_tile2d_load_decomposed_dequant_add_relu_quant` in this PR as example:
The previous generated code is:
```
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L))
{
    unsigned char tmp0[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp0, 16);
    unsigned char tmp7[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp7, 16);
    for (long i0_inner = 0; i0_inner < 16; i0_inner++)
    {
        auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*i0_inner));
        auto tmp8 = at::vec::Vectorized<float>::loadu(tmp7 + static_cast<long>(16L*i0_inner));
        auto tmp2 = (tmp1);
        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0));
        auto tmp4 = tmp2 - tmp3;
        auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01));
        auto tmp6 = tmp4 * tmp5;
        auto tmp9 = (tmp8);
        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0));
        auto tmp11 = tmp9 - tmp10;
        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02));
        auto tmp13 = tmp11 * tmp12;
        auto tmp14 = tmp6 + tmp13;
        auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0));
        auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336));
        auto tmp17 = tmp15 * tmp16;
        auto tmp18 = tmp17.round();
        auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0));
        auto tmp20 = tmp18 + tmp19;
        auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0));
        auto tmp22 = at::vec::maximum(tmp20, tmp21);
        auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0));
        auto tmp24 = at::vec::minimum(tmp22, tmp23);
        auto tmp25 = (tmp24);
        at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196L*i0) + (196L*i0_inner)));
    }
}
```

After this PR, the generated code is:
```
#pragma GCC ivdep
for(long i1=static_cast<long>(0L); i1<static_cast<long>(192L); i1+=static_cast<long>(16L))
{
    unsigned char tmp0[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr0 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp0, 16);
    unsigned char tmp7[16*16] __attribute__ ((aligned (16)));
    at::vec::transpose_mxn<unsigned char,16,16>(in_ptr1 + static_cast<long>(i0 + (1024L*i1)), static_cast<long>(1024L), tmp7, 16);
    for (long i0_inner = 0; i0_inner < 16; i0_inner++)
    {
        auto tmp1 = at::vec::load_uint8_as_float(tmp0 + static_cast<long>(16L*i0_inner));
        auto tmp8 = at::vec::load_uint8_as_float(tmp7 + static_cast<long>(16L*i0_inner));
        auto tmp2 = (tmp1);
        auto tmp3 = at::vec::Vectorized<float>(static_cast<float>(1.0));
        auto tmp4 = tmp2 - tmp3;
        auto tmp5 = at::vec::Vectorized<float>(static_cast<float>(0.01));
        auto tmp6 = tmp4 * tmp5;
        auto tmp9 = (tmp8);
        auto tmp10 = at::vec::Vectorized<float>(static_cast<float>(2.0));
        auto tmp11 = tmp9 - tmp10;
        auto tmp12 = at::vec::Vectorized<float>(static_cast<float>(0.02));
        auto tmp13 = tmp11 * tmp12;
        auto tmp14 = tmp6 + tmp13;
        auto tmp15 = at::vec::clamp_min(tmp14, decltype(tmp14)(0));
        auto tmp16 = at::vec::Vectorized<float>(static_cast<float>(33.333333333333336));
        auto tmp17 = tmp15 * tmp16;
        auto tmp18 = tmp17.round();
        auto tmp19 = at::vec::Vectorized<float>(static_cast<float>(3.0));
        auto tmp20 = tmp18 + tmp19;
        auto tmp21 = at::vec::Vectorized<float>(static_cast<float>(0.0));
        auto tmp22 = at::vec::maximum(tmp20, tmp21);
        auto tmp23 = at::vec::Vectorized<float>(static_cast<float>(255.0));
        auto tmp24 = at::vec::minimum(tmp22, tmp23);
        auto tmp25 = (tmp24);
        at::vec::store_float_as_uint8(tmp25, out_ptr0 + static_cast<long>(i1 + (196L*i0) + (196L*i0_inner)));
    }
}
```

**Test Plan**
```
python -m pytest test_cpu_repro.py -k test_tile2d_load_decomposed_dequant_add_relu_quant
python -m pytest test_cpu_repro.py -k test_tile2d_store_channel_shuffle_cl_quant_output
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104074
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-27 00:59:05 +00:00
d2281e38ae Adds the initial support for AOTInductor model and interface (#104202)
This PR combines the C++ code for the AOTInductor's model and interface with Bin Bao's changes to AOTInductor codegen.

It adds a number of AOTInductor C interfaces that can be used by an inference runtime. Under the hood of the interfaces, the model code generated by the AOTInductor's codegen is wrapped into a class, AOTInductorModel, which manages tensors and run the model inference.

On top of AOTInductorModel, we provide one more abstract layer, AOTInductorModelContainer, which allows the user to have multiple inference runs concurrently for the same model.

This PR also adjusts the compilation options for AOT codegen, particularly some fbcode-related changes such as libs to be linked and header-file search paths.

Note that this is the very first version of the AOTInductor model and interface, so many features (e.g. dynamic shape) are incomplete. We will support those missing features in in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104202
Approved by: https://github.com/desertfire
2023-06-27 00:37:26 +00:00
d8a2e7461b Fix incorrect distribution of randperm with device mps (#104171)
Fixes #104170

As noted in the above issue it seems that the code for randperm basically boils down to:
`torch.argsort(torch.rand(size, device="mps"), dim = 0)`

However it seems like in the fused(?) pytorch version the type of tensor we were drawing `torch.rand(size, device="mps")` from was int64 with an inclusive(?) upper bound of 1. This caused everything to be sorted into two groups (if you drew 0 or 1) each monotonically ascending due to sort tie breaking.

One way to fix this is to  just generate the random tensor as float64s with an upper bound of 1.0 instead of int64s. An alternative to to just set the upper bound to max int 64.

~I choose the float64 one basically on a coin flip b/c I couldn't tell the original contributor's intent (due to mixed up upper bounds and type) but would be happy to change to use int64 and max int 64 as an upper bound instead if that's better.~

Edit on second thought I don't like using floats from 0.0 to 1.0 as there are fewer of them in that range than int64s from 0 to int 64 max_value. I also suspect integer math might be faster but need to benchmark this tomorrow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104171
Approved by: https://github.com/malfet
2023-06-27 00:36:15 +00:00
994b98b78b Add language server support for vscode (#104160)
Makes it so clangd support can work with with vscode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104160
Approved by: https://github.com/seemethere
2023-06-27 00:20:53 +00:00
981f24e806 Add docstring to torch.serialization.register_package (#104046)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104046
Approved by: https://github.com/albanD
2023-06-26 23:28:32 +00:00
4a008d268a REDO of dropout support for mem eff #102038 (#103704)
THIS IS A new PR with the changes from #102038 + #103201 +  plus namespacing changes to fix bug.

# Summary
This PR builds off of:
- https://github.com/pytorch/pytorch/pull/101847
- https://github.com/pytorch/pytorch/pull/100583

It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made:
- Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention
- Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support
- Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103704
Approved by: https://github.com/cpuhrsch
2023-06-26 23:05:03 +00:00
bfa08a1c67 Revert "[core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)"
This reverts commit cf5262a84f815c1e574883bc244333d0d211c7a2.

Reverted https://github.com/pytorch/pytorch/pull/102135 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but test_sparse_semi_structured.py::TestSparseSemiStructuredCUDA::test_mm_sparse_first_NT_cuda_int8 is failing CUDA trunk jobs cf5262a84f. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102135#issuecomment-1608423849))
2023-06-26 22:54:16 +00:00
cyy
d4a98280a8 [Reland] Use missing-prototypes in torch_cpu (#104138)
This PR enables Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal,vulkan backends and caffe2 sources.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104138
Approved by: https://github.com/albanD, https://github.com/malfet
2023-06-26 22:53:43 +00:00
436d035dc7 Revert "DDP + C10D sparse all_reduce changes (#103916)"
This reverts commit fed5fba6e4ee3f221bac481798c5a31f785ba75e.

Reverted https://github.com/pytorch/pytorch/pull/103916 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103916#issuecomment-1608412325))
2023-06-26 22:37:58 +00:00
a69f427f95 aten: Ensure dim is size_t (#104201)
Attempts to fix failures introduced in https://github.com/pytorch/pytorch/pull/103930 (example failures: https://github.com/pytorch/pytorch/actions/runs/5363450214/jobs/9731034104)

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 67d5076</samp>

### Summary
🔧🚨🚦

<!--
1.  🔧 (wrench) - This emoji can be used to indicate a bug fix or a minor improvement to the code quality or performance.
2.  🚨 (rotating light) - This emoji can be used to indicate a change that affects the error handling or validation logic of the code, or that adds or modifies a test case.
3.  🚦 (vertical traffic light) - This emoji can be used to indicate a change that affects the control flow or branching logic of the code, or that adds or modifies a condition or assertion.
-->
Fix a compiler warning in `Expand.cpp` by casting a tensor dimension to `size_t`. This improves the code quality and correctness of the `expand` function for the Vulkan backend.

> _`expand` tensor_
> _cast `dim()` to `size_t`_
> _autumn leaves warning_

### Walkthrough
*  Cast `self.dim()` to `size_t` to avoid signed-unsigned comparison warning in `expand` function ([link](https://github.com/pytorch/pytorch/pull/104201/files?diff=unified&w=0#diff-c175e908cbcb8595b22696e672b526202ed3a4a11341603c1522397e499b5c2bL29-R29))

<details>
<summary> Fix done using chatgpt </summary>

![Screenshot 2023-06-26 at 11 52 14 AM](https://github.com/pytorch/pytorch/assets/1700823/95c141e5-36b6-4916-85ca-85415bcc507f)

</details>
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104201
Approved by: https://github.com/lucylq, https://github.com/huydhn, https://github.com/malfet
2023-06-26 22:01:27 +00:00
b93ed8164e Add non-recursive module.to_empty option (#104197)
Fixes https://github.com/pytorch/pytorch/issues/97049, related to https://github.com/pytorch/pytorch/issues/104187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104197
Approved by: https://github.com/albanD
2023-06-26 21:47:22 +00:00
cf5262a84f [core][pruning][sparse][feature] SparseSemiStructured tensor subclass (#102135)
This PR adds in support for semi-structured sparsity via a tensor
subclass. It currently uses the CUTLASS kernels merged in PR #100881.

In the future we plan to add in cuSPARSELt support (see the other PRs in
the stack), which will give us larger performance gains.

This PR adds in 2 things:
- a Tensor subclass, `SparseSemiStructuredTensor` to store the
  sparse tensor in copmressed form and override `__torch_dispatch__`.
- a conversion function that takes in a dense tensor and a
  semi-structured sparse bool mask and creates an instance of the
  subclass.

**SparseSemiStructuredTensor**

The subclass stores the dense tensor in a contiguous flattened tensor
for future compatability with cuSPARSELt, which expects this format.
Note that the CUTLASS kernels do not have this limitation, as the
specified values and the metadata are passed separately in
`_structured_sparse_linear`. In the future we can use the cuSPARSELT bindings
[here](https://github.com/pytorch/pytorch/pull/103700) for faster matmul, better dtype converage, and relaxed shape
constraints.

Since we currently don't have a way to go back from the sparse
representation to the dense representation, and we store the weights in
compressed form, we don't have a great way to handle .t().

Instead, we keep track of how often we've called transpose on our
tensor, and if it's an unexpected number we throw an error. When the first
argument is sparse, we expect an even number of calls to transpose,
while when the second argument is sparse, we expect an odd number of
calls. This is because we support second argument sparse matrix
multiplications by using transpose properties.

**to_sparse_semi_structured**

This is a conversion function to convert a dense tensor and a
semi-structured sparse bool mask into a subclass. Currently, we must
pass in a bool mask, since we can't infer it becuase there may be
additional zero elements in the dense tensor, so `tensor !=0` is not 2:4
sparse.

Once we add either a method to derive the mask from the dense tensor or
cuSPARSELt, we no longer need to pass in the mask. cuSPARSELt has it's
own helper functions to create the metadata mask.

**User Details**

We have implemented support for the following ops for `torch.float16`
and `torch.int8`:
```
torch.addmm(bias, dense, sparse.t())
torch.mm(dense, sparse)
torch.mm(sparse, dense)
aten.linear.default
aten.t.default
aten.t.detach
```

The end user interface to accelerate a nn.Linaer module with the
subclass would look like this:

```
from torch.sparse import to_sparse_semi_structured

mask = torch.Tensor([0, 0, 1, 1]).tile(128, 32).cuda().bool()
linear = Model(128, 128).half().cuda()

linear.weight = nn.Parameter(to_sparse_semi_structured(linear.weight,
                                                       mask=linear.weight.bool())

```

This also updates tests and the `torch.sparse` module docstring to
reflect these changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102135
Approved by: https://github.com/albanD
2023-06-26 21:30:43 +00:00
f7f415eb2d [inductor] add cpp randint implementation to ir.py (#103079) (#104124)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104124
Approved by: https://github.com/desertfire
2023-06-26 21:26:25 +00:00
fed5fba6e4 DDP + C10D sparse all_reduce changes (#103916)
Summary:
## Changes

prototyping sparse allreduce using the sparse dispatch key. When passing in sparse tensors into `dist.allreduce()` we can execute our dispatched function.

prior to this change, passing a sparse tensor into `allreduce()` will error out with `Tensor must be dense...`

## Example script

```python
# python -m torch.distributed.run --nnodes=1 --nproc_per_node=2 this_script.py

import torch
import torch.distributed as dist

def main():
    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    a = torch.tensor([[0, 2.], [3, 0]]).to(rank)
    a = a.to_sparse()
    print(f"rank {rank} - a: {a}")
    dist.all_reduce(a)

if __name__ == "__main__":
    main()
```

output:
```
rank 1 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:1', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
rank 0 - a: tensor(indices=tensor([[0, 1],
                       [1, 0]]),
       values=tensor([2., 3.]),
       device='cuda:0', size=(2, 2), nnz=2, layout=torch.sparse_coo)
allreduce_sparse_cuda_
tensor.is_sparse() = 1
in ProcessGroupNCCL::allreduceSparse
```

Test Plan:
Testing commands (OSS):

```
# python
pytest test/distributed/test_c10d_nccl.py -vsk test_sparse_allreduce_ops

# c++
build/bin/ProcessGroupNCCLTest --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Testing commands (internal, ondemand GPU):
ddp tests:
```
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output

# Get the .par file from the previous command and use it below
TORCH_SHOW_CPP_STACKTRACE=1 /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_ddp_set_sparse_metadata
```

c10d tests:
```
# build tests and run with log output (python)
buck build mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d --show-full-output
NCCL_DEBUG=WARN /data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/c8344b52091f4f7f/caffe2/test/distributed/__c10d__/c10d.par -r test_sparse_allreduce_ops

# python
NCCL_DEBUG=WARN buck test mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/distributed:c10d -- --exact 'caffe2/test/distributed:c10d - test_sparse_allreduce_ops (test_c10d_nccl.ProcessGroupNCCLTest)'

# c++
NCCL_DEBUG=WARN buck run mode/opt -c hpc_comms.use_nccl=exp //caffe2/test/cpp/c10d:ProcessGroupNCCLTest -- --gtest_filter=ProcessGroupNCCLTest.testSparseAllreduce
```

Differential Revision: D46724856

Pulled By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103916
Approved by: https://github.com/rohan-varma
2023-06-26 20:42:17 +00:00
8a08733218 update test_higher_order_op: grad test (#104179)
With https://github.com/pytorch/pytorch/pull/103597, `config.dynamic_shapes` is always `True` and we never check the generated graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104179
Approved by: https://github.com/zou3519
2023-06-26 19:32:59 +00:00
adf9595c2f Update CODEOWNERS (#103934)
Remove users that no longer have write access to the repo, resolving CODEOWNERS errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103934
Approved by: https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/malfet
2023-06-26 19:29:29 +00:00
fb8aa721e2 [Pytorch Edge][BE] Delete Sparse Qnnpack test failing since 2022 jul (#104073)
Summary:
According to https://www.internalfb.com/omh/view/ai_infra_mobile_platform/tests these have been failing since jul 2022.

Just going to delete unless someone thinks they actually do matter and should be made green

https://www.internalfb.com/intern/test/562949996115570/ <- failing test

I ran locally and got errors like

  xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure
  Expected equality of these values:
  c[mIndex * cStride() + nIndex]
    Which is: -872.50446
  acc[mIndex * n() + nIndex]
    Which is: -872.50488
  at 0, 0: reference = -872.5048828125, optimized = -872.50445556640625, Mr x Nr = 8 x 4, M x N x K = 7 x 1 x 13
  xplat/caffe2/aten/src/ATen/native/quantized/cpu/qnnpack/test/gemm-block-sparse-microkernel-tester.h:483: Failure
  Expected equality of these values:
  c[mIndex * cStride() + nIndex]
    Which is: -67.246628
  acc[mIndex * n() + nIndex]
    Which is: -67.24707
  at 3, 0: reference = -67.2470703125, optimized = -67.246627807617188, Mr x Nr = 8 x 4, M x N x K = 4 x 1 x 15
  [  FAILED  ] Q8GEMM_8x4c1x4__SSE2.packedA_k_gt_8_subtile (148 ms)

Test Plan: ci

Reviewed By: kimishpatel

Differential Revision: D46950966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104073
Approved by: https://github.com/kimishpatel
2023-06-26 18:27:20 +00:00
100aff9d4f [export] Deserialize subgraphs. (#103991)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103991
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2023-06-26 18:17:44 +00:00
dd4f4bb47d [exir] Initial serialization (#103763)
Summary:
ETRecord can't use this yet because the other programs need to be migrated to using ExportedProgram (D46729844)

Note: higher order ops like call_delegate/cond are also not supported yet

Test Plan: `buck2 run @//mode/dev-nosan //executorch/exir/tests:serde`

Differential Revision: D46802454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103763
Approved by: https://github.com/tarun292
2023-06-26 18:05:27 +00:00
618cc82e77 Stop Dynamo from peeking into wrap's body (#104076)
When Dynamo sees `wrap(f, x)`, and it decides that `f` is unsafe, Dynamo
should fall back to eager mode and stop introspection all the way
throughout the call of `f`. The motivation is:
- it's easier to test `wrap` this way (it is clearer how many graph
breaks should occur)
- Other HigherOrderOperator do this because their execution of the
body involves code that is not necessarily Dynamo-able. e.g. functorch
transforms. Since `wrap` is a test for the HigherOrderOp mechanism, it
should reflect what other HigherOrderOps do.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104076
Approved by: https://github.com/ydwu4
2023-06-26 17:16:51 +00:00
5364366f8c Sparse Compressed mm avoid creating temp sparse (#104062)
When mm forwards to addmm it creates a zeroed out self this tensor
should take options from the result not one of the sparse arguments.

The bug was leading to an error when calling linear with an `out` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104062
Approved by: https://github.com/nikitaved, https://github.com/pearu
2023-06-26 16:45:04 +00:00
bd8841101b [ET][XNNPACK] Add support for quantized Sub (#104090)
Summary:
Also adds support for backend_config with relu fusion since XNNPACK allows it.

We should revisit the relu fusion once we gain more clarity on quantSrcPartition or some other way to do these fusion and not having to add all combinations.

We should really rename the backend config to et_xnnpack.py or something TODO

Test Plan: `buck test fbcode//mode/dev-nosan fbcode//executorch/backends/xnnpack/test:`

Differential Revision: D46924209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104090
Approved by: https://github.com/mcr229
2023-06-26 16:32:15 +00:00
edc9c0df7e Fold Conv-Bn (#100653)
Adds Conv-BN folding to inductor freezing. One thing that's a little awkward now is we'll want different decompositions to run depending on if we are in the inference compiler. For now, I require that you run with torch.no_grad() so we can detect if no gradients are required before calling aot_autograd.

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100653
Approved by: https://github.com/jansel
2023-06-26 16:04:34 +00:00
c1fffdcd5b Change how AOTInductor's fx input is produced (#104123)
Test Plan: CI

Reviewed By: wushirong

Differential Revision: D46983754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104123
Approved by: https://github.com/chenyang78
2023-06-26 15:59:33 +00:00
b2277075b0 Fixed benchmark_utils.Fuzzer (#101553)
Use np.randint with int64 since int32 is default on Windows.
change default seed to be between [0, 2**32-1] because that is np.random.RandomState required input

Fixes #51205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101553
Approved by: https://github.com/kit1980
2023-06-26 08:03:27 +00:00
3c34a00d1b Preserve all submodules/parameters/buffers when unpickle graph module (#104115)
Summary:
When we pickle/unpickle graph module in multipy, we would lost modules/attributes that are not referred in the graph. This is because when unpickle fx graph module, we use the stored `__dict__` and the fx graph to create a new graph module. In GraphModule init, we drop any attribute that is not referred in the graph.

This behavior is not ideal because we actually expect a graph module that's exactly the same after unpickling.

Test Plan:
```
buck test mode/opt caffe2/test:fx -- test_preserve_unused_attr_after_unpickle

Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D46976230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104115
Approved by: https://github.com/houseroad
2023-06-26 06:59:48 +00:00
58feefa4ed add custom device support for special nn.modules (#103419)
Fixes #103818
1. for some special nn.Modules, there are checks which only support cuda, so I add `privateuse1` check.
2. when get the device type for `privateuse1` by `torch._C._get_privateuse1_backend_name()`, it will get error in `torch.jit.script`, so I add a global variable to avoid this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103419
Approved by: https://github.com/albanD
2023-06-26 00:58:29 +00:00
7cef7195f6 [draft] Update Multiprocessing best practices with CPU device (#103229)
Fixes [#102498](https://github.com/pytorch/pytorch/issues/102498)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103229
Approved by: https://github.com/mingfeima, https://github.com/svekars, https://github.com/jgong5
2023-06-25 06:26:40 +00:00
86e0eda18d Add partial derivative unit tests (#103809)
Adds the unit tests requested in #95810

This PR also addresses a gap in unit testing of gradients, as `gradcheck` always performs total derivatives w.r.t. all arguments and module parameters. Some modules have different code paths for partial derivatives, e.g. `LayerNorm`, and those should be tested separately.

The PR has the following limitations:
- it does not test partial derivatives w.r.t. every combination of arguments, which would exponentially increase CI time.
- it does not implement the same logic for Hessians, where the increase in CI time would be quadratic in the number of arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103809
Approved by: https://github.com/kit1980
2023-06-25 00:36:10 +00:00
a9efbef716 Add support for unique overload of foreach_pow (#104137)
This overload has a scalar in the first arg position unlike every other overload.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104137
Approved by: https://github.com/jansel
2023-06-24 21:07:33 +00:00
e4d8504ebc Unify GELU tanh approximation in _addmm_activation GPU back-end (#104061)
Summary:

Currently, cuBLASLt-based fused GELU epilogue in the GPU back-end of the `_addmm_activation` operator uses tanh approximation, whereas other code paths on GPU don't.

With this PR, the GELU tanh approximation is switched on in all back-end code paths of `_addmm_activation` on GPU for better consistency.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 1.896s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.050s

OK
```

Reviewers: @eellison

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104061
Approved by: https://github.com/eellison
2023-06-24 18:36:45 +00:00
925f0a01c7 Do not pass stepcurrent option unless in CI (#104135)
Should allow one to run the same tests multiple times on local machine

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 740a92d</samp>

> _`pytest_args` change_
> _Only add `--sc` on CI_
> _Avoid conflicts - fall_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104135
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-06-24 09:34:14 +00:00
454f4e98a2 [Pytorch] aten::expand (#103930)
Summary:
Expand using `aten::repeat` for all dims

[expand](https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html#torch.Tensor.expand)
[expand_as](
https://pytorch.org/docs/stable/generated/torch.Tensor.expand_as.html)

Test Plan:
clang-format on `Expand.cpp`
expand tests:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*.expand*"
Action graph will be rebuilt because files have been added or removed.

Parsing buck files: finished in 1.1 sec
Downloaded 5/50 artifacts, 661.18 Kbytes, 37.5% cache miss (for updated rules)
Building: finished in 15.4 sec (100%) 515/515 jobs, 15/515 updated
  Total time: 16.9 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *.expand*
[==========] Running 6 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 6 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.expand_exceptions
[       OK ] VulkanAPITest.expand_exceptions (66 ms)
[ RUN      ] VulkanAPITest.expand_1d
[       OK ] VulkanAPITest.expand_1d (7 ms)
[ RUN      ] VulkanAPITest.expand_2d
[       OK ] VulkanAPITest.expand_2d (2 ms)
[ RUN      ] VulkanAPITest.expand_3d
[       OK ] VulkanAPITest.expand_3d (2 ms)
[ RUN      ] VulkanAPITest.expand_4d
[       OK ] VulkanAPITest.expand_4d (4 ms)
[ RUN      ] VulkanAPITest.expand_as
[       OK ] VulkanAPITest.expand_as (11 ms)
[----------] 6 tests from VulkanAPITest (95 ms total)

[----------] Global test environment tear-down
[==========] 6 tests from 1 test suite ran. (95 ms total)
[  PASSED  ] 6 tests.
lfq@lfq-mbp fbsource %
```

Differential Revision: D46302042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103930
Approved by: https://github.com/SS-JIA
2023-06-24 03:57:53 +00:00
466efccc8a [Pytorch] aten::zeros (#103703)
Summary: Implement [aten::zeros](https://pytorch.org/docs/stable/generated/torch.zeros.html?highlight=zeros#torch.zeros)

Test Plan:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*zeros*"
Action graph will be rebuilt because files have been added or removed.
Parsing buck files: finished in 2.3 sec
Downloaded 0/4 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 6.0 sec (100%) 454/454 jobs, 3/454 updated
  Total time: 8.4 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *zeros*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.zeros
[       OK ] VulkanAPITest.zeros (99 ms)
[----------] 1 test from VulkanAPITest (99 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (99 ms total)
[  PASSED  ] 1 test.
```

Differential Revision: D46777782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103703
Approved by: https://github.com/SS-JIA
2023-06-24 03:47:45 +00:00
6f78390607 [vision hash update] update the pinned vision hash (#104133)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104133
Approved by: https://github.com/pytorchbot
2023-06-24 03:42:08 +00:00
63f66d19ea [Tests] Make run_test.py usable without boto3 (#104111)
There is a `HAVE_TEST_SELECTION_TOOLS` conditional, but turns out it does not really work, so fix it by defining all missing prototypes and make it work as single-shard instance

Add lint rule to test stat it would succeed for runnign only test_cuda with released version of PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104111
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-06-24 03:10:49 +00:00
cyy
483f748dd5 [BE] Enforce missing override keyword (#104032)
This PR enables `-Winconsistent-missing-destructor-override` and `-Winconsistent-missing-override`
and fixes violations.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 47e904e</samp>

This pull request updates the code of various classes and operators in the `caffe2` and `aten` subdirectories to use the `override` specifier instead of the `virtual` keyword for destructors and other virtual functions that override a base class function. This improves the code readability, quality, and consistency with C++ best practices. It also modifies the `./CMakeLists.txt` file to enable warnings for these specifiers, but disable errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104032
Approved by: https://github.com/malfet
2023-06-24 02:34:24 +00:00
202a9108f7 Disable core dump when rerunning disabled tests (#104131)
Fixes https://github.com/pytorch/pytorch/issues/103612

Figuring out a way to dynamically stop generating core dumps on Linux runner is harder than I thought.  The recommend solution is to set a custom script in `/proc/sys/kernel/core_pattern` as documented in https://man7.org/linux/man-pages/man5/core.5.html so that we could dynamically stop generating more core file when the disk space drops below a certain threshold.  However, AFAICT this is not yet supported inside Docker container (https://stackoverflow.com/questions/59986788).

In addition, when the runner runs out of space, all the subsequent step to clean it up won't be done.  The next job running will also fail because nothing could be setup, i.e. https://github.com/pytorch/pytorch/actions/runs/5357044327/jobs/9717914230

So this is only a limit fix to not generate core dumps while re-running disabled tests because a crashed test is run multiple times there and will generate multiple core files.

### Testing

```
ulimit -c 0
kill -3 PID
```

Check that no core file is generated after.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104131
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-06-24 02:29:53 +00:00
75dab587ef [dynamo] FSDP + AC + torch.compile (#103953)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103953
Approved by: https://github.com/wanchaol
2023-06-24 01:40:56 +00:00
b3ace213f2 Heap buffer overflow at source_range_serialization.cpp:73 (#103969)
Hi! We've been fuzzing torchvision project with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz).
We've found a heap buffer overflow error at `source_range_serialization.cpp:73` in pytorch project.

The error occurs because there is not check in `deserialize_source` that `text_table_` size can be less than `fnameIndex`. To prevent the error the corresponding check must be located.

torchvision version: 9d0a93eee90bf7c401b74ebf9c8be80346254f15
pytorch version: 0f1621df1a0a73956c7ce4e2f72f069e610e0137

OS: Ubuntu 20.04

How to reproduce

1. Build docker from [here](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/torchvision) and run the container:

        sudo docker build -t oss-sydr-fuzz-torchvision .
        sudo docker run --privileged --rm -v `pwd`:/fuzz -it oss-sydr-fuzz-torchvision /bin/bash

2. Run the target on this input:  [serialization-crash.txt](https://github.com/pytorch/pytorch/files/11819901/serialization-crash.txt)

        /encode_png_fuzz serialization-crash.txt

3. You will see the following output:

        =================================================================
        ==13==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60200055a630 at pc 0x0000010197b7 bp 0x7ffd4cfb15f0 sp 0x7ffd4cfb15e8
        READ of size 8 at 0x60200055a630 thread T0
            #0 0x10197b6 in std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2>::get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1325:16
            #1 0x10197b6 in std::__shared_ptr_access<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1024:66
            #2 0x10197b6 in std::__shared_ptr_access<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2, false, false>::operator*() const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1011:10
            #3 0xde888c2 in torch::jit::SourceRangeDeserializer::deserialize_source(c10::IValue const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:73:16
            #4 0xde8802b in torch::jit::SourceRangeDeserializer::deserialize(c10::IValue const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:51:37
            #5 0xde8e9c7 in torch::jit::ConcreteSourceRangeUnpickler::unpickle() /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:224:39
            #6 0xde8fb19 in torch::jit::ConcreteSourceRangeUnpickler::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:231:3
            #7 0x10798e7 in torch::jit::Source::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/frontend/source_range.cpp:144:23
            #8 0x1079d9a in torch::jit::SourceRange::findSourceRangeThatGenerated() const /pytorch/torch/csrc/jit/frontend/source_range.h:384:26
            #9 0x1079acd in torch::jit::SourceRange::highlight(std::ostream&) const /pytorch/torch/csrc/jit/frontend/source_range.cpp:149:32
            #10 0x1026fe2 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Token const&) /pytorch/torch/csrc/jit/frontend/lexer.h:461:13
            #11 0x10417d9 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/lexer.h:465:5
            #12 0x102e52c in torch::jit::Lexer::expect(int) /pytorch/torch/csrc/jit/frontend/lexer.h:471:7
            #13 0xcee774c in torch::jit::ParserImpl::parseIdent() /pytorch/torch/csrc/jit/frontend/parser.cpp:52:16
            #14 0xcef4ea8 in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:195:22
            #15 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16
            #16 0xcefac6a in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12
            #17 0xcefac6a in torch::jit::ParserImpl::parseSubscriptExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:403:15
            #18 0xceff39f in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)())::'lambda'()::operator()() const /pytorch/torch/csrc/jit/frontend/parser.cpp:354:54
            #19 0xceff39f in torch::jit::Expr std::__invoke_impl<void, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)())::'lambda'()&>(std::__invoke_other, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)())::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
            #20 0xceea935 in torch::jit::ParserImpl::parseSequence(int, int, int, std::function<void ()> const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:339:7
            #21 0xceefd69 in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)()) /pytorch/torch/csrc/jit/frontend/parser.cpp:353:5
            #22 0xcef895a in torch::jit::ParserImpl::parseSubscript(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:430:9
            #23 0xcef5e5c in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:206:18
            #24 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16
            #25 0xceeeb9d in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12
            #26 0xceeeb9d in torch::jit::ParserImpl::parseExpOrExpTuple() /pytorch/torch/csrc/jit/frontend/parser.cpp:94:19
            #27 0xcee8a36 in torch::jit::ParserImpl::parseStmt(bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:612:20
            #28 0xcee7e72 in torch::jit::ParserImpl::parseStatements(bool, bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:697:23
            #29 0xcee56f5 in torch::jit::ParserImpl::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:747:9
            #30 0xcee544a in torch::jit::Parser::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:812:17
            #31 0xdddbea9 in torch::jit::SourceImporterImpl::parseSourceIfNeeded(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:182:42
            #32 0xdddadbc in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:135:3
            #33 0xdde1d88 in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10
            #34 0xcf2ba5f in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24
            #35 0xcf2bec7 in torch::jit::ScriptTypeParser::parseType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312:10
            #36 0xddf4284 in torch::jit::SourceImporter::loadType(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import_source.cpp:786:27
            #37 0xdd739f7 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const /pytorch/torch/csrc/jit/serialization/import.cpp:146:33
            #38 0xdd739f7 in c10::StrongTypePtr std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
            #39 0xdd73880 in std::enable_if<is_invocable_r_v<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>, c10::StrongTypePtr>::type std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:113:9
            #40 0xdd736d6 in std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:291:9
            #41 0xdd76349 in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/std_function.h:622:14
            #42 0xdeb9f48 in torch::jit::Unpickler::readGlobal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/unpickler.cpp:835:9
            #43 0xdeb012d in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:511:7
            #44 0xdeae437 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27
            #45 0xdeae0d2 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3
            #46 0xddd6de3 in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20
            #47 0xdd732dd in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10
            #48 0xdd69885 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19
            #49 0xdd6c855 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:438:25
            #50 0xdd6c1c7 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:421:10
            #51 0xdd6dce4 in torch::jit::load(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:503:10
            #52 0xf2d3f75 in torch::serialize::InputArchive::load_from(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<c10::Device>) /pytorch/torch/csrc/api/src/serialize/input-archive.cpp:97:13
            #53 0x60509c in void torch::load<at::Tensor, char*&>(at::Tensor&, char*&) /pytorch/torch/include/torch/csrc/api/include/torch/serialize.h:107:11
            #54 0x6036be in LLVMFuzzerTestOneInput /vision/encode_png.cc:38:5
            #55 0x66b041 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
            #56 0x6544cc in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
            #57 0x65a61b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
            #58 0x654222 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
            #59 0x7f3d12cc7082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
            #60 0x542cdd in _start (/encode_png_fuzz+0x542cdd)

        0x60200055a630 is located 16 bytes to the right of 16-byte region [0x60200055a610,0x60200055a620)
        allocated by thread T0 here:
            #0 0x60057d in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
            #1 0xde9185d in std::_Vector_base<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20
            #2 0xde9185d in void std::vector<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::_M_realloc_insert<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >(__gnu_cxx::__normal_iterator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >*, std::vector<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > >, std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33
            #3 0xde916a1 in std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >& std::vector<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >::emplace_back<std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >(std::shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:121:4
            #4 0xde8f445 in torch::jit::SourceRangeDeserializer::SourceRangeDeserializer(c10::IValue) /pytorch/torch/csrc/jit/serialization/source_range_serialization.h:42:19
            #5 0xde8e141 in torch::jit::ConcreteSourceRangeUnpickler::unpickle() /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:215:28
            #6 0xde8fb19 in torch::jit::ConcreteSourceRangeUnpickler::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:231:3
            #7 0x10798e7 in torch::jit::Source::findSourceRangeThatGenerated(torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/frontend/source_range.cpp:144:23
            #8 0x1079d9a in torch::jit::SourceRange::findSourceRangeThatGenerated() const /pytorch/torch/csrc/jit/frontend/source_range.h:384:26
            #9 0x1079acd in torch::jit::SourceRange::highlight(std::ostream&) const /pytorch/torch/csrc/jit/frontend/source_range.cpp:149:32
            #10 0x1026fe2 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::Token const&) /pytorch/torch/csrc/jit/frontend/lexer.h:461:13
            #11 0x10417d9 in torch::jit::Lexer::expected(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/frontend/lexer.h:465:5
            #12 0xcee774c in torch::jit::ParserImpl::parseIdent() /pytorch/torch/csrc/jit/frontend/parser.cpp:52:16
            #13 0xcef4ea8 in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:195:22
            #14 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16
            #15 0xcefac6a in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12
            #16 0xcefac6a in torch::jit::ParserImpl::parseSubscriptExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:403:15
            #17 0xceff39f in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)())::'lambda'()::operator()() const /pytorch/torch/csrc/jit/frontend/parser.cpp:354:54
            #18 0xceff39f in torch::jit::Expr std::__invoke_impl<void, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)())::'lambda'()&>(std::__invoke_other, torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)())::'lambda'()&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/invoke.h:60:14
            #19 0xceea935 in torch::jit::ParserImpl::parseSequence(int, int, int, std::function<void ()> const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:339:7
            #20 0xceefd69 in torch::jit::List<torch::jit::Expr> torch::jit::ParserImpl::parseList<torch::jit::Expr>(int, int, int, torch::jit::Expr (torch::jit::ParserImpl::*)()) /pytorch/torch/csrc/jit/frontend/parser.cpp:353:5
            #21 0xcef895a in torch::jit::ParserImpl::parseSubscript(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch/torch/csrc/jit/frontend/parser.cpp:430:9
            #22 0xcef5e5c in torch::jit::ParserImpl::parseBaseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:206:18
            #23 0xcef2c1b in torch::jit::ParserImpl::parseExp(int) /pytorch/torch/csrc/jit/frontend/parser.cpp:284:16
            #24 0xceeeb9d in torch::jit::ParserImpl::parseExp() /pytorch/torch/csrc/jit/frontend/parser.cpp:262:12
            #25 0xceeeb9d in torch::jit::ParserImpl::parseExpOrExpTuple() /pytorch/torch/csrc/jit/frontend/parser.cpp:94:19
            #26 0xcee8a36 in torch::jit::ParserImpl::parseStmt(bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:612:20
            #27 0xcee7e72 in torch::jit::ParserImpl::parseStatements(bool, bool) /pytorch/torch/csrc/jit/frontend/parser.cpp:697:23
            #28 0xcee56f5 in torch::jit::ParserImpl::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:747:9
            #29 0xcee544a in torch::jit::Parser::parseClass() /pytorch/torch/csrc/jit/frontend/parser.cpp:812:17
            #30 0xdddbea9 in torch::jit::SourceImporterImpl::parseSourceIfNeeded(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:182:42
            #31 0xdddadbc in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:135:3
            #32 0xdde1d88 in torch::jit::SourceImporterImpl::resolveType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::SourceRange const&) /pytorch/torch/csrc/jit/serialization/import_source.cpp:261:10
            #33 0xcf2ba5f in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238:24

        SUMMARY: AddressSanitizer: heap-buffer-overflow /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/shared_ptr_base.h:1325:16 in std::__shared_ptr<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, (__gnu_cxx::_Lock_policy)2>::get() const
        Shadow bytes around the buggy address:
          0x0c04800a3470: fa fa 00 00 fa fa 00 00 fa fa fd fa fa fa 00 00
          0x0c04800a3480: fa fa fd fa fa fa fd fd fa fa fd fd fa fa fd fa
          0x0c04800a3490: fa fa fd fd fa fa 00 00 fa fa 00 00 fa fa 00 00
          0x0c04800a34a0: fa fa fd fa fa fa fd fd fa fa fd fa fa fa 00 fa
          0x0c04800a34b0: fa fa fd fd fa fa fd fd fa fa fd fa fa fa fd fd
        =>0x0c04800a34c0: fa fa 00 00 fa fa[fa]fa fa fa fa fa fa fa fa fa
          0x0c04800a34d0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
          0x0c04800a34e0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
          0x0c04800a34f0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
          0x0c04800a3500: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
          0x0c04800a3510: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
        Shadow byte legend (one shadow byte represents 8 application bytes):
          Addressable:           00
          Partially addressable: 01 02 03 04 05 06 07
          Heap left redzone:       fa
          Freed heap region:       fd
          Stack left redzone:      f1
          Stack mid redzone:       f2
          Stack right redzone:     f3
          Stack after return:      f5
          Stack use after scope:   f8
          Global redzone:          f9
          Global init order:       f6
          Poisoned by user:        f7
          Container overflow:      fc
          Array cookie:            ac
          Intra object redzone:    bb
          ASan internal:           fe
          Left alloca redzone:     ca
          Right alloca redzone:    cb
        ==13==ABORTING
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103969
Approved by: https://github.com/davidberard98
2023-06-24 00:49:14 +00:00
344bab2669 [RFC]: Functionalize assertions (#103757)
The idea here is to create do a graph mutation to:
* Create an initial dependency token at the beginning of the program.
* Replace non-functional version of assertion statements to functional version.
* The functional version of assertion statement will:
  * Accept a dependency token from output of previous functional assertion statement (or the initial dependency token if there isn't any).
  * Generate a dependency token as the output of assertion statement.
  * Augment the output to include the dependency token generated by last assertion statement.

The goal here is to:
* Form an explicit dependency chain and avoid potential reordering during other passes of compiling.
* Make the assertions a part of overall execution graph will affect the final output (or it could potentially be DCEed).

**NOTE:**
* Currently only cover `contrain_range` and WIP to support other assertions. Send out this PR to collect feedback first.
* Here it only focus on implementation itself. Will integrate it with current export in future PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103757
Approved by: https://github.com/avikchaudhuri
2023-06-24 00:23:35 +00:00
98d513cabf [BE][Test] Remove --pytest option from run_test.py (#104125)
Because we always run tests with pytest now.

Marking it as `bc-breaking` as there could technically be some scripts depending on it somewhere...

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1760568</samp>

> _`pytest` option gone_
> _simpler test runner script_
> _autumn leaves fall fast_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104125
Approved by: https://github.com/seemethere
2023-06-24 00:20:20 +00:00
9f11ad6f86 Extend torch->onnx export for quantized convolutional ops (#102759)
- Extend support:
  - quantized::conv1d
  - quantized::conv3d
  - quantized::conv3d_relu
  - quantized::conv_transpose1d
  - quantized::conv_transpose2d
  - quantized::conv_transpose3d
  - Note: quantized::{conv1d_relu,conv2d,conv2d_relu} already supported.
- To support this, quantization unpacking added for:
  - conv1d
  - conv_transpose1d
  - conv_transpose2d
  - conv_transpose3d
  - Note: conv3d/conv3d_relu already had weights unpacking set up, even though it didn't have torch.onnx support.
- Add tests.
- The 3D tests will fail if run with the qnnpack backend (e.g., on Apple silicon Mac), so added decorator skipIfQuantizationBackendQNNPack.
- Minor fix in `aten/src/ATen/native/quantized/cpu/qconv.cpp` for 3D convolutions (triggered by added tests).

Fixes #102747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102759
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi, https://github.com/kit1980
2023-06-23 22:50:17 +00:00
75108b2096 Normal and Uniform return earlier without entering kernel<RNG> (#103507)
Fixes [#103418](https://github.com/pytorch/pytorch/issues/103418)

By the way, I'm wondering should we update other distribution funcs in `aten\src\ATen\native\DistributionTemplates.h`?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103507
Approved by: https://github.com/jgong5, https://github.com/mikaylagawarecki
2023-06-23 21:54:13 +00:00
bd5b1788cd Support printing inequality in ExprPrinter (#104104)
Fixes https://github.com/pytorch/pytorch/issues/103587

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104104
Approved by: https://github.com/jansel
2023-06-23 21:50:17 +00:00
3e674b75b1 Allow fusion of epilogue copies with upstream foreach ops (#104018)
Allow fusion of epilogue copies with foreach kernel scheduler nodes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104018
Approved by: https://github.com/jansel
2023-06-23 21:39:59 +00:00
47ff90fde5 [pt2][inductor] update local caching and create get_system method (#104050)
Summary: separate system information construction as a separate static method, and update local caching (/temp_dir/cache is now a dir, not a file; this is relevant for upcoming changed i.e. adding `allow_tf32` since it would now be possible to have multiple valid local caches)

Test Plan: sandcastle + CI

Differential Revision: D46568207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104050
Approved by: https://github.com/jansel
2023-06-23 21:06:22 +00:00
eqy
ef3d1cfa16 [cuDNN][cuDNN V8 API] Thread safety fixes for cuDNN V8 API (#103939)
(this these two fixes are now oudated, see EDIT below)
Fixes for two thread safety issues (one currently unobserved, and one currently observed).
1. `std::erase` can potentially invalidate a pointer to an `ExecutionPlan` in the current implementation. While failures due to this issue have not yet been reported to my knowledge, it is better to return a copy of an `ExecutionPlan` for safety.
2. #103793 surfaced that `cudnnBackendExecute` appears to currently be thread-unsafe. I've verified this with a PyTorch free (pure C++) repro using the cuDNN frontend. This PR addes a mutex that we can hopefully once this issue is resolved.

EDIT:
Feedback from cuDNN is that the V8 backend API has known thread-safety limitations when `ExecutionPlan`s are shared (or even shallow copied) across threads. Given that the common PyTorch use case of eager-mode is singlethreaded (per GPU), this PR now opts to make `ExecutionPlan` caches `thread_local`, as this simplifies the code and eliminates the need for a mutex. The potential tradeoff is some additional warmup cost in multithreaded case, but this would also only be worse than the current case if multiple threads had largely overlapping workloads.

CC @tuero @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103939
Approved by: https://github.com/xw285cornell, https://github.com/colesbury
2023-06-23 20:51:47 +00:00
933166e5c0 Fix null pointer dereference on ROCm (#95810)
The root cause of the crash in training nanoGPT was a null pointer dereference in the layer norm kernel.

While addressing the issue, I also made sure that `__syncthreads()` is simultaneously called by all threads in the block, to avoid unwanted side effects.

Moreover, I changed the kernel launch code to be more clear about the accumulation data type (`T_ACC`) and thread block dimensions, without changing function.

Fixes #95808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95810
Approved by: https://github.com/ngimel
2023-06-23 20:02:10 +00:00
a45132e049 Remove CUDA 11.7 Docker image build (#104116)
This option has been removed by https://github.com/pytorch/builder/pull/1408.  This is currently failing in trunk https://github.com/pytorch/pytorch/actions/runs/5358541073/jobs/9720970056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104116
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-06-23 20:01:57 +00:00
6ff4548b6e [AMP] Support XLA:TPU (#96370)
With https://github.com/pytorch/xla/pull/5148, https://github.com/pytorch/xla/pull/4740

With these changes
XLA:GPU users should use `torch.cuda.amp.autocast()` for AMP with float16
XLA:TPU users should use `torch.amp.autocast('xla')` for AMP with bfloat16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96370
Approved by: https://github.com/bdhirsh, https://github.com/malfet
2023-06-23 19:46:42 +00:00
c17bdb3247 [C10D] Add functional collective reduce_scatter_into_tensor_coalesced. (#101023)
Implementation uses a fallback that does no coalescing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101023
Approved by: https://github.com/wanchaol
2023-06-23 19:24:11 +00:00
93e63fc0f6 [Core] Drop GIL in THPVariable_item around aten op (#104103)
This can cause a deadlock by starving other python threads required for the kernel to make progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104103
Approved by: https://github.com/albanD
2023-06-23 19:13:49 +00:00
b5d1b42f99 [bfloat16] adaptive_{max, avg}_pool3d (#89754)
Add bfloat16 support in `adaptive_{max, avg}_pool3d` as discussed in https://github.com/pytorch/pytorch/pull/88906#discussion_r1033466164.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89754
Approved by: https://github.com/jgong5, https://github.com/kit1980
2023-06-23 19:12:47 +00:00
7274582390 Revert "sparse_mask: backward support for sparse lhs (#95165)"
This reverts commit f090fdf3b49164679fb6316e9ae15e0c4fb3c9eb.

Reverted https://github.com/pytorch/pytorch/pull/95165 on behalf of https://github.com/huydhn due to Sorry for reverting this. I think one of the tests test_sparse.py::TestSparseCUDA::test_sparse_mask_backward_cuda_complex128 is failing on slow gradcheck f090fdf3b4 ([comment](https://github.com/pytorch/pytorch/pull/95165#issuecomment-1604696109))
2023-06-23 18:40:15 +00:00
3a823e4617 [BE][CMake] Do not pass -mfpu=neon on Apple (#104078)
Followup after https://github.com/pytorch/pytorch/pull/103929 that
get rid of an annoying warning, which will become an error in newer Xcode

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 748d60d</samp>

> _`NEON_FOUND` is true_
> _But iOS may not like `-mfpu=neon`_
> _Check platform, then branch_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104078
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-06-23 17:09:30 +00:00
d1c367470b [Specialized Kernel] Remove requirement for type_alias and dim_order_alias to be present (#104006)
These fields are not required when kernels provided do not use aliases (e.g. only a default kernel

Differential Revision: [D46916099](https://our.internmc.facebook.com/intern/diff/D46916099/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104006
Approved by: https://github.com/larryliu0820
2023-06-23 16:49:57 +00:00
8176cd8c0f [ao] fixing quantized prelu workflow (#103455)
Summary: https://github.com/pytorch/pytorch/issues/100654 noticed prelu
was not running its observers when the quantization flow was being run,
this was a bug which is now fixed and the relevant prelu tests also now
check for this. Also added a corrected observer for PReLU to
qconfig_mapping

Test Plan: python test/test_quantization.py TestStaticQuantizedModule.test_prelu

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103455
Approved by: https://github.com/jerryzh168
2023-06-23 16:45:40 +00:00
8a500f0be6 Update triton commit pin for ROCm (#104035)
Updates ROCm triton pinned commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104035
Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980
2023-06-23 16:36:20 +00:00
7320ef5651 [quant][pt2] Add prepare QAT test for mobilenetv2 (#104068)
Summary:
Prepare QAT for mobilenetv2 has matching numerics with
FX. There were two changes needed to achieve this, however.
First, this commit adds observer sharing for ReLU6, which is
used extensively throughout this model. Second, in the tests we
have to use the same manual seed every time we call the models
in order to get the same results between FX and PT2. This is
because there is a dropout at the end of the model.

Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_mobilenet_v2

Reviewed By: kimishpatel

Differential Revision: D46707786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104068
Approved by: https://github.com/jerryzh168
2023-06-23 16:34:25 +00:00
fd40abb706 Minor bugfix for int inputs in minifier (#104100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104100
Approved by: https://github.com/albanD
2023-06-23 16:17:12 +00:00
fb04b59fa2 [functorch] Remove test_functionalize (#103748)
After landing D46344980 I talked with @rzou and discovered that
test_functionalize no longer actually tests anything beyond what's already in
test_aotdispatch.  So, we can remove this, and save some GPU testing cycles!

Differential Revision: [D46395212](https://our.internmc.facebook.com/intern/diff/D46395212/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46395212/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103748
Approved by: https://github.com/zou3519, https://github.com/Neilblaze
2023-06-23 14:38:50 +00:00
ce845dfe49 [Reland][ET] Select used et_kernel_metadata only (#104005)
Summary: Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels.

Test Plan: contbuild & OSS CI

Reviewed By: Jack-Khuu

Differential Revision: D46882119

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104005
Approved by: https://github.com/Jack-Khuu
2023-06-23 14:38:45 +00:00
afc788a99c Re-land _cycleviz.py: visualize reference cycles holding cuda memory (#104051)
Reference cycles are freed by the cycle collector rather than being cleaned up
when the objects in the cycle first become unreachable. If a cycle points to a tensor,
the CUDA memory for that tensor will not be freed until garbage collection runs.
Accumulation of CUDA allocations can lead to out of memory errors (OOMs), as well as
non-deterministic allocation behavior which is harder to debug.

This visualizer installs a garbage collection hook to look for cycles containing
CUDA tensors and saves a visualization of the garbage:

```
from torch.cuda._cycleviz import warn_tensor_cycles
warn_tensor_cycles()
# do some work that results in a cycle getting garbage collected
# ...
> WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html
```

Reland to make windows skip the test.

This reverts commit 7b3b6dd4262337c5289d64dd3e824b0614cf68e3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104051
Approved by: https://github.com/aaronenyeshi, https://github.com/malfet
2023-06-23 13:44:58 +00:00
f090fdf3b4 sparse_mask: backward support for sparse lhs (#95165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95165
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-06-23 12:27:27 +00:00
fcb7a47f8b [Quant][PT2E]Fix the maxpool2d input observer didn't insert after QuantizationAnnotation API (#101941)
**Summary**
The previous UT has been broken accidently, since the output of conv2d node has been annotated by mistake.
Re-enable these UTs for case:

- Single `conv2d` node, if we don't annotate the output node of `conv2d`. There should be no fake quant at conv2d's output.
-  For `conv2d-maxpool` pattern, `maxpool` should has fake quant inserted at input and output node since we annotate these nodes.

**Test Plan**
```
python -m pytest test_quantize_pt2e.py -k test_wo_annotate_conv_output_quantizer
python -m pytest test_quantize_pt2e.py -k test_max_pool2d_quantizer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101941
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-06-23 11:50:31 +00:00
47894bb165 [functorch] disable C++ Function under functorch transforms (#103957)
Fixes https://github.com/pytorch/pytorch/issues/102720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103957
Approved by: https://github.com/zou3519
2023-06-23 11:01:44 +00:00
ec24f1e4cc Simulate treespec flattening/unflattening (#101896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101896
Approved by: https://github.com/jansel, https://github.com/anijain2305
2023-06-23 10:53:15 +00:00
92c0e49419 Add num_elements_per_warp as an triton_config (#103702)
# Summary
1. Add `num_elements_per_warp` as an optional triton config. Currently it's only used in Pointwise max_auto_tune.
2. Added an entry for Pointwise max_auto_tune when len(size_hints)==1. This is from the results of `CoordescTuner` for the `max_pool2d_with_indices_backward` kernel.
3. Made `max_pool2d_with_indices_backward` channel-last consider torch inductor lowering by default when auto-tune is enabled.

(I tried to update `num_elements_per_warp` directly for all configs directly. However it brings some perf regressions for "torchbench" and "dynamic" models. So in this PR still use a guard.)

# Performance test results
Operator max_pool2d_with_indices_backward testing:
```
python3.9 benchmarks/dynamo/microbenchmarks/operatorbench.py --suite=timm --op=aten.max_pool2d_with_indices_backward.default --max-samples=5 --dtype=float16 --channels-last

Before this change:

Fallback
Inductor Speedups : [0.9997202597876758, 1.0001309108307304, 1.0002654421310901]

Default lowering:
Inductor Speedups : [0.9945062166479167, 1.0632119741391315, 1.3002933288577507]

TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=0
Inductor Speedups : [0.9941159121217165, 1.0648002410311495, 1.2999986086755966]

TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
Inductor Speedups : [0.9950528253874693, 1.0651245316963014, 1.3013674401534756]

TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1
Inductor Speedups : [1.4020247605755827, 1.5504232138088152, 1.8226088905229931]

After this change:

TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
Inductor Speedups : [1.403303265792746, 1.548831582475635, 1.822278780085024]

```

Inductor perf nightly run in progress:
https://github.com/pytorch/pytorch/actions/runs/5329044981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103702
Approved by: https://github.com/jansel, https://github.com/eellison
2023-06-23 10:17:46 +00:00
09d093b47b Update foreach realization rules (#104017)
- don't realize inputs to foreach ops
- only realize outputs if there are downstream non-foreach ops

before merge:
need to update tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104017
Approved by: https://github.com/jansel
2023-06-23 08:26:09 +00:00
a152b3e3b8 [RFC] Create functional aten assertion ops (#103751)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #103887
* #103757
* __->__ #103751

Prep PR to create functional version of assertions. Concrete logic will be implemented in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103751
Approved by: https://github.com/tugsbayasgalan
2023-06-23 06:20:42 +00:00
3c28431a0f Feature: Dump compile_times when TORCH_LOGS=dynamo is enabled. (#104057)
Partial implementation of  https://github.com/pytorch/pytorch/issues/103173. This PR only implements the feature to dump compile_times at the end of the session using the atexit handler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104057
Approved by: https://github.com/ezyang
2023-06-23 05:25:09 +00:00
23b7035b3c [TP] Add an input resharding wrapper for TP and unit test for 2D + AC (#103334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103334
Approved by: https://github.com/kumpera
2023-06-23 04:05:01 +00:00
8c3958eddc Fix lr_scheduler serialization contains bound methods issue (#102627)
Fixes #42376
`torch.save` serializes bound methods inside LR scheduler resulting in large serialized file.

Test cases include checking file size, checking if the `anneal_func` is bounded and file is loaded correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102627
Approved by: https://github.com/albanD
2023-06-23 03:53:15 +00:00
c805b81fef [vision hash update] update the pinned vision hash (#104082)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104082
Approved by: https://github.com/pytorchbot
2023-06-23 03:47:42 +00:00
29e3fddb08 Revert "Preserve original co_filename when FX symbolic_trace (#103885)"
This reverts commit b9f81a483a7879cd3709fd26bcec5f1ee33577e6.

Reverted https://github.com/pytorch/pytorch/pull/103885 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103885#issuecomment-1603612781))
2023-06-23 02:49:04 +00:00
5a97c947c6 Fix optimizer grad mode state interaction with dynamo (#103952)
Graph break before restoring the grad mode to ensure dynamo respects `no_grad`. This isn't a bug necessarily, but this will allow us to get good perf until aot is updated.

https://github.com/pytorch/pytorch/issues/104053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103952
Approved by: https://github.com/janeyx99
2023-06-23 02:07:08 +00:00
22eaedacd3 [nccl] Do no skip send/recv 0 byte tensor (#103140)
Summary:
Since NCCL 2.12.10, NCCL supports send/recv 0 byte: https://github.com/NVIDIA/nccl/issues/696. Therefore we don't have to skip.

One issue is that if a rank has 0 bytes to send and 0 bytes to recv, it'll skip send/recv completely. And it'll proceed to the next collective which it can send/recv something, making it confusing to the other ranks. Another solution is to add a barrier but that's very expensive.

Test Plan: will add a unit test

Differential Revision: D46507785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103140
Approved by: https://github.com/malfet, https://github.com/kwen2501
2023-06-23 01:57:53 +00:00
0330f67b22 Remove ExportGraphModuleMixin. (#103786)
Summary:
We remove the ExportGraphModuleMixin. There are several implications of this change:
1. The graph_module of ExportedProgram, EdgeDialectProgram and ExecutorchProgram won't have the same signature as original user function. Instead, we should directly call the *Program, which has the same calling convention. e.g:

2. All passes need to go through prog.transform(*passes). We need to make all passes return PassResult as a result.

3. We also need to make sure the graph_module.meta is preserved after transform.

Test Plan: Test with CI.

Differential Revision: D46729844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103786
Approved by: https://github.com/avikchaudhuri
2023-06-23 01:22:28 +00:00
4624afaa30 use reset_running_stats in swa_utils.update_bn (#103801)
the stat reset in `swa_utils.update_bn` already exists in `NormBase.reset_running_stats`, so use that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103801
Approved by: https://github.com/janeyx99
2023-06-23 01:17:13 +00:00
75716fb060 [export][serde] Add opset version check and upgrader API (#103238)
This PR adds initial implementation of an upgrader. Added test to show that this version works for one of the upgraders in https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/operator_upgraders/upgraders_entry.cpp.

Differential Revision: [D46651778](https://our.internmc.facebook.com/intern/diff/D46651778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103238
Approved by: https://github.com/avikchaudhuri
2023-06-23 01:06:02 +00:00
6463c55ef8 [inductor] Limit window for horizontal fusion (#104024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104024
Approved by: https://github.com/jansel
2023-06-23 01:04:15 +00:00
6bda97e2c1 Raise type error message for interpolate if size contains non-integer elements (#99243)
Raise type error message for interpolate when output size is a tuple containing elements that are not `int`

Fixes #98287

Check is only performed if `size` is an instance of `list` or `tuple`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99243
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/MovsisyanM, https://github.com/albanD
2023-06-23 00:48:45 +00:00
51664489ba fix upload alerts to rockset (#103995)
Testing is the CI of https://github.com/pytorch/pytorch/pull/103996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103995
Approved by: https://github.com/huydhn
2023-06-22 23:33:10 +00:00
4e204ff87b Added is_xla (#103100)
This change creates `is_xla` which is congruent with `is_cuda` and `is_cpu`. Useful in situations like: https://github.com/pytorch/pytorch/pull/102858

```
>>> x = torch.tensor([1], device=xm.xla_device())
>>> x.is_xla
True
>>> x.is_cpu
False
>>> x = torch.tensor([1])
>>> x.is_cpu
True
>>> x.is_xla
False
```

Attn: @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103100
Approved by: https://github.com/albanD
2023-06-22 23:31:04 +00:00
49dc26435f [BE]Fix @parametrize not working when using @with_comms in DTensorTestBase (#104065)
1) Fix @parametrize not working when using @with_comms in DTensorTestBase, this is because args and kwargs are currently not being passed when using @with_comms wrapper.
2) Use @parametrize in test_fsdp_dtensor_state_dict.py to make sure it is working correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104065
Approved by: https://github.com/fduwjj
2023-06-22 23:24:40 +00:00
a3ac258291 Pass in .so name via lower setting (#103968) (#104015)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/103968

Differential Revision: D46922444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104015
Approved by: https://github.com/desertfire
2023-06-22 23:05:27 +00:00
8d9581a390 Remove foreach triton workaround that is no longer needed (#104016)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104016
Approved by: https://github.com/eellison
2023-06-22 22:30:23 +00:00
1f1fb58b8a [dynamo] Fix TimmRunner typo in benchmarks (#104052)
Minor fix - removes extra n from TimmRunner class object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104052
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-06-22 22:25:25 +00:00
0d5f1cb666 [quant] Add torch.flatten to executorch backend_config (#103988)
Summary: This is needed to make the short-term and long-term
quantization numerics match for mobilenetv2.

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers: jerryzh, kimishpatel

Subscribers: jerryzh, kimishpatel

Differential Revision: [D46909962](https://our.internmc.facebook.com/intern/diff/D46909962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103988
Approved by: https://github.com/jerryzh168
2023-06-22 22:11:48 +00:00
f044613f78 Back out "Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)" (#103938)
Differential Revision: [D46883396](https://our.internmc.facebook.com/intern/diff/D46883396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103938
Approved by: https://github.com/awgu, https://github.com/fegin
2023-06-22 21:55:58 +00:00
10ad74cbec Update SavedVariable to support saving non-input leafs (#104039)
Fixes https://github.com/pytorch/pytorch/issues/103726
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104039
Approved by: https://github.com/albanD
2023-06-22 21:52:35 +00:00
d7994dfd07 [inductor] Add triton_helpers.any instead of reusing max (#103974)
I doubt there's much difference in performance, but this improves readability of
the generated code, e.g.

```python
tmp8 = triton_helpers.max2(tmp7, 1)[:, None]
```
becomes
```python
tmp8 = triton_helpers.any(tmp7, 1)[:, None]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103974
Approved by: https://github.com/lezcano
2023-06-22 20:06:21 +00:00
303ff84b04 [quant][pt2] Update special qspecs after QAT rewrite (#103970)
Summary:
Special qspecs like `SharedQuantizationSpec` and
`DerivedQuantizationSpec` refer to other nodes in the graph.
However, after subgraph rewriting in QAT, the nodes referred
to in these special qspecs may be replaced by new nodes.
This could lead to the following error when inserting
observers according to these qspecs:

```
AssertionError: please make sure only refer to edge or node
that has observer/fake_quant inserted: 'getitem' not in
dict_keys([(arg0, convolution_default_1), (mul_tensor, convolution_default_1), getitem_3])
```

This commit fixes this by keeping track of the nodes that
are replaced during subgraph rewriting in QAT, and using
this mapping to update the dangling references used in these
special qspecs.

Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_update_shared_qspec

Reviewed By: jerryzh168

Differential Revision: D46606614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103970
Approved by: https://github.com/jerryzh168
2023-06-22 20:05:57 +00:00
c16a28860f Reenable disabled tests by pr body (#103790)
Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message.  `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled.

For testing: Fixes #103789
Check that 103789 shows up in list of ignored disabled issues
Sanity check that test-config labels still work

More testing via `python3 ".github/scripts/filter_test_configs.py"     --workflow "pull"     --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)"     --test-matrix "{ include: [
    { config: "default", shard: 1, num_shards: 1 },
  ]}
  "     --pr-number ""     --tag ""     --event-name "push"     --schedule ""     --branch ""`
 and
 `python3 ".github/scripts/filter_test_configs.py"     --workflow "pull"     --job-name "linux-bionic-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu)"     --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 2, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 3, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 4, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}, {"config": "default", "shard": 5, "num_shards": 5, "runner": "linux.g5.4xlarge.nvidia.gpu"}]}"     --pr-number "103790"     --tag ""     --event-name "pull_request"     --schedule ""     --branch ""`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790
Approved by: https://github.com/huydhn
2023-06-22 19:47:11 +00:00
7ac1c64bc4 Exclude _nvfuser from test collection (#104003)
The three files in this folder are run by should instead be run by test_jit_cuda_fuser.py, test_nvfuser_dynamo.py, and test_nvfuser_frontend.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104003
Approved by: https://github.com/huydhn, https://github.com/jjsjann123
2023-06-22 19:46:45 +00:00
5847cb55e4 [PyPer][ET] Refactor EG to ET (#99694)
Summary:
Change execution graph to execution trace.
See post: https://fb.workplace.com/groups/873291503156329/permalink/1529496217535851/

Test Plan: Run a job.

Reviewed By: chaekit

Differential Revision: D44121392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99694
Approved by: https://github.com/chaekit
2023-06-22 19:41:54 +00:00
ec922efe3b [inductor] fix a failed test for layout optimization (#103984)
Summary:
The test fail because a fixed port is used to initialize the process group. That does not work in stress test when multiple instance of the tests are being run concurrently.

Pick a random port and do some small retry for that.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:layout_optim -- --exact 'caffe2/test/inductor:layout_optim - test_mutate_view (caffe2.test.inductor.test_layout_optim.TestLayoutOptim)' --run-disabled --jobs 18 --stress-runs 10 --record-results
```

Differential Revision: D46908114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103984
Approved by: https://github.com/williamwen42
2023-06-22 19:34:10 +00:00
b5594f7df0 Revert "Use missing-prototypes in torch_cpu (#103725)"
This reverts commit 716b3b893d2826f1e47ab5321f082b48c66c8c92.

Reverted https://github.com/pytorch/pytorch/pull/103725 on behalf of https://github.com/osalpekar due to Broke caffe2 builds due. More info at [D46920675](https://www.internalfb.com/diff/D46920675) ([comment](https://github.com/pytorch/pytorch/pull/103725#issuecomment-1603129273))
2023-06-22 18:30:31 +00:00
4aee0fef11 Heap buffer overflow due to wrong loop condition in torch::jit::unpickler (#103667)
Hi!

I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a heap buffer overflow error that occures by incorrect loop condition in torch::jit::unpickler.cpp. This bug was found in several fuzzing targets: it can be triggered by `torch::jit::load()` method when loading a .pt model and by `torch::distributed::rpc::deserializeRequest()` method in RPC module.

All found errors could be reproduced with provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### PoC for deserealizeRequest():
[crash-0722408578cd2f26593b5a01e26d2a078d3dc5f6.zip](https://github.com/pytorch/pytorch/files/11756694/crash-0722408578cd2f26593b5a01e26d2a078d3dc5f6.zip)

```
=================================================================
==29858==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020004ed808 at pc 0x000000680084 bp 0x7ffcbd8220d0 sp 0x7ffcbd8220c8
READ of size 4 at 0x6020004ed808 thread T0
    #0 0x680083 in c10::IValue::IValue(c10::IValue const&) /pytorch/aten/src/ATen/core/ivalue.h:224:33
    #1 0xdc4beb8 in std::pair<c10::impl::DictIterator<c10::IValue, c10::IValue, ska_ordered::detailv3::sherwood_v3_table<std::pair<c10::IValue, c10::IValue>, c10::IValue, c10::detail::DictKeyHash, ska_ordered::detailv3::KeyOrValueHasher<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyHash>, c10::detail::DictKeyEqualTo, ska_ordered::detailv3::KeyOrValueEquality<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyEqualTo>, std::allocator<std::pair<c10::IValue, c10::IValue> >, std::allocator<ska_ordered::detailv3::sherwood_v3_entry<std::pair<c10::IValue, c10::IValue> > > >::templated_iterator<std::pair<c10::IValue, c10::IValue> > >, bool> c10::Dict<c10::IValue, c10::IValue>::insert_or_assign<c10::IValue&, c10::IValue&>(c10::IValue&, c10::IValue&) const /pytorch/aten/src/ATen/core/Dict_inl.h:136:5
    #2 0xea680a7 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:452:14
    #3 0xea64e07 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27
    #4 0xea64a61 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3
    #5 0xe9b13ce in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:126:20
    #6 0xe9b178c in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch/torch/csrc/jit/serialization/pickle.cpp:136:10
    #7 0xfdc8aa1 in torch::distributed::rpc::(anonymous namespace)::toIValues(torch::distributed::rpc::Message const&, torch::distributed::rpc::MessageType) /pytorch/torch/csrc/distributed/rpc/rref_proto.cpp:23:16
    #8 0xfdca3ca in torch::distributed::rpc::PythonRRefFetchCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/rref_proto.cpp:105:17
    #9 0xfe7f347 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:117:14
    #10 0x5c5d13 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27
    #11 0x5c2bfd in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
    #12 0x5c2a08 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
    #13 0x5c25c8 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
    #14 0x7feb90908082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #15 0x50237d in _start (/message_deserialize_afl+0x50237d)

0x6020004ed808 is located 8 bytes to the right of 16-byte region [0x6020004ed7f0,0x6020004ed800)
allocated by thread T0 here:
    #0 0x5bfc1d in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
    #1 0x32ad8d1 in std::_Vector_base<c10::IValue, std::allocator<c10::IValue> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20
    #2 0x32ad8d1 in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<double>(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, double&&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33

SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:224:33 in c10::IValue::IValue(c10::IValue const&)
Shadow bytes around the buggy address:
  0x0c0480095ab0: fa fa fd fd fa fa fd fd fa fa fd fd fa fa 00 00
  0x0c0480095ac0: fa fa 00 00 fa fa 00 00 fa fa 04 fa fa fa 04 fa
  0x0c0480095ad0: fa fa 00 fa fa fa fd fa fa fa 04 fa fa fa 00 fa
  0x0c0480095ae0: fa fa 00 fa fa fa fd fa fa fa fd fa fa fa fd fa
  0x0c0480095af0: fa fa fd fd fa fa 00 00 fa fa 00 fa fa fa 00 00
=>0x0c0480095b00: fa[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480095b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480095b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480095b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480095b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c0480095b50: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==29858==ABORTING
```

### PoC for load():
[crash-2bd32e496811fb06de24a2bb720dc6490218009f.zip](/uploads/53d108cdd434ec4b11a2034bbca3cfd8/crash-2bd32e496811fb06de24a2bb720dc6490218009f.zip)

```
==29865==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60c00031f388 at pc 0x000000669984 bp 0x7ffd6c6de630 sp 0x7ffd6c6de628
READ of size 4 at 0x60c00031f388 thread T0
    #0 0x669983 in c10::IValue::IValue(c10::IValue const&) /pytorch/aten/src/ATen/core/ivalue.h:224:33
    #1 0xdc3de68 in std::pair<c10::impl::DictIterator<c10::IValue, c10::IValue, ska_ordered::detailv3::sherwood_v3_table<std::pair<c10::IValue, c10::IValue>, c10::IValue, c10::detail::DictKeyHash, ska_ordered::detailv3::KeyOrValueHasher<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyHash>, c10::detail::DictKeyEqualTo, ska_ordered::detailv3::KeyOrValueEquality<c10::IValue, std::pair<c10::IValue, c10::IValue>, c10::detail::DictKeyEqualTo>, std::allocator<std::pair<c10::IValue, c10::IValue> >, std::allocator<ska_ordered::detailv3::sherwood_v3_entry<std::pair<c10::IValue, c10::IValue> > > >::templated_iterator<std::pair<c10::IValue, c10::IValue> > >, bool> c10::Dict<c10::IValue, c10::IValue>::insert_or_assign<c10::IValue&, c10::IValue&>(c10::IValue&, c10::IValue&) const /pytorch/aten/src/ATen/core/Dict_inl.h:136:5
    #2 0xea5a207 in torch::jit::Unpickler::readInstruction() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:452:14
    #3 0xea56f67 in torch::jit::Unpickler::run() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251:27
    #4 0xea56bc1 in torch::jit::Unpickler::parse_ivalue() /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204:3
    #5 0xe96db4e in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) /pytorch/torch/csrc/jit/serialization/import_read.cpp:53:20
    #6 0xe8fc648 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) /pytorch/torch/csrc/jit/serialization/import.cpp:184:10
    #7 0xe8f8935 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize(c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:287:19
    #8 0xe8f6d74 in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > >&, bool, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:386:25
    #9 0xe90086e in torch::jit::import_ir_module(std::shared_ptr<torch::jit::CompilationUnit>, std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:322:10
    #10 0xe903209 in torch::jit::load(std::istream&, c10::optional<c10::Device>, bool) /pytorch/torch/csrc/jit/serialization/import.cpp:482:10
    #11 0x5c2d60 in LLVMFuzzerTestOneInput /load.cc:42:14
    #12 0x5c2a8d in ExecuteFilesOnyByOne /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255:7
    #13 0x5c2898 in LLVMFuzzerRunDriver /AFLplusplus/utils/aflpp_driver/aflpp_driver.c
    #14 0x5c2458 in main /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300:10
    #15 0x7f156ae33082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)
    #16 0x50220d in _start (/load_afl+0x50220d)

0x60c00031f388 is located 8 bytes to the right of 128-byte region [0x60c00031f300,0x60c00031f380)
allocated by thread T0 here:
    #0 0x5bfaad in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3
    #1 0xa86231 in std::_Vector_base<c10::IValue, std::allocator<c10::IValue> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20
    #2 0xa86231 in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<c10::IValue&>(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33

SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:224:33 in c10::IValue::IValue(c10::IValue const&)
Shadow bytes around the buggy address:
  0x0c188005be20: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c188005be30: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x0c188005be40: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x0c188005be50: fd fd fd fd fd fd fd fd fa fa fa fa fa fa fa fa
  0x0c188005be60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c188005be70: fa[fa]fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x0c188005be80: 00 00 00 00 00 00 00 00 fa fa fa fa fa fa fa fa
  0x0c188005be90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0c188005bea0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c188005beb0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0c188005bec0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==29865==ABORTING
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103667
Approved by: https://github.com/albanD
2023-06-22 18:09:19 +00:00
f27a9129e7 XFAIL test_MaxUnpool_index_errors CUDA slow tests (#103905)
This has been failing in trunk for a while.  Let's XFAIL it while continuing the investigation https://github.com/pytorch/pytorch/issues/103854.  We might not need this PR if the fix is on the way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103905
Approved by: https://github.com/mikaylagawarecki
2023-06-22 18:05:10 +00:00
abd4ee8150 Specific namesapce for mha (#104001)
Potential namespace collisions for downstream projects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104001
Approved by: https://github.com/cpuhrsch
2023-06-22 17:57:06 +00:00
d2d3394c21 [pytorch/tensorexpr] Create LLJIT instance with an ObjectLinkingLayer (#103824)
- Upstream LLVM switched LLJIT's default JIT linker for ELF/x86-64 to JITLink: [commit](b92839c954). This commit mandates clients to use JITLink plugins, following the example in "llvm/examples/OrcV2Examples/LLJITWithCustomObjectLinkingLayer".

- Current change updates PytorchLLVMJITImpl to set ObjectLinkingLayer on LLJIT creation.
- If setObjectLinkingLayerCreator not set, RTDyldObjectLinkingLayer will be constructed. This is currently causing "Symbols not found: [ llvm_orc_registerEHFrameSectionWrapper ]" error for tests in test_quantization.py when pytorch is built to use latest LLVM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103824
Approved by: https://github.com/jeffdaily, https://github.com/davidberard98
2023-06-22 17:50:25 +00:00
f818036f85 Fix test_addmm_gelu assertion on Windows CUDA (#104031)
Summary:

This PR fixes the wrong assertion in the `test_addmm_gelu` happening in the Windows CUDA CI job caused by #103811. The addmm + GELU fusion is likely not happening (or not using the tanh approximation) on Windows. See [this comment](https://github.com/pytorch/pytorch/pull/103811#issuecomment-1601936203) in the #103811 for the details of the error.

Test Plan:

```
$ python test/test_linalg.py -k test_addmm_relu -v
test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.131s

OK

$ python test/test_linalg.py -k test_addmm_gelu -v
test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

----------------------------------------------------------------------
Ran 6 tests in 2.194s

OK
```

Reviewers: @eellison @huydhn

Subscribers:

Tasks:

Tags:

Differential Revision: [D46931688](https://our.internmc.facebook.com/intern/diff/D46931688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104031
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-06-22 17:42:33 +00:00
7b3b6dd426 Revert "_cycleviz.py: visualize reference cycles holding cuda memory (#102656)"
This reverts commit dba67f71c9b5abbdca5aa64913c50f9aa5ac6f51.

Reverted https://github.com/pytorch/pytorch/pull/102656 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But I think the change is failing on Windows CUDA https://github.com/pytorch/pytorch/actions/runs/5341701630/jobs/9683293600 ([comment](https://github.com/pytorch/pytorch/pull/102656#issuecomment-1603035364))
2023-06-22 17:16:47 +00:00
ab9ea0d0f2 Bump numpy from 1.21.6 to 1.22.0 in /benchmarks/dynamo/_onnx (#104014)
Bumps [numpy](https://github.com/numpy/numpy) from 1.21.6 to 1.22.0.
- [Release notes](https://github.com/numpy/numpy/releases)
- [Changelog](https://github.com/numpy/numpy/blob/main/doc/RELEASE_WALKTHROUGH.rst)
- [Commits](https://github.com/numpy/numpy/compare/v1.21.6...v1.22.0)

---
updated-dependencies:
- dependency-name: numpy
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-06-22 09:45:15 -07:00
1c33c398c7 [FSDP][state_dict] Add a summary log when finishing state_dict (#103784)
Add a summary log when finishing state_dict

Differential Revision: [D46807103](https://our.internmc.facebook.com/intern/diff/D46807103/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103784
Approved by: https://github.com/fduwjj
2023-06-22 16:29:24 +00:00
ab8fc41e2f Support bfloat16 dtype for CUTLASS-based semi-structured sparsity (#103978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103978
Approved by: https://github.com/cpuhrsch
2023-06-22 15:53:27 +00:00
5eb7325bc7 Add autocast support for IPU (#103890)
As part of this, a new `AutocastIPU` dispatch key has been added.

There's an existing PR, #85043, to make `Autocast` a proper per-backend functionality key, but it ran into issues with layering with other functionality keys and went stale.

This has been tested in the out-of-tree IPU PyTorch backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103890
Approved by: https://github.com/albanD
2023-06-22 15:38:45 +00:00
0d653730ce Refactory bits for the codegen cache (#103452)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103452
Approved by: https://github.com/ezyang
2023-06-22 13:04:22 +00:00
wgb
b663a41b51 add onlyprivateuse1 decorator for test framework (#103664)
Fixes #ISSUE_NUMBER
The current community testing framework does not have a decorator for privateuse1, we have fixed this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103664
Approved by: https://github.com/albanD
2023-06-22 13:00:31 +00:00
4143b6b89b Add torch_dispatch and modes to extending.rst note (#102087)
The following subjects are not in this PR and will be done in a follow up:
- Go through torch_function section and update to the latest phrasing and link to the proper new sections
- Go through torch.library and custom device docs to add links to the new sections as appropriate
- Top level explanations on which component should be used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102087
Approved by: https://github.com/janeyx99
2023-06-22 12:56:35 +00:00
e9705c52ac [pt2] add metas for _pdist_forward and _pdist_backward (#103817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103817
Approved by: https://github.com/ezyang
2023-06-22 11:18:05 +00:00
e48851033a [pt2] add metas for pad ops (#103815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103815
Approved by: https://github.com/ezyang
2023-06-22 11:18:05 +00:00
f9c64a1156 [debugging] aot_eager backend to use the min-cut partitioner (#103555)
default_partitioner is kind of broken when it comes to memory footprint. Moving aot_eager to use min-cut partitioner is better debugging experience.

One bad thing though would be that we will much lower speedup numbers, because min cut partitioner will try to recompute ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103555
Approved by: https://github.com/eellison, https://github.com/jansel
2023-06-22 09:31:08 +00:00
613970eb05 [5/n][FSDP] Update _sharded_post_state_dict_hook to use DTensor when use_dtensor=True in state_dict_config (#103921)
This allows us use use_dtensor=True for ShardedStateDictConfig() before calling model.state_dict().

load_state_dict hooks updates will be in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103921
Approved by: https://github.com/fduwjj, https://github.com/fegin
2023-06-22 08:32:19 +00:00
34336bd625 [PyTorch Vulkan] fix the position computation with the consideration of channel padding (#103908)
Summary: The old shader file was created before channel padding was implemented. We recompute the positions with the consideration that channels are padded as a multiple of 4.

Test Plan:
under `fbsource` run `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1`

full test result: P772641736

Reviewed By: SS-JIA

Differential Revision: D46866159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103908
Approved by: https://github.com/SS-JIA
2023-06-22 08:03:10 +00:00
2d528625d7 Make PyTorch compilable without XNNPACK (#104004)
By including `Engine.h` in `Shim.cpp` and defining `bool available()` outside of `#ifdef` guard in `Common.h`.

Modernize code a bit by using nested namespaces.

Fixes following compilation error if `USE_XNNPACK` is false:
```
Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:26:6: error: no previous prototype for function 'available' [-Werror,-Wmissing-prototypes]
bool available() {
     ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:30:6: error: no previous prototype for function 'use_convolution2d' [-Werror,-Wmissing-prototypes]
bool use_convolution2d(
     ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:42:8: error: no previous prototype for function 'convolution2d' [-Werror,-Wmissing-prototypes]
Tensor convolution2d(
       ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:53:6: error: no previous prototype for function 'use_linear' [-Werror,-Wmissing-prototypes]
bool use_linear(
     ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:60:8: error: no previous prototype for function 'linear' [-Werror,-Wmissing-prototypes]
Tensor linear(
       ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:67:6: error: no previous prototype for function 'use_max_pool2d' [-Werror,-Wmissing-prototypes]
bool use_max_pool2d(
     ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/xnnpack/Shim.cpp:79:8: error: no previous prototype for function 'max_pool2d' [-Werror,-Wmissing-prototypes]
Tensor max_pool2d(
       ^
```

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at f8ac185</samp>

> _The code for xnnpack activations_
> _Was scattered in different locations_
> _But now it's all neat_
> _In `Activation.cpp`_
> _With nested namespaces and simplifications_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104004
Approved by: https://github.com/drisspg
2023-06-22 06:41:31 +00:00
cyy
b689128db3 Fix an UBSAN error (#103900)
UBSAN reports unaligned address access.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103900
Approved by: https://github.com/kimishpatel
2023-06-22 06:17:48 +00:00
bffcfa9628 [ONNX] Separate fx _type_utils from torchscript exporter (#103942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103942
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-06-22 05:18:06 +00:00
c575f748ab [MPS] Remove unnecessary PSO checks (#103244)
The checks are unnecessary as PSO derived from `metalIndexingPSO` function is already checked, see:

c4752b1a91/aten/src/ATen/mps/MPSDevice.mm (L69-L72)

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 2d71d96</samp>

This pull request removes unnecessary and duplicated error handling code for the pipeline state object in the constructors of several MPS kernel classes in `aten/src/ATen/native/mps/operations`. This makes the code more concise and clear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103244
Approved by: https://github.com/albanD
2023-06-22 04:42:27 +00:00
dba67f71c9 _cycleviz.py: visualize reference cycles holding cuda memory (#102656)
Reference cycles are freed by the cycle collector rather than being cleaned up
when the objects in the cycle first become unreachable. If a cycle points to a tensor,
the CUDA memory for that tensor will not be freed until garbage collection runs.
Accumulatin of CUDA allocations can lead to out of memory errors (OOMs), as well as
non-deterministic allocation behavior which is harder to debug.

This visualizer installs a garbage collection hook to look for cycles containing
CUDA tensors and saves a visualization of the garbage:

```
from torch.cuda._cycleviz import warn_tensor_cycles
warn_tensor_cycles()
# do some work that results in a cycle getting garbage collected
# ...
> WARNING:root:Reference cycle includes a CUDA Tensor see visualization of cycle /tmp/tmpeideu9gl.html
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102656
Approved by: https://github.com/aaronenyeshi
2023-06-22 04:00:28 +00:00
73c927f901 Improve debuggability of activation checkpoint (#103859)
This PR makes some improvements for debuggability of checkpointing:
- improved error messages that are more understandable
- errors are now `CheckpointError` which subclasses `RuntimeError` (only `CheckpointError` triggers debug message, see below)
- stricter error checking by default:
   - shapes, dtypes, and device are compared
   - we also now error when more tensors are being saved for backward during recompute
   - NOTE: checks are relaxed if it is detected that you are doing backward within forward
 - shapes, dtype, and device checking can be disabled by passing `determinism_check="none"`
 - new debug flag: more helpful error message when `debug=True`

Note:
- cpp stack trace is only included for x86 linux machines
- the error message if cpp stack trace is included can be quite long. For a function checkpointed with 8 operators, the log was around 1300 lines! (should this be hidden behind a flag?)

[Error message when debug='True' (python stack trace only)](https://gist.github.com/soulitzer/3d5e19c7cceae8e22f9bdd625ec39dd4)

[Error message when debug='True' (with python and cpp stacktrace)](https://gist.github.com/soulitzer/ff8fd8c3ccbb2c90dfe3df6d7713b167)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103859
Approved by: https://github.com/albanD
2023-06-22 03:57:36 +00:00
dc15b4c838 add workflow dispatch to upload-alerts.yml (#103972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103972
Approved by: https://github.com/huydhn
2023-06-22 03:35:39 +00:00
518abe8b7e Revert "Migrate exportdb to torch.export from torchdynamo.export (#103861)"
This reverts commit fb6173a4ac60ed5a22cff2c68134633eb72e53b9.

Reverted https://github.com/pytorch/pytorch/pull/103861 on behalf of https://github.com/huydhn due to It looks like this change is failing in trunk due to a landrace fb6173a4ac ([comment](https://github.com/pytorch/pytorch/pull/103861#issuecomment-1601960600))
2023-06-22 03:24:01 +00:00
5f88dd3e47 Link new PyTorch Contributing Guidelines from CONTRIBUTING.md (#103986)
We wrote some new Contributing Guidelines that guide a contributor
through the lifecycle of a Pull Request to PyTorch.

We've gotten some positive feedback from early adopters so we are now
adding it as the go-to link in CONTRIBUTING.md and the PyTorch Wiki.

Note that there are older contributing guidelines over at
https://github.com/pytorch/pytorch/blob/main/docs/source/community/contribution_guide.rst
The new Contributing Guidelines doc is targeted towards guiding a user
through submitting and merging a Pull Request to pytorch; the existing
guidelines are more of a high-level overview. We should rationalize these
at some point, but I left the resources for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103986
Approved by: https://github.com/kit1980, https://github.com/albanD
2023-06-22 03:18:50 +00:00
c40fa8b614 [inductor] remove fft and svd ops from fake_incorrect_kernels (#103616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103616
Approved by: https://github.com/eellison
2023-06-22 03:01:43 +00:00
fb6173a4ac Migrate exportdb to torch.export from torchdynamo.export (#103861)
Things that needed to be fixed:
1. Fix a bug with returning dict output type
2. Make pass_base work with map implementation
3. Fix subtle bug with dynamo not propagating "val" in node.meta
4. Add export_constraints field in ExportCase in ExportDB

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103861
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
2023-06-22 02:53:41 +00:00
430cb3e160 [PyTorch] add Vulkan support for aten::tile (#103944)
Summary: We implement `aten::tile` on Vulkan backend through `aten::repeat`. The behavior of `aten::tile` is demonstrated here https://pytorch.org/docs/stable/generated/torch.tile.html

Test Plan:
Run tests for combinations of input dim between 1 and 4 and repeats of size  between 1 and 4. When a test case fails, the shape info is printed, e.g. `Tile test failed when input is of shape [13, 5] and repeat of [7, 2, 3]`.

```
(base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*tile*"
Building: finished in 0.1 sec (100%) 263/2812 jobs, 0/2812 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *tile*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.tile_invalid_inputs_exceptions
[       OK ] VulkanAPITest.tile_invalid_inputs_exceptions (34 ms)
[ RUN      ] VulkanAPITest.tile_invalid_outpus_exceptions
[       OK ] VulkanAPITest.tile_invalid_outpus_exceptions (2 ms)
[ RUN      ] VulkanAPITest.tile
[       OK ] VulkanAPITest.tile (63 ms)
[----------] 3 tests from VulkanAPITest (100 ms total)

[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (100 ms total)
[  PASSED  ] 3 tests.
```

Reviewed By: yipjustin

Differential Revision: D46367170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103944
Approved by: https://github.com/SS-JIA
2023-06-22 01:53:58 +00:00
41cc526b19 Avoid unwanted type promotion in tensordot (#103917)
Fixes #103366 as noted [in comment](https://github.com/pytorch/pytorch/issues/103366#issuecomment-1589782866) by passing dtype to `sum` when dimension is 1.

The code block given in the original issue now succeeds with no `RuntimeError`.

```
x1 = torch.as_tensor([0], dtype=torch.int32)
x2 = torch.as_tensor([0], dtype=torch.int32)
torch.inner(x1, x2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103917
Approved by: https://github.com/mikaylagawarecki
2023-06-22 01:35:24 +00:00
3535c634d1 Eliminate c10/util/array from PyTorch (#103893)
Test Plan: Sandcastle

Differential Revision: D46772319

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103893
Approved by: https://github.com/Skylion007
2023-06-22 01:33:31 +00:00
58d11159bd Revert "Reenable disabled tests by pr body (#103790)"
This reverts commit 2237b4ad754cac060c906377800d28f7e56da8ec.

Reverted https://github.com/pytorch/pytorch/pull/103790 on behalf of https://github.com/huydhn due to I think we tested it on PR but missed the logic in trunk where there is no PR number ([comment](https://github.com/pytorch/pytorch/pull/103790#issuecomment-1601890299))
2023-06-22 01:26:46 +00:00
c1a49823cd [ONNX] Bench torch.onnx.dynamo_export and torch.onnx.export under dynamo bench (#103135)
- Extend dynamo bench interface with '--compilers onnx' and '--compilers dynamo-onnx'
- ONNX bench exports model to onnx and runs in ONNX Runtime.
- Introduce error aggregation and report.
- Scripts to build ONNX deps and running ONNX bench.
- Huggingface accuracy check workaround for ONNX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103135
Approved by: https://github.com/thiagocrepaldi, https://github.com/jansel
2023-06-22 01:21:09 +00:00
ts
ba6b1ae43a Fix group norm mixed type error (#103360)
Fixes #102922, adding a more descriptive error message when dealing with inputs that contain mixed types.

Would be happy to add a test (I believe in test_nn.py?), just want to confirm that this is the correct place to put it!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103360
Approved by: https://github.com/albanD
2023-06-22 01:15:32 +00:00
2237b4ad75 Reenable disabled tests by pr body (#103790)
Query for list of renabled issues in the filter test config step: switch filter test config to query for all the PR info instead of just the labels so token usage should stay the same, move code and tests related to parsing reenabled issues to filter test config step, remove old code to get PR body and commit message.  `REENABLED_ISSUES` should be a comma separated list of issue numbers to be reenabled.

For testing: Fixes #103789
Check that 103789 shows up in list of ignored disabled issues
Sanity check that test-config labels still work
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103790
Approved by: https://github.com/huydhn
2023-06-22 01:10:31 +00:00
ede1965f5d Enable additional inductor test suites on ROCm (#102270)
Enables additional inductor UTs on ROCm, following from https://github.com/pytorch/pytorch/pull/100981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102270
Approved by: https://github.com/malfet
2023-06-22 00:36:35 +00:00
cd05c3b98c [BE] Use TEST_MULTIGPU from common_cuda.py (#103982)
Comment about `TEST_CUDNN` called over and over has long been alleviated by wrapping the check with `LazyVal`, that caches the results.
Also, delete unused `TEST_MAGMA`.

Prep change for https://github.com/pytorch/pytorch/issues/100006

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at e3a5b39</samp>

> _`common_cuda.py`_
> _Refactored for dynamo tests_
> _Winter code cleanup_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103982
Approved by: https://github.com/atalman, https://github.com/janeyx99
2023-06-22 00:07:44 +00:00
eed287ec19 [android] Only pass -mfpu to armv7 (#103929)
Summary:
The argument is unsupported on other architectures, and Clang 17 will
error out when you pass an argument that's unsupported for the arch
you're building for. Note that we need to use platform_compiler_flags
instead of selects because the latter can't distinguish between
architectures when doing a multi-arch app build in Buck1.

Differential Revision: D46825070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103929
Approved by: https://github.com/ezyang
2023-06-21 23:23:35 +00:00
626d8548df Revert "add override to Caffe2 (#103795)"
This reverts commit f5f020adb0f8aa689b4db9881b666b6b5f3722a0.

Reverted https://github.com/pytorch/pytorch/pull/103795 on behalf of https://github.com/osalpekar due to Caused some breakages due to jobs using `-Winconsistent-missing-destructor-override` detecting inconsistent usage of override. Specifically the Tensor class destructor not being marked with override ([comment](https://github.com/pytorch/pytorch/pull/103795#issuecomment-1601812803))
2023-06-21 23:21:25 +00:00
13664bb535 Revert "add oncall info individual info to failing alert job alert (#103915)"
This reverts commit 1b0d23708b74e7538242be2793d1046cdd3a3a0b.

Reverted https://github.com/pytorch/pytorch/pull/103915 on behalf of https://github.com/malfet due to Broke trunk with no module named tools, see https://github.com/pytorch/pytorch/actions/runs/5338343319/jobs/9675586967 ([comment](https://github.com/pytorch/pytorch/pull/103915#issuecomment-1601802715))
2023-06-21 23:06:45 +00:00
08a7d60a46 Revert "[Reland][ET] Select used et_kernel_metadata only (#103705)"
This reverts commit 59a01c49ee180c8d332e14bf3d5cbd1e8707bb65.

Reverted https://github.com/pytorch/pytorch/pull/103705 on behalf of https://github.com/osalpekar due to large number of internal failures in executorch contbuild. See [D46882119](https://www.internalfb.com/diff/D46882119) for more details ([comment](https://github.com/pytorch/pytorch/pull/103705#issuecomment-1601789900))
2023-06-21 22:51:38 +00:00
b7ae40f4c8 [min-cut partitioner] Disable a heuristic if graph has recomputable ops (#103635)
Removing this heuristic leads to major memory compression and speedup bump for activation-checkpointed models. Here is the data
![image](https://github.com/pytorch/pytorch/assets/13822661/64a491ab-173d-435a-b858-61b847fbb08b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103635
Approved by: https://github.com/Chillee
2023-06-21 22:27:17 +00:00
3912b722f3 Upgrade LoggingTensor mode and add traceback collection (#103734)
Parts borrowed from: https://github.com/albanD/subclass_zoo/blob/main/logging_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103734
Approved by: https://github.com/albanD
2023-06-21 22:04:30 +00:00
09fdea8564 Fix autograd issue with identity conversions (#92022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92022
Approved by: https://github.com/pearu, https://github.com/mtaaooby, https://github.com/amjames, https://github.com/cpuhrsch
2023-06-21 21:23:03 +00:00
7fb2a928cf fix hpu storage serialization (#101680)
Change-Id: Ia534400a0e8972590374eceba5b62a2525b796e5

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101680
Approved by: https://github.com/mikaylagawarecki
2023-06-21 21:19:49 +00:00
9590228303 Fix device of lengths in pack_padded_sequence when the default device is GPU (#103967)
Fixes #103964

Always create the `lengths` tensor on cpu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103967
Approved by: https://github.com/mikaylagawarecki
2023-06-21 21:14:37 +00:00
c3c03e7cb8 Reland of https://github.com/pytorch/pytorch/pull/101818 (#103888)
Original PR broke internal

This reverts commit 5ed618132f466440ad76c884240e07796c7e2c6b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103888
Approved by: https://github.com/albanD
2023-06-21 21:00:56 +00:00
8b418f197c [decomp] Add decomposition for torch.renorm (#103858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103858
Approved by: https://github.com/ezyang, https://github.com/nkaretnikov
2023-06-21 20:57:43 +00:00
c0596ffe85 improve repr for pytrees (#103945)
The current thing indents based on the length of the previous line, which is totally unreadable if, e.g. the treespec is a dict with a lot of keys, since all the keys will go on a ginormous line and everything after will be super indented.

Fix the indentation at 2, which is much more compact.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103945
Approved by: https://github.com/zou3519
2023-06-21 20:53:03 +00:00
ec8aa6e592 [Easy][FSDP] Fix "column" -> "row" in PG example (#103975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103975
Approved by: https://github.com/fduwjj
2023-06-21 20:41:50 +00:00
a2d001d4dd [FSDP][state_dict] Use _get_module_fsdp_state_if_fully_sharded_module for state_dict (#103783)
Fix https://github.com/pytorch/pytorch/issues/90788
Use a consistent implementation as optim_state_dict

Differential Revision: [D46807090](https://our.internmc.facebook.com/intern/diff/D46807090/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103783
Approved by: https://github.com/awgu, https://github.com/fduwjj
2023-06-21 20:31:30 +00:00
591981c5e2 [inductor] Lower diagonal, diagonal_copy and diagonal_scatter (#103755)
Currently these are decomposed into `as_strided`, which forces a buffer to be
realized. Instead, this lowers them into a native inductor view node and so
doesn't require any buffers to be realized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103755
Approved by: https://github.com/jansel
2023-06-21 20:16:24 +00:00
a61096fb94 [decomp] Decompose logaddexp2 (#103765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765
Approved by: https://github.com/Chillee
2023-06-21 20:16:24 +00:00
1c79003b3c Enable addmm + GELU epilogue fusion via cuBLASLt (#103811)
Summary:

Previously, addmm + GELU epilogue fusion was unconditionally disabled in `ATen/native/cuda/Blas.cpp` due to compilation and numerical issues in CUDA <= 11.4. This PR:

1. Enables addmm + GELU epilogue fusion for CUDA >= 11.8.

2. Restricts the usage of fused addmm epilogue to contiguous output (bugfix).

3. Extends unit tests with addmm epilogue fusion and GELU activation paths.

Test Plan:

$ python test/test_linalg.py -k test_addmm_relu -v

test_addmm_relu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_relu_cpu_bfloat16) ... ok
test_addmm_relu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float32) ... ok
test_addmm_relu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_relu_cpu_float64) ... ok
test_addmm_relu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_bfloat16) ... ok
test_addmm_relu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float32) ... ok
test_addmm_relu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_relu_cuda_float64) ... ok

$ python test/test_linalg.py -k test_addmm_gelu -v

test_addmm_gelu_cpu_bfloat16 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_bfloat16) ... ok
test_addmm_gelu_cpu_float32 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float32) ... ok
test_addmm_gelu_cpu_float64 (__main__.TestLinalgCPU.test_addmm_gelu_cpu_float64) ... ok
test_addmm_gelu_cuda_bfloat16 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_bfloat16) ... ok
test_addmm_gelu_cuda_float32 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float32) ... ok
test_addmm_gelu_cuda_float64 (__main__.TestLinalgCUDA.test_addmm_gelu_cuda_float64) ... ok

Reviewers: @eellison

Differential Revision: [D46829884](https://our.internmc.facebook.com/intern/diff/D46829884)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103811
Approved by: https://github.com/IvanYashchuk, https://github.com/eellison
2023-06-21 19:59:40 +00:00
1b0d23708b add oncall info individual info to failing alert job alert (#103915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103915
Approved by: https://github.com/huydhn
2023-06-21 19:25:39 +00:00
0beec88c93 Inductor support for all_gather_into_tensor_coalesced. (#98643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98643
Approved by: https://github.com/wanchaol
2023-06-21 19:25:03 +00:00
2adfd1315a [export] Serialize subgraphs. (#103901)
Differential Revision: D46865179

Deserialization part will be added in a following up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103901
Approved by: https://github.com/larryliu0820
2023-06-21 19:17:33 +00:00
6fd358e7f7 [ONNX] FX Dispatcher Test (#103971)
The test of https://github.com/pytorch/pytorch/pull/100660 ....
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103971
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-06-21 19:04:08 +00:00
61cd605813 [decomp] Don't call .item() in aten.fill.Tensor decomp (#103880)
Currently calling the fill.Tensor overload under `torch.compile` results in a
`DataDependentOutputException` due to the `.item()` call. This instead does a
device-device copy which can then be inlined into subsequent inductor kernels as
you would expect, e.g.

```python
def fn(a):
    result = torch.deg2rad(a).sin()
    return torch.empty((128, 128), device=a.device).fill_(result)
```

generates the single kernel
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 16384
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset  + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (0))
    tmp1 = tl.broadcast_to(tmp0, [XBLOCK])
    tmp2 = 0.017453292519943295
    tmp3 = tmp1 * tmp2
    tmp4 = tl.sin(tmp3)
    tl.store(out_ptr0 + (x0), tmp4, None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103880
Approved by: https://github.com/Chillee
2023-06-21 18:45:04 +00:00
785d472861 Skip Tensor-Tensor ops which have a Scalar input (#103928)
The pass was assuming aten.mul.Tensor would have two Tensor inputs, as per its schema, but because of https://github.com/pytorch/pytorch/issues/86128 a Scalar may show up.

Fix for https://github.com/pytorch/pytorch/issues/103924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103928
Approved by: https://github.com/yanboliang
2023-06-21 18:28:28 +00:00
ae1ed27756 [codemod][numpy] replace np.str with str (#103931)
Summary:
`np.str` is removed from numpy 1.20.0. It was an alias to builtin `str` and it's safe to do the replacement.

The whole changes is mechanical, generated using the following onliner:
```
fbgr -sl 'np\.str\b' | xargs perl -pi -e 's,\bnp\.str\b,str,g'
```

Test Plan: sandcastle

Differential Revision: D46586144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103931
Approved by: https://github.com/huydhn
2023-06-21 18:16:42 +00:00
72f09faf10 remove CUDA 11.7 builds (#103904)
CC @atalman @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103904
Approved by: https://github.com/malfet, https://github.com/atalman
2023-06-21 18:16:34 +00:00
17ef983516 skip torchinductor test_data_type_propogation if C++ compiler is not available (#103920)
This test is failing internally (https://www.internalfb.com/intern/test/844425024008760/).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103920
Approved by: https://github.com/yanboliang, https://github.com/jgong5, https://github.com/jansel
2023-06-21 18:14:50 +00:00
223f232928 Fix shape function for transpose convolution (#102139)
Fixes #98129.
Fixes the shape function for jit conv_transpose, as defined by the documentation https://pytorch.org/docs/stable/generated/torch.nn.ConvTranspose2d.html#torch.nn.ConvTranspose2d, includes output_padding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102139
Approved by: https://github.com/mingfeima, https://github.com/davidberard98
2023-06-21 17:50:56 +00:00
678ce61cdb s390x simd: update abs() functions for vectors of complex numbers (#103850)
This change fixes tests
SignManipulation/2.Absolute and SignManipulation/3.Absolute in vec_test_all_types_ZVECTOR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103850
Approved by: https://github.com/kit1980
2023-06-21 16:00:32 +00:00
dbbf24decd Fix counter resetting in pad mm (#103918)
Prior counter reset interacted poorly w internal test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103918
Approved by: https://github.com/devashishshankar
2023-06-21 15:54:46 +00:00
873f772df2 [quant][pt2] Fix QAT convert for resnet18 (#103759)
Summary:
Before this commit, only prepare QAT numerics matched
between PT2 and FX for resnet18. Convert numerics diverged,
however, for two reasons:

(1) Existing patterns did not handle inplace ReLUs. This commit
fixes this by adding extra patterns that use these ReLUs instead
of the normal ones.

(2) Subgraph rewriter could not handle skip connections in
quantized models, because the dequantize node is used in both
the conv node within the match pattern, and an inplace add node
outside of the match pattern. This led the subgraph matcher to
filter out the match, complaining that it was not self contained.
This commit fixes this problem by duplicating the dequantize
nodes, one for each user, such that subsequent matches will
be self contained.

Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_resnet18

Reviewed By: jerryzh168

Differential Revision: D46564114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103759
Approved by: https://github.com/jerryzh168
2023-06-21 15:36:07 +00:00
f73ff54f9a Use torch._foreach_lerp for SWA update (#103550)
Launch fewer kernels during a SWA update thanks to `torch._foreach_lerp`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103550
Approved by: https://github.com/janeyx99
2023-06-21 15:35:20 +00:00
3cfd677b1f fix inference mode / PyDispatcher / Functionalize interaction (#103275)
Fixes https://github.com/pytorch/pytorch/issues/103132

This is kind of annoying: Functionalization (and also vmap, I think?) manually figures out which ops have C++ CompositeImplicit decomps, and directly registers them to the Functionalize key. This is a problem for the PyDispatcher: We normally want the PyDispatcher to take precedence over the regular dispatcher. But in this case, we have a python decomp registered to `CompositeImplicitAutograd`, and a C++ decomp registered *directly* to the `Functionalize` key, so the C++ decomp gets precedence over the python decomp.

The way this showed up was that a model was running `matmul()` under inference mode, so we never hit the autograd dispatch key, and go straight to the functionalize dispatch key. Matmul has both a python decomp and a c++ decomp, but we were running the C++ decomp. That C++ decomp isn't meant to be used with dynamic shapes, so we were failing with the "tried to call `.sizes()` on a tensor with dynamic shapes" error.

For now, I had the PyDispatcher mimic the behavior of functionalization codegen: when you register a python decomp to the `CompositeImplicitAutograd` key, this PR just automatically registers that decomp to the `Functionalize` key at the same time.

I'm trying to remember now why we didn't just add `Functionalize` (and all of the other functorch transform keys) directly to the `CompositeImplicitAutograd` alias keyset, but I couldn't remember (@zou3519 any chance you remember?).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103275
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-06-21 15:19:55 +00:00
106d3f0115 [AOTAutograd] make _unsafe_view() logic happen during the runtime epilogue (#103919)
Fixes https://github.com/pytorch/pytorch/issues/103153

AOTAutograd has some logic for handling the case when we have:
* a graph output that is a view of an intermediate
* None of the other aliases of that output escape the graph, so from the perspective of the user + the autograd engine, we can pretend that the output is not a view

However, that logic would inject an `_unsafe_view()` call into the graph at trace time. This isn't wrong, but inductor will just immediately decompose `_unsafe_view()` into `view()`, and so the output tensor will continue to show up as having view metadata w.r.t. autograd.

This PR changes the `unsafe_view()` call to be in the runtime epilogue, instead of being part of the graph (where the compiler might do bad things to it - the compiler also shouldn't have to concern itself with autograd metadata).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103919
Approved by: https://github.com/ezyang
2023-06-21 14:37:35 +00:00
7ce932a92c Add signpost_event to dynamic_shapes (#103882)
Added two signpost_event calls to torch.fx.experimental.symbolic_shapes, one for produce_guards (where we can give stats like how many free symbols and how many guards produced) and the other is for evaluate_expr after freeze (so we can look for cases where we're improperly discarding guards in backwards.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103882
Approved by: https://github.com/Skylion007
2023-06-21 13:26:21 +00:00
cyy
716b3b893d Use missing-prototypes in torch_cpu (#103725)
This PR enables  Wmissing-prototypes in torch_cpu except some generated cpp files and the mps and metal backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103725
Approved by: https://github.com/albanD
2023-06-21 13:19:55 +00:00
d552c271db [pt2] grad support (#102264)
Teach dynamo about grad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102264
Approved by: https://github.com/zou3519
2023-06-21 10:13:09 +00:00
6d2887cc06 Reland "Move tensor grouping to ATen" (#103912)
This is a reland of https://github.com/pytorch/pytorch/pull/100007 with a build fix for Windows debug builds.
`at::native::ParamsHash` only works on structs with standard layout, but `std::string` isn't one in Visual C++ debug builds, which one can easily verified by running something like:
```cpp
#define _DEBUG
#include <type_traits>
#include <string>
static_assert(std::is_standard_layout_v<std::string>, "Oh noes");
```
If above conditon is not met, instead of printing a static_assert output, VC++ raises a very cryptic compilation errors,  see https://github.com/pytorch/pytorch/pull/100007#discussion_r1227116292 for more detail.

Also, using `std::hash` for string should result in a faster hash function.

(cherry picked from commit 74b7a6c75e698378882d30958908073407f97fb3)

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 5914771</samp>

This pull request introduces a new function `_group_tensors_by_device_and_dtype` that can group tensors by their device and dtype, and updates the `foreach` utilities and several optimizers to use this function. The goal is to improve the performance, readability, and compatibility of the code that handles tensors with different properties. The pull request also adds a test case and type annotations for the new function, and some error checks for the `fused` argument in Adam and AdamW.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103912
Approved by: https://github.com/janeyx99
2023-06-21 09:26:33 +00:00
b9f81a483a Preserve original co_filename when FX symbolic_trace (#103885)
Previously, you'd get `<eval_with_key>.0`; now you get `<eval_with_key>.0 from /data/users/ezyang/b/pytorch/test/dynamo/test_misc.py:5683 in forward`

I used to do this with globals, but now I do it with a `co_fields` parameter that's plumbed around, because putting things in globals has implications(TM). Happy to bikeshed on the `co_fields` structure.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103885
Approved by: https://github.com/albanD
2023-06-21 08:28:50 +00:00
6b1d6750b9 [FSDP][state_dict][BE] Remove outdated and fixed TODOs (#103782)
Remove outdated and fixed TODOs

Differential Revision: [D46807071](https://our.internmc.facebook.com/intern/diff/D46807071/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103782
Approved by: https://github.com/rohan-varma
2023-06-21 05:41:19 +00:00
1192f5ac46 [FSDP][optim_state_dict] Cleanup the unused optimizer state_dict APIs (#103781)
Cleanup the unused optimizer state_dict APIs

Differential Revision: [D46803955](https://our.internmc.facebook.com/intern/diff/D46803955/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103781
Approved by: https://github.com/rohan-varma
2023-06-21 05:38:48 +00:00
e737a8486f Revert "[pt2] grad support (#102264)"
This reverts commit 85b83954c8820fc7473d8e7b68325fa8ed5753dc.

Reverted https://github.com/pytorch/pytorch/pull/102264 on behalf of https://github.com/huydhn due to This is failing in trunk 85b83954c8 and looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102264#issuecomment-1600001309))
2023-06-21 03:02:55 +00:00
2642f31e4c Make torch.empty* deterministic by filling with NaN or max int value (#101849)
Part of #82004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101849
Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/kulinseth
2023-06-21 02:53:22 +00:00
d8352312f9 tf32 threshold fixes for various tests (#103138)
Addresses tf32 threshold related failures from NVIDIA internal testing for following unit tests:

A100:
- test_nn.py: test_Conv2d_groups_thnn_cuda_tf32, test_Conv2d_pad_same_dilated_cuda_tf32, test_Conv2d_groups_cuda_tf32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103138
Approved by: https://github.com/kit1980
2023-06-21 02:25:42 +00:00
85b83954c8 [pt2] grad support (#102264)
Teach dynamo about grad

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102264
Approved by: https://github.com/zou3519
2023-06-21 01:37:08 +00:00
02f28de408 [dynamo x fsdp] Simplify stream logic handling (#103902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103902
Approved by: https://github.com/awgu
2023-06-21 01:34:19 +00:00
39a22e2791 softmax: Triton kernel for BSR inputs (#102095)
Implements `softmax` Triton kernel for BSR inputs. So far, only over `dim=-1`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102095
Approved by: https://github.com/cpuhrsch
2023-06-21 01:23:27 +00:00
ee83c646bb Replace _prims_common.check with torch._check* (#103240)
This relands most of the changes from #102219 which were backed out by #103128. However, instead of removing `_prims_common.check`, it adds a warning and a comment mentioning that it will be removed in the future and `torch._check*` should be used instead. As mentioned in https://github.com/pytorch/pytorch/pull/103128#pullrequestreview-1466414415, `_prims_common.check` cannot yet be removed because of some internal usage

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103240
Approved by: https://github.com/albanD
2023-06-21 00:46:17 +00:00
f3c3d12efb [vision hash update] update the pinned vision hash (#103869)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103869
Approved by: https://github.com/pytorchbot
2023-06-21 00:18:49 +00:00
e5e9d563c2 Lift user defined attributes into inputs for certain cases (user defined types and tensors) (#103386)
(1) Lazy (converts to dynamo variable on access only)
(2) Uses existing side effect/reconstruct tech
(3) not tensor opinionated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103386
Approved by: https://github.com/jansel
2023-06-20 23:45:19 +00:00
8c2effcaf7 Fix bug for buffer reuse (#103720)
When disable `allow_buffer_reuse` in `torch/_inductor/config.py`, the buffer reuse still happens. This PR fixes the missing check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103720
Approved by: https://github.com/jansel
2023-06-20 23:28:40 +00:00
c9256ac609 add branch and sha info to alerting schema (#103631)
https://github.com/pytorch/pytorch/pull/103897 is used for testing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103631
Approved by: https://github.com/huydhn
2023-06-20 22:58:37 +00:00
a4b9872187 [PyTorch] add Vulkan support for aten::repeat (#103255)
Summary: We implement `aten::repeat` on Vulkan backend through `aten::unsqueeze` and `aten::cat`. The behavior of `aten::repeat` is demonstrated here https://pytorch.org/docs/stable/generated/torch.Tensor.repeat.html

Test Plan:
`repeat_invalid_inputs_outputs_exceptions` check the following:
- if the input tensor has dim <= 4
- if the size of `repeats` is >= input.dim
- if the output tensor has dim <= 4

In `test_repeat` we check the following combinations: input is of dim between 1 and 4 and `repeats` is of size between `input.dim()` and 4. If a testcase failed, the shape info is printed, e.g. `Repeat test failed when input is of shape [13, 5, 13] and repeat of [7, 2, 3]`.

```
(base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*repeat*"
Building: finished in 0.1 sec (100%) 263/2811 jobs, 0/2811 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *repeat*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.repeat_invalid_inputs_outputs_exceptions
[       OK ] VulkanAPITest.repeat_invalid_inputs_outputs_exceptions (28 ms)
[ RUN      ] VulkanAPITest.repeat
[       OK ] VulkanAPITest.repeat (46 ms)
[----------] 2 tests from VulkanAPITest (75 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (75 ms total)
[  PASSED  ] 2 tests.
```

Reviewed By: yipjustin

Differential Revision: D46244750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103255
Approved by: https://github.com/SS-JIA
2023-06-20 22:46:35 +00:00
0ae4c4d417 [FSDP][optim_state_dict] Avoid calling optim.state_dict() to get the initial
empty states (#103609)

Users may prefix the keys optim state_dict. Using`optim.state_dict()` to get the initial states is brittle. This PR removes the call to `optim.state_dict()` and directly infers the empty states from the input states.

Differential Revision: [D46729119](https://our.internmc.facebook.com/intern/diff/D46729119/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103609
Approved by: https://github.com/awgu
2023-06-20 22:11:58 +00:00
8ce4fee68d [BE] Use C++17 features in ParamsHash.h (#103911)
- Nested namespaces
- `std:: is_standard_layout_v` vs `std:: is_standard_layout<>::value`
- Remove unnecessary typecast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103911
Approved by: https://github.com/kit1980, https://github.com/atalman
2023-06-20 21:53:28 +00:00
a475ea4542 [fx] change from #users to num_users in graph printout (#101140)
`#users` means stuff in various chat apps, which makes it annoying to copypasta graphs into them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101140
Approved by: https://github.com/ezyang
2023-06-20 21:24:32 +00:00
f83ebfe1bb [FSDP] Improve support for CPU tensors. (#103171)
Don't emit device index when using CPU devices.
Don't call Tensor::record_stream as it's CUDA only op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103171
Approved by: https://github.com/rohan-varma, https://github.com/wz337
2023-06-20 21:08:19 +00:00
8b37821813 make balance check in DP only for cuda (#103311)
Fixes #103825
1. if we want to use dp on other device ranther than "cuda", this balance  check will raise error, so I make the balance check only effective for `cuda`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103311
Approved by: https://github.com/kit1980
2023-06-20 21:01:57 +00:00
4bd14d97f8 s390x simd: switch clamp min and max order (#103849)
This change makes s390x behave closer to non-simd. It also fixes multiple tests in test/test_ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103849
Approved by: https://github.com/kit1980
2023-06-20 20:39:26 +00:00
f7737bb96b Revert "Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264)"
This reverts commit 03881b0c925f191ec41d6899d589ed420ac285b5.

Reverted https://github.com/pytorch/pytorch/pull/103264 on behalf of https://github.com/osalpekar due to This commits seems to have been causing failures in test_nccl_init_abort. Those failures may have been masked by pre-existing failures in the distributed jobs on trunk when running CI on this PR. Since those breaking changes are now reverted, we should be able to rebase this and get clean signal + uncover the breakages caused by this PR. ([comment](https://github.com/pytorch/pytorch/pull/103264#issuecomment-1599451197))
2023-06-20 20:29:43 +00:00
d06fc1bfda [PyTorch] Add Vulkan support and tests for at::softmax along all dimensions for 4-dim Tensors (#102988)
Summary:
Extending support for the [Softmax function](https://pytorch.org/docs/stable/generated/torch.nn.Softmax.html) on the PyTorch Vulkan GPU backend.

# Before
1.  Softmax could only be calculated along dim=1, AKA along channel with NCHW convention
2. Softmax input Vulkan Tensor must have had size 1 along dim=0, AKA batch size of 1 with NCHW convention
3. Softmax input Vulkan Tensor must be 4-dimensional, AKA NCHW

# After
1. Softmax can be calculated along any dim={0,1,2,3}
2. Softmax input Vulkan Tensor can have any size along dim=0
3. Softmax input Vulkan Tensor must be 4-dimensional, AKA NCHW

Test Plan:
1. `buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook
2. Confirm all tests pass with no regression, and the added tests `*softmax*` pass under `-- --gtest_filter="*softmax*"`
2a. All tests P758913494
2b. `softmax` tests P758910449
3. Overview:

```
~/fbsource » buck2 run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

[...]

[ RUN      ] VulkanAPITest.softmax_4d
[       OK ] VulkanAPITest.softmax_4d (69 ms)

[...]

[----------] 275 tests from VulkanAPITest (3149 ms total)

[----------] Global test environment tear-down
[==========] 275 tests from 1 test suite ran. (3149 ms total)
[  PASSED  ] 274 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log

```

Reviewed By: SS-JIA

Differential Revision: D45880611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102988
Approved by: https://github.com/SS-JIA
2023-06-20 20:19:05 +00:00
8391618b99 [Inductor][FX passes] Pre grad pass modified graph should be topological sorted (#103794)
Error from 14k github models.
Minimized repro: please check the unit test I added
Error:
```
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Argument 'getitem_2' of Node 'sin' was used before it has been defined! Please check that Nodes in the graph are topologically ordered
graph():
    %l_x_ : torch.Tensor [#users=1] = placeholder[target=L_x_]
    %sin : [#users=1] = call_function[target=torch.sin](args = (%getitem_2,), kwargs = {})
    %unbind : [#users=2] = call_function[target=torch.unbind](args = (%l_x_,), kwargs = {dim: 0})
    %getitem_2 : [#users=1] = call_function[target=operator.getitem](args = (%unbind, 0), kwargs = {})
    %getitem_3 : [#users=1] = call_function[target=operator.getitem](args = (%unbind, 1), kwargs = {})
    %sin_1 : [#users=1] = call_function[target=torch.sin](args = (%getitem_3,), kwargs = {})
    %stack : [#users=1] = call_function[target=torch.stack](args = ([%sin, %sin_1],), kwargs = {})
    return (stack,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103794
Approved by: https://github.com/devashishshankar
2023-06-20 20:05:47 +00:00
974525c053 De-register forward hooks upon exiting flop counter context (#103744)
This PR fixes https://github.com/pytorch/pytorch/issues/103684.
- Instead of registering forward hooks in `__init__()`, do it upon `__enter__()`.
- De-register those forward hooks upon `__exit__()`.
- Achieve this by saving an additional mapping `_module_to_forward_hook_handles: Dict[nn.Module, _ForwardHookHandles]`. Only the values in the mapping (i.e. not the keys) are useful for this change. (A `List[_ForwardHookHandles]` would suffice.)
- The unit test accesses private attributes `_forward_hooks` and `_forward_pre_hooks` :/

Note that this PR is technically not backward compatible since it does not register the hooks upon `__init__()`, which means that you will not get the flops counting without the context manager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103744
Approved by: https://github.com/Chillee
2023-06-20 19:34:02 +00:00
54ff8ffedd Add Thiago Crepaldi (ONNX) to CODEOWNERS (#103894)
Adding @thiagocrepaldi.

_Note also that the errors exist in `main` as well as the following users do not have write access anymore: @z-a-f, @robieta, @NivekT._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103894
Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi
2023-06-20 19:24:26 +00:00
3a53dbae2a Update viable/strict script to ignore unstable jobs (#103899)
As distributed jobs had been failing in the past few days, viable/strict branch hasn't been updated since June 15th.  The issue was discovered when looking into nightly https://hud.pytorch.org/hud/pytorch/pytorch/nightly which sync with viable/strict.

Despite the fact that the failing job has been marked as unstable by https://github.com/pytorch/pytorch/issues/103612, the script still counted it as a failure https://github.com/pytorch/pytorch/actions/runs/5319411414/jobs/9631875636, and we kind of forget to monitor viable/strict delay to notice this earlier.  An alarm would probably need to be setup for this.

I also update the Rockset query a bit to add a comment on that it's used for.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103899
Approved by: https://github.com/clee2000, https://github.com/seemethere, https://github.com/malfet
2023-06-20 19:24:20 +00:00
036cda415f Change HigherOrderOperator default namespace from global to 'higher_order' (#103870)
This PR changes the default namespace for higher order operators from the
global namespace (e.g. torch.ops.cond) to `higher_order` (e.g.
torch.ops.higher_order.cond). We don't actually change the namespace
for existing HigherOrderOperators.

The motivation is to stem the bleeding; exposing operators into the global
namespace is a bad idea due to name collision with other user-defined
namespaces.

We will go in and fix the `_deprecated_global_ns` as necessary after this diff.

Differential Revision: [D46809738](https://our.internmc.facebook.com/intern/diff/D46809738/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103870
Approved by: https://github.com/ydwu4
2023-06-20 19:10:55 +00:00
3ca8542dff Fix _saved_tensors argument issue in test (#103594)
Summary:
fix broken test in

https://github.com/pytorch/pytorch/issues/103460

Test Plan: pytest ./generated/test_pabloppp_pytorch_tools.py -k test_015

Differential Revision: D46723640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103594
Approved by: https://github.com/yanboliang
2023-06-20 19:03:41 +00:00
d52d1fd5ba add description for unexpected case (#103500)
Fixes #88547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103500
Approved by: https://github.com/mingfeima, https://github.com/mikaylagawarecki
2023-06-20 19:02:45 +00:00
f730e22b5b [cpp] remove redundant code (#103808)
These lines are already available in the header files. Link: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/AdaptivePooling.h#L21-L27

I think we don't need them in the individual `.cpp` files.

cc'ing: @Skylion007 as I've seen you working on many great C++ stuff. Could you please confirm it? Thanks!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103808
Approved by: https://github.com/Skylion007
2023-06-20 19:00:31 +00:00
e031dd23b0 Revert "To add brief intro for CPU backend optimization (#103666)"
This reverts commit 013ffe457e79180d6aa3b82f20116052faee242a.

Reverted https://github.com/pytorch/pytorch/pull/103666 on behalf of https://github.com/huydhn due to Failing doc tests in trunk 013ffe457e ([comment](https://github.com/pytorch/pytorch/pull/103666#issuecomment-1599301270))
2023-06-20 18:33:01 +00:00
2722c52e52 Allow Unequality in top level IR too (#103746)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103746
Approved by: https://github.com/wconstab
2023-06-20 18:27:55 +00:00
50d8cf27e1 Fix annotations on torch function signatures (#103807)
Fixes #103806

- `reduction` related functions are now automatically generated from yaml registration.
- `Optional` or `Union` with `None` is properly added to where they were missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103807
Approved by: https://github.com/ezyang
2023-06-20 18:08:01 +00:00
013ffe457e To add brief intro for CPU backend optimization (#103666)
This PR is about adding brief introduction for x86 CPU backend optimization. Per previous discussion, the former PR #103307 was closed and creating this one, the contents are put into a new file.
@Guobing-Chen @jgong5 @mingfeima @jingxu10 please help review, thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103666
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-20 17:35:22 +00:00
b1ddd5a293 Revert "[DDP] multiple forward support for static graph (#103487)" (#103873)
Per the discussion in https://github.com/pytorch/pytorch/pull/103629#issuecomment-1598001313, I preemptively create this revert PR to revert all commits in the stack.  This seems like a safer option than using the bot as the commit has already been in trunk since last week.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103873
Approved by: https://github.com/rohan-varma
2023-06-20 16:25:00 +00:00
7b6dc72ffa Revert "[decomp] Decompose logaddexp2 (#103765)"
This reverts commit bab21d20ebf45a5dc620b48791bb526f664445a5.

Reverted https://github.com/pytorch/pytorch/pull/103765 on behalf of https://github.com/ezyang due to looks like land race ([comment](https://github.com/pytorch/pytorch/pull/103765#issuecomment-1599030496))
2023-06-20 15:35:02 +00:00
a39466c934 Modify DeviceThreadHandles.h file for device generic. (#95133)
…sed for other devices.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95133
Approved by: https://github.com/ezyang
2023-06-20 15:19:11 +00:00
bab21d20eb [decomp] Decompose logaddexp2 (#103765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103765
Approved by: https://github.com/Chillee
2023-06-20 09:24:21 +00:00
d4b85f3031 Support params/buffers inside cond and map (#102310)
With #102022, params and buffers are always treated as special case of free variables. In this PR, I switch cond and map implementation to the this method and deprecate the old tracing mechanism.

Differential Revision: [D46746202](https://our.internmc.facebook.com/intern/diff/D46746202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102310
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2023-06-20 05:33:10 +00:00
1be1f5090e [Dynamo] Fix broken NNModule comparison (#103812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103812
Approved by: https://github.com/msaroufim
2023-06-20 04:01:24 +00:00
1dba81f56d Abstract FX->ONNX translation through FxOnnxInterpreter (#102810)
# Summary by author

* Previous to this PR, FX-to-ONNX conversion logic was sprinkled in several functions, files and class, such as `_export_fx_to_onnx`, `export_fx_to_onnxscript`, `_export_fx_node_to_onnxscript` and `OnnxDispatcher` class[1]. Although each had its specific role in the lowering of FX, they all are part of the same lowering process.
* A `FxOnnxInterpreter` class, similar to but not derived from `torch.fx.Interpreter`, is introduced to drive the FX Graph -> ONNX Graph process. All functions and utilities from previous bullet were either moved under this class with minor refactoring.
  * One of the main changes is that each FX node now have their own entry point, providing lower complexity. It also provides isolation among them.
  *  Why refactored as class and not as a bunch of functions? ONNX Exporter adopted Object Oriented paradigm since its origin, so this refactoring should not be seen as any break of paradigm. This is just a continuation of a previous design decision. Example of other classes include `Exporter`, `ExportOptions`, `ExportOutput`, `ExportOutputSerializer`, `ProtobufExportOutputSerializer`, `FXGraphExtractor`, `ResolvedExportOptions`, `Analysis`, `Diagnostic`, `DiagnosticContext`, `Decompose`, `Functionalize`, `MovePlaceholderToFront`, `RemoveInputMutation`, `ReplaceGetAttrWithPlaceholder`, `ShapeInferenceWithFakeTensor`, `OnnxRegistry`, `OnnxDispatcher`, just to name a few.
  * `torch.fx.Interpreter` was not used because its API only passes the node name (aka `target`) instead of the actual `torch.fx.Node` object to the node implementations. This is not sufficient as the ONNX conversion process needs to inspect the node to extract type, name and other info from the node.
* This PR renames `OnnxDispatch` (without functionality changes) to `OnnxFunctionDispatcher` for clarity. ONNX word was too overloaded in this context.
* This PR also moved the `passes` and `serialization` handling from the `_export_fx_to_onnx` util and moved them to `Exporter.export`, where they are consumed. Passes are not the goal of this PR, so it was moved to a temporary function called `pre_export_passes` (mainly the content of `_export_fx_to_onnx` without serialization and fx -> onnx call).
* This PR also removed a bug in which new registry (and dispatcher, that wouldnt be a problem) were created for each pass was. That would be an issue with custom operators because only the original registry would have a reference to the custom operator.

Below is a summarized structure of the export process:

```python
class Exporter
    def export(self) -> ExportOutput:
        # 1) Trace torch.nn.Module into torch.fx.GraphModule
        graph_module = self.options.fx_tracer.generate_fx()

        # 2) Adapt input and output types to match ONNX graph
        updated_model_args = self.options.fx_tracer.input_adapter.apply()

        # 3) Run pre-export passes
        graph_module = pre_export_passes()

        # 4) Dispatch each FX node to an ONNX operator implementation
        #  Model level FX -> ONNX.
        fx_interpreter = fx_onnx_interpreter.FxOnnxInterpreter()
        fx_interpreter.run()

        # 5) Serialize graph to ONNX ModelProto.
        onnx_model = onnxscript_graph.to_model_proto(self.options.opset_version)

        # 6 Create ExportOutput
        return torch.onnx.ExportOutput()

class FxOnnxInterpreter:  # NOT a torch.fx.Interpreter
    def run(self, node: torch.fx.Node):
        with torch.utils._mode_utils.no_dispatch():
            for node in self.graph_module.graph.nodes:
               run_node(node)
   def run_node(node):
        if node.op == "placeholder":
            self.placeholder(node)
        elif node.op == "get_attr":
            self.get_attr(node)
        elif node.op == "call_function":
            self.call_function(node)
        elif node.op == "call_method":
            self.call_method(node)
        elif node.op == "call_module":
            self.call_module(node)
        elif node.op == "output":
            self.output(node)
        else:
            raise RuntimeError(
                f"Found node type not defined in torch.fx: {node.op}"
            )
    def placeholder(self, node: torch.fx.Node):
        pass
    def call_function(self, node: torch.fx.Node):
        pass
    def output(self, node: torch.fx.Node):
        pass
    def call_method(self, node: torch.fx.Node):
        pass
    def call_module(self, node: torch.fx.Node):
        pass
    def get_attr(self, node: torch.fx.Node):
        pass

class OnnxFunctionDispatcher:
    def dispatch(
        self,
        node: torch.fx.Node,
        onnx_args: Sequence[Optional[Union[_TensorLike, str, int, float, bool, list]]],
        onnx_kwargs: Dict[str, _type_utils.Argument],
        diagnostic_context: diagnostics.DiagnosticContext,
    ) -> Union["onnxscript.OnnxFunction", "onnxscript.TracedOnnxFunction"]:
        pass

    def get_aten_name(  # promoted to public API
        self, node: torch.fx.Node, diagnostic_context: diagnostics.DiagnosticContext
    ) -> str:
        pass

    def get_function_overloads(  # promoted to public API
        self,
        node: torch.fx.Node,
        aten_name: str,
        diagnostic_context: diagnostics.DiagnosticContext,
    ) -> Set[Union["onnxscript.OnnxFunction", "onnxscript.TracedOnnxFunction"]]:
        pass
```

Before this PR, that was the structure of the code:

```python
# torch/onnx/_internal/exporter.py
class Exporter:
def export(self) -> ExportOutput:
    graph_module = self.options.fx_tracer.generate_fx(
    self.options, self.model, self.model_args, self.model_kwargs
        )

    updated_model_args = self.options.fx_tracer.input_adapter.apply(
        *self.model_args, **self.model_kwargs
    )

    return self.options.fx_tracer._export_fx_to_onnx(
        self.options, graph_module, updated_model_args
    )

# torch/onnx/_internal/exporter.py
class FXGraphExtractor
    def _export_fx_to_onnx() -> ExportOutput: `# Ignore the fact it lives inside FXGraphExtractor. It was a temporary thing
        # Run all passes
        # ...
        with torch.utils._mode_utils.no_dispatch():
            onnxscript_graph = passes.export_fx_to_onnxscript()

            # Run input adapter
            # ...

            # Run output adapter
            # ...

        # Export TorchScript graph to ONNX ModelProto.
        onnx_model = onnxscript_graph.to_model_proto(options.opset_version)

        # Create ExportOutput
        return torch.onnx.ExportOutput()

# torch/onnx/_internal/fx/passes/fx_to_onnxscript.py
def export_fx_to_onnxscript():

    # Initialize the ONNX graph
    onnxscript_graph = graph_building.TorchScriptGraph()
    tracer = graph_building.TorchScriptTracingEvaluator(onnxscript_graph)

    for node in fx_module_with_metadata.graph.nodes:
        _export_fx_node_to_onnxscript()

    return onnxscript_graph

# torch/onnx/_internal/fx/passes/fx_to_onnxscript.py
def _export_fx_node_to_onnxscript():
    if node.op == "placeholder":
        # ...
    elif node.op == "call_function":
        symbolic_fn = options.onnx_dispatcher.dispatch()

        with evaluator.default_as(tracer):
            output = symbolic_fn(*onnx_args, **onnx_kwargs)
    elif node.op == "output":
        # ...
    elif node.op == "call_method":
        # ...
    elif node.op == "call_module":
        # ...
    elif node.op == "get_attr":
        # ...
    else:
        raise RuntimeError(f"Found node type not defined in torch.fx: {node.op}")

# torch/onnx/_internal/fx/function_dispatcher.py
class OnnxDispatcher:
    @_beartype.beartype
    def dispatch() -> Union["onnxscript.OnnxFunction", "onnxscript.TracedOnnxFunction"]:
        # ONNX Script lookup only
```
[1]
Note that the main functionalities in the fx -> onnx is orchestrated by functions in different files (see below).

Although the main loop that drives the dispatching is executed by a well-defined function (`export_fx_to_onnxscript`), this is not the entry point of the export process. The entry point is a utility function called `_export_fx_to_onnx`, which calls `export_fx_to_onnxscript`, that in turn will call another utility called `_export_fx_node_to_onnxscript`. Lastly, `_export_fx_node_to_onnxscript` implements *all* FX nodes in a huge monolithic block. the "call_function" segment of such block consumes `OnnxDispatcher`, completing the cycle.

```bash
_export_fx_to_onnx                         torch/onnx/_internal/exporter.py
_export_fx_node_to_onnxscript      torch/onnx/_internal/fx/fx_to_onnxscript.py
export_fx_to_onnxscript                  torch/onnx/_internal/fx/fx_to_onnxscript.py
OnnxDispatcher                              torch/onnx/_internal/fx/onnxfunction_dispatcher.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102810
Approved by: https://github.com/wschin, https://github.com/BowenBao
2023-06-20 02:40:54 +00:00
724a1ba2de Tidy __all__ under torch._refs (#103712)
- Added ops that were missing under `__all__`.
- Some misc changes to helper functions to make them private.
- Set correct `fn.__module__` for `fn` created by `_make_alias`, when called in another module.

All modification largely references results from a hacked version of `test_public_bindings::test_correct_module_names`.
By default `torch._refs` is not included in the test because it is technically a private package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103712
Approved by: https://github.com/lezcano
2023-06-20 00:04:58 +00:00
5d34656fd7 Update dynamo sum dtype handling to match eager (#103037)
The current behaviour for dynamo is to set the dtype to torch.int64 for integral types if the dtype is not specified explicitly which results in mismatched behaviour as compared to eager mode. In eager mode the semantics are:
- If both out is specified and dtype is specified then they have to match
- If dtype is not specified but out is specified then the dtype is set to match the out dtype
- If neither dtype nor out is set then the dtype is set to kLong if it is a bool or an integral type

Fixes #100698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103037
Approved by: https://github.com/ngimel
2023-06-19 22:26:37 +00:00
13ef0ec186 Add "slow" tests to list of disable conditions (#103856)
Companion PR to https://github.com/pytorch/test-infra/pull/4306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103856
Approved by: https://github.com/huydhn
2023-06-19 21:22:35 +00:00
def1b57151 Update datapipe.py (#103834)
change 'dp' to 'source_dp'

Fixes #103833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103834
Approved by: https://github.com/kit1980
2023-06-19 18:05:56 +00:00
55814bb46e [CI] Limit service jobs to Pytorch org (#103853)
Otherwise, everybody who forks the repo will try to run those

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103853
Approved by: https://github.com/huydhn
2023-06-19 17:47:57 +00:00
3e42854caa [xla hash update] update the pinned xla hash (#103827)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103827
Approved by: https://github.com/pytorchbot
2023-06-19 10:39:45 +00:00
9832cfbbfe Quantization oneDNN backend only support VNNI CPU (#103653)
**Summary**

- Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support.
- Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-19 09:50:07 +00:00
7b3242d5f7 [PyTorch Vulkan] fix bug of aten::cat for concatenation of 3D tensors at channel dim with channels as multiple of 4 (#103718)
Summary: The original `cat_feature_mult4ch` assumes input tensors are of 4d and use `tensor.sizes()[1]` to obtain the channel info of the tensor. This will cause bugs when the input tensors are of 3D. We generalize `cat_feature_mult4ch` to make it cover both 3D and 4D.

Test Plan:
Test for 3D tensors with channels as multiple of 4 is show below. Full test result is in P771032677.
```
(base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*cat_3d_dim0_mult4ch_success*"
Building: finished in 0.1 sec (100%) 263/2812 jobs, 0/2812 updated
  Total time: 0.1 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *cat_3d_dim0_mult4ch_success*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.cat_3d_dim0_mult4ch_success
[       OK ] VulkanAPITest.cat_3d_dim0_mult4ch_success (129 ms)
[----------] 1 test from VulkanAPITest (129 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (129 ms total)
[  PASSED  ] 1 test.
```

Reviewed By: SS-JIA

Differential Revision: D46755034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103718
Approved by: https://github.com/SS-JIA
2023-06-19 06:30:44 +00:00
79fe3aef2f indutor: fix issue of compute index_expr range (#103147)
For the CPU inductor side, there has an optimization to convert ```int64``` index_expr to ```int32``` for good performance(https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/cpp.py#L2034), but for ```ModularIndexing``` exp, we replace it as division(https://github.com/pytorch/pytorch/blob/main/torch/_inductor/optimize_indexing.py#L73, ```ModularIndexing``` doesn't have derivative) to compute derivative and then compute the expr's value range, there may meet issue which the min value may greater than the max value(```ModularIndexing(513*i2 + i3 + 262400, 512, 513), with vars_ranges is {i2: ValueRanges(lower=0, upper=256), i3: ValueRanges(lower=0, upper=513)}```).

One solution is that we don't replace ```ModularIndexing```, but it can't get the value range.
Another solution is that return ```inf``` range when the min val is great than the max val.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103147
Approved by: https://github.com/jgong5, https://github.com/eellison
2023-06-19 04:39:02 +00:00
01abccf63f inductor: fix CppTile2D bf16 store complier error for cpp backend (#103659)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103659
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-19 00:46:30 +00:00
adeb63de95 [CI] Fix a bug that bfloat16 is also used for dashboard training run (#103816)
Summary: The past two training runs were on bfloat16. Let's merge this ASAP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103816
Approved by: https://github.com/Chillee
2023-06-18 23:05:00 +00:00
15eed5b73e [Oncall][MTPG] Fix flaky test multi_threaded - test_broadcast_object_list (#103568)
This test(8340762211/test/distributed/test_multi_threaded_pg.py (L133) ) is failing on internal sandbox with the following error msg:
```
  File "/data/sandcastle/boxes/eden-trunk-hg-fbcode-fbsource/buck-out/v2/gen/fbcode/8c7462494077df89/caffe2/test/distributed/__multi_threaded__/multi_threaded#link-tree/torch/testing/_internal/distributed/multi_threaded_pg.py", line 255, in _start_coll
    raise Exception(
Exception: world not ready, only 3 PG's registered but world has 4 ranks
 exiting thread 1
ERROR
```

Internal error report: https://www.internalfb.com/intern/test/562950031915334?ref_report_id=0

We believe this is because we no longer perform barrier after init (see https://github.com/pytorch/pytorch/pull/99937).
This PR temporarily turn back on ```TORCH_DIST_INIT_BARRIER``` to avoid flaky test for the time being, but we should look into it to find a way to properly do this.

cc. @kumpera @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103568
Approved by: https://github.com/H-Huang
2023-06-18 07:05:28 +00:00
59a01c49ee [Reland][ET] Select used et_kernel_metadata only (#103705)
Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103705
Approved by: https://github.com/larryliu0820
2023-06-18 00:33:28 +00:00
cyy
f5f020adb0 add override to Caffe2 (#103795)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103795
Approved by: https://github.com/kit1980
2023-06-17 19:46:40 +00:00
0a7351e9ee [Doc] Fix torch.UntypedStorage.mps() doc (#103797)
Fix doc typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103797
Approved by: https://github.com/kit1980
2023-06-17 18:56:18 +00:00
1b16ac7481 Add A Pass to Fold Tensors With a Uniform Value, match sdpa on a few models (#103600)
Adds a Constant Folding pass to the joint graph only targeting tensors which can be replaced with a single value, and then removes no-ops from the graph. This allows us to match sdpa in BertForMaskedLM, AlbertForMaskedLM, and LayoutLMForMaskedLM.

BertForMaskedLM
Perf: 1.6853 -> 1.933, Memory: 0.9462 -> 1.41

AlbertForMaskedLM
Perf: 1.6620 -> 1.761, Memory: 1.257 -> 1.94

LayoutLMForMaskedLM
Perf: (non cudagraphs) 1.6991 -> 1.939x, Memory: 0.9624 -> 1.50

MobileBertForMaskedLM
Perf: 1.864x -> 1.941x, Memory: 0.94 -> 1.03

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103600
Approved by: https://github.com/jansel
2023-06-17 16:50:51 +00:00
dbc8eb2a8f [Quant][PT2E]Enable x86 inductor quantizer (#98730)
**Summary**

- Enable `X86InductorQuantizer` basics.
- Recipe to annotate conv2d is added.

**Test Plan**
```
python -u -m pytest -s -v test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98730
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-06-17 06:10:23 +00:00
2357498a8c s390x simd: ensure that vectorized complex constructor behaves same to x86 (#103426)
This change fixes multiple tests,
including test_noncontiguous_samples_lerp_cpu_complex64 from test/test_ops.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103426
Approved by: https://github.com/malfet
2023-06-17 02:40:51 +00:00
a2988c9e6a [CI] Switch inference accuracy and performance tests to bfloat16 (#103535)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103535
Approved by: https://github.com/eellison
2023-06-17 00:24:37 +00:00
918fe519a0 Use the new analytics ID (#103766)
Re: https://github.com/pytorch/pytorch.github.io/issues/1397
Following the migration to latest google analytics
FYI @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103766
Approved by: https://github.com/svekars
2023-06-16 23:21:08 +00:00
9541053cca [dynamo] support FakeTensor for SYM_INT/SYM_INT_LIST/INT_LIST param in python-to-cpp argument parsing (#103448)
before the PR, when compiling a function with signature symint/symintlist/intlist, we have runtime error like ```argument 'shifts' must be tuple of ints, not FakeTensor```. see newly added unit test in test/dynamo/test_misc.py for repro

after the PR, for FakeTensor with empty size and numel()=1, we will try
to convert it into symint/symintlist. we will likely see expected
exception
```torch._subclasses.fake_tensor.DataDependentOutputException / aten._local_scalar_dense.default``` during conversion

reference PR:
* we handle FakeTensor for symintlist as 1st varags: https://github.com/pytorch/pytorch/pull/97508
* we handle FakeTensor for intlist in a similar way:
https://github.com/pytorch/pytorch/pull/85759/files
* call local_scalar_dense on a FakeTensor:
f7365eca90

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103448
Approved by: https://github.com/yanboliang
2023-06-16 21:33:40 +00:00
b34ac35b77 Revert "Use hipsolver for default svd case on ROCm (#103540)"
This reverts commit 0a4a7d4b26ab5c789df4dc690686e6a7d06b1ed0.

Reverted https://github.com/pytorch/pytorch/pull/103540 on behalf of https://github.com/huydhn due to Turn out that the failure discussed in https://github.com/pytorch/pytorch/issues/102629 is not a fluke and ROCm signal in trunk is red atm ([comment](https://github.com/pytorch/pytorch/pull/103540#issuecomment-1595309297))
2023-06-16 20:59:40 +00:00
750cbb299b [RPC] Check stack for emptiness in interpreter (#103327)
Hi! I found heap-buffer-overflow during PyTorch RPC-module fuzzing.

[crash-9cc26b8da3b688a9c26614481239943b357c5636.zip](https://github.com/pytorch/pytorch/files/11707706/crash-9cc26b8da3b688a9c26614481239943b357c5636.zip)

```
    "==10634==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6060001b6a98 at pc 0x000000639a2e bp 0x7fffffff9100 sp 0x7fffffff90f8",
    "READ of size 4 at 0x6060001b6a98 thread T0",
    "    #0 0x639a2d in c10::IValue::isTensor() const /pytorch/aten/src/ATen/core/ivalue.h:432:27",
    "    #1 0x639a2d in c10::IValue::toTensor() && /pytorch/aten/src/ATen/core/ivalue_inl.h:159:7",
    "    #2 0xc5eb105 in at::Tensor c10::IValue::to<at::Tensor>() && /pytorch/aten/src/ATen/core/ivalue_inl.h:1690:1",
    "    #3 0xc5eb105 in void torch::jit::pop<at::Tensor>(std::vector<c10::IValue, std::allocator<c10::IValue> >&, at::Tensor&) /pytorch/aten/src/ATen/core/stack.h:130:55",
    "    #4 0xc5eaedb in torch::jit::dtype(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/mobile/promoted_prim_ops.cpp:105:3",
    "    #5 0xcc79600 in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:682:13",
    "    #6 0xcc4158b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:1052:9",
    "    #7 0x60f378 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential.cc:66:38",
    "    #8 0x610bb9 in LLVMFuzzerTestOneInput /jit_differential.cc:107:25",
    "    #9 0x535c91 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #10 0x51fb9c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #11 0x5258eb in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #12 0x54eea2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #13 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "    #14 0x51a4bd in _start (/jit_differential_fuzz+0x51a4bd)",
    "",
    "0x6060001b6a98 is located 8 bytes to the left of 64-byte region [0x6060001b6aa0,0x6060001b6ae0)",
    "allocated by thread T0 here:",
    "    #0 0x60c66d in operator new(unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/asan/asan_new_delete.cpp:95:3",
    "    #1 0xa5a41b in std::_Vector_base<c10::IValue, std::allocator<c10::IValue> >::_M_allocate(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:346:20",
    "    #2 0xa5a41b in void std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_insert<c10::IValue&>(__gnu_cxx::__normal_iterator<c10::IValue*, std::vector<c10::IValue, std::allocator<c10::IValue> > >, c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:440:33",
    "    #3 0xa5a241 in c10::IValue& std::vector<c10::IValue, std::allocator<c10::IValue> >::emplace_back<c10::IValue&>(c10::IValue&) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/vector.tcc:121:4",
    "    #4 0xcc8209c in torch::jit::InterpreterStateImpl::runImpl(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:345:19",
    "    #5 0xcc4158b in torch::jit::InterpreterStateImpl::run(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/jit/runtime/interpreter.cpp:1052:9",
    "    #6 0x60f378 in runGraph(std::shared_ptr<torch::jit::Graph>, std::vector<at::Tensor, std::allocator<at::Tensor> > const&) /jit_differential.cc:66:38",
    "    #7 0x610bb9 in LLVMFuzzerTestOneInput /jit_differential.cc:107:25",
    "    #8 0x535c91 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #9 0x51fb9c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #10 0x5258eb in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #11 0x54eea2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #12 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "",
    "SUMMARY: AddressSanitizer: heap-buffer-overflow /pytorch/aten/src/ATen/core/ivalue.h:432:27 in c10::IValue::isTensor() const",
    "Shadow bytes around the buggy address:",
    "  0x0c0c8002ed00: 00 00 00 00 00 00 00 fa fa fa fa fa fd fd fd fd",
    "  0x0c0c8002ed10: fd fd fd fd fa fa fa fa fd fd fd fd fd fd fd fd",
    "  0x0c0c8002ed20: fa fa fa fa fd fd fd fd fd fd fd fd fa fa fa fa",
    "  0x0c0c8002ed30: fd fd fd fd fd fd fd fd fa fa fa fa 00 00 00 00",
    "  0x0c0c8002ed40: 00 00 00 00 fa fa fa fa fd fd fd fd fd fd fd fd",
    "=>0x0c0c8002ed50: fa fa fa[fa]00 00 00 00 00 00 00 00 fa fa fa fa",
    "  0x0c0c8002ed60: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c0c8002ed70: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c0c8002ed80: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c0c8002ed90: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "  0x0c0c8002eda0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa",
    "Shadow byte legend (one shadow byte represents 8 application bytes):",
    "  Addressable:           00",
    "  Partially addressable: 01 02 03 04 05 06 07",
    "  Heap left redzone:       fa",
    "  Freed heap region:       fd",
    "  Stack left redzone:      f1",
    "  Stack mid redzone:       f2",
    "  Stack right redzone:     f3",
    "  Stack after return:      f5",
    "  Stack use after scope:   f8",
    "  Global redzone:          f9",
    "  Global init order:       f6",
    "  Poisoned by user:        f7",
    "  Container overflow:      fc",
    "  Array cookie:            ac",
    "  Intra object redzone:    bb",
    "  ASan internal:           fe",
    "  Left alloca redzone:     ca",
    "  Right alloca redzone:    cb",
    "==10634==ABORTING"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103327
Approved by: https://github.com/Skylion007
2023-06-16 20:12:51 +00:00
f1b367c418 [BE] Nested namespace in ATen/native headers (#103753)
Use nested namespace and `enum class` in `ATen/native` headers.
In particular, it helps avoid polluting global namespace with `MAX`,`MIN` enum values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103753
Approved by: https://github.com/atalman, https://github.com/Skylion007
2023-06-16 19:51:45 +00:00
fd4beb7a05 Better function annotations for nn.functional (#102918)
Fixes #102768

- Provides proper function declarations in generated `torch/nn/functional.pyi`.
- Moves some functions from manually defined in `functional.pyi.in` to generated code, in order to single-source the signature.
- Includes some of the functions in `torch._C._nn` into its `.pyi.in`, but not exhaustive (only what's already there).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102918
Approved by: https://github.com/drisspg, https://github.com/malfet
2023-06-16 19:48:04 +00:00
36ff9879de update multipy pin to not use install options (#103758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103758
Approved by: https://github.com/huydhn
2023-06-16 19:36:55 +00:00
d80174e2db Do not materialize entire randperm in RandomSampler (#103339)
In our DDP training workloads, each rank was initializing a `RandomSampler` for a dataset with a length of 3.5 billion items. We noticed that when this sampler was in scope, `gc.collect` calls were taking on the order of seconds to run, which would slow down the entire training iteration. This is because when we call `torch.randperm(n).tolist()`, we create a python list of 3.5 billion items, which massively slows down the periodic mark & sweep garbage collection.

This PR swaps out the `.tolist()` call with a `.numpy()` call and manually calls `.item()` on each element as it is being requested. This has two benefits:

1. The first call to `RandomSampler::__next__` should be about twice as fast, since `.numpy` does not copy the contents of the original tensor
2. The runtime of `gc.collect()` calls no longer scales linearly with the size of the dataset passed to `RandomSampler`

I've attached some `timeit` samples to illustrate the speedups with this Pr:

```
Main (no GC):  51.72115747816861
Main (10 GC calls) 83.61965207383037
PR (no GC) 33.06403830461204
PR (10 GC calls) 33.959467427805066
```

Code
```python
from timeit import timeit

baseline_no_gc = """
import torch

n = int(1e9)
steps = n // 100

x = torch.randperm(n).tolist()
x_iter = iter(x)

for i in range(steps):
    next(x_iter)
"""

baseline_gc = """
import torch
import gc
n = int(1e9)
steps = n // 100
gc_every = steps // 10

x = torch.randperm(n).tolist()
x_iter = iter(x)

for i in range(steps):
    next(x_iter)
    if i % gc_every == 0:
        gc.collect()
"""

numpy_no_gc = """
import torch
n = int(1e9)
steps = n // 100

x = torch.randperm(n).numpy()
x_iter = (i.item() for i in x)

for i in range(steps):
    next(x_iter)
"""

numpy_gc = """
import torch
import gc
n = int(1e9)
steps = n // 100
gc_every = steps // 10

x = torch.randperm(n).numpy()
x_iter = (i.item() for i in x)

for i in range(steps):
    next(x_iter)
    if i % gc_every == 0:
        gc.collect()
"""

if __name__ == "__main__":
    print("Main (no GC): ", timeit(baseline_no_gc, number=1))
    print("Main (10 GC calls)", timeit(baseline_gc, number=1))
    print("PR (no GC)",  timeit(numpy_no_gc, number=1))
    print("PR (10 GC calls)", timeit(numpy_gc, number=1))

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103339
Approved by: https://github.com/kit1980
2023-06-16 19:25:58 +00:00
67babf7a45 Enhance decorator _use_grad_for_differentiable (#103567)
Aim: enhance decorator _use_grad_for_differentiable so that functions (methods) decorated by it keep their docstrings and signatures unchanged.

Fixes #103566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103567
Approved by: https://github.com/janeyx99
2023-06-16 18:33:31 +00:00
5875a2fb3c [Inductor][FX passes] Forward fix an internal unit test failure. (#103739)
Summary: Forward fix a corner case, see  the failure at D46689080.

Test Plan: Existing tests

Differential Revision: D46789312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103739
Approved by: https://github.com/devashishshankar
2023-06-16 17:28:29 +00:00
8fc687f7ee Add activation functions (ReLU and SiLU for now) for structured sparse linear operator (#101339)
Differential Revision: [D46453476](https://our.internmc.facebook.com/intern/diff/D46453476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101339
Approved by: https://github.com/cpuhrsch
2023-06-16 17:24:59 +00:00
0da38409a0 [gloo] Make it possible for gloo TCPStore to take over an existing socket fd (#103478)
Summary:
This diff allows the `TCPStore` server associated with a gloo process group to listen on an existing socket already bound to a port.

Without the functionality in this diff, canonical initialization of a gloo `ProcessGroup` is fundamentally racy: 1) ask the OS for a free port by creating a socket bound to port 0, 2) close the socket, 3) attempt to initialize a `TCPStore` server that listens on the previously free port. Of course, the problem is that in between steps 2 and 3, another process on the host may have claimed the port, causing `TCPStore` and overall process group initialization to fail. With this diff, it is now possible for users to completely avoid this race (see unit test for how this can be achieved).

Test Plan:
Added new unit test:
  buck2 test caffe2/test/distributed:store

Differential Revision: D46622317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103478
Approved by: https://github.com/H-Huang
2023-06-16 17:15:56 +00:00
2bc56bec07 [quant][pt2] Handle literal conv args in convert QAT (#103731)
Summary:
Similar to the prepare case, we need to manually copy
over literal conv args such as padding and stride to the new,
replaced conv nodes, since these args are not captured by the
subgraph rewriter.

Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_fusion_literal_args

Reviewed By: jerryzh168

Differential Revision: D46383130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103731
Approved by: https://github.com/jerryzh168
2023-06-16 17:15:37 +00:00
08a054649c [operator_compile_check] Add FakeTensor testing (#103595)
This PR adds dedicated FakeTensor testing to operator_compile_check. We
reuse CrossRefFakeMode to do this and improve the error messages on it.

Note that this only really runs detailed tests for operators that do not
have data-dependent output shape. In the future we should add something
like a dynamic CrossRefFakeMode.

Test Plan:
- existing tests (these now have improved error messages).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103595
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2023-06-16 16:55:51 +00:00
23c143400e use mutable_data_ptr for grad_input in backward passes (#98999)
Summary:

Test Plan: Rely on CI.

Reviewers: ezyang

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98999
Approved by: https://github.com/ezyang
2023-06-16 16:30:40 +00:00
0a4a7d4b26 Use hipsolver for default svd case on ROCm (#103540)
Fixes #102678
Fixes #102629
Fixes #102558
HipSOLVER performance on ROCm5.4.2 and later no longer serves as massive bottleneck. Additionally, using magma on rocm in this case caused test_compare_cpu_lialg_pinv_singular_cuda_float32 to fail. Using hipSOLVER, the test now passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103540
Approved by: https://github.com/lezcano
2023-06-16 14:57:34 +00:00
b27c3558a4 [RFC]: Create aten native op for constrain_range (#103346)
At high current implementation of constrains functions (constrain_as_**) will raise exception for the following code snippets:
```
def f(x):
    a = x.item()
    constrain_as_size(a, 4, 7)
    return torch.empty((a, 4))

inp = torch.tensor([5])
ep = torch._export.export(f, (inp,))
```

The reason is because current constrain logic is:
1) Purely python so it won't survive AOT export (the full node is gone after AOT export since AOT export only maintains aten level op).
2) Utilize side effect to add range constraints for traced symbol's shape env ([code](9591e52880/torch/fx/experimental/symbolic_shapes.py (L370-L372))).
3) If runtime assertion is turned on (by default). [`_AddRuntimeAssertionsForConstraintsPass`](9591e52880/torch/_export/passes/add_runtime_assertions_for_constraints_pass.py (L98-L100)) will try to append assertion node based on range constrains extracted from shape env of symbol during another interpretation round.
4). However, since 1), in the round of AOT export, range constraints logic won't run for symbols generated during this round. And later there is no range constrains information available for assertion round and caused issue.
5) As a result of above, it will failure at `torch.empty((a, 4))` (there is no constrains for `a` that it must be positive).

The fix here is just to implement range constrain logic as a native aten op (CPU implementation as no-op) to make it be able to survive AOT export.

**NOTE:**
[Logic](2d745b95d7/torch/fx/experimental/symbolic_shapes.py (L350-L365C15)) within [`constrain_range`](2d745b95d7/torch/fx/experimental/symbolic_shapes.py (LL313C74-L313C74)) is split out as `constrain_range_int` to capture case when non `SymInt` is passed in and reused in the new `_constrain_range`. The reason is when non `SymInt` is provided:
* If it directly calls `sym_constrain_range`, the C++ version will be called which will be no-op.
* So in this case it calls `constrain_range_int` instead to be able to capture issue like user provides a input whose tensor's shape could be out of range during exporting, like the following for above code example:
```
...
inp = torch.tensor([10])
ep = torch._export.export(f, (inp,)) # immediately raise error
```

Differential Revision: [D46734204](https://our.internmc.facebook.com/intern/diff/D46734204)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103346
Approved by: https://github.com/tugsbayasgalan
2023-06-16 14:55:40 +00:00
df814484f4 remove dynamo fake param/buf check (#103574)
Fixes #103569
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103574
Approved by: https://github.com/ezyang
2023-06-16 14:19:37 +00:00
ae78e80123 [memory_viz] fix javascript url (#103741)
It turns out that jsdelivr, which is used to access the MemoryViz.js
source from generated files, doesn't work unless a version is specified.

This wasn't able to be tested until the PR actually landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103741
Approved by: https://github.com/aaronenyeshi
2023-06-16 13:15:45 +00:00
ad4ee297ed allow cpu scalar to be moved to xpu in masked_fill (#103645)
# Motivation
Align to CUDA scenario, allow cpu scalar to be moved to xpu device in masked_fill.

# Solution
Add "xpu" support in condition control.

# Additional
no need for more UT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103645
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-06-16 12:15:43 +00:00
d3971f2d15 [ONNX] Support aten::hstack and aten::vstack (#102872)
https://github.com/microsoft/onnxscript/pull/762 is actually not used by FX graph.
Fixes https://github.com/microsoft/onnx-converters-private/issues/168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102872
Approved by: https://github.com/justinchuby
2023-06-16 06:20:14 +00:00
f889c886d4 [export] Make pass base composable (#103701)
Moving ExportTracer so that EXIR can subclass it to do handling for delegates, and ExportPassBase can use the correct tracer. Upstreaming OSS changes in D45884895 first
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103701
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan, https://github.com/ydwu4
2023-06-16 06:07:18 +00:00
0411fc6ab6 [ONNX] Support aten::atleast_1d and aten::atleast_2d and aten::atleast_3d (#103061)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103061
Approved by: https://github.com/justinchuby
2023-06-16 06:07:00 +00:00
703875e364 [Reland][Dynamo] VariableTracker.recursively_contains should be updated correctly when mutation happens (#103564) (#103717)
Summary: Reland of https://github.com/pytorch/pytorch/pull/103564

Test Plan: contbuild & OSS CI, see 5c3556da94

Differential Revision: D46783727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103717
Approved by: https://github.com/angelayi
2023-06-16 04:25:27 +00:00
b287cb816c inductor: make the vec_transpose's tiling stride doesn't depend on out_idx and tiling_idex (#103651)
For TIMM swin_base_patch4_window7_224 dynamic shape path, there has an accuracy issue with horizontal reduction with vec_transpose:
```
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
{
    #pragma GCC ivdep
    for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L))
    {
        {
            #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={{0}})
            float tmp_acc0 = 0;
            auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0);
            for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(16L))
            {
                float tmp1[16*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,16>(in_ptr1 + static_cast<long>(i2 + (128L*(static_cast<long>((static_cast<long>(i1) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L*(static_cast<long>(at::native::div_floor_integer(i1, 56L)) % static_cast<long>(7L))) + (6272L*(at::native::div_floor_integer((static_cast<long>(i1) % static_cast<long>(56L)), 7L))) + (50176L*(at::native::div_floor_integer(i1, 392L))) + (401408L*i0)), static_cast<long>(((-50176L)*(at::native::div_floor_integer(i1, 392L))) + ((-6272L)*(at::native::div_floor_integer((static_cast<long>(i1) % static_cast<long>(56L)), 7L))) + ((-896L)*(static_cast<long>(at::native::div_floor_integer(i1, 56L)) % static_cast<long>(7L))) + ((-128L)*(static_cast<long>((static_cast<long>(i1) % static_cast<long>(56L))) % static_cast<long>(7L))) + (128L*(static_cast<long>((static_cast<long>((1L + i1)) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L*(static_cast<long>(at::native::div_floor_integer((1L + i1), 56L)) % static_cast<long>(7L))) + (6272L*(at::native::div_floor_integer((static_cast<long>((1L + i1)) % static_cast<long>(56L)), 7L))) + (50176L*(at::native::div_floor_integer((1L + i1), 392L)))), tmp1, 16);
                for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (3136L*i2) + (3136L*i2_inner) + (401408L*i0)));
                    auto tmp2 = at::vec::Vectorized<float>::loadu(tmp1 + static_cast<long>(16L*i2_inner));
                    auto tmp3 = tmp0 + tmp2;
                    tmp_acc0_vec = tmp_acc0_vec + tmp3;
                }
            }
            tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (3136L*i0)));
        }
    }
}
```

The ```transpose_mxn```'s ```ld_src``` depends on ```i1``` which is not expected. This PR will  add a check to make sure the tiling stride doesn't depend on out_idx(```i2```) and tiling_idex(```i1```)

After this PR, the generated code will be like this:
```
#pragma omp for
for(long i0=static_cast<long>(0L); i0<static_cast<long>(ks0); i0+=static_cast<long>(1L))
{
    #pragma GCC ivdep
    for(long i1=static_cast<long>(0L); i1<static_cast<long>(3136L); i1+=static_cast<long>(16L))
    {
        {
            #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out = omp_out + omp_in) initializer(omp_priv={{0}})
            float tmp_acc0 = 0;
            auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0);
            for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(16L))
            {
                for (long i2_inner = 0; i2_inner < 16; i2_inner++)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i1 + (3136L*i2) + (3136L*i2_inner) + (401408L*i0)));
                    auto tmp1 = ([&]() { __at_align__ float tmpbuf[16]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr1[static_cast<long>(i2 + i2_inner + (128L*(static_cast<long>((static_cast<long>((i1 + i1_inner)) % static_cast<long>(56L))) % static_cast<long>(7L))) + (896L*(static_cast<long>(at::native::div_floor_integer((i1 + i1_inner), 56L)) % static_cast<long>(7L))) + (6272L*(at::native::div_floor_integer((static_cast<long>((i1 + i1_inner)) % static_cast<long>(56L)), 7L))) + (50176L*(at::native::div_floor_integer((i1 + i1_inner), 392L))) + (401408L*i0))]; return at::vec::Vectorized<float>::loadu(tmpbuf); })();
                    auto tmp2 = tmp0 + tmp1;
                    tmp_acc0_vec = tmp_acc0_vec + tmp2;
                }
            }
            tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i1 + (3136L*i0)));
        }
    }
}
```

How to reproduce this issue:
```
python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --accuracy --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only swin_base_patch4_window7_224
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103651
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-16 03:56:39 +00:00
19b3e07fe0 [memory_viz] Unified viewer (#103565)
This replaces the invidual visualization routines in _memory_viz.py with
a single javascript application.

The javascript application can load pickled snapshot dumps directly using
drag/drop, requesting them via fetch, or by embedding them in a webpage.

The _memory_viz.py commands use the embedding approach.
We can also host MemoryViz.js on a webpage to use the drag/drop approach, e.g.
https://zdevito.github.io/assets/viz/
(eventually this should be hosted with the pytorch docs).

All views/multiple cuda devices are supported on one page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103565
Approved by: https://github.com/eellison, https://github.com/albanD
2023-06-16 03:49:48 +00:00
346feb6b56 [memory_viz] profile_plot generates snapshot objects (#103497)
This will make it easier to use a single html viewer for
both ways of generating the data. The next PR will change MemoryPlot.js
to simply read the snapshot information directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103497
Approved by: https://github.com/eellison
2023-06-16 03:49:48 +00:00
efc3bcceb1 Move memory viz templates into separate javascript files (#103474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103474
Approved by: https://github.com/eellison
2023-06-16 03:49:46 +00:00
69969e52c3 Cast computation_node_input_size to int (#103677)
This bandaid fixes yolov3 with automatic_dynamic_shapes.
A more proper fix probably is to figure out why when we
have

```
TypeError: mkldnn_reorder_conv2d_weight(): argument 'input_size' (position 6) must be tuple of ints, but found element of type SymInt at pos 1
```

where the SymInt is known to be constant, we aren't willing to
coerce it to int.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103677
Approved by: https://github.com/voznesenskym
2023-06-16 03:31:34 +00:00
bcf2becaf2 [vision hash update] update the pinned vision hash (#103721)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103721
Approved by: https://github.com/pytorchbot
2023-06-16 03:17:59 +00:00
d1effcd4a9 Don't apply automatic_dynamic_shapes if we force tensor to be static (#103673)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103673
Approved by: https://github.com/voznesenskym
2023-06-16 03:05:42 +00:00
39ba2e6226 Allow for sympy.Expr in tensor lowering in inductor (#103678)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103678
Approved by: https://github.com/voznesenskym
2023-06-16 02:41:23 +00:00
dad29f906b [quant][pt2] Fix no conv bias in convert QAT (#103298)
Summary:
Previously, the QAT pattern for conv + bn with no conv
bias was not actually replaced in convert. This commit adds an
extra pattern in the convert path for this case and the numerics
now match FX's.

Test Plan: python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_no_conv_bias

Reviewed By: jerryzh168

Differential Revision: D46382819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103298
Approved by: https://github.com/jerryzh168
2023-06-16 01:59:48 +00:00
a52b6f086d [export][serde] Add validator to compare deserializer opset version with model opset version (#103691)
This PR adds a validator to compare model opset version and deserializer opset version. This currently raises exception if any of the version doesn't match.

Note: the validator will only print warning if the op namespace in model is missing from the deserializer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103691
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2023-06-16 01:36:43 +00:00
1f5ee39c6c [reland][inductor] Make clone_graph copy node name as well (#103688)
Summary: This solves an inconsistency between two-pass fusion results when turning on cpp wrapper.
This is a reland of https://github.com/pytorch/pytorch/pull/103409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103688
Approved by: https://github.com/jansel
2023-06-16 00:32:07 +00:00
806a642eb1 update README.md to reflect current build from source status on master (#92729)
Signed-off-by: Mike Brown <brownwm@us.ibm.com>

To avoid new contributor issues when building master a couple README.md comments will help... This change:

~~- Documents the current support restriction to apt package `g++-9` #91328 ** noting here that with the commit in https://github.com/pytorch/pytorch/pull/92911 g++-11.3 appears to build and run local tests at least as well as g++9, so this restriction may be overcome with that PR merge depending on success and CI updates.~~ (fixed now)

- Documents wip status for CUDA 12 #91122 (by forwarding to support matrix per suggestion)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92729
Approved by: https://github.com/kit1980
2023-06-16 00:21:01 +00:00
38f35b4fc3 Add some missing disabled functions (#103662)
Disable Adadelta, rprop, multitensor, and the fused optimizers

Fixes https://github.com/pytorch/benchmark/actions/runs/5167132765/jobs/9307817625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103662
Approved by: https://github.com/janeyx99
2023-06-16 00:11:13 +00:00
ecf4ce7a0e Silence CUDA graph tree cuda warning (#103636)
Fixes

```
/data/home/marksaroufim/miniconda/envs/saf/lib/python3.8/site-packages/torch/_inductor/cudagraph_trees.py:85: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
  if torch.has_cuda:
Traceback (most recent call last):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103636
Approved by: https://github.com/eellison
2023-06-15 23:55:59 +00:00
03881b0c92 Ensure ncclCommAbort can abort stuck ncclCommInitRank (#103264)
https://github.com/pytorch/pytorch/pull/95715 added the functionality to abort `ncclCommInitRankConfig` by specifying `blocking=0` to enable non-blocking behavior.

However, calling the `pg._abort()` didn't recover from a stuck `ncclCommInitRankConfig` since the `_abort` method only looked through `devNCCLCommMap_` map and aborted those communicators. Since `ncclCommInitRankConfig` was stuck, the communicator itself wasn't added to the map and the host thread was stuck on this line: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L1171. As a result, `_abort` was a no-op.

To resolve this issue, I added the communicators to `inProgressCommMap_` as soon as they were created and then removed them once added to `devNCCLCommMap_`.

I also added a unit test that was failing without the changes to ProcessGroupNCCL.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103264
Approved by: https://github.com/kwen2501
2023-06-15 23:40:22 +00:00
1985c490fe [inductor] Fix tags for inductor random ops (#103648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103648
Approved by: https://github.com/eellison, https://github.com/jansel
2023-06-15 23:27:55 +00:00
8c54cd434f [inductor] Fix allow_buffer_reuse=False (#103630)
Fixes #103461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103630
Approved by: https://github.com/anijain2305
2023-06-15 22:50:01 +00:00
7c152376b7 [Easy] Dont truncate cudagraph error msg (#103693)
We're erroring anyway, we don't want to cut off important context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103693
Approved by: https://github.com/davidberard98, https://github.com/Skylion007
2023-06-15 22:21:10 +00:00
5f979d400c [inductor] let coordinate descent tuning respect max block size (#103660)
It turns out that we need fix https://github.com/pytorch/pytorch/issues/103656 in coordinate descent tuner.

Inductor generate triton code with assumption of max-block-size. If inductor is sure that numel is a multiple of the max-block-size, inductor will safely skip the check of the corresponding mask for perf reason.

Coordinate descent tuner previous does not respect this assumption and may pick triton config with even larger block size. That will cause IMA.

BTW, I was wondering how we pick those max block size. Not enforcing a max block size may allow coordinate descent tuner find an even better config. But it may slow down other cases a bit because of extra mask check.

Test:

```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --amp --performance --inference --inductor --only alexnet
```
Fail before and works after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103660
Approved by: https://github.com/spectrometerHBH, https://github.com/jansel
2023-06-15 22:15:51 +00:00
155691a7d9 Implement meta functions for rshift and lshift (#103637)
Fixes #103606

Was using this script to exercise new code, cause I can never remember which test it is.
```
import torch

@torch.compile(fullgraph=True, dynamic=True)
def shift_right(tensor: torch.Tensor) -> torch.Tensor:
    return (tensor >> 2).to(torch.long)

def main():
    sample_input = torch.tensor([4, 4, 16, 32], dtype=torch.uint8)
    print(shift_right(sample_input))

if __name__ == "__main__":
    main()
```
And iterated through the error messages

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103637
Approved by: https://github.com/ezyang
2023-06-15 21:49:22 +00:00
6f655d4195 Add symbolic tracing support to torch._dynamo.export (fake input + weights) (#100017)
Fixes #95900
Using the following repro as guide:

```python
import torch
import torch._dynamo
from torch._subclasses import fake_tensor
from torch.fx.experimental.symbolic_shapes import ShapeEnv
from torch._dynamo.output_graph import config
class Model(torch.nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.linear = torch.nn.Linear(2, 2)
        self.linear2 = torch.nn.Linear(2, 2)

    def forward(self, x):
        out = self.linear(x)
        out = self.linear2(out)
        return out

fake_mode = fake_tensor.FakeTensorMode(allow_non_fake_inputs=False,
                                       allow_fallback_kernels=True,
                                       shape_env=ShapeEnv(
                                            allow_scalar_outputs=config.capture_scalar_outputs,
                                            allow_dynamic_output_shape_ops=config.capture_dynamic_output_shape_ops,
                                            frame_id=0
                                        ),
)
# Fakefying input/model before calling torch._dynamo.export
with fake_mode:
    fake_x = torch.rand(5, 2, 2)
    model = Model()

# Calling torch._dynamo.export without active fake mode
graph_module, guards = torch._dynamo.export(
    model,
    fake_x,
    aten_graph=True,
    fake_mode=fake_mode
)
graph_module.print_readable()
graph_module.graph.print_tabular()
```

Summary of changes:

    * Plumb fake_mode through torch.export API. When specified, it
    replaces the creation of a new FaketendorMode at InstructionTranslator on behalf of OutputGraph
     Hacks FakeTensor.__new__ to prevent a
    torch.tensor._make_subclass call for inputs that are already fakefied by
    user. This probably need to be fixed in a nicer way. Any idea?
    * Removed a few asserts that didn't want faked tensors coming
    from user script
    * Added torch._subclasses.fake_tensor.FakeTensor to type list on a few
    asserts check to allow fake inputs

The changes above allowed symbolic tracing with both static and dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100017
Approved by: https://github.com/ezyang
2023-06-15 21:28:10 +00:00
f61b248d5b [BE][Functorch] Use nested namespace (#103685)
As we are a C++17 project now

Also, replace `enum` with `enum class` to make enum values visibility limited to the namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103685
Approved by: https://github.com/zou3519
2023-06-15 20:56:19 +00:00
def01eafc5 [BE] Remove unused dim_plane from reflection_pad2d_backward_out_template (#103680)
Probably introduced by https://github.com/pytorch/pytorch/pull/102254

This fixes `variable 'dim_plane' set but not used ` on my clang-14.0.3 compiler complained about it:
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/ReflectionPad.cpp:272:7: error: variable 'dim_plane' set but not used [-Werror,-Wunused-but-set-variable]
  int dim_plane = 0;
      ^
1 error generated.
```

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at e254b4b</samp>

> _`dim_plane` is gone_
> _Simpler code, no more warning_
> _Autumn leaves fall fast_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103680
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2023-06-15 20:43:57 +00:00
8553f9c896 Revert "[ET] Select used et_kernel_metadata only (#103658)"
This reverts commit 480d20cac109836a44971af774184d9a2d98748e.

Reverted https://github.com/pytorch/pytorch/pull/103658 on behalf of https://github.com/malfet due to Broke Windows builds ([comment](https://github.com/pytorch/pytorch/pull/103658#issuecomment-1593696503))
2023-06-15 20:41:45 +00:00
22e8a61d9b Implement coalesced reduce_scatter_tensor (#103561)
Map of #101157.

This PR adds support for coalesced `reduce_scatter_tensor` calls in the following syntax:

Sync communication style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])
```

Async communication style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.reduce_scatter_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the reduce-scatters' results
```
Each `reduce_scatter_tensor` call can be independent in terms of their data and buffer locations. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103561
Approved by: https://github.com/fegin
2023-06-15 20:11:12 +00:00
da7ca82121 [inductor] Store real inputs to be used for cpp wrapper codegen (#103289)
Summary: defaked args (zeros) may cause device-side access assertion, so
record the orginal real tensor inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103289
Approved by: https://github.com/jansel, https://github.com/eellison
2023-06-15 20:05:50 +00:00
ed3a61afcc Add automatic_dynamic_shapes test configuration (#103598)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103598
Approved by: https://github.com/Skylion007
2023-06-15 19:55:57 +00:00
480d20cac1 [ET] Select used et_kernel_metadata only (#103658)
Currently we rely on root operator, but we also need to check for et_kernel_metadata for used specialized kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103658
Approved by: https://github.com/larryliu0820
2023-06-15 19:05:04 +00:00
0cd6ebd704 optimize replication padding performance on CPU (#102255)
The major difference from the previous PR on ReflectionPad is the padding indexing struct, `ReplicationPad::index()`, the rest of the part is pretty much the same.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.265 ms;
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 52.336 ms;

(after)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.048 ms;
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.199 ms;
```

### single socket inference
```
(before)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.111 ms;
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.885 ms;

(after)
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.011 ms;
ReplicationPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.148 ms;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102255
Approved by: https://github.com/cpuhrsch
2023-06-15 18:42:36 +00:00
d1cecd9c32 Add assign kwarg to module.load_state_dict (#102212)
Fixes #64601 and #98906

Adds an `assign` argument to `load_state_dict` that loads params/buffers by assignment instead of doing `param.copy_(param_from_state_dict)`.

Primarily intended to remove the need for the `.to_empty()` in

```
with torch.device('meta'):
    m = SomeModule()
m.to_empty()
state_dict = torch.load('...pth')
m.load_state_dict(state_dict)
```

so we can instead do

```
with torch.device('meta'):
    m = SomeModule()
state_dict = torch.load('...pth')
m.load_state_dict(state_dict, assign=True)
```

**A problem with this PR for the case where the model is initialized on meta is what happens to nonpersistent buffers/params corresponding to keys missing from the state dict?**
What happens in the case where `load_state_dict(state_dict, strict=False, assign=True)` and the state_dict is missing some keys? The corresponding params missing from the `state_dict` and nonpersistent buffers would still be on `meta` and need to be manually initialized. However, I don't think we offer an API that would initialize these.

One solution would be to make these empty tensors but it might not be semantically correct...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102212
Approved by: https://github.com/albanD
2023-06-15 18:41:00 +00:00
73be9842be Revert "[Dynamo] VariableTracker.recursively_contains should be updated correctly when mutation happens (#103564)"
This reverts commit 5c3556da9406f814e6a1286cb6762e5508d54971.

Reverted https://github.com/pytorch/pytorch/pull/103564 on behalf of https://github.com/ZainRizvi due to Broke internal builds ([comment](https://github.com/pytorch/pytorch/pull/103564#issuecomment-1593552435))
2023-06-15 18:40:51 +00:00
9f39123d18 Allow to continue when fail to configure Windows Defender (#103454)
Windows Defender will soon be removed from the AMI.  Without the service, the step fails with the following error:

```
Set-MpPreference : Invalid class
At C:\actions-runner\_work\_temp\1f029685-bb66-496d-beb8-19268ecbe44a.ps1:5 char:1
+ Set-MpPreference -DisableRealtimeMonitoring $True
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : MetadataError: (MSFT_MpPreference:root\Microsoft\...FT_MpPreference) [Set-MpPreference],
    CimException
    + FullyQualifiedErrorId : HRESULT 0x80041010,Set-MpPreference
```

For example, https://github.com/pytorch/pytorch-canary/actions/runs/5267043497/jobs/9521809176.  This is expected as the service is completely removed.

Here are all the places where `Set-MpPreference` is used according to https://github.com/search?type=code&q=org%3Apytorch+Set-MpPreference
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103454
Approved by: https://github.com/atalman
2023-06-15 18:30:58 +00:00
3e9eaa1a12 [GHF] Fix regression
Introduced by https://github.com/pytorch/pytorch/pull/103679

That was not covered by tests :(

Discovered while looking at https://github.com/pytorch/pytorch/actions/runs/5281833681/jobs/9555936751#step:5:24

Locally tested by running `python -c "from trymerge import gh_get_pr_info;print(gh_get_pr_info('pytorch', 'pytorch', 103685)['author'])"`
2023-06-15 10:48:58 -07:00
e6108e8533 [caffe2] Create deterministic zip archives (#102903)
Summary: Ensure that we create deterministic zip archives for the same inputs to make builds deterministic.

Test Plan: CI

Reviewed By: StanislavGlebik

Differential Revision: D46417033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102903
Approved by: https://github.com/malfet
2023-06-15 17:45:19 +00:00
90ef8d58cf [export] Serialize metadata (#103274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103274
Approved by: https://github.com/zhxchen17
2023-06-15 17:34:12 +00:00
7b5f8988a2 [GHF] Auth when trying to fetch labels (#103679)
There were few merge bot failures reported recently due to HTTP/403 error:
- https://github.com/pytorch/pytorch/actions/runs/5269083146/jobs/9526693976#step:6:80
- https://github.com/pytorch/pytorch/actions/runs/5272750075/jobs/9535376256#step:6:93

Which likely stems from the fact that `_fetch_url` method did not try to pass the auth token even when it was available and as result was rate-limited to 60 requests per hour, according to [Resources in the REST API](https://docs.github.com/en/rest/overview/resources-in-the-rest-api?apiVersion=2022-11-28#rate-limit-headers)

Refactor `gh_fetch_url` into `gh_fetch_url_and_headers` and use it from `request_for_labels` to utilize auth token, if available, which bumps rate limit to 1000 per hour as well as print more actionable message when rate limit is exceeded.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 499b805</samp>

> _`gh_fetch_url` splits_
> _returns headers and body_
> _wrapper function_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103679
Approved by: https://github.com/jeanschmidt, https://github.com/kit1980
2023-06-15 17:03:55 +00:00
cyy
f2900420da fix missing-prototypes warnings in torch_cpu (Part 6) (#101845)
This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147, #100245, #100849 and #101788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101845
Approved by: https://github.com/albanD
2023-06-15 16:48:28 +00:00
e75f7994e1 Fix Dirichlet.log_prob() when x=0 and alpha=1 (#103605)
`Dirichlet.log_prob()` incorrectly returns NaN in the case where $x_i=0$ and $\alpha_i=1$.  The Dirichlet PDF is given by:
$$\frac{1}{B(\alpha)} \prod_{i=1}^{K} x_i^{\alpha_i - 1}$$
So this corresponds to the case where one of the terms has the form $0^0=1$. The logarithm of such a term should be 0, but you get NaN if you try to calculate it as `0 * log(0)`.

This PR implements the same algorithm that `scipy.stats.dirichlet` uses to avoid this behavior, namely `xlogy(alpha - 1, x)` instead of `(alpha - 1) * log(x)`.  It also adds a test case comparing the pytorch and scipy implementations for this specific case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103605
Approved by: https://github.com/albanD
2023-06-15 16:16:50 +00:00
2f893d04c8 Implement adding bias vector into structured sparse linear operator (#100881)
Differential Revision: [D46453477](https://our.internmc.facebook.com/intern/diff/D46453477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100881
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-06-15 16:16:09 +00:00
e56cdfd74b [MPS] Handle deserialization more permissively (#98834)
MPS deserialization should handle `mps:0`.
It is generated from some codes like the following

```python
torch.rand(size=(3, 4)).to("mps")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98834
Approved by: https://github.com/kulinseth, https://github.com/kit1980, https://github.com/malfet
2023-06-15 15:51:03 +00:00
bc6ec97e02 Switch dynamic_shapes to True by default (#103597)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103597
Approved by: https://github.com/voznesenskym
2023-06-15 15:16:20 +00:00
cyy
5642b5a36f enable performance-noexcept-move-constructor in clang-tidy (#103593)
Use noexcept as much as possible can improve code performance significantly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103593
Approved by: https://github.com/albanD
2023-06-15 14:38:47 +00:00
f0360c99ca Properly account for empty lists in symbol_to_source (#103633)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103633
Approved by: https://github.com/albanD
2023-06-15 13:05:13 +00:00
96c23fe212 [dynamo][numpy] Add support for builtin functions (#103457)
In order to be able to run stuff like:
```
def f(x):
	a = x.numpy()
        return a + a
```
This PR adds a branch in `BuiltinVariable` to handle `NumpyNdarrayVariable` case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103457
Approved by: https://github.com/ezyang
2023-06-15 09:18:45 +00:00
da21273ad5 inductor: support rsqrt for dynamic shape (#103579)
Fix compiler error for HF hf_BigBird dynamic shape path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103579
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-15 07:02:18 +00:00
5efdcd5802 Handle long Docker image name when building Docker image (#103562)
After https://github.com/pytorch/pytorch/pull/102562, the `IMAGE_NAME` input to `.ci/docker/build_docker.sh` now accepts the name in the following two formats:

* Short form, like `pytorch-linux-bionic-py3.11-clang9`
* Or long form, like `308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-bionic-py3.11-clang9`

This PR updates the build script to handle both cases.

This bug was discovered when I saw the wrong image name in https://github.com/pytorch/pytorch/actions/runs/5261424181/jobs/9509633110.

### Testing

Verify that the long form is handled correctly

```
export IMAGE_NAME=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-py3.8-gcc7:06fdf1facf0eef5e5f303dd9cfac8639fb5f9201
export DOCKER_TAG=06fdf1facf0eef5e5f303dd9cfac8639fb5f9201

./build_docker.sh
+ tag=06fdf1facf0eef5e5f303dd9cfac8639fb5f9201
+ registry=308535385114.dkr.ecr.us-east-1.amazonaws.com
+ [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-py3.8-gcc7:06fdf1facf0eef5e5f303dd9cfac8639fb5f9201 == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h\/* ]]
++ echo pytorch-linux-focal-py3.8-gcc7:06fdf1facf0eef5e5f303dd9cfac8639fb5f9201
++ awk -F '[:,]' '{print $1}'
+ EXTRACTED_IMAGE_NAME=pytorch-linux-focal-py3.8-gcc7
+ IMAGE_NAME=pytorch-linux-focal-py3.8-gcc7
+ image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/pytorch-linux-focal-py3.8-gcc7
+ [[ -z '' ]]
+ retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com
+ login 308535385114.dkr.ecr.us-east-1.amazonaws.com
+ aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken'
+ base64 -d
+ cut -d: -f2
+ docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103562
Approved by: https://github.com/PaliC
2023-06-15 05:21:50 +00:00
cyy
1e108d9c21 enable more ASAN tests (#101483)
Recently, we are seeing some bugs found by ASAN such as #101400, I think enabling ASAN for more tests is necessary to catch more hidden bugs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101483
Approved by: https://github.com/huydhn
2023-06-15 05:21:15 +00:00
17217d367f Inductor cpp wrapper: support Constant in input (#103496)
## Description
Fix cpp wrapper for models which have constants in the graph inputs.

Python wrapper directly gets the value inside the wrapper call as a global variable passed when calling:
4081e924a8/torch/_inductor/codecache.py (L757)
The constants value has been saved in `mod.__dict__` in
4081e924a8/torch/_inductor/graph.py (L874-L875)
For cpp wrapper, we need to append constants to the input args, so as to pass this python value to the `inductor_entry_cpp` function explicitly.

### Example
Example of output code for dlrm in TorchBench with this fix:
```py
module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'cfkc6c36t7cggi6mnokrdm5jhesnunjg5xysv3o3x3vaqmzmpe6r', False)

def _wrap_func(f):
    def g(args):
        args_tensor = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        constants_tensor = [constant0, constant1]
        args_tensor.extend(constants_tensor)

        return f(args_tensor)
    return g
call = _wrap_func(module.inductor_entry_cpp)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103496
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-06-15 05:01:25 +00:00
90ee6a7354 [PT2][Quant] Update op names for decomposed quantized lib (#103251)
Summary:
Dynamo trace, via dynamo.export, with aten_graph, generates graph with nodes
whose target is an isntance of torch._ops.OpOverload. Quantization workflow
inserting quantize/dequantize ops which are sometimes instances of
torch._ops.OpOverload (quantize_per_tensor.tensor) while other times instances
of torch._ops.OpOverloadPacket (quantizer_per_tensor) is a bit inconsistent.

Also not sure if it is a valid exported model, if it has nodes with target
of type torch._ops.OpOverloadPacket.

Without op overload name attached to the 'target', it fails during executorch
tracing. Reason is that executorch tracing expects node's targets to be
instances of torch._ops.OpOverload and not torch._ops.OpOverloadPacket.

So for consistency and tracing reasons, fixing convert pass to insert ops which
are torch._ops.OpOverload

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D46342822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103251
Approved by: https://github.com/andrewor14
2023-06-15 04:37:58 +00:00
5211fad738 cm3leon_generate is at edge of timeout, so bump it up (#103607)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103607
Approved by: https://github.com/malfet
2023-06-15 03:40:42 +00:00
b4056ba744 chore: Update ModelReportObserver variables to buffers (#97971)
This commit changes ModelReportObserver variables to buffers similar to other observers. This will allow for gathering data on other device than CPU.
Moreover, updates InputWeightEqualizationDetector to compute weight stats that are on GPU

Tested with running tests `test/quantization/fx/test_model_report_fx.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97971
Approved by: https://github.com/vkuzo
2023-06-15 03:15:41 +00:00
00546333a5 Register more foreach op lowerings (#102654)
Adds the necessary foreach op lowerings for Adam

Adds two decomps for addcdiv and addcmul (need to verify that type promotion works correctly here)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102654
Approved by: https://github.com/jansel
2023-06-15 02:52:17 +00:00
6d570ccd59 tf32 context fixes for various tests (#103137)
Addresses tf32 context related failures from NVIDIA internal testing for following unit tests:

H100:

- functorch/test_vmap.py: test_op_has_batch_rule

A100:

- test_expanded_weights.py: test_cnn_model_sum
- nn/test_convolution.py: test_conv2d_same_padding_backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103137
Approved by: https://github.com/zou3519
2023-06-15 02:33:12 +00:00
2e65354880 Fix inductor-perf-compare (#103538)
The "7" seems to have been a typo in #102881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103538
Approved by: https://github.com/kit1980
2023-06-15 02:29:01 +00:00
3d6fd07c46 Revert "[inductor] Make clone_graph copy node name as well (#103409)"
This reverts commit 2d745b95d723641e575027bd4e2fff612f61cc8f.

Reverted https://github.com/pytorch/pytorch/pull/103409 on behalf of https://github.com/osalpekar due to torchbench regression starting this commit. See 2d745b95d7 for more info ([comment](https://github.com/pytorch/pytorch/pull/103409#issuecomment-1592194229))
2023-06-15 01:27:55 +00:00
d6da649a1b [benchmark] hf_T5_base - torchbench original batchsize too large (#103442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103442
Approved by: https://github.com/desertfire
2023-06-15 01:06:40 +00:00
16c2090b2d [benchmark][compile] Limit number of bounding boxes to 5 (#103413)
Depends on https://github.com/pytorch/benchmark/pull/1729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103413
Approved by: https://github.com/ezyang
2023-06-15 01:06:40 +00:00
2087d32811 Revert "Support params/buffers inside cond and map (#102310)"
This reverts commit 766f236bad2327060575780219e0d4964dc661e5.

Reverted https://github.com/pytorch/pytorch/pull/102310 on behalf of https://github.com/huydhn due to The test is failing in trunk 766f236bad ([comment](https://github.com/pytorch/pytorch/pull/102310#issuecomment-1592159710))
2023-06-15 00:29:20 +00:00
ddf4cd69ec Delete ifdyn and ifunspec combinators (#103596)
Replaced with expect tests for ease of updating.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103596
Approved by: https://github.com/voznesenskym
2023-06-15 00:14:17 +00:00
e82616d900 Add generator argument in torch.randn signature (#102075)
Fix the document issue of `torch.randn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102075
Approved by: https://github.com/kit1980, https://github.com/soulitzer
2023-06-14 23:37:19 +00:00
a0885dff98 Link torch.cat in docstring of torch.stack and vice versa (#103421)
torch.cat and torch.stack are similar enough that they should point to each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103421
Approved by: https://github.com/malfet, https://github.com/svekars, https://github.com/kit1980
2023-06-14 23:31:22 +00:00
766f236bad Support params/buffers inside cond and map (#102310)
With #102022, params and buffers are always treated as special case of free variables. In this PR, I switch cond and map implementation to the this method and deprecate the old tracing mechanism.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102310
Approved by: https://github.com/avikchaudhuri, https://github.com/zou3519
2023-06-14 22:32:33 +00:00
600f7dc211 add instruction to compile with new C++ ABI (#95177)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95177
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/kit1980
2023-06-14 22:25:26 +00:00
55cf5c00fa Improve DDPOptimizer Logging (#103489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103489
Approved by: https://github.com/ezyang
2023-06-14 22:24:44 +00:00
9152d0e5be Silence has_cuda deprecation in optim (#103610)
```
UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103610
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2023-06-14 22:09:22 +00:00
d0ff640ec8 [Pytorch] aten::stack (#103344)
Summary:
Stack: https://pytorch.org/docs/stable/generated/torch.stack.html

This diff uses `at::unsqueeze` and `at::cat` to implement `at::stack` for all dims

Re-organize the tests to 1d, 2d, 3d tensors.

Test Plan:
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*stack*"
Restarting Buck daemon because Buck version has changed...
Buck daemon started.
Parsing buck files: finished in 9.1 sec
Creating action graph: finished in 0.7 sec
Downloaded 54/3888 artifacts, 27.68 Mbytes, 97.3% cache miss (for updated rules)
Building: finished in 07:36.5 min (100%) 2487/2487 jobs, 2487/2487 updated
  Total time: 07:46.3 min
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *stack*
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.stack_invalid_inputs
[       OK ] VulkanAPITest.stack_invalid_inputs (499 ms)
[ RUN      ] VulkanAPITest.stack_1d
[       OK ] VulkanAPITest.stack_1d (6 ms)
[ RUN      ] VulkanAPITest.stack_2d
[       OK ] VulkanAPITest.stack_2d (12 ms)
[ RUN      ] VulkanAPITest.stack_3d
[       OK ] VulkanAPITest.stack_3d (130 ms)
[----------] 4 tests from VulkanAPITest (649 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (649 ms total)
[  PASSED  ] 4 tests.
lfq@lfq-mbp fbsource %
```

Reviewed By: yipjustin

Differential Revision: D46178424

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103344
Approved by: https://github.com/SS-JIA
2023-06-14 21:59:36 +00:00
2eea3cb19d Fix composable checkpoint(use_reentrant=True) with multi args (#103590)
The `_ModuleHookCheckpointFunction.backward()` should take in `*output_grads` instead of `output_grads`. Otherwise, we may see an error like:
```
TypeError: backward() takes 2 positional arguments but 5 were given
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103590
Approved by: https://github.com/rohan-varma, https://github.com/fduwjj, https://github.com/fegin
2023-06-14 21:53:30 +00:00
c2952e8be9 [inductor] Fix an expression printer issue during generate_return (#103557)
Summary: This fixes a symbolic expression printing issue when cpp_wrapper
is on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103557
Approved by: https://github.com/eellison
2023-06-14 21:49:53 +00:00
dc3fa9e52f Update optimizer tests to compile with fullgraph (#103559)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103559
Approved by: https://github.com/jansel
2023-06-14 20:54:33 +00:00
7dd0f525b5 [FSDP][4/n]Update use_dtensor option for _optim_utils.py (#103599)
Same as https://github.com/pytorch/pytorch/pull/103069 (this branch is corrupted so have to re-submit).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103599
Approved by: https://github.com/fegin
2023-06-14 20:18:33 +00:00
bd0ed940b7 [activation checkpoint][dynamo] Wrap AC into Tag based higher order op (#102935)
These are the numbers with this PR

![image](https://github.com/pytorch/pytorch/assets/13822661/63e991d5-80e2-4e94-8e4b-243621c3990e)

There are 3 main followups
* A naive partitioner gives better memory footprint than min-cut partitioner here. Currently, we are using min-cut partitioner. Waiting for @Chillee  to discuss this further to either modify min-cut or add a naive partitioner.
* aot_eager is < 1x memory footprint. This is true even for non AC models. This could hide some inefficiency somewhere.
* inductor is giving very different memory numbers between AOT-traced-AC (duplicate early) vs this implementation. This leads to some inefficiency in inductor that we need to resolve.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102935
Approved by: https://github.com/jansel
2023-06-14 20:15:43 +00:00
df0505743f [activation checkpointing] Tagging based min cut partitioner (#103357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103357
Approved by: https://github.com/jansel
2023-06-14 20:15:43 +00:00
aece6705d1 Move locals/globals to output graph, make it easier to access them anywhere (#103456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103456
Approved by: https://github.com/jansel
2023-06-14 20:04:33 +00:00
d27bc34f4b Simple Source traversal util (#103450)
lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103450
Approved by: https://github.com/ezyang
2023-06-14 20:04:20 +00:00
6db21a9cf8 Update clang-tidy install in CONTRIBUTING.md (#101247)
Updated clang-tidy install to reflect install command in github actions workflow ce76670c6f/.github/workflows/lint.yml (L45)

I was following steps to run clang-tidy and got into the above issue. I also think that the following line is outdated: ce76670c6f/CONTRIBUTING.md (L1077)

 but not sure what is the right solution there as there is no `clang_tidy/requirements.txt` file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101247
Approved by: https://github.com/subramen, https://github.com/kit1980
2023-06-14 19:57:12 +00:00
9946499228 Continue simplifying dynamic shapes tests (#103592)
Remove the static by default / no automatic dynamic configuration as this is about to become the default.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103592
Approved by: https://github.com/voznesenskym, https://github.com/Skylion007
2023-06-14 19:35:51 +00:00
49dcf48e66 [PT2][Quant] Change quat conv bn fusion code (#103556)
Summary:
Dynamo burn in scalars instead of keeping them on module. This results in
quantize_per_tensor and dequantize_per_tensor nodes to have burnt in scale and
zero point value, if we trace them scalar.

Graph rewrite ignores literals and when match pattern is replaced with
replacement pattern, we lose the scale/zp and other values from nodes in
original graph and instead get one from replacement graph.

This diff fixes that for q/dq per tensor node by manually copying these values
over.

Note that this is not robust because it works only when there is only a single
q/dq node

Test Plan: quantization_pt2e

Reviewed By: andrewor14

Differential Revision: D46614000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103556
Approved by: https://github.com/andrewor14
2023-06-14 18:37:43 +00:00
a60f6dbe69 Revert "Add groups to dynamo benchmarking output data (#103268)"
This reverts commit 455f542ed95921a073b7859fc51a3a1e7c361239.

Reverted https://github.com/pytorch/pytorch/pull/103268 on behalf of https://github.com/drisspg due to no longer needed ([comment](https://github.com/pytorch/pytorch/pull/103268#issuecomment-1591732331))
2023-06-14 17:50:34 +00:00
69b09eca5a optimize reflection padding performance on CPU (#102254)
This patch improves reflection padding performance on CPU.

Original kernel has nested paralleled loops, e.g. first on dim of **batch** and then on dim of **channels**, this is not optimal practice when N * C is small. This patch did dimension collapse on NC and adjacent spatial dims to maximize the parallelism scope.

The following benchmark result gathered on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, with 20 cores per socket.

### single core inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.281 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 55.675 ms;

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.049 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 17.252 ms;
```

### single socket inference
```
(before)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.118 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 4.023 ms;

(after)
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([1, 3, 224, 224]) , NCHW: 0.010 ms;
ReflectionPad2d((2, 2, 2, 2)) size:  torch.Size([128, 64, 56, 56]) , NCHW: 3.149 ms;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102254
Approved by: https://github.com/cpuhrsch
2023-06-14 17:18:51 +00:00
717e63b7bd [inductor] use aten.kernel.OVERLOAD_NAME instead of aten.kernel in python wrapper (#103576)
Summary:
When we call an overload packet (e.g. torch.ops.aten.ge), there's some C++ code (from TorchScript) that determines which overload to use. There's sometimes ambiguity as to which op should be used. Therefore, for python we should use the specific overload name if we know it.

Specifically, the issue was with ge. We had a test (test_lerp_cuda from test_torchinductor.py) that eventually got lowered to code like this:
```
torch.ops.aten.ge(torch.tensor(70000.), 0.5)
```

This can either match torch.ops.aten.ge.Scalar (the intended overload), which will return torch.tensor(True); or it can match torch.ops.aten.ge.float (a TorchScript overload), which will return `True`. The decision of which to use depends on the order in which the operators are registered. Internally, depending on the build config (opt vs. dev-nosan), the operator registration order could differ. In opt mode, the torchscript overload would appear first and therefore would get called first, and cause the inductor program to fail.

Differential Revision: D46712744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103576
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-06-14 17:14:47 +00:00
5c3556da94 [Dynamo] VariableTracker.recursively_contains should be updated correctly when mutation happens (#103564)
Fixes #103563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103564
Approved by: https://github.com/jansel
2023-06-14 17:08:00 +00:00
0ca3c6f7d7 [_memory_viz.py] Fix bug when using profile_plot (#103384)
When we updated plotting to add level of detail the Legend
code for profile_plot got broken. This patch fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103384
Approved by: https://github.com/drisspg
2023-06-14 16:54:29 +00:00
6ff6b49039 Revert "Register more foreach op lowerings (#102654)"
This reverts commit 05c01b9bfc0af1ad1bf230cac658d10a42f754d6.

Reverted https://github.com/pytorch/pytorch/pull/102654 on behalf of https://github.com/ZainRizvi due to This is breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/102654#issuecomment-1591639478))
2023-06-14 16:49:30 +00:00
b1adaa8777 [inductor] Fix no-xdim reductions (#103527)
Fixes #103481

Normally triton tensors have shape `[XBLOCK, RBLOCK]`, or some variation where
the lengths are 1 but the number of dimensions is the same. The `no_x_dim`
change in addition to removing the x dimension, also removed the r dimension
from certain values such as the results of reductions and the `xindex` variable.

This fixes those two cases to correctly produce tensors of shape `[1]`,
equivalent to the old shape `[XBLOCK, 1]` with the x-dimension dropped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103527
Approved by: https://github.com/ngimel
2023-06-14 16:32:17 +00:00
80139fc2db [DDP] multiple forward support for static graph (#103487)
Adds support for multiple forward before bwd call for
static_graph=True.

There are 2 changes:
1) Change tracking of accounting of when to populate static grap related maps
from relying on forward iteration to backward calls
2) In DDP python, don't rely on num_forward iterations == 1 to enqueue the
delay allreduce. Instead use a flag.

Differential Revision: [D46673736](https://our.internmc.facebook.com/intern/diff/D46673736/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103487
Approved by: https://github.com/awgu
2023-06-14 16:14:52 +00:00
780b24b27c [DDP] Refactor _DDPSink to take DDP weakref (#103304)
This will make future PRs to support DDP static graph multi forward
cleaner.

Differential Revision: [D46584545](https://our.internmc.facebook.com/intern/diff/D46584545/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103304
Approved by: https://github.com/awgu
2023-06-14 16:14:52 +00:00
a3a32c1be0 [DDP] Rename num_iterations -> num_forward_calls (#103283)
This more accurately represents what we're counting. At iteration is a
forward + backward call, but here we're just counting forward calls. This makes
things less confusing in future diffs where we support DDP static graph
multiple forwards.

Differential Revision: [D46580601](https://our.internmc.facebook.com/intern/diff/D46580601/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103283
Approved by: https://github.com/awgu
2023-06-14 16:14:50 +00:00
2076a2ffa7 [DDP] Rename state_dict var to ddp_state (#103282)
This name is confusing in the context that it is just a dictionary
used to pass state to DDP backward pass.

Differential Revision: [D46580516](https://our.internmc.facebook.com/intern/diff/D46580516/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103282
Approved by: https://github.com/awgu
2023-06-14 16:14:49 +00:00
2d745b95d7 [inductor] Make clone_graph copy node name as well (#103409)
Summary: This solves an inconsistency between two-pass fusion results
when turning on cpp wrapper. The unit test comes from yolov3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103409
Approved by: https://github.com/eellison, https://github.com/jansel
2023-06-14 15:25:18 +00:00
7a2a006c9e Remove dynamic_shapes test for inductor static weights (#103377)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103377
Approved by: https://github.com/anijain2305
2023-06-14 15:00:34 +00:00
45401ef745 Enable float16 and complex32 support for sparse CSR elementwise multiplication operation. (#100394)
As in the title. In addition, the PR adds float16 addcmul support for CPU device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100394
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-06-14 14:42:39 +00:00
a980b19be7 Revert "Remove dynamic_shapes test for inductor static weights (#103377)"
This reverts commit 53cb1a7d15804fef6eb25cbad8a0380a29f53e8b.

Reverted https://github.com/pytorch/pytorch/pull/103377 on behalf of https://github.com/malfet due to broke lint ([comment](https://github.com/pytorch/pytorch/pull/103377#issuecomment-1591356769))
2023-06-14 14:41:13 +00:00
339007fe65 operator_compile_check v0 (#103198)
This PR adds `operator_compile_check` (pls bikeshed name), a gradcheck-like
API to test if a custom operator is supported by torch.compile.

The API is scoped to check only that the interaction between the
operator and torch.compile works (e.g. it is not going to include
gradcheck). Concretely, it currently checks the following things:
- schema correctness
- make_fx traceable (static shapes)
- aot_autograd correctness (static shapes)
- torch.compile correctness, with and without inductor (static shapes)
- make_fx traceable (dynamic shapes)
- aot_autograd correctness (dynamic shapes)
- torch.compile correctness, with and without inductor (dynamic shapes)

Test Plan:

We test a bunch of error cases, including many failure modes that have tripped
us up in the past, and assert that they (mostly) have nice error messages:
- incorrect schema (mutates)
- incorrect schema (has a view)
- missing abstract impl
- incorrect abstract impl
- missing functionalization kernel
- autograd registered at CPU/CUDA keys
- operator is not traceable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103198
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2023-06-14 14:00:14 +00:00
149cd09221 Refactor and improve AOTAutograd tests (#103197)
This is in preparation for the new "custom_op_compile_check" utility,
which will call the refactored testing API as a subroutine.

Here are the improvements to the AOTAutograd tests that this PR makes:
- we use torch.autograd.grad instead of .backward(), which makes it so
that we stop destructively modifying the inputs
- we get rid of the difficult-to-understand sentinel=42 logic and
replace it with something more sane
- We create some helper functions and add some code comments
- We improve error messages

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103197
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer, https://github.com/Chillee
2023-06-14 14:00:14 +00:00
27a67d8699 Refactor and improve make_fx testing (#103196)
This is in preparation for the custom_op_compile_check utility, which
will call the newly refactored function.

This PR:
- splits off code into helper functions
- adds clearer error messages
- stops updating the inputs destructively (leading to slightly slower
tests)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103196
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2023-06-14 14:00:12 +00:00
53cb1a7d15 Remove dynamic_shapes test for inductor static weights (#103377)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103377
Approved by: https://github.com/anijain2305
2023-06-14 13:32:24 +00:00
ccf56eca84 [inductor] Fix is_broadcasted (#103514)
Fixes #103491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103514
Approved by: https://github.com/ngimel
2023-06-14 13:30:48 +00:00
e9674d146c [Specialized Kernel] Propagate Specialized Kernel Support through ComputeCodegenUnboxedKernels (#103113)
Updating ComputeCodegenUnboxedKernels to accept and write out kernel information to RegisterCodegenUnboxedKernels.cpp

Differential Revision: [D46486195](https://our.internmc.facebook.com/intern/diff/D46486195/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103113
Approved by: https://github.com/larryliu0820, https://github.com/kirklandsign
2023-06-14 10:18:16 +00:00
e3ee5b00be Enable test sparse allreduce basics Windows (#103317)
The test was marked as flaky in #59965. However, it is not failing anymore so it can be enabled.

This PR enables only one test, but it will only run in local tests because the test suite is disabled in CI.

#94495 is a superset of this PR which enables the full test suite. The CI run there shows this test passing.

Fixes #59965
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103317
Approved by: https://github.com/kit1980
2023-06-14 07:37:50 +00:00
8b015c166c Don't test dynamic_shapes in tensor_always_has_static_shape (#103517)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103517
Approved by: https://github.com/anijain2305
2023-06-14 07:04:17 +00:00
593642d1d8 Use CUDA DSA in caffe2/operators (#95299)
Differential Revision: D42977333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95299
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-06-14 06:58:34 +00:00
d991ce6da3 [FSDP][3/N]_shard_utils update for dtensor state_dict support (#103479)
Same as https://github.com/pytorch/pytorch/pull/102545 (this branch is corrupted so have to re-submit).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103479
Approved by: https://github.com/fegin
2023-06-14 06:45:28 +00:00
3c5ac4baa4 [CI] Enable inductor dynamic accuracy test on cpu device (#103387)
Enable inductor dynamic accuracy test on cpu in ci workflow to capture issue early.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103387
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire
2023-06-14 06:12:41 +00:00
f0832914ee [Dynamo] Fix lineinfo generation on PY3.11+ (#103525)
- Replace `for inst in instructions[0:targe.offset//2]: inst.starts_line = None`, with the one that that iterates over all instructions until `inst.offset == target.offset` condition is met, this way making it uniform across Python bytecode dialects (Python-3.11+ bytecode size is variable, while bytecode size is fixed for older Pythons)
- Speedup target_index search by replacing `[i for i in instructions if i.offset == offset][0]` with `next(i for i in instructions if i.offset == offset)`, which aborts the evaluation after condition met for the first time, according to:
  ```python
   In [1]: lst=list(range(10000))

   In [2]: %time [i for i in lst if i == 10]
   CPU times: user 144 µs, sys: 23 µs, total: 167 µs
   Wall time: 168 µs
   Out[2]: [10]

   In [3]: %time next(i for i in lst if i == 10)
   CPU times: user 6 µs, sys: 0 ns, total: 6 µs
   Wall time: 9.06 µs
   Out[3]: 10
   ```
- Fix small typo
- use `is_py311_plus` variable rather than checking `sys.version_info`

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 6cd7f27</samp>

> _We fix the typos in our code of doom_
> _We remove the warnings that obscure our vision_
> _We refactor the `generate` function for the dynamo_
> _We resume the execution with precision_

Fixes https://github.com/pytorch/pytorch/issues/103355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103525
Approved by: https://github.com/Skylion007, https://github.com/williamwen42
2023-06-14 05:41:43 +00:00
193d8412e7 [vision hash update] update the pinned vision hash (#103560)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103560
Approved by: https://github.com/pytorchbot
2023-06-14 05:19:10 +00:00
674d18c124 inductor: using int64 as index dtype for slice_scatter (#103511)
For the given test case from HF AllenaiLongformerBase, there has an accuracy issue for the dynamic shape case, the reason is that we are using int32 as the index type but there has a default value ```9223372036854775807```  out of range of int32, see the IR:
```
def masked_subblock1(self, ops):
    get_index = self.get_index('index1')
    index_expr = ops.index_expr(get_index, torch.int32)
    get_index_1 = self.get_index('index2')
    index_expr_1 = ops.index_expr(get_index_1, torch.int32)
    ge = ops.ge(index_expr, index_expr_1)
    get_index_2 = self.get_index('index1')
    index_expr_2 = ops.index_expr(get_index_2, torch.int32)
    constant = ops.constant(9223372036854775807, torch.int32)
    lt = ops.lt(index_expr_2, constant)
    and_ = ops.and_(ge, lt)
    masked_subblock2 = self.masked_subblock2(and_, 0.0)
    get_index_3 = self.get_index('index4')
    load = ops.load('arg4_1', get_index_3)
    where = ops.where(and_, masked_subblock2, load)
    return where
```
and the CPU codegen will generate the cpp code according to the node type:
```
auto tmp3 = [&]
{
    auto tmp4 = static_cast<int>(i3);
    auto tmp5 = static_cast<int>(ks2);
    auto tmp6 = tmp4 >= tmp5;
    auto tmp7 = static_cast<int>(9223372036854775807);
    auto tmp8 = tmp4 < tmp7;
    auto tmp9 = tmp6 & tmp8;
    auto tmp10 = [&]
    {
        auto tmp11 = in_ptr0[static_cast<long>(i2 + i3 + ((-1L)*ks2) + (i1*ks3) + (2L*i2*ks2) + (3L*i0*ks3) + (2L*i1*ks2*ks3) + (                       6L*i0*ks2*ks3))];
        return tmp11;
    }
    ;
    auto tmp12 = tmp9 ? tmp10() : static_cast<decltype(tmp10())>(0.0);
    auto tmp13 = in_ptr1[static_cast<long>(i2 + i3 + (i1*ks2) + (2L*i1*(static_cast<long>(ks2*ks2))) + (2L*i2*ks2) + (i0*ks1*ks2)                        + (2L*i0*ks1*(static_cast<long>(ks2*ks2))))];
    auto tmp14 = tmp9 ? tmp12 : tmp13;
    return tmp14;
}
```
For ```auto tmp7 = static_cast<int>(9223372036854775807);```, ```tmp7``` is always ```-1```, this is wrong.

After This PR, HF AllenaiLongformerBase CPU dynamic shape path can be passed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103511
Approved by: https://github.com/desertfire
2023-06-14 04:59:19 +00:00
2e1369d7ad [inductor] fix benchmark call for inplace update (#103547)
Enabling coordinate descent tuning for a few models cause illegal memory access (or trigger a device assert before that). Command:
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 python benchmarks/dynamo/huggingface.py --amp --performance --training --inductor -d cuda --only CamemBert
```

It turns out that we can not benchmark this kernel: https://gist.github.com/shunting314/a78997f54b5751f2887f4576956036ce

Digging more, it shows that this kernel has a inplace argument that will be changed after running the kernel. Our benchmark API simply call a kernel multiple times. Since each run may have side effect. The previous calls may change the inplace argument in a way that fail following calls.

This PR clone those inplace arguments before each benchmark call. This can increase the time for each benchmark call. But this should not affect autotuning since we increase the equal amount of time for each tuning configs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103547
Approved by: https://github.com/jansel
2023-06-14 04:10:41 +00:00
876161983d default should be used as default value in boolean_dispatch (#103463)
The original code mistakenly uses `False`, but should be `default` as passed in.

NOTE: The behavior is silently changed in an internal package. Be aware if you use it for your own purposes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103463
Approved by: https://github.com/davidberard98
2023-06-14 03:16:31 +00:00
cbea85b416 [Pytorch] aten::zero_ (#103042)
Summary: aten::zero_: https://pytorch.org/docs/stable/generated/torch.Tensor.zero_.html

Test Plan:
clang-format on zero_.glsl and Zero.cpp

```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*zero*"
Downloaded 0/48 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 40.5 sec (100%) 525/525 jobs, 12/525 updated
  Total time: 40.5 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *zero*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.zero_
[       OK ] VulkanAPITest.zero_ (59 ms)
[----------] 1 test from VulkanAPITest (59 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (59 ms total)
[  PASSED  ] 1 test.
```

Differential Revision: D46403983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103042
Approved by: https://github.com/SS-JIA
2023-06-14 03:15:13 +00:00
8340762211 Update lr_scheduler.py to check the type of eta_min (#97003)
Add float assertion to `eta_min` parameter in `CosineAnnealingWarmRestarts`.

Fixes #87757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97003
Approved by: https://github.com/janeyx99
2023-06-14 02:13:05 +00:00
2f5fef5912 Refactor tests for dynamic shapes (#103542)
First, infra improvements: new combinator `expectedFailureDynamic` which subsumes expectedFailure calls in test_dynamic_shapes.py. It's just nicer to have these right with the test. Implementation in torch/_dynamo/testing.py and it works by putting an attr on the test, which is then converted into a real expectedFailure when we actually generate the dynamic shapes test class

Next, some housekeeping:
* test/dynamo/test_unspec.py accidentally was running mostly statically due to the `assume_static_by_default` config flip. Don't assume static by default and xfail some tests which regressed in that time.
* New test file test/dynamo/test_config.py, for testing permutations of configuration options. `test_dynamic_shapes` got moved there.

Finally, grinding through tests in a way that will make them more compatible with dynamic by default:
* If the test explicitly requires dynamic_shapes=False, remove that patch (and probably xfail it)
* If the test checks dynamic_shapes internally, remove that test and patch the test so it ALWAYS runs with dynamic_shapes (this is not coverage loss because we're going to switch the default)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103542
Approved by: https://github.com/anijain2305
2023-06-14 02:04:54 +00:00
b7777c812e extend serialization for tensor metadata (#99808)
Fixes #ISSUE_NUMBER
Add the serialization logic of backend metadata to the serialization of tensor, which is implemented through custom registration functions.

In #97429 , the structure backendMeta is provided in TensorImpl, and we think that this part of information may also need to be serialized for custom.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99808
Approved by: https://github.com/ezyang, https://github.com/huydhn
2023-06-14 01:43:21 +00:00
ce0a511993 Using dynamic allocation buffer and dynamic threads on scan with index (#103502)
What this PR does is (continuation from #103435):
- Applying dynamic number of threads for innerdim scan with index function.
- Using dynamically allocated shared memory to get rid of `num_threads` template arguments.

@ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103502
Approved by: https://github.com/ngimel
2023-06-14 01:27:58 +00:00
fee01640df Make DDPOptimizer handle subgraphs without outputs (#103488)
Subgraphs are partitions cut out of a whole graph. Outputs of a subgraph are either global outputs of the original graph, or can be outputs of a partition that feed inputs of the subsequent partition.  Subgraphs are created using the fx utility 'passes.split_module', which requires that each partition
have at least one output node.

In cases where DDPOptimizer asked the partitioner to cut the graph around a set of nodes which only
performed inplace mutation, the partitioner could be left trying to create a subgraph with no output nodes, violating its assumptions.

To circumvent this, DDPOptimizer can expand the set of nodes marked for inclusion in a subgraph that has no outputs until it includes a node that is an output for that subgraph. It still traverses nodes of the original graph in reverse order and only considers widening a subgraph by iterating further in reverse order than it would have ordinarily done (past the cut point dictated by paramter count). It may still be possible the subgraph reaches the input node of the graph without satisfying the subgraph-output condition, in which case an error would still be raised by the partitioner.

Fixes #103385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103488
Approved by: https://github.com/anijain2305
2023-06-14 01:16:04 +00:00
93b0410eef Use CUDA DSA in ATen (#95300)
Differential Revision: D42977336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95300
Approved by: https://github.com/xw285cornell, https://github.com/ezyang, https://github.com/malfet
2023-06-14 00:12:03 +00:00
6cc0f1c20c Checking for nullptr in get_model_bytecode_version (#97149)
One-liner commit to check that the ptr is not null. Just had `test_jit` that had a segfault there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97149
Approved by: https://github.com/kit1980
2023-06-13 23:54:45 +00:00
0cd155b042 [reland][quant][pt2e] Annotate GRU module (#103358) (#103526)
Summary:

att, we use module partition API to identify the GRU submodule and annotate all necessary patterns

Test Plan: buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'

Differential Revision: D46689428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103526
Approved by: https://github.com/andrewor14
2023-06-13 23:43:10 +00:00
0254880015 NCCL process group: avoid workEnqueue when capturing cuda graph (#103503)
Summary:
In torch.distributed, we make ProcessGroupNCCL not call workEnqueue when the cuda stream is capturing. I.e., when capturing a CUDA graph, we do not enqueue anything for the watchdog thread to consider. This allows capturing NCCL operations in a CUDA Graph.

This is followup to an internal discussion [1] where the watchdog thread was observed to crash when using cuda graphs containing an all_reduce. The watchdog thread wants to query events pertaining to enqueued work items, but this can't be done for "events" created during cuda graph capture.

[1] https://fb.workplace.com/groups/1405155842844877/posts/6975201909173548/

This is another attempt at https://github.com/pytorch/pytorch/pull/102542 / D46274814, fixing the test failures.

Test Plan: The repro mentioned in https://fb.workplace.com/groups/1405155842844877/posts/7003002339726838/ runs successfully after this change.

Differential Revision: D46683554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103503
Approved by: https://github.com/kwen2501
2023-06-13 23:12:43 +00:00
25b6b95b2e Fix freezing tests (#103531)
Workaround for https://github.com/pytorch/pytorch/issues/103532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103531
Approved by: https://github.com/desertfire
2023-06-13 22:51:48 +00:00
056bf951bf Strengthen partially supported invariant of base for chained sources (#103445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103445
Approved by: https://github.com/ezyang
2023-06-13 22:44:28 +00:00
bc2caa7fdf Add type hint for retains_grad (#103528)
Fixes #103485

Type checkers don't know about the existence of `retains_grad` otherwise:

```python
torch.randn(10, 10).retains_grad  # Cannot access member "retains_grad" for type "Tensor"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103528
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99
2023-06-13 21:37:32 +00:00
d38b651d51 [pt2] add SymInt support for cosine_similarity (#103400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103400
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-06-13 21:23:48 +00:00
c07634436e [pt2] add SymInt support for bilinear (#103396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103396
Approved by: https://github.com/ezyang
2023-06-13 21:23:48 +00:00
4a76fb49f3 [pt2] add metas for avg_pool3d and avg_pool3d_backward (#103392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103392
Approved by: https://github.com/ezyang
2023-06-13 21:23:46 +00:00
8dc6001057 [export] Serialize symbolic values (#103273)
* Modified the SymInt schema to also store the hint of the SymInt if it is represented as a symbol so that when we reconstruct the SymInt, the hint will also exist on the node.
* GraphModuleDeserializer.deserialize now also optionally map of symbol names to range.

ReplaceSymSizeOpPass should not be needed after https://github.com/pytorch/pytorch/pull/103107 lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103273
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2023-06-13 20:29:47 +00:00
876695d4ec [ONNX] Add constant folding for Softmax op (#102861)
This commit adds a torch implementation for the ONNX Softmax op, which allows it to be folded during ONNX export if all its inputs are known.

Fixes #97927

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102861
Approved by: https://github.com/BowenBao
2023-06-13 20:23:37 +00:00
3804eb109a Always register SHAPE_ENV guard (#103521)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103521
Approved by: https://github.com/Skylion007
2023-06-13 20:15:20 +00:00
ea384cd377 torch.compiler public namespace (#102182)
# torch.compiler public API

## Goal

The goal of this document is to describe the public facing API for torchdynamo and torchinductor.

Today both dynamo and torchinductor are in `torch/_dynamo` and `torch/_inductor` namespace with the only public function

`torch.compile()` which is directly placed in `torch/__init__.py`

This poses a few problems for users trying to take dependencies on PyTorch 2.0
1. Unclear BC guarantees
2. No builtin discovery mechanism outside of reading the source code
3. No hard requirements for docstrings or type annotations

Most importantly it mixes two personas the PyTorch 2.0 developer vs the PyTorch 2.0 customer so this is an attempt to address this. We draw a lot of inspiration from the `functorch` migration to the `func` namespace.

## Alternate names

We did discuss some other alternative names

1. `torch.compile` -> problem is this would break BC on the existing `torch.compile` function
2. `torch.dynamo` -> `dynamo` is so far not something we've deliberately hidden from users but problem is now figuring out what it's `_dynamo` vs `dynamo` might be confusing
3. `torch.compiler` -> 1 would be better but to keep BC this is a good compromise

# The general approach
## Proposal 1
In https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py

We have function called `reset()`, this function is essential if users are trying to `torch.compile()` a model under different settings

```python
# in _dynamo/
def reset():
    do_reset_stuff()
```

Instead we propose

```python
# in compiler/
def reset():
    do_reset_stuff() # As in copy paste the logic from _dynamo.reset

# in _dynamo/
import warnings
import inspect

def reset():
    function_name = inspect.currentframe().f_code.co_name
    warnings.warn(f"{function_name} is deprecated, use compiler.{function_name} instead", DeprecationWarning)
    return compiler.reset()

```
## Proposal 2

```python
# in compiler/
def reset():
    “””
    Docstrings here
    “””
    _dynamo.reset()

# in _dynamo/
No changes
```
Consensus so far seems to be proposal 2 since fewer warnings will be less jarring and it’ll make it quite easy to merge the public API

## Docstrings

The above was an example of a function that has no inputs or outputs but there are other functions which could use an improvement in their docstrings, for example allow_in_graph actually works over lists of functions but that’s not mentioned anywhere in the example only if you read the source code.

def allow_in_graph(fn):
    """
    Customize which functions TorchDynamo will include in the generated
    graph. Similar to `torch.fx.wrap()`.

    Parameters:
        fn (callable or list/tuple): The function(s) to be allowed in the graph.

    Returns:
        callable or list/tuple: The input function(s) included in the graph.

    Examples:
        Customize inclusion of a single function:
        ::
            torch._dynamo.allow_in_graph(my_custom_function)

        Customize inclusion of multiple functions:
        ::
            torch._dynamo.allow_in_graph([my_custom_function1, my_custom_function2])

        @torch._dynamo.optimize(...)
        def fn(a):
            x = torch.add(x, 1)
            x = my_custom_function(x)
            x = torch.add(x, 1)
            return x

        fn(...)

    Notes:
        The `allow_in_graph` function allows customization of which functions TorchDynamo
        includes in the generated graph. It can be used to include specific functions that
        are not automatically captured by TorchDynamo.

        If `fn` is a list or tuple, `allow_in_graph` will be called recursively on each
        element in the sequence.

        Once a function is allowed in the graph using `allow_in_graph`, it will be captured
        in the graph generated by TorchDynamo. This customization enables more fine-grained
        control over the functions included in the graph.

        Note that `allow_in_graph` expects the input `fn` to be a callable.

    """
    if isinstance(fn, (list, tuple)):
        return [allow_in_graph(x) for x in fn]
    assert callable(fn), "allow_in_graph expects a callable"
    allowed_functions._allowed_function_ids.add(id(fn))
    allowed_functions._disallowed_function_ids.remove(id(fn))
    return fn

So to make the API public, we’d have to write similar docstrings for all public functions we’d like to create.

The benefit of this approach is that
1. No BC risks, internal and external users relying on our tooling can slowly wean off the private functions.
2. We will also have to write correct docstrings which will automatically make our documentation easier to maintain and render correctly on pytorch.org
3. We already have some BC guarantees already, we don’t kill OptimizedModule, we rejected the PR to change the config system

The con of this approach is that
Will be stuck with some potentially suboptimal functions/classes that you can’t kill

## Testing strategy
If the approach is to mostly make a public function call an already tested private function then all we need to do is ensure that the function signatures don't change

## Which functions should be in the public API

Our heuristic for deciding whether something should be public or not is are users already relying on it for lack of other options or have we recommended some non public functions for users to debug their PT 2.0 programs.

Heuristic for not putting something in public is that it’s an experimental subsystem with the goal of turning it on by default, it’s very core dev centric, meta centric, a bunch of different configs that should be batched into a single user facing one, or something that needs to be renamed because the name is confusing

#### Top level
`torch.compile()` -> already is a public API it does require some minor improvements like having configs be passed in to any backend and not just inductor (EDIT: This was already done https://github.com/pytorch/pytorch/pull/99645l) and renaming `mode=reduce-overhead` to `mode=cudagraph`

To make sure that PT 2.0 is supported with a given pytorch version users can create a new public function and this would replace the need for `try/except` blocks around `import torch._dynamo` that has been populating user code.

```python
def pt2_enabled():
    if hasattr(torch, 'compile'):
        return True
    else:
        return False
```

For all of the below they will be translated to `torch.compiler.function_name()`

#### From _dynamo

As a starting point we looked at https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py and we suggest redefining these functions in `pytorch/torch/compiler/__init__.py`

It might also make sense to split them over multiple files and import them in `__init__.py` but because the number of functions is small it'd probably be fine to add them all into a single compiler/__init__.py until this list becomes larger

1. `reset()`
2. `allow_in_graph()`
10. `list_backends()`
12. `compile()`:  torch.compile() would be mostly a shell function passing arguments to torch.compiler.compile()
13. `assume_constant_result()`: TODO: Double check how this is useful
15. `torch._dynamo.disable()`

Some notable omissions
11. `explain()`: We need to clean up the output for this function, make it a data class and pretty printable
1. `forbid_in_graph()`: Considered adding this but should instead consolidate on `disallow_in_graph`
2. `optimize_assert()`: Already covered by `torch.compile(fullgraph=True)`
3. `check_if_dynamo_supported()`: this would be supplanted by pt2_enabled()
4. `compilation_metrics`, `graph_breaks_reasons` ..: would all be accessed via `torch.compiler.explain()`
5. `replay` does not seem useful to end customers
6. . `graph_break()`: Mostly useful for debugging or unit tests
9. `register_backend()`: End users will just pass a string backend to torch.compile, only devs will create new backends
10. `export()` : Eventually this needs to public but for now it’s not ready so just highlighting that it will be in the public API eventually
11. `disallow_in_graph()`: Usage is limited
12. `mark_static()`: we can keep this private until dynamic=True is recommended in stable
13. `mark_dynamic()`:  we can keep this private until dynamic=True is recommended in trunk
14. 8. `OptimizedModule`: This is the only class that we'd expose but is crucial since users are running code like `if isinstance(mod, OptimizedModule): torch.save(mod._orig_mod)` EDIT: because we fixed pickling we no longer need to
expose this
15. `is_compiling()`: Still not clear how this useful to end users

There are also config variables which we need to expose https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py

Some of our configs are useful dev flags, others are to gate experimental functionality and others are essential debugging tools and we seperate out the essential debugging and logging tools to a public facing config.

TODO: I still need to think of a good way of porting the config in a BC way here are some ideas
1. Just make all passes available and controllable via `torch.compile(options={})` but only show docstrings for the ones users should care about.

The current problem with our config system is we have 3 ways of setting them once via `options={}`, environment variables and variables in `config.py`, it'd be worth settling on one source of truth and have that be the public API.

The configs we should make public are
1. `log_file_name`
2. `verbose`
3. `cache_size_limit`
4. `repro_level` and `repro_after`: Although we can rename these to minifier and give human readable names to the levels

Everything else should stay private in particular

1. `print_graph_breaks`, `print_specializations`: should be supplanted by `explain()` for public users
2. dynamic shape configs : Users should only have to worry about `torch.compile(dynamic=True/False)`
3. The distributed flags, hook or guard configs: If we tell a user to use FSDP and DDP then the flag should be enabled by default or be in a private namespace
4. The fbcode flags: Obviously no need to be user facing
5. Skip/Allow lists: Not something normal users should play around with

#### From _inductor
Very little of inductor should be exposed in a public facing API, our core audience as in people writing models mostly just need information on what certain passes mean and how to control them a high level and they can do this with `torch.compile(options={})` so the goal here should be more to make available passes clearer and ideally consolidate them into `torch.compile()` docstrings or modes.

There are some exceptions though from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/__init__.py

1. `list_mode_options()`
2. `list_options()`: this needs an additional pass to hide internal or debug options

For both of these we’d rename them to compiler.inductor_list_mode_options and compiler.inductor_list_options() since they would be in the same init file as the one for dynamo

Notable omissions
1. `_inductor.compile()`: Because of users are coming in with their own fx graph, they are likely developers
2. `_inductor.aot_compile()`:Again this is about capturing and modifying fx graphs so users APIs don't need to be public

However the configs are a slightly different story, because we can choose to either
1. Make all configs public
2. Make some configs public and keep most of the private ones. If public config is set it should override the private version
3. Make all configs controllable via `torch.compile(options={})` but make list_options() hide more things

For now 3 seems like the most reasonable choice with some high level configs we’ll keep like TORCH_COMPILE_DEBUG

Regardless here's what should probably be public or advertised more
1. `disable_progress` and verbose_progress:  Combine and enable by default
2. `fallback_random`: We could make the case this shouldn't be public if a top level deterministic mode enables this
3. `profile_bandwidth`: Or could make the case that this should be in TORCH_COMPILE_DEBUG

Notable omissions
1. Any config that would generally improve performance for most that we should probably enable by default but might be disabled in the short term because of stability: example `epilogue_fusion`, `pattern_matcher`, `reordering`
2. Autotuning flags: Should just sit behind `torch.compile(mode="max-autotune")` like `max_autotune`, `max_autotune_gemm`
3. `coordinate_descent_tuning`: This one I'm a but mixed about, maybe it just also fall into `mode="max-autotune"`
4. `trace`: `TORCH_COMPILE_DEBUG` is the best flag for all of this
5. `triton.cudagraphs`: Default should be `torch.compile(mode="reduce-overhead")` - I'd go further and rename the `mode=cudagraph` and we can keep reduce-overhead for BC reasons
6. `triton_unique_kernel_names`: Mostly useful for devs debugging
7. `dce`: which doesnt really do anything
8. `shape_padding`: Elias is working on enabling this by default in which case we also remove it

## Mechanics

This PR would include the public functions with their docstrings

Another PR will take a stab at the configs

And for work where the APIs are still being cleaned up whether its minifier or escape hatches, export or dynamic shapes, aot_inductor etc.. we’ll keep them private until a public commitment can be made

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102182
Approved by: https://github.com/jansel, https://github.com/albanD
2023-06-13 19:52:17 +00:00
3596a853b4 Always apply new_empty special case in Dynamo (#103378)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103378
Approved by: https://github.com/anijain2305
2023-06-13 19:49:35 +00:00
51d21ffd8a [FSDP][2/n] add use_dtensor flag to both StateDictConfig and OptimStateDictConfig (#103477)
Same as #102552 (this branch is corrupted so have to re-submit).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103477
Approved by: https://github.com/fegin
2023-06-13 19:09:56 +00:00
72931759fd Unified aa_filter and aa_filter_075 for bicubic upsampling (#103510)
Follow up PR to https://github.com/pytorch/pytorch/pull/103252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103510
Approved by: https://github.com/NicolasHug
2023-06-13 18:55:25 +00:00
71b560208c [FSDP] Fix device_id when buffer-only module (#103504)
There was an issue reported internally that with `sync_module_states=True`, if the model had buffers on CPU, even with `device_id` specified, FSDP would try to broadcast CPU buffers, leading to an error like:
```
RuntimeError: No backend type associated with device type cpu
```

After some investigation, I determined that we should _not_ fix this by moving the buffers to GPU just for the broadcast and then back to CPU. Instead, we should fix our `device_id` logic.

The issue is that we always used the _parameters_ as the proxy to tell whether we should move module states to the device specified by `device_id`. However, a module (often the root) may not have any parameters but have some buffers! In that case, the buffers are left on CPU even if `device_id` is specified. This PR fixes this by considering both parameters and buffers for movement to `device_id`.

Note that this PR preserves the logic that `ignored_modules` / `ignored_parameters` are not considered for this movement, meaning that ignored parameters are moved to `device_id`.

Note also that I had to move the unit test back from using MTPG to the normal PG since otherwise, I could not repro the original error. (It seems like MTPG does not complain if we try to use `dist._broadcast_coalesced()` with CPU tensors.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103504
Approved by: https://github.com/rohan-varma
2023-06-13 18:33:26 +00:00
1628bbecb6 Use free_symbols to determine if convolutions involve dynamic shapes (#103486)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103486
Approved by: https://github.com/shunting314
2023-06-13 18:17:03 +00:00
38890e1d2b Stop disabling ShapeProp with dynamic_shapes for mkldnn (#103381)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103381
Approved by: https://github.com/anijain2305
2023-06-13 18:16:57 +00:00
1506acebaf Detect symbolic tracing_mode with free_symbols (#103515)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103515
Approved by: https://github.com/anijain2305
2023-06-13 17:57:16 +00:00
ddb682f616 Enable Python dispatcher when ShapeProp with fake mode (#103512)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103512
Approved by: https://github.com/Skylion007
2023-06-13 17:47:33 +00:00
af7bd409be Don't test dynamic_shapes in profiler (#103516)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103516
Approved by: https://github.com/anijain2305
2023-06-13 17:47:25 +00:00
05c01b9bfc Register more foreach op lowerings (#102654)
Adds the necessary foreach op lowerings for Adam

Adds two decomps for addcdiv and addcmul (need to verify that type promotion works correctly here)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102654
Approved by: https://github.com/jansel
2023-06-13 17:30:03 +00:00
5b33d39114 [FSDP] Workaround for GLOO's lack of all_gather_into_tensor. (#103170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103170
Approved by: https://github.com/rohan-varma
2023-06-13 17:21:41 +00:00
b77f1b0f27 Wrong type when exporting {zeros, ones, full, empty, rand, randn}_like ops to onnx (#103048)
Fixes #99788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103048
Approved by: https://github.com/thiagocrepaldi
2023-06-13 17:17:28 +00:00
e9f2921bff Fix rerun disabled test uploading logic (#103476)
After https://github.com/pytorch/pytorch/pull/102107, rerunning disabled tests only collect and run disable tests.  A side effect of this change is that the skip message `Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run` isn't in the test report anymore as these non-disabled tests are not going to be collected in the first place.  This breaks the logic in the uploading script that depends on this string to know if the test report belongs to a rerunning disabled tests workflow.

* This PR updates the logic in `is_rerun_disabled_tests` check to count the number of times a test is run instead.  In rerunning disabled tests mode, a test is run 50 times by default and 15 times for distributed tests (to avoid timeout). Both these numbers are larger than the max number of retries a test can get normally (3 x 3)
* This also removes the hacky `is_rerun_disabled_tests` check in `tools/stats/upload_test_stats.py` as rerun disabled tests reports are now very small (50 x the number of disabled tests)

### Testing

* `test_gradgrad_nn_GroupNorm_cuda_float64` now shows up correctly https://github.com/pytorch/pytorch/issues/98678
```
python3 -m tools.stats.check_disabled_tests --workflow-run-id 5229037746 --workflow-run-attempt 1 --repo "pytorch/pytorch"

Using temporary directory: /var/folders/x4/2kd9r0fn5b9bf_sbcw16fxsc0000gn/T/tmpdojg5vq5
Downloading test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925022.zip
Downloading test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925093.zip
Downloading test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925167.zip
Downloading test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925226.zip
Downloading test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925295.zip
Downloading test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925371.zip
Downloading test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925453.zip
Downloading test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925536.zip
Downloading test-reports-test-slow-1-1-linux.2xlarge_14154853469.zip
Downloading test-reports-test-slow-1-1-linux.rocm.gpu_14154932523.zip
Downloading test-reports-test-slow-1-1-linux.rocm.gpu_14154932563.zip
Downloading test-reports-test-slow-1-2-linux.4xlarge_14154873704.zip
Downloading test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931154.zip
Downloading test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931186.zip
Downloading test-reports-test-slow-2-2-linux.4xlarge_14154873756.zip
Downloading test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931225.zip
Downloading test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931267.zip
Extracting test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925022.zip to unzipped-test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925022
Extracting test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925093.zip to unzipped-test-reports-test-default-1-4-linux.g5.4xlarge.nvidia.gpu_14154925093
Extracting test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925167.zip to unzipped-test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925167
Extracting test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925226.zip to unzipped-test-reports-test-default-2-4-linux.g5.4xlarge.nvidia.gpu_14154925226
Extracting test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925295.zip to unzipped-test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925295
Extracting test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925371.zip to unzipped-test-reports-test-default-3-4-linux.g5.4xlarge.nvidia.gpu_14154925371
Extracting test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925453.zip to unzipped-test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925453
Extracting test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925536.zip to unzipped-test-reports-test-default-4-4-linux.g5.4xlarge.nvidia.gpu_14154925536
Extracting test-reports-test-slow-1-1-linux.2xlarge_14154853469.zip to unzipped-test-reports-test-slow-1-1-linux.2xlarge_14154853469
Extracting test-reports-test-slow-1-1-linux.rocm.gpu_14154932523.zip to unzipped-test-reports-test-slow-1-1-linux.rocm.gpu_14154932523
Extracting test-reports-test-slow-1-1-linux.rocm.gpu_14154932563.zip to unzipped-test-reports-test-slow-1-1-linux.rocm.gpu_14154932563
Extracting test-reports-test-slow-1-2-linux.4xlarge_14154873704.zip to unzipped-test-reports-test-slow-1-2-linux.4xlarge_14154873704
Extracting test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931154.zip to unzipped-test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931154
Extracting test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931186.zip to unzipped-test-reports-test-slow-1-2-linux.g5.4xlarge.nvidia.gpu_14154931186
Extracting test-reports-test-slow-2-2-linux.4xlarge_14154873756.zip to unzipped-test-reports-test-slow-2-2-linux.4xlarge_14154873756
Extracting test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931225.zip to unzipped-test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931225
Extracting test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931267.zip to unzipped-test-reports-test-slow-2-2-linux.g5.4xlarge.nvidia.gpu_14154931267
Downloading test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932523.zip
Downloading test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932563.zip
Extracting test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932523.zip to unzipped-test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932523
Extracting test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932563.zip to unzipped-test-reports-runattempt1-test-slow-1-1-linux.rocm.gpu_14154932563
The following 32 tests should be re-enabled:
  test_huge_index (__main__.TestCuda) from test_cuda.py
  test_conv_bn_fuse_cpu (__main__.CpuTests) from inductor/test_torchinductor.py
  test_multi_threads (__main__.TestTorchrun) from backends/xeon/test_launch.py
  test_huge_index (__main__.TestCuda) from test_cuda_expandable_segments.py
  test_memory_timeline_no_id (__main__.TestMemoryProfilerE2E) from profiler/test_memory_profiler.py
  test_inverse_errors_large_cuda_float64 (__main__.TestLinalgCUDA) from test_linalg.py
  test_trace_dependencies (__main__.TestAnalyze) from test_package.py
  test_caching_pinned_memory (__main__.TestCuda) from test_cuda_expandable_segments.py
  test_graph_concurrent_replay (__main__.TestCuda) from test_cuda_expandable_segments.py
  test_module_attribute_mutation_violation_negative_1 (__main__.MutationExportTests) from dynamo/test_export_mutations.py
  test_module_attribute_mutation_violation_negative_2 (__main__.MutationExportTests) from dynamo/test_export_mutations.py
  test_module_attribute_mutation_violation_negative_4 (__main__.MutationExportTests) from dynamo/test_export_mutations.py
  test_vmapjvpall_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py
  test_vmapjvpvjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py
  test_Conv2d_no_bias_cuda_tf32 (__main__.TestNN) from test_nn.py
  test_save_graph_repro (__main__.TestAfterAot) from dynamo/test_after_aot.py
  test_doc_examples (__main__.TestTypeHints) from test_type_hints.py
  test_caching_pinned_memory (__main__.TestCuda) from test_cuda.py
  test_graph_concurrent_replay (__main__.TestCuda) from test_cuda.py
  test_non_contiguous_tensors_nn_ConvTranspose1d_cuda_complex32 (__main__.TestModuleCUDA) from test_modules.py
  test_pickle_nn_RNN_eval_mode_cuda_float64 (__main__.TestModuleCUDA) from test_modules.py
  test_op_has_batch_rule_nn_functional_conv_transpose3d_cuda_float32 (__main__.TestVmapOperatorsOpInfoCUDA) from functorch/test_vmap.py
  test_geometric_kstest_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) from test_torch.py
  test_profiler_experimental_tree_with_memory (__main__.TestProfilerTree) from profiler/test_profiler_tree.py
  test_fs_pool (__main__.TestMultiprocessing) from test_multiprocessing.py
  test_forward_mode_AD_linalg_lu_factor_ex_cuda_complex128 (__main__.TestFwdGradientsCUDA) from test_ops_fwd_gradients.py
  test_vjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py
  test_inplace_grad_fmod_cuda_float64 (__main__.TestBwdGradientsCUDA) from test_ops_gradients.py
  test_inplace_gradgrad_remainder_cuda_float64 (__main__.TestBwdGradientsCUDA) from test_ops_gradients.py
  test_bottleneck_cuda (__main__.TestBottleneck) from test_utils.py
  test_comprehensive_empty_strided_cuda_int32 (__main__.TestInductorOpInfoCUDA) from inductor/test_torchinductor_opinfo.py
  test_vmapvjpvjp_linalg_lu_cuda_float32 (__main__.TestOperatorsCUDA) from functorch/test_ops.py
The following 11 are still flaky:
  test_transpose_with_norm (__main__.CPUReproTests) from inductor/test_cpu_repro.py, failing 215/215
  test_compare_cpu_linalg_pinv_singular_cuda_float32 (__main__.TestCommonCUDA) from test_ops.py, failing 100/100
  test_conv_bn_fuse_dynamic_shapes_cpu (__main__.DynamicShapesCodegenCpuTests) from inductor/test_torchinductor_codegen_dynamic_shapes.py, failing 115/115
  test_lobpcg (__main__.TestAutograd) from test_autograd.py, failing 50/50
  test_module_attribute_mutation_violation_negative_3 (__main__.MutationExportTests) from dynamo/test_export_mutations.py, failing 2/50
  test_Conv2d_dilated_cuda_tf32 (__main__.TestNN) from test_nn.py, failing 1/50
  test_grad_nn_GroupNorm_cuda_float64 (__main__.TestModuleCUDA) from test_modules.py, failing 50/50
  test_index_add_correctness (__main__.TestTorch) from test_torch.py, failing 22/50
  test_attn_cuda (__main__.TestMin) from functorch/test_dims.py, failing 1/50
  test_open_device_registration (__main__.TestCppExtensionOpenRgistration) from test_cpp_extensions_open_device_registration.py, failing 50/50
  test_gradgrad_nn_GroupNorm_cuda_float64 (__main__.TestModuleCUDA) from test_modules.py, failing 50/50
```

* Uploading tests stats for rerunning disabled tests takes only half a minute

```
time python3 -m tools.stats.upload_test_stats --workflow-run-id 5229037746 --workflow-run-attempt 1 --head-branch main
31.94s user 2.94s system 44% cpu 1:19.07 total
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103476
Approved by: https://github.com/clee2000
2023-06-13 17:07:40 +00:00
3ffac08271 Fix bug in SplitCatSimplifier when next_user is an output node (#103338)
Summary:
When simplifying split cat patterns, if next user of a split node was an output node, there was a bug leading to an issue like: P765993221

Basically, the bug was in how args and kwargs of the user were getting replaced, and the code didn't handle nested arg/kwargs.

Using torch.fx.Node functions such as `all_input_nodes` and `replace_input_with` fixes this issue

Differential Revision: D46603618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103338
Approved by: https://github.com/jansel
2023-06-13 16:57:51 +00:00
9591e52880 Add vfdev-5 as reviewer for CPU Aten backend (#103524)
As suggested by @malfet. @vfdev-5 is the primary owner of the `interpolate()` op  and this will avoid having to ask for stamps like in https://github.com/pytorch/pytorch/pull/103252.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103524
Approved by: https://github.com/kit1980
2023-06-13 16:17:59 +00:00
b00d388ada Update test_misc.cpp (#97768)
Potential null dereference after dynamic cast was found during static analysis.

**Description:**
Dereference of `ctx` is performed in `TORCH_CHECK` on line 1176, while `ctx` pointer may equal `nullptr`.
Previous `TORCH_CHECK` on line 1175 checks the value of `ctx_ptr` pointer that may be of type that cannot be casted to `TestContext*`. In such case, `dynamic_cast` returns `nullptr` despite `ctx_ptr` is not equal to `nullptr`.

**Fix:**

- Check `ctx` instead of `ctx_ptr` for equality to zero.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97768
Approved by: https://github.com/kit1980
2023-06-13 16:14:11 +00:00
cbe270d233 Fix zeros_like for sparse tensors with batch dimensions. Add opinfo-based tests to like-functions. (#101215)
Fixes #101078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101215
Approved by: https://github.com/cpuhrsch
2023-06-13 16:02:10 +00:00
597e2a11a3 indexing_dtype_strength_reduction more aggressive free_symbols tests (#103470)
ValueRanges can't handle symbolic bounds. Be a bit more careful about detecting if you try to pass in expressions with free symbols, and fall back to "don't know" range if this occurs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103470
Approved by: https://github.com/eellison
2023-06-13 16:00:41 +00:00
63fe26809d Implement all_gather_into_tensor_coalesced. (#98642)
The implementation is suboptimal since it uses c10d's group coalescing which
is known to be inneficient.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98642
Approved by: https://github.com/wanchaol
2023-06-13 15:06:52 +00:00
4081e924a8 Dynamically assign number of threads in innerdim scan (#103435)
This is the continuation of optimizing inner-dimension scan operations (`torch.cumsum`, `torch.cumprod`, `torch.logcumsumexp`) by dynamically setting the number of threads based on the input shape from #103314.
What I found that just setting the number of x-threads and y-threads following the ratio of the tensor's shape works quite well (with some clamping).
Here is the speed-up of this PR, compared to `2.0.0+cu118` (not compared to #103314) using A100 with 40GB memory (up to 23x faster):
```
                2        8       32      128      512     1024     2048     4096     8096    16348    65536   262144  1048576
       2:  1.07(4)  1.02(5)  1.01(6)  1.07(7)  2.16(8)  4.94(9)  8.71(9) 11.00(9) 12.99(9) 14.77(9) 16.41(9) 16.81(9) 16.97(9)
       8:  1.20(4)  1.00(4)  1.01(5)  1.08(6)  2.85(7)  4.90(8)  6.34(8) 11.76(9) 13.86(9) 15.26(9) 16.96(9) 17.45(9) 19.75(9)
      32:  1.08(4)  1.00(4)  1.00(4)  1.23(5)  2.48(6)  4.23(7)  5.04(7)  9.16(8) 10.11(8) 18.72(9) 20.64(9) 23.13(9) 23.50(9)
     128:  1.09(4)  1.02(4)  1.03(4)  1.02(4)  1.64(5)  2.84(6)  3.08(6)  5.61(7)  5.86(7) 10.72(8) 19.22(9) 19.75(9) 19.97(9)
     512:  1.06(4)  1.14(4)  1.01(4)  1.10(4)  1.02(4)  1.78(5)  1.85(5)  3.26(6)  3.34(6)  5.56(7)  8.56(8)  9.55(9)  9.62(9)
    1024:  1.21(4)  1.22(4)  1.20(4)  1.06(4)  1.03(4)  1.05(4)  1.81(5)  1.86(5)  3.06(6)  3.12(6)  4.76(7)  5.20(8)  5.56(9)
    2048:  1.04(4)  0.88(4)  1.00(4)  1.01(4)  1.02(4)  1.03(4)  1.02(4)  1.72(5)  1.73(5)  2.62(6)  2.86(7)  3.06(8) --------
    4096:  1.02(4)  1.12(4)  0.98(4)  1.60(4)  1.16(4)  1.09(4)  1.10(4)  1.10(4)  1.74(5)  1.75(5)  1.86(6)  2.00(7) --------
    8096:  1.03(4)  1.00(4)  1.00(4)  1.16(4)  1.17(4)  1.17(4)  1.18(4)  1.18(4)  1.18(4)  1.27(5)  1.43(6) -------- --------
   16348:  1.02(4)  1.15(4)  1.11(4)  1.17(4)  1.12(4)  1.11(4)  1.13(4)  1.12(4)  1.11(4)  1.08(4)  1.32(5) -------- --------
   65536:  1.17(4)  1.17(4)  1.16(4)  1.15(4)  1.12(4)  1.12(4)  1.12(4)  1.10(4)  1.10(4)  1.07(4) -------- -------- --------
  262144:  1.20(4)  1.20(4)  1.08(4)  1.13(4)  1.10(4)  1.09(4)  1.10(4)  1.08(4) -------- -------- -------- -------- --------
 1048576:  1.21(4)  1.14(4)  1.10(4)  1.13(4)  1.09(4)  1.08(4) -------- -------- -------- -------- -------- -------- --------
```
The first row is the innermost dimension, the first column is the outermost dimension (i.e. the batch size).
The float numbers are the speed up while the integers within the brackets are the log2 of number of x-threads.
The blank cells (the ones with dashes) are not compared because of my GPU's memory limitation.

There are some slowdowns that I observed (like `(2048, 8)` and `(4096, 32)`). The slowdown is because in this PR, the scan loop (the one I use with Sklansky) is not optimized by the compiler due to dynamic number of iterations (it is `log2(num_threads_x)`), while in the previous version, the scan loop can be unrolled and optimized by the compiler due to fixed number of iterations.
That's why I slightly modified the operations within the scan loop to use bit operations in order to compensate for this slowdown.

The most significant acceleration comes from the tensors with relatively small batch size (<= 4096) and with very long sequence.
As the batch size increases, the speed up is not that significant because the previous implementation is most likely to be optimized.
NOTE: I haven't optimized scan dim with indices, it could come in another PR.

As for the build time, I tried not to write more templated functions than necessary.
I will report the build time when I already have the numbers.
UPDATE: I compared the build time when I changed ScanUtils.cuh only. In `main` branch, it took 4m2s, while in this PR, it took 3m39s.

What do you think, @ngimel?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103435
Approved by: https://github.com/ngimel
2023-06-13 08:29:47 +00:00
f6b4106554 [export] Automatically add label for export (#103458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103458
Approved by: https://github.com/clee2000
2023-06-13 08:24:01 +00:00
13777e3391 Revert "[quant][pt2e] Annotate GRU module (#103358)"
This reverts commit 23892d8ee44c33abafe9b96ccb788033ffbc63ad.

Reverted https://github.com/pytorch/pytorch/pull/103358 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/103358#issuecomment-1588729657))
2023-06-13 07:45:40 +00:00
b0a93c851c Fix BUCK build after #103185 (#103446)
Per title.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at c64f0c0</samp>

> _`torch_headers` grows_
> _to include profiler files_
> _autumn of code change_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103446
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet
2023-06-13 05:12:07 +00:00
cyy
db07ba3a9b Use size_t in THManagedMapAllocator (#103331)
When reviewing the source code, I found the ptrdiff_t size in THManagedMapAllocator::THManagedMapAllocator can be changed to size_t size to avoid unnecessary casts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103331
Approved by: https://github.com/malfet
2023-06-13 04:50:30 +00:00
23892d8ee4 [quant][pt2e] Annotate GRU module (#103358)
Summary: att, we use module partition API to identify the GRU submodule and annotate all necessary patterns

Test Plan: buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'

Reviewed By: kimishpatel

Differential Revision: D46384329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103358
Approved by: https://github.com/HDCharles
2023-06-13 04:10:13 +00:00
6ed3c4499a Fix fuse_custom_config_dict arg from being None (#102154)
`fuse_custom_config_dict` in [fuse_modules.py](https://github.com/pytorch/pytorch/blob/main/torch/ao/quantization/fuse_modules.py#L164) being passed as None even if a fuse_custom_config_dict is provided.

This patch fixes the `fuse_custom_config_dict` from being passed as None.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102154
Approved by: https://github.com/kit1980
2023-06-13 03:45:20 +00:00
45104cb67f Different csv headers by bench mode on infra error (#103134)
As title. The headers are different for distinct bench mode. This PR is a supplement
to https://github.com/pytorch/pytorch/pull/100372 to respect `performance` mode where numerical speedup is expected
instead of status text.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103134
Approved by: https://github.com/thiagocrepaldi, https://github.com/ezyang
2023-06-13 03:40:22 +00:00
5f77be8bbe Refactor OptimizeIndexing (#100549)
This PR decouples the logic necessary to compute bounds on variables
from the logic that uses this info to perform the strenght analysis on
int64 variables. While doing so, it tries to minimize the number of
attributes of the class in favour of local variables.

This class is now accessible from any `LoopBody` object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100549
Approved by: https://github.com/eellison
2023-06-13 03:31:41 +00:00
88ebb2e321 Windows FileStore skip timeout if the file path is invalid (#103247)
On Windows in FileStore if the path to the file to be created is not valid then it will get stuck there trying to create the file until the timeout is reached.
This PR checks if the path is invalid and if it is then it will leave the loop instantly.

Fixes #48475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103247
Approved by: https://github.com/fduwjj, https://github.com/kit1980
2023-06-13 03:21:35 +00:00
4c3799447f Back out "Dropout support for memory efficient attention (#102038)" & "Two small mem_eff bug fixes (#103201)" (#103464)
Summary:
Original commit changeset: 04c4473d8510

Original Phabricator Diff: D46584152 & D46582033

Test Plan: Already explained in summary.

Reviewed By: yinghai

Differential Revision: D46633283

fbshipit-source-id: c23c2945408988f3c4339dfd5cd40ae46261716c

Co-authored-by: Shenxiu Liu <shenxiu@meta.com>
2023-06-12 18:56:48 -07:00
7360d0f904 Upgraded nightly wheels to rocm5.5 (#102242)
Upgraded nightly wheels to rocm5.5

Follow-up to https://github.com/pytorch/builder/pull/1407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102242
Approved by: https://github.com/malfet
2023-06-13 01:34:10 +00:00
9bc0b79369 [dynamo][numpy] Install numpy_pytorch_interop in ci jobs (#103447)
It is required for numpy_pytorch_interop to be installed, for all tests being annotated by `@requires_numpy_pytorch_interop` decorator.

This PR adds a commit for it and adds a function to install it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103447
Approved by: https://github.com/ezyang
2023-06-13 01:14:19 +00:00
fa893f3f58 Fix optim state_dict casting to allow step to cast to CPU (#102619)
I'm guessing this should fix https://github.com/pytorch/pytorch/pull/88015#issuecomment-1569523106 but am waiting on @ychfan to supply more details so I could write a good test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102619
Approved by: https://github.com/albanD
2023-06-13 00:46:40 +00:00
666ec8160c Skip test suite (#103472)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103472
Approved by: https://github.com/osalpekar, https://github.com/huydhn
2023-06-13 00:43:40 +00:00
4a52694b08 [torch.compile] Add explain as a backend #102053 (#103259)
Fixes #102053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103259
Approved by: https://github.com/voznesenskym
2023-06-13 00:32:17 +00:00
2abad0c184 Add dtype check baddbmm (#102659)
Fixes part of the #100838 related to disabling support for non matching dtypes for input/batches for `baddbmm` operator.

* [x] added dtype checks
* [x] added test case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102659
Approved by: https://github.com/ngimel
2023-06-13 00:31:06 +00:00
a18048d982 Remove redundant fallback for view_as_complex (#103261)
This enables lowering to work for it

Differential Revision: [D46585029](https://our.internmc.facebook.com/intern/diff/D46585029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103261
Approved by: https://github.com/desertfire, https://github.com/eellison
2023-06-13 00:27:28 +00:00
2c313e7b99 Revert "Record view stacks if running anomaly mode (#103185)"
This reverts commit a02c573a8996d5d47585410ceaf81c87104cfd43.

Reverted https://github.com/pytorch/pytorch/pull/103185 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629734 ([comment](https://github.com/pytorch/pytorch/pull/103185#issuecomment-1588258206))
2023-06-12 23:52:10 +00:00
c3d3165f16 Enable uploading metrics and upload Test Reordering metrics to dynamodb (#102691)
Added a feature to upload test statistics to DynamoDB and Rockset using a new function `emit_metric` in `tools/stats/upload_stats_lib.py`.

Added metrics to measure test reordering effectiveness in `tools/testing/test_selections.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102691
Approved by: https://github.com/malfet
2023-06-12 23:01:53 +00:00
72b7c4efe5 [Profiler] Fix flaky test_memory_timeline_no_id (#103441)
Summary: On CPU only runs, the allocator seems to sometimes report out of context CPU allocations, but sometimes not. Let's just check the expected list is in the actual list for CPU. GPU test stays the same.

Test Plan: CI, ran locally 100 times.

Reviewers: dberard

Resolves: https://github.com/pytorch/pytorch/issues/103286
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103441
Approved by: https://github.com/davidberard98
2023-06-12 22:58:56 +00:00
58d2c66a70 [activation checkpointing] Higher order functional rng op wrappers (#102934)
Introduces two higher order operators
* run_and_save_rng_state - Saves the current rng state and then runs the op.
* run_with_rng_state - Runs the op with the rng state supplied as an input

Ideally, we would like to use torch.compile for these operators. But currently the plan is to introduce these operators at the partitioner level, obviating the need to support them fully through the torch.compile stack. To ensure that we have good enough debugging with minifiers, we have ensure that they work with make_fx. In future, we can move on torch.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102934
Approved by: https://github.com/jansel, https://github.com/zou3519
2023-06-12 22:54:17 +00:00
31ee1512d3 [inductor] Update triton pin (#102736)
There is some bug in triton's handling of `tl.reduce` that breaks the variance PR, but is fixed on the latest triton master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102736
Approved by: https://github.com/huydhn, https://github.com/desertfire
2023-06-12 22:02:13 +00:00
455f542ed9 Add groups to dynamo benchmarking output data (#103268)
# Summary
Ads the required information to enable this issue:
https://github.com/pytorch/test-infra/issues/4268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103268
Approved by: https://github.com/huydhn
2023-06-12 21:09:42 +00:00
4935b3e0e7 Make specialized attributes on Tensor mandatory (#103434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103434
Approved by: https://github.com/anijain2305
2023-06-12 21:01:24 +00:00
056d92e2a0 sparse.mm backward: performance improvements (#94991)
`torch.sparse.mm` - faster and without syncs in "most" cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94991
Approved by: https://github.com/Skylion007, https://github.com/pearu, https://github.com/cpuhrsch
2023-06-12 20:57:29 +00:00
d083d444ff Inductor Freezing (#100652)
Adds a freezing pass that will constant fold parameters in inductor `config.freezing`. This occurs post functionalization in aot autograd to capture both dispatching and allow passes to occur post functionalization. A few notes:

- There is an option to discard parameters `config.freezing_discard_parameters` which will take the current eager modules and wrap parameters to a Tensor subclass which will error if used.
- I needed to expose flat_params in aot_autograd in order to discard old references when we constant fold away parameters, like with amp. I also exposed `fw_metadata` to avoid constant folding mutated paraemters.
- Caching parameter transformations/constant folding across different inferences nyi
- Checking version_counter of constant folded params nyi

I'm not really sure what the actual naming should be. In jit there was both "freezing", which was platform agnostic, and "optimize for inference", which made device specific optimizations. We're doing the latter here but maybe freezing is a better name.

Differential Revision: [D46244033](https://our.internmc.facebook.com/intern/diff/D46244033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100652
Approved by: https://github.com/jansel
2023-06-12 20:56:03 +00:00
54daf870bc CUDA graphs overrides dynamic shapes and forces specialization (#103290)
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs.  This new strategy
I think is better.  The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input.  When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.

This obsoletes https://github.com/pytorch/pytorch/pull/101813

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym, https://github.com/eellison
2023-06-12 20:26:55 +00:00
6c6c897d6b Add graph break logging option instead of config flag (#103202)
Make graph break logging a logging option vs a config setting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103202
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2023-06-12 19:52:31 +00:00
50c972bfd2 [c10d] Add xpu to the default device supported by user specified backend (#103410)
**Motivation:**
For collective dispatching, we want to provide a more user friendly usage for xpu device and CCL backend (user specified backend) mapping.

**Solution:**
We add xpu to the default device list, and it can construct the mapping between xpu and the user specified backend directly.
Usage:
When using xpu device, user can specify backend name only:
`dist.init_process_group(backend='ccl')`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103410
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-06-12 19:46:33 +00:00
49754f44ee Rewrite size/stride/numel TensorVariable handling (#103438)
The main concept behind this refactor is this: if we know that a size/stride/etc is constant, do NOT trace it into the graph, EXCEPT for any preexisting special cases that applied for static shapes. The refactor unfolds like this:

1. Delete the `dynamic_shapes` branches in torch/_dynamo/variables/builder.py which accept int/float/bool outputs. This is over-aggressive and we don't want to allow this (because if the operator returns a constant, we shouldn't have called wrap_fx_proxy in the first place.) This causes a bunch of failures because we are blindly feeding the result of size() call to wrap_fx_proxy when dynamic shapes is enabled.
2. Modify TensorVariable.call_method in torch/_dynamo/variables/tensor.py to avoid sending constant ints to wrap_fx_proxy. After normal specialization (which should be deleted, see https://github.com/pytorch/pytorch/pull/103434) we consult the fake tensor to see if the values in question have free variables or not. If they don't we short circuit tracing into graph. We only trace into graph if the operation in question is truly symbolic. Note that there is a near miss here: it's OK to trace x.size() call entirely into the graph, even if it doesn't have all dynamic shapes, because operator.getitem with int output is special cased in builder.py. This is a preexisting special case and I don't try to get rid of it.
3. It turns out that the change here also breaks torch_np compatibility layer. So I completely rewrite getattr handling in torch/_dynamo/variables/tensor.py to follow the same pattern (only trace into graph if truly dynamic).

There's some minor housekeeping in torch/fx/experimental/symbolic_shapes.py and some test files.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103438
Approved by: https://github.com/larryliu0820
2023-06-12 19:36:24 +00:00
141828498c [CI] Update inference accuracy test (#103361)
Summary:
1) Switch inference accuracy test from fp32 to amp (consistent with dashboard run, https://github.com/pytorch/pytorch/pull/103220)
2) GoogleFnet fails in eager with amp or fp16, so fallback to always using fp32.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103361
Approved by: https://github.com/eellison
2023-06-12 19:34:18 +00:00
f22d99c784 Update C++ frontend docs (#103451)
Specify that C++ standard is not C++17 and minimum supported CMake version is 3.18

Fixes https://github.com/pytorch/pytorch/issues/103371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103451
Approved by: https://github.com/jeanschmidt
2023-06-12 19:19:36 +00:00
d997969b8b [Reland] Add sym_size/stride/numel/storage_offset to native_function.yaml (#103107)
Differential Revision: D46459100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103107
Approved by: https://github.com/angelayi, https://github.com/soulitzer
2023-06-12 19:18:49 +00:00
0cb5bc3b04 Revert "Move tensor grouping to ATen (#100007)"
This reverts commit 74b7a6c75e698378882d30958908073407f97fb3.

Reverted https://github.com/pytorch/pytorch/pull/100007 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629727 ([comment](https://github.com/pytorch/pytorch/pull/100007#issuecomment-1587861598))
2023-06-12 18:30:33 +00:00
3766c04736 Add uint8 support for CPU images in interpolate(mode='bicubic') (#103252)
CC @vfdev-5

Proposed strategy: Be as close as possible to PIL when `antialias=True`. Be as close as possible to float path when `antialias=False`.

Ad-hoc tests:

<details>

```py
import random

import torch
import pytest
import numpy as np
from PIL import Image
from torch.nn.functional import interpolate

@pytest.mark.parametrize("C", (1, 3, 6))
@pytest.mark.parametrize("batch_size", (1, 4))
@pytest.mark.parametrize("memory_format", (torch.contiguous_format, torch.channels_last, "strided", "cropped"))
@pytest.mark.parametrize("antialias", (True, False))
# @pytest.mark.parametrize("mode", ("bilinear", "bicubic",))
@pytest.mark.parametrize("mode", ("bicubic",))
@pytest.mark.parametrize("seed", range(100))
def test_resize(C, batch_size, memory_format, antialias, mode, seed):

def test_resize(C, batch_size, memory_format, antialias, mode, seed):

    torch.manual_seed(seed)
    random.seed(seed)

    Hi = 2**random.randint(3, 10) + random.randint(0, 30)
    Wi = 2**random.randint(3, 10) + random.randint(0, 30)
    Ho = 2**random.randint(3, 10) + random.randint(0, 30)
    Wo = 2**random.randint(3, 10) + random.randint(0, 30)
    # print(Hi, Wi, Ho, Wo)

    img = torch.randint(0, 256, size=(batch_size, C, Hi, Wi), dtype=torch.uint8)

    if memory_format in (torch.contiguous_format, torch.channels_last):
        img = img.to(memory_format=memory_format, copy=True)
    elif memory_format == "strided":
        img = img[:, :, ::2, ::2]
    elif memory_format == "cropped":
        a = random.randint(1, Hi // 2)
        b = random.randint(Hi // 2 + 1, Hi)
        c = random.randint(1, Wi // 2)
        d = random.randint(Wi // 2 + 1, Wi)
        img = img[:, :, a:b, c:d]
    else:
        raise ValueError("Uh?")

    margin = 0
    img = img.clip(margin, 255 - margin)
    out_uint8 = interpolate(img, size=[Ho, Wo], mode=mode, antialias=antialias)

    if antialias and C == 3:
        out_pil_tensor = resize_with_pil(img, Wo, Ho, mode=mode, antialias=antialias)
        atol = {"bicubic": 2, "bilinear": 1}[mode]  # TODO: is 2 expected when comparing with PIL bicubic? Why not 1 as for bilinear?
        torch.testing.assert_close(out_uint8, out_pil_tensor, rtol=0, atol=atol)

    out_float = interpolate(img.to(torch.float), size=[Ho, Wo], mode=mode, antialias=antialias).round().clip(0, 255).to(torch.uint8)
    if mode == "bicubic":
        diff = (out_float.float() - out_uint8.float()).abs()
        assert diff.max() < 30

        percent = .03 if antialias else .1
        assert (diff > 2).float().mean() < percent

        mae = .4 if antialias else .8
        assert diff.mean() < mae
    else:
        torch.testing.assert_close(out_uint8, out_float, rtol=0, atol=1)

def resize_with_pil(batch, Wo, Ho, mode, antialias):
    resample = {"bicubic": Image.BICUBIC, "bilinear": Image.BILINEAR}[mode]
    out_pil = [
        Image.fromarray(img.permute((1, 2, 0)).numpy()).resize((Wo, Ho), resample=resample)
        for img in batch
    ]
    out_pil_tensor = torch.cat(
        [
            torch.as_tensor(np.array(img, copy=True)).permute((2, 0, 1))[None]
            for img in out_pil
        ]
    )
    return out_pil_tensor
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103252
Approved by: https://github.com/vfdev-5, https://github.com/H-Huang, https://github.com/malfet, https://github.com/atalman
2023-06-12 18:25:33 +00:00
5ed618132f Revert "change pre_autograd to pre_dispatch tracing (#101818)"
This reverts commit b0392de2c39d132b5901fc9a366afc1ddc214f96.

Reverted https://github.com/pytorch/pytorch/pull/101818 on behalf of https://github.com/izaitsevfb due to Breaks internal builds see D46629736 TypeError: wrap_key() got an unexpected keyword argument pre_autograd ([comment](https://github.com/pytorch/pytorch/pull/101818#issuecomment-1587837667))
2023-06-12 18:16:37 +00:00
1b31665e78 Make all CI commit pin changes trigger ciflow/inductor (#103443)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103443
Approved by: https://github.com/desertfire, https://github.com/malfet
2023-06-12 18:12:45 +00:00
fc46f01b55 Revert "Cleanup scatter-related code (#103074)"
This reverts commit 88aea179e379c743764e148adb86f1a320f0a299.

Reverted https://github.com/pytorch/pytorch/pull/103074 on behalf of https://github.com/izaitsevfb due to Breaks internal builds, see D46629742, symbol not found: scatter_add_expanded_index_stub ([comment](https://github.com/pytorch/pytorch/pull/103074#issuecomment-1587823954))
2023-06-12 18:08:46 +00:00
1ef7d6790d [ONNX] Fix onnx constant folding (#101329)
Fixes #101328

Note that this most likely is a bandage solution. We either need to actually fix one of those onnx passes that is causing this decomposition/functionalization issue, or need to special case all onnx op in `runTorchBackendForOnnx` like this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101329
Approved by: https://github.com/BowenBao
2023-06-12 18:06:41 +00:00
cyy
48e3ee29ff enable missing-prototypes in functorch (#103391)
This PR enables  missing-prototypes in functorch target and turn some functions into static ones

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103391
Approved by: https://github.com/malfet
2023-06-12 17:47:37 +00:00
de354bf53e Replace CUDA 11.7 small pip wheels with 12.1 (#103091)
CC @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103091
Approved by: https://github.com/malfet, https://github.com/atalman
2023-06-12 17:29:20 +00:00
ac3ce0a57a Remove dynamic_shapes special case in SizeVariable getitem (#103380)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103380
Approved by: https://github.com/voznesenskym
2023-06-12 17:29:03 +00:00
2eac8bd2b8 [dynamo][numpy] Support ndarray methods (#97537)
This PR adds universal support for ndarray methods. After #100839 each `NumpyNdarrayVariable` should wrap a `torch.Tensor`. This PR adds a `numpy_method_wrapper` which converts the `torch.Tensor` to `torch_np.ndarray` and then call the numpy ndarray method. Then we also try to return a `torch.Tensor` (return as-is if the value is not ndarray-like)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97537
Approved by: https://github.com/ezyang
2023-06-12 17:21:31 +00:00
18f203a567 Clean up op BC check list (#103363)
Summary: We clean up the BC op check list, and remove expired items.

Test Plan: OSS CI

Differential Revision: D46618642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103363
Approved by: https://github.com/Skylion007
2023-06-12 17:00:15 +00:00
df83fe5bf7 [dynamo] graph break on nn.Parameter construction (#103262)
Fixes #99569

nn.Parameter construction appears to run into FakeTensor / tracing issues during AOT Autograd. We could try to fix this; but nn.Parameter construction _inside_ the compiled region isn't a common scenario, so it's reasonable to just graph break on nn.Parameter construction.

For reference, see #99569 for the errors/issues that appear from tracing through nn.Parameter construction with AOT Autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103262
Approved by: https://github.com/williamwen42
2023-06-12 16:41:56 +00:00
08f90b3481 Revert "Update torchbench pin - torchrec_dlrm moved to canary (#103383)"
This reverts commit 114f99bba1a10da748326aa246709456cf46c10f.

Reverted https://github.com/pytorch/pytorch/pull/103383 on behalf of https://github.com/malfet due to This broke inductor test ([comment](https://github.com/pytorch/pytorch/pull/103383#issuecomment-1587681978))
2023-06-12 16:40:05 +00:00
caecb55223 Revert "Log functional_collectives apis to distributed logger (#103288)"
This reverts commit 37359c36fdb413df3b02996eb0ea2433c147db34.

Reverted https://github.com/pytorch/pytorch/pull/103288 on behalf of https://github.com/malfet due to Broke test_inductor_collectives, see 37359c36fd ([comment](https://github.com/pytorch/pytorch/pull/103288#issuecomment-1587677705))
2023-06-12 16:37:57 +00:00
c3fdfca5da Always create ShapeEnv, always apply unspec logic (#103302)
Originally, my goal for this PR was to remove the `dynamic_shapes` tests in torch/_dynamo/variables/builder.py. However, one thing lead to another, and it turns out that it was easiest to do all of the following in one go:

* Unconditionally allocate a ShapeEnv, no matter if dynamic_shapes is enabled or not (torch/_dynamo/output_graph.py). There is a small adjustment to export torch/_dynamo/eval_frame.py to account for the fact that a ShapeEnv always exists, even if you're not doing symbolic export.
* Remove dynamic_shapes test from unspec logic (torch/_dynamo/variables/builder.py), the original goal
* Specialize strides and storage offset if all sizes are dynamic (torch/fx/experimental/symbolic_shapes.py). This is required to deal with unconditional ShapeEnv: if a ShapeEnv exist, fake tensor-ification may choose to allocate symbols. The idea is that with `automatic_dynamic_shapes == False`, Dynamo should never request dynamic sizes, but this invariant was not upheld for nontrivial strides/offset.

The rest are just auxiliary fixups from the above:

* Workaround bug in FakeTensorProp where sometimes it doesn't return a FakeTensor (torch/fx/passes/fake_tensor_prop.py), see https://github.com/pytorch/pytorch/pull/103395 for follow up
* Make ShapeProp correctly handle int inputs (torch/fx/passes/shape_prop.py)
* Disable indexing strength reduction if `assume_static_by_default` is False (torch/_inductor/codegen/triton.py)
* Fix hf_T5_generate to NOT toggle `assume_static_by_default` if dynamic shapes is not enabled (benchmarks/dynamo/common.py); technically this is not necessary anymore but it's in for safety.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103302
Approved by: https://github.com/voznesenskym
2023-06-12 12:48:28 +00:00
f4228e7037 [xla hash update] update the pinned xla hash (#103416)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103416
Approved by: https://github.com/pytorchbot
2023-06-12 10:20:36 +00:00
37359c36fd Log functional_collectives apis to distributed logger (#103288)
This logs functional collectives API calls with debug log level only.

(the `+` in the TORCH_LOGS cmdline enables debug level, otherwise only info level)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103288
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-06-12 06:33:26 +00:00
f37be77813 [Quant][XNNPACK] Delegate add_relu fusion (#103266)
Quantized Resnet currently sees fused add-relu
```
--> dq
       \
        add --> relu --> quant
       /
--> dq
```

Let us support this fusion in the delegate as xnnpack can use the output_min and output_max of the op nodes to clamp the values and perform a fused add - relu operation

Differential Revision: [D45258028](https://our.internmc.facebook.com/intern/diff/D45258028/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103266
Approved by: https://github.com/jerryzh168
2023-06-12 04:35:29 +00:00
8a744c31d3 Up to 48% speed up using Sklansky method for innermost prefix scan algorithm (#103314)
I found this algorithm (Sklansky) could provide speed-up over the previously implemented Brent-Kung (BK) algorithm. In BK algorithm, the sweeps are done twice: up-sweep and down-sweep. In up-sweep, initially all threads are working, but then half of the working threads becomes inactive in the subsequent step. Similarly for down-sweep but the other way around, where it initially starts with only 1 working thread and double the number of working threads for each sweep. This results of half of the thread is idle on average and produces `2 * log2(num_threads_x)` sweep steps.

On the other hand, Sklansky algorithm only use 1 sweep and in each step of the sweep, all the threads are working. This algorithm also produces `log2(num_threads_x)` sweep steps which is half of the BK algorithm. That provides the speed up. I follow the schematics of Sklansky algorithm provided in [this paper](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf). The same paper provides a much better algorithm (the one implemented in CUB), but I haven't got my head around it, while the Sklansky algorithm is easier to digest and implement.

Here are the speed up from my experiment using `cumsum` in the innermost dimension using A100:
(UPDATE: the newest commit further optimize it up to 76% on `8 x 4000` matrix)
(UPDATE: added shapes with 2048 and 1M in its elements)
| Shape        | Torch cumsum              | Custom cumsum            | Speed up            |
|--------------|---------------------------|--------------------------|---------------------|
| (2, 1000)    | 4.8112869262695315e-05   | 2.849102020263672e-05    | 1.688702928870293   |
| (8, 4000)    | 0.00017731189727783204   | 0.0001005411148071289    | 1.7635760018970834  |
| (128, 10000) | 0.0005342483520507813    | 0.00035474300384521487   | 1.5060151891928222  |
| (1024, 20000)| 0.0014238595962524415    | 0.0010990619659423829    | 1.2955225823246128  |
| (1024, 100000)| 0.007089591026306153    | 0.005468320846557617     | 1.296484099093993   |
| (2048, 1000000)| 0.058730244636535645 | 0.0458010196685791 | 1.2822912035913994 |
| (1000, 2)    | 1.0919570922851562e-05   | 8.106231689453125e-06    | 1.3470588235294116  |
| (4000, 8)    | 9.512901306152343e-06    | 7.867813110351562e-06    | 1.209090909090909   |
| (10000, 128) | 2.079010009765625e-05    | 1.6164779663085937e-05   | 1.2861356932153394  |
| (20000, 1024)| 0.00024993419647216796   | 0.00017964839935302734   | 1.3912408759124086  |
| (100000, 1024)| 0.0011160612106323243   | 0.0009322404861450195    | 1.1971816577581138  |
| (1000000, 2048) | 0.017030668258666993 | 0.014445066452026367 | 1.178995494082889 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103314
Approved by: https://github.com/ngimel
2023-06-11 22:29:33 +00:00
0863e5503a Handle nonzero via its meta registration (#103379)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103379
Approved by: https://github.com/Skylion007
2023-06-11 21:41:27 +00:00
114f99bba1 Update torchbench pin - torchrec_dlrm moved to canary (#103383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103383
Approved by: https://github.com/ezyang
2023-06-11 16:47:25 +00:00
03101a227f Remove not dynamic_shapes case from wrap_listlike (#103301)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103301
Approved by: https://github.com/voznesenskym
2023-06-10 12:51:19 +00:00
900226f20a add multi swa support for custom device (#103297)
Fixes #ISSUE_NUMBER
add multi swa support for custom device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103297
Approved by: https://github.com/janeyx99
2023-06-10 10:01:08 +00:00
daf75c0759 [AOTAutograd] compare with stride hints (#103342)
We previously compare FakeTensor's strides with real tensor's strides. This cause dynamic dimension of FakeTensor being specialized to static int. This may cause a graph specialized for one shape being used by another shape which is wrong.

Use stride hints for the comparison instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103342
Approved by: https://github.com/malfet
2023-06-10 06:51:54 +00:00
4cfa06f706 [BE] Deprecate has_XYZ attributes (#103279)
Use [`__getattr__`](https://peps.python.org/pep-0562/) to raise warningwhen one tries to access `has_XYZ` methods and recommend appropriate `torch.backends.XYZ` methods

Make respective properties in `torch._C` private (by prefixing them with underscore), to exclude from `from torch._C import *`.

Added `warnings.simplefilter` to workaround Python-3.11 torch.compile lineinfo issue.

Fixes https://github.com/pytorch/pytorch/issues/102484

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103279
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2023-06-10 05:17:17 +00:00
0496d70aa0 [Profiler][Easy] Add log msg to assertEqual for flaky test_memory_timeline_no_id (#103326)
Summary: Add msg to assertEqual field in the flaky test of test_memory_timeline_no_id, so that we print the actual tuple for debugging.

Test Plan: CI

Differential Revision: D46596242

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103326
Approved by: https://github.com/davidberard98
2023-06-10 03:57:57 +00:00
919c567c38 Simplify has_unpack_var_sequence (#103324)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103324
Approved by: https://github.com/Skylion007
2023-06-10 03:57:29 +00:00
d61cd03b97 Inductor cpp wrapper: support ConvTranspose and fix Convolution ir (#103308)
The changes in this PR include:
- Support ConvTranspose in cpp wrapper
- Fix cpp wrapper support for aten convolution when bias is `not None`: bias is in `args` instead of `kwargs` when it is `not None`. The change is covered by ConvTranspose dynamic shapes UT since we'll fall back to aten convolution in dynamic shape cases.
- Fix cpp wrapper support for `inf`. This is a UT added in https://github.com/pytorch/pytorch/issues/101865. The cpp wrapper UT is covered in `test_conv2d_unary` of `test_cpp_wrapper.py`. It's in `slowTest` category and seems not captured in the CI of that PR.

I will submit another PR to remove the hard-coded schema in these `ExternKernel`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103308
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-06-10 03:53:05 +00:00
d67b676c51 Remove config.dynamic_shapes test for tracing size calls (#103325)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103325
Approved by: https://github.com/Skylion007
2023-06-10 03:42:36 +00:00
cf8af57c4a Make torch.compile(dynamic=True) not assume static by default (#99469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99469
Approved by: https://github.com/ezyang
2023-06-10 02:56:01 +00:00
f474497cd3 [Docker] Update cc/c++ to point t clang/clang++ (#103350)
Should prevent weird issues when PyTorch is compiled with clang, but
triton or torch_vision are build with gcc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103350
Approved by: https://github.com/seemethere
2023-06-10 02:53:38 +00:00
347463fddf [cpp-extensions] Add clang to the list of supported Linux compilers (#103349)
Not sure, why was it excluded previous (oversight I guess).
Also, please note, that `clang++` is already considered acceptable compiler (as it ends with `g++` ;))

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 55aa7db</samp>

> _`clang` or `gcc`, we don't care what you use_
> _We'll build our extensions with the tools we choose_
> _Don't try to stop us with your version string_
> _We'll update our logic and make our code sing_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103349
Approved by: https://github.com/seemethere
2023-06-10 02:53:38 +00:00
00e16179f0 [LibTorch] Fix append_whole_archive macro (#103348)
`-force_load` is not  compiler, but a linker option, and as such should depend on the platform (i.e. MacOS/iOS), rather than on compiler (i.e.  clang vs gcc)

Otherwise, attempt to link libtorch static with clang results in a cryptic `/usr/bin/ld: -f may not be used without -shared` error on Linux.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103348
Approved by: https://github.com/seemethere
2023-06-10 02:53:37 +00:00
5c252f2c7c [Inductor/cpp] Fix reduction on pre clang-10 (#103347)
`#pragma omp declare reduction` is not supported before clang-10 and results in a misleading compiler error in the following example:
```c++

template<typename T>
T max_propagate_nan(T, T);

extern "C" void cpp_fused_argmax_max_sum_0(const float* in_ptr0,
                       float* out_ptr0,
                       float* out_ptr1,
                       long* out_ptr2)
{
    float tmp_acc0 = 0;
    float tmp_acc1 = -std::numeric_limits<float>::infinity();
    float tmp_acc2 = std::numeric_limits<float>::infinity();
    struct IndexValue_7 {size_t index; float value;};
    IndexValue_7 tmp_acc3{0, -std::numeric_limits<float>::infinity()};
    #pragma omp declare reduction(argmax : IndexValue_7 :                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)               initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
    for(long i0=static_cast<long>(0L); i0<static_cast<long>(3L); i0+=static_cast<long>(1L))
    {
        auto tmp0 = in_ptr0[static_cast<long>(i0)];
        tmp_acc0 = tmp_acc0 + tmp0;
        tmp_acc1 = max_propagate_nan(tmp_acc1, tmp0);
        if (tmp_acc3.value < tmp0) {
            tmp_acc3.index = i0; tmp_acc3.value = tmp0;
        }
    }
    out_ptr0[static_cast<long>(0L)] = tmp_acc0;
    out_ptr1[static_cast<long>(0L)] = tmp_acc1;
    out_ptr2[static_cast<long>(0L)] = tmp_acc3.index;
}
```

```
% clang++-10 -std=c++17 -fopenmp bar.cpp  -c -O3
% clang++-9 -std=c++17 -fopenmp bar.cpp  -c -O3
bar.cpp:17:149: error: expected ')'
    #pragma omp declare reduction(argmax : IndexValue_7 :                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)               initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
                                                                                                                                                    ^
bar.cpp:17:34: note: to match this '('
    #pragma omp declare reduction(argmax : IndexValue_7 :                omp_out.value = omp_in.value < omp_out.value ? omp_out.value : omp_in.value,                omp_out.index = omp_in.value < omp_out.value ? omp_out.index : omp_in.index)               initializer(omp_priv = {0, -std::numeric_limits<float>::infinity()})
                                 ^
1 error generated.
```

Also, remove unnecessary `struct` keyword in front of type, as C++ compiler already assumes that (and again, it causes problem with clang++-10 implementation)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103347
Approved by: https://github.com/voznesenskym
2023-06-10 02:53:37 +00:00
414ec6ce97 Turn off automatic_dynamic_shapes in prep for dynamic-by-default (#103320)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103320
Approved by: https://github.com/Skylion007
2023-06-10 02:49:59 +00:00
a8549357d2 Add distributed category to TORCH_LOGS (#103351)
Fix use of torch distributed testing assertLogs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103351
Approved by: https://github.com/wanchaol
2023-06-10 02:21:36 +00:00
59ee6cd864 fix soundness bug with unsupported constraints (#102897)
We do not raise constraint violations for complex binary conditions, such as conditions involving `%`. Moreover, while these constraints are discovered by our solver, the solver does not inject new constraint violations. This can result in cases where export passes, appropriate assertions are not added, and we get runtime crashes.

Now, when the solver discovers constraints that are too complex, we force-specialize the involved dimensions and raise a constraint violation when such dimensions are marked dynamic. This forces the user to remove the dynamic marking, and causes the appropriate specialization assertions to be added.

Differential Revision: [D46415786](https://our.internmc.facebook.com/intern/diff/D46415786/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102897
Approved by: https://github.com/tugsbayasgalan
2023-06-10 01:59:55 +00:00
1b398297dd Rely on repeat meta reporting dynamic shapes (#103294)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103294
Approved by: https://github.com/Skylion007
2023-06-10 01:36:36 +00:00
1d40b394e6 Remove getitem dynamic shapes special case (#103296)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103296
Approved by: https://github.com/voznesenskym
2023-06-10 01:27:22 +00:00
5987c52082 Delete is_dynamic_shapes test (#103291)
Simplified version of https://github.com/pytorch/pytorch/pull/102106/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103291
Approved by: https://github.com/voznesenskym
2023-06-10 01:27:14 +00:00
7be2a6228d Delete non-dynamic shapes export special case in guard creation (#103295)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103295
Approved by: https://github.com/voznesenskym
2023-06-10 01:26:06 +00:00
1eb762c919 [Inductor][FX passes] Normalize torch.cat for pre_grad fusion (#102951)
Fixes #102950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102951
Approved by: https://github.com/devashishshankar
2023-06-10 00:56:51 +00:00
443edb9015 [DOCS][DDP]Fix the simple of saving and reloading PowerSGD state and hook. (#102721)
Fix the simple of saving and reloading PowerSGD state and hook.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102721
Approved by: https://github.com/H-Huang
2023-06-10 00:15:00 +00:00
fff5daf3ee [Dynamo] Support methods of NamedTuple (#103217)
This PR adds the support of calling NamedTuple methods. https://github.com/pytorch/pytorch/issues/91662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103217
Approved by: https://github.com/williamwen42
2023-06-10 00:01:40 +00:00
d84b63c4f4 Properly respect automatic dynamic config for unspec int (#103321)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103321
Approved by: https://github.com/Skylion007
2023-06-09 23:47:56 +00:00
2b3d955ffd [pt2] add meta and SymInt support for linalg_matrix_exp (#102945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102945
Approved by: https://github.com/lezcano
2023-06-09 22:45:16 +00:00
3a0f37735c [pt2] bug fix: invert condition in checkFloatingOrComplex (#102944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102944
Approved by: https://github.com/lezcano
2023-06-09 22:45:16 +00:00
cyy
34ccd1dde6 [Reland2] fix missing-prototypes warnings in torch_cpu (Part 5) (#102931)
This PR relands the changes introduced in PR #101976 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102931
Approved by: https://github.com/Skylion007
2023-06-09 21:58:51 +00:00
90110b0e4f Revert "Add distributed category to TORCH_LOGS (#103287)"
This reverts commit 0b252aebb2e46c0dc5585ec6d296832a308f563b.

Reverted https://github.com/pytorch/pytorch/pull/103287 on behalf of https://github.com/ZainRizvi due to Breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/103287#issuecomment-1585161976))
2023-06-09 21:51:25 +00:00
cde4657284 [inductor] Support complex fallback for convert_element_type, _fft_c2c, view_as_real to support GoogleFnet with cpp wrapper (#103183)
Fixes #102752

These 3 fallback kernels appear in GoogleFnet because they take complex arguments - i.e., usually they aren't fallback kernels. To support this model, we added support for these 3 ops.

Details:
1. Add these 3 ops to the allowlist. I assume that we eventually want to support all fallback kernels, but for now we just add these 3 ops to the allowlist.
2. Support complex64 in cpp codegen
3. Support List[] arguments and ScalarType arguments in cpp codegen
4. Allow alias_info in schema arguments. In the original PR supporting fallback kernels for cpp wrapper, ops with schemas with non-null alias_info for any of the arguments were disallowed; but I don't think there's any reason we need to disallow these in cpp wrapper code.

Caveats:
* This has not added support for complex32 or complex128
* It only works with static shapes, not dynamic shapes. It seems like the dynamic shapes issue is unrelated to cpp wrapper, since it fails in the test_torchinductor_dynamic_shapes.py test. I checked these `test_fft_.*` tests, which I added in this PR, and verified that they were broken with dynamic shapes before any of the code changes from this PR.

**Test**:

```
benchmarks/dynamo/huggingface.py --inductor --amp --accuracy --inference --device cuda   --cpp-wrapper --only GoogleFnet
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103183
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/chunyuan-w
2023-06-09 21:12:41 +00:00
f49b2f114a [Pytorch] Add Vulkan support for aten::unsqueeze, 1d->2d, 3d->4d (#102987)
Summary:
Re-submitting D46057585 after revert from merge conflict

Add 1d->2d, 3d->4d unsqueeze

Unsqueeze operator: https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze

Test Plan:
Unsqueeze tests:
```
lfq@lfq-mbp xplat % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*unsqueeze*"
Downloaded 0/44 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 38.6 sec (100%) 523/523 jobs, 8/523 updated
  Total time: 38.6 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 9 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 9 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim0
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (76 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim1
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim0
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (9 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim1
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim2
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim0
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim1
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim2
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim3
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 9 tests from VulkanAPITest (98 ms total)

[----------] Global test environment tear-down
[==========] 9 tests from 1 test suite ran. (98 ms total)
[  PASSED  ] 9 tests.
```

clang-format on the glsl files

Reviewed By: copyrightly

Differential Revision: D46375157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102987
Approved by: https://github.com/SS-JIA
2023-06-09 21:04:53 +00:00
89632b56ff Revert "NCCL process group: avoid workEnqueue when capturing cuda graph (#102542)" (#103341)
This reverts commit 74a5d62d7ca9204b3b24137065c73fc7c66cc02d from PR https://github.com/pytorch/pytorch/pull/102542

That PR introduces a land race (see failure [here](5aefa61d2f)), and since it was exported from phabricator it cannot be reverted normally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103341
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-06-09 20:38:13 +00:00
7550ec16a4 Add support for dictionary with torch object keys. (#103158)
Fixes: #101979

This PR adds support for dictionaries with torch object as keys in dynamo.

The main problem was that, for example, the source built for `d[torch.float]` (`d` being a
dictionary) was `ODictGetItemSource(GlobalSource('d'), index=torch.float)`. When
`Source.name` method was called, we got `odict_getitem(G['d'], torch.float)`. Evaluating
that string raised an error, since `torch` was only available in the global dictionary `G`
as `G["torch"]`.

Instead, this PR builds the source:
`ODictGetItemSource(GlobalSource('d'), index=AttrSource(GlobalSource('torch'), 'float'))`.
The to-be-evaluated string is correctly generated as:
`odict_getitem(G['d'], G['torch'].float)`.

Here's a minimal example that reproduces the error, before this PR:

```python
import torch

d = {
    torch.float16: torch.float32,
}

@torch.compile
def f():
    return torch.randn(3, dtype=d[torch.float16])

f()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103158
Approved by: https://github.com/mlazos
2023-06-09 20:18:49 +00:00
d1f24f73da Revert "Make HigherOrderOperator stop appearing like torch.ops.* in FX (#103108)"
This reverts commit 194262ee49961acc8d84d6d4672748eae1826c30.

Reverted https://github.com/pytorch/pytorch/pull/103108 on behalf of https://github.com/izaitsevfb due to Breaks executorch internally, see D46581996 ([comment](https://github.com/pytorch/pytorch/pull/103108#issuecomment-1585041505))
2023-06-09 19:31:40 +00:00
0b252aebb2 Add distributed category to TORCH_LOGS (#103287)
This lets users run `TORCH_LOGS="+distributed" python myscript.py` and enable additional logging output for the distributed module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103287
Approved by: https://github.com/ezyang
2023-06-09 19:25:07 +00:00
d89dd05e4d Revert "CUDA graphs overrides dynamic shapes and forces specialization (#103290)"
This reverts commit c760f0e4dd5dad6146f6ab97116924786911768d.

Reverted https://github.com/pytorch/pytorch/pull/103290 on behalf of https://github.com/ezyang due to to handle the other cuda graphs case ([comment](https://github.com/pytorch/pytorch/pull/103290#issuecomment-1584977767))
2023-06-09 18:25:28 +00:00
5aefa61d2f Fix calls to unqualified format_to to not clash with C++20's std::format_to (#103130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103130
Approved by: https://github.com/Skylion007
2023-06-09 18:19:07 +00:00
74a5d62d7c NCCL process group: avoid workEnqueue when capturing cuda graph (#102542)
Summary:
In torch.distributed, we make ProcessGroupNCCL not call workEnqueue when the cuda stream is capturing. I.e., when capturing a CUDA graph, we do not enqueue anything for the watchdog thread to consider. This allows capturing NCCL operations in a CUDA Graph.

This is followup to an internal discussion [1] where the watchdog thread was observed to crash when using cuda graphs containing an all_reduce. The watchdog thread wants to query events pertaining to enqueued work items, but this can't be done for "events" created during cuda graph capture.

[1] https://fb.workplace.com/groups/1405155842844877/posts/6975201909173548/

Test Plan: Test added. Also, the repro mentioned in https://fb.workplace.com/groups/1405155842844877/posts/7003002339726838/ runs successfully after this change.

Differential Revision: D46274814

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102542
Approved by: https://github.com/kwen2501
2023-06-09 18:16:02 +00:00
88aea179e3 Cleanup scatter-related code (#103074)
This patch cleans up scatter-related code.

GNN-specific implementation for scatter operation uses `radix_sort` to sort the indices, as `radix_sort` was recently moved to FBGEMM common utils (via [pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)), we do not need a local copy of the algorithm anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103074
Approved by: https://github.com/mikaylagawarecki
2023-06-09 18:11:32 +00:00
c760f0e4dd CUDA graphs overrides dynamic shapes and forces specialization (#103290)
Previously, cudagraphs and dynamic_shapes were incompatible and enabling
dynamic shapes would forcibly disable cudagraphs.  This new strategy
I think is better.  The idea is essentially that cudagraphs is an
"optimization" that happens to guard on every input.  When cudagraphs
is on, we force everything static, and this automatically does the right
thing because we will force a recompile if sizes change.

This obsoletes https://github.com/pytorch/pytorch/pull/101813

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103290
Approved by: https://github.com/voznesenskym
2023-06-09 17:43:47 +00:00
b0392de2c3 change pre_autograd to pre_dispatch tracing (#101818)
We discussed in a composability meeting a few weeks ago that `pre_autograd` should probably be renamed to `pre_dispatch`.

One question in this PR was: should I re-use a dispatch key? Or should I create a new dispatch key (that yet again corresponds to "top of the dispatcher")?

~~For now, I ended up sticking our proxy mode on the mode stack corresponding to `PythonTLSSnapshot`, because it was simple and it works. It looks like one of the functorch dispatch keys has higher priority though, so it's possible that functorch will end up running first. Open to options, but we can consider adding a new dispatch key later if that becomes a problem~~

Update: I added a dedicated dispatch key, `PreDispatch`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101818
Approved by: https://github.com/ezyang, https://github.com/Neilblaze, https://github.com/albanD, https://github.com/zou3519
2023-06-09 17:30:15 +00:00
1c3a7d9a7e Resolve TODO by deleting assert sparse cannot be meta on SymInt (#103299)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103299
Approved by: https://github.com/bdhirsh
2023-06-09 17:13:54 +00:00
a02c573a89 Record view stacks if running anomaly mode (#103185)
Now, when you do an inplace mutation and the view is naughty, you get this message:

```
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). To find out where this view was allocated, run your entire forward region under anomaly mode (torch.autograd.detect_anomaly(check_nan=False)).
```

When you run under anomaly mode, you get:

```
RuntimeError: A view was created in no_grad mode and is being modified inplace with grad mode enabled. Given that this use case is ambiguous and error-prone, it is forbidden. You can clarify your code by moving both the view and the inplace either both inside the no_grad block (if you don't want the inplace to be tracked) or both outside (if you want the inplace to be tracked). This view was allocated at:
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4299, in arglebargle
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 4306, in test_anomaly_gives_view_stack
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 591, in run
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2266, in _run_with_retry
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 2337, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/case.py", line 650, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 122, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/suite.py", line 84, in __call__
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/runner.py", line 184, in run
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 271, in runTests
  File "/home/ezyang/local/c/pytorch-env/lib/python3.10/unittest/main.py", line 101, in __init__
  File "/data/users/ezyang/c/pytorch/torch/testing/_internal/common_utils.py", line 894, in run_tests
  File "/data/users/ezyang/c/pytorch/test/test_autograd.py", line 11209, in <module>
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103185
Approved by: https://github.com/zdevito
2023-06-09 16:56:28 +00:00
79e0a1eacb Revert "Make torch.compile(dynamic=True) not assume static by default (#99469)"
This reverts commit 7108c035bc0309f60fc86d32a42335b8808576f9.

Reverted https://github.com/pytorch/pytorch/pull/99469 on behalf of https://github.com/ZainRizvi due to Breaks trunk ([comment](https://github.com/pytorch/pytorch/pull/99469#issuecomment-1584868864))
2023-06-09 16:46:29 +00:00
2e21cb095a Remove capture_scalar_outputs sanity check prepping for dynamic by default (#103292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103292
Approved by: https://github.com/voznesenskym
2023-06-09 16:13:09 +00:00
4a5d56b74c Disable dynamo'd test_optim entirely (#103323)
See issue https://github.com/pytorch/pytorch/issues/103322.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103323
Approved by: https://github.com/DanilBaibak, https://github.com/atalman, https://github.com/malfet
2023-06-09 16:06:36 +00:00
6fa2d41dc7 Add mmap option to torch.load (#102549)
Using [`nanoGPT/model.py`](https://github.com/karpathy/nanoGPT/blob/master/model.py) run

<details><summary><b>Click for script to save gpt2-xlarge (1.5B params)</b></summary>

```
# test_load_save_gpt.py
from model import GPT
import torch
import time

torch.manual_seed(5)
# gpt2-xlarge 1558M parameters
class GPTConfig:
    block_size: int = 1024
    vocab_size: int = 50304 # GPT-2 vocab_size of 50257, padded up to nearest multiple of 64 for efficiency
    n_layer: int = 48
    n_head: int = 25
    n_embd: int = 1600
    dropout: float = 0.0
    bias: bool = True # True: bias in Linears and LayerNorms, like GPT-2. False: a bit better and faster

def f():
    model = GPT(GPTConfig())
    state_dict = model.state_dict()

    start_saving = time.time()
    torch.save(state_dict, "gpt2-xlarge.pth")
    end_saving = time.time()

if __name__ == "__main__":
    f()
```
</details>

<details><summary><b>Click for script to load</b></summary>

```
# test_load_gpt.py

import torch
from model import GPT
from test_load_save_gpt import GPTConfig
import time
import argparse

def f(mmap, meta):
    device = 'meta' if meta else 'cpu'
    assign = True if meta else False
    with torch.device(device):
        model = GPT(GPTConfig())
    start_loading = time.time()
    loaded_state_dict = torch.load("gpt2-xlarge.pth", _mmap=mmap)
    end_loading = time.time()
    print(f"loading time using torch.load with mmap={mmap}: ", end_loading - start_loading)
    model.load_state_dict(loaded_state_dict, assign=assign)
    end_load_state_dict = time.time()
    print("load_state_dict time: ", end_load_state_dict - end_loading)
    model.cuda()
    end_cuda = time.time()
    print("cuda time using torch.load with mmap: ", end_cuda - end_load_state_dict)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(prog='load_gpt_xlarge')
    parser.add_argument('-m', '--mmap', action='store_true')
    parser.add_argument('-d', '--devicemeta', action='store_true')
    args = parser.parse_args()
    mmap = args.mmap
    meta = args.devicemeta
    f(mmap, meta)

```

</details>

`python test_load_gpt.py`

<img width="614" alt="Screenshot 2023-06-06 at 1 35 43 PM" src="https://github.com/pytorch/pytorch/assets/35276741/ee06e5b3-b610-463b-a867-df995d21af29">

`python test_load_gpt.py --mmap`
<img width="622" alt="Screenshot 2023-06-06 at 1 35 30 PM" src="https://github.com/pytorch/pytorch/assets/35276741/00d2fdd0-b1f5-4313-83dc-e540b654b2af">

If we further use the `with torch.device('meta')` context manager and pull the changes from https://github.com/pytorch/pytorch/pull/102212 that allow the model to reuse tensors from the state_dict, we have

`python test_load_gpt.py --mmap --devicemeta`
<img width="727" alt="Screenshot 2023-06-06 at 1 35 51 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b50257d9-092a-49c3-acae-876ee44d009f">

\
\
Running the above in a docker container containing a build of PyTorch with RAM limited to 512mb by

1) running `make -f docker.Makefile` from `pytorch/` directory
2) `docker run -m 512m -it <image> bash`
3) docker cp `gpt2-xlarge.pth` and `test_load_gpt.py` into the image

`python test_load_gpt.py`

Docker will Kill the process due to OOM whereas

`python test_load_gpt.py --mmap --devicemeta`
<img width="635" alt="Screenshot 2023-06-06 at 1 55 48 PM" src="https://github.com/pytorch/pytorch/assets/35276741/f3820d9e-f24c-43e7-885b-3bfdf24ef8ad">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102549
Approved by: https://github.com/albanD
2023-06-09 15:49:58 +00:00
74b7a6c75e Move tensor grouping to ATen (#100007)
rel: #94344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100007
Approved by: https://github.com/janeyx99
2023-06-09 15:44:46 +00:00
7108c035bc Make torch.compile(dynamic=True) not assume static by default (#99469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99469
Approved by: https://github.com/ezyang
2023-06-09 13:36:40 +00:00
96fd283640 Preserve CreationMeta when metafying views. (#103152)
This helps us avoid erroring / generate more accurate error messages
in Dynamo when doing mutations on views.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103152
Approved by: https://github.com/albanD
2023-06-09 12:34:54 +00:00
c24b61bc20 Enable torch._C._get_privateuse1_backend_name in Dynamo tracing (#103141)
Fixes https://github.com/pytorch/pytorch/issues/103125
torch._C._get_privateuse1_backend_name()  will cause graph break, so I add it to the functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103141
Approved by: https://github.com/yanboliang
2023-06-09 09:19:33 +00:00
6095a22cff [inductor] add the ability to do heavier search for coordinate descent tuning (#99403)
When checking Meta's internal cmf10x model, I found this interesting kernel https://gist.github.com/shunting314/d4b1fc7352c840ef185c607392e21f31 . Doing coordinate descent tuning starting from the out of box tuner find sub-optimal config: a config worse than the best one max-autotuner can find.

This indicates that the coordinate descent tuner does not necessarily find the optimal config. Starting point matters.

I want to make the coordinate descent tuning less depend on the starting point. Also I think by improving that, the coordinate descent tuner may be more likely to find even better configs when starting from max-autotune result.

There are 2 ideas.
1. currently coordinate descent tuning only considers changing one field/coordinate at a time. I add the ability to check all directions (i.e. tuning all tunable fields at the same time) after the normal coordinate descent searching does not find better choices. I'll check how that works in cmf10x
2. currently when we change a field, we only change 1 step (i.e. radius is 1). I add the ability to use a larger radius. This only affect the search in all directions and does not affect the normal coordinate descent searching workflow.

Both are disabled by default.

Here are the tests I've done:

- OOB (out of the box): 0.083ms    0.003GB    38.13GB/s
- MA (max autotune): 0.016ms    0.003GB    195.60GB/s
   - best config: XBLOCK: 4, RBLOCK: 128, num_warps: 4, num_stages: 1

Default coordinate descent:
- Coordesc (coordinate descent tuner) upon OOB:  0.024ms    0.003GB    131.52GB/s ( **WORSE than Max Autotune** )
   - best config: XBLOCK: 64, RBLOCK: 4, num_warps: 16, num_stages: 1
- Coordesc upon MA: 0.016ms    0.003GB    194.31GB/s (no further improvement upon MA)

Search in all directions: (radius = 1)
- Coordesc upon OOB: 0.017ms    0.003GB    184.55GB/s
   - best config: XBLOCK: 32, RBLOCK: 16, num_warps: 32, num_stages: 1
   - **IMPROVE FROM 0.024ms to 0.017ms. QUITE CLOSE TO THE ONE FIND BY MAX-AUTOTUNE**
- Coordesc upon MA: no further improvements upon MA

Search in all directions: (radius = 2)
- Coordesc upon OOB:  0.016ms    0.003GB    192.60GB/s
   - best config: XBLOCK: 8, RBLOCK: 16, num_warps: 8, num_stages: 1
   - **SLIGHTLY BETTER THAN RADIUS=1 for this kernel and on par with max-autotune**
- Coordesc upon MA: no further improvements upon MA

**Overall max-autotuner does a really good job for this kernel**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99403
Approved by: https://github.com/jansel
2023-06-09 09:04:55 +00:00
2961ea80f5 Deprecate "Type" and support more devices for save_on_cpu (#103245)
Fixes #ISSUE_NUMBER
1、the class named "Type" has not been used anymore in anywhere, so I add warning message  to remove it in the future.
2、add a arg(default is "cuda") for save_on_cpu so that it can support more device type (like privateuse1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103245
Approved by: https://github.com/soulitzer
2023-06-09 05:05:01 +00:00
c037088ac4 Debug Windows locked files (#103237)
This PR temporally installs SysInternal https://learn.microsoft.com/en-us/sysinternals/downloads/handle and prints which processes locking `C:\action-runner\work` folder.  This is needed to debug the elusive Windows locked file issues https://github.com/pytorch/pytorch/actions/runs/5216626202/jobs/9415560483.  This will be reverted once the investigation is done.  If the tool proves to be useful, we can add it to the AMI later on

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103237
Approved by: https://github.com/clee2000
2023-06-09 04:53:07 +00:00
4cc474dec4 [dtensor] support torch.save/load with DTensor (#103106)
This PR actually enables DTensor to be pickable and add tests to test
torch.save/load works correctly for DTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103106
Approved by: https://github.com/kumpera
2023-06-09 04:11:15 +00:00
d31707a257 Get rid of dim_groups attribute from DeviceMesh (#103105)
This PR get rids of the dim_groups attribute from DeviceMesh, the main
motivation behind this is that we should let c10d store the process
groups during its creation instead of DeviceMesh, DeviceMesh should just
handle ranks correctly.

This could enable DTensor becomes picklable! (torch.save/load could be
possible), which I will give it a try in the next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103105
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
2023-06-09 04:11:15 +00:00
81b704eab3 numpy1.25 deprecation: np.product -> np.prod (#103263)
Deprecated according to https://github.com/numpy/numpy/releases/tag/v1.25.0rc1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103263
Approved by: https://github.com/mikaylagawarecki
2023-06-09 02:18:53 +00:00
f3553c508c ImportLib py3.10 bug in AOTInductor (#103277)
Other projects have seen a similar issue https://github.com/quantumlib/Cirq/issues/4637

## Before

```
(nightly) ubuntu@ip-172-31-2-131:~$ python /tmp/torchinductor_ubuntu/eq/ceqs7t4pesfhqllk6qf4k5spu2cm23l7quqdt2mkrp4rlcjl6kw5.py
Traceback (most recent call last):
  File "/tmp/torchinductor_ubuntu/eq/ceqs7t4pesfhqllk6qf4k5spu2cm23l7quqdt2mkrp4rlcjl6kw5.py", line 47, in <module>
    module = CppWrapperCodeCache.load(cpp_wrapper_src, 'inductor_entry_cpp', 'czenwgemzbe2etzbh7hzhnwjhyamvwirgodyjlly75fayy4tp3rx', False)
  File "/opt/conda/envs/nightly/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 846, in load
    assert isinstance(spec.loader, importlib.abc.Loader)
AttributeError: module 'importlib' has no attribute 'abc'. Did you mean: '_abc'?
```

## After

```sh
(nightly) ubuntu@ip-172-31-2-131:~/test$ python /tmp/torchinductor_ubuntu/eq/ceqs7t4pesfhqllk6qf4k5spu2cm23l7quqdt2mkrp4rlcjl6kw5.py
0.000272
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103277
Approved by: https://github.com/desertfire
2023-06-09 02:12:34 +00:00
4c03adc1f4 [dashboard] Allocate 4 shards for torchbench (#103280)
Summary: Enabling inference nightly run unproportionally slows down
torchbench, so allocate more shards for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103280
Approved by: https://github.com/huydhn
2023-06-09 01:44:39 +00:00
8c584028a7 add github action to upload alerts to rockset / aws (#102995)
Successful test run found at Test run found at https://github.com/pytorch/pytorch/actions/runs/5213244046/jobs/9410138550

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8d7d860</samp>

This pull request adds a new feature to create and upload alerts for failing jobs in the pytorch/pytorch repo. It introduces a new script `tools/alerts/create_alerts.py` to generate alert entries and a new workflow `.github/workflows/upload-alerts.yml` to run the script and upload the alerts periodically.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 8d7d860</samp>

> _To upload alerts to Rockset_
> _We added a workflow, you bet_
> _It runs every ten_
> _With concurrency then_
> _And `create_alerts.py` we edit_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102995
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-06-09 01:33:40 +00:00
bb8278731e [FSDP][Easy] Remove redundant var def in test (#103270)
This is an easy follow-up from the previous PR to avoid re-running CI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103270
Approved by: https://github.com/rohan-varma
2023-06-09 00:56:13 +00:00
8e5b7ce5db inductor: fix bf16 legalization issue for fp32 load with to bf16 case (#103080)
Giving following ir:

```
    def body(self, ops):
        get_index = self.get_index('index0')
        index_expr = ops.index_expr(get_index, torch.int32)
        constant = ops.constant(4, torch.int32)
        lt = ops.lt(index_expr, constant)
        masked_subblock1 = self.masked_subblock1(lt, 0.0)
        get_index_1 = self.get_index('index3')
        load = ops.load('arg2_1', get_index_1)
        to_dtype = ops.to_dtype(load, torch.bfloat16)
        where = ops.where(lt, masked_subblock1, to_dtype)
        get_index_2 = self.get_index('index3')
        store = ops.store('buf0', get_index_2, where, None)
        return store
    def masked_subblock2(self, ops):
        get_index = self.get_index('index2')
        load = ops.load('arg1_1', get_index)
        return load
    def masked_subblock1(self, ops):
        get_index = self.get_index('index1')
        index_expr = ops.index_expr(get_index, torch.int32)
        constant = ops.constant(1, torch.int32)
        ge = ops.ge(index_expr, constant)
        get_index_1 = self.get_index('index1')
        index_expr_1 = ops.index_expr(get_index_1, torch.int32)
        constant_1 = ops.constant(3, torch.int32)
        lt = ops.lt(index_expr_1, constant_1)
        and_ = ops.and_(ge, lt)
        masked_subblock2 = self.masked_subblock2(and_, 0.0)
        get_index_2 = self.get_index('index3')
        load = ops.load('arg2_1', get_index_2)
        to_dtype = ops.to_dtype(load, torch.bfloat16)
        where = ops.where(and_, masked_subblock2, to_dtype)
        return where
```

before this PR, the ```masked_subblock2``` will legalize as ```load_bf16+to_fp32```, and the ```masked_subblock2```'s output type is ```fp32```, but for ```load = ops.load('arg2_1', get_index_2), to_dtype = ops.to_dtype(load, torch.bfloat16)```, we didn't convert ```to_bf16``` as ```to_fp32```, which the ```op.where``` has mixed type computation, and will has compiler error: ```error: operands to ?: have different types ‘float’ and ‘c10::BFloat16’```.

This PR will always convert ```to_bf16``` as ```to_fp32``` to fix such an issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103080
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-06-09 00:33:10 +00:00
40dbbcab6c Update error message with torch logging instructions (#102892)
https://github.com/pytorch/pytorch/issues/100109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102892
Approved by: https://github.com/yanboliang
2023-06-09 00:07:08 +00:00
d0c0e13b69 [Specialized Kernel] Translate Kernel Assignment Logic from function.yaml to native_functions.yaml (#102576)
Updating `gen_executorch.translate_native_yaml()` to translate kernel assignments when converting `functions.yaml` to `native_functions.yaml`
---
Functions.yaml format:
```
- func: add.out
	type_alias:
		T0: [<Type>, <Type>]
		T1: [<Type>]
	dim_order_alias:
		D0: [0, 1, 2, 3]
		D1: [0, 3, 2, 1]
	kernels:
		- arg_meta: null
		  kernel_name: default_impl
		- arg_meta:
			self: [T0, D0]
			other:[T0, D0]
			out: [T0, D0]
		  kernel_name: test_impl
```

native_functions.yaml format
```
func: add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
  kernel:
    default: default_impl
    v<Version>/<TYPE Enum>;<DIM Order>|<TYPE Enum>;<DIM Order>|<TYPE Enum>;<DIM Order>: test_impl
```
Example: **'v1/6;0,1,2,3|3;0,1,2,3|6;0,1,2,3' : 'test_impl'**

## Note:
- If a "kernels" field is not present in functions.yaml (as it currently is), the output is unaffected
---
Design Doc: https://docs.google.com/document/d/1gq4Wz2R6verKJ2EFseLyPdAF0wqomnCrVDDJpRkYsRw/edit?kh_source=GDOCS#

Differential Revision: [D45971107](https://our.internmc.facebook.com/intern/diff/D45971107/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102576
Approved by: https://github.com/larryliu0820
2023-06-08 23:42:24 +00:00
98a1e3a3e9 Put back cuda 11.8 distributed tests (#103265)
Put back cuda 11.8 distributed tests
Follow up after: https://github.com/pytorch/pytorch/pull/102178 which accidentally disabled cuda distributed tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103265
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-06-08 23:12:52 +00:00
481023fb6c add huggingface to inductor docker images (#102881)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 971a80c</samp>

This pull request adds support for building docker images that can run performance benchmarks using the inductor framework. It introduces new files and scripts to install the benchmark dependencies, and updates the docker build and test workflows to use the new images. It also fixes some minor issues with the existing inductor tests and workflows.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 971a80c</samp>

> _Oh we're the docker builders and we work all day and night_
> _We install the dependencies for the inductor benchmarks right_
> _We pin the versions and the commits and run the scripts with ease_
> _And we heave away and pull away and build the images_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102881
Approved by: https://github.com/huydhn
2023-06-08 22:15:14 +00:00
89d57f269f [quant][pt2] Fix convert in Conv + BN + ReLU QAT fusion (#102993)
Summary:
Previously, the QAT pattern for conv + bn + relu was
not actually replaced in convert. This is because the quantized
QAT pattern used in convert doesn't actually have a relu node.
This commit adds this extra pattern in the convert path and
the numerics now match FX's.

Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_relu_numerics

Reviewed By: jerryzh168

Differential Revision: D46372411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102993
Approved by: https://github.com/jerryzh168
2023-06-08 22:10:29 +00:00
606fb882c4 Dropout support for memory efficient attention (#102038)
# Summary
This PR builds off of:
- https://github.com/pytorch/pytorch/pull/101847
- https://github.com/pytorch/pytorch/pull/100583

It specifically adds dropout support to the memory efficient attention kernel. In the process of doing so roughly 3 changes were made:
- Update sdpa dispatching to allow for inputs requiring grad to be sent to efficient attention
- Update how memory efficient attention handles passing the rng state from forward to backward in order to enable cuda_graph support
- Fix a bug in the kernel that was causing incorrect gradients to be produced for num_keys > 64 with dropout and causal masking set. https://github.com/facebookresearch/xformers/pull/755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102038
Approved by: https://github.com/cpuhrsch
2023-06-08 21:50:12 +00:00
05e91a50d9 Manually generate guards for optimizer (#103121)
Manually generate guards for optimizer rather than use variable builder, which can be slow with lots of params.

This is the reason for ~10s compile slowdown

Redisable `_init_group`. This is important, because if for any reason a frame which calls `_init_group` is run in the python interpreter, we will trace it, which we don't want to do. We only want to call it when it is accessed via the fast path implemented with the optimizer variable during symbolic interpretation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103121
Approved by: https://github.com/jansel
2023-06-08 21:45:19 +00:00
48056b168f [FSDP] Reshard frozen params in backward (#101982)
This PR makes a first attempt at improving FSDP's fine-tuning support by adding hooks to reshard frozen parameters in the backward pass.
- Without this, frozen parameters involved in gradient computation are kept as unsharded through the entire backward pass.
- The approach is to register a multi-grad ~~post~~-hook on the _input_ activations to the FSDP module, where the hook performs the resharding after all gradients for the FSDP module must have been computed (meaning that we are safe to reshard).

~~This PR relies on adding a "multi-grad post-hook" that differs from the existing "multi-grad hook" from `register_multi_grad_hook()`. I find that with `register_multi_grad_hook()`, sometimes the unit test counting the number of times `_post_backward_reshard()` is called fails (due to it not being called).~~ This was resolved in https://github.com/pytorch/pytorch/pull/102859.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101982
Approved by: https://github.com/rohan-varma
2023-06-08 21:12:45 +00:00
b52ee80cdc Revert "Add print statements to debug sharding error (#102713)"
This reverts commit c7873522c2ceefbc3b747224da1d26d566115c9a.

Reverted https://github.com/pytorch/pytorch/pull/102713 on behalf of https://github.com/clee2000 due to issue should be resolved now ([comment](https://github.com/pytorch/pytorch/pull/102713#issuecomment-1583334560))
2023-06-08 21:02:17 +00:00
cea899cd57 Add early validation logic to dynamic_dim (#102982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102982
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2023-06-08 20:23:49 +00:00
f1f13a35b0 Fix GELU-related docstring formatting (#102845)
The docstring about GELU seems formatted incorrectly. The original docstring about GELU is rendered as below:

$$ \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt(2 / \pi) * (x + 0.044715 * x^3))) $$

where the square root of which part is confusing.

I double-checked the formula, which should be:

$$ \text{GELU}(x) = 0.5 * x * (1 + \text{Tanh}(\sqrt{2 / \pi} * (x + 0.044715 * x^3))) $$

where round brackets in resource code should be brace brackets.

> _formula in [original paper](https://arxiv.org/abs/1606.08415)_
> ![Snipaste_2023-06-03_00-43-49](https://github.com/pytorch/pytorch/assets/39690782/22511c4e-2f20-4a16-9bda-4c182a360160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102845
Approved by: https://github.com/mikaylagawarecki
2023-06-08 20:19:03 +00:00
1d857586f1 [ROCM] enable hipSOLVER backend for linalg.ldl_factor (#102665)
* Add complex dtype support for linalg.ldl_factor
* Fixes SWDEV-360139
* Enable the following 19 tests for ROCM
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_cuda_complex128
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_cuda_complex64
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_ex_cuda_complex128
    + test_decomp.py TestDecompCUDA.test_comprehensive_linalg_ldl_factor_ex_cuda_complex64
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_cuda_complex128
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_cuda_complex64
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_ex_cuda_complex128
    + test_meta.py TestMetaCUDA.test_dispatch_meta_linalg_ldl_factor_ex_cuda_complex64
    + test_meta.py TestMetaCUDA.test_meta_linalg_ldl_factor_cuda_complex128
    + test_ops.py TestCommonCUDA.test_noncontiguous_samples_linalg_ldl_factor_cuda_complex64
    + test_ops.py TestCommonCUDA.test_noncontiguous_samples_linalg_ldl_factor_ex_cuda_complex64
    + test_ops.py TestCommonCUDA.test_variant_consistency_eager_linalg_ldl_factor_cuda_complex64
    + test_ops.py TestCommonCUDA.test_variant_consistency_eager_linalg_ldl_factor_ex_cuda_complex64
    + test_ops.py TestMathBitsCUDA.test_conj_view_linalg_ldl_factor_cuda_complex64
    + test_ops.py TestMathBitsCUDA.test_conj_view_linalg_ldl_factor_ex_cuda_complex64
    + test_ops.py TestMathBitsCUDA.test_neg_conj_view_linalg_ldl_factor_cuda_complex128
    + test_ops.py TestMathBitsCUDA.test_neg_conj_view_linalg_ldl_factor_ex_cuda_complex128
    + test_ops_jit.py TestJitCUDA.test_variant_consistency_jit_linalg_ldl_factor_cuda_complex64
    + test_ops_jit.py TestJitCUDA.test_variant_consistency_jit_linalg_ldl_factor_ex_cuda_complex64

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102665
Approved by: https://github.com/lezcano
2023-06-08 20:05:01 +00:00
b4f3a6f58f [Dynamo Hackathon] Add support for hasattr on TorchVariable (#103177)
Fixes #101154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103177
Approved by: https://github.com/yanboliang
2023-06-08 19:34:44 +00:00
c62fcedc44 [cuda] Limit grid size for torch.cat kernel on aligned16 contig tensors (#103233)
When torch.cat gets called on a list of contiguous tensors that are aligned on a 16B boundary in memory, the number of thread blocks used is directly proportional with the maximum size of the tensors in the list. If one or more tensors are very large while the others are small, a high number of thread blocks results in useless redundant loads of the input metadata. This PR limits the grid size and improves the performance of cat when used on list of tensors with large variations in size.

Used the same test program from https://github.com/pytorch/pytorch/pull/102815 but added new cases with list of tensors with varying sizes.

<img width="735" alt="Screenshot 2023-06-07 at 10 14 18 PM" src="https://github.com/pytorch/pytorch/assets/23515689/72d0e5cb-5840-400e-b53b-d1418e664f19">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103233
Approved by: https://github.com/malfet
2023-06-08 19:14:14 +00:00
39201ce025 Make dynamo bench conditionally import DDP/FSDP (#103163)
Avoids hitting importerror for singlenode benchmarks when running on
a non-distributed build of pytorch.

Fixes #102086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103163
Approved by: https://github.com/lezcano, https://github.com/wanchaol
2023-06-08 19:10:49 +00:00
591134f2a5 [CI] Enable UCC in CI (#100395)
UCC was temporarily disabled in #98832. This PR re-enables it with the necessary fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100395
Approved by: https://github.com/atalman
2023-06-08 19:01:22 +00:00
a1c26ba77c Rename READEME.md to README.md (#103230)
Fix the typo so the file is shown for the dir.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103230
Approved by: https://github.com/ZainRizvi
2023-06-08 18:42:53 +00:00
4a72708d2b [dynamo] Fix Autograd Function Classmethod bug (#103175)
Fixes https://github.com/pytorch/pytorch/issues/103139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103175
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
2023-06-08 18:15:27 +00:00
a667b2ad1d [codemod] Use C++17 [[fallthrough]] in caffe2/torch/csrc/utils/python_arg_parser.cpp (#103039)
Test Plan: Sandcastle

Differential Revision: D46402909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103039
Approved by: https://github.com/Skylion007
2023-06-08 17:41:48 +00:00
40d70ba7ed Remove a number of fixed skips (#103162)
Also adds `PYTORCH_TEST_WITH_AOT_EAGER` to distinguish errors coming from aot_autograd and not inductor (not tested in ci, but useful for local debugging)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103162
Approved by: https://github.com/desertfire
2023-06-08 17:37:59 +00:00
3c896a5adb [dynamo] fix torch.distributions lazy_attribute failure (#103208)
Fixes #93340.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103208
Approved by: https://github.com/yanboliang
2023-06-08 17:26:54 +00:00
57c63aad10 [c10d] Remove test for init barrier (#103223)
Forward fix for intermittent failures after landing of #103033 (resolves issue #103195)

After #103033 , some tests are no longer applicable.

Cc @huydhn

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103223
Approved by: https://github.com/huydhn, https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/ZainRizvi
2023-06-08 16:56:40 +00:00
2a4fa25109 [Profiler] Include more uncategorized events in memory profile (#101200)
Summary: This PR adds handling for allocations / frees which we cannot prove are for Tensors. (And thus aren't assigned an ID.) These events are still important for judging overall utilization.

Test Plan: CI and Unit tests.

Differential Revision: D45458885

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101200
Approved by: https://github.com/anupambhatnagar, https://github.com/davidberard98
2023-06-08 16:22:49 +00:00
675f2597fa [reland][DTensor][3/N] add DTensor constructor function: full (#101436) (#103165)
This is a reland attempt of reverted PR #101436 .

Differential Revision: [D46537531](https://our.internmc.facebook.com/intern/diff/D46537531)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103165
Approved by: https://github.com/wanchaol
2023-06-08 16:18:33 +00:00
fdca7f7c2f Revert "export helper funcs in foreach ops (#102928)"
This reverts commit 978a2f2b276b51f615aa860d47fadd16a284b2f6.

Reverted https://github.com/pytorch/pytorch/pull/102928 on behalf of https://github.com/janeyx99 due to Broke build on windows. ([comment](https://github.com/pytorch/pytorch/pull/102928#issuecomment-1582720414))
2023-06-08 14:44:09 +00:00
4833dc10b8 [DCP] Rewrite read slicing to use a wrapper. (#99167)
Moved SlicedBufferedReader to utils and renamed to _ReaderView.

It no longer depends on file handles and is a pure wrapper. This makes it general enought to handle non io stream objects like fsspec's.

Should help with #98386
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99167
Approved by: https://github.com/wz337
2023-06-08 13:52:13 +00:00
39bf86ae90 [dynamo] Support OrderedDict constructor with kwargs (#103192)
Summary: To solve an issue in https://github.com/pytorch/pytorch/issues/102878.
The solution follows the example in https://github.com/pytorch/pytorch/pull/98660.
It only solves a problem for standard OrderedDict. There is another
problem if we use a user-defined CustomDict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103192
Approved by: https://github.com/yanboliang
2023-06-08 12:14:21 +00:00
580958a338 Revert "add github action to upload alerts to rockset / aws (#102995)"
This reverts commit 49450fe021cc5d439f56580463461ff438f9ac96.

Reverted https://github.com/pytorch/pytorch/pull/102995 on behalf of https://github.com/PaliC due to failing with no credentials error ([comment](https://github.com/pytorch/pytorch/pull/102995#issuecomment-1582466491))
2023-06-08 12:09:52 +00:00
a49aefdce2 [PT2][Quant] In linear partition include functional.linear (#103186)
Summary: as title

Test Plan: tested in subsequent diff

Reviewed By: jerryzh168

Differential Revision: D46342824

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103186
Approved by: https://github.com/jerryzh168
2023-06-08 09:48:09 +00:00
c9681613b2 [export] Unskip non supported tests. (#103168)
Test Plan: CI

Differential Revision: D46526971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103168
Approved by: https://github.com/tugsbayasgalan
2023-06-08 09:05:10 +00:00
978a2f2b27 export helper funcs in foreach ops (#102928)
Fixes #ISSUE_NUMBER
export some helper funcs in foreach ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102928
Approved by: https://github.com/janeyx99
2023-06-08 07:59:51 +00:00
91e82ba0a6 [PT2 Dynamo Hackathon] Fix simple bug in inline dict (#103187)
Fixes: https://github.com/pytorch/pytorch/issues/101980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103187
Approved by: https://github.com/yanboliang
2023-06-08 07:16:13 +00:00
49450fe021 add github action to upload alerts to rockset / aws (#102995)
Successful test run found at Test run found at https://github.com/pytorch/pytorch/actions/runs/5179855118/jobs/9333292038 (uses equivalent PRs)

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8d7d860</samp>

This pull request adds a new feature to create and upload alerts for failing jobs in the pytorch/pytorch repo. It introduces a new script `tools/alerts/create_alerts.py` to generate alert entries and a new workflow `.github/workflows/upload-alerts.yml` to run the script and upload the alerts periodically.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 8d7d860</samp>

> _To upload alerts to Rockset_
> _We added a workflow, you bet_
> _It runs every ten_
> _With concurrency then_
> _And `create_alerts.py` we edit_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102995
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-06-08 06:57:05 +00:00
ts
d2d03f0f44 Make index_add_ error if input source shape is wrong (#100321)
Fixes #92576 , checking the following as described in the documentation:

"source.shape[dim] == len(index) and source.shape[i] == self.shape[i] for i != dim"

Would be happy to iterate on this if there are any issues, and would be happy to implement the checking for the CUDA and MPS implementations of index_add_.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100321
Approved by: https://github.com/lezcano
2023-06-08 06:51:10 +00:00
52e310f7a8 Enable torch.nn.init._calculate_correct_fan in dynamo tracing (#103182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103182
Approved by: https://github.com/yanboliang
2023-06-08 06:13:49 +00:00
676210a139 GHA setup-linux should always be pair with teardown-linux (#103216)
This explains the existence of a running container `pytorch-linux-focal-py3.8-gcc7`.  For example,  https://github.com/pytorch/pytorch/actions/runs/5138349666/jobs/9344366066 on `i-097fac5d9b9bef249`.  The immediate job running before on this runner was a doc build jobs:

```
{
    "name": "linux-docs / build-docs-functorch-false",
    "html_url": "https://github.com/pytorch/pytorch/actions/runs/5184821352/jobs/9344262329",
    "_event_time": "2023-06-06T05:21:23.569553Z",
    "runner_name": "i-097fac5d9b9bef249",
    "head_branch": "gh/titaiwangms/25/head",
    "head_sha": "5d48fd183875e4eea44a94df239eae356e852939",
    "workflow_name": "pull",
    "conclusion": "success",
    "line": null
}
```

This might be related to OOM issue happening on these runners.

Thoughts:

* May be we should have a linter for this to make sure that GHA setup-OS always pairs with teardown-OS
* Nova GHA `setup-linux` is updated by https://github.com/pytorch/test-infra/pull/4275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103216
Approved by: https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/atalman
2023-06-08 05:59:12 +00:00
2868a5d0d1 Two small mem_eff bug fixes (#103201)
# Summary
Upstream two small bug fixes:
* https://github.com/fairinternal/xformers/pull/679
* https://github.com/fairinternal/xformers/pull/681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103201
Approved by: https://github.com/cpuhrsch
2023-06-08 05:34:56 +00:00
9508e60c1e [quant][pt2] Add prepare QAT test for resnet18 (#103020)
Summary:
Prepare QAT for resnet18 has matching numerics with FX.
Adding this test requires us to refactor the way the test code
is structured, however.

Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_resnet18

Differential Revision: D46456243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103020
Approved by: https://github.com/kimishpatel
2023-06-08 05:17:20 +00:00
18e4a466db fix amp in inference in benchmarking suite (#103220)
Even if you passed in --amp we would run inference in float32.

`AlbertForMaskedLM` goes from 1.305 float32 to 1.724x amp, and then again to 1.910x with freezing. Benchmark numbers for amp are about to go way up lol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103220
Approved by: https://github.com/desertfire
2023-06-08 05:16:22 +00:00
8585784a34 [dtensor] fix allgather unpadding logic (#103219)
This PR fixes allgather unpadding logic so that we only need to unpad
the full tensor instead of first chunking it to small tensors and unpad
individually, as we know how our padding algorithm works
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103219
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-06-08 03:31:24 +00:00
d5142c52d3 [FSDP]Remove dim_group from device_mesh init (#103218)
1) remove dim_group
2) don't init device_mesh if not using default_pg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103218
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-06-08 03:29:19 +00:00
6acb8d3d1c [data_loader] Extra signal handlers in DataLoader.cpp should be added on top rather than replacing defaults (#103164)
Summary: DataLoader.cpp signal handlers are adding some special behavior (e.g. exit(0) on SIGTERM under certain conditions). To preserve this behavior we should install additional signal handlers on top of default ones, rather than completely replacing them.

Test Plan: unit tests

Reviewed By: drej82

Differential Revision: D46525348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103164
Approved by: https://github.com/drej82
2023-06-08 02:55:58 +00:00
194262ee49 Make HigherOrderOperator stop appearing like torch.ops.* in FX (#103108)
Previously, defining a HigherOrderOperators (like cond) automatically generates
a torch.ops.cond and causes them to trace into the FX graph as e.g.
torch.ops.cond.

This is not good, because:
- Duplication. Since HigherOrderOperators are written in Python, they have an
associated Python function that users should access them from. E.g.
torch.cond (when we make it public). That is what should actually appear in the
graph.
- torch.ops.cond is a valid namespace for operator registration; having
it be a function too confuses things.

This PR:
- Moves cond/map HigherOrderOperators to be under torch (necessary for
the FX logic to not do weird things)
- Sets the `__module__` of a HigherOrderOperator correct. This is what
FX uses when tracing the operator.

Test Plan:
- updated tests

Future:
- I'll delete the ability to call cond as torch.ops.cond in a couple of
days, after this change circulates internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103108
Approved by: https://github.com/ydwu4
2023-06-08 01:55:27 +00:00
47cfcf566a Add selector.is_et_kernel_key_selected (#103184)
Summary:

This API is used by the gen_executorch.py to check whether a kernel with specified kernel key is used or not.

Test Plan:
```
buck test xplat/caffe2/tools:test_torchgen_executorch
buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103184
Approved by: https://github.com/larryliu0820
2023-06-08 01:10:20 +00:00
c37f02f61c [PyTorch][HAM]: Deprecate functionalize (#103053)
TLDR is to remove this flag and utilize functionalization functionality by lower level lib directly.

Differential Revision: [D46469134](https://our.internmc.facebook.com/intern/diff/D46469134/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103053
Approved by: https://github.com/tugsbayasgalan
2023-06-08 00:46:41 +00:00
0900782f0c [inductor][easy] raise register spill threshold (#103190)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103190
Approved by: https://github.com/spectrometerHBH, https://github.com/eellison
2023-06-08 00:35:46 +00:00
8c5d97d353 [inductor] Fix correctness issues with pre_grad and context managers (#103051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103051
Approved by: https://github.com/williamwen42
2023-06-08 00:20:34 +00:00
17737f9d0e [DTensor] Allow DTensor support cuda-like device (#102468)
Allow DTensor support cuda-like device, fix https://github.com/pytorch/pytorch/issues/102442

Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example https://github.com/pytorch/pytorch/pull/101914 and https://github.com/pytorch/pytorch/issues/101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular!

1. Similar to what is done here, we need to initialize the communication backend for the device set by DeviceMesh. So `_default_backend_for_device` is added to `Backend`. It is worth noting that when we register a new backend for a device other than cpu and cuda, we also need to add a new default backend for this device.
2. Adding `_device_handle` to `DeviceMesh` for cuda-like devices, similar to what is set in FSDP. When `_device_handle` is not None, the device has similar behavior to `cuda`. In this way, functions like `torch.cuda.device_count()` need to be modified to `device_mesh._device_handle.device_count()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102468
Approved by: https://github.com/wanchaol
2023-06-07 23:13:53 +00:00
790f5732f6 Fix Graph Break on builtin comparison on NNModule (#103176)
Fixes https://github.com/pytorch/pytorch/issues/102338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103176
Approved by: https://github.com/anijain2305
2023-06-07 22:51:43 +00:00
95fced4483 Pretty dataclass dynamo explain (#102869)
Also thinking out loud: maybe we only print graph break reasons? And for the rest we have a verbose print which prints everything?

TODO: some tests are failing based on what they expect a guard string to look like, easy to fix i'll do it early next week

# After

```
(sourcetorch) ubuntu@ip-172-31-1-136:~/test$ python pretty.py
BREAK
Graph Count: 2
Graph Break Count: 1
Op Count: 2
Break Reasons:
  Break Reason 1:
    Reason: call_function BuiltinVariable(print) [ConstantVariable(str)] {}
    User Stack:
      <FrameSummary file /home/ubuntu/test/pretty.py, line 6 in fn>
Ops per Graph:
  Ops 1:
    <built-in function add>
  Ops 2:
    <built-in function add>
Out Guards:
  Guard 1:
    Name: ''
    Source: global
    Create Function: GRAD_MODE
    Guard Types: ['GRAD_MODE']
    Code List: ['___is_grad_enabled()']
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 2:
    Name: ''
    Source: global
    Create Function: DEFAULT_DEVICE
    Guard Types: ['DEFAULT_DEVICE']
    Code List: ['utils_device.CURRENT_DEVICE == None']
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 3:
    Name: "G['print']"
    Source: global
    Create Function: BUILTIN_MATCH
    Guard Types: None
    Code List: None
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 4:
    Name: ''
    Source: global
    Create Function: DETERMINISTIC_ALGORITHMS
    Guard Types: ['DETERMINISTIC_ALGORITHMS']
    Code List: ['not ___are_deterministic_algorithms_enabled()']
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 5:
    Name: "L['x']"
    Source: local
    Create Function: TENSOR_MATCH
    Guard Types: None
    Code List: None
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 6:
    Name: ''
    Source: global
    Create Function: GRAD_MODE
    Guard Types: ['GRAD_MODE']
    Code List: ['___is_grad_enabled()']
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 7:
    Name: ''
    Source: global
    Create Function: DEFAULT_DEVICE
    Guard Types: ['DEFAULT_DEVICE']
    Code List: ['utils_device.CURRENT_DEVICE == None']
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 8:
    Name: ''
    Source: global
    Create Function: DETERMINISTIC_ALGORITHMS
    Guard Types: ['DETERMINISTIC_ALGORITHMS']
    Code List: ['not ___are_deterministic_algorithms_enabled()']
    Object Weakref: None
    Guarded Class Weakref: None
  Guard 9:
    Name: "L['x']"
    Source: local
    Create Function: TENSOR_MATCH
    Guard Types: None
    Code List: None
    Object Weakref: None
    Guarded Class Weakref: None
Compile Times: TorchDynamo compilation metrics:
Function                        Runtimes (s)
------------------------------  --------------
_compile                        0.0164, 0.0035
OutputGraph.call_user_compiler  0.0000, 0.0000
```

## Before

```
('Dynamo produced 2 graphs with 1 graph break and 2 ops', [{Guard(name='print', source=<GuardSource.GLOBAL: 1>, create_fn=<function GuardBuilder.BUILTIN_MATCH at 0x7f92ea5009d0>, is_volatile=False, guard_types=None, code_list=None, obj_weakref=None, guarded_class_weakref=None), Guard(name='x', source=<GuardSource.LOCAL: 0>, create_fn=<function GuardBuilder.TENSOR_MATCH at 0x7f92ea501000>, is_volatile=False, guard_types=['TENSOR_MATCH'], code_list=None, obj_weakref=<weakref at 0x7f9224d28f40; dead>, guarded_class_weakref=<weakref at 0x7f92d81734c0; to 'torch._C._TensorMeta' at 0x540b610 (Tensor)>)}, {Guard(name='x', source=<GuardSource.LOCAL: 0>, create_fn=<function GuardBuilder.TENSOR_MATCH at 0x7f92ea501000>, is_volatile=False, guard_types=['TENSOR_MATCH'], code_list=None, obj_weakref=<weakref at 0x7f9224d5e700; dead>, guarded_class_weakref=<weakref at 0x7f92d81734c0; to 'torch._C._TensorMeta' at 0x540b610 (Tensor)>)}], [GraphModule(), GraphModule()], [[<built-in function add>], [<built-in function add>]], [GraphCompileReason(reason='call_function BuiltinVariable(print) [ConstantVariable(str)] {}', user_stack=[<FrameSummary file <ipython-input-1-9e2ddb639697>, line 6 in fn>]), GraphCompileReason(reason='return_value', user_stack=[<FrameSummary file <ipython-input-1-9e2ddb639697>, line 8 in <graph break in fn>>])], 'Dynamo produced 2 graphs with 1 graph break and 2 ops\n Break reasons: \n\n1. call_function BuiltinVariable(print) [ConstantVariable(str)] {}\n  File "<ipython-input-1-9e2ddb639697>", line 6, in fn\n    print("BREAK")\n \n2. return_value\n  File "<ipython-input-1-9e2ddb639697>", line 8, in <graph break in fn>\n    return x\n \nTorchDynamo compilation metrics:\nFunction                        Runtimes (s)\n------------------------------  --------------\n_compile                        0.0418, 0.0084\nOutputGraph.call_user_compiler  0.0001, 0.0001')

```

## Program

```python
import torch
import torch._dynamo

def fn(x):
    x = x + 1
    print("BREAK")
    x = x + 1
    return x

out = torch._dynamo.explain(fn, torch.randn(10))
print(out)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102869
Approved by: https://github.com/voznesenskym
2023-06-07 22:38:57 +00:00
2baadc2ade Small operatorbench improvements (#103110)
- Don't copy inputs in cudagraphs wrapping, since the copies will distorts timing and triton do_bench will clear cache anyway
- Don't skip op if there is a fallback, since we have both fallbacks and lowerings for some ops
- Add option for channels last

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103110
Approved by: https://github.com/desertfire
2023-06-07 22:04:59 +00:00
e936277cc2 [ROCm] force HIP context initialization for inductor UTs (#103149)
Workaround for https://github.com/pytorch/pytorch/issues/102886
related to: https://github.com/pytorch/pytorch/issues/102476 https://github.com/pytorch/pytorch/issues/102475 https://github.com/pytorch/pytorch/issues/102474 https://github.com/pytorch/pytorch/issues/102473 https://github.com/pytorch/pytorch/issues/102473 https://github.com/pytorch/pytorch/issues/102472

Since 9aaa12e328 the first inductor (CPU) UT fails until the GPU context is correct initialised and the subsequent UTs pass. CUDA observes the same issue and a workaround was pushed to force initialisation of cuda context by declaring an empty tensor https://github.com/pytorch/pytorch/issues/92627, we have adopted the same approach but have opted for `torch.zeros` which correctly activates the HIP context after the kernel launch.

**Reproducer:**
```
import torch
from torch._subclasses.fake_tensor import FakeTensorMode
import argparse
if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Swap between torch.empty and torch.randn operations.')
    parser.add_argument('--empty', action='store_true', help='Use torch.empty operation')
    parser.add_argument('--rand', action='store_true', help='Use torch.randn operation')
    args = parser.parse_args()

    torch.cuda.set_device(0)
    if args.empty:
        torch.empty(1, device="cuda")
    elif args.rand:
        torch.rand(1, device="cuda")
    print(f": hasPrimaryContext: {torch._C._cuda_hasPrimaryContext(0)")
    with FakeTensorMode():
        p = torch.randn(4, 2, requires_grad=True, device='cuda')
        x = torch.randn(8, 4, device='cuda')
        y = torch.mm(x, p).square().sum()
        y.backward()
```

**ROCm python repro.py --empty**
0: hasPrimaryContext: False

**ROCm python repro.py --rand**
0: hasPrimaryContext: True

**CUDA python repro.py --empty**
0: hasPrimaryContext: True

**CUDA python repro.py --rand**
0: hasPrimaryContext: True

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103149
Approved by: https://github.com/eellison
2023-06-07 21:42:33 +00:00
376cf7965f Use gcc9 in linux-bionic-cuda12_1-py3_10-gcc9-build workflows (#103075)
Use gcc9 in linux-bionic-cuda12_1-py3_10-gcc9-build workflows
After PR, which fixed gcc9 transition : https://github.com/pytorch/multipy/pull/321

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at a076506</samp>

This pull request updates the GCC version for Python 3.10 and CUDA 11.8/12.1 test images and removes the unused CUDA 12.1 image configuration and reference from the docker build scripts and workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103075
Approved by: https://github.com/malfet
2023-06-07 21:34:29 +00:00
c454534d25 Enable torch.get_autocast_gpu_dtype in Dynamo tracing (#103166)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103166
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
2023-06-07 21:31:45 +00:00
b5021ba981 Enable torch.is_complex in Dynamo tracing (#103154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103154
Approved by: https://github.com/yanboliang
2023-06-07 20:56:46 +00:00
2e8d2a2e69 [quant][pt2] Add test for inplace add (#102867)
Summary: This was broken after the recent partitioner refactors.

Test Plan: python test/test_quantization.py TestQuantizePT2E.test_qat_inplace_add_relu

Differential Revision: D46402378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102867
Approved by: https://github.com/jerryzh168
2023-06-07 19:43:28 +00:00
28f43c767c Fix outdated log settings in doc (#102285) (#102286)
Replace torch._dynamo.config.loglevel=<level> with torch._logging.set_logs(dynamo=<level>)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102286
Approved by: https://github.com/msaroufim, https://github.com/Neilblaze
2023-06-07 18:07:20 +00:00
471407cf78 [PT2][Quant] Use composble quantizer for embedding + static conv + dynamic (#103116)
Summary:
In this diff we test a module that does a) emedding lookup b) runs 1D
(converted to 2D) conv and c) runs linear on the output of 1d conv.

a is quantized using embedding quantizer.
c is quantized using dynamic quantization.
b is quantized using static quantization.

We compose quantizer from [a, c, b]. Tested it against similar fx config.

Test Plan: test_embedding_conv_linear_quantization

Reviewed By: jerryzh168

Differential Revision: D46267688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103116
Approved by: https://github.com/jerryzh168
2023-06-07 17:34:59 +00:00
3c0072e7c0 [MPS] Prerequisite for MPS C++ extension (#102483)
in order to add mps kernels to torchvision codebase, we need to expose mps headers and allow objc++ files used in extensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102483
Approved by: https://github.com/malfet
2023-06-07 17:28:31 +00:00
0c9117a61f [dashboard] Bring back inference perf measurement as nightly (#103151)
Summary: GCP workload has dropped since adding control options for
manual dispatch. Let's set the inference run default to nightly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103151
Approved by: https://github.com/huydhn
2023-06-07 17:19:10 +00:00
686d7e4c48 [Inductor] Fix x.view(dtype) decomp and make inductor support it (#102920)
Fixes #99804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102920
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-06-07 17:10:54 +00:00
b8caa2b08f Fix regressions caused by https://github.com/pytorch/pytorch/pull/103128
By adding `torch.SymBool` back
2023-06-07 09:39:02 -07:00
e930c0fc35 [export] Initial deserialization v2 (#102716)
v2 of https://github.com/pytorch/pytorch/pull/102126. mentally stacked on top of https://github.com/pytorch/pytorch/pull/102707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102716
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2023-06-07 16:02:35 +00:00
adcefcb378 insert to dtype for fused mem copy scheduler node (#101042)
Fix https://github.com/pytorch/pytorch/issues/100830.

For the inplace node, there will be a `copy_` generated and the `copy_` will be `realized` as a `scheduler buffer` since it is a mutation. This `scheduler buffer` is a memory copy but after fusing with the previous buffer, it will not be a memory copy only buffers.
This PR solves the issue by removing `load_bf16_as_fp32` and `store_bf16_from_fp32`. Instead, enable fp32/bf16 vec conversion in `to_dtype`. Then we always store bf16.

```python
import torch
import torch.nn as nn
torch.manual_seed(420)
from torch._inductor import config

x = torch.randn(1, 18, dtype=torch.bfloat16)

class ExampleModel(nn.Module):

    def __init__(self):
        super(ExampleModel, self).__init__()
        self.relu = nn.ReLU(inplace=True) # nn.ReLU(inplace=False)

    def forward(self, input1):
        out = self.relu(input1)
        # input1.copy_(out)
        return out

func = ExampleModel()

with torch.no_grad():
    func.train(False)
    res1 = func(x) # without jit
    print(res1)

    jit_func = torch.compile(func)
    res2 = jit_func(x)
    print(res2)
```

Generated code without this PR: (`tm3` store is wrong, `tmp3` is `float` while `out_ptr1` is `bf16`)
```
            auto tmp0 = load_bf16_as_float(out_ptr1 + static_cast<long>(i0));
            auto tmp1 = (tmp0);
            auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0));
            auto tmp3 = (tmp2);
            store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp3);
            tmp3.store(out_ptr1 + static_cast<long>(i0), 16);
```

Generated code with this PR:
```
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(out_ptr1 + static_cast<long>(i0), 16);
            auto tmp1 = cvt_bf16_to_fp32(tmp0);
            auto tmp2 = at::vec::clamp_min(tmp1, decltype(tmp1)(0));
            auto tmp3 = cvt_fp32_to_bf16(tmp2);
            tmp3.store(out_ptr0 + static_cast<long>(i0), 16);
            tmp3.store(out_ptr1 + static_cast<long>(i0), 16);
```

This PR also fixed the data type propagation for `masked_subblock`.
Before the masked_subblock's dtype is propagated by its input which is wrong.
```
opcode       name       target     args                        kwargs
-----------  ---------  ---------  --------------------------  --------
call_module  masked_subblock1  masked_subblock1  (and__2, -inf)
```
Now we propagated it by subblock with the same name:

```
# graph for body.subblocks['masked_subblock1']
opcode       name       target     args                        kwargs
-----------  ---------  ---------  --------------------------  --------
placeholder  ops        ops        ()                          {}
call_module  get_index  get_index  ('index2',)                 {}
call_method  load       load       (ops, 'arg0_1', get_index)  {}
call_method  to_dtype   to_dtype   (ops, load, torch.float32)  {}
output       output     output     (to_dtype,)                 {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101042
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-07 15:55:25 +00:00
605a85249c Fix graph break on boolean mask better (#103052)
Previously I accidentally thought setitem takes each argument as a
list.  But if you write x[:, b] that actually is passed in as a tuple.
Try harder.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103052
Approved by: https://github.com/desertfire
2023-06-07 14:40:56 +00:00
2dafa70d61 Add a little more error checking to minifier (#103057)
Prompted by https://github.com/pytorch/pytorch/issues/101408

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103057
Approved by: https://github.com/bdhirsh
2023-06-07 14:40:12 +00:00
e4a42bcf56 add foreach support for custom device (#102047)
Fixes #ISSUE_NUMBER
for custom device, we want to support foreach, so I add a func that we could set other device type, and the default value is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102047
Approved by: https://github.com/janeyx99
2023-06-07 13:59:20 +00:00
07104ca99c [c10d] Make it default that PG do not perform barrier after init (#103033)
Both internal and OSS users trying https://github.com/pytorch/pytorch/pull/99937 report that their workloads perform normally even with the barrier removed and see a scalability win. Thus in this PR, we decide to make it default that PG do not perform a barrier after init.

In the discussion of #99937, people point out that such barrier might be needed for c10d + RPC cases. IMO, this need originates from RPC's programming model and should be RPC or RPC user's responsibility to deal with. That is, with other functions/libraries, it can happen too. So the need for c10d to do so big a favor is not justified IMO. Also good to remove it before users become reliant on this barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103033
Approved by: https://github.com/XilunWu
2023-06-07 06:11:14 +00:00
3e988316b5 update argument checks from padding layers (#102253)
replacement of https://github.com/pytorch/pytorch/pull/99608, breaking the old pr into smaller ones.

this one handles the common error message from both CPU and CUDA device, to simplify the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102253
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-06-07 05:01:59 +00:00
5acf7e266b [vision hash update] update the pinned vision hash (#103120)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103120
Approved by: https://github.com/pytorchbot
2023-06-07 04:40:34 +00:00
a02a58d862 [FSDP][1/N]Add device_mesh to FSDPstate (#102317) (#102551)
This PR creates a device_mesh and share it across all FSDP state. The device_mesh will later be used to test out dtensor state_dict (1d device_mesh).
Approved by: https://github.com/awgu

Add device mesh to fsdp state
skip dist.get_world_size(pg) != dist.get_world_size()
address test_fake_pg.py test failure
fix test_fake_py.py failure

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102551
Approved by: https://github.com/fegin
2023-06-07 04:14:00 +00:00
0769a50a5f Disable dynamo on some opt methods and differentiable optimizer tests (#103066)
- Disables dynamo on the differentiable optimizer tests
- Disables dynamo on some test methods which expose a very rare dynamo edge case
- Disables dynamo on export/save optimizer state methods because it shouldn't trace those anyway.

I have a draft PR to fix the two tests marked skip due to unsupported mutation of step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103066
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-06-07 03:50:42 +00:00
f760899864 Teach Triton codegen to generate sqrt (#103084)
Fixes https://github.com/pytorch/pytorch/issues/100972

I know ngimel doesn't like this sort of fix because we shouldn't
actually be computed sqrt at runtime, I'm open to some sort of
perf warning saying that we're spending FLOPs weirdly.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103084
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/ngimel
2023-06-07 03:03:56 +00:00
3f6f508646 [PT-D] Update torch.distributed code owners (#103114)
Summary: As title.

Test Plan: CI

Differential Revision: D46498287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103114
Approved by: https://github.com/fegin, https://github.com/wanchaol
2023-06-07 02:28:34 +00:00
821493715c Back out "Remove check from _prims_common, replace with torch._check* (#102219)", Back out "Forwatd fix for D46427687" (#103128)
Test Plan: revertitparrot

Reviewed By: malfet

Differential Revision: D46506433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103128
Approved by: https://github.com/malfet
2023-06-07 01:41:41 +00:00
428bff842d [benchmarks] Torchbench llama is not suitable for training (#103094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103094
Approved by: https://github.com/eellison, https://github.com/desertfire
2023-06-07 01:33:07 +00:00
2800a04a17 Add device range helper and remove sm86 specific check for memory efficient attention (#102985)
# Summary
Since we have upstreamed the latest changes of memory efficient attetnion we can remove the sm86/sm89 specific check. All head_sizes (assuming correctly alignment) should work for sm86 and sm89 size and don't have a max capability.

If head_size > 96 there will be a big drop in performance but should not error and still maintain memory savings by not materializing attention weights.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102985
Approved by: https://github.com/cpuhrsch
2023-06-07 00:28:40 +00:00
6596cfa4d7 [export] Remove example custom_object_type to type_reflection_method. (#103015)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103015
Approved by: https://github.com/tugsbayasgalan
2023-06-07 00:03:57 +00:00
27f4dc6c0a [ONNX] Add FX exporter MaxPool tests (#102773)
Need https://github.com/microsoft/onnxscript/pull/757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102773
Approved by: https://github.com/BowenBao
2023-06-06 23:31:49 +00:00
5b700fc914 Disable fallback for custom kernels (#101131)
Previous failed attempt was here: https://github.com/pytorch/pytorch/pull/97715.
Basically we tried to disable fallback for all ops (aten + custom) but hit many CI failures due to missing fake tensor coverage. Let's just disable it for custom kernels for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101131
Approved by: https://github.com/zou3519
2023-06-06 23:25:29 +00:00
8e0837cf84 [PT2][Quant] Move embedding quantization to osss (#103088)
Summary:
This is in preperation to enable embeddign quantization on models with
embeddings.

Test Plan: test_embedding_quantizer

Reviewed By: jerryzh168

Differential Revision: D46267689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103088
Approved by: https://github.com/andrewor14
2023-06-06 23:07:57 +00:00
bf312f2d9d [inductor] add a few tests to verify view_to_reshape pass is safe (#103034)
This PR follows up on issue https://github.com/pytorch/pytorch/issues/102229 . I added 2 unit tests and verified that autoaugorad/functionalization already handles view properly. The view_to_reshape pass does not cause an issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103034
Approved by: https://github.com/ezyang
2023-06-06 22:32:51 +00:00
61736679cd [Dynamo] No graph break for super(MyConv{1/2/3}d, self).forward and super(MyConvTranspose, self).forward (#102509)
before the PR, running super(MyConv1d, self).forward or super(MyConvTranspose, self).foward, dynamo will create a graph break when executing NNModuleVariable.call_method and raise unimplemented error for name=_conv_forward / _output_padding. see issue for full detail: https://github.com/pytorch/pytorch/issues/101155

after the PR, for torch.nn.conv module with function name _conv_forward / _output_padding, we inline the function with tx.inline_user_function_return

code refactor: added NNModuleVariable._inline_user_function_return_helper to consolidaste tx.inline_user_function_return into 1 place to keep code dry. after factor, there are 2 uncolidated inline_user_function_return with different ```fn``` and ```source``` logic. the code is still dry. For local testing, they are covered by test_modulelist, test_moduledict, test_conv_call_super_forward_directly and test_conv_transpose_call_super_forward_directly in test_modules.py

Differential Revision: [D46494460](https://our.internmc.facebook.com/intern/diff/D46494460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102509
Approved by: https://github.com/yanboliang
2023-06-06 22:01:17 +00:00
038955f489 torch.compile docs: "Profiling to understand torch.compile performance (#102862)
Docs on how to use torch.profiler.profile to understand torch.compile performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102862
Approved by: https://github.com/eellison
2023-06-06 22:00:36 +00:00
6261055471 dst_bin_of_end_center is defined twice (#102755)
(line 995 and line 1011)
both definations are the same.
Delete one of them.

Fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102755
Approved by: https://github.com/janeyx99
2023-06-06 21:17:07 +00:00
0279d0b611 [Profiler] Update Kineto Submodule (#103031)
Summary: Update Kineto Submodule to pick-up fixes to Tensorboard and CUPTI log level.

Test Plan: CI

Differential Revision: D46455120

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103031
Approved by: https://github.com/Skylion007
2023-06-06 20:30:16 +00:00
dfa64fddeb [FSDP] Fix for optim state dict (#102901)
Fix for HSDP + use_orig_params where we need to pass in the PG that
might not be the default.

Differential Revision: [D46417327](https://our.internmc.facebook.com/intern/diff/D46417327/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102901
Approved by: https://github.com/wz337
2023-06-06 20:21:23 +00:00
2405c59c75 [BE] Use value_or (#103065)
`s/!k.has_value || *k == foo/k.value_or(foo) == foo/`

Which yields the same code see https://godbolt.org/z/6b35zYcYc

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 003c703</samp>

Simplify the logic for registering and looking up backend fallbacks in `library.cpp` by using a constant and an optional helper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103065
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-06-06 20:07:29 +00:00
08c4a442fd Dont run test files that are already run in test_optim (#103017)
they get run twice on accident
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103017
Approved by: https://github.com/janeyx99
2023-06-06 17:31:21 +00:00
90fd90dd94 Fix rocm sharding (#102871)
Rocm queries for the number of processes it should use per machine, which might cause it be different across shards, which leads to inconsistencies when distributing tests among shards.

My solution is to separate the vars used for shard calculations and the actual number of procs that can be used and to ensure that the var used for shard calculations is consistent across all shards for a test config + job.  I believe that the only consequence is that rocm sharding might become unbalanced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102871
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-06-06 17:29:53 +00:00
a867e6db85 Add newline before minified repro path (#103083)
Minor QOL change.  This log message is pushed into my history by the
backtrace, which is a pain because if I tab up in tmux I can no longer
paste it without line breaks.  This makes it more convenient to use tmux
copy mode to get only the file (as I get the entire line this way.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103083
Approved by: https://github.com/albanD
2023-06-06 17:09:44 +00:00
fbbde8df69 [inductor] fix a numel expr codegen issue (#103005)
Summary: Correctly use pexpr or cexpr for generating symbolic expression
during wrapper codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103005
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
49577c7e47 [inductor] Turn off autotune_cublasLt for cpp_wrapper (#103004)
Summary: bias_addmm is not backed up by a cpp funciton, so turn
autotune_cublasLt for cpp_wrapper + max_autotune. We can add a cpp
function implementation if there is a performance need.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103004
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
44fdfd3222 [inductor] Support select_algorithm with cpp_wrapper (#103003)
Summary: This is one step towards getting cpp_wrapper work with max_autotune.
Switch to use unique kernel name to cache generated cubin file.

This is a copy of https://github.com/pytorch/pytorch/pull/102738 to solve a ghstack issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103003
Approved by: https://github.com/jansel
2023-06-06 14:08:05 +00:00
8824101fb6 [PT2][Quant] Introduce composable quantizer (#102846)
Summary:
Using composable quantizer, we can now composable two or more quantizers. In
the test here we compose quantizer configured with dynamic linear quantization,
with quantizer configured for static quantization.

Note that composable quantizer has strict order in which annotations are
applied

Test Plan: test_composable_quantizer*

Reviewed By: jerryzh168

Differential Revision: D46267690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102846
Approved by: https://github.com/andrewor14
2023-06-06 14:01:55 +00:00
eeb3c62117 Add Wav2Vec2 HuggingFace support (#103009)
This is not actually enabled in the benchmark suite as you need
https://github.com/pytorch/pytorch/pull/103001 and also training
is broken per https://github.com/pytorch/pytorch/issues/101160
but might as well review this part first.

Contains https://github.com/pytorch/pytorch/pull/102979 but
I will probably rebase past that once it lands.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103009
Approved by: https://github.com/Skylion007
2023-06-06 13:25:06 +00:00
ba962fefea Add parametrization version of weight_norm (#103001)
This done in the ordinary way, but also:

* Deprecation warning for the old API, and a migration guide
* Backwards compatibility for state_dict loading the old weight_norm
* Test for pickling and deepcopy, which was the motivating reason

weight_norm is still used by HuggingFace Wav2Vec2.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103001
Approved by: https://github.com/albanD
2023-06-06 13:14:43 +00:00
3a38acf18f Move CUDA 11.8 CI jobs to CUDA 12.1, CUDA 11.7 jobs to CUDA 11.8 (#102178)
Move CUDA 11.8 CI jobs to CUDA 12.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102178
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-06-06 11:53:35 +00:00
1fcc67fd8c [pt2] add SymInt support for linalg.tensorsolve (#102466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102466
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2023-06-06 08:06:55 +00:00
ec0aa965da [pt2] add meta for _linalg_solve_ex (#102454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102454
Approved by: https://github.com/lezcano
2023-06-06 08:06:55 +00:00
4bda4a7e4d [pt2] add meta for lu_unpack (#102937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102937
Approved by: https://github.com/lezcano
2023-06-06 08:06:53 +00:00
39f3514fa3 Add an env PYTORCH_TEST_SKIP_CUDAGRAPH to skip all cuda graph-related unit tests (#103032)
Skip all cuda graph-related unit tests by setting env var `PYTORCH_TEST_SKIP_CUDAGRAPH=1`

This PR refactors the `TEST_CUDA` python variable in test_cuda.py into common_utils.py. This PR also creates a new python variable `TEST_CUDA_GRAPH` in common_utils.py, which has an env var switch to turn off all cuda graph-related tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103032
Approved by: https://github.com/malfet
2023-06-06 07:51:57 +00:00
b592e67516 Use C++17 [[fallthrough]]; (#102849)
Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D46385240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102849
Approved by: https://github.com/Skylion007
2023-06-06 07:06:26 +00:00
cyy
30e2764221 remove c10::guts::{max,min} (#102952)
Because we have enabled C++17, and std::{max,min} are required to be constexpr since C++14 according to [cppreference](https://en.cppreference.com/w/cpp/algorithm/max) we can safely remove them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102952
Approved by: https://github.com/Skylion007
2023-06-06 05:40:30 +00:00
3a385656b5 [export] Initial serialization v2 (#102707)
v2 of https://github.com/pytorch/pytorch/pull/102125 because of git issues
corresponding deserialization diff: https://github.com/pytorch/pytorch/pull/102716

Implementing serialization of the exported program to a python dataclass, and then from that dataclass to json. This is split into a couple of sections:
- `serialize(ep: ep.ExportedProgram, opset_version: Dict[str, int]) -> Tuple[bytes, bytes]` -- takes an exported program object, a dictionary mapping opset namespaces to versions, and returns the serialized exported program in bytes, and separately the state dict serialized in bytes
- `GraphModuleSerializer` class that serializes torch.fx.GraphModule
to the schema.GraphModule dataclass
- `ExportedProgramSerializer` class that serializes torch._export.exported_program.ExportedProgram to the schema.ExportedProgram dataclass

Serialization TODOs:
- [x] pytree spec: https://github.com/pytorch/pytorch/pull/102577
- [ ] higher order ops
- [ ] node metadata (specifically nn_module_stack/source_fn)
- [ ] constraints
- [ ] graph module metadata

The tests are not super comprehensive, but that's because I think it'll be better tested + easier to test once deserialization is implemented.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102707
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2023-06-06 05:12:49 +00:00
d7035ffde3 Enable uint8/int8 mkldnn/dense tensor conversion (#102965)
**Summary**
Support mkldnn tensor and dense tensor conversion with uint8/int8 data type.

**Test Plan**
```
 python -m pytest -s -v test_mkldnn.py -k test_conversion_byte_char
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102965
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper
2023-06-06 05:05:29 +00:00
cyy
7a42a03547 fix use-after-free in test (#102734)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102734
Approved by: https://github.com/Skylion007
2023-06-06 04:41:20 +00:00
5fbbae4283 [quant][pt2e][be] Cleanup prepare function in _pt2e (#103022)
Summary: att

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
```

Differential Revision: D46346087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103022
Approved by: https://github.com/andrewor14
2023-06-06 04:33:05 +00:00
872fdb329b This extra message would have helped with Wav2Vec2 debugging. (#103002)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103002
Approved by: https://github.com/janeyx99, https://github.com/anijain2305, https://github.com/voznesenskym, https://github.com/malfet
2023-06-06 04:28:16 +00:00
6408b85d88 [vision hash update] update the pinned vision hash (#103038)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103038
Approved by: https://github.com/pytorchbot
2023-06-06 02:43:48 +00:00
dda59162f1 Native rearrange in functorch (#101957)
Fixes #92675

Here we implement a native version of [`einops.rearrange`](https://einops.rocks/api/rearrange/) using first class dims to perform the operations. The string parsing + validation, documentation, and relevant tests are adapted from `einops`. The API is exactly the same as the `einops` API.

The main idea is to take the string and convert it to a left and right `ParsedExpression`, and then find a mapping from the axes to first class dims. Once the mapping exists we convert the left expression `composition` list into a `Tensor.__getitem__` index and the right expression `composition` into the `Tensor.order` arguments, and then use this to dynamically create a callable that performs the `rearrange` operation as specified by the pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101957
Approved by: https://github.com/zdevito
2023-06-06 02:10:42 +00:00
367b0ad062 enforce dtype (reland) (#102996)
Summary: The original diff didn't break the test.

Test Plan: N/A

Differential Revision: D46448488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102996
Approved by: https://github.com/malfet, https://github.com/wanchaol
2023-06-06 00:35:04 +00:00
e26f5b2ac7 docs: Render bullet points correctly (#103021)
This wasn't rendering correctly on the website, this should make it so that the bullet points actually show correctly now.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103021
Approved by: https://github.com/albanD
2023-06-06 00:22:49 +00:00
9567aaebe5 Package torch/*.pyi type hints (#103016)
Including `torch._VF` and `torch.return_types`

These are generated by:
4003e96ca1/tools/pyi/gen_pyi.py (L1139-L1155)

Ref #99541
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103016
Approved by: https://github.com/Skylion007
2023-06-05 23:08:10 +00:00
258525093e Exclude clang-format diff from git-blame (#103000)
Add https://github.com/pytorch/pytorch/pull/102887 to `.git-blame-ignore-revs`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103000
Approved by: https://github.com/Skylion007
2023-06-05 22:59:01 +00:00
117f9bb847 [BE] Explain how to get consistent linter behavior locally (#102990)
Sometimes you'll see linter failures on CI that don't repro locally, caused by the local linter not having installed the latest config.

These instructions explain how to make both the CI and local linter consistent again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102990
Approved by: https://github.com/huydhn
2023-06-05 22:51:17 +00:00
12cd1dbba0 Handle recursive tuple in clone_inputs (#102979)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102979
Approved by: https://github.com/wconstab
2023-06-05 22:11:48 +00:00
4479e2fa19 fix profiling ref in side panel (#103014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103014
Approved by: https://github.com/msaroufim
2023-06-05 21:19:51 +00:00
6cb1455857 [export] Change equality constraints to list of tuples (#102998)
Changed equality constraints to a list of tuples as the dictionary wasn't providing much value -- also makes creating constraints + serialization easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102998
Approved by: https://github.com/avikchaudhuri
2023-06-05 21:03:02 +00:00
3cb0ba2263 [ROCm] MIOpen supports bias with bfloat16 (#95080)
Removes the condition in ConvParams::use_miopen restricting bfloat16 with bias defined.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95080
Approved by: https://github.com/ngimel
2023-06-05 20:58:28 +00:00
1943bd0d7e [Release] Add FAQ explaining release terms (#102618)
[Release] Add FAQ explaining release terms
This is AI from release 2.0.0 retrospective
Fixes: https://github.com/pytorch/pytorch/issues/98009

Co-authored-by: Nikita Shulga <nshulga@meta.com>
Co-authored-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102618
Approved by: https://github.com/malfet
2023-06-05 20:26:57 +00:00
1c2dfdf30c Add renorm forward-ad (#100798)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100798
Approved by: https://github.com/soulitzer
2023-06-05 20:25:35 +00:00
d89c719160 Fix torch.compile side panels refs (#102407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102407
Approved by: https://github.com/msaroufim
2023-06-05 20:08:40 +00:00
76a98abcb2 Rework Inductor support for collectives. (#99765)
This is done by introducing two new base classes: InPlaceCollectiveKernel and OutOfPlaceCollectiveKernel.

They deal with the differences for when InPlaceHint needs to be used.

Additionally to that, we introduce `has_side_effects` method to buffers that
prevents them from being DCE'd by the scheduduler. This is needed because InPlaceHint
nodes both wrap the inputs and are the outputs, which places no users to the collectives
themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99765
Approved by: https://github.com/wconstab
2023-06-05 20:06:40 +00:00
cca7b38564 Don't allow skipping deepcopy (#102973)
We might mutate it afterwards!  This could lead to hard to understand
bugs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102973
Approved by: https://github.com/albanD
2023-06-05 20:01:16 +00:00
7112880cc1 Preserve leaf-ness and requires_grad-ness in minified repros (#102899)
Also some minor refactoring

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102899
Approved by: https://github.com/albanD
2023-06-05 19:56:00 +00:00
719584600b Merge original module attributes with attributes assigned by __setattr__ (#102910)
Fixes https://github.com/pytorch/pytorch/issues/94478 @davidberard98

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102910
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze, https://github.com/davidberard98
2023-06-05 19:14:07 +00:00
515c427941 Enable clang-format on foreach / multi_tensor_apply files (#102887)
As per title.

I don't think it's good to have multiple styles of line breaks, indents, etc. in a file.
I'll submit a pull request to update https://github.com/pytorch/pytorch/blob/main/.git-blame-ignore-revs once this landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102887
Approved by: https://github.com/albanD
2023-06-05 19:00:07 +00:00
604a414bfc [quant][pt2] Fix convert in Conv + BN QAT fusion (#102224)
Summary:
Previously, the test for the convert flow in Conv + BN
QAT fusion was not enabled by mistake. However, reenabling this
test uncovered several bugs:

(1) The replaced nodes returned by subgraph rewriter were not
handled correctly. This is because a recent change in the subgraph
rewriter (#100556) fixed only the prepare case but not the convert
case. This commit brings this fix to the convert case as well and
deduplicates some code between the two cases.

(2) When folding BN into conv, we used the wrong arg index to get
the BN eps value. This resulted in an incorrect conv weight.

(3) In FX, we currently do a hack for weighted modules where we
observe the weights once in convert in order to ensure we get the
right shapes for these weight observers. This caused the numerics
to diverge between PT2 and FX. This commit fixes this by skipping
this unnecessary hack for `_convert_to_reference_decomposed_fx`.

(4) Per channel support was simply missing. This commit adds
support for this by matching the quantize_per_channel and
dequantize_per_channel ops in addition to the existing ones.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_numerics

Reviewed By: jerryzh168

Differential Revision: D46097783

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102224
Approved by: https://github.com/jerryzh168
2023-06-05 18:09:28 +00:00
4bb2b65ea4 Turn on add_runtime_assertion by default (#102671)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102671
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2023-06-05 16:27:44 +00:00
ecb191683e Revert "enforece dtype (#102802)"
This reverts commit 8e2a86c2a54719fd66a3e612fe8b433fbb1d4522.

Reverted https://github.com/pytorch/pytorch/pull/102802 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/102802#issuecomment-1577099676))
2023-06-05 16:21:28 +00:00
9cabdff8bd Update documentation to read FileSystemReader instead of FileSystemLoader (#102795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102795
Approved by: https://github.com/wz337
2023-06-05 15:22:49 +00:00
f1f57e1e54 trigger tracing for MTIA events (#102288)
Summary: trigger tracing for MTIA events on python side when ProfilerActivity.MTIA is specified

Test Plan:
Test diff: D45437426

```
hg graft D45437426
```
- in one terminal

```
cd ~/fbsource/fbcode
buck2 run -j 8 \
    //infra_asic_fpga/firmware/tools/mad/service:mad_service
```
- in another terminal

Pytorch profiler
```
buck run mode/dev-nosan -j 8 //caffe2/torch/fb/acc_runtime/afg/tests:test_afg  -- -m kernel_add
```

Differential Revision: D46122853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102288
Approved by: https://github.com/aaronenyeshi
2023-06-05 15:10:31 +00:00
2c2e4d5228 Populate the eviction_policy field for load/store properly (#91316)
This helps with kernels that make use of caching like mid-range softmax
which reads the data three times.

Selecting `eviction_policy=evict_first` in the last loop of the softmax
operation seems to give a 7-10% speed-up vs. selecting `evict_last` which
was the previous option. I'll put up some benchmarks soon™.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91316
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-06-05 13:54:36 +00:00
ee77d2b660 Create public interface for torch.jit (#101678)
Fixes #92240; this adds all variables in `torch/jit/__init__.py` that also have docs page to `__all__`: https://pytorch.org/docs/stable/jit.html

As stated in the tracking issue, this solves pyright errors like this:

```python
import torch

def foo(x, y):
    return 2 * x + y

traced_foo = torch.jit.trace(foo, (torch.rand(3), torch.rand(3)))  # error: "trace" is not exported from module "torch.jit" (reportPrivateImportUsage)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101678
Approved by: https://github.com/albanD
2023-06-05 13:14:32 +00:00
f79d2b45fb Revert "Replace _dynamo.config with an object instead of module (#96455)"
This reverts commit 3864207c2a71a3ba8dc13bcf9582a726a10292cd.

Reverted https://github.com/pytorch/pytorch/pull/96455 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/96455#issuecomment-1576162237))
2023-06-05 07:06:14 +00:00
258d398eec Revert "torch.compiler public namespace (#102182)"
This reverts commit b5840f99c3f2ae01b7831fd32b99758180fc22c3.

Reverted https://github.com/pytorch/pytorch/pull/102182 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/102182#issuecomment-1576144551))
2023-06-05 06:52:37 +00:00
6ac3352a37 [pt2] add meta for _linalg_slogdet (#102464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102464
Approved by: https://github.com/ezyang
2023-06-05 03:17:08 +00:00
ca18053913 inductor: add fake mode tracing for cumsum graph pattern (#102820)
When running dynamic shape of ```OPTForCausalLM``` path, there has an error: ```TypeError: unsupported operand type(s) for +: 'Node' and 'int'```, this PR will do:

1. For ```pointless_cumsum_replacement```, the sizes may be a node, we should trace the target pattern using example input.
2. For dynamic shape, we should trace a pattern under fake mode in which inputs may have symbolic inputs.

After this PR, the dynamic shape of ```OPTForCausalLM``` can work(```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/huggingface.py --performance --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only OPTForCausalLM```).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102820
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-06-05 03:09:26 +00:00
8e2a86c2a5 enforece dtype (#102802)
Summary: Add a flag to enforce the gather data dtype. In case backward compatibility, make the default as False

Test Plan: local and mast

Reviewed By: zyan0, strisunshinewentingwang

Differential Revision: D46295190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102802
Approved by: https://github.com/mrshenli
2023-06-05 02:04:09 +00:00
881307abcf [inductor] Fix a cpp_wrapper issue when fx_passes modified fx graph (#102851)
Summary: Currently cpp_wrapper for CUDA does it in two passe, which
means we need to deepcopy the input module to isolate any fx
transformations between the two passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102851
Approved by: https://github.com/jansel
2023-06-05 00:20:38 +00:00
26bf8894b6 [export] Replicate exportdb examples and tests in oss. (#102769)
Summary: Initial work to copy source to OSS for exportdb and make sure tests can run properly.

Test Plan: test_export

Differential Revision: D46369152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102769
Approved by: https://github.com/angelayi
2023-06-04 20:01:57 +00:00
a748be93df [CheckpointWrapper] Warn on reentrant use (#102890)
We'd like to encourage users to try non-reentrant as much as possible,
and identify any gaps this way.

Differential Revision: [D46397786](https://our.internmc.facebook.com/intern/diff/D46397786/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102890
Approved by: https://github.com/awgu
2023-06-04 18:31:22 +00:00
5b623d6c6a [Composable] fully_shard load_optim test (#102692)
Closes https://github.com/pytorch/pytorch/issues/93280 and adds tests
for this.

Differential Revision: [D46343364](https://our.internmc.facebook.com/intern/diff/D46343364/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102692
Approved by: https://github.com/awgu
2023-06-04 18:31:22 +00:00
88ce6215f5 [FSDP/DDP] Unify _cast_forward_inputs (#102680)
Closes https://github.com/pytorch/pytorch/issues/96380

Differential Revision: [D46342814](https://our.internmc.facebook.com/intern/diff/D46342814/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102680
Approved by: https://github.com/awgu
2023-06-04 18:31:21 +00:00
957ea485c4 [FSDP/AC] checkpoint_wrapper acccept auto_wrap_policy (#102672)
Some feedback for this API is that folks would like to use
auto_wrap_policy similar to FSDP instead of having to adapt to the signature of
``check_fn``.

Differential Revision: [D46340320](https://our.internmc.facebook.com/intern/diff/D46340320/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102672
Approved by: https://github.com/awgu
2023-06-04 18:31:19 +00:00
df40ec82dc [FSDP][Docs] Document get_state_dict_type (#102658)
Per title

Differential Revision: [D46335317](https://our.internmc.facebook.com/intern/diff/D46335317/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102658
Approved by: https://github.com/fegin, https://github.com/awgu
2023-06-04 18:31:18 +00:00
c6d0fe39ec [FSDP] Document optim_state_dict_config in method (#102657)
Per title

Differential Revision: [D46335318](https://our.internmc.facebook.com/intern/diff/D46335318/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102657
Approved by: https://github.com/fegin
2023-06-04 18:31:16 +00:00
beb7131c64 [FSDP] Use INFO instead of DETAIL for warning logs (#102639)
Since these are just logs and don't introduce any big perf slowdowns,
I think we should just enable them in info mode.

Differential Revision: [D46328510](https://our.internmc.facebook.com/intern/diff/D46328510/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102639
Approved by: https://github.com/awgu
2023-06-04 18:31:15 +00:00
4d516f44a1 [FSDP][ez] Type optimizer correctly (#102637)
In shardedgradscaler, the optimizer doesn't have to be SGD.

Differential Revision: [D46327103](https://our.internmc.facebook.com/intern/diff/D46327103/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102637
Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/fegin
2023-06-04 18:31:13 +00:00
e66c498d2d Log modules FSDP hooks fire for (#102508)
Under torch_distributed_debug >= INFO and use_orig_params=True, log post backward hook firing to debug things like FSDP + AC integration.

Differential Revision: [D46172916](https://our.internmc.facebook.com/intern/diff/D46172916/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102508
Approved by: https://github.com/awgu, https://github.com/fegin
2023-06-04 18:31:12 +00:00
757791d1e3 [pt2] add SymInt support for linalg.vander (#102469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102469
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2023-06-04 09:58:02 +00:00
cyy
87cbfe957a increase clang-tidy coverage to more c10 source files (#102902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102902
Approved by: https://github.com/Skylion007
2023-06-04 06:33:01 +00:00
992bffe5a3 [vision hash update] update the pinned vision hash (#102919)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102919
Approved by: https://github.com/pytorchbot
2023-06-04 02:47:36 +00:00
9d20b47e47 make device normalization more generic in faketensor (#102519)
Fixes #ISSUE_NUMBER
 make the device normalization more generic in faketensor to support devices like "cuda", "foo" and so on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102519
Approved by: https://github.com/albanD
2023-06-04 01:44:21 +00:00
85efacee07 Add a new UNSTABLE category in trymerge (#102784)
Per title, after https://github.com/pytorch/pytorch/pull/102426 landed, it makes sense to have a new category for UNSTABLE jobs and handle them accordingly in trymerge.

* The simple approach is to check for `unstable` in the check (job) name.  I plan to roll this out first and then see if we need to cover the more complicated, but less popular case, of unstable build job.  Specifically, an unstable build job has no `unstable` in its name
* An unstable job is ignored by trymerge.  This is the same behavior we have atm when a job is moved to unstable.  It's completely ignored
* The update to Dr. CI will come later, so that unstable failures would also be hidden like broken trunk or flaky

### Testing

Leverage the broken trunk Windows CPU job atm and mark Windows CPU jobs as unstable https://github.com/pytorch/pytorch/issues/102297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102784
Approved by: https://github.com/clee2000
2023-06-04 00:40:27 +00:00
3864207c2a Replace _dynamo.config with an object instead of module (#96455)
Summary:
    Replace _dynamo.config with an object instead of module

    Current usage patterns of setting and reading fields on config will work
    unchanged.

    Only changes needed going forward:
    1. import torch._dynamo.config will not work. However, just doing
       import torch._dynamo is sufficient to access dynamo config
       as torch._dynamo.config.

    2. Files inside of _dynamo folder need to access config via
       from torch._dynamo.config_util import config instead of
       from torch._dynamo import config. Because _dynamo/__init__.py
       imports some of the files so it would be circular import.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455
Approved by: https://github.com/jansel
2023-06-03 23:18:41 +00:00
eebe0ee141 [Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102874)
Summary:
keys and change codegen to take ETKernelIndex

We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change:

* `ETKernelKey` to retrieve a certain kernel from the registry.
* `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping.

Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys.

Test Plan: Added tests

Reviewed By: Jack-Khuu

Differential Revision: D46407096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102874
Approved by: https://github.com/Jack-Khuu, https://github.com/kirklandsign
2023-06-03 17:23:42 +00:00
0f672e8c67 Revert "[DTensor][3/N] add DTensor constructor function: full (#101436)"
This reverts commit 2ca75d49a83609b4e25b4b9becc859669e855a8d.

Reverted https://github.com/pytorch/pytorch/pull/101436 on behalf of https://github.com/malfet due to Caused internal SEV ([comment](https://github.com/pytorch/pytorch/pull/101436#issuecomment-1575076672))
2023-06-03 17:09:08 +00:00
c46af25bb3 Initialize optimizer in dynamo to avoid graph break and tracing slowness (#102640)
On calls to `_init_group` rather than tracing through it, extract python values from the arguments, and call the initialization. This avoids having to trace this function which is very slow with large parameters, and also avoids graph breaking on it. This is sound in this case because the state is only initialized once in the eager case. Guards on the state and params are generated explicitly rather than via tracing the initialization.

Caveats:
`_init_group` also gathers various state tensors into lists via mutating list arguments to pass to the functional optimizer implementation. These state tensors exist on the optimizer itself, but we don't know exactly how the gathering is done and which tensors correspond to which attributes of the optimizer module (each optimizer has different states). To rectify this, we keep weak_ptrs to all of the tensors collected in the lists in globals (similar to how parameter keys are stored for dictionaries). These pointers are guaranteed to be alive as long as the optimizer object is alive if the internal state is not interfered with and they are guarded with weakref guards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102640
Approved by: https://github.com/jansel
2023-06-03 15:49:51 +00:00
eb0971cfe9 [quant][pt2e][be] Remove _input_output_share_observers and _reuse_input_obs_or_fq from QuantizationAnnotation (#102854)
Summary:
att, after we support SharedQuantizationSpec we don't need these things anymore, this PR refactors the
uses of _input_output_share_observers to SharedQuantizationSpec

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
```

Reviewed By: andrewor14

Differential Revision: D46301342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102854
Approved by: https://github.com/andrewor14
2023-06-03 07:31:09 +00:00
8215468870 Feature:To add --tolerance option to benchmark scripts (#102218)
The "tolerance" option evaluates the model on the baseline device in eager mode (default: CPU) compared to the test device (e.g., CUDA, XLA, etc.) and compares the output tensors to determine the absolute tolerance value based on the [formula](https://pytorch.org/docs/stable/generated/torch.allclose.html). It then saves the results in a CSV file. This comparison highlights the tolerance/accuracy difference between XLA and GPU/CPU devices and can also be used to evaluate newer accelerators. This feature aims to identify accuracy failures on the test device (e.g., XLA) and facilitate quick bug triaging.

This feature enables the following capabilities:
1. Ability to monitor accuracy issues of backends
2. Provide more informative picture on accuracy beyond pass/ fail status
3. Having a dump of accuracy information will help triage models accordingly

The data generated using this feature is in the [spreadsheet](https://docs.google.com/spreadsheets/d/1A8BAzSqfAw0Q5rgzK5Gk__Uy7qhuynh8tedxKnH-t94/edit#gid=0).

The spreadsheet data can be used to compile the below summary table:

| Suite                     | Max Tolerance                |          | No. of models with high inaccuracy(>=0.005) |          | Mean Tolerance |          |
|------------------ |:-------------:|:--------:|:-------------------------------------------:|:--------:|:--------------:|:--------:|
|                             |      xla           | inductor      |                     xla     | inductor |                                                xla      | inductor |
| huggingface       |        0.1169  |   0.0032      |                            1 |        0 |                                                   0.0022 |   0.0005 |
| timm_models     |        0.0373 |   2.8892      |                          10 |        8 |                                                   0.0028 |   0.7044 |
| torchbench        |         3.013   |   3.0381       |                            6 |        2 |                                                    0.0016 |   0.0016 |
| All models          |         3.013   |   3.0381      |                           17 |       10 |                                                  0.0028 |   0.7044 |

I used PyTorch release/2.0 branch and corresponding [commit_pin](https://github.com/pytorch/pytorch/blob/release/2.0/.github/ci_commit_pins/xla.txt) for XLA to generate the above data.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102218
Approved by: https://github.com/jansel
2023-06-03 06:40:26 +00:00
1237502213 Introduce fast path for cuda_equal (#102714)
We introduce the same trick for cuda_equal. Assuming in cuda_equal, the flags are already handled correctly.

Added the tests for cuda part.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102714
Approved by: https://github.com/ezyang
2023-06-03 05:49:49 +00:00
4254e052fb [BE] Fix lintrunner init on python 3.11 (#102889)
Makes the `lintrunner init` command work with python 3.11

The old version of numpy would fail to install on python 3.11, where setup would fail to build wheels with the error `AttributeError: fcompiler. Did you mean: 'compiler'?`

The latest version of numpy installs just fine however, so switching to that.

More details in https://github.com/numpy/numpy/pull/22102
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102889
Approved by: https://github.com/kit1980
2023-06-03 04:54:41 +00:00
00f1bb0963 Fix optimizer cuda health check graph break (can be done in the compiler) (#102765)
- Ignore the health check if we are compiling
- Don't disable the function anymore

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102765
Approved by: https://github.com/albanD
2023-06-03 03:42:23 +00:00
d92bb036a4 [Dynamo] Fix if condition on UnspecializedNNModuleVariable (#102583)
Fixes #102315

The root cause is for ```UnspecializedNNModuleVariable``` which extends from ```UserDefinedObjectVariable```, if ```__bool__``` is missing, we should use ```__len__``` to infer a truth value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102583
Approved by: https://github.com/jansel
2023-06-03 03:42:15 +00:00
a84bb2709a Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-03 02:23:21 +00:00
1035e33b38 [dynamo] test attaching attributes to an OptimizedModule (#102781)
Test that the following passes:
```python
mod = torch.compile(mod)
mod.is_compiled = True
assert "is_compiled" in dir(mod)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102781
Approved by: https://github.com/yuguo68
2023-06-03 02:20:21 +00:00
2fb182e054 speeds up will_fusion_create_cycle (#102770)
improves #102622 from ~150s to ~15s.
The way computing recursive predecessors works is if `nodes = node1.recursive_predecessors` then recursive predecessors of any `n` in `nodes` should still be a subset of `nodes`, so we can shortcut computing intersection of `node.recursive_predecessors - combined_predecessors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102770
Approved by: https://github.com/Chillee
2023-06-03 01:45:09 +00:00
39b04370db Preserve coalesce state in sparse COO tensor serialization (#102647)
Fixes #101186

Also, resolves the "serialization to preserve coalesced-ness" part in https://github.com/pytorch/pytorch/issues/73479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102647
Approved by: https://github.com/mikaylagawarecki
2023-06-03 01:37:52 +00:00
ec4a107f87 [LLVM] Make changes needed for opaque pointers (#101396)
Update llvm_codegen module to use opaque pointers feature of llvm.

* Set setOpaquePointers to true for llvm context.
* Pass Type to emit\*Load and emit\*Store functions.
* Create TypedPointer struct to keep track of Value and its Type.
* Introduce OpqTy_ to be used for opaque pointer types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101396
Approved by: https://github.com/jgong5
2023-06-03 01:00:11 +00:00
c304fddf68 [dynamo][numpy] Support graph break for numpy ndarray (#100839)
Issue: #93684

In previous PRs #95849 #99560 we redirect `numpy.*`, `<tensor>.numpy()` calls to `torch_np.*` methods and attributes, by creating `NumpyNdarrayVariable` for those calls.

We need to handle `NumpyNdarrayVariable` when graph break happens.

This PR did 2 things:
1. In `codegen.py` we made sure we can reconstruct the value wrapped by `NumpyNdarrayVariable`, to be `torch_np.ndarray` in the stack whenerver we recompiles the subgraph.
2. In `builder.py` we can wrap the value to be `NumpyNdarrayVariable` and save it as graph input.

-----

Starting from commit 6:

## A new design for supporting numpy in dynamo

In short the core concept doesn't change: we still convert `numpy` API calls to `torch_np` API calls. However, instead of wrapping a `torch_np.ndarray` in `NumpyNdarrayVariable`, the new design wraps a `torch.Tensor`.

The reason for doing this change is because we need to keep `torch.Tensor` everywhere in the captured graph, so that it works well with the backend of dynamo. See discussions in https://github.com/Quansight-Labs/numpy_pytorch_interop/issues/142 for details.

### Flow
This is an example showing how do we think about dynamo working on a simple function:
```python
def f(x: torch.Tensor, y: torch.Tensor):
    a, b = x.numpy(), y.numpy()
    c = np.add(x, y)
    return torch.from_numpy(c)
```
```

              +------------+             +------------+
 torch.Tensor |            |numpy.ndarray|            |
 -------------- .numpy()   --------------|            |
              |            |             |            |             +------------------+
              +------------+             | numpy.add  |numpy.ndarray|                  |torch.Tensor
              +------------+             |            --------------| torch.from_numpy --------------
 torch.Tensor |            |numpy.ndarray|            |             |                  |
 -------------- .numpy()   --------------|            |             +------------------+
              |            |             |            |
              +------------+             +------------+

              +------------+             +----------------+
 torch.Tensor |            |torch.Tensor |                |
 -------------- .detach()  --------------|                |
              |            |             |                |                +----------------+            +------------+
              +------------+             |                |torch_np.ndarray|                |torch.Tensor|            |torch.Tensor
                                         | torch_np.add   -----------------| util.to_tensor -------------| .detach()  --------------
              +------------+             |                |                |                |            |            |
 torch.Tensor |            |torch.Tensor |                |                +----------------+            +------------+
 -------------- .detach()  --------------|                |
              |            |             |                |
              +------------+         |   +----------------+                                   |
                                     |                       wrapper on torch_np.add          |
                                     +--------------------------------------------------------+
```

### Approach

`torch_np` APIs can take both `torch_np.ndarray` as well as `torch.Tensor`. What  we need to do is to have a wrapper for these APIs to convert the return value back to `torch.Tensor`. This way only the wrapper is showing up in the captured graph, with `torch.Tensor`s as input and `torch.Tensor` as output.

If we have a graph break or we've traced to the end of the program, we need to inspect all the `NumpyNdarrayVariable` in the stack and convert them back to `numpy.ndarray`, to make sure the compiled version is still behaving the same as the eager version.

### Examples
Here's an example of the graph generated:

```python
def fn(x: np.ndarray, y: np.ndarray):
    a = x.real
    b = y.real
    torch._dynamo.graph_break()
    return np.add(a, 1), np.add(b, 1)
```

Graph generated:

```
[2023-05-16 10:31:48,737] torch._dynamo.output_graph.__graph: [DEBUG] TRACED GRAPH
 __compiled_fn_0 <eval_with_key>.0 opcode         name            target                                                      args                    kwargs
-------------  --------------  ----------------------------------------------------------  ----------------------  --------
placeholder    l_x_            L_x_                                                        ()                      {}
placeholder    l_y_            L_y_                                                        ()                      {}
call_function  from_numpy      <built-in method from_numpy of type object at 0x12b1fdc80>  (l_x_,)                 {}
call_function  from_numpy_1    <built-in method from_numpy of type object at 0x12b1fdc80>  (l_y_,)                 {}
call_function  attr_wrapper    <function attr_wrapper at 0x12e8693a0>                      (from_numpy, 'real')    {}
call_function  attr_wrapper_1  <function attr_wrapper at 0x12e8693a0>                      (from_numpy_1, 'real')  {}
output         output          output                                                      ((),)                   {}

[2023-05-16 10:31:48,908] torch._dynamo.output_graph.__graph: [DEBUG] TRACED GRAPH
 __compiled_fn_2 <eval_with_key>.1 opcode         name           target                                                      args                             kwargs
-------------  -------------  ----------------------------------------------------------  -------------------------------  --------
placeholder    l_a_           L_a_                                                        ()                               {}
placeholder    l_b_           L_b_                                                        ()                               {}
call_function  from_numpy     <built-in method from_numpy of type object at 0x12b1fdc80>  (l_a_,)                          {}
call_function  from_numpy_1   <built-in method from_numpy of type object at 0x12b1fdc80>  (l_b_,)                          {}
call_function  wrapped_add    <Wrapped function <original add>>                           (from_numpy, 1)                  {}
call_function  wrapped_add_1  <Wrapped function <original add>>                           (from_numpy_1, 1)                {}
output         output         output                                                      ((wrapped_add, wrapped_add_1),)  {}

```
### Changes

* `codegen.py`: reconstruct `numpy.ndarray` from `NumpyNdarrayVariable` by adding bytecode to call `utils.to_numpy_helper()`.
*  `output_graph.py`: getting rid of legacy code that does exactly what `codegen.py` does, which only handling return case but not graph break case.
*  `utils.py`: added helpers to convert `numpy.ndarray` to `torch.Tensor` and vice versa. Also adding a wrapper class that takes in a function. In `__call__` it calls the function and converts its out to `torch.Tensor` (or a list of it).
* `builder.py`: add method to wrap `numpy.ndarray` graph inputs into `NumpyNdarrayVariable`, by calling `torch.numpy` in the proxy.
* `misc.py`: `numpy` API calls goes into `NumpyVariable` and we find the function with the same name in `torch_np` module, then wrap it with the wrapper defined in `utils.py`.
* `tensor.py`, `torch.py`: proxy `tensor.numpy()` to be `torch.detach()` but wrap it with `NumpyNdarrayVariable`. Similarly, `torch.from_numpy()` -> `torch.detach()` but wrap it with `TensorVariable`. In `NumpyNdarrayVariable`, do the similar `torch_np.ndarray` to `torch.Tensor` wrapping for attributes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100839
Approved by: https://github.com/ezyang
2023-06-03 00:54:25 +00:00
2491aa53a8 Make DataParallel generic (#102455)
Fixes #102441

improves type hinting of the module attribute, since it can easily be bound in `DataParallel.__init__`

```python
from torch.nn import DataParallel

class MyModule(Module):
    ...

my_data_parallel = DataParallel(MyModule(), device_ids=[0, 1, 2])
reveal_type(my_data_parallel)  # Type of "my_data_parallel" is "DataParallel[MyModule]"
reveal_type(my_data_parallel.module)  # Type of "my_data_parallel.module" is "MyModule"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102455
Approved by: https://github.com/Skylion007
2023-06-03 00:33:01 +00:00
ed113332e5 [jit] Try to mitigate bad_weak_ptr error from type ptrs and print more error message. (#102822)
Test Plan: CI

Differential Revision: D46385190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102822
Approved by: https://github.com/Skylion007
2023-06-02 23:20:36 +00:00
af50efca24 add nested/sprase/quantized tensor key for privateuse1 (#102696)
Fixes #ISSUE_NUMBER
add nested/sprase/quantized tensor key for privateuse1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102696
Approved by: https://github.com/bdhirsh
2023-06-02 22:35:52 +00:00
a1142053f0 [reland][quant][test] Fix broken PT2 import, add warnings (#102819)
Summary:
We are currently silently skipping all PT2 quantization
tests due to a recent typo. This commit fixes this and also adds
warnings so it'll be easier to debug similar issues in the future.

Test Plan: python test/test_quantization.py

Differential Revision: D46383546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102819
Approved by: https://github.com/jerryzh168
2023-06-02 22:35:30 +00:00
5d57a348cd Graph break on differentiable boolean mask setitem (#102843)
Fixes https://github.com/pytorch/pytorch/issues/102841

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102843
Approved by: https://github.com/voznesenskym
2023-06-02 22:34:52 +00:00
02dd1f38f2 [pytorch] CUDA kernel for torch.cat on contiguous tensors with wide loads (#102815)
This PR creates a CUDA kernel for `CatArrayBatchedCopy` that makes use of vectorized memory loads to maximize HBM bandwidth. It also simplifies the kernel code by removing the path handling not-contiguous inputs.  It gets called when the following conditions are met:

- tensors are contiguous
- input data types are of 32bit and 64 bit
- all the input are aligned to 16 bytes boundary

We tested on a larger set of problem sizes and there is net gain for 32 bit types and marginal gain for 64 bit types. Based on our analysis the 32 bit cats are by far the dominant kernel being called.

Results:

<img width="1320" alt="Screenshot 2023-06-02 at 8 10 21 AM" src="https://github.com/pytorch/pytorch/assets/23515689/6f083f7c-2e1a-4513-a994-e0cb072d9b5d">

The SASS Code confirms using the wide loads for input tensors and the stores to global memory are unrolled to maximize oversubscription:

<img width="1648" alt="Screenshot 2023-06-02 at 8 16 29 AM" src="https://github.com/pytorch/pytorch/assets/23515689/10325ee6-d3a0-402a-af0d-29cd1a32813b">

Test Code:

```python
import sys

import torch

l_inputs = [
    ((1024,), 0, 2, 100),
    ((4096,), 0, 2, 100),
    ((16384,), 0, 4, 100),
    ((32000,), 0, 8, 100),
    ((128 * 1024,), 0, 2, 100),
    ((256 * 1024,), 0, 3, 100),
    ((1 * 1024 * 1024,), 0, 2, 100),
    ((4 * 1024 * 1024,), 0, 2, 100),
    ((16 * 1024 * 1024,), 0, 2, 100),
    ((32 * 1024 * 1024,), 0, 2, 100),
    ((128 * 1024 * 1024,), 0, 2, 50),
    ((64, 256), 0, 4, 100),
    ((400, 400), 0, 2, 100),
    ((640, 1080), 0, 2, 100),
    ((128, 4096), 1, 2, 100),
    ((512, 512), 1, 2, 100),
    ((699, 713), 1, 2, 100),
    ((1024, 1024), 1, 2, 100),
    ((2000, 1000), 1, 2, 100),
    ((4096, 4096), 1, 2, 100),
    ((16384, 16384), 1, 2, 50),
    ((384, 256, 16), 1, 2, 100),
    ((400, 200, 13), 1, 2, 100),
    ((128, 64, 256), 0, 2, 100),
    ((512, 256, 256), 1, 2, 100),
    ((512, 1024, 1024), 2, 2, 10),
    ((1024, 512, 1024), 2, 2, 10),
    ((1024, 1024, 512), 2, 2, 10),
    ((128, 64, 64, 32), 0, 2, 50),
    ((128, 64, 128, 16), 1, 2, 50),
    ((100, 45, 45, 32), 3, 2, 50),
    ((128, 32, 256, 32), 3, 2, 50),
]

prof_inputs = [
    ((1234567,), 0, 2, 5),
    ((16 * 1024 * 1024,), 0, 3, 5),
    ((1013, 1013), 0, 2, 5),
    ((1024, 1024), 1, 2, 5),
    ((69, 74, 128), 0, 2, 5),
    ((128, 128, 128), 2, 2, 5),
]

def generate_tensors(dim_tuple, cat_type, num_tensors):
    if cat_type in [torch.int8, torch.int32, torch.int64]:
        l_tensors = [
            torch.randint(
                high=torch.iinfo(cat_type).max,
                size=dim_tuple,
                dtype=cat_type,
                device="cuda",
            )
        ] * num_tensors
        return l_tensors
    else:
        l_tensors = [
            torch.randn(dim_tuple, dtype=cat_type, device="cuda")
        ] * num_tensors
        return l_tensors

def test_simple_cat(
    dim_tuple, cat_dim: int, num_tensors: int, iterations: int, cat_type
):
    torch.cuda.synchronize()

    # Allocate a tensor equal to L2 cache size on A100 GPUs
    l2_cache_flusher = torch.empty(
        int(80 * (1024**2)), dtype=torch.float, device="cuda"
    )

    # All the tensors in the list get read and written once
    total_MB = 2 * num_tensors
    for dim in dim_tuple:
        total_MB *= dim
    total_MB /= 1024 * 1024

    # Get the number of bits per element
    if cat_type in [torch.int8, torch.int32, torch.int64]:
        total_MB *= torch.iinfo(cat_type).bits / 8
    else:
        total_MB *= torch.finfo(cat_type).bits / 8

    l_tensors = generate_tensors(dim_tuple, cat_type, num_tensors)
    c = torch.cat(l_tensors, dim=cat_dim)
    torch.cuda.synchronize()

    # Measure correctness
    l_tensors_cpu = []
    for t in l_tensors:
        l_tensors_cpu.append(t.detach().to("cpu"))
    c_cpu = torch.cat(l_tensors_cpu, dim=cat_dim)
    c_cpu_dev = c.detach().to("cpu")

    if not torch.equal(c_cpu, c_cpu_dev):
        missmatches = torch.count_nonzero(torch.abs(c_cpu - c_cpu_dev))
        print("Error; num missmatches for {0} = {1}".format(dim_tuple, missmatches))
        return

    # Measure a few iterations
    l_ev_start = [torch.cuda.Event(enable_timing=True)] * iterations
    l_ev_stop = [torch.cuda.Event(enable_timing=True)] * iterations

    l_cat_times = []
    torch.cuda.synchronize()
    for i in range(iterations):
        l2_cache_flusher.zero_()
        torch.cuda._sleep(1_000_000)

        l_ev_start[i].record()
        c = torch.cat(l_tensors, dim=cat_dim)
        l_ev_stop[i].record()
    torch.cuda.synchronize()

    for i in range(iterations):
        t_cat = l_ev_start[i].elapsed_time(l_ev_stop[i]) / 1000
        l_cat_times.append(t_cat)

    min_cat_time = min(l_cat_times)

    # return bandwidth in GB/s
    estimated_bw_GBps = total_MB / min_cat_time / 1024
    return estimated_bw_GBps

def main(argv):
    if len(argv) > 0:
        if "profile" in str(argv[0]):
            for l_input in prof_inputs:
                gbps = test_simple_cat(
                    l_input[0], l_input[1], l_input[2], l_input[3], torch.float
                )
                print(
                    "Bandwidth (GB/s) for {0} fp32 | {1:.2f}".format(
                        (l_input[0], l_input[1]), gbps
                    )
                )
            return

    for l_input in l_inputs:
        gbps_int8 = test_simple_cat(
            l_input[0], l_input[1], l_input[2], l_input[3], torch.int8
        )
        gbps_fp16 = test_simple_cat(
            l_input[0], l_input[1], l_input[2], l_input[3], torch.float16
        )
        gbps_fp32 = test_simple_cat(
            l_input[0], l_input[1], l_input[2], l_input[3], torch.float32
        )
        gbps_int32 = test_simple_cat(
            l_input[0], l_input[1], l_input[2], l_input[3], torch.int32
        )
        gbps_fp64 = test_simple_cat(
            l_input[0], l_input[1], l_input[2], l_input[3], torch.float64
        )
        gbps_long = test_simple_cat(
            l_input[0], l_input[1], l_input[2], l_input[3], torch.long
        )

        print(
            "Bandwidth (GB/s) for {0} int8;fp16;fp32;int32;fp64;long|{1:.2f}|{2:.2f}|{3:.2f}|{4:.2f}|{5:.2f}|{6:.2f}".format(
                (l_input[0], l_input[1]),
                gbps_int8,
                gbps_fp16,
                gbps_fp32,
                gbps_int32,
                gbps_fp64,
                gbps_long,
            )
        )

if __name__ == "__main__":
    main(sys.argv[1:])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102815
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-06-02 22:33:29 +00:00
896d997dd0 Remove incorrect THP{Cpp,}Function_traverse PyObject traversals (#102860)
Fixes https://github.com/pytorch/pytorch/issues/102174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102860
Approved by: https://github.com/albanD
2023-06-02 22:05:25 +00:00
9866408167 Multihooks should not keep tensor alive in closure (#102859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102859
Approved by: https://github.com/albanD
2023-06-02 22:05:25 +00:00
cyy
77f2883c41 [Reland2] fix missing-prototypes warnings in torch_cpu (Part 4) (#102228)
This PR relands the changes introduced in PR https://github.com/pytorch/pytorch/pull/100849. The old PR turnd nnc_* functions into  static. We now add declarations for them and hope that inter builds will pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102228
Approved by: https://github.com/albanD
2023-06-02 22:04:44 +00:00
86c7652503 [inductor] layout optimization for conv (#99773)
convolution kernel with channels last runs much faster then kernel with contiguous inputs. The PR leverage that to optimize tensor layouts so we provide 'channels last' inputs to convolution. Some care need to be taken to not convert tensor layout between contiguous and channels last back and forth. Those extra copies hurt performance quite much.

Latest perf number [here](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2024%20May%202023%2023%3A40%3A37%20GMT&stopTime=Wed%2C%2031%20May%202023%2023%3A40%3A37%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=shunting-layout-opt-19&lCommit=baa797fc100688dfb044fbcbdebcfd2591710f78&rBranch=main&rCommit=999bae0f54108ffc5b7cf2524a02a83901554b16)
- TB: 1.64x -> 1.69x
- HF: 1.79x -> 1.78x (random noise)
- TIMM: 1.51x -> 1.65x

Right now we disable layout optimization for dynamic shape since there is perf loss in that combination. Here is a GH issue to followup: https://github.com/pytorch/pytorch/issues/102670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99773
Approved by: https://github.com/jansel
2023-06-02 21:08:18 +00:00
4da88447ea Disable grouping by dtype and device if compiling (#102771)
Disable grouping if we are compiling, this happens during lowering
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102771
Approved by: https://github.com/janeyx99
2023-06-02 21:04:49 +00:00
cyy
a8c1967cee fix an asan warning of container overflow (#102735)
The last substr in QualifiedName seems having container overflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102735
Approved by: https://github.com/Skylion007
2023-06-02 20:51:03 +00:00
a6a030a8eb [data_loader] Enable overriding signal handler in DataLoader.cpp (#101816)
Summary: Custom signal handlers (e.g. with more logging) can help in debugging crashes.

Test Plan: builds

Reviewed By: drej82

Differential Revision: D45934625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101816
Approved by: https://github.com/drej82
2023-06-02 20:07:53 +00:00
a7efa0ce35 Revert "Remove check from _prims_common, replace with torch._check* (#102219)"
This reverts commit fb79d43649d3755cdd8d87897fdcf12447530896.

Reverted https://github.com/pytorch/pytorch/pull/102219 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/5158949959/jobs/9293466925 ([comment](https://github.com/pytorch/pytorch/pull/102219#issuecomment-1574245414))
2023-06-02 20:00:48 +00:00
c36d235db0 Revert "implement __dir__ for dynamo (#102480)" (#102766)
This reverts commit b02f48b18152ddfcf5fcbefb68f6b66a6c44b37f.

If a user does this:

```
mod = torch.compile(mod)
mod.is_compiled = True
assert "is_compiled" in dir(mod)
```

it will fail after #102480.

Differential Revision: [D46368712](https://our.internmc.facebook.com/intern/diff/D46368712)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102766
Approved by: https://github.com/msaroufim
2023-06-02 19:40:44 +00:00
fc218a8a13 Fix typos in README of DTensor (#102813)
Fix typos in README of DTensor. But there is still a problem to be fixed. I reported an error when I tried to use distribute_module with  shard_params. I show the specific error message in issue https://github.com/pytorch/pytorch/issues/102812.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102813
Approved by: https://github.com/wanchaol
2023-06-02 19:27:23 +00:00
659f947583 Try to use a bigger runner for android-emulator-build-test (#102855)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102855
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-06-02 19:22:28 +00:00
fb79d43649 Remove check from _prims_common, replace with torch._check* (#102219)
Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102219
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-06-02 19:13:45 +00:00
2296ee08fa [PT2][Quant][BE] Test refactor to be organize them better (#102704)
Collected most of the test modules under TestHelperModules. This allows reuse
of modules when possible. Probably we can refactor a bit more but left some qat
related helper modules in their respective tests

Differential Revision: [D46267687](https://our.internmc.facebook.com/intern/diff/D46267687/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102704
Approved by: https://github.com/andrewor14
2023-06-02 18:40:05 +00:00
9978850cc0 Update list of bots in upload_external_contrib_stats.py (#102786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102786
Approved by: https://github.com/PaliC
2023-06-02 18:34:22 +00:00
fdd6375a80 Revert "fix alert upload action (#102840)"
This reverts commit 7af47f139d06e365f0ef6bad0382c16c29a0e5bb.

Reverted https://github.com/pytorch/pytorch/pull/102840 on behalf of https://github.com/PaliC due to does not actually work e2e ([comment](https://github.com/pytorch/pytorch/pull/102840#issuecomment-1574137743))
2023-06-02 18:24:29 +00:00
624257890e Reenable hf_T5_generate (#102818)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102818
Approved by: https://github.com/albanD
2023-06-02 17:59:53 +00:00
a53acafd2b [PT2][Quant] Enable dynamic quantization (#102703)
Enable dynamic quantization of linear layers.

Differential Revision: [D46235070](https://our.internmc.facebook.com/intern/diff/D46235070/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102703
Approved by: https://github.com/andrewor14
2023-06-02 17:52:14 +00:00
7af47f139d fix alert upload action (#102840)
<!--
copilot:all
-->

Test run found at https://github.com/pytorch/pytorch/actions/runs/5156296463/jobs/9287113913

### <samp>🤖 Generated by Copilot at a08e8ec</samp>

### Summary
🛠️🚀🐛

<!--
1.  🛠️ for improving and fixing the workflow
2.  🚀 for speeding up the checkout step by fetching only the latest commit
3.  🐛 for correcting the syntax error in the run command
-->
Improve and fix the workflow for uploading alerts to the dashboard. Optimize the checkout step and fix the `run` command in `.github/workflows/upload_alerts.yml`.

> _To upload alerts to the dashboard_
> _The workflow needed a quick fix_
> _It fetched only one commit_
> _And ran the script with the right syntax_
> _Now it works as smooth as a fish_

### Walkthrough
* Fix syntax error in `run` command that invokes `create_alerts.py` script ([link](https://github.com/pytorch/pytorch/pull/102840/files?diff=unified&w=0#diff-946b3ad914f86182b35d4b6db415ddc39393c3017ef8fdaeee2b0e866ea831d6L23-R25))
* Add `with` option to `actions/checkout@v2` step to specify `fetch-depth: 1` and improve workflow performance ([link](https://github.com/pytorch/pytorch/pull/102840/files?diff=unified&w=0#diff-946b3ad914f86182b35d4b6db415ddc39393c3017ef8fdaeee2b0e866ea831d6R15-R16))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102840
Approved by: https://github.com/malfet
2023-06-02 17:49:00 +00:00
b740d3b014 Add comptime.breakpoint (#102758)
This sets a pdb breakpoint to fire whenever we *compile* this
Python code in Dynamo.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102758
Approved by: https://github.com/zou3519, https://github.com/voznesenskym
2023-06-02 17:44:16 +00:00
2301b624ae [PT2][Quant] Update quconfig to contain input/qoutput activation qspec (#102702)
As title

Differential Revision: [D46342823](https://our.internmc.facebook.com/intern/diff/D46342823/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102702
Approved by: https://github.com/andrewor14
2023-06-02 17:41:46 +00:00
6a24cfd74c Fix merge rules for XLA pin updates (#102844)
https://github.com/pytorch/pytorch/pull/102446 moved the job to 12xlarge runner, but merge rule still refer to it as 4xlarge, which results in merge timeouts, for example see https://github.com/pytorch/pytorch/actions/runs/5150076112/jobs/9273821855

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102844
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt
2023-06-02 17:23:51 +00:00
6492b7d22e [PT2][Quant][BE] Refactor qnnpack_quantizer.py (#102701)
This diff refactors annotate functions so as to couple annotate functions with
corresponding quantization configs that they support. This will help in dynamic
quantization which is only supported for linear layers

Differential Revision: [D46235071](https://our.internmc.facebook.com/intern/diff/D46235071/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102701
Approved by: https://github.com/jerryzh168
2023-06-02 17:14:56 +00:00
c64aae4287 Move ROCm distributed jobs back to periodic (#102790)
Unstable jobs can now be handled by creating issues like https://github.com/pytorch/pytorch/issues/102789.  There is no need to manually move them to unstable workflow anymore

### Testing

ROCm distributed jobs show up as `unstable` https://hud.pytorch.org/pr/pytorch/pytorch/102790#5150329587

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102790
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-06-02 16:39:35 +00:00
8bbef821c3 Add some unit tests from cm3leon involving repeat_interleave (#102733)
These actually were fixed by https://github.com/pytorch/pytorch/pull/102570
but that PR doesn't test guard-freeness, so here you go.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102733
Approved by: https://github.com/zou3519
2023-06-02 15:35:35 +00:00
7c00d45312 Reenable cm3leon_generate (#102793)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102793
Approved by: https://github.com/albanD, https://github.com/awgu
2023-06-02 15:15:26 +00:00
09b5b73b90 [xla hash update] update the pinned xla hash (#101388)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101388
Approved by: https://github.com/pytorchbot, https://github.com/malfet
2023-06-02 14:55:14 +00:00
8a52b5440e Revert "upload alerts to rockset/aws through github workflow (#102646)"
This reverts commit ddd741f38520804db5559b08b31ef0742457ce0f.

Reverted https://github.com/pytorch/pytorch/pull/102646 on behalf of https://github.com/malfet due to It did not work, how was it tested, see ddd741f385 ([comment](https://github.com/pytorch/pytorch/pull/102646#issuecomment-1573862275))
2023-06-02 14:52:26 +00:00
b5840f99c3 torch.compiler public namespace (#102182)
# torch.compiler public API

## Goal

The goal of this document is to describe the public facing API for torchdynamo and torchinductor.

Today both dynamo and torchinductor are in `torch/_dynamo` and `torch/_inductor` namespace with the only public function

`torch.compile()` which is directly placed in `torch/__init__.py`

This poses a few problems for users trying to take dependencies on PyTorch 2.0
1. Unclear BC guarantees
2. No builtin discovery mechanism outside of reading the source code
3. No hard requirements for docstrings or type annotations

Most importantly it mixes two personas the PyTorch 2.0 developer vs the PyTorch 2.0 customer so this is an attempt to address this. We draw a lot of inspiration from the `functorch` migration to the `func` namespace.

## Alternate names

We did discuss some other alternative names

1. `torch.compile` -> problem is this would break BC on the existing `torch.compile` function
2. `torch.dynamo` -> `dynamo` is so far not something we've deliberately hidden from users but problem is now figuring out what it's `_dynamo` vs `dynamo` might be confusing
3. `torch.compiler` -> 1 would be better but to keep BC this is a good compromise

# The general approach
## Proposal 1
In https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py

We have function called `reset()`, this function is essential if users are trying to `torch.compile()` a model under different settings

```python
# in _dynamo/
def reset():
    do_reset_stuff()
```

Instead we propose

```python
# in compiler/
def reset():
    do_reset_stuff() # As in copy paste the logic from _dynamo.reset

# in _dynamo/
import warnings
import inspect

def reset():
    function_name = inspect.currentframe().f_code.co_name
    warnings.warn(f"{function_name} is deprecated, use compiler.{function_name} instead", DeprecationWarning)
    return compiler.reset()

```
## Proposal 2

```python
# in compiler/
def reset():
    “””
    Docstrings here
    “””
    _dynamo.reset()

# in _dynamo/
No changes
```
Consensus so far seems to be proposal 2 since fewer warnings will be less jarring and it’ll make it quite easy to merge the public API

## Docstrings

The above was an example of a function that has no inputs or outputs but there are other functions which could use an improvement in their docstrings, for example allow_in_graph actually works over lists of functions but that’s not mentioned anywhere in the example only if you read the source code.

def allow_in_graph(fn):
    """
    Customize which functions TorchDynamo will include in the generated
    graph. Similar to `torch.fx.wrap()`.

    Parameters:
        fn (callable or list/tuple): The function(s) to be allowed in the graph.

    Returns:
        callable or list/tuple: The input function(s) included in the graph.

    Examples:
        Customize inclusion of a single function:
        ::
            torch._dynamo.allow_in_graph(my_custom_function)

        Customize inclusion of multiple functions:
        ::
            torch._dynamo.allow_in_graph([my_custom_function1, my_custom_function2])

        @torch._dynamo.optimize(...)
        def fn(a):
            x = torch.add(x, 1)
            x = my_custom_function(x)
            x = torch.add(x, 1)
            return x

        fn(...)

    Notes:
        The `allow_in_graph` function allows customization of which functions TorchDynamo
        includes in the generated graph. It can be used to include specific functions that
        are not automatically captured by TorchDynamo.

        If `fn` is a list or tuple, `allow_in_graph` will be called recursively on each
        element in the sequence.

        Once a function is allowed in the graph using `allow_in_graph`, it will be captured
        in the graph generated by TorchDynamo. This customization enables more fine-grained
        control over the functions included in the graph.

        Note that `allow_in_graph` expects the input `fn` to be a callable.

    """
    if isinstance(fn, (list, tuple)):
        return [allow_in_graph(x) for x in fn]
    assert callable(fn), "allow_in_graph expects a callable"
    allowed_functions._allowed_function_ids.add(id(fn))
    allowed_functions._disallowed_function_ids.remove(id(fn))
    return fn

So to make the API public, we’d have to write similar docstrings for all public functions we’d like to create.

The benefit of this approach is that
1. No BC risks, internal and external users relying on our tooling can slowly wean off the private functions.
2. We will also have to write correct docstrings which will automatically make our documentation easier to maintain and render correctly on pytorch.org
3. We already have some BC guarantees already, we don’t kill OptimizedModule, we rejected the PR to change the config system

The con of this approach is that
Will be stuck with some potentially suboptimal functions/classes that you can’t kill

## Testing strategy
If the approach is to mostly make a public function call an already tested private function then all we need to do is ensure that the function signatures don't change

## Which functions should be in the public API

Our heuristic for deciding whether something should be public or not is are users already relying on it for lack of other options or have we recommended some non public functions for users to debug their PT 2.0 programs.

Heuristic for not putting something in public is that it’s an experimental subsystem with the goal of turning it on by default, it’s very core dev centric, meta centric, a bunch of different configs that should be batched into a single user facing one, or something that needs to be renamed because the name is confusing

#### Top level
`torch.compile()` -> already is a public API it does require some minor improvements like having configs be passed in to any backend and not just inductor (EDIT: This was already done https://github.com/pytorch/pytorch/pull/99645l) and renaming `mode=reduce-overhead` to `mode=cudagraph`

To make sure that PT 2.0 is supported with a given pytorch version users can create a new public function and this would replace the need for `try/except` blocks around `import torch._dynamo` that has been populating user code.

```python
def pt2_enabled():
    if hasattr(torch, 'compile'):
        return True
    else:
        return False
```

For all of the below they will be translated to `torch.compiler.function_name()`

#### From _dynamo

As a starting point we looked at https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/__init__.py and we suggest redefining these functions in `pytorch/torch/compiler/__init__.py`

It might also make sense to split them over multiple files and import them in `__init__.py` but because the number of functions is small it'd probably be fine to add them all into a single compiler/__init__.py until this list becomes larger

1. `reset()`
2. `allow_in_graph()`
10. `list_backends()`
12. `compile()`:  torch.compile() would be mostly a shell function passing arguments to torch.compiler.compile()
13. `assume_constant_result()`: TODO: Double check how this is useful
15. `torch._dynamo.disable()`

Some notable omissions
11. `explain()`: We need to clean up the output for this function, make it a data class and pretty printable
1. `forbid_in_graph()`: Considered adding this but should instead consolidate on `disallow_in_graph`
2. `optimize_assert()`: Already covered by `torch.compile(fullgraph=True)`
3. `check_if_dynamo_supported()`: this would be supplanted by pt2_enabled()
4. `compilation_metrics`, `graph_breaks_reasons` ..: would all be accessed via `torch.compiler.explain()`
5. `replay` does not seem useful to end customers
6. . `graph_break()`: Mostly useful for debugging or unit tests
9. `register_backend()`: End users will just pass a string backend to torch.compile, only devs will create new backends
10. `export()` : Eventually this needs to public but for now it’s not ready so just highlighting that it will be in the public API eventually
11. `disallow_in_graph()`: Usage is limited
12. `mark_static()`: we can keep this private until dynamic=True is recommended in stable
13. `mark_dynamic()`:  we can keep this private until dynamic=True is recommended in trunk
14. 8. `OptimizedModule`: This is the only class that we'd expose but is crucial since users are running code like `if isinstance(mod, OptimizedModule): torch.save(mod._orig_mod)` EDIT: because we fixed pickling we no longer need to
expose this
15. `is_compiling()`: Still not clear how this useful to end users

There are also config variables which we need to expose https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/config.py

Some of our configs are useful dev flags, others are to gate experimental functionality and others are essential debugging tools and we seperate out the essential debugging and logging tools to a public facing config.

TODO: I still need to think of a good way of porting the config in a BC way here are some ideas
1. Just make all passes available and controllable via `torch.compile(options={})` but only show docstrings for the ones users should care about.

The current problem with our config system is we have 3 ways of setting them once via `options={}`, environment variables and variables in `config.py`, it'd be worth settling on one source of truth and have that be the public API.

The configs we should make public are
1. `log_file_name`
2. `verbose`
3. `cache_size_limit`
4. `repro_level` and `repro_after`: Although we can rename these to minifier and give human readable names to the levels

Everything else should stay private in particular

1. `print_graph_breaks`, `print_specializations`: should be supplanted by `explain()` for public users
2. dynamic shape configs : Users should only have to worry about `torch.compile(dynamic=True/False)`
3. The distributed flags, hook or guard configs: If we tell a user to use FSDP and DDP then the flag should be enabled by default or be in a private namespace
4. The fbcode flags: Obviously no need to be user facing
5. Skip/Allow lists: Not something normal users should play around with

#### From _inductor
Very little of inductor should be exposed in a public facing API, our core audience as in people writing models mostly just need information on what certain passes mean and how to control them a high level and they can do this with `torch.compile(options={})` so the goal here should be more to make available passes clearer and ideally consolidate them into `torch.compile()` docstrings or modes.

There are some exceptions though from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/__init__.py

1. `list_mode_options()`
2. `list_options()`: this needs an additional pass to hide internal or debug options

For both of these we’d rename them to compiler.inductor_list_mode_options and compiler.inductor_list_options() since they would be in the same init file as the one for dynamo

Notable omissions
1. `_inductor.compile()`: Because of users are coming in with their own fx graph, they are likely developers
2. `_inductor.aot_compile()`:Again this is about capturing and modifying fx graphs so users APIs don't need to be public

However the configs are a slightly different story, because we can choose to either
1. Make all configs public
2. Make some configs public and keep most of the private ones. If public config is set it should override the private version
3. Make all configs controllable via `torch.compile(options={})` but make list_options() hide more things

For now 3 seems like the most reasonable choice with some high level configs we’ll keep like TORCH_COMPILE_DEBUG

Regardless here's what should probably be public or advertised more
1. `disable_progress` and verbose_progress:  Combine and enable by default
2. `fallback_random`: We could make the case this shouldn't be public if a top level deterministic mode enables this
3. `profile_bandwidth`: Or could make the case that this should be in TORCH_COMPILE_DEBUG

Notable omissions
1. Any config that would generally improve performance for most that we should probably enable by default but might be disabled in the short term because of stability: example `epilogue_fusion`, `pattern_matcher`, `reordering`
2. Autotuning flags: Should just sit behind `torch.compile(mode="max-autotune")` like `max_autotune`, `max_autotune_gemm`
3. `coordinate_descent_tuning`: This one I'm a but mixed about, maybe it just also fall into `mode="max-autotune"`
4. `trace`: `TORCH_COMPILE_DEBUG` is the best flag for all of this
5. `triton.cudagraphs`: Default should be `torch.compile(mode="reduce-overhead")` - I'd go further and rename the `mode=cudagraph` and we can keep reduce-overhead for BC reasons
6. `triton_unique_kernel_names`: Mostly useful for devs debugging
7. `dce`: which doesnt really do anything
8. `shape_padding`: Elias is working on enabling this by default in which case we also remove it

## Mechanics

This PR would include the public functions with their docstrings

Another PR will take a stab at the configs

And for work where the APIs are still being cleaned up whether its minifier or escape hatches, export or dynamic shapes, aot_inductor etc.. we’ll keep them private until a public commitment can be made

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102182
Approved by: https://github.com/jansel
2023-06-02 14:38:55 +00:00
b76af5f9a6 Fix broken link in Dynamo's guards doc (#102183) (#102185)
This PR fixes broken link for the code referenced in the guards doc.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102185
Approved by: https://github.com/mikaylagawarecki, https://github.com/ezyang
2023-06-02 14:36:28 +00:00
f22148f0ed aotautograd: fix mutation bug when input is noncontiguous (#102767)
Fixes https://github.com/pytorch/pytorch/issues/93363.

See the comment here for details: https://github.com/pytorch/pytorch/issues/93363#issuecomment-1572647261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102767
Approved by: https://github.com/ezyang
2023-06-02 14:31:06 +00:00
80f59cc61a Change some py_context_manager_DEPRECATED to py_context_manager (#102643)
I confirmed that there are no usages of these APIs on github code search
or internally. There may still be usages (hence the BC-breaking label),
but I expect none to very few.

There are some leftover py_context_manager_DEPRECATED that will likely
stay that way for a while because:
- they are used outside of the pytorch repo (`_AutoDispatchBelowAutograd`,
`_DisableTorchDispatch`, `_InferenceMode`)
- they are high risk (all of the torch_function / torch_dispatch related
stuff)
- PyTorch requires that the object behaves like a "Python RAII guard"
(`_DisableFuncTorch`, `_MultithreadingEnabled`)

This is probably the last PR in the context manager cleanup series.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102643
Approved by: https://github.com/bdhirsh
2023-06-02 14:29:04 +00:00
51e0f9e858 Add missing decompositons/lowerings for logical/bitwise operators (#102566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102566
Approved by: https://github.com/lezcano, https://github.com/alexsio27444, https://github.com/jgong5
2023-06-02 14:27:17 +00:00
3897c479af Add API to construct the functional variant of an op (#102293)
`register_functional_op`:
- constructs the functional variant of an op
- registers a functionalization kernel to the op

To get this to work:
- `register_functional_op` makes assumptions that it checks about the
op's schema. In particular, the op is not allowed to return anything it
mutates. We can relax these constraints in the future.
- We add a "boxed" python functionalization kernel that handles this
case.

I'm not actually sure (or convinced) this should be public API or how
it should work. If we want this to be public, then it should probably be
a torch.library API, but does that also mean we should give the same
lifetime guarantees? If so, then it would be up to the user to construct
a Library object to actually register the functional variant onto.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102293
Approved by: https://github.com/bdhirsh
2023-06-02 13:36:50 +00:00
eaeea62ee4 Make TestPythonRegistration clean up after itself (#102292)
We did this for TestCustomOp, now we are applying the same thing to
TestPythonRegistration.

This PR:
- changes TestPythonRegistration to register new ops under a single
namespace (self.test_ns)
- clean up the namespace by deleting it from torch.ops after each test
is done running.

This avoids a problem where if an op is re-defined, torch.ops.myns.op
crashes because we do some caching. The workaround in many of these
tests have been to just create an op with a different name, but this PR
makes it so that we don't need to do this.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102292
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-06-02 13:36:50 +00:00
72cdbf6a3f Fix spurious "missing return" error in irange.h (#102785)
Summary:
Fixes:
```
warning: missing return statement at end of non-void function
```
This warning is cluttering a lot of compilation logs!

Test Plan: Sandcastle

Differential Revision: D46374554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102785
Approved by: https://github.com/Skylion007
2023-06-02 09:23:29 +00:00
2e8ce910bb [Profiler][1/N] add profiler support for custom device. (#101554)
1. `torch.autograd.profiler` interface parameters changed. (use `self.use_device` instead of `self.use_cuda` facilitates access by other devices and integrate it in subsequent pr)
2. Modify `ProfilerEventStub`(aka `std::shared_ptr<CUevent_st>`) to `ProfilerVoidEventStub`(aka `std::shared_ptr<void>`) so that `ProfilerStubs` can be inherited by any `{device}Methods`.
In addition, `cuda_event_start_` is renamed to `device_event_start_` , cuda and other devices can use this event pointer if needed.
4. custom device support using legacy profiling(add `ProfilerState::KINETO_PRIVATEUSE1_FALLBACK` option)
5. add `privateuse1Stubs` register
(parse results and test cases are added in subsequent pr)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101554
Approved by: https://github.com/aaronenyeshi
2023-06-02 09:19:19 +00:00
1204463bd0 inductor: fix bfloat16 reduction crash issue which store float value to bfloat16 (#102719)
For bfloat16 reduction, there has an wrong store issue which store float value as bfloat16:

Before:

```

extern "C" void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr0,
                       float* out_ptr1)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L))
            {
                {
                    #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={{-std::numeric_limits<float>::infinity()}})
                    float tmp_acc0 = -std::numeric_limits<float>::infinity();
                    auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0);
                    for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L))
                    {
                        auto tmp0 = load_bf16_as_float(in_ptr0 + static_cast<long>(i0 + (16L*i1)));
                        auto tmp1 = (tmp0);
                        tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp1);
                    }
                    tmp_acc0_vec.store(out_ptr0 + static_cast<long>(i0));
                }
            }
        }
        #pragma omp single
        {
            {
                for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L))
                {
                    auto tmp0 = load_bf16_as_float(out_ptr0 + static_cast<long>(i0));
                    auto tmp1 = (tmp0);
                    tmp1.store(out_ptr1 + static_cast<long>(i0));
                }
            }
        }
    }
}
''')

```

after:

```
extern "C" void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr0,
                       float* out_ptr1)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L))
            {
                {
                    #pragma omp declare reduction(max:at::vec::Vectorized<float>:omp_out = at::vec::maximum(omp_out, omp_in)) initializer(omp_priv={{-std::numeric_limits<float>::infinity()}})
                    float tmp_acc0 = -std::numeric_limits<float>::infinity();
                    auto tmp_acc0_vec = at::vec::Vectorized<float>(tmp_acc0);
                    for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L))
                    {
                        auto tmp0 = load_bf16_as_float(in_ptr0 + static_cast<long>(i0 + (16L*i1)));
                        auto tmp1 = (tmp0);
                        tmp_acc0_vec = at::vec::maximum(tmp_acc0_vec, tmp1);
                    }
                    store_float_as_bf16(out_ptr0 + static_cast<long>(i0), tmp_acc0_vec);
                }
            }
        }
        #pragma omp single
        {
            {
                for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L); i0+=static_cast<long>(16L))
                {
                    auto tmp0 = load_bf16_as_float(out_ptr0 + static_cast<long>(i0));
                    auto tmp1 = (tmp0);
                    tmp1.store(out_ptr1 + static_cast<long>(i0));
                }
            }
        }
    }
}
''')

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102719
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-06-02 08:34:29 +00:00
c537acf46f Make 1D integer sorting work in parallel (#100081)
This patch reuses `radix_sort` from fbgemm and makes `torch.(arg)sort` work in parallel for tensors filled with integers.

In GNN workloads we often use `torch.(arg)sort`, for example, to calculate permutation from CSR to CSC storage format. Till now, sorting one-dimensional data was performed sequentially. Recently, `radix_sort` implementation from FBGEMM was moved to common utilities and was also enhanced, to cover negative numbers ([pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)). This gives us an opportunity to reuse `radix_sort` to accelerate 1D integer sorting in PyTorch.

Benchmark results, measured on a single socket, 56C machine:
Before (int64):
```
size:   64000, average run time (from 100 runs):   6.592ms
size:  128000, average run time (from 100 runs):   9.798ms
size:  256000, average run time (from 100 runs):  19.199ms
size:  512000, average run time (from 100 runs):  36.394ms
size: 1024000, average run time (from 100 runs):  70.371ms
size: 2048000, average run time (from 100 runs): 137.752ms
size: 4096000, average run time (from 100 runs): 287.257ms
```

After(int64):
```
size:   64000, average run time (from 100 runs):  1.553ms
size:  128000, average run time (from 100 runs):  1.853ms
size:  256000, average run time (from 100 runs):  2.873ms
size:  512000, average run time (from 100 runs):  4.323ms
size: 1024000, average run time (from 100 runs):  7.184ms
size: 2048000, average run time (from 100 runs): 14.250ms
size: 4096000, average run time (from 100 runs): 29.374ms
```

Notes:
Average speedup from measured tensor sizes is 7.7x.
For smaller types (e.g. int32/int16), even higher speedup is observed, as fewer passes are required.

Depends on #100236.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100081
Approved by: https://github.com/mingfeima, https://github.com/ngimel
2023-06-02 07:41:28 +00:00
c75e064dd6 Disallow _foreach_utils.py, but allow it to be inlined (#102221)
This function should not be allowed, but should be inlineable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102221
Approved by: https://github.com/anijain2305
2023-06-02 05:14:09 +00:00
1ca2e993af [ONNX] Support aten::logit (#102377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102377
Approved by: https://github.com/BowenBao
2023-06-02 03:39:35 +00:00
683753fb0f upload external pr kpi for 10 days in the past (#102780)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 963044b</samp>

The pull request improves the reliability and completeness of the external contribution stats collection and upload. It adds a `time` delay to avoid API rate limit errors in `upload_external_contrib_stats.py`, and changes the order and date range of the commands in `nightly-rockset-uploads.yml`.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 963044b</samp>

> _Oh we are the coders of the open source sea_
> _And we pull and we push with the `git` command_
> _We upload the stats of the external PRs_
> _With a ten-day range and a `time` delay_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102780
Approved by: https://github.com/kit1980
2023-06-02 03:00:38 +00:00
ddd741f385 upload alerts to rockset/aws through github workflow (#102646)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 943f854</samp>

### Summary
:clock15:⬆️☁️

<!--
1.  :clock15: - This emoji represents the 15-minute interval of the cron schedule, and also suggests the idea of time-based triggers or events.
2.  ⬆️ - This emoji represents the upload action of the workflow, and also suggests the idea of moving data from one place to another.
3.  ☁️ - This emoji represents the AWS/Rockset destination of the alerts, and also suggests the idea of cloud-based services or platforms.
-->
Add a new workflow to upload alerts to a database. The workflow `.github/workflows/upload_alerts.yml` runs periodically on a cron schedule and uses AWS/Rockset as the backend.

> _`workflow` file added_
> _upload alerts to the cloud_
> _every quarter hour_

### Walkthrough
* Add a new workflow to upload alerts to AWS/Rockset every 15 minutes ([link](https://github.com/pytorch/pytorch/pull/102646/files?diff=unified&w=0#diff-946b3ad914f86182b35d4b6db415ddc39393c3017ef8fdaeee2b0e866ea831d6R1-R46))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102646
Approved by: https://github.com/huydhn
2023-06-02 02:24:19 +00:00
4d055ee5a1 RelaxUnspecConstraint some more (#102729)
One annoyance with mark_dynamic is if you use it on a user specified
tensor input (the idea being that you want to compile a function and
have it be polymorphic in size), you will get an error if the user
ever sends you a 0/1 size input, because of course we are probably
going to specialize it.  So I relax the constraint even more: even if we
find it's constant, if the value is 0/1, that's no big deal.

There's some irritating code duplication that I don't entirely know how
to resolve.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102729
Approved by: https://github.com/avikchaudhuri, https://github.com/voznesenskym
2023-06-02 02:11:01 +00:00
9fbfaaa57f [c10d] Add flag value for direct teardown without comm abort (#102599)
It was recently reported that `ncclCommAbort` itself may hang in some NCCL versions. For example, https://github.com/NVIDIA/nccl/issues/829.
In that case, it may be desirable to directly tear down the program without properly aborting the NCCL communicator, so that user does not wait for hours before noticing a hang.
This PR adds new value 3 for env `NCCL_ASYNC_ERROR_HANDLING` that skips the comm abort, and directly throws error in case of exception (timeout, async error, etc)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102599
Approved by: https://github.com/fegin
2023-06-02 00:40:28 +00:00
5be1088ed6 [c10d] Bridge c10d and gloo stores. (#102641)
This relands #100633 with fixes for internal.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102641
Approved by: https://github.com/rohan-varma, https://github.com/fduwjj
2023-06-02 00:07:18 +00:00
4c9992d5ed Inductor cpp wrapper: cache the wrapper (#89743)
If the wrapper code has been built, directly load the .so file to avoid recompilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89743
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-02 00:02:39 +00:00
0b7320315a [CI] Move libtorch-debug CUDA build to CUDA-12.1 (#102756)
To avoid nvcc segfaults, compile without `--source-in-ptx` option on CUDA-12.1+

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 984e4b2</samp>

> _Sing, O Muse, of the daring deeds of PyTorch, the swift and fiery_
> _framework that harnesses the power of CUDA, the blazing tool of Nvidia._
> _How they faced a mighty challenge when CUDA, the ever-shifting,_
> _released a new version, twelve point one, that broke their code and caused them grief._

Fixes https://github.com/pytorch/pytorch/issues/102372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102756
Approved by: https://github.com/atalman
2023-06-01 23:11:07 +00:00
da963d793b Fix aten.copy device mismatch bug in FakeTensor (#102664)
Fixes `pytest ./generated/test_yizhou_wang_RODNet.py -k test_000` failure in https://github.com/pytorch/pytorch/issues/92670.

FakeTensor would raise an error upon trying to run `aten.copy` with inputs with different devices, although this is allowed behavior.

Also fix `aten.slice_scatter`, since it also takes args with different devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102664
Approved by: https://github.com/yanboliang
2023-06-01 23:05:20 +00:00
c7873522c2 Add print statements to debug sharding error (#102713)
sharding on rocm is broken, i cant replicate on dummy PRs even though it seems to happen pretty often on main, so adding this to increase my sample size.  Hopefully this is enough print statements...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102713
Approved by: https://github.com/huydhn
2023-06-01 22:38:28 +00:00
cf0aa38005 Allow ORT backend for DTensor (#101914)
fixes #101911

Currently, `DTensor` supports cuda and cpu. This PR makes some changes for easier integration with the ort backend.

* `Backend.NAME`  attribute now has value `name` instead of `NAME` for backends registered through `register_backend(name)`; this matches the pattern for backends with built-in support like nccl.
* remove unused `_check_for_nccl_backend` function
* add test case that moves parameters to device in the `partition_fn` - a scenario that's useful for big models
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101914
Approved by: https://github.com/wanchaol
2023-06-01 22:37:09 +00:00
72ed22e806 Revert "[Pytorch] Add Vulkan support for aten::unsqueeze, 1d->2d, 3d->4d (#102042)"
This reverts commit c9ae705a22d9b92e28d655ed3960d488aef04c0e.

Reverted https://github.com/pytorch/pytorch/pull/102042 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/102042#issuecomment-1572840752))
2023-06-01 21:58:32 +00:00
8b03a59e4d Revert "[quant][test] Fix broken PT2 import, add warnings (#102644)"
This reverts commit f18b9f86ba1343270d790d2b66e1903af1a7df5c.

Reverted https://github.com/pytorch/pytorch/pull/102644 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/102644#issuecomment-1572818537))
2023-06-01 21:36:27 +00:00
f15af19877 initialize max_stream_priorities in getStreamFromPool(bool) (#102739)
Summary:
`getStreamFromPool(bool, signed char)` overload doesn't initialize `max_stream_priorities`. So if we call `getStreamFromPool(true)` we would hit the following error
```
terminate called after throwing an instance of 'c10::Error'
  what():  Expected cuda stream priority to be less than or equal to 0, got 1
```

Differential Revision: D46358087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102739
Approved by: https://github.com/ngimel
2023-06-01 21:05:56 +00:00
67792e175c Add -debug suffix to trunk libtorch builds (#102764)
Cause that's what they are according to
30558c2896/.ci/pytorch/build.sh (L307)

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 40cd88d</samp>

> _`libtorch` debug_
> _Build with symbols for Linux_
> _Winter of errors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102764
Approved by: https://github.com/atalman
2023-06-01 21:02:27 +00:00
401109a243 Use int64_t for indexing in multi_tensor_apply (#101760)
Fixes #101449

I found it better to either imitate the combo of `TensorIterator::can_use_32bit_indexing` and `TensorIterator::with_32bit_indexing` or adroitly choose the index type depending on `Tensor::numel` in the future.

---

Used `nsys nvprof` to casually see the effect of `int64_t` indexing:

```python
import torch

params = [
    {"params": [torch.randn(32, 32, device="cuda") for _ in range(100)]},
    {"params": [torch.randn(32, 32, device="cuda") for _ in range(100)]},
]
grads = [
    [torch.randn(32, 32, device="cuda") for _ in range(100)],
    [torch.randn(32, 32, device="cuda") for _ in range(100)],
]
optimizer = torch.optim.Adam(params, fused=True)

for _ in range(100):
    for i, param_groups in enumerate(params):
        for p, g in zip(param_groups["params"], grads[i]):
            p.grad = g
        optimizer.step()
        optimizer.zero_grad()
```

Environment
```
Collecting environment information...
PyTorch version: 2.1.0a0+gitf994d0b
Is debug build: False
CUDA used to build PyTorch: 12.1

Python version: 3.10.9 (main, May 17 2023, 00:46:40) [GCC 11.3.0] (64-bit runtime)
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
```

---

- `multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensor` -> 1.02x
- `multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…` -> 1.04x

Current main branch:

```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     64.9          5787610        600    9646.0    9632.0      9503      9888         52.9  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorLi…
...
      8.1           720575        200    3602.9    3584.0      3551      4320         63.4  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
```

this PR:

```
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     65.0          5876847        600    9794.7    9792.0      9632     10080         58.1  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::FusedOptimizerTensorLi…
...
      8.3           748313        200    3741.6    3744.0      3711      4479         60.0  void at::native::<unnamed>::multi_tensor_apply_kernel<at::native::<unnamed>::TensorListMetadata<(in…
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101760
Approved by: https://github.com/ngimel
2023-06-01 20:55:09 +00:00
b8e2e0e907 check users are actually recieved in upload to s3 (#102760)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 5927156</samp>

### Summary
🔁🧹📊

<!--
1.  🔁 - This emoji represents the retry logic that is added to the script, which loops until the command succeeds or reaches the maximum number of attempts.
2.  🧹 - This emoji represents the cleanup and simplification of the code, which removes clutter and makes it easier to understand and maintain.
3.  📊 - This emoji represents the data analysis and visualization that is enabled by uploading the external contribution stats to Rockset, which allows for exploring and sharing insights on the open source community.
-->
This pull request improves the `upload_external_contrib_stats.py` script and the `nightly-rockset-uploads.yml` workflow. It makes the script more efficient and robust, and increases the retry logic for the Rockset upload command.

> _Oh we are the coders of the open source sea_
> _And we upload stats to Rockset with glee_
> _But sometimes the network is slow or breaks down_
> _So we retry the command and we don't let it drown_

### Walkthrough
* Increase the number of retries for uploading external contribution stats to Rockset to avoid failures ([link](https://github.com/pytorch/pytorch/pull/102760/files?diff=unified&w=0#diff-a0d80a44a0694ddbddd6d8cf9484f5b850268a34117c8caf1fc071ad59895f9fL35-R35))
* Simplify the logic of uploading external contribution stats to Rockset by removing the loop and adding assertions and print statements ([link](https://github.com/pytorch/pytorch/pull/102760/files?diff=unified&w=0#diff-ac022823c08d71df6cc85aae7f2ca50a1ec71e5f9eb9371ac563c12cf52b750cL137-R146))
* Remove unused import of `read_from_s3` from `upload_external_contrib_stats.py` to clean up the code ([link](https://github.com/pytorch/pytorch/pull/102760/files?diff=unified&w=0#diff-ac022823c08d71df6cc85aae7f2ca50a1ec71e5f9eb9371ac563c12cf52b750cL11-R11))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102760
Approved by: https://github.com/kit1980
2023-06-01 20:53:03 +00:00
6340aa5d58 Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR [v2] (#102660)
Test was originally skipped in https://github.com/pytorch/pytorch/pull/98462

Not sure why it was removed in https://github.com/pytorch/pytorch/pull/94825

Now the test hits CUDA illegal memory access on H100 again after https://github.com/pytorch/pytorch/pull/101163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102660
Approved by: https://github.com/zou3519
2023-06-01 20:36:45 +00:00
ca470fc59f [BE] Make test_no_triton_on_import simple (#102674)
Do not try to parse raised exception for no good reason
Add short description
Reduce script to a single line

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ea4164e</samp>

> _`test_no_triton_on_import`_
> _Cleans up the code, adds docs_
> _No hidden errors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102674
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-06-01 20:31:18 +00:00
90b1b17c9f Fix string concatenation with non-string (#102728)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102728
Approved by: https://github.com/Skylion007
2023-06-01 20:02:03 +00:00
ca1c1fdc91 [C10D] Implement Store fallbacks for append, multi_get and multi_set. (#100768)
These fallbacks exposed some issue in quite a few spots in our bindings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100768
Approved by: https://github.com/fduwjj
2023-06-01 19:58:47 +00:00
59532bd6f1 [inductor] Fix a cpp wrapper codegen issue for _scaled_dot_product_efficient_attention (#102624)
Summary: This fixes a cpp_wrapper coverage drop on TIMM models as
shown in recent inference dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102624
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-06-01 19:52:37 +00:00
bd0a4e2d83 Serialize pytree to string v2 (#102708)
v2 of https://github.com/pytorch/pytorch/pull/102577
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102708
Approved by: https://github.com/avikchaudhuri
2023-06-01 19:51:28 +00:00
fb0729054b Revert "[Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102565)"
This reverts commit 019c38624cdd079fbed04a561eebde45c6fa3b1f /
https://github.com/pytorch/pytorch/pull/102565 as it breaks
ExecutorchBuilds.
2023-06-01 12:35:23 -07:00
9d9ce19d12 [split cat fx passes] Normalize squeeze (#102294)
Summary: Sometimes, squeeze can be a "call_method" instead of a "call_function". Normalizing it will make it amenable to pattern matching by passes like "split->squeeze"

Test Plan: * CI tests

Differential Revision: D46031846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102294
Approved by: https://github.com/jansel
2023-06-01 19:05:20 +00:00
f18b9f86ba [quant][test] Fix broken PT2 import, add warnings (#102644)
Summary:
We are currently silently skipping all PT2 quantization
tests due to a recent typo. This commit fixes this and also adds
warnings so it'll be easier to debug similar issues in the future.

Test Plan: python test/test_quantization.py

Differential Revision: D46329480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102644
Approved by: https://github.com/jerryzh168
2023-06-01 19:02:36 +00:00
87c976b69d Remove deprecated HIP flags (#102271)
Removes the outdated HIP flags appended to HIP_CXX_FLAGS

The will help remove the following warnings in the pytorch build log

```
[6238/6889] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/aten/src/ATen/native/cudnn/hip/Conv_v8.cpp.o
cc1plus: warning: command line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
cc1plus: warning: unrecognized command line option ‘-Wno-unused-command-line-argument’
cc1plus: warning: unrecognized command line option ‘-Wno-exceptions’
cc1plus: warning: unrecognized command line option ‘-Wno-inconsistent-missing-override’
cc1plus: warning: unrecognized command line option ‘-Wno-macro-redefined’
```

This also updates the gloo submodule commit to include the similar change made to gloo.
597accfd79

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102271
Approved by: https://github.com/malfet
2023-06-01 18:58:48 +00:00
30558c2896 [functorch] Get test_functionalize to run on FB infra (#102695)
A few bits of weirdness needed to happen here:

- skipIfRocm doesn't work as a unittest class decorator; it returns a function,
  and the test discovery logic looks for things that inherit from TestCase.  So
  I wrapped the individual test methods instead.
- Inside fbcode, our test runner (buck + tpx) discovers and runs tests using
  two separate processes, so it's important to use @wraps on the generated
  class to make it "look like" a regular test.

Differential Revision: [D46344980](https://our.internmc.facebook.com/intern/diff/D46344980/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D46344980/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102695
Approved by: https://github.com/zou3519
2023-06-01 18:47:09 +00:00
08150ee020 Mark job as unstable dynamically (#102426)
Allow CI jobs to be marked as unstable dynamically.  This use the same mechanism to disable job but with a different issue title `UNSTABLE JOB_NAME`.

The action will output a `is-unstable` flag to let the CI know if the current job it's running is unstable.  This is similar to the way `keep-going` flag is exposed.  Once this is merged, I will follow up with another PR to actually use `is-unstable` flag in CI.

### Testing

* https://github.com/pytorch/pytorch/issues/102297
  * `is-unstable` set https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9194921978#step:9:172
  * Windows CPU jobs are named unstable https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9195186715
* https://github.com/pytorch/pytorch/issues/102298
  * `is-unstable set https://github.com/pytorch/pytorch/actions/runs/5114543738/jobs/9195036258#step:11:139
  * Dynamo jobs are named unstable https://github.com/pytorch/pytorch/actions/runs/5114543738/jobs/9195036258
* https://github.com/pytorch/pytorch/issues/102299
  * `is-unstable` set https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9194922158#step:9:190
  * MacOS test jobs are named unstable https://github.com/pytorch/pytorch/actions/runs/5114544576/jobs/9195007882
* https://github.com/pytorch/pytorch/issues/102433
  * `is-unstable` set https://github.com/pytorch/pytorch/actions/runs/5114544572/jobs/9198630766#step:13:265
* https://github.com/pytorch/pytorch/issues/102425 (open temporarily during testing)
  * Disabling CI jobs still works correctly https://github.com/pytorch/pytorch/actions/runs/5114543738/jobs/9194904007 (backwards_compat)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102426
Approved by: https://github.com/ZainRizvi
2023-06-01 18:38:09 +00:00
ce8d31551b [quant][be] Change return type for zero_point to be int32 Tensor (#102234)
Summary: This is probably a typo

Test Plan: CI

Reviewed By: salilsdesai

Differential Revision: D46172706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102234
Approved by: https://github.com/salilsdesai
2023-06-01 18:30:44 +00:00
c9ae705a22 [Pytorch] Add Vulkan support for aten::unsqueeze, 1d->2d, 3d->4d (#102042)
Summary:
Add 1d->2d, 3d->4d unsqueeze

Unsqueeze operator: https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze

Test Plan:
Unsqueeze tests:
```
lfq@lfq-mbp xplat % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*unsqueeze*"
Downloaded 0/44 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 38.6 sec (100%) 523/523 jobs, 8/523 updated
  Total time: 38.6 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 9 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 9 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim0
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim0 (76 ms)
[ RUN      ] VulkanAPITest.unsqueeze_1dto2d_dim1
[       OK ] VulkanAPITest.unsqueeze_1dto2d_dim1 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim0
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim0 (9 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim1
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_2dto3d_dim2
[       OK ] VulkanAPITest.unsqueeze_2dto3d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim0
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim0 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim1
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim1 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim2
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim2 (1 ms)
[ RUN      ] VulkanAPITest.unsqueeze_3dto4d_dim3
[       OK ] VulkanAPITest.unsqueeze_3dto4d_dim3 (1 ms)
[----------] 9 tests from VulkanAPITest (98 ms total)

[----------] Global test environment tear-down
[==========] 9 tests from 1 test suite ran. (98 ms total)
[  PASSED  ] 9 tests.
```

clang-format on the glsl files

Differential Revision: D46057585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102042
Approved by: https://github.com/SS-JIA
2023-06-01 18:15:04 +00:00
2f96981e5a [inductor] Reduce duplication of reduction combine functions (#99661)
Currently reduction bodies are duplicated in several different places.
This reduces duplication by `combine_fn` definition used in
`_unroll_reduction_fn` and using it in the triton codegen. For cpp
this also makes better use of `reduction_combine{,_vec}` by using them
to generate the `omp declare reduction` line and the `vec_reduce_all`
call.

For triton the only change is that that the combine step gets spread
over two lines, e.g. instead of:
```python
_tmp1 = tl.where(rmask & xmask, triton_helpers.maximum(_tmp1, tmp0), _tmp1)
```
we get
```python
tmp2 = triton_helpers.maximum(_tmp1, tmp0)
_tmp1 = tl.where(rmask & xmask, tmp2, _tmp1)
```

For cpp the only change is that inplace reduction operations are now written as
an out-of-place operation and an assignment, e.g. instead if
```cpp
omp_out += omp_in
```
we generate
```cpp
omp_out = omp_out + omp_in
```

Which is a purely cosmetic change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99661
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-06-01 18:02:17 +00:00
d930bfc419 [quant][pt2e][be] Add QuantizationSpecBase (#102582)
Summary:
Make all quantization spec to inherit from the same base class in order to simplify the typing
for QuantizationAnnotation

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
```

Reviewed By: kimishpatel

Differential Revision: D46173954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102582
Approved by: https://github.com/andrewor14
2023-06-01 17:55:22 +00:00
685505353a Back out "Add PyObject preservation for UntypedStorage (#97470)" (#102553)
Summary:
Original commit changeset: c24708d18ccb

Original Phabricator Diff: D46159983

Test Plan: SL tests and CI

Differential Revision: D46284986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102553
Approved by: https://github.com/DanilBaibak
2023-06-01 17:23:43 +00:00
32360b48e8 [C10D] Rewrite TCPStore client send path to minimize amount of syscalls. (#100742)
Accumulate data in a local buffer prior to sending it. This reduces
the number of syscalls and network packets.

We flush every 1440 bytes to cap the amount of temporaty memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100742
Approved by: https://github.com/fduwjj
2023-06-01 16:58:46 +00:00
9d77949b9e Revert "add foreach support for custom device (#102047)"
This reverts commit b088ff467794bc1125133fb0428749d5bcd6ae3a.

Reverted https://github.com/pytorch/pytorch/pull/102047 on behalf of https://github.com/malfet due to Broke inductor, see b088ff4677 ([comment](https://github.com/pytorch/pytorch/pull/102047#issuecomment-1572368942))
2023-06-01 16:33:03 +00:00
74f10b9ea5 Switch most Python RAII guard usages to context manager (#102642)
There are some I can't easily switch due to reasons like:
- Dynamo modelling the guard
- BC concerns (for torch.autograd.set_multithreading_enabled)

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102642
Approved by: https://github.com/albanD
2023-06-01 16:28:37 +00:00
dcf0c5fb6e Use safe_is_leaf to test leafness (#102706)
This fixes one of the problems in https://github.com/pytorch/pytorch/issues/101160#issuecomment-1570376548
but I don't have a test case because the full example is fairly
difficult to minify.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102706
Approved by: https://github.com/bdhirsh
2023-06-01 16:02:12 +00:00
d9c8f9a00d add storage dtype for custom device (#102481)
Fixes #ISSUE_NUMBER
1、add `isinstance` check with dtyped storage for custom device
2、add `storage.type()` support for custom device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102481
Approved by: https://github.com/albanD
2023-06-01 12:46:19 +00:00
e59db08699 inductor: eliminate meaningless copy (#102089)
This pr aims to eliminate meaningless load/store pairs in generate code. HF models on CPU are expected to gain 2~4% E2E training performance improvement.

Taking the following case as an example, the generated kernel named cpp_fused_permute_1 does nothing but load and store in_out_ptr0.

Example code:
```
@torch._dynamo.optimize("inductor")
def fn(permute_6, view_10):
    permute_5 = torch.ops.aten.permute.default(view_10, [0, 2, 1, 3])
    clone_2 = torch.ops.aten.clone.default(permute_5, memory_format = torch.contiguous_format)
    view_11 = torch.ops.aten.view.default(clone_2, [1024, -1, 32])
    bmm = torch.ops.aten.bmm.default(view_11, permute_6)
    permute_339 = torch.ops.aten.permute.default(view_11, [0, 2, 1])
    return (bmm, permute_339)

permute_6 = rand_strided((1024, 32, 128), (4096, 1, 32), device='cpu', dtype=torch.float32)
view_10 = rand_strided((64, 128, 16, 32), (65536, 512, 32, 1), device='cpu', dtype=torch.float32)
out = fn(permute_6, view_10)
```

Output code (Before this pr):
```
aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

cpp_fused_clone_0 = async_compile.cpp('''
#include "/tmp/torchinductor_bzheng/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(80)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(1L))
                    {
                        for(long i3=static_cast<long>(0L); i3<static_cast<long>(32L); i3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i3 + (32L*i1) + (512L*i2) + (65536L*i0)));
                            tmp0.store(out_ptr0 + static_cast<long>(i3 + (32L*i2) + (4096L*i1) + (65536L*i0)));
                        }
                    }
                }
            }
        }
    }
}
''')

cpp_fused_permute_1 = async_compile.cpp('''
#include "/tmp/torchinductor_bzheng/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(float* in_out_ptr0)
{
    #pragma omp parallel num_threads(80)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(4194304L); i0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + static_cast<long>(i0));
                tmp0.store(in_out_ptr0 + static_cast<long>(i0));
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    buf0 = empty_strided((64, 16, 128, 32), (65536, 4096, 32, 1), device='cpu', dtype=torch.float32)
    cpp_fused_clone_0(c_void_p(arg1_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg1_1
    buf1 = empty_strided((1024, 128, 128), (16384, 128, 1), device='cpu', dtype=torch.float32)
    extern_kernels.bmm(as_strided(buf0, (1024, 128, 32), (4096, 32, 1)), arg0_1, out=buf1)
    del arg0_1
    buf2 = as_strided(buf0, (1024, 32, 128), (4096, 1, 32)); del buf0  # reuse
    cpp_fused_permute_1(c_void_p(buf2.data_ptr()))
    return (buf1, buf2, )
```

Output code (After this pr):
```
aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

cpp_fused_clone_0 = async_compile.cpp('''
#include "/tmp/torchinductor_bzheng/gv/cgv6n5aotqjo5w4vknjibhengeycuattfto532hkxpozszcgxr3x.h"
extern "C" void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(80)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0L); i0<static_cast<long>(64L); i0+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i1=static_cast<long>(0L); i1<static_cast<long>(16L); i1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long i2=static_cast<long>(0L); i2<static_cast<long>(128L); i2+=static_cast<long>(1L))
                    {
                        for(long i3=static_cast<long>(0L); i3<static_cast<long>(32L); i3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(i3 + (32L*i1) + (512L*i2) + (65536L*i0)));
                            tmp0.store(out_ptr0 + static_cast<long>(i3 + (32L*i2) + (4096L*i1) + (65536L*i0)));
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1 = args
    args.clear()
    buf0 = empty_strided((64, 16, 128, 32), (65536, 4096, 32, 1), device='cpu', dtype=torch.float32)
    cpp_fused_clone_0(c_void_p(arg1_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg1_1
    buf1 = empty_strided((1024, 128, 128), (16384, 128, 1), device='cpu', dtype=torch.float32)
    extern_kernels.bmm(as_strided(buf0, (1024, 128, 32), (4096, 32, 1)), arg0_1, out=buf1)
    del arg0_1
    return (buf1, as_strided(buf0, (1024, 32, 128), (4096, 1, 32)), )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102089
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-06-01 11:03:32 +00:00
ce9923a1cb [Quant][PT2E][Inductor] Lower quantized conv to Inductor (#101164)
**Summary**
Enable the lowering path for reference quantized conv after PT2E to Inductor.

The pattern `decomposed dequantize -> aten.convolution -> decomposed quantize` is fused to `quantized.functional.conv1d/2d/3d` and Inductor makes external calls to these ops.

This PR focuses on functionality only. The implementation is expected to have low performance.

Code example:
```Python
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(3, 6, 2, stride=2, padding=0, dilation=1)

    def forward(self, x):
        return nn.functional.gelu(self.conv(x))

m = M().eval()
example_inputs = (torch.randn(2, 3, 6, 6),)
exported_model, guards = torchdynamo.export(
    m,
    *copy.deepcopy(example_inputs),
    aten_graph=True,
    tracing_mode="real",
)

qconfig = get_default_qconfig("x86")
qconfig_mapping = QConfigMapping().set_global(qconfig)
backend_config_inductor = get_x86_inductor_pt2e_backend_config()
prepared_model = prepare_pt2e(
    exported_model,
    qconfig_mapping,
    example_inputs,
    backend_config_inductor
)
prepared_model(*example_inputs)
converted_model = convert_pt2e(prepared_model)
run = compile_fx(converted_model, example_inputs)
```
Output code by Inductor
```python
from ctypes import c_void_p, c_long
import torch
import math
import random
import os
import tempfile
from torch._inductor.hooks import run_intermediate_hooks
from torch._inductor.utils import maybe_profile

from torch import empty_strided, as_strided, device
from torch._inductor.codecache import AsyncCompile
from torch._inductor.select_algorithm import extern_kernels

aten = torch.ops.aten
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
async_compile = AsyncCompile()

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_weiwen/5d/c5dsrjrcd4jlzryilhxl5hdvcrzsoek52xzzqqy57hcoezvxxxwm.h"
extern "C" void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const long* in_ptr2,
                       unsigned char* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(2L); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(3L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(36L); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i2 + (36L*i1) + (108L*i0))];
                    auto tmp1 = in_ptr1[static_cast<long>(0L)];
                    auto tmp7 = in_ptr2[static_cast<long>(0L)];
                    auto tmp2 = 1 / tmp1;
                    auto tmp3 = static_cast<float>(1.0);
                    auto tmp4 = decltype(tmp2)(tmp2 * tmp3);
                    auto tmp5 = decltype(tmp0)(tmp0 * tmp4);
                    auto tmp6 = std::nearbyint(tmp5);
                    auto tmp8 = static_cast<float>(tmp7);
                    auto tmp9 = tmp6 + tmp8;
                    auto tmp10 = static_cast<float>(0);
                    auto tmp11 = max_propagate_nan(tmp9, tmp10);
                    auto tmp12 = static_cast<float>(127);
                    auto tmp13 = min_propagate_nan(tmp11, tmp12);
                    auto tmp14 = static_cast<unsigned char>(tmp13);
                    out_ptr0[static_cast<long>(i1 + (3L*i2) + (108L*i0))] = tmp14;
                }
            }
        }
    }
}
''')

kernel_cpp_1 = async_compile.cpp('''
#include "/tmp/torchinductor_weiwen/5d/c5dsrjrcd4jlzryilhxl5hdvcrzsoek52xzzqqy57hcoezvxxxwm.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       const long* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(2L); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(6L); i1+=static_cast<long>(1L))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0L); i2<static_cast<long>(9L); i2+=static_cast<long>(1L))
                {
                    auto tmp0 = in_ptr0[static_cast<long>(i1 + (6L*i2) + (54L*i0))];
                    auto tmp2 = in_ptr1[static_cast<long>(0L)];
                    auto tmp5 = in_ptr2[static_cast<long>(0L)];
                    auto tmp1 = static_cast<float>(tmp0);
                    auto tmp3 = static_cast<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp6 = decltype(tmp4)(tmp4 * tmp5);
                    auto tmp7 = static_cast<float>(0.5);
                    auto tmp8 = decltype(tmp6)(tmp6 * tmp7);
                    auto tmp9 = static_cast<float>(0.7071067811865476);
                    auto tmp10 = decltype(tmp6)(tmp6 * tmp9);
                    auto tmp11 = std::erf(tmp10);
                    auto tmp12 = static_cast<float>(1);
                    auto tmp13 = tmp11 + tmp12;
                    auto tmp14 = decltype(tmp8)(tmp8 * tmp13);
                    out_ptr0[static_cast<long>(i2 + (9L*i1) + (54L*i0))] = tmp14;
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1 = args
    args.clear()
    buf0 = torch.ops.quantized_decomposed.quantize_per_channel.default(arg0_1, arg4_1, arg5_1, 0, -128, 127, torch.int8)
    del arg0_1
    buf1 = buf0
    assert_size_stride(buf1, (6, 3, 2, 2), (12, 4, 2, 1))
    del buf0
    buf2 = empty_strided((2, 3, 6, 6), (108, 1, 18, 3), device='cpu', dtype=torch.uint8)
    kernel_cpp_0(c_void_p(arg8_1.data_ptr()), c_void_p(arg2_1.data_ptr()), c_void_p(arg3_1.data_ptr()), c_void_p(buf2.data_ptr()))
    del arg8_1
    buf2 = torch._make_per_tensor_quantized_tensor(buf2, arg2_1, arg3_1)
    buf1 = torch._make_per_channel_quantized_tensor(buf1, arg4_1, arg5_1, 0)
    buf3 = torch.ao.nn.quantized.functional.conv2d(buf2, buf1, arg1_1, (2, 2), (0, 0), (1, 1), 1, 'zeros', arg6_1, arg7_1, torch.uint8)
    assert_size_stride(buf3, (2, 6, 3, 3), (54, 1, 18, 6))
    del arg1_1
    del arg2_1
    del arg3_1
    del arg4_1
    del arg5_1
    del buf1
    del buf2
    buf4 = empty_strided((2, 6, 3, 3), (54, 9, 3, 1), device='cpu', dtype=torch.float32)
    kernel_cpp_1(c_void_p(buf3.data_ptr()), c_void_p(arg7_1.data_ptr()), c_void_p(arg6_1.data_ptr()), c_void_p(buf4.data_ptr()))
    del arg6_1
    del arg7_1
    return (buf4, )

def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided((6, 3, 2, 2), (12, 4, 2, 1), device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.float32)
    arg2_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
    arg3_1 = rand_strided((), (), device='cpu', dtype=torch.int64)
    arg4_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.float32)
    arg5_1 = rand_strided((6, ), (1, ), device='cpu', dtype=torch.int64)
    arg6_1 = rand_strided((), (), device='cpu', dtype=torch.float32)
    arg7_1 = rand_strided((), (), device='cpu', dtype=torch.int64)
    arg8_1 = rand_strided((2, 3, 6, 6), (108, 36, 6, 1), device='cpu', dtype=torch.float32)
    return print_performance(lambda: call([arg0_1, arg1_1, arg2_1, arg3_1, arg4_1, arg5_1, arg6_1, arg7_1, arg8_1]), times=times, repeat=repeat)

if __name__ == "__main__":
    from torch._inductor.utils import compiled_module_main
    compiled_module_main('None', benchmark_compiled_module)
```

**Test plan**
python test/test_quantization.py TestQuantizePT2EFXX86Inductor.test_inductor_qconv_lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101164
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-06-01 10:22:02 +00:00
b088ff4677 add foreach support for custom device (#102047)
Fixes #ISSUE_NUMBER
for custom device, we want to support foreach, so I add a func that we could set other device type, and the default value is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102047
Approved by: https://github.com/janeyx99
2023-06-01 06:22:44 +00:00
9fa82c90f7 [Dynamo] Correct UserDefinedObjectVariable.var_getattr on function/method type (#102580)
Fixes #102329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102580
Approved by: https://github.com/jansel
2023-06-01 05:04:13 +00:00
92923aca61 [TP] Use Stride inferred from local tensor in to_local bwd (#102630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102630
Approved by: https://github.com/wanchaol
2023-06-01 04:30:24 +00:00
7a569f86a0 [export] Cleanup constraints (#102666)
Redo of https://github.com/pytorch/pytorch/pull/102432 because idk how to push to that other branch...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102666
Approved by: https://github.com/zhxchen17
2023-06-01 04:22:31 +00:00
bebb8b7c1e [inductor] use native fetch_add function for trivial types (#101931)
floating-point is supported by std::atomic::fetch_add since C++20.
However, this code path is not activated yet because cpp_flags in codecache.py is hard-coded to "-std=c++17"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101931
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-06-01 03:47:56 +00:00
a548fab8a8 Add size info to collective logs (#100413)
Previous timeout log does not print size info. Making it hard to debug hang caused by message size mismatch.

(Reason is that when copying `WorkNCCL` object during work enqueue, we don't copy `outputs_` due to reference concern, hence `output.size()` is never triggered.)

This PR logs sizes using separate fields, hence not relying on `outputs_`.

New timeout log:
```
[Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=_ALLGATHER_BASE, NumelIn=209715200, NumelOut=1677721600, Timeout(ms)=10000) ran for 10957 milliseconds before timing out.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100413
Approved by: https://github.com/kumpera
2023-06-01 03:39:30 +00:00
c5d4ee2d73 [dtensor][simple] fix some comments (#102661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102661
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
2023-06-01 03:23:19 +00:00
49cd184f89 inductor: improve the index range check for index_expr vec check (#102263)
Fix https://github.com/pytorch/pytorch/issues/102065.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102263
Approved by: https://github.com/lezcano, https://github.com/peterbell10, https://github.com/jgong5
2023-06-01 03:07:14 +00:00
49d0d1d79f Update XLA pin (#102446)
Updating the pin to the same hash as  https://github.com/pytorch/pytorch/pull/100922

On the XLA side, build have switch from CMake to bazel, which requires number of changes on PyTorch side:
 - Copy installed headers back to the `torch/` folder before starting the build
 - Install `torch/csrc/lazy/python/python_utils.h`
 - Define `LD_LIBRARY_PATH`

TODO:
 - Enable bazel caching
 - Pass CXX11_ABI flag to  `//test/cpp:all`  to reuse build artifacts from  `//:_XLAC.so`

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at cd4768b</samp>

> _To fix the XLA tests that were failing_
> _We updated the submodule and scaling_
> _We added `python_util.h`_
> _And copied `torch` as well_
> _And set `LD_LIBRARY_PATH` for linking_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102446
Approved by: https://github.com/huydhn
2023-06-01 02:04:07 +00:00
b9294c7ca2 Allow more inserts before reIndexTopology (#102312)
Summary:
Currently if you are inserting into JIT IR at the same point in the middle of the graph,
it only allows for 40 inserts before it has to reindex. Reindexing is N**2 behavior, which can
lead to slow load times. This changes it so that it keeps track of how many insertions happen
at single point (like when a function is being inlined) to predict how many future insertions will happen
there. It then adjusts how it assigns topology to make sure there is enough room for those predicted insertions.
In practice this will allow around 2M inserts at a single point before it reindexes.

Test Plan: test_jit.py

Differential Revision: [D46206617](https://our.internmc.facebook.com/intern/diff/D46206617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102312
Approved by: https://github.com/eellison
2023-06-01 01:17:55 +00:00
6b8e68ce7e [pytorch-vulkan] aten::uniform (#102431)
Summary:
aten::uniform implementation.

the randomization function didn't use Perlin, as the outcome distribution is not uniform.

choose to use PCG (https://www.reedbeta.com/blog/hash-functions-for-gpu-rendering/) instead.

Test Plan:
```
yipjustin@yipjustin-mac fbsource % buck run  -c pt.vulkan_full_precision=1  --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="*uniform*"
Downloaded 0/47 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 40.0 sec (100%) 524/524 jobs, 10/524 updated
  Total time: 40.0 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *uniform*
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from VulkanAPITest
[ RUN      ] VulkanAPITest.uniform
[       OK ] VulkanAPITest.uniform (54 ms)
[----------] 1 test from VulkanAPITest (54 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (54 ms total)
[  PASSED  ] 1 test.
```

Differential Revision: D46170098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102431
Approved by: https://github.com/SS-JIA
2023-06-01 01:10:50 +00:00
20ca994a3e Use size in python list (#102538)
Resubmission of #101922

Description copied verbatim
Potentially fixes the second issue described in https://github.com/pytorch/pytorch/issues/87159.

In python_list.h, int64_t is used when diff_type is better suited. On 32 bit systems, int64_t isn't a proper signed size type, which may cause the compilation error described in https://github.com/pytorch/pytorch/issues/87159.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102538
Approved by: https://github.com/albanD
2023-06-01 00:46:29 +00:00
0d2e7a1888 support ConvBinaryInplace in Inductor cpp wrapper (#101394)
This PR has changed the OP schema since `at::Tensor&` should be the FirstArg:
87f9160b67/aten/src/ATen/core/boxing/impl/boxing.h (L305-L341)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101394
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire
2023-06-01 00:22:29 +00:00
cdfba6fca7 Add ngimel to Core Reviewers (#102668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102668
Approved by: https://github.com/ezyang
2023-06-01 00:21:10 +00:00
c84f246c83 Improve time savings calculation math for test reordering (#102411)
Use a more accurate method that accounts for tests being run in parallel

Right now we still log results to the console, but later it'll get logged to Rockset for better tracking
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102411
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-05-31 23:51:27 +00:00
693114c0a2 Adds script to generate alerts for failing jobs (#102002)
Copies over bits of the script from test-infra to grab the relevant parts an alert and turns them into a json. Generally copied over from check_alerts in pytorch/test-infra

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 1789c36</samp>

> _`Python 3` shebang_
> _added for compatibility_
> _a good practice / spring_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102002
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2023-05-31 23:20:31 +00:00
398a5f4d4a Clean up mypy (#102555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102555
Approved by: https://github.com/Skylion007
2023-05-31 23:16:49 +00:00
8d7e082300 [c10d] Add is_backend_available for c10d backend. (#101945)
Add is_backend_available for c10d backend, either the built-in backends or third-party backends through function ``Backend.register_backend``.

There is a related discussion in https://github.com/pytorch/pytorch/pull/101775#discussion_r1199253553
> For example in python constructor for their backend they should explicitly add the is_X_available. Or if defining in C++ they should modify pybind like this https://github.com/H-Huang/torch_collective_extension/blob/main/custom_backend/include/dummy.hpp#L98-L101
to also add their own is_available property

It is a natural choice for users to add their own `is_available` when they create a backend. We think it might be a possible way for the user to use `is_X_available` in the same way as the native, for example by dynamically adding`torch.distributed.is_dummpy_available()` function.  This is why we want to dynamically add the `is_X_available` to `torch.distributed` in `register_backend`.

> Or we could add an Is_available(backend) function, that checks for the backend.

Providing a public function is indeed another good approach. We have implemented an `is_backend_available` in https://github.com/pytorch/pytorch/pull/101945  that supports both built-in backends and third-party backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101945
Approved by: https://github.com/H-Huang
2023-05-31 22:51:51 +00:00
e03800a93a Add torch._utils.render_call, improve printoptions (#102623)
- Add get_printoptions and printoptions context manager
- Improve edgeitems handling when it is zero
- Add render_call which can be used to conveniently print command
  line arguments of a function call, while suppressing actual
  tensor data

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102623
Approved by: https://github.com/albanD
2023-05-31 22:08:04 +00:00
cba4004983 Run libtorch in 2 shards (manual sharding) (#102554)
This is a quick way to mitigate libtorch timing out issue on 2nd shard when running with memory leak check, for example https://github.com/pytorch/pytorch/actions/runs/5119293905/jobs/9204880456

### Testing

* Slow gradcheck https://github.com/pytorch/pytorch/actions/runs/5128253177
  * `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu)`: `3h40` → `3h20`?
  * `slow / linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu)`: `4h30` → `3h50`
  * `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 1, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h35` → `3h20`
  * `linux-bionic-cuda12.1-py3-gcc7-slow-gradcheck / test (default, 2, 4, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `4h20` → `4h`
* Linux GPU https://github.com/pytorch/pytorch/actions/runs/5128252752
  * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu)`: `1h40` → `1h40`
  * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)`: `2h10` → `1h35`
  * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 1, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `2h30` → `2h50`
  * `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu, mem_leak_check)`: `3h20` → `2h50`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102554
Approved by: https://github.com/clee2000
2023-05-31 22:03:17 +00:00
d9f75dded1 [export] Add aot_export 1/N (#101490)
This PR adds aot_export_module as the lowering path from torch.level graph to aten graph. Some known limitations that need to be addressed in the follow up PRs:
1. Store param/buffer data in ExportedProgram
2. Fully support torch.cond with params/buffers
3. Making sure no duplicated ExportMetaData entry
4. This API will break Executorch if used on PyE, we will figure out a plan internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101490
Approved by: https://github.com/avikchaudhuri
2023-05-31 20:56:21 +00:00
04c1c2b791 Try to build the Docker image if it doesn't exist (#102562)
There is a bug in the test workflow where it could fail to find the new Docker image when the image hasn't yet became available on ECR, for example e71ab21422.  This basically is a race condition where the test job starts before the docker-build workflow could finish successfully.  The fix here is to make sure that the test job has the opportunity to build the image if it doesn't exist, same as what the build workflow does atm.  Once the docker-build workflow finishes pushing the new image to ECR, that can then be used instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102562
Approved by: https://github.com/PaliC
2023-05-31 20:50:27 +00:00
9a2df0a5af [RFC] Add method to DDP to check for backward finalization. (#100773)
Summary: In cases where DDP backward is not finalized, the error is raised only in the next forward iteration of DDP. However, if there are other collective calls between those two points, training scripts could potentially get stuck.

As a result, there should be a way to check if DDP finalized after calling `.backward()`. To address this, I've added a `_check_reducer_finalized` method to validate that DDP indeed did successfully finish reduction.

Test Plan: Added unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100773
Approved by: https://github.com/rohan-varma
2023-05-31 20:43:06 +00:00
fc31b3a106 Allow existing "Python RAII guards" to be used as context managers (#102579)
This PR adds a `py_context_manager_DEPRECATED` that converts a C++ RAII
guard to an object that may be either used as Python context manager or
as a "Python RAII guard".

We don't convert all of them to Python context manager only due to BC
reasons; people in OSS and internally actually rely on these APIs and I
don't want to break them. We are justified in breaking BC if we wanted
to, but it seemed like too much work for not a lot of gain.

The API is postfixed with "DEPRECATED" to indicate that people should
really use `py_context_manager` (converts C++ RAII guard to Python
context manager) instead.

Test Plan:
- this PR converts all PyTorch usages of _AutoDispatchBelowAutograd to
context manager. I can do the rest in follow-ups.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102579
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-05-31 19:55:38 +00:00
65631d4515 [benchmarks] Use train mode for accuracy checks for HF models (#102578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102578
Approved by: https://github.com/desertfire
2023-05-31 19:47:18 +00:00
213e10dc3d fix bug in trace model when out-operator has more than one output (#101563)
Fixes #https://github.com/pytorch/pytorch/issues/101960
when I trace a func to run out-operator has more than one output, I got the error. This is because the situation when the output of the out operator is greater than 1 is not handled.
```
def test_trace_out_operator_with_two_output():
    example_input = torch.rand(2, 8)
    out_1, out_2 = torch.cummax(example_input, 1)

    def run_cummax(example_input, out_1, out_2):
        output_1, output_2 = torch.cummax(example_input, 1, out=(out_1, out_2))
        return output_1, output_2

    trace_model = torch.jit.trace(run_cummax, (example_input, out_1, out_2))

and the error info:

    raise TracingCheckError(
torch.jit._trace.TracingCheckError: Tracing failed sanity checks!
encountered an exception while running the trace with test inputs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101563
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/davidberard98
2023-05-31 19:39:52 +00:00
17166c2511 python_arg_parser to allow fake tensor element in symint_list when in dynamo mode #95424 (#97508)
Failing mechanism on #95424 :
In dynamo mode, when passing numpy.int_ to 'shape' like param (Sequence[Union[int, symint]]) is wrapped as list with FakeTensor.  However, in python_arg_parser, parser expect int in symint_list but got FakeTensor.

Following #85759, this PR allow tensor element in symint_list when in dynamo mode

This PR also fix below test with similar failing mechanism
pytest ./generated/test_huggingface_diffusers.py -k test_016
pytest ./generated/test_ustcml_RecStudio.py -k test_036

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97508
Approved by: https://github.com/yanboliang
2023-05-31 19:19:17 +00:00
cyy
3ae42cb7db adjust header inclusions in C10 as sugguested by IWYU (#102467)
This PR aims to reduce unused header inclusions in C10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102467
Approved by: https://github.com/albanD
2023-05-31 19:19:10 +00:00
0ecca122e7 [Replicate] Add unit test with replicate param names (#102401)
This attribute wasn't actually used in tests, add a test ensuring that
if replicate is used on top of FSDP, the replicated parameter names are as
expected.

TODO: there are a few ways to check if module is managed by composable API,
such as replicated param names for replicate, _get_module_state API,
_get_registry_api, etc. We should unify all composable APIs to check in a
unified way (filed an issue)

Differential Revision: [D46236377](https://our.internmc.facebook.com/intern/diff/D46236377/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102401
Approved by: https://github.com/awgu
2023-05-31 18:41:03 +00:00
9331b7fa05 Run slow gradcheck on the newer G5 runner (#102496)
As slow gradcheck is slow (Thank you, Captain Obvious!), let's run it on the newer G5 runner to improve its TTS and avoid flaky timing out error such as https://github.com/pytorch/pytorch/actions/runs/5112059782/jobs/9190167924.  AFAIK, there is no reason to keep running slow gradcheck on `linux.4xlarge.nvidia.gpu`

### Testing
* `1st` shard: `3h30m` → `4h`, The increase is probably due to https://github.com/pytorch/pytorch/pull/102380 in which the job's name switch from `gcc7` to `gcc9`.  Does this invalidate the test time used to balance these shards?
* `2nd` shard: `4h35m` → `4h15m`
* `3rd` shard: `3h20m` → `1h20m`
* `4th` shard: `3h20m` → `2h10m`
* `14h45m` → `11h45m`, a total saving of `3h`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102496
Approved by: https://github.com/malfet
2023-05-31 17:28:03 +00:00
c27cefccd3 Faketensor hpu device normalization (#102512)
FakeTensor doesn't normalize device_idx  and failed with below testcase.

import torch
import habana_frameworks.torch.hpu
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode.push():
    a = torch.empty(1, device="hpu")
    b = torch.empty(1, device="hpu:0")
    result = a + b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102512
Approved by: https://github.com/albanD
2023-05-31 17:06:44 +00:00
eaffd98880 Enable hipSOLVER in ROCm builds (#97370)
Enables the hipSolver backend for ROCm builds
--------------------------------------------------------------------------

- Minimum ROCm version requirement - 5.3
- Introduces new macro USE_LINALG_SOLVER the controls enablement of both cuSOLVER and hipSOLVER
- Adds hipSOLVER API to hipification process
- combines hipSOLVER and hipSPARSE mappings into single SPECIAL map that takes priority among normal mappings
- Torch api to be moved to hipsolver backend (as opposed to magma) include: torch.svd(), torch.geqrf(), torch.orgqr(), torch.ormqr()
- Will enable 100+ linalg unit tests for ROCm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97370
Approved by: https://github.com/malfet
2023-05-31 16:53:23 +00:00
46a925795e S390x clang fixes for SIMD (#100874)
S390x clang fixes for SIMD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100874
Approved by: https://github.com/jgong5
2023-05-31 16:38:19 +00:00
cyy
850b37cc3b merge identical branches in cpu_index_kernel (#102601)
A simple simplification when reviewing code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102601
Approved by: https://github.com/jgong5
2023-05-31 16:22:49 +00:00
f47ee87765 Fix ignored_states when they are passed as generators (#102575)
This PR fixed the case where ignored_states are passed as generators, not List/Set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102575
Approved by: https://github.com/awgu
2023-05-31 15:58:55 +00:00
9f97b7c43b Add integer overflow checks for large compressed tensor dimensions and nnz (#102530)
With the previous PR allowing large compressed tensors (dimensions larger than `2 ** 31 - 1`), sparse compressed tensor invariants checks may give false-positive results:
```python
>>> nnz=2**31
>>> torch.sparse.check_sparse_tensor_invariants.enable()
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.zeros(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
tensor(crow_indices=tensor([          0,           1,           2,  ...,
                             2147483646,  2147483647, -2147483648]),
       col_indices=tensor([0, 0, 0,  ..., 0, 0, 0]),
       values=tensor([1., 1., 1.,  ..., 1., 1., 1.]), size=(2147483648, 1),
       nnz=2147483648, layout=torch.sparse_csr)
```
(notice that the last entry in `crow_indices` is invalid) or raise a bogus exception as in
```python
>>> torch.sparse_csr_tensor(torch.arange(nnz+1, dtype=torch.int32), torch.arange(nnz, dtype=torch.int32), torch.ones(nnz), (nnz, 1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: `0 <= col_indices < ncols` is not satisfied.
```
(notice that `col_indices` is actually valid).

This PR fixes the above-reported bugs by introducing integer overflow checks for sparse compressed tensors dimensions as well as nnz.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102530
Approved by: https://github.com/nikitaved
2023-05-31 15:34:08 +00:00
9fd14fcd09 Improve repeat_interleave with scalar repeat value (#102570)
`repeat_interleave_symint` is currently implemented by guarding on the `SymInt`
and converting it to a tensor to pass to the Tensor overload. This instead
implements it as a copy of an expanded tensor, which can be done without guards
and is also much more efficient in eager mode to boot.

For example, these are timings for `x.repeat_interleave(100, dim=-1)` with `x.shape == (1000, 100)`

| Device | Time (Master) | Time (This PR)  | Speedup |
|--------|---------------|-----------------|---------|
| cpu    | 18.8 ms       | 3.5 ms          | 5.4     |
| cuda   | 271 us        | 134 us          | 2.0     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102570
Approved by: https://github.com/lezcano
2023-05-31 14:14:32 +00:00
47b884a74c [inductor] Revert a CI remedy for Triton compilation error (#102541)
Summary: revert https://github.com/pytorch/pytorch/pull/91634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102541
Approved by: https://github.com/ngimel
2023-05-31 13:13:51 +00:00
d80d3b18d0 nn.Linear with BSR inputs: spare the user from explicit Triton kernel registrations (#98403)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 08f7a6a</samp>

This pull request adds support for triton kernels in `torch` and `torch/cuda`, and refactors and tests the existing triton kernel for BSR matrix multiplication. It also adds a test case to ensure that importing `torch` does not implicitly import `triton`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98403
Approved by: https://github.com/malfet, https://github.com/cpuhrsch
2023-05-31 13:09:45 +00:00
019c38624c [Executorch][codegen] Add ETKernelIndex for aggregating all kernels for kernel (#102565)
keys and change codegen to take ETKernelIndex

We are adding support for dtype and dim order specialized kernel registration. This requires us to reorganize `BackendIndex` (which is a `Dict[DispatchKey, Dict[OperatorName, BackendMetadata]]`) to be `Dict[OperatorName, Dict[ETKernelKey, BackendMetadata]]`. This PR adds new data structures in order to support this change:

* `ETKernelKey` to retrieve a certain kernel from the registry.
* `ETKernelIndex`, the dictionary from operator name to kernel key to kernel mapping.

Note that the codegen logic is not changed yet, we need subsequent diffs to actually generate code for different kernel keys.

Differential Revision: [D46206339](https://our.internmc.facebook.com/intern/diff/D46206339/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102565
Approved by: https://github.com/Jack-Khuu
2023-05-31 09:41:36 +00:00
cyy
7c2641d5f1 apply constexpr and if constexpr when possible (#102471)
Now that we have full C++17 support, we can use if constexpr in some identified cases.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at df4c16d</samp>

The pull request improves the performance, readability, and consistency of various function templates in the `ATen` and `torch` modules by using `constexpr` keywords and C++17 features. It also fixes some type conversion and overflow issues for different input and output types. The changes affect the code for distributions, BLAS, batch normalization, embedding bag, random number generation, vectorized operations, cuBLAS, XNNPACK, CUTLASS, and shape inference. The affected files include `DistributionsHelper.h`, `vec256_int.h`, `vec512_int.h`, `BlasKernel.cpp`, `IndexKernel.cpp`, `EmbeddingBag.cpp`, `Normalization.cpp`, `rng_test.h`, `vec_test_all_types.h`, `TransformationHelper.h`, `CUDABlas.cpp`, `DistributionKernels.cpp`, `DistributionTemplates.h`, `RangeFactories.cu`, `RangeFactories.cpp`, `qconv.cpp`, `StructuredSparseLinearCUTLASS.cu`, `vec_test_all_types.cpp`, and `shape_inference.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102471
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-05-31 06:17:07 +00:00
a5ddb72aec Quick fix for keep-going + reruns (#102569)
Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-05-31 04:46:25 +00:00
3c0251a100 [inductor] Fix issue with 0D reductions (#102568)
Fixes https://github.com/pytorch/pytorch/issues/102546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102568
Approved by: https://github.com/ngimel
2023-05-31 04:38:53 +00:00
46691d4369 [inductor][pattern matcher] Retain meta tags (#102462)
This will be used later on while propagating the `recompute` flag all the way to min-cut partitioner.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102462
Approved by: https://github.com/jansel
2023-05-31 03:59:13 +00:00
e7cc41772d Add dynamo collections.deque support (#102412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102412
Approved by: https://github.com/jansel, https://github.com/voznesenskym
2023-05-31 03:54:20 +00:00
cdca25cdc7 Fix warning couldn't find split args (#102561)
Summary:
Fixes #102416  [WARNING] couldn't find split args

In case `dim=` kwarg is absent is absent, we can default it to 0. Even after this, probably okay to make this an INFO rather than a WARNING

Test Plan: run torchbench

Differential Revision: D46292754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102561
Approved by: https://github.com/jansel
2023-05-31 03:27:51 +00:00
b4a49124c8 [ONNX] Reduce exporter memory usage by removing intermediate values (#101148)
This commit reduces the exporter memory usage by as much as 50%. During the shape inference step, the exporter caches the values of intermediate tensors in a `ConstantValueMap`. This can use as much memory as the model itself, or even more. For example, model weight tensors are often fed to a Transpose layer, and the output of that is the same size of the weights. This commit fixes the issue by removing the intermediate tensor values after they are used by all consumers.

The cached values are only used for shape inference, so removing them after use should be safe. `ConstantValueMap` is cleared anyways once shape inference is complete for the entire graph.

As an example, here is the model from issue #61263:
```python
import torch
import math

# Size in GB
tensor_size = 1
model_size = 8

layers_num = model_size // tensor_size
kB = 1024
MB = kB * kB
GB = MB * kB
precision_size = 4 # bytes per float
activation_size = math.floor(math.sqrt(tensor_size * GB / precision_size))

class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        for i in range(layers_num):
            name = "fc_%d" % i
            linear = torch.nn.Linear(activation_size, activation_size)
            setattr(self, name, linear)
    def forward(self, x):
        for i in range(layers_num):
            name = "fc_%d" % i
            linear = getattr(self, name)
            x = linear(x)
        return x

model = Net().cuda()
input = torch.zeros(activation_size, requires_grad=True).cuda()
with torch.no_grad():
    torch.onnx.export(model, (input, ), './model_large.onnx', do_constant_folding=False, opset_version=13)
```
It is just some large linear layers stacked together. Before this commit, my max GPU usage during export was about 16.7 GB, twice the model size. With this commit in combination with #101134, it was only about 9.5 GB.

Together with #101134, fixes issue #61263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101148
Approved by: https://github.com/BowenBao
2023-05-31 02:55:57 +00:00
5324124eac [profiler] Reintroduce forward-backward links (#102424)
**TL;DR:** This re-introduces links between backward kernels and their corresponding forward kernels.

<img width="1020" alt="Screenshot 2023-05-26 at 7 25 22 PM" src="https://github.com/pytorch/pytorch/assets/5067123/02571b59-859c-4c9e-b3ef-121ef3159812">

In the example above, you can see there are two such flows - one for aten::add, and one for aten::binary_cross_entropy

### Details

Forward/backward links were added in https://github.com/pytorch/pytorch/pull/62553, but then disabled in https://github.com/pytorch/pytorch/pull/72904 due to segfaults (e.g. https://github.com/pytorch/pytorch/issues/69443).

Between now and when the fwd-bwd links were disabled, there's been a lot of refactoring; so this PR updates the implementation:
* Use a raw profiler::impl::Result instead of a KinetoEvent
* Move the implementation to collection.cpp, where the TraceWrapper is currently handled.
* Sort the events before processing, because they aren't always in chronological order
* There can now be more than one event in the backward pass that matches the sequenceNr-threadID pair. The implementation needed to be updated to avoid showing multiple endpoints for a given sequenceNr-threadID pair ([ptr to where the bwd sequenceNr-threadID pair is duplicated](6e3e3dd477/torch/csrc/profiler/collection.cpp (L398-L399))).

Next, we need to verify that https://github.com/pytorch/pytorch/issues/69443 is fixed. Running the repro no longer errors. Looking further into the details of the issue it seems like the handling of the [raw linkedActivity pointer (old code from 2021)](6089dcac48/libkineto/src/output_json.cpp (L283)) resulted in the segfault. Now, it doesn't look like the linked activity is used anywhere in output_json.cpp so the issue should be fixed.

### Testing

#### 1. unit test
`test_profiler_fwd_bwd_link` was un-skipped. It was modified to match the new implementation.

#### 2. https://github.com/pytorch/pytorch/issues/69443

I ran the repro in https://github.com/pytorch/pytorch/issues/69443 and verified there were no segfaults.

#### 3. Duplicate flow IDs

When forward-backward connections were first introduced, gpu-cpu async links had not been introduced. There's a possibility that gpu-cpu links and fwd-bwd links could interfere if their IDs overlap.

I manually tested this in chrome://tracing; I edited a file so that a gpu-cpu link had the same ID as one of the fwd-bwd connections. The chrome tracing UI continued showing both types of links.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102424
Approved by: https://github.com/aaronenyeshi
2023-05-31 02:50:38 +00:00
73fd7235ad add function specializations for the case of parameters in BFloat16 data type (#100233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100233
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-05-31 02:01:07 +00:00
9edf65a821 [build] fix compilation error on s390x (#101923)
This PR fixes the following compilation error due to the unexpected conflicts among #99057 and #101000

```
In file included from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec256/vec256.h:21,
                 from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec.h:6,
                 from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/native/cpu/Loops.h:37,
                 from /home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/native/cpu/batch_norm_kernel.cpp:9,
                 from /home1/ishizaki/PyTorch/main-lastest/build/aten/src/ATen/native/cpu/batch_norm_kernel.cpp.ZVECTOR.cpp:1:
/home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:2332:17: error: ‘at::vec::ZVECTOR::Vectorized<T> at::vec::ZVECTOR::Vectorized<T, typename std::enable_if<is_zarch_implemented_complex<T>(), void>::type>::expm1() const’ cannot be overloaded with ‘at::vec::ZVECTOR::Vectorized<T> at::vec::ZVECTOR::Vectorized<T, typename std::enable_if<is_zarch_implemented_complex<T>(), void>::type>::expm1() const’
 2332 |   Vectorized<T> expm1() const {
      |                 ^~~~~
/home1/ishizaki/PyTorch/main-lastest/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:2328:17: note: previous declaration ‘at::vec::ZVECTOR::Vectorized<T> at::vec::ZVECTOR::Vectorized<T, typename std::enable_if<is_zarch_implemented_complex<T>(), void>::type>::expm1() const’
 2328 |   Vectorized<T> expm1() const {
      |                 ^~~~~
cc1plus: note: unrecognized command-line option ‘-Wno-aligned-allocation-unavailable’ may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option ‘-Wno-unused-private-field’ may have been intended to silence earlier diagnostics
cc1plus: note: unrecognized command-line option ‘-Wno-invalid-partial-specialization’ may have been intended to silence earlier diagnostics
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101923
Approved by: https://github.com/malfet
2023-05-31 01:19:40 +00:00
cce58a43c9 [MPS] Fix softplus with f16 input (#101948)
Fixes #101946
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101948
Approved by: https://github.com/malfet
2023-05-31 00:40:10 +00:00
c3c1496143 [dynamo][higher order op] Bugfixes to pass graph.lint (#102448)
This PR ensures that the subgraphs use the newly created placeholder for the primary inputs and free variables. Earlier, this was not happening, and graph.lint() was failing. I need `graph.lint()` in the followup PRs where I run an `Interpreter` on the subgraph to preserve the metadata information to AOT Autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102448
Approved by: https://github.com/zou3519
2023-05-31 00:29:29 +00:00
8d5b14e907 [ONNX] Don't duplicate model weights in ONNX export (#101134)
This commit partially fixes an issue where the ONNX exporter always requires about 2x memory than the model size. The `ONNXTracedModule` class uses a copy of the original weights only when `return_inputs=True`, so this commit makes sure the weights are cloned only in that case.

As a side note, I don't think the exporter is ever called with `return_inputs=True`, so maybe this is just some old code that can be removed.

Partially fixes #61263. There are still other places in the exporter which use more memory than they need to. For example, during the shape inference step many intermediate tensors are computed and saved until shape inference on the model is complete. I am working on a fix for that, but that optimization is independent of this one and can be done in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101134
Approved by: https://github.com/BowenBao, https://github.com/osalpekar
2023-05-30 23:47:04 +00:00
33a49eeae7 [benchmark] Flag to switch on activation checkpointing for HF models (#102557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102557
Approved by: https://github.com/ngimel, https://github.com/Chillee
2023-05-30 23:46:14 +00:00
6ac8a11746 Switch cuda 12.1 docker images to gcc9 (#102380)
Update CUDA-12.1 CI docker images to gcc-9, that should tentatively fix for internal compiler error  in [libtorch-linux-bionic-cuda12.1-py3.7-gcc7 / build](https://github.com/pytorch/pytorch/actions/runs/5071681366/jobs/9135310361)

Co-authored by: Nikita Shulga <nshulga@meta.com>

Fixes: https://github.com/pytorch/pytorch/issues/102372
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102380
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-05-30 23:03:55 +00:00
9ff1932d2b [Dynamo] Save global autocast state to restore on graph break (#102415)
Fixes #102414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102415
Approved by: https://github.com/yf225
2023-05-30 23:03:21 +00:00
1a6ab8a5dc Revert "Quick fix for keep-going + reruns (#102569)"
This reverts commit 7f6edcf422d133b6fd747ec0775d1c840a91ee46.

Reverted https://github.com/pytorch/pytorch/pull/102569 on behalf of https://github.com/clee2000 due to broke a ton of stuff ([comment](https://github.com/pytorch/pytorch/pull/102569#issuecomment-1569167673))
2023-05-30 22:04:27 +00:00
4f468646d9 [PT2][Quant][BE] refactor tets cose to reduce duplication and standardize (#102497)
Summary:
This refactor introduces an internal function which selectively tests againt fx
quant as well. Notably this does increase  test times so wo need to figure out
how to resolve that.

Test Plan: test_quantization_pt2e

Reviewed By: jerryzh168

Differential Revision: D46154323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102497
Approved by: https://github.com/jerryzh168
2023-05-30 21:37:59 +00:00
7f6edcf422 Quick fix for keep-going + reruns (#102569)
Currently file level reruns + stepcurrent are incompatible and it's making PRs green when they are actually red, so turn off stepcurrent + file level reruns when keep-going is used until I figure out a better way to do this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102569
Approved by: https://github.com/huydhn
2023-05-30 21:29:56 +00:00
f14ac74fce [quant][pt2e] Add support for FixedQParamsQuantizationSpec (#102439)
Summary:
This PR adds support for FixedQParamsQuantizationSpec:

```
dataclass(eq=True, frozen=True)
class FixedQParamsQuantizationSpec(QuantizationSpecBase):
    dtype: torch.dtype
    scale: float
    zero_point: int
    quant_min: Optional[int] = None
    quant_max: Optional[int] = None
    qscheme: Optional[torch.qscheme] = None
```

This is useful to define quantization spec for operators like sigmoid which has predefined and fixed scale/zero_point

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_fixed_qparams_qspec (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'
```

Reviewed By: kimishpatel

Differential Revision: D46153082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102439
Approved by: https://github.com/kimishpatel
2023-05-30 21:28:13 +00:00
168ae806d0 [fx] Fix repr when arg is an OpOverload (#102547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102547
Approved by: https://github.com/Skylion007, https://github.com/jansel
2023-05-30 21:11:05 +00:00
68e55bff62 [minifier] add missing import (#102521)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102521
Approved by: https://github.com/jansel
2023-05-30 20:57:16 +00:00
95cdd58c8f Revert "[pt2] add SymInt support for linalg.tensorsolve (#102466)"
This reverts commit b1b76f614d5f1899f8f49518836a504ff05bf847.

Reverted https://github.com/pytorch/pytorch/pull/102466 on behalf of https://github.com/clee2000 due to reverting b/c stack https://github.com/pytorch/pytorch/pull/102469#issuecomment-1569041604, i think this is the one that actually causes the test to fail ([comment](https://github.com/pytorch/pytorch/pull/102466#issuecomment-1569045123))
2023-05-30 20:26:46 +00:00
463df86ce8 Revert "[pt2] add SymInt support for linalg.vander (#102469)"
This reverts commit 05717895aaab826bfd0e59567729e0d979e27897.

Reverted https://github.com/pytorch/pytorch/pull/102469 on behalf of https://github.com/clee2000 due to broke test_aotdispatch on linux ex 05717895aa https://github.com/pytorch/pytorch/actions/runs/5125654882/jobs/9219389448, shows up as green on pr due to bug with keep-going flag and reruns ([comment](https://github.com/pytorch/pytorch/pull/102469#issuecomment-1569041604))
2023-05-30 20:24:26 +00:00
c28f8e314d Add type hints in torch/distributed/utils.py (#102262)
Fixes #77190

Pretty similar to the typing in `torch/nn/parallel`, which was also improved recently: #102194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102262
Approved by: https://github.com/Skylion007, https://github.com/Neilblaze
2023-05-30 19:57:45 +00:00
05717895aa [pt2] add SymInt support for linalg.vander (#102469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102469
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2023-05-30 19:50:16 +00:00
b1b76f614d [pt2] add SymInt support for linalg.tensorsolve (#102466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102466
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2023-05-30 19:50:15 +00:00
0ba81ce8fe [pt2] add SymInt support for linalg.tensorinv (#102465)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102465
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2023-05-30 19:50:14 +00:00
7378b6b9e3 Add devcontainer support to PyTorch Project (#98252)
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 293ded1</samp>

This pull request adds support for using Visual Studio Code Remote - Containers extension with the pytorch project. It adds a `.devcontainer` folder with a `devcontainer.json` file, a `Dockerfile`, and a `noop.txt` file that configure and create a dev container with Anaconda and Python 3.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at d6b9cd7</samp>

> _`devcontainer.json`_
> _Configures PyTorch containers_
> _For CPU or GPU_

## Related to:
https://github.com/pytorch/pytorch/issues/92838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98252
Approved by: https://github.com/ZainRizvi
2023-05-30 19:44:18 +00:00
4d89489df5 Move static checks of layers[0] (e.g., isinstance check) to model build time (#102045)
Summary: Move static checks of layers[0] (e.g., isinstance check) to model build time because isinstance() does not work for torchscripted code.  Because the validation is now performed while constructing the object, the isinstance() call is performed in eager mode at model build time, and we avoid needing to call  isinstance() at runtime to determine whether the layers in a model are an instance of the TransformerEncoderLayer class, or its derived classes.

Test Plan: sandcastle, github

Differential Revision: D46096222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102045
Approved by: https://github.com/mikaylagawarecki
2023-05-30 19:42:01 +00:00
ff58d19c89 DeviceMesh use dispatchable PG to support custom backend (#102336)
This PR switches DeviceMesh to use dispatchable process group instead,
this could enable easier backend integration as user only need to
integrate with c10d process group custom backend, without needing to
change DeviceMesh to plug in the backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102336
Approved by: https://github.com/fduwjj
2023-05-30 19:22:37 +00:00
3ef4d697df [c10d] default backend need to check for nccl availability (#102470)
As titled, we can only initialize nccl backend when NCCL is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102470
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2023-05-30 19:22:37 +00:00
ALi
b02f48b181 implement __dir__ for dynamo (#102480)
Fixes #94478 modules' attributes are not included in when `__dir__` is called on the optimized module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102480
Approved by: https://github.com/msaroufim
2023-05-30 18:46:10 +00:00
704283d61f Improve clip_grad_norm to use torch.linalg.vector_norm (#102429)
Done in this PR:
 - Use `torch.linalg.vector_norm` instead of `torch.norm`
 - Reduce bandwidth boundary of clip_grad_norm when used with `inf`, ie no need to get the returned tensor after `abs`

What I'm slightly unsure:
 - I don't know if `inf` support `torch._foreach` API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102429
Approved by: https://github.com/lezcano
2023-05-30 18:35:18 +00:00
e71ab21422 update triton pin (#101919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101919
Approved by: https://github.com/ngimel
2023-05-30 17:16:05 +00:00
5fa273c870 ASAN: fix heap-buffer-overflow (#101970)
Pass size argument.

<details>
<summary>ASAN report</summary>

```
==1640574==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x609000022160 at pc 0x03ff31a04b42 bp 0x03ff69885dc0 sp 0x03ff69885db0
READ of size 16 at 0x609000022160 thread T1
    #0 0x3ff31a04b41 in at::vec::ZVECTOR::Vectorized<unsigned char, void>::loadu(void const*, int) /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:397
    #1 0x3ff31a04b41 in at::vec::ZVECTOR::Vectorized<c10::quint8, void>::loadu(void const*, int) /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:1574
    #2 0x3ff31a04b41 in operator() /home/user/pytorch/aten/src/ATen/native/quantized/cpu/kernels/QuantizedOpKernels.cpp:2668
    #3 0x3ff31cefa5d in void at::internal::invoke_parallel<at::native::(anonymous namespace)::quantized_normalize_kernel(at::Tensor const&, at::Tensor const&, at::Tensor const&, bool, int, int, long, long
, double, at::Tensor*)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(long, long)#1}>(long, long, long, at::native::(anonymous namespace)::quantized_normalize_kernel(at::Tens
or const&, at::Tensor const&, at::Tensor const&, bool, int, int, long, long, double, at::Tensor*)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const::{lambda(long, long)#1} const&) [clone
 ._omp_fn.0] /home/user/pytorch/aten/src/ATen/ParallelOpenMP.h:42
    #4 0x3ff6f31f52d in gomp_thread_start /var/tmp/portage/sys-devel/gcc-12.2.1_p20230304/work/gcc-12-20230304/libgomp/team.c:129
    #5 0x3ff82218381 in start_thread /usr/src/debug/sys-libs/glibc-2.37-r1/glibc-2.37/nptl/pthread_create.c:444
    #6 0x3ff822943f1  (/lib64/libc.so.6+0x1143f1)

0x609000022160 is located 0 bytes to the right of 32-byte region [0x609000022140,0x609000022160)
allocated by thread T0 here:
    #0 0x3ff82a3663f in __interceptor_posix_memalign /usr/src/debug/sys-devel/gcc-11.3.1_p20230303/gcc-11-20230303/libsanitizer/asan/asan_malloc_linux.cpp:226
    #1 0x3ff6f53ad95 in c10::alloc_cpu(unsigned long) /home/user/pytorch/c10/core/impl/alloc_cpu.cpp:74

Thread T1 created by T0 here:
    #0 0x3ff829dc263 in __interceptor_pthread_create /usr/src/debug/sys-devel/gcc-11.3.1_p20230303/gcc-11-20230303/libsanitizer/asan/asan_interceptors.cpp:216
    #1 0x3ff6f31fad5 in gomp_team_start /var/tmp/portage/sys-devel/gcc-12.2.1_p20230304/work/gcc-12-20230304/libgomp/team.c:858

SUMMARY: AddressSanitizer: heap-buffer-overflow /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:397 in at::vec::ZVECTOR::Vectorized<unsigned char, void>::loadu(void const*, int)
Shadow bytes around the buggy address:
  0x100c12000043d0: 00 fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c12000043e0: fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c12000043f0: fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1200004400: fd fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1200004410: fa fa fa fa fa fa fa fa fd fa fa fa fa fa fa fa
=>0x100c1200004420: fa fa fa fa fa fa fa fa 00 00 00 00[fa]fa fa fa
  0x100c1200004430: fa fa fa fa fa fa fa fa fd fd fa fa fa fa fa fa
  0x100c1200004440: fa fa fa fa fa fa fa fa fd fd fa fa fa fa fa fa
  0x100c1200004450: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1200004460: 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1200004470: 00 00 fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1640574==ABORTING
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101970
Approved by: https://github.com/Skylion007, https://github.com/jgong5
2023-05-30 17:09:52 +00:00
fcbdbd6682 Fix silent nnz overflow for large sparse compressed tensors. (#102523)
Fixes https://github.com/pytorch/pytorch/issues/102520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102523
Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch
2023-05-30 16:58:01 +00:00
77f97019b7 Dynamo remaps legacy allgather to traceable one (#102232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102232
Approved by: https://github.com/voznesenskym
2023-05-30 16:45:25 +00:00
c58264c3e9 [inductor] Support multiple symbolic numel expr in CudaWrapperCodeGen (#102093)
Summary: Add a set to avoid generating extra `auto` when seeing the
symbolic numel expression for the second time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102093
Approved by: https://github.com/jansel
2023-05-30 16:08:00 +00:00
7042e10215 Fixed issue with bicubic interpolation on uint8 input and antialising (#102296)
Description:

- Fixed issue with bicubic interpolation on uint8 input and antialising, discovered by @NicolasHug
- Unified `_separable_upsample_generic_Nd_kernel_impl_single_dim` on `antialis` arg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102296
Approved by: https://github.com/NicolasHug
2023-05-30 14:57:19 +00:00
0f1621df1a [pt2] fix typos in checkFloatingOrComplex errors (#102456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102456
Approved by: https://github.com/lezcano
2023-05-30 11:18:50 +00:00
e380d692dc [pt2] skip linalg.householder_product tests on x86 macOS (#102460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102460
Approved by: https://github.com/lezcano
2023-05-30 08:44:13 +00:00
076f84c46f [pt2] update tolerance for linalg.pinv singular tests (#102458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102458
Approved by: https://github.com/lezcano
2023-05-30 08:44:13 +00:00
999bae0f54 Add padding check for use_nnpack (#92238)
Fixes #90142
nnp_convolution_output doesn't support the case of input padding > = kernel_size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92238
Approved by: https://github.com/jgong5, https://github.com/ganler
2023-05-30 05:07:59 +00:00
00992ffa2f [profiler] Global function for controlling fwd-bwd connection behavior (#102492)
Summary: In https://github.com/pytorch/pytorch/pull/102424, we'll re-introduce forward-backward links. We assume that most users will want to see them, but in case there are issues, we'll provide these APIs for turning them on and off.

Differential Revision: D46266365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102492
Approved by: https://github.com/aaronenyeshi
2023-05-30 04:50:34 +00:00
0e72ada9bb [vision hash update] update the pinned vision hash (#102495)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102495
Approved by: https://github.com/pytorchbot
2023-05-30 03:04:56 +00:00
2cc6ae1926 squash xblock for persistent inner reduction (#102444)
Currently layer norm kernel performance is pretty bad due to triton perf bug https://gist.github.com/ngimel/c1e7f70f8268f038e710e835b0065f63, but since XBLOCK for persistent reduction is `1` we can just drop this dimension and operate on 1d tensors (and then perf of ln kernels improves a lot)
Perf results http://hud.pytorch.org/benchmark/compilers?startTime=Mon%2C%2022%20May%202023%2001%3A27%3A25%20GMT&stopTime=Mon%2C%2029%20May%202023%2001%3A27%3A25%20GMT&suite=torchbench&mode=training&dtype=amp&lBranch=ngimel/persistent_1d&lCommit=1d5175f5e682f37aae15fd217bc3767e1788bacf&rBranch=main&rCommit=c9f4f01981fd73fcc7c27676cc50230cd1b5bc22, approx 4% on hf

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102444
Approved by: https://github.com/jansel
2023-05-30 02:51:10 +00:00
3c2519ab5e Revert "apply constexpr and if constexpr when possible (#102471)"
This reverts commit 461c03a93c0ac85837c1ef11afc0ec1dc8900d0c.

Reverted https://github.com/pytorch/pytorch/pull/102471 on behalf of https://github.com/huydhn due to Sorry for reverting your PR.  I think it breaks Windows CUDA build with a landrace 461c03a93c ([comment](https://github.com/pytorch/pytorch/pull/102471#issuecomment-1567653793))
2023-05-30 01:41:20 +00:00
cyy
461c03a93c apply constexpr and if constexpr when possible (#102471)
Now that we have full C++17 support, we can use if constexpr in some identified cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102471
Approved by: https://github.com/Skylion007
2023-05-30 00:47:07 +00:00
319a1cb4e5 [inductor] Replaced refs.op by torch.op in _refs/* (#102176)
Description:
- Replaced `refs.op` by `torch.op` in `_refs/*`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102176
Approved by: https://github.com/lezcano
2023-05-29 22:36:14 +00:00
fc0fed36d9 [inductor] fix issue with ops.lookup_seed (#102485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102485
Approved by: https://github.com/anijain2305
2023-05-29 22:25:47 +00:00
c6d9a0b9dd [inductor] Handle floordiv and remainder in IndexPropagation (#102277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102277
Approved by: https://github.com/lezcano
2023-05-29 17:21:13 +00:00
e4e151d669 [inductor] Inline ComputedBuffer computation when there are no reads (#102000)
When inductor compiles the following example,
```python
def flip(x):
    idx = torch.arange(x.shape[0] - 1, -1, -1, device=x.device)
    return x[idx], idx
```

The return of `idx` forces it to be realized into a `ComputedBuffer`
and the downstream index call inserts a corresponding load and
indirect_indexing:
```python
    tmp0 = tl.load(in_ptr0 + (x1), None)
    tmp1 = triton_helpers.promote_to_tensor(tmp0)
    tl.device_assert((0 <= tmp1) & (tmp1 < 128), "index out of bounds: 0 <= tmp1 < 128")
    tmp2 = tl.load(in_ptr1 + (x0 + (128*tmp0)), None)
```

However, if we can inline the index expression from the buffer's
computation we instead get direct indexing (and half the loads):
```python
    tmp0 = tl.load(in_ptr0 + (127 + ((-1)*x0)), None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102000
Approved by: https://github.com/lezcano
2023-05-29 17:21:13 +00:00
b1bc8aecf5 [inductor] erfinv: CPU/CUDA lowering (#101863)
Add `erfinv` lowering for CUDA. On CPU, we just fallback to the aten operator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101863
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-05-29 15:31:54 +00:00
0803b91867 Revert "Replace int64_t with a size type in python_list.h when applicable (#101922)"
This reverts commit 44e7f07ed4ef3ea2f9dc8deb66a779aeb4450b21.

Reverted https://github.com/pytorch/pytorch/pull/101922 on behalf of https://github.com/atalman due to breaks windows nightlies ([comment](https://github.com/pytorch/pytorch/pull/101922#issuecomment-1567240450))
2023-05-29 14:58:31 +00:00
af1d437654 Improve precision and performance for BFloat16 upsampling (#91169)
### Description
- Fix precision issue for BFloat16 upsampling: https://github.com/pytorch/pytorch/issues/89212
- Improve performance for BFloat16 upsampling.
### Testing
data type: BFloat16

- Single core

contiguous:
mode | scale_factor | shape  | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 14.47 | 8.34
linear | 2 | [3, 200, 200] | 3.69 | 2.74
bilinear | 2 | [3, 5, 200, 200] | 87.99 | 49.05
trilinear | 2 | [3, 3, 3, 100, 100]  | 171.02 | 72.53
bicubic | 2 | [3, 3, 200, 200 ] | 176.29 | 78

channels last:
mode | scale_factor | shape | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 17.70 | 10.30
linear | 2 | [3, 200, 200] | \ | \
bilinear | 2 | [3, 5, 200, 200] | 50.90 | 18.83
trilinear | 2 | [3, 3, 3, 100, 100] | 121.56 | 42.60
bicubic | 2 | [3, 3, 200, 200 ] | 179.40 | 80

- 20 cores

contiguous:
mode | scale_factor | shape | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] | 1.17 | 1.01
linear | 2 | [3, 200, 200] | 0.41 | 0.26
bilinear | 2 | [3, 5, 200, 200] | 7.19 | 4.07
trilinear | 2 | [3, 3, 3, 100, 100]  | 21.32 | 9.33
bicubic | 2 | [3, 3, 200, 200 ] | 178.67 | 10

channels last:
mode | scale_factor | shape | before backward / ms |  after backward / ms
-- | -- | -- | -- | --
nearest | 2 | [10, 3, 200, 200] |  2.25 | 1.55
linear | 2 | [3, 200, 200] | \ | \
bilinear | 2 | [3, 5, 200, 200] |  20.17 | 7.20
trilinear | 2 | [3, 3, 3, 100, 100] | 43.33 | 15.66
bicubic | 2 | [3, 3, 200, 200 ] | 176.76 | 10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91169
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/Skylion007
2023-05-29 01:35:57 +00:00
040d2cc969 [dynamo] Some torchrec_dlrm related fixes (#101953)
Issue 1 of https://github.com/pytorch/pytorch/issues/101918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101953
Approved by: https://github.com/jansel
2023-05-28 17:56:08 +00:00
53d1d301c6 Enable CuDNN v8 frontend in RL (#102284)
Summary:
This enables use of CUDNN v8 in all Meta internal workflows. Also, fixes two minor issues:
- Skip LogCumSumExp compilation for complex dtypes for fbcode and RL
- Move `MakeConvOutputShape` template definition/specialization to anonymous namespace inside `at::native::quantized` as it is referenced from both `torch_cpu` and `torch_cuda`. This is necessary to avoid `duplicate symbol` linker error if say `libtorch_cpu` and `libtorch_cuda` are statically linked together.
- Lower CuDNN v8 version guard from 8.3 to 8.2 (as there are no good reason why it should be 8.3, first version of the library that properly supports all the features is actually 8.5)

Test Plan: CI

Differential Revision: D46161651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102284
Approved by: https://github.com/atalman
2023-05-28 13:21:47 +00:00
81ac076bce Revert "[FSDP]Add device_mesh to FSDPstate (#102317)"
This reverts commit 4c584acc5d87ece9b236424cef6474c453e8d4b3.

Reverted https://github.com/pytorch/pytorch/pull/102317 on behalf of https://github.com/malfet due to Broke test_fake_pg, see https://github.com/pytorch/pytorch/actions/runs/5100633726/jobs/9173277369  ([comment](https://github.com/pytorch/pytorch/pull/102317#issuecomment-1566129496))
2023-05-28 12:53:28 +00:00
af70fe9f3e [PT2][Quant] Enable test_qnnpack_quantizer_conv_linear test (#102399)
Earlier this test was disabled due to pattern matching not working correctly.
Enablign this test now since we moved to module partitioner based matching.

Differential Revision: [D46130722](https://our.internmc.facebook.com/intern/diff/D46130722/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102399
Approved by: https://github.com/jerryzh168
2023-05-28 06:44:16 +00:00
0d876f7d43 [PT2][Quant] Move observer sharing ops to use module partitions (#102398)
As title

Differential Revision: [D46095331](https://our.internmc.facebook.com/intern/diff/D46095331/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102398
Approved by: https://github.com/jerryzh168
2023-05-28 05:50:15 +00:00
9fac5afbcc [PT2][Quant] Move add/add relu pattern via module partitioner (#102397)
This diff uses module partitioners to find add and add + relu patterns.

Differential Revision: [D46095330](https://our.internmc.facebook.com/intern/diff/D46095330/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102397
Approved by: https://github.com/jerryzh168
2023-05-28 05:47:43 +00:00
3d8f405022 [PT2][Quant] Move maxpool_2d quant to use module partitioners (#102396)
As summary

Differential Revision: [D46095332](https://our.internmc.facebook.com/intern/diff/D46095332/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102396
Approved by: https://github.com/jerryzh168
2023-05-28 05:44:54 +00:00
d997e3aac6 [PT2][Quant] Use module partitions for conv2d and conv2d + relu (#102395)
In this diff we continue to use source partition for identifying node patterns
to annotate. Here we expand the usecase for conv2d+relu and conv2d

Differential Revision: [D46095329](https://our.internmc.facebook.com/intern/diff/D46095329/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102395
Approved by: https://github.com/jerryzh168
2023-05-28 05:40:45 +00:00
4cb6add471 [PT2][Quant] Use module partition for fused patterns (#102394)
This diff introduces utility `find_sequential_partitions`.
This utility allows one to specify sequential pattern of
nn.Module/nn.functional and returns a list. Each item in the list contains a
List[SourcePartition] that represents sequentially connected partitions that
are of the pattern requested.
For example `find_sequential_partitions(model, [nn.Conv2d, nn.ReLU])` will find
all nn.Conv2d and nn.ReLU partitions that are sequentially connected.

Furthmore, move to using `find_sequential_partitions` for conv_bn/conv_bn_relu
for QAT.

Differential Revision: [D45948057](https://our.internmc.facebook.com/intern/diff/D45948057/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45948057/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102394
Approved by: https://github.com/jerryzh168
2023-05-28 05:29:16 +00:00
4c584acc5d [FSDP]Add device_mesh to FSDPstate (#102317)
This PR creates a device_mesh and share it across all FSDP state. The device_mesh will later be used to test out dtensor state_dict (1d device_mesh).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102317
Approved by: https://github.com/awgu
2023-05-27 20:25:30 +00:00
c3ea8cc58b [pt2] convert out params in register_meta (#101344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101344
Approved by: https://github.com/lezcano
2023-05-27 18:38:52 +00:00
44e7f07ed4 Replace int64_t with a size type in python_list.h when applicable (#101922)
Potentially fixes the second issue described in #87159.

In python_list.h, `int64_t` is used when `diff_type` is better suited. On 32 bit systems, int64_t isn't a proper signed size type, which may cause the compilation error described in #87159.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101922
Approved by: https://github.com/Skylion007
2023-05-27 17:55:53 +00:00
3f4fee735a add Half support for logsigmoid, threshold, elu, gelu, hardtanh, hardsigmoid, hardswish, hardshrink, softshrink, leakyrelu, softplus, glu, silu, mish, and prelu on CPU (#98745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98745
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel
2023-05-27 16:20:21 +00:00
eda5abf5e0 [quant][pt2e] Fix propagate_annotation after recent refactors (#102422)
Summary:
Recently we changed the annotation from "target_dtype_info" to "quantization_annotation" and introduced QuantizationAnnotation API
and SharedQuantizationSpec API for users to convey sharing between input/outputs, this PR updates the _propagate_annotation
pass to accommadate the recent changes

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
```

Reviewed By: kimishpatel

Differential Revision: D46153084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102422
Approved by: https://github.com/kimishpatel
2023-05-27 16:01:47 +00:00
6e3e3dd477 Do not collect and skip non-disabled tests when rerunning disabled tests (#102107)
The console log blows up to much when running in rerun disabled tests mode (x50) e132f09e88.  Each log is around 1GB and the whole uncompressed logs is ~50GB.  After compression, it will be around 1GB, still too big.  The increase comes mainly from the multiple SKIPPED message for non-disabled tests, which is expected due to how SkipTest and pytest-flakyfinder currently work.

I update `test/conftest.py` to completely ignore skipped tests when rerunning disabled test instead of collecting then skipping 50 tests each.  The benefit of doing is is much more than I originally expect:
  * Rerun disabled tests jobs now finish in less than half an hour as they should be
  * Fix OOM runner crash because of too many collected tests
  * Fix verbosity issue as now only disabled tests are run x50 times.  There are only few hundreds of them atm
  * Fix timed out issue when rerunning disabled distributed and ASAN tests.  They are just too slow when running at x50

### Testing

When rerunning disabled tests https://github.com/pytorch/pytorch/actions/runs/5084508614, only disabled tests on the platform are run, for example `test_ops_jit` on https://ossci-raw-job-status.s3.amazonaws.com/log/13770164954 only ran 100 tests (`test_variant_consistency_jit_linalg_lu_cuda_float32` + `test_variant_consistency_jit_linalg_lu_factor_cuda_complex64`) x50.

```
Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ops_jit.py', '--shard-id=1', '--num-shards=2', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--sc=test_ops_jit_1', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2023-05-25 21:32:49.763856]

Expand the folded group to see the log file of test_ops_jit 2/2
##[group]PRINTING LOG FILE of test_ops_jit 2/2 (/var/lib/jenkins/workspace/test/test-reports/test_ops_jit_h2wr_t2c.log)
Test results will be stored in test-reports/python-pytest/test_ops_jit/test_ops_jit-51a83bd44549074e.xml
============================= test session starts ==============================
platform linux -- Python 3.10.11, pytest-7.3.1, pluggy-1.0.0 -- /opt/conda/envs/py_3.10/bin/python
cachedir: .pytest_cache
hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow]
rootdir: /var/lib/jenkins/workspace
configfile: pytest.ini
plugins: hypothesis-5.35.1, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-11.1.2, shard-0.1.2, xdist-3.3.0, xdoctest-1.1.0
collecting ... collected 1084 items
Running 100 items in this shard: test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 (x50), test/test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 (x50)
stepcurrent: Cannot find last run test, not skipping

test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_cuda_float32 PASSED [2.1876s] [  1%]
test_ops_jit.py::TestJitCUDA::test_variant_consistency_jit_linalg_lu_factor_cuda_complex64 PASSED [4.5615s] [  2%]
```

* [pull](https://github.com/pytorch/pytorch/actions/runs/5093566864)
* [trunk](https://github.com/pytorch/pytorch/actions/runs/5095364311)
* [periodic](https://github.com/pytorch/pytorch/actions/runs/5095378850)
* [slow](https://github.com/pytorch/pytorch/actions/runs/5095390285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102107
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-05-27 12:10:36 +00:00
995ac703cd [pt2] add SymInt support for linalg.pinv (#102367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102367
Approved by: https://github.com/lezcano
2023-05-27 11:10:47 +00:00
c9f4f01981 Add security guards to avoid crashes in torch::jit module (#102156)
Hi!

I've been fuzzing different pytorch modules with with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch), and found a multiple crashes in torch::jit::load() function.

All found errors could be reproduced with provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### Crash in torch/csrc/jit/unpickler.cpp:1075

[crash-1f59083b8396c5b62b4705c7556e68f129e833b1.zip](https://github.com/pytorch/pytorch/files/11552947/crash-1f59083b8396c5b62b4705c7556e68f129e833b1.zip)

```asan
    "#0  0x00007ffff7a5600b in raise () from /lib/x86_64-linux-gnu/libc.so.6",
    "#1  0x00007ffff7a35859 in abort () from /lib/x86_64-linux-gnu/libc.so.6",
    "#2  0x00007ffff7ce3911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#3  0x00007ffff7cef38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#4  0x00007ffff7cef3f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#5  0x00007ffff7cef6a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#6  0x00007ffff7ce6326 in std::__throw_length_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#7  0x00007ffff7d87edc in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_create(unsigned long&, unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#8  0x00007ffff7d88880 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::reserve(unsigned long) () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#9  0x000000000ea52931 in torch::jit::Unpickler::readBytes[abi:cxx11](unsigned long) (this=this@entry=0x7fffffffac10, length=length@entry=8358680908539635837) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:1075",
    "#10 0x000000000ea4c3a0 in torch::jit::Unpickler::readInstruction (this=0x7fffffff90d0) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:355",
    "#11 0x000000000ea49eb8 in torch::jit::Unpickler::run (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251",
    "#12 0x000000000ea49b12 in torch::jit::Unpickler::parse_ivalue (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204",
    "#13 0x000000000e960a9f in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) (archive_name=..., pickle_prefix=..., tensor_prefix=..., type_resolver=..., obj_loader=..., device=..., stream_reader=..., type_parser=<optimized out>, storage_context=...) at /pytorch/torch/csrc/jit/serialization/import_read.cpp:53",
    "#14 0x000000000e8ef599 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive (this=0x7fffffffbc60, archive_name=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:184",
    "#15 0x000000000e8eb886 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize (this=<optimized out>, device=..., extra_files=..., restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:287",
    "#16 0x000000000e8e9cc5 in torch::jit::import_ir_module (cu=..., in=..., device=..., extra_files=..., load_debug_files=<optimized out>, restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:386",
    "#17 0x000000000e8f37bf in torch::jit::import_ir_module (cu=..., in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:322",
    "#18 0x000000000e8f615a in torch::jit::load (in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:482",
    "#19 0x00000000005c2d61 in LLVMFuzzerTestOneInput (data=<optimized out>, size=1663) at /load.cc:42",
    "#20 0x00000000005c2a8e in ExecuteFilesOnyByOne (argc=2, argv=0x7fffffffc6b8, callback=callback@entry=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const*, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255",
    "#21 0x00000000005c2899 in LLVMFuzzerRunDriver (argcp=argcp@entry=0x7fffffffc5b4, argvp=argvp@entry=0x7fffffffc5b8, callback=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const*, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:364",
    "#22 0x00000000005c2459 in main (argc=2, argv=0x7fffffffc6b8) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300"

```

### Crash in torch/csrc/jit/unpickler.cpp:386

[crash-2e9923de375c393e700e8c0441f0ebe8252ca364.zip](https://github.com/pytorch/pytorch/files/11552950/crash-2e9923de375c393e700e8c0441f0ebe8252ca364.zip)

```asan
    "#0  0x00007ffff7a5600b in raise () from /lib/x86_64-linux-gnu/libc.so.6",
    "#1  0x00007ffff7a35859 in abort () from /lib/x86_64-linux-gnu/libc.so.6",
    "#2  0x00007ffff7ce3911 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#3  0x00007ffff7cef38c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#4  0x00007ffff7cef3f7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#5  0x00007ffff7cef6a9 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#6  0x00007ffff7ce6326 in std::__throw_length_error(char const*) () from /lib/x86_64-linux-gnu/libstdc++.so.6",
    "#7  0x0000000000670aff in std::vector<c10::IValue, std::allocator<c10::IValue> >::reserve (this=this@entry=0x7fffffff9750, __n=__n@entry=18446744073709551614) at /usr/include/c++/10/bits/vector.tcc:70",
    "#8  0x000000000ea4d5cd in torch::jit::Unpickler::readInstruction (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:386",
    "#9  0x000000000ea49eb8 in torch::jit::Unpickler::run (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251",
    "#10 0x000000000ea49b12 in torch::jit::Unpickler::parse_ivalue (this=0x7fffffffac10) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204",
    "#11 0x000000000e960a9f in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) (archive_name=..., pickle_prefix=..., tensor_prefix=..., type_resolver=..., obj_loader=..., device=..., stream_reader=..., type_parser=<optimized out>, storage_context=...) at /pytorch/torch/csrc/jit/serialization/import_read.cpp:53",
    "#12 0x000000000e8ef599 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive (this=0x7fffffffbc60, archive_name=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:184",
    "#13 0x000000000e8eb886 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize (this=<optimized out>, device=..., extra_files=..., restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:287",
    "#14 0x000000000e8e9cc5 in torch::jit::import_ir_module (cu=..., in=..., device=..., extra_files=..., load_debug_files=<optimized out>, restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:386",
    "#15 0x000000000e8f37bf in torch::jit::import_ir_module (cu=..., in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:322",
    "#16 0x000000000e8f615a in torch::jit::load (in=..., device=..., load_debug_files=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:482",
    "#17 0x00000000005c2d61 in LLVMFuzzerTestOneInput (data=<optimized out>, size=5498) at /load.cc:42",
    "#18 0x00000000005c2a8e in ExecuteFilesOnyByOne (argc=2, argv=0x7fffffffc6b8, callback=callback@entry=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const*, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:255",
    "#19 0x00000000005c2899 in LLVMFuzzerRunDriver (argcp=argcp@entry=0x7fffffffc5b4, argvp=argvp@entry=0x7fffffffc5b8, callback=0x5c2ae0 <LLVMFuzzerTestOneInput(uint8_t const*, size_t)>) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:364",
    "#20 0x00000000005c2459 in main (argc=2, argv=0x7fffffffc6b8) at /AFLplusplus/utils/aflpp_driver/aflpp_driver.c:300"
```

### Crash in torch/csrc/jit/serialization/source_range_serialization.cpp:211

[crash-5598d386057152f606bfa69d85605499e8852625.zip](https://github.com/pytorch/pytorch/files/11552952/crash-5598d386057152f606bfa69d85605499e8852625.zip)

```asan
    "#0  torch::jit::ConcreteSourceRangeUnpickler::unpickle (this=0x99b8d80) at /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:211",
    "#1  0x0000000004042566 in torch::jit::ConcreteSourceRangeUnpickler::findSourceRangeThatGenerated (this=0x99aa1c0, range=...) at /pytorch/torch/csrc/jit/serialization/source_range_serialization.cpp:229",
    "#2  0x00000000007b5cc8 in torch::jit::Source::findSourceRangeThatGenerated (this=<optimized out>, range=...) at /pytorch/torch/csrc/jit/frontend/source_range.cpp:144",
    "#3  torch::jit::SourceRange::findSourceRangeThatGenerated (this=0x7fffffffa650) at /pytorch/torch/csrc/jit/frontend/source_range.h:384",
    "#4  torch::jit::SourceRange::highlight (this=0x7fffffffa650, out=...) at /pytorch/torch/csrc/jit/frontend/source_range.cpp:149",
    "#5  0x00000000007a0e74 in torch::jit::Lexer::expected (this=this@entry=0x99979a0, what=..., t=...) at /pytorch/torch/csrc/jit/frontend/lexer.h:461",
    "#6  0x000000000079fcaa in torch::jit::Lexer::lexRaw (this=this@entry=0x99979a0, whitespace_token=false) at /pytorch/torch/csrc/jit/frontend/lexer.h:552",
    "#7  0x000000000079fd23 in torch::jit::Lexer::lex (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/lexer.h:487",
    "#8  0x00000000007a1da1 in torch::jit::Lexer::next (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/lexer.h:436",
    "#9  0x0000000003bff6a8 in torch::jit::Lexer::nextIf (this=0x99979a0, kind=330) at /pytorch/torch/csrc/jit/frontend/lexer.h:444",
    "#10 torch::jit::ParserImpl::parseReturnAnnotation (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/parser.cpp:703",
    "#11 0x0000000003bfd500 in torch::jit::ParserImpl::parseDecl (this=this@entry=0x99979a0) at /pytorch/torch/csrc/jit/frontend/parser.cpp:729",
    "#12 0x0000000003bfb725 in torch::jit::ParserImpl::parseFunction (this=this@entry=0x99979a0, is_method=true) at /pytorch/torch/csrc/jit/frontend/parser.cpp:755",
    "#13 0x0000000003bfdc28 in torch::jit::ParserImpl::parseStmt (this=this@entry=0x99979a0, in_class=<optimized out>) at /pytorch/torch/csrc/jit/frontend/parser.cpp:599",
    "#14 0x0000000003bfd8dd in torch::jit::ParserImpl::parseStatements (this=this@entry=0x99979a0, expect_indent=<optimized out>, in_class=<optimized out>) at /pytorch/torch/csrc/jit/frontend/parser.cpp:697",
    "#15 0x0000000003bfc4ba in torch::jit::ParserImpl::parseClass (this=0x99979a0) at /pytorch/torch/csrc/jit/frontend/parser.cpp:747",
    "#16 0x0000000003bfaddc in torch::jit::Parser::parseClass (this=<optimized out>) at /pytorch/torch/csrc/jit/frontend/parser.cpp:812",
    "#17 0x0000000004008e2d in torch::jit::SourceImporterImpl::parseSourceIfNeeded (this=this@entry=0x95d41f0, qualifier=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:182",
    "#18 0x0000000004008ab7 in torch::jit::SourceImporterImpl::findNamedType (this=this@entry=0x95d41f0, name=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:135",
    "#19 0x000000000400d010 in torch::jit::SourceImporterImpl::resolveType (this=0x95d41f0, name=..., loc=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:261",
    "#20 0x0000000003c20821 in torch::jit::ScriptTypeParser::parseTypeFromExpr (this=this@entry=0x7fffffffb658, expr=...) at /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:238",
    "#21 0x0000000003c20acc in torch::jit::ScriptTypeParser::parseType (this=0x7fffffffb658, str=...) at /pytorch/torch/csrc/jit/frontend/script_type_parser.cpp:312",
    "#22 0x0000000004019416 in torch::jit::SourceImporter::loadType (this=<optimized out>, name=...) at /pytorch/torch/csrc/jit/serialization/import_source.cpp:786",
    "#23 0x0000000003ff365e in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0::operator()(c10::QualifiedName const&) const (this=<optimized out>, qn=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:146",
    "#24 std::__invoke_impl<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(std::__invoke_other, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) (__f=..., __args=...) at /usr/include/c++/10/bits/invoke.h:60",
    "#25 std::__invoke_r<c10::StrongTypePtr, torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&>(torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0&, c10::QualifiedName const&) (__fn=..., __args=...) at /usr/include/c++/10/bits/invoke.h:113",
    "#26 std::_Function_handler<c10::StrongTypePtr (c10::QualifiedName const&), torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)::$_0>::_M_invoke(std::_Any_data const&, c10::QualifiedName const&) (__functor=..., __args=...) at /usr/include/c++/10/bits/std_function.h:291",
    "#27 0x000000000404e5c4 in std::function<c10::StrongTypePtr (c10::QualifiedName const&)>::operator()(c10::QualifiedName const&) const (this=0x7fffffffbf28, __args=...) at /usr/include/c++/10/bits/std_function.h:622",
    "#28 torch::jit::Unpickler::readGlobal (this=this@entry=0x7fffffffbd50, module_name=..., class_name=...) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:820",
    "#29 0x0000000004049ce5 in torch::jit::Unpickler::readInstruction (this=this@entry=0x7fffffffbd50) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:496",
    "#30 0x00000000040497a8 in torch::jit::Unpickler::run (this=0x7fffffffbd50) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:251",
    "#31 0x00000000040494f9 in torch::jit::Unpickler::parse_ivalue (this=0x99aa1c0) at /pytorch/torch/csrc/jit/serialization/unpickler.cpp:204",
    "#32 0x00000000040075f8 in torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_type<c10::ivalue::Object> > (c10::StrongTypePtr, c10::IValue)> >, c10::optional<c10::Device>, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&), std::shared_ptr<torch::jit::DeserializationStorageContext>) (archive_name=..., pickle_prefix=..., tensor_prefix=..., type_resolver=..., obj_loader=..., device=..., stream_reader=..., type_parser=0x0, storage_context=...) at /pytorch/torch/csrc/jit/serialization/import_read.cpp:53",
    "#33 0x0000000003ff3545 in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::readArchive (this=this@entry=0x7fffffffc2b8, archive_name=...) at /pytorch/torch/csrc/jit/serialization/import.cpp:184",
    "#34 0x0000000003fed8bf in torch::jit::(anonymous namespace)::ScriptModuleDeserializer::deserialize (this=this@entry=0x7fffffffc2b8, device=device@entry=..., extra_files=..., restore_shapes=220) at /pytorch/torch/csrc/jit/serialization/import.cpp:287",
    "#35 0x0000000003febb0f in torch::jit::import_ir_module (cu=..., in=..., device=..., device@entry=..., extra_files=..., load_debug_files=true, restore_shapes=<optimized out>) at /pytorch/torch/csrc/jit/serialization/import.cpp:386",
    "#36 0x0000000003feb7a1 in torch::jit::import_ir_module (cu=..., in=..., device=..., device@entry=..., load_debug_files=false) at /pytorch/torch/csrc/jit/serialization/import.cpp:322",
    "#37 0x0000000003ff015a in torch::jit::load (in=..., device=device@entry=..., load_debug_files=true) at /pytorch/torch/csrc/jit/serialization/import.cpp:482",
    "#38 0x00000000004a1655 in LLVMFuzzerTestOneInput (data=0x981a680 \"PK\\003\\004\", size=1609) at /load.cc:42",
    "#39 0x00000000004a1dbf in main ()"
```

### Segmentation fault in /pytorch/aten/src/ATen/core/ivalue.h:526

[crash-9bd059c1ae85ab9cdb41d786932214d942baa189.zip](https://github.com/pytorch/pytorch/files/11552956/crash-9bd059c1ae85ab9cdb41d786932214d942baa189.zip)

```asan
    "==8528==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x00000e55d97e bp 0x7fffffffb4d0 sp 0x7fffffffb360 T0)",
    "==8528==The signal is caused by a READ memory access.",
    "==8528==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.",
    "    #0 0xe55d97e in c10::IValue::isTuple() const /pytorch/aten/src/ATen/core/ivalue.h:526:26",
    "    #1 0xe55d97e in torch::distributed::rpc::GloballyUniqueId::fromIValue(c10::IValue const&) /pytorch/torch/csrc/distributed/rpc/types.cpp:60:3",
    "    #2 0xe4b04fb in torch::distributed::rpc::ScriptRemoteCall::fromIValues(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch/torch/csrc/distributed/rpc/script_remote_call.cpp:33:20",
    "    #3 0xe4b1ed5 in torch::distributed::rpc::ScriptRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/script_remote_call.cpp:80:10",
    "    #4 0xe55f8a0 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch/torch/csrc/distributed/rpc/utils.cpp:108:14",
    "    #5 0x6120a8 in LLVMFuzzerTestOneInput /message_deserialize.cc:192:27",
    "    #6 0x535de1 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15",
    "    #7 0x51fcec in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6",
    "    #8 0x525a3b in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9",
    "    #9 0x54eff2 in main /llvm-project-llvmorg-14.0.6/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10",
    "    #10 0x7ffff7a37082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082) (BuildId: 1878e6b475720c7c51969e69ab2d276fae6d1dee)",
    "    #11 0x51a60d in _start (/message_deserialize_fuzz+0x51a60d)",
    "",
    "AddressSanitizer can not provide additional info.",
    "SUMMARY: AddressSanitizer: SEGV /pytorch/aten/src/ATen/core/ivalue.h:526:26 in c10::IValue::isTuple() const",
    "==8528==ABORTING"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102156
Approved by: https://github.com/ezyang
2023-05-27 04:23:01 +00:00
cyy
d7eec5628d Fix some move warnings by gcc13 (#102353)
GCC 13 has improved warnings about std::move. This PR tries to fix some detected code issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102353
Approved by: https://github.com/ezyang
2023-05-27 04:19:00 +00:00
26f53bb8b0 Deallocate workspace on thread exit (#102276)
LeakSanitizer picks up this allocation as a leak, so turn the buffer and size into a single object that deallocates when the thread_local is destroyed.

Note that in our use case the call that hits this code is running on a separate thread(s) which can, under the right circumstances, be torn down and rebuilt hence leaking multiple instances of this allocation.

Testing was performed locally on an Apple M2 with this patch applied and the ~100MB of leaks previously shown by LeakSanitizer and Instruments are no longer there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102276
Approved by: https://github.com/ezyang
2023-05-27 03:57:30 +00:00
5ee46afc05 perf hint logging in inductor (#102250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102250
Approved by: https://github.com/Skylion007, https://github.com/shunting314, https://github.com/jansel
2023-05-27 03:43:30 +00:00
25058d5f66 Modified logging threshold for memory profiling (#102243)
Fixed test_memory_profiler::TestMemoryProfilerE2E::test_memory_timeline by changing the (arbitrary) threshold for logging. We observe differently-sized allocations on different AMD GPUs, so chose a higher threshold over 512 to account for those differences and yet satisfy the test requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102243
Approved by: https://github.com/ezyang
2023-05-27 03:36:25 +00:00
e344ff4113 Support dynamo tracing collectives with processgroup arg (#102222)
Previously, other types of rank descriptors worked but pg
caused dynamo to break down when tracing the internal func that converts
pg to rank list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102222
Approved by: https://github.com/wanchaol, https://github.com/voznesenskym
2023-05-27 03:01:49 +00:00
ecd79b1fef add additional stream priority for cuda streams (#101956)
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.

Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
2023-05-27 02:36:16 +00:00
88961e6d30 Revert "[inductor] Inline ComputedBuffer computation when there are no reads (#102000)"
This reverts commit f2dfcb8778f109d32c8fb141ac7492b07ad8547b.

Reverted https://github.com/pytorch/pytorch/pull/102000 on behalf of https://github.com/kit1980 due to Broke inductor tests https://github.com/pytorch/pytorch/actions/runs/5095190248/jobs/9160028124 ([comment](https://github.com/pytorch/pytorch/pull/102000#issuecomment-1565131080))
2023-05-27 01:11:40 +00:00
20e6ff375a support ConvBinary in Inductor cpp wrapper (#101393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101393
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang
2023-05-27 01:03:51 +00:00
f162ab0423 Revert "[inductor] Handle floordiv and remainder in IndexPropagation (#102277)"
This reverts commit 267a181beb3e7d39dce3c5dfa527080684969691.

Reverted https://github.com/pytorch/pytorch/pull/102277 on behalf of https://github.com/kit1980 due to Broke inductor tests https://github.com/pytorch/pytorch/actions/runs/5095190248/jobs/9160028124 ([comment](https://github.com/pytorch/pytorch/pull/102277#issuecomment-1565108864))
2023-05-27 00:40:43 +00:00
da3aba1e46 Revert "[pt2] add SymInt support for linalg.pinv (#102367)"
This reverts commit 0d5b74da0cab798fbfdb9caa53fad816999c8386.

Reverted https://github.com/pytorch/pytorch/pull/102367 on behalf of https://github.com/kit1980 due to Broke slow tests https://github.com/pytorch/pytorch/actions/runs/5095190248/jobs/9160028124 ([comment](https://github.com/pytorch/pytorch/pull/102367#issuecomment-1565104562))
2023-05-27 00:33:42 +00:00
23223402eb [quant][pt2e] Add Support for DerivedQuantizationSpec (#102282)
Summary:
```
"""
4. DerivedQuantizationSpec
this is the quantization spec for the Tensors whose quantization parameters are derived from other Tensors
"""

class DerivedQuantizationSpec(QuantizationSpecBase):
    # specifies which Tensors the quantization parameters are derived from
    # this can either be an edge from argument to node, or a node
    derived_from: List[EdgeOrNode]
    derive_qparams_fn: Callabale[List[ObserverOrFakeQuantize], Tuple[Tensor, Tensor]]
     ...
```

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'
```

Reviewed By: kimishpatel

Differential Revision: D46097855

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102282
Approved by: https://github.com/andrewor14
2023-05-27 00:24:39 +00:00
267a181beb [inductor] Handle floordiv and remainder in IndexPropagation (#102277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102277
Approved by: https://github.com/lezcano
2023-05-26 23:47:53 +00:00
f2dfcb8778 [inductor] Inline ComputedBuffer computation when there are no reads (#102000)
When inductor compiles the following example,
```python
def flip(x):
    idx = torch.arange(x.shape[0] - 1, -1, -1, device=x.device)
    return x[idx], idx
```

The return of `idx` forces it to be realized into a `ComputedBuffer`
and the downstream index call inserts a corresponding load and
indirect_indexing:
```python
    tmp0 = tl.load(in_ptr0 + (x1), None)
    tmp1 = triton_helpers.promote_to_tensor(tmp0)
    tl.device_assert((0 <= tmp1) & (tmp1 < 128), "index out of bounds: 0 <= tmp1 < 128")
    tmp2 = tl.load(in_ptr1 + (x0 + (128*tmp0)), None)
```

However, if we can inline the index expression from the buffer's
computation we instead get direct indexing (and half the loads):
```python
    tmp0 = tl.load(in_ptr0 + (127 + ((-1)*x0)), None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102000
Approved by: https://github.com/lezcano
2023-05-26 23:47:53 +00:00
1e4292a1e8 [export] Rename graph_module.py to exported_program.py (#102260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102260
Approved by: https://github.com/ydwu4, https://github.com/tugsbayasgalan
2023-05-26 23:36:38 +00:00
c4028de462 [export] ExportedProgram (#102259)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102259
Approved by: https://github.com/ydwu4, https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2023-05-26 23:36:38 +00:00
80b916a586 fix sm86 cuda 21.1 conv threshold issues (#102361)
Fixes #102287, helps unblock https://github.com/pytorch/pytorch/pull/102178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102361
Approved by: https://github.com/atalman
2023-05-26 22:48:33 +00:00
c06d33ce43 Add dynamo itertools.combinations support (#102379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102379
Approved by: https://github.com/jansel
2023-05-26 22:48:24 +00:00
76a36159f7 Replace full_like lowerings with decomps (#101963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101963
Approved by: https://github.com/jansel
2023-05-26 21:51:22 +00:00
9c4fd72b53 [aot_autograd][functional_rng] Change calling convention (#102344)
Key change - seed, offset are the last 2 args in both the fwd and bwd graphs
Reason - The cudagraphs implementation in inductor currently relies on very simple ordering guarantees i.e. first n inputs are static for both fwd and bwd graphs. In the current implementation of functionalization of rng ops, this assumption is broken because the first 2 inputs are seed, offset.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102344
Approved by: https://github.com/eellison
2023-05-26 21:27:20 +00:00
bcaa93e80c s390x simd: disable functions with out-of-bounds reads (#102266)
3 disabled functions are attempting out of bounds reads. Disable them until sleef library is fixed.

<details>
<summary>ASAN report</summary>

```
=================================================================
==2030580==ERROR: AddressSanitizer: global-buffer-overflow on address 0x03ff70f54570 at pc 0x03ff6704e960 bp 0x03ffce128940 sp 0x03ffce128930
READ of size 4 at 0x03ff70f54570 thread T0
    #0 0x3ff6704e95f in vgather_vf_p_vi2 /home/user/pytorch/third_party/sleef/src/arch/helpers390x_128.h:129
    #1 0x3ff6704e95f in rempif /home/user/pytorch/third_party/sleef/src/libm/sleefsimdsp.c:550
    #2 0x3ff6704e95f in Sleef_cosf4_u10vxe2 /home/user/pytorch/third_party/sleef/src/libm/sleefsimdsp.c:1021
    #3 0x3ff67029cfb in Sleef_cosf4_u10 /home/user/pytorch/build/sleef/src/libm/disps390x_128.c:182
    #4 0x3ff55d21941 in at::vec::ZVECTOR::Vectorized<float, void> at::vec::ZVECTOR::Vectorized<float, void>::mapSleef<float __vector(4) const (*)(float __vector(4)), double __vector(2) const (*)(double __
vector(2)), float, 0>(float __vector(4) const (*)(float __vector(4)), double __vector(2) const (*)(double __vector(2))) const /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:991
    #5 0x3ff5689ad01 in at::vec::ZVECTOR::Vectorized<float, void>::cos() const /home/user/pytorch/aten/src/ATen/cpu/vec/vec256/zarch/vec256_zarch.h:1074
    #6 0x3ff5685df97 in at::vml::ZVECTOR::vcos<float>(float*, float const*, long)::{lambda(at::vec::ZVECTOR::Vectorized<float, void>)#1}::operator()(at::vec::ZVECTOR::Vectorized<float, void>) const /home/
user/pytorch/aten/src/ATen/cpu/vml.h:71
    #7 0x3ff5689b691 in void at::vec::map<float, at::vml::ZVECTOR::vcos<float>(float*, float const*, long)::{lambda(at::vec::ZVECTOR::Vectorized<float, void>)#1}, 0>(at::vml::ZVECTOR::vcos<float>(float*,
float const*, long)::{lambda(at::vec::ZVECTOR::Vectorized<float, void>)#1} const&, float*, float const*, long) /home/user/pytorch/aten/src/ATen/cpu/vec/functional_base.h:239
    #8 0x3ff5685e0df in void at::vml::ZVECTOR::vcos<float>(float*, float const*, long) /home/user/pytorch/aten/src/ATen/cpu/vml.h:71
    #9 0x3ff563fdde3 in operator() /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770
    #10 0x3ff5648e4a3 in operator() /home/user/pytorch/aten/src/ATen/TensorIterator.h:406
    #11 0x3ff5663cae1 in callback_fn<at::TensorIteratorBase::loop_2d_from_1d<at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda()>::<lambda(char**, const int64_t*, int64_t)> >(c
onst at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda()>::<lambda(char**, const int64_t*, int64_t)>&)::<lambda(char**, const int64_t*, int64_t, int64_t)> > /home/user/pytorch/
c10/util/FunctionRef.h:43
    #12 0x3ff4d45a933 in c10::function_ref<void (char**, long const*, long, long)>::operator()(char**, long const*, long, long) const /home/user/pytorch/c10/util/FunctionRef.h:64
    #13 0x3ff4d455133 in at::internal::serial_for_each(c10::ArrayRef<long>, c10::ArrayRef<long>, char**, unsigned long, c10::function_ref<void (char**, long const*, long, long)>, at::Range) /home/user/pyt
orch/aten/src/ATen/TensorIteratorInternal.h:52
    #14 0x3ff4d43b703 in at::TensorIteratorBase::serial_for_each(c10::function_ref<void (char**, long const*, long, long)>, at::Range) const /home/user/pytorch/aten/src/ATen/TensorIterator.cpp:777
    #15 0x3ff4d43ab59 in at::TensorIteratorBase::for_each(c10::function_ref<void (char**, long const*, long, long)>, long) /home/user/pytorch/aten/src/ATen/TensorIterator.cpp:749
    #16 0x3ff5648e851 in for_each<at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&)::<lambda()>::<lambda()>::<lambda(char**, const int64_t*, int64_t)> > /home/user/pytorch/aten/src/ATen/TensorItera
tor.h:421
    #17 0x3ff563fe5f9 in operator() /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770
    #18 0x3ff56400915 in operator() /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770
    #19 0x3ff56400f1d in at::native::ZVECTOR::cos_kernel(at::TensorIteratorBase&) /home/user/pytorch/aten/src/ATen/native/cpu/UnaryOpsKernel.cpp:770
    #20 0x3ff4f303007 in void at::native::DispatchStub<void (*)(at::TensorIteratorBase&), at::native::cos_stub>::operator()<at::native::structured_cos_out&>(c10::DeviceType, at::native::structured_cos_out
&) /home/user/pytorch/aten/src/ATen/native/DispatchStub.h:158
    #21 0x3ff4f2edb3f in at::native::structured_cos_out::impl(at::Tensor const&, at::Tensor const&) /home/user/pytorch/aten/src/ATen/native/UnaryOps.cpp:330
    #22 0x3ff526ef739 in wrapper_CPU_cos /home/user/pytorch/build/aten/src/ATen/RegisterCPU.cpp:4307
    #23 0x3ff52c651d9 in operator() /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13
    #24 0x3ff52c651d9 in call /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:463
    #25 0x3ff5076df2f in at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&>(void*, c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) /home/user/pytorch/aten/src/ATen/core
/boxing/KernelFunction_impl.h:50
    #26 0x3ff5009a93f in at::Tensor c10::KernelFunction::call<at::Tensor, at::Tensor const&>(c10::OperatorHandle const&, c10::DispatchKeySet, at::Tensor const&) const /home/user/pytorch/aten/src/ATen/core
/boxing/KernelFunction_impl.h:103
    #27 0x3ff5009a93f in at::Tensor c10::Dispatcher::call<at::Tensor, at::Tensor const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)> const&, at::Tensor const&) const /home/user/pytorch/aten/s
rc/ATen/core/dispatch/Dispatcher.h:639
    #28 0x3ff5009a93f in c10::TypedOperatorHandle<at::Tensor (at::Tensor const&)>::call(at::Tensor const&) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:487
    #29 0x3ff5009a93f in at::_ops::cos::call(at::Tensor const&) /home/user/pytorch/build/aten/src/ATen/Operators_0.cpp:2215
    #30 0x3ff7d813741 in at::Tensor::cos() const /home/user/pytorch/build/aten/src/ATen/core/TensorBody.h:2107
    #31 0x3ff7dc0f2b7 in operator() /home/user/pytorch/torch/csrc/autograd/generated/python_torch_functions_2.cpp:2953
    #32 0x3ff7dc0faf7 in THPVariable_cos /home/user/pytorch/torch/csrc/autograd/generated/python_torch_functions_2.cpp:2955
    #33 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543
    #34 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #35 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #36 0x3ffa5feb50d in do_call_core Python/ceval.c:5915
    #37 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #38 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #39 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #40 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #41 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255
    #42 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #43 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #44 0x3ff7f87a393 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) /home/user/pytorch/
torch/csrc/utils/python_dispatch.cpp:175
    #45 0x3ff7f8871a7 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch::
PythonKernelHolder> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::operator()(c10::OperatorKernel*, c10::Op
eratorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:87
    #46 0x3ff7f887261 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch::
PythonKernelHolder> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::Operator
Handle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:86
    #47 0x3ff7e0d10ab in c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/b
oxing/BoxedKernel_impl.h:41
    #48 0x3ff7e0d1459 in c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/cor
e/boxing/KernelFunction_impl.h:43
    #49 0x3ff7f876421 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:6
91
    #50 0x3ff4d22bcdd in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:417
    #51 0x3ff65a092d5 in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:421
    #52 0x3ff65a05641 in operator() /home/user/pytorch/torch/csrc/jit/runtime/register_c10_ops.cpp:15
    #53 0x3ff65a08cb5 in __invoke_impl<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c1
0::IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:61
    #54 0x3ff65a0897b in __invoke_r<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c10::
IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:111
    #55 0x3ff65a084e1 in _M_invoke /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/std_function.h:290
    #56 0x3ff7eb2cb21 in std::function<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /usr/lib/gcc/s390x-ibm-lin
ux-gnu/11/include/g++-v11/bits/std_function.h:590
    #57 0x3ff7eb1b659 in torch::jit::Operation::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /home/user/pytorch/aten/src/ATen/core/stack.h:41
    #58 0x3ff7eb08449 in torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11::
kwargs const&, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:764
    #59 0x3ff7eb09d85 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol,
pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:829
    #60 0x3ff7e573eb9 in operator() /home/user/pytorch/torch/csrc/jit/python/init.cpp:1549
    #61 0x3ff7e6728dd in call_impl<pybind11::object, torch::jit::initJITBindings(PyObject*)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&, 0, 1, pybind11::detail::vo
id_type> /home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1439
    #62 0x3ff7e64312f in call<pybind11::object, pybind11::detail::void_type, torch::jit::initJITBindings(PyObject*)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&> /h
ome/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1408
    #63 0x3ff7e5da259 in operator() /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:249
    #64 0x3ff7e5da441 in _FUN /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:224
    #65 0x3ff7d317a1f in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:929
    #66 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543
    #67 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #68 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #69 0x3ffa5feb50d in do_call_core Python/ceval.c:5915
    #70 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #71 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #72 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #73 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #74 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142
    #75 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431
    #76 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494
    #77 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #78 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #79 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #80 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #81 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #82 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #83 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #84 0x3ffa5fd76a3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #85 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123
    #86 0x3ffa5feb289 in call_function Python/ceval.c:5891
    #87 0x3ffa5fe5c3b in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #88 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #89 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #90 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #91 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255
    #92 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #93 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #94 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #95 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #96 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #97 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #98 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #99 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255
    #100 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #101 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #102 0x3ff7f87a393 in torch::impl::dispatch::PythonKernelHolder::operator()(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) /home/user/pytorch
/torch/csrc/utils/python_dispatch.cpp:175
    #103 0x3ff7f8871a7 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch:
:PythonKernelHolder> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::operator()(c10::OperatorKernel*, c10::O
peratorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:87
    #104 0x3ff7f887261 in c10::BoxedKernel::makeFromFunctor<torch::impl::dispatch::PythonKernelHolder>(std::unique_ptr<torch::impl::dispatch::PythonKernelHolder, std::default_delete<torch::impl::dispatch:
:PythonKernelHolder> >)::{lambda(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)#1}::_FUN(c10::OperatorKernel*, c10::Operato
rHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) /home/user/pytorch/aten/src/ATen/core/boxing/BoxedKernel_impl.h:86
    #105 0x3ff7e0d10ab in c10::BoxedKernel::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/
boxing/BoxedKernel_impl.h:41
    #106 0x3ff7e0d1459 in c10::KernelFunction::callBoxed(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/co
re/boxing/KernelFunction_impl.h:43
    #107 0x3ff7f876421 in c10::Dispatcher::callBoxed(c10::OperatorHandle const&, std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:
691
    #108 0x3ff4d22bcdd in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >*) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:417
    #109 0x3ff65a092d5 in c10::OperatorHandle::callBoxed(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:421
    #110 0x3ff65a05641 in operator() /home/user/pytorch/torch/csrc/jit/runtime/register_c10_ops.cpp:15
    #111 0x3ff65a08cb5 in __invoke_impl<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c
10::IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:61
    #112 0x3ff65a0897b in __invoke_r<void, torch::jit::(anonymous namespace)::createOperatorFromC10(const c10::OperatorHandle&)::<lambda(torch::jit::Stack&)>&, std::vector<c10::IValue, std::allocator<c10:
:IValue> >&> /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/invoke.h:111
    #113 0x3ff65a084e1 in _M_invoke /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/std_function.h:290
    #114 0x3ff7eb2cb21 in std::function<void (std::vector<c10::IValue, std::allocator<c10::IValue> >&)>::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) const /usr/lib/gcc/s390x-ibm-li
nux-gnu/11/include/g++-v11/bits/std_function.h:590
    #115 0x3ff7eb1b659 in torch::jit::Operation::operator()(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /home/user/pytorch/aten/src/ATen/core/stack.h:41
    #116 0x3ff7eb08449 in torch::jit::invokeOperatorFromPython(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, pybind11::args, pybind11:
:kwargs const&, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:764
    #117 0x3ff7eb09d85 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol,
 pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:829
    #118 0x3ff7e573eb9 in operator() /home/user/pytorch/torch/csrc/jit/python/init.cpp:1549
    #119 0x3ff7e6728dd in call_impl<pybind11::object, torch::jit::initJITBindings(PyObject*)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&, 0, 1, pybind11::detail::v
oid_type> /home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1439
    #120 0x3ff7e64312f in call<pybind11::object, pybind11::detail::void_type, torch::jit::initJITBindings(PyObject*)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&> /
home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1408
    #121 0x3ff7e5da259 in operator() /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:249
    #122 0x3ff7e5da441 in _FUN /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:224
    #123 0x3ff7d317a1f in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:929
    #124 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543
    #125 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #126 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #127 0x3ffa5feb50d in do_call_core Python/ceval.c:5915
    #128 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #129 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #130 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #131 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #132 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142
    #133 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431
    #134 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494
    #135 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #136 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #137 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #138 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #139 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #140 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #141 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #142 0x3ffa5e87d2b in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #143 0x3ffa5e882dd in method_vectorcall Objects/classobject.c:83
    #144 0x3ffa5e836d3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #145 0x3ffa5e84b6f in _PyObject_CallFunctionVa Objects/call.c:485
    #146 0x3ffa5e84f2d in callmethod Objects/call.c:557
    #147 0x3ffa5e85039 in PyObject_CallMethod Objects/call.c:577
    #148 0x3ff7f7efa05 in torch::handle_torch_function_no_python_arg_parser(c10::ArrayRef<pybind11::handle>, _object*, _object*, char const*, _object*, char const*, torch::TorchFunctionName) /home/user/py
torch/torch/csrc/utils/python_arg_parser.cpp:338
    #149 0x3ff7eb09b67 in torch::jit::_get_operation_for_overload_or_packet(std::vector<std::shared_ptr<torch::jit::Operator>, std::allocator<std::shared_ptr<torch::jit::Operator> > > const&, c10::Symbol,
 pybind11::args, pybind11::kwargs const&, bool, c10::optional<c10::DispatchKey>) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:827
    #150 0x3ff7e573eb9 in operator() /home/user/pytorch/torch/csrc/jit/python/init.cpp:1549
    #151 0x3ff7e6728dd in call_impl<pybind11::object, torch::jit::initJITBindings(PyObject*)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&, 0, 1, pybind11::detail::v
oid_type> /home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1439
    #152 0x3ff7e64312f in call<pybind11::object, pybind11::detail::void_type, torch::jit::initJITBindings(PyObject*)::<lambda(const string&, const string&)>::<lambda(pybind11::args, pybind11::kwargs)>&> /
home/user/pytorch/third_party/pybind11/include/pybind11/cast.h:1408
    #153 0x3ff7e5da259 in operator() /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:249
    #154 0x3ff7e5da441 in _FUN /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:224
    #155 0x3ff7d317a1f in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /home/user/pytorch/third_party/pybind11/include/pybind11/pybind11.h:929
    #156 0x3ffa5ef5ae1 in cfunction_call Objects/methodobject.c:543
    #157 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #158 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #159 0x3ffa5feb50d in do_call_core Python/ceval.c:5915
    #160 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #161 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #162 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #163 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #164 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142
    #165 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431
    #166 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494
    #167 0x3ffa5e84027 in _PyObject_MakeTpCall Objects/call.c:215
    #168 0x3ffa5fd767b in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #169 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123
    #170 0x3ffa5feb289 in call_function Python/ceval.c:5891
    #171 0x3ffa5fe5ad1 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #172 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #173 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #174 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #175 0x3ffa5fd76a3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #176 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123
    #177 0x3ffa5feb289 in call_function Python/ceval.c:5891
    #178 0x3ffa5fe5c3b in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #179 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #180 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #181 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #182 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267
    #183 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #184 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #185 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #186 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #187 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #188 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #189 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #190 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255
    #191 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #192 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #193 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #194 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #195 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #196 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #197 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #198 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255
    #199 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #200 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #201 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #202 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #203 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #204 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #205 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #206 0x3ffa5e841fb in PyVectorcall_Call Objects/call.c:255
    #207 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #208 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #209 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #210 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #211 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #212 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #213 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #214 0x3ffa5e83d1f in _PyObject_FastCallDictTstate Objects/call.c:142
    #215 0x3ffa5e84937 in _PyObject_Call_Prepend Objects/call.c:431
    #216 0x3ffa5f2f577 in slot_tp_call Objects/typeobject.c:7494
    #217 0x3ffa5e843f3 in _PyObject_Call Objects/call.c:305
    #218 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #219 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #220 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #221 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #222 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #223 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #224 0x3ffa5fd76a3 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #225 0x3ffa5fd772f in PyObject_Vectorcall Include/cpython/abstract.h:123
    #226 0x3ffa5feb289 in call_function Python/ceval.c:5891
    #227 0x3ffa5fe5b21 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #228 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #229 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #230 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #231 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267
    #232 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #233 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #234 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #235 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #236 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #237 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #238 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #239 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267
    #240 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #241 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #242 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #243 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #244 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #245 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #246 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #247 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267
    #248 0x3ffa5e84347 in _PyObject_Call Objects/call.c:290
    #249 0x3ffa5e84483 in PyObject_Call Objects/call.c:317
    #250 0x3ffa5feb7cf in do_call_core Python/ceval.c:5943
    #251 0x3ffa5fe6019 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #252 0x3ffa5fd7aed in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #253 0x3ffa5fe8ba9 in _PyEval_Vector Python/ceval.c:5065
    #254 0x3ffa5e8459b in _PyFunction_Vectorcall Objects/call.c:342
    #255 0x3ffa5e8427f in PyVectorcall_Call Objects/call.c:267

0x03ff70f54570 is located 0 bytes to the right of global variable 'Sleef_rempitabsp' defined in '/home/user/pytorch/third_party/sleef/src/libm/rempitab.c:986:34' (0x3ff70f53f00) of size 1648
SUMMARY: AddressSanitizer: global-buffer-overflow /home/user/pytorch/third_party/sleef/src/arch/helpers390x_128.h:129 in vgather_vf_p_vi2
Shadow bytes around the buggy address:
  0x10007fee1ea850: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea860: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea870: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea890: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x10007fee1ea8a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00[f9]f9
  0x10007fee1ea8b0: f9 f9 f9 f9 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea8c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea8d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea8e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10007fee1ea8f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==2030580==ABORTING
```
</details>

It reproduces when running `pytest -v test/test_ops.py -k test_python_ref__refs_cos_cpu_bfloat16` under address sanitizer on s390x.

See also: https://github.com/shibatch/sleef/issues/464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102266
Approved by: https://github.com/malfet
2023-05-26 20:59:42 +00:00
0ed22fce97 Merge type stubs torch nn parallel (#102194)
Fixes merge issue for #101528

In the above PR, `torch.nn.parallel.parallel_apply.get_a_var` was marked private to appease the [public interface linter](https://github.com/pytorch/pytorch/actions/runs/4999216467/jobs/8955582204#step:14:21666): ceeb242bc7

This broke CI pipelines running external dependencies that expected `get_a_var`'s name to not change. In this PR, we change the name back to `get_a_var` and include it in the `__all__` instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102194
Approved by: https://github.com/ezyang
2023-05-26 20:10:47 +00:00
7b6438da9e [Dynamo] Fix if condition on NNModuleVariable (#102335)
Fixes #102315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102335
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-05-26 17:00:43 +00:00
3469f100f3 support ConvUnary in Inductor cpp wrapper (#101392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101392
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/EikanWang
2023-05-26 15:52:06 +00:00
0d5b74da0c [pt2] add SymInt support for linalg.pinv (#102367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102367
Approved by: https://github.com/lezcano
2023-05-26 15:20:34 +00:00
8751002215 equality assertions (#102256)
Previously we had runtime asserts for range constraints. This diff adds runtime asserts for equality constraints.

This requires a bit of refactoring that is worth calling out.
1. [Minor] Some of the data structures produced by export and consumed by the runtime assertion pass need to be broadened. This is a WIP. There are some associated code improvements that are included in this diff, but by and large the structures are similar to what exists now. Meanwhile @angelayi and I are chatting about how to make it qualitatively better: briefly, we want to index everything by symbols, which are 1-1 with (name, dim) pairs.
2. [Major] The order in which runtime asserts are emitted is changed. Previously we used to do the work in `placeholder`, now this diff adds a hook for "post-processing" after processing of all placeholders is done. This is needed because equality constraints can mention different placeholders. This change also opens the way to optimizing codegen: e.g., each (name, dim) pair should correspond to a single intermediate variable that is reused across runtime asserts. This is future work.

Differential Revision: [D46177642](https://our.internmc.facebook.com/intern/diff/D46177642/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102256
Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi
2023-05-26 14:57:31 +00:00
9b5e4c308c [PT2][Quant][BE] Apply formatting to test_quantize_pt2e (#102275)
Summary: Just formatting diff

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D45948056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102275
Approved by: https://github.com/andrewor14
2023-05-26 14:24:34 +00:00
efd774a295 Document faster builds for C++ changes (#102316)
Update `CONTRIBUTING.md` with tip on how to avoid rebuilding/copying libs every time one makes a small change to the native code.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at f5e8394</samp>

> _`setup.py` docs_
> _Link to source and build dirs_
> _Winter of testing_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102316
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-05-26 14:11:08 +00:00
c05a317371 Bump requests from 2.30.0 to 2.31.0 in /tools/build/bazel (#102059)
* Bump requests from 2.30.0 to 2.31.0 in /tools/build/bazel

Bumps [requests](https://github.com/psf/requests) from 2.30.0 to 2.31.0.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.30.0...v2.31.0)

---
updated-dependencies:
- dependency-name: requests
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Apply suggestions from code review

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikita Shulga <nshulga@meta.com>
2023-05-26 07:01:22 -07:00
6c9b94dcda Revert "add additional stream priority for cuda streams (#101956)"
This reverts commit 5da497cabbbef96061a7840ea7e5f10730ccc2a0.

Reverted https://github.com/pytorch/pytorch/pull/101956 on behalf of https://github.com/osalpekar due to Broke internal builds that used -Wunused-function since this PR removed the call to StreamIdType::<< ([comment](https://github.com/pytorch/pytorch/pull/101956#issuecomment-1563875493))
2023-05-26 06:35:23 +00:00
3dfa755a1f [MTPG] Enable for some tests in test_fsdp_misc (#102043)
Enables MTPG for some FSDP tests in this file. Tests that need the
backward pass and warning logging are left as follow up work.

Backward pass issue: It seems that there is a hang with all_gather. Will sync with @kumpera on this.

Warning issue: We have a couple tests that regex check on warnings, but in the
multithreaded scenario these warnings are somehow not logged.

Differential Revision: [D43209769](https://our.internmc.facebook.com/intern/diff/D43209769/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102043
Approved by: https://github.com/awgu
2023-05-26 06:21:25 +00:00
ce41faa2ae Add cpp.max_horizontal_fusion_size to control the granularity of horizontal fusion (#99828)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99828
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-05-26 05:20:49 +00:00
e1dc793ef0 [vision hash update] update the pinned vision hash (#102318)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102318
Approved by: https://github.com/pytorchbot
2023-05-26 04:22:10 +00:00
fb468b6792 [ONNX] Support aten::scatter_reduce (#102048)
Fixes #84260

`reduce='mean'` is not supported, as it's not in ONNX spec (https://github.com/onnx/onnx/issues/5100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102048
Approved by: https://github.com/abock
2023-05-26 02:51:41 +00:00
ef13fde290 Increase mem eff backward performance (#101847)
# Summary
This is another upstream which is much smaller than the previous.
This bumps the kernel versions from  xformers
Current: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](6425fd0cac)
With this PR: [1d635e193e169fc677b2e7fa42dad7ebe88eec9e](1d635e193e)

### Notable Changes:
- Drastically improve the BW pass in multiple cases (especially when B*numHeads < 100)
- H100 Support: *Warning* While these kernels have been added, we don't have the CI/CD machines to test.
- Enables a deterministic mode.

## Specific Changes
- Updates to the backward kernel.
- Added num_splits_key which we hard code to -1. (This is a another performance knob that we set to the heuristic)
- Update gen_code and kernels to produce h100 instantiations.

### Due Diligence Checks:
* CUDA_lib size: No changes in size

#### Peformance
* Micro Benchmark: (batch_size: 1, num_heads=25, seq_len=4096, embed_dim = 64 | grid:[1,25,1]block: [128,1,1])
    * MemEfficientAttention Backward Kernel: 27.972 ms
    * After the updated Xformers code(https://github.com/pytorch/pytorch/pull/100583): 23.958 ms
    * With this PR: 4.085 ms
* Ran micro benchmarks on sdpa_forw().sum().backward() over a range of dtypes, and input shapes
   * Geo_mean increase -> 1.17x
   * Max increase -> 2.95x
   * min_increase -> 0.8x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101847
Approved by: https://github.com/cpuhrsch
2023-05-26 02:25:31 +00:00
6f464e0cf8 Invoke the bf16 load w/o #elements to bypass the temporary buffer allocation from the performance perspective. (#99822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99822
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-26 02:10:41 +00:00
c3550d8376 Add fast path for BF16 kernel if all the operations within the kernel support bf16 (#99814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99814
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-26 02:08:53 +00:00
68816e4fa9 Remove inplace buffers when original and mutation are both removed (#102289)
Currently if we have an inplaced buffer that's completely internal to a fused kernel and thus doesn't need to be allocated, we are still allocating it and sending unused argument to a kernel, because our analysis for removing buffers treats it separately (assuming that either original or mutated value are still needed).
This PR extends buffer removal to inplaced buffers that can be removed.

Generated kernel for e.g. ln changes from
```
def triton_(in_out_ptr0, in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
where in_out_ptr0 is unused in the kernel to
```
def triton_(in_out_ptr1, in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, xnumel, rnumel, XBLOCK : tl.constexpr):
```
and corresponding allocation/reuse lines in the wrapper are removed.
The `in_out_ptr1` is also mislabeled - it's not `in_out`, it's only written to, but this PR doesn't fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102289
Approved by: https://github.com/jansel
2023-05-26 02:06:36 +00:00
0db704d240 [OpInfo] Add multi_head_attention_forward (#100153)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8f8d620</samp>

This pull request improves the testing of the `nn.functional.multi_head_attention_forward` function by adding it to the `OpInfo` framework, adjusting the tolerance and skipping criteria for some test cases, and restricting the dtype for the `MetaProgrammingSystem` tests. These changes aim to address the randomness and numerical precision issues of the function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100153
Approved by: https://github.com/drisspg
2023-05-26 01:58:17 +00:00
8aa48315de Revert "Disallow _foreach_utils.py, but allow it to be inlined (#102221)"
This reverts commit 552299c42c45dda93e2a473639e092dae4d548b9.

Reverted https://github.com/pytorch/pytorch/pull/102221 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. It starts to break dynamo jobs in trunk 552299c42c and it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/102221#issuecomment-1563694599))
2023-05-26 01:27:19 +00:00
eqy
54f38381a0 [CUDA][DLPack] Try ~~bumping sleep interval~~ running on explicit side-stream for Windows dlpack test (#102283)
(attempted fix for Windows failure in #101318)
CC @huydhn

If this doesn't work, will try adding an explicit side stream in case that is causing the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102283
Approved by: https://github.com/huydhn
2023-05-26 00:57:55 +00:00
b469ed72d0 Integrating new API usage metadata logger (#101762)
Summary: The new logger allows passing metadata into the api usage logger. The immediate use case is to pass the serialization_id to the save and load events to be enable tracking serialized models in API events. It could be extended to add more metadata in the future.

Test Plan:
```
buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test
```

Reviewed By: davidberard98

Differential Revision: D45683697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101762
Approved by: https://github.com/davidberard98
2023-05-26 00:24:26 +00:00
ae5606bb2f Make test_inductor_collectives use self.assert* (#102274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102274
Approved by: https://github.com/wanchaol, https://github.com/voznesenskym
2023-05-26 00:02:02 +00:00
b628eb524b simplify BinaryDivFloorKernel.cu code (#102168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102168
Approved by: https://github.com/ngimel
2023-05-25 23:52:35 +00:00
552299c42c Disallow _foreach_utils.py, but allow it to be inlined (#102221)
This function should not be allowed, but should be inlineable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102221
Approved by: https://github.com/anijain2305
2023-05-25 23:48:36 +00:00
de7ec2ddd7 [MPS] Allow saved models to be loaded directly to MPS through torch.jit.load (#102204)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 94eed69</samp>

This pull request adds support for serializing and deserializing tensors on the `mps` device using JIT. It includes a test case in `test/test_mps.py` and a device handling logic in `torch/csrc/jit/serialization/unpickler.cpp`.

Fixes https://github.com/pytorch/pytorch/issues/88820, https://github.com/pytorch/pytorch/issues/87504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102204
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-25 23:32:29 +00:00
836798e0f3 [inductor] Support precomputed_sizes in CppWrapperCodeGen (#102083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102083
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-05-25 23:14:28 +00:00
053dff1111 [ONNX] Bump ORT version to 1.15.0 (#102248)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102248
Approved by: https://github.com/abock
2023-05-25 23:11:52 +00:00
3c77310752 fix benchmarks/dynamo/runner.py (#102311)
Benchmark performance csv's can now contain `infra_error` strings, leading to failed parses. Fix by converting strings in data to 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102311
Approved by: https://github.com/yanboliang
2023-05-25 22:42:03 +00:00
0d17bd5fa4 DOC Fixes unpacking issue in dynamo explain docs (#101761)
This PR updates the docs to be consistent with `torch.explain` which currently returns 6 items:

bfb3941ad8/torch/_dynamo/eval_frame.py (L622-L629)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101761
Approved by: https://github.com/desertfire
2023-05-25 22:32:15 +00:00
5b01c8dc6a fix functorch/test_ops.py test_vjp flash attention unexpected success (#102131)
add isSm90 check for expected failure in nn.functional.scaled_dot_product_attention in functorch/test_ops.py

Fixes #102029

Uses solution https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560052965 which was verified by
https://github.com/pytorch/pytorch/issues/102029#issuecomment-1560071148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102131
Approved by: https://github.com/zou3519
2023-05-25 22:17:25 +00:00
184d4f1ba3 [ez] add docs/source/compile/generated/ to .gitignore (#101094)
as titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101094
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-05-25 21:52:26 +00:00
80f7264804 Foreach kernel codegen in inductor (#99975)
[design doc](https://docs.google.com/document/d/1JLr5yMAR8TuKW78ixKeqzfDHhcazwxKo_JXQnP_-wyY/edit?kh_source=GDOCS#heading=h.8x4z4mmet3im)

Add foreach kernel codegen for a single overload of foreach add in Inductor. Coverage will expand to more ops in subsequent PRs.

[example](https://gist.github.com/mlazos/9606fe64100ea2a5ec8265df1739fbe2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99975
Approved by: https://github.com/jansel
2023-05-25 21:48:41 +00:00
1f80b972a6 [CUDAGraph Trees] Fix empty storages handling (#102273)
We don't need to handle managing their memory since they dont have any. Previously you would get error `RuntimeError: These storage data ptrs are not allocated in pool (0, 2) but should be {0}`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102273
Approved by: https://github.com/ngimel
2023-05-25 21:45:12 +00:00
c1db235040 [dynamo] fix module buffers call (#102251)
This PR fixes module buffers call and extract module.buffers similar to
module.parameters

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102251
Approved by: https://github.com/wconstab
2023-05-25 21:26:09 +00:00
d40f4f12f6 [dynamo] add itertools.chain support (#102247)
This PR adds itertools chain support to dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102247
Approved by: https://github.com/jansel
2023-05-25 21:26:09 +00:00
c2498d3deb Fixed indentation error in test_binary_ufuncs.py (#102244)
Fixes #102147

Move the code where calling _scalar_helper out of its defination scope. Otherwise  test_div_and_floordiv_vs_python will test nothing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102244
Approved by: https://github.com/kit1980
2023-05-25 21:21:30 +00:00
080d86acfb [DCP] Add API logging for checkpoint high level API (#102278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102278
Approved by: https://github.com/fduwjj
2023-05-25 21:13:29 +00:00
bd39767408 Bump requests from 2.26 to 2.31.0 in /.github (#102057)
Bumps [requests](https://github.com/psf/requests) from 2.26 to 2.31.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.31.0</h2>
<h2>2.31.0 (2023-05-22)</h2>
<p><strong>Security</strong></p>
<ul>
<li>
<p>Versions of Requests between v2.3.0 and v2.30.0 are vulnerable to potential
forwarding of <code>Proxy-Authorization</code> headers to destination servers when
following HTTPS redirects.</p>
<p>When proxies are defined with user info (<a href="https://user:pass@proxy:8080">https://user:pass@proxy:8080</a>), Requests
will construct a <code>Proxy-Authorization</code> header that is attached to the request to
authenticate with the proxy.</p>
<p>In cases where Requests receives a redirect response, it previously reattached
the <code>Proxy-Authorization</code> header incorrectly, resulting in the value being
sent through the tunneled connection to the destination server. Users who rely on
defining their proxy credentials in the URL are <em>strongly</em> encouraged to upgrade
to Requests 2.31.0+ to prevent unintentional leakage and rotate their proxy
credentials once the change has been fully deployed.</p>
<p>Users who do not use a proxy or do not supply their proxy credentials through
the user information portion of their proxy URL are not subject to this
vulnerability.</p>
<p>Full details can be read in our <a href="https://github.com/psf/requests/security/advisories/GHSA-j8r2-6x86-q33q">Github Security Advisory</a>
and <a href="https://nvd.nist.gov/vuln/detail/CVE-2023-32681">CVE-2023-32681</a>.</p>
</li>
</ul>
<h2>v2.30.0</h2>
<h2>2.30.0 (2023-05-03)</h2>
<p><strong>Dependencies</strong></p>
<ul>
<li>
<p>⚠️ Added support for urllib3 2.0. ⚠️</p>
<p>This may contain minor breaking changes so we advise careful testing and
reviewing <a href="https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html">https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html</a>
prior to upgrading.</p>
<p>Users who wish to stay on urllib3 1.x can pin to <code>urllib3&lt;2</code>.</p>
</li>
</ul>
<h2>v2.29.0</h2>
<h2>2.29.0 (2023-04-26)</h2>
<p><strong>Improvements</strong></p>
<ul>
<li>Requests now defers chunked requests to the urllib3 implementation to improve
standardization. (<a href="https://redirect.github.com/psf/requests/issues/6226">#6226</a>)</li>
<li>Requests relaxes header component requirements to support bytes/str subclasses. (<a href="https://redirect.github.com/psf/requests/issues/6356">#6356</a>)</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.31.0 (2023-05-22)</h2>
<p><strong>Security</strong></p>
<ul>
<li>
<p>Versions of Requests between v2.3.0 and v2.30.0 are vulnerable to potential
forwarding of <code>Proxy-Authorization</code> headers to destination servers when
following HTTPS redirects.</p>
<p>When proxies are defined with user info (<a href="https://user:pass@proxy:8080">https://user:pass@proxy:8080</a>), Requests
will construct a <code>Proxy-Authorization</code> header that is attached to the request to
authenticate with the proxy.</p>
<p>In cases where Requests receives a redirect response, it previously reattached
the <code>Proxy-Authorization</code> header incorrectly, resulting in the value being
sent through the tunneled connection to the destination server. Users who rely on
defining their proxy credentials in the URL are <em>strongly</em> encouraged to upgrade
to Requests 2.31.0+ to prevent unintentional leakage and rotate their proxy
credentials once the change has been fully deployed.</p>
<p>Users who do not use a proxy or do not supply their proxy credentials through
the user information portion of their proxy URL are not subject to this
vulnerability.</p>
<p>Full details can be read in our <a href="https://github.com/psf/requests/security/advisories/GHSA-j8r2-6x86-q33q">Github Security Advisory</a>
and <a href="https://nvd.nist.gov/vuln/detail/CVE-2023-32681">CVE-2023-32681</a>.</p>
</li>
</ul>
<h2>2.30.0 (2023-05-03)</h2>
<p><strong>Dependencies</strong></p>
<ul>
<li>
<p>⚠️ Added support for urllib3 2.0. ⚠️</p>
<p>This may contain minor breaking changes so we advise careful testing and
reviewing <a href="https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html">https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html</a>
prior to upgrading.</p>
<p>Users who wish to stay on urllib3 1.x can pin to <code>urllib3&lt;2</code>.</p>
</li>
</ul>
<h2>2.29.0 (2023-04-26)</h2>
<p><strong>Improvements</strong></p>
<ul>
<li>Requests now defers chunked requests to the urllib3 implementation to improve
standardization. (<a href="https://redirect.github.com/psf/requests/issues/6226">#6226</a>)</li>
<li>Requests relaxes header component requirements to support bytes/str subclasses. (<a href="https://redirect.github.com/psf/requests/issues/6356">#6356</a>)</li>
</ul>
<h2>2.28.2 (2023-01-12)</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="147c8511dd"><code>147c851</code></a> v2.31.0</li>
<li><a href="74ea7cf7a6"><code>74ea7cf</code></a> Merge pull request from GHSA-j8r2-6x86-q33q</li>
<li><a href="3022253346"><code>3022253</code></a> test on pypy 3.8 and pypy 3.9 on windows and macos (<a href="https://redirect.github.com/psf/requests/issues/6424">#6424</a>)</li>
<li><a href="b639e66c81"><code>b639e66</code></a> test on py3.12 (<a href="https://redirect.github.com/psf/requests/issues/6448">#6448</a>)</li>
<li><a href="d3d504436e"><code>d3d5044</code></a> Fixed a small typo (<a href="https://redirect.github.com/psf/requests/issues/6452">#6452</a>)</li>
<li><a href="2ad18e0e10"><code>2ad18e0</code></a> v2.30.0</li>
<li><a href="f2629e9e3c"><code>f2629e9</code></a> Remove strict parameter (<a href="https://redirect.github.com/psf/requests/issues/6434">#6434</a>)</li>
<li><a href="87d63de873"><code>87d63de</code></a> v2.29.0</li>
<li><a href="51716c4ef3"><code>51716c4</code></a> enable the warnings plugin (<a href="https://redirect.github.com/psf/requests/issues/6416">#6416</a>)</li>
<li><a href="a7da1ab349"><code>a7da1ab</code></a> try on ubuntu 22.04 (<a href="https://redirect.github.com/psf/requests/issues/6418">#6418</a>)</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.26.0...v2.31.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=requests&package-manager=pip&previous-version=2.26&new-version=2.31.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102057
Approved by: https://github.com/huydhn
2023-05-25 21:06:44 +00:00
870880236b Enables configuration of NCCL communicators (#97394)
NCCL 2.17+ introduces some user configurable parameters for NCCL communicators using [ncclConfig_t](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/types.html#c.ncclConfig_t) datatype and [ncclCommInitRankConfig](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcomminitrankconfig). This PR enables that feature.

A user can tune the parameters as follows:
```
import torch.distributed as dist
nccl_options = dist.ProcessGroupNCCL.Options()
nccl_options.config.max_ctas = 32
nccl_options.config.min_ctas = 8
nccl_options.config.cga_cluster_size = 2
dist.init_process_group(backend='nccl', init_method='env://', pg_options=nccl_options)
my_group = dist.new_group(pg_options=nccl_options)
```

The default values of these parameters are what is initialized by `NCCL_CONFIG_INITIALIZER`. Only for DistributedDataParallel, this PR sets the default value of cga_cluster_size to 2 (a heuristic that works well especially for DDP workloads).

Tuning these parameters can lead to improvement in end-to-end performance, since it affects the communication-computation overlap for NCCL kernels.

CC: @ptrblck @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97394
Approved by: https://github.com/kwen2501
2023-05-25 20:46:19 +00:00
3cae6d2493 Make exir passes work with map_impl HigherOrderOperator. (#102009)
Summary: Forward fix t53725825. New map implementation breaks multiple internal tests. forward fix it for some of them. To unblock others, mark unfixed ones are expectedFailure first.

Test Plan: Test with CI.

Reviewed By: angelayi

Differential Revision: D46084287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102009
Approved by: https://github.com/angelayi
2023-05-25 20:00:51 +00:00
ee33bae5c7 Fix an issue where checking sameness throw an exception (#102279)
Summary: currently the exception is caught by outside and marked as
infra_error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102279
Approved by: https://github.com/anijain2305
2023-05-25 19:49:23 +00:00
d64ec82d15 Turn on padding (#101915)
🚀 🚀 🚀

Turns on torchinductor mm padding. Gives 4% HF training win at 5s compilation time increase. Results for mm tuning are cached.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101915
Approved by: https://github.com/jansel
2023-05-25 18:54:28 +00:00
0833f475ce Cache mm padding decision (#102200)
Rebase of https://github.com/pytorch/pytorch/pull/100982 which was already accepted

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102200
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-05-25 17:57:28 +00:00
375446a0ea [fix opinfo] empty_strided (#102088)
Fixes #102024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102088
Approved by: https://github.com/ngimel
2023-05-25 17:39:06 +00:00
ed87508b32 [quant][pt2e] Add support for SharedQuantizationSpec (#102184)
Summary:
This PR adds support for SharedQuantizationSpec, it's used to express the sharing between
two Tensors in the prepared graph, the Tensor will either be input of some node (expressed as a Tuple of fx nodes) or
output of some node (expressed as an fx Node)

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- 'caffe2/test:quantization_pt2e'
buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'
```

Differential Revision: D46043026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102184
Approved by: https://github.com/kimishpatel, https://github.com/leslie-fang-intel
2023-05-25 17:31:59 +00:00
fab49823a5 Skip bandwidth bound mms (#102199)
Speeds up compilation time, and was particularly needed for cm3leon_generate which has a ton of small matmuls of different sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102199
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-05-25 17:29:49 +00:00
9aaa12e328 Move mm padding to pattern matcher (#101913)
There are a few reasons for this:

1. When I tried to enable padding via decompositions, I ran into weird errors with a number of models. I believe because we were making the type of a regular tensor a fake tensor.
2. This gives us flexibility to go before or after other graph passes
3. We can now also reason about the cost of the padding, and whether or not it can be fused since we have access to the graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101913
Approved by: https://github.com/ngimel
2023-05-25 17:21:18 +00:00
0bb2b01541 Add forward mode AD to in-place foreach functions (#100695)
Awkwardly implement fwd AD by
- adding a few `CodeTemplate`s
- allowing for the cases where a variable is initialized with i-th element of TensorList

<!--
### TODOs:
- [x] ~~remove the first `_any_has_forward_grad_self`~~ make it a vector of bool
- [ ] clean up mapping of names from reference impl to foreach impl
- [x] add tests
-->

### Rel:
- #58833
- #96405

---

`_foreach_addcmul_.ScalarList` from `VariableType`

```c++
void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) {
  auto self_ = unpack(self, "self", 0);
  auto tensor1_ = unpack(tensor1, "tensor1", 1);
  auto tensor2_ = unpack(tensor2, "tensor2", 2);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 );

  std::vector<bool> _any_has_forward_grad_self(self.size());
  for (const auto& i : c10::irange(self.size())) {
    _any_has_forward_grad_self[i] = isFwGradDefined(self[i]) || isFwGradDefined(tensor1[i]) || isFwGradDefined(tensor2[i]);
  }
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] ));
              return grad_fn;
          }
      }());
    }
    if (!grad_fns.empty()) {

        for (const auto& i : c10::irange(grad_fns.size())) {
            auto grad_fn = grad_fns[i];
            if (grad_fn != nullptr) {
                grad_fn->self_scalar_type = self[i].scalar_type();
                grad_fn->tensor1_scalar_type = tensor1[i].scalar_type();
                if (grad_fn->should_compute_output(1)) {
                  grad_fn->tensor2_ = SavedVariable(tensor2[i], false);
                }
                grad_fn->value = scalars[i];
                if (grad_fn->should_compute_output(2)) {
                  grad_fn->tensor1_ = SavedVariable(tensor1[i], false);
                }
                grad_fn->tensor2_scalar_type = tensor2[i].scalar_type();
            }
        }
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size());
  for (const Tensor& tensor : tensor1_)
    tensor1__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size());
  for (size_t i=0; i<tensor1_.size(); i++)
    if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size());
  for (const Tensor& tensor : tensor2_)
    tensor2__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size());
  for (size_t i=0; i<tensor2_.size(); i++)
    if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_))
      TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage()));
  }
  for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_))
      TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_))
      TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage()));
  }
  for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_))
      TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
  std::vector<c10::optional<at::Tensor>> self_new_fw_grad_opts(self.size(), c10::nullopt);
  for (const auto& i : c10::irange(self_new_fw_grad_opts.size())) {
    if (_any_has_forward_grad_self[i]) {
        auto self_t_raw = toNonOptFwGrad(self[i]);
        auto self_tensor = toNonOptTensor(self[i]);
        auto self_t = (self_t_raw.defined() || !self_tensor.defined())
          ? self_t_raw : at::zeros(self_tensor.sizes(), self_tensor.options());
        auto tensor1_t_raw = toNonOptFwGrad(tensor1[i]);
        auto tensor1_tensor = toNonOptTensor(tensor1[i]);
        auto tensor1_t = (tensor1_t_raw.defined() || !tensor1_tensor.defined())
          ? tensor1_t_raw : at::_efficientzerotensor(tensor1_tensor.sizes(), tensor1_tensor.options());
        auto tensor1_p = toNonOptPrimal(tensor1[i]);
        auto tensor2_t_raw = toNonOptFwGrad(tensor2[i]);
        auto tensor2_tensor = toNonOptTensor(tensor2[i]);
        auto tensor2_t = (tensor2_t_raw.defined() || !tensor2_tensor.defined())
          ? tensor2_t_raw : at::_efficientzerotensor(tensor2_tensor.sizes(), tensor2_tensor.options());
        auto tensor2_p = toNonOptPrimal(tensor2[i]);
        self_t = GradMode::is_enabled() ? self_t.clone() : self_t;
        self_new_fw_grad_opts[i] = self_t_raw.defined() ? self_t_raw.copy_(self_t + maybe_multiply(tensor1_t * tensor2_p, scalars[i]) + maybe_multiply(tensor2_t * tensor1_p, scalars[i])) : self_t + maybe_multiply(tensor1_t * tensor2_p, scalars[i]) + maybe_multiply(tensor2_t * tensor1_p, scalars[i]);
    }
  }
  for (const auto& i : c10::irange(self_new_fw_grad_opts.size())) {
    auto& self_new_fw_grad_opt = self_new_fw_grad_opts[i];
    if (self_new_fw_grad_opt.has_value() && self_new_fw_grad_opt.value().defined() && self[i].defined()) {
      // The hardcoded 0 here will need to be updated once we support multiple levels.
      self[i]._set_fw_grad(self_new_fw_grad_opt.value(), /* level */ 0, /* is_inplace_op */ true);
    }
  }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100695
Approved by: https://github.com/soulitzer
2023-05-25 15:39:48 +00:00
6c7410ddc3 sampled_addmm: BSR support (#101163)
This PR implements a `sampled_addmm` kernel that works with a BSR mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101163
Approved by: https://github.com/cpuhrsch
2023-05-25 12:33:50 +00:00
4882cd0801 inductor: align cpp floordiv with python floordiv for dyanmic shape path (#102068)
This PR does the following things:

- Align the C++ behavior with Python for FloorDiv.
- Always return expr dtype for some ops which not use expr's dtype to do the computation.

After this PR, TIMM ```levit_128``` and ```volo_d1_224``` accuracy tests can be passed for dynamic shape path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102068
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-05-25 10:18:45 +00:00
a896962f0a [fx][2/n] Add metadata to placeholders (#102195)
Summary:
# Context
In TorchRec's train pipeline, we need to fx trace a module to analyze the arguments on the forward call. In order to do this, we need to preserve some sort of meaning with each argument (a key or name of sorts that lets us identify the argument).

The issue is, when you use concrete args, internally, fx will unflatten the arg into it's constituents (to locate PHs).

Given a function that looks like this:
```
def process(batch: Dict[str, torch.Tensor]):
   ....

symbolic_trace(process, concrete_args: {"batch": {"f1": PH, "f2": PH}})

# function will be rewritten to look like:
def process(batch_1, batch_2):  # batch_1 -> "f1", batch_2->"f2"
  ...
```

When you traverse through the nodes of the graph, the names of the argument nodes to the function are batch_1 and batch_2. **This doesn't mean anything to the user who is fx tracing.** There isn't anything indicating that batch_1 corresponds to key "f1" in the batch input.

# Solution

When fx sees a "PH", it creates a proxy node.

The user does not have direct access to proxy creation, but only through the PH structure.

Attach a piece of metadata, `ph_key`, to the PH when you set it in the concrete args, it will get passed into proxy + node creation. So when you traverse the graph, this metadata sticks onto the node as an attribute. This way you have a way of tagging that  "batch_1" as "f1".

Test Plan: added a unit test

Reviewed By: dstaay-fb

Differential Revision: D44947653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102195
Approved by: https://github.com/PaliC
2023-05-25 07:04:20 +00:00
7b47cd0a6c [c10d] add fake pg necessary collectives (#102238)
This PR adds fake pg necessary collectives to enable e2e FSDP run
with out multiprocess or multithreading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102238
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
9a19262556 [c10d] conslidate barrier after init logic (#102237)
This PR consolidates the barrier after init logic to allow custom
backend to set the env var when creating the pg, so that
`init_process_group` would skip barrier
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102237
Approved by: https://github.com/ezyang
2023-05-25 05:01:16 +00:00
aa83a52742 Profiling doc (#101895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101895
Approved by: https://github.com/msaroufim, https://github.com/shunting314
2023-05-25 04:57:38 +00:00
818d92f58c Support resize on meta storage (#101988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-05-25 04:41:45 +00:00
3ca068bc44 Location-shift MKL Exponential Distribution (#101720)
Fixes #48841 , https://github.com/pytorch/pytorch/issues/101620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101720
Approved by: https://github.com/lezcano, https://github.com/ngimel, https://github.com/mingfeima, https://github.com/jgong5
2023-05-25 04:15:44 +00:00
d4380edb9b [TP] Add API logging for TP high level API (#102209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102209
Approved by: https://github.com/wz337, https://github.com/wanchaol
2023-05-25 03:33:00 +00:00
d4f711b0b5 do not raise when constraint locals are not in signature (#102198)
Summary: Fix forward for D46151668

Test Plan: none

Differential Revision: D46161799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102198
Approved by: https://github.com/angelayi
2023-05-25 03:16:00 +00:00
69c7f710ba Add meta registrations for some foreach ops (#102225)
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102225
Approved by: https://github.com/ngimel
2023-05-25 02:59:11 +00:00
2f08f9a66f [vision hash update] update the pinned vision hash (#102230)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102230
Approved by: https://github.com/pytorchbot
2023-05-25 02:48:13 +00:00
2434a205de Support unary not on lists (#102210)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102210
Approved by: https://github.com/anijain2305
2023-05-25 02:45:36 +00:00
a0e44284de [pytorch] add Vulkan support for the aten::cat operator for 1d, 2d, 3d and 4d (#102128)
Summary: Implement `torch.cat(tensors, dim=0)`, which concatenates a given sequence of tensors in the given dimension, for Vulkan backend. See the behavior of the operator here: https://pytorch.org/docs/stable/generated/torch.cat.html

Test Plan:
```
(base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*cat_*"
Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 12.2 sec (100%) 471/471 jobs, 2/471 updated
  Total time: 12.2 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *cat_*
[==========] Running 40 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 40 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_4d_dim0_invalidinputs_exceptions (73 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_samebatch_success
[       OK ] VulkanAPITest.cat_4d_dim0_samebatch_success (36 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_diffbatch_success
[       OK ] VulkanAPITest.cat_4d_dim0_diffbatch_success (20 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_singledepth_success
[       OK ] VulkanAPITest.cat_4d_dim0_singledepth_success (2 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_singletensor_success
[       OK ] VulkanAPITest.cat_4d_dim0_singletensor_success (4 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_twotensors_success
[       OK ] VulkanAPITest.cat_4d_dim0_twotensors_success (13 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim0_negdim_success (38 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim1_negdim_success (26 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim2_negdim_success (31 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_negdim_success
[       OK ] VulkanAPITest.cat_4d_dim3_negdim_success (30 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_singledepth_success
[       OK ] VulkanAPITest.cat_4d_dim1_singledepth_success (2 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_singletensor_success
[       OK ] VulkanAPITest.cat_4d_dim1_singletensor_success (4 ms)
[ DISABLED ] VulkanAPITest.DISABLED_cat_4d_dim1_twotensors_success
[ RUN      ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success
[       OK ] VulkanAPITest.cat_4d_dim1_bat1_mult4ch_success (4 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success
[       OK ] VulkanAPITest.cat_4d_dim1_bat2_mult4ch_success (7 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success
[       OK ] VulkanAPITest.cat_4d_dim1_mult4ch_mixed_success (19 ms)
[ DISABLED ] VulkanAPITest.DISABLED_cat_4d_dim1_mult4ch_nonmult4ch_success
[ RUN      ] VulkanAPITest.cat_4d_dim2_sameheight_success
[       OK ] VulkanAPITest.cat_4d_dim2_sameheight_success (23 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_diffheight_success
[       OK ] VulkanAPITest.cat_4d_dim2_diffheight_success (23 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_singledepth_success
[       OK ] VulkanAPITest.cat_4d_dim2_singledepth_success (2 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_4d_dim2_invalidinputs_exceptions (23 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions
[       OK ] VulkanAPITest.cat_4d_dim3_invalidinputs_exceptions (23 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_samewidth_success
[       OK ] VulkanAPITest.cat_4d_dim3_samewidth_success (30 ms)
[ RUN      ] VulkanAPITest.cat_4d_dim3_diffwidth_success
[       OK ] VulkanAPITest.cat_4d_dim3_diffwidth_success (22 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_diff_channel_success
[       OK ] VulkanAPITest.cat_3d_dim0_diff_channel_success (8 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_same_channel_success
[       OK ] VulkanAPITest.cat_3d_dim0_same_channel_success (5 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim1_diffheight_success
[       OK ] VulkanAPITest.cat_3d_dim1_diffheight_success (7 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim1_same_height_success
[       OK ] VulkanAPITest.cat_3d_dim1_same_height_success (6 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim2_diffwidth_success
[       OK ] VulkanAPITest.cat_3d_dim2_diffwidth_success (9 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim2_samewidth_success
[       OK ] VulkanAPITest.cat_3d_dim2_samewidth_success (4 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_3d_dim0_negdim_success (8 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim1_negdim_success
[       OK ] VulkanAPITest.cat_3d_dim1_negdim_success (8 ms)
[ RUN      ] VulkanAPITest.cat_3d_dim2_negdim_success
[       OK ] VulkanAPITest.cat_3d_dim2_negdim_success (5 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim0_same_height_success
[       OK ] VulkanAPITest.cat_2d_dim0_same_height_success (2 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim0_diff_height_success
[       OK ] VulkanAPITest.cat_2d_dim0_diff_height_success (1 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim1_same_width_success
[       OK ] VulkanAPITest.cat_2d_dim1_same_width_success (1 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim1_diff_width_success
[       OK ] VulkanAPITest.cat_2d_dim1_diff_width_success (1 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_2d_dim0_negdim_success (1 ms)
[ RUN      ] VulkanAPITest.cat_2d_dim1_negdim_success
[       OK ] VulkanAPITest.cat_2d_dim1_negdim_success (2 ms)
[ RUN      ] VulkanAPITest.cat_1d_dim0_same_width_success
[       OK ] VulkanAPITest.cat_1d_dim0_same_width_success (0 ms)
[ RUN      ] VulkanAPITest.cat_1d_dim0_diff_width_success
[       OK ] VulkanAPITest.cat_1d_dim0_diff_width_success (0 ms)
[ RUN      ] VulkanAPITest.cat_1d_dim0_negdim_success
[       OK ] VulkanAPITest.cat_1d_dim0_negdim_success (0 ms)
[----------] 40 tests from VulkanAPITest (543 ms total)

[----------] Global test environment tear-down
[==========] 40 tests from 1 test suite ran. (543 ms total)
[  PASSED  ] 40 tests.

  YOU HAVE 2 DISABLED TESTS
```

Reviewed By: SS-JIA

Differential Revision: D46059444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102128
Approved by: https://github.com/SS-JIA
2023-05-25 02:29:15 +00:00
23dbdd900f Full default dict support in dynamo (#102202)
Allows arbitrary default dict factories and construction of a default dict in a compiled function - needed for [this function](2e2a74670d/torch/utils/_foreach_utils.py (LL21C5-L21C395)) used to group params in the foreach optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102202
Approved by: https://github.com/yanboliang
2023-05-25 01:41:38 +00:00
f3e42f15e9 [FSDP] Start to generalize modules to ignore for mixed precision (#102010)
The main use case here is that folks would like to ignore layer norm for mixed precision. This can now be enabled with:

```
mp_config = MixedPrecision(
            param_dtype=torch.float16,
            reduce_dtype=torch.float16,
            buffer_dtype=torch.float16,
            _mixed_precision_module_classes_to_ignore=[_BatchNorm, nn.LayerNorm],
        )
```

This is done by classes of types in `_mixed_precision_module_classes_to_ignore` being wrapped in their own FSDP unit with mixed preicsion disabled. This is only enabled for auto wrapping.

We also add module pre and post hooks to cast / downcast inputs to the appropriate full precision.

Differential Revision: [D46079957](https://our.internmc.facebook.com/intern/diff/D46079957/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102010
Approved by: https://github.com/awgu
2023-05-25 00:45:54 +00:00
c2093de5d9 [partitioner] fix for rng ops (#102123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102123
Approved by: https://github.com/Chillee
2023-05-25 00:35:07 +00:00
2763b50803 update thresholds for various ops in functorch/test_ops.py (#102016)
update thresholds for following ops:

linalg.multi_dot, pca_lowrank
matrix_exp
matmul, __rmatmul__
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102016
Approved by: https://github.com/ngimel, https://github.com/zou3519
2023-05-25 00:30:57 +00:00
e274c2e4fd [MPS] Restride output strides to contiguous format for inverse op (#102122)
Remove unnecessary output allocation and reuse the current allocated output Tensor.
This change restrides the output strides to contiguous format for inverse op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102122
Approved by: https://github.com/kulinseth
2023-05-25 00:21:43 +00:00
11d1cd899a Replace require_backend with require_backend_is_available (#101891)
[BE] `require_backend_is_available` offers the a more thorough check as `require_backend` but both are often used together. This remove `require_backend` and centralizes on the `require_backend_is_available` decorator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101891
Approved by: https://github.com/awgu
2023-05-25 00:00:06 +00:00
3e08988cd3 Fix redudant kernel generations (#102104)
## Issue description

The PR https://github.com/pytorch/pytorch/pull/100064 introduces a new RNG operation process. However, it causes every `randint` to load a separate random seed by default. TorchInductor generates a buffer to store all necessary random seeds and places the offsets as constant values in the subsequent compute buffers. In ir_pre_fusion generated by TorchInductor, some buffers only differ by one line, which is the load random seed with the corresponding offset. Subsequently, the codegen generates Triton kernels following the same rule. Finally, in the output_code.py, some Triton kernels only differ by one line, meaning that redundant kernels are being generated.

## Solution

This PR captures the seed offset and adds it to the existing `self.sizevars` structure. It generates variable names as placeholders, allowing the code wrapper to pass the offset as an argument to the kernels. I've also modified the divisible_by_16 check to exclude this argument.

This PR reduces the number of generated kernels from 50 to 17 for BertForMaskedLM forward.

According to tests on my own environment, the compilation time of attention_is_all_you_need_pytorch has been reduced from 94s to 66s. The speedup remains largely unchanged, at 1.37X.

The following is a comparison for a simple example.
Before:
```
triton_poi_fused_0 = async_compile.triton('triton_', '''
...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    ...
    tmp0 = tl.load(in_ptr0 + 0)
    tmp1 = x0
    tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10)

triton_poi_fused_1 = async_compile.triton('triton_', '''
...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    ...
    tmp0 = tl.load(in_ptr0 + 1)
    tmp1 = x0
    tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10)
...''')

def call(args):
        triton_poi_fused_0.run(buf0, buf1, 1024, grid=grid(1024), stream=stream0)
        triton_poi_fused_1.run(buf0, buf2, 1024, grid=grid(1024), stream=stream0)

```
After:
```
triton_poi_fused_0 = async_compile.triton('triton_', '''
...
def triton_(in_ptr0, out_ptr0, load_seed_offset, xnumel, XBLOCK : tl.constexpr):
    ...
    tmp0 = tl.load(in_ptr0 + load_seed_offset)
    tmp1 = x0
    tmp2 = triton_helpers.randint64(tmp0, (tmp1).to(tl.uint32), 0, 10)
    ....

def call(args):
        triton_poi_fused_0.run(buf0, buf1, 0, 1024, grid=grid(1024), stream=stream0)
        triton_poi_fused_0.run(buf0, buf2, 1, 1024, grid=grid(1024), stream=stream0)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102104
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-05-24 23:56:53 +00:00
9ce95ce157 [pytorch] add Vulkan support for the t and transpose operators for 2d, 3d and 4d tensors (#101808)
Summary:
Use the existing permute shader to implement the following two operators for Vulkan backend
- `aten::transpose` The behavior of the operator is shown in https://pytorch.org/docs/stable/generated/torch.transpose.html.
- `aten::t` The behavior of the operator is shown in https://pytorch.org/docs/stable/generated/torch.t.html#torch.t. 1d tensors are returned as is. When input is a 2d tensor this is equivalent to `aten::transpose(input, 0, 1)`.

Test Plan:
At local repo of fbsource on MacBook, run `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1`
- Full test results P739033174.
- `aten::t` and `aten::tranpose` related results shown below
```
(base) luwei@luwei-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1

[... other tests ...]

[ RUN      ] VulkanAPITest.transpose_t_1d
[       OK ] VulkanAPITest.transpose_t_1d (0 ms)
[ RUN      ] VulkanAPITest.transpose_t_2d_small
[       OK ] VulkanAPITest.transpose_t_2d_small (1 ms)
[ RUN      ] VulkanAPITest.transpose_t_2d_medium
[       OK ] VulkanAPITest.transpose_t_2d_medium (0 ms)
[ RUN      ] VulkanAPITest.transpose_t_2d_large
[       OK ] VulkanAPITest.transpose_t_2d_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_2d_height_and_width_small
[       OK ] VulkanAPITest.transpose_2d_height_and_width_small (0 ms)
[ RUN      ] VulkanAPITest.transpose_2d_height_and_width_medium
[       OK ] VulkanAPITest.transpose_2d_height_and_width_medium (0 ms)
[ RUN      ] VulkanAPITest.transpose_2d_height_and_width_large
[       OK ] VulkanAPITest.transpose_2d_height_and_width_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_2d_height_and_height_large
[       OK ] VulkanAPITest.transpose_2d_height_and_height_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_2d_width_and_width_large
[       OK ] VulkanAPITest.transpose_2d_width_and_width_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_height_and_width_small
[       OK ] VulkanAPITest.transpose_3d_height_and_width_small (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_height_and_width_medium
[       OK ] VulkanAPITest.transpose_3d_height_and_width_medium (1 ms)
[ RUN      ] VulkanAPITest.transpose_3d_height_and_width_large
[       OK ] VulkanAPITest.transpose_3d_height_and_width_large (1 ms)
[ RUN      ] VulkanAPITest.transpose_3d_width_and_width_large
[       OK ] VulkanAPITest.transpose_3d_width_and_width_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_width_small
[       OK ] VulkanAPITest.transpose_3d_depth_and_width_small (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_width_medium
[       OK ] VulkanAPITest.transpose_3d_depth_and_width_medium (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_width_large
[       OK ] VulkanAPITest.transpose_3d_depth_and_width_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_depth_large
[       OK ] VulkanAPITest.transpose_3d_depth_and_depth_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_height_small
[       OK ] VulkanAPITest.transpose_3d_depth_and_height_small (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_height_medium
[       OK ] VulkanAPITest.transpose_3d_depth_and_height_medium (0 ms)
[ RUN      ] VulkanAPITest.transpose_3d_depth_and_height_large
[       OK ] VulkanAPITest.transpose_3d_depth_and_height_large (2 ms)
[ RUN      ] VulkanAPITest.transpose_3d_height_and_height_large
[       OK ] VulkanAPITest.transpose_3d_height_and_height_large (1 ms)
[ RUN      ] VulkanAPITest.transpose_4d_batch_and_batch_large
[       OK ] VulkanAPITest.transpose_4d_batch_and_batch_large (1 ms)
[ RUN      ] VulkanAPITest.transpose_4d_depth_and_depth_large
[       OK ] VulkanAPITest.transpose_4d_depth_and_depth_large (1 ms)
[ RUN      ] VulkanAPITest.transpose_4d_height_and_height_large
[       OK ] VulkanAPITest.transpose_4d_height_and_height_large (1 ms)
[ RUN      ] VulkanAPITest.transpose_4d_width_and_width_large
[       OK ] VulkanAPITest.transpose_4d_width_and_width_large (0 ms)
[ RUN      ] VulkanAPITest.transpose_4d_batch_and_depth_large
[       OK ] VulkanAPITest.transpose_4d_batch_and_depth_large (1 ms)
[ RUN      ] VulkanAPITest.transpose_4d_batch_and_height_large
[       OK ] VulkanAPITest.transpose_4d_batch_and_height_large (2 ms)
[ RUN      ] VulkanAPITest.transpose_4d_batch_and_width_large
[       OK ] VulkanAPITest.transpose_4d_batch_and_width_large (2 ms)
[ RUN      ] VulkanAPITest.transpose_4d_depth_and_height_large
[       OK ] VulkanAPITest.transpose_4d_depth_and_height_large (2 ms)
[ RUN      ] VulkanAPITest.transpose_4d_depth_and_width_large
[       OK ] VulkanAPITest.transpose_4d_depth_and_width_large (2 ms)
[ RUN      ] VulkanAPITest.transpose_4d_height_and_width_large
[       OK ] VulkanAPITest.transpose_4d_height_and_width_large (1 ms)

[... other tests ...]
```

Reviewed By: SS-JIA

Differential Revision: D45878333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101808
Approved by: https://github.com/SS-JIA
2023-05-24 23:50:07 +00:00
c903b12cb8 Add fake process group (#102180)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102180
Approved by: https://github.com/wanchaol
2023-05-24 23:27:40 +00:00
5da497cabb add additional stream priority for cuda streams (#101956)
Changes the StreamID encoding to use the last bit to distinguish between external and internal streams, 4 bits for IdType (DEFAULT, EXT or user-created streams possibly with high priority), and 5 bits for index. This allows us to have more stream priorities exposed to user (I'm currently setting 4, but that's easy to change now). Note, we are pre-creating all 32 streams in the pool per each allowed priority, I don't know if it's a problem in practice. Currently cuda 11.8/A100 GPUs allow 6 different stream priorities, the number may be different for the different cards/different cuda versions.

Previous callsites explicitly requesting high prioity stream (`isHighPriority=true`) are now getting the highest priority stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101956
Approved by: https://github.com/ezyang
2023-05-24 23:26:47 +00:00
f8896b7b0e update tf32 thresholds in nn/test_convolution.py (#102015)
updated tf32 thresholds for test_cudnn_convolution_relu, test_cudnn_convolution_add_relu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102015
Approved by: https://github.com/ngimel
2023-05-24 22:42:25 +00:00
dedcf8f70f No need to run non-CUDA jobs in memory leak check mode (#102188)
Memory leak check mode is only meant for runner with GPU such as CUDA and ROCm https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_utils.py#L1093.  So it's a waste of time and resource to run them for CPU-only jobs

### Testing

CUDA jobs have both `mem_leak_check` and `rerun_disabled_tests`:
* https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109448858#step:9:131
* https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109449417#step:9:123
* https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109452338#step:9:111

Same goes for Bazel CUDA job:
* https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109451535#step:3:132

And ROCM job:
* https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109451353#step:9:117

Non CUDA or ROCM jobs have only `rerun_disabled_tests` mode, for example:
* https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109449894#step:9:127
* ASAN https://github.com/pytorch/pytorch/actions/runs/5072143038/jobs/9109449157#step:9:126
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102188
Approved by: https://github.com/clee2000
2023-05-24 22:36:31 +00:00
ce42010722 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-24 22:17:32 +00:00
dbf6912be6 Populate all args with fake tensor value (#102129)
Summary: We don't need to leak matched input positions from dynamo anymore if we can just populate all args with corresponding fake tensors.

Test Plan: CI

Differential Revision: D46131556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102129
Approved by: https://github.com/angelayi
2023-05-24 22:01:47 +00:00
210fc28d5e Revert "Support resize on meta storage (#101988)"
This reverts commit 7d1ba0a92adededec1ce3488e39c1d399ecf6b6c.

Reverted https://github.com/pytorch/pytorch/pull/101988 on behalf of https://github.com/osalpekar due to Need to revert and rebase this in order to unblock train import ([comment](https://github.com/pytorch/pytorch/pull/101988#issuecomment-1561970230))
2023-05-24 21:51:33 +00:00
06f656c5d1 [distributed] implemented find_all_descendants (#102138)
Fixes #100397

Implemented find_all_descendants function that identifies the list of nodes that need to be moved. Added unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102138
Approved by: https://github.com/fegin
2023-05-24 21:47:59 +00:00
5d6810a4ee [dynamo][higher order op] Support nn.Module calls (#102022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102022
Approved by: https://github.com/zou3519
2023-05-24 21:39:58 +00:00
e6af31a5a2 [dynamo] Add astunparse dependency (#102120)
Summary:
https://github.com/pytorch/pytorch/pull/98488 implements CSE for dynamo guards, and it relies on astunparse to perform the optimization.
`test_guards_cse_pass_single` was broken and later was fixed by introducing a check_and_skip_if_needed. This actually fixes the root cause on fbcode and should bring some perf gain internally.

Test Plan: `buck2 test @//mode/opt //caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::DynamicShapesMiscTests::test_guards_cse_pass_single' --run-disabled`

Reviewed By: malfet

Differential Revision: D46126742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102120
Approved by: https://github.com/malfet
2023-05-24 21:24:24 +00:00
e6fc7d814d Segmentation fault in flatbuffers when parsing malformed modules (#95221)
Fixes #95061, #95062

Add Flatbuffer verification before parsing to avoid crashing on malformed modules. Flatbuffers doesn't perform boundary checks at runtime for the sake of performance, so when parsing untrusted modules it is highly recommended to verify overall buffer integrity.

This bug can be triggered both by C++ (`torch::jit::load`, `torch::jitload_jit_module_from_file`) and Python  API (`torch.jit.load`, `torch.jit.jit_module_from_flatbuffer`).

Crash files to reproduce:
[crash-1feb368861083e3d242e5c3fcb1090869f4819c4.txt](https://github.com/pytorch/pytorch/files/10795267/crash-1feb368861083e3d242e5c3fcb1090869f4819c4.txt)
[crash-7e8ffd314223be96b43ca246d3d3481702869455.txt](https://github.com/pytorch/pytorch/files/10795268/crash-7e8ffd314223be96b43ca246d3d3481702869455.txt)
[crash-ad4d7c6183af8f34fe1cb5c8133315c6389c409f.txt](https://github.com/pytorch/pytorch/files/10795279/crash-ad4d7c6183af8f34fe1cb5c8133315c6389c409f.txt)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95221
Approved by: https://github.com/qihqi, https://github.com/davidberard98
2023-05-24 21:16:19 +00:00
2e2a74670d torch.sparse.softmax: allow negative dim (#102172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102172
Approved by: https://github.com/cpuhrsch
2023-05-24 19:43:47 +00:00
424c930f76 Add quantization lowering for nn.PixelShuffle and nn.PixelUnshuffle (#101926)
Similar to https://github.com/pytorch/pytorch/pull/96160 but for the modules
nn.PixelShuffle and nn.PixelUnshuffle.

torch.nn.PixelUnshuffle accepts both float and quantized inputs.
However, previously we would unnecessarily dequantize quantized inputs into floats
before passing them to the function. This commit fixes this by lowering the pattern
[dequant - PixelShuffle - quant].
[dequant - PixelUnshuffle - quant].

Test Plan:

python test/test_quantization.py TestQuantizeFxOps.test_pixel_shuffle_module
python test/test_quantization.py TestQuantizeFxOps.test_pixel_unshuffle_module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101926
Approved by: https://github.com/jerryzh168
2023-05-24 19:33:26 +00:00
956bd03808 add ignored_states to FSDP/fully_shard (#102056)
Add 'ignored_states' that accepts either a list of ignored_parameters or a list of nn modules for FSDP model wrapper and fully_shard composable APIs, it is recommended to use 'ignored_states' over 'ignored_modules' moving forward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102056
Approved by: https://github.com/awgu
2023-05-24 18:36:48 +00:00
023bc30b17 Revert "Merge type stubs for torch.nn.parallel (#101528)"
This reverts commit 6cabc105bb7c9cce4e23bdcc4a921613caae0f9a.

Reverted https://github.com/pytorch/pytorch/pull/101528 on behalf of https://github.com/kit1980 due to Broke inductor tests https://github.com/pytorch/pytorch/actions/runs/5071348299/jobs/9107880424 ImportError: cannot import name 'get_a_var' from 'torch.nn.parallel.parallel_apply' ([comment](https://github.com/pytorch/pytorch/pull/101528#issuecomment-1561732862))
2023-05-24 18:23:52 +00:00
d316a2dd5c [spmd] Enable data parallel to work with non 0 batch dim (#100073)
This PR enables data parallel to work with non 0 batch dim, the only
thing we need to do is to expose the input_batch_dim to DataParallelMode
and the data parallel expansion automatically works as we have done
things correctly in batch dim analysis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100073
Approved by: https://github.com/mrshenli
2023-05-24 17:55:10 +00:00
d378837039 [spmd] add more decomp and fix a sharding bug (#100938)
This PR adds native_layernorm_backward op to the decomp table and fixes
a sharding bug to not automatically do padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100938
Approved by: https://github.com/mrshenli
2023-05-24 17:55:10 +00:00
dd1f295201 [spmd] Improve activation handling, factory ops and batch dim reduction (#100853)
This PR improves the activation handling logic of data parallel, to
support the cases where there're tensor factory ops that does not depend
on any input node, it would still produce activation, with either
sharded act (i.e. if output shape have batch size) or replcate act

It also significantly simplify the full reduction logic, now we don't
need the full reduction detection, we only need to ensure that when
compute the batch dim, we detected full reduction and mark it as sharded
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100853
Approved by: https://github.com/mrshenli
2023-05-24 17:55:09 +00:00
4d55ea8548 [spmd] enhance batch dim analysis of data parallel (#100852)
This PR enhances batch dim analysis of data parallel to understand
more on the cases where batch dim get flattened or split, using
dtensor's view ops, we could be able to track the batch dim that got
transformed in non-trival ways.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100852
Approved by: https://github.com/mrshenli
2023-05-24 17:55:07 +00:00
b2eaba6b62 [spmd] by default average gradients for nccl backend (#99964)
This PR by default average gradient for NCCL backend, this allows
SPMD's data parallel match with DDP/FSDP results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99964
Approved by: https://github.com/mrshenli
2023-05-24 17:55:06 +00:00
942cd12d55 [spmd] add option to preserve node types (#100072)
This PR adds a option to preserve node types for the entire graph,
this could allow some exploration about using those node types to do
things like act checkpoint, etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100072
Approved by: https://github.com/mrshenli
2023-05-24 17:55:05 +00:00
2232cce69c No cpp + step current (#102001)
stepcurrent cannot handle xdist
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102001
Approved by: https://github.com/huydhn
2023-05-24 17:39:32 +00:00
fcf812c35a Unbind Cat pattern (#101767)
In continuation to previous diffs, this diff merges unbind->cat / unbind-> stack pattern.

In combination with previous diffs, this can handle split->squeeze->[cat/stack]

Since many of the cases are similar to split->cat, we reuse SplitCatSimplifier

Differential Revision: [D45955486](https://our.internmc.facebook.com/intern/diff/D45955486/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101767
Approved by: https://github.com/jansel
2023-05-24 17:04:14 +00:00
6cabc105bb Merge type stubs for torch.nn.parallel (#101528)
Fixes #91648

As explained in the tracking issue, the incomplete type stubs in `torch/nn/parallel` mask `DataParallel` methods relevant for subclassing and also mask type issues present in the code as well.

One notable change here is the addition of [`allow_redefinition = True`](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern:

> Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition.

This is added specifically to allow for the type narrowing of `device_ids` in `torch.nn.parallel.data_parallel.data_parallel` from `Sequence[Union[int, torch.device]]` to `Sequence[int]`.

Other than this, there are various renamings and `type: ignore` comments added to bypass errors that arose from the merging.

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101528
Approved by: https://github.com/ezyang
2023-05-24 16:52:13 +00:00
fdd28399dc Replace unsqueeze transform with stack (#101766)
As part of split-cat transforms, we needed to unsqueeze additional inputs (not coming from split) but going to the cat/stack nodes.

However, this leads to patterns like:

```
split -> unsqueeze -> cat
```

when there are multiple splits going into cat.

An alternative is to use stack rather than unsqueeze, leading to patterns like:

```
split -> stack -> cat
```

This is much better, as repeated applications of the same pattern will further simplify "split->stack", which is not trivial in case of "split->unsqueeze->cat".

Another nice side-effect is lesser number of nodes in the graph overall.

Differential Revision: [D45952452](https://our.internmc.facebook.com/intern/diff/D45952452/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101766
Approved by: https://github.com/jansel
2023-05-24 16:43:32 +00:00
c0d0a9f7a0 Replace split-squeeze pattern (#101765)
Replaces split-squeeze (same dimension) with an unbind. This will be used in combination with later patterns to remove the unbind if it follows a cat/stack

Differential Revision: [D45758181](https://our.internmc.facebook.com/intern/diff/D45758181/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101765
Approved by: https://github.com/jansel
2023-05-24 16:40:03 +00:00
0eb4f07282 [ONNX] Introduce FX-ONNX dispatcher (#100660)
Needs https://github.com/microsoft/onnxscript/pull/721

The current FX exporter is using manually maintained dictionary to map ATen op to its OnnxFunction. However, the issue arises when ATen op has overloads or OnnxFunction has overloads, which is not resolvable by the one to one mapping . For example, `aten::arange` has onverloads: `aten::arange.start` and `aten::arange.start_step`, or for `aten::argmax`, torchlib provides two function: aten_argmax, and aten_argmax_dim.

This PR utilizes newly introduced [ONNX OpSchema](https://github.com/microsoft/onnxscript/pull/626) to match the input arguments of an ATen operator to find the correct overload.

### OnnxRegistry

Heavily reference on [TorchScript Registry](https://github.com/pytorch/pytorch/pull/84382). The only difference is that in FX registry, an ATen operator with specific opset version is mapped to a list of overloaded functions.

* No longer use global registry. The registry is initialized in `ResolvedExportOptions` with torchlib, and will be exposed to users in the future.
* Multiple opset version layer is kept through `_SymbolicFunctionGroup` , but torchlib now only supports 18.
* Basic API of custom operator support: `register`, `unregister`, and `is_register_op` are kept for future development. To further complete them, the follow-up PRs should address:
    - How to allow users to remove/override specific overload? Using OpSchema to differentiate?
    - User registers a new overload with the same OpSchema as one of registered overload.

### OnnxDispatcher

Dispatch ATen operators to the matched overload by comparing OpSchema with input arguments.

* `OpSchemaWrapper` wrap the onnx schema, and record matching score.
* `dispatch` uses `OpSchemaWrapper` to compare data types to find the best matched overload. If the match isn't perfect, record warning in diagnostics.
* `dispatch_opset_version` is referenced from #84382 and kept, but torchlib doesn't support opset version != 18.
* Because right now (1) OnnxFunction arguments are manually typed, and (2) ORT could unfollow ONNX type spec, we relax the schema match with `matching score system`.
* To include more supports:  the follow-up PRs should address:
    - How to add op.Cast with autocast? In torchlib or converter?
    - The need of type promotion can be captured by dispatcher, but needs OpSchema shows the T1/T2 information.

### OpSchemaWrapper - Matching Score Mechanism

#### The matching score system:
This is a temporary solution to how we target the correct ONNX overloads given that we only have manually annotated arguments (potentially inaccurate schema) and limited supports on AttributeProto.

1. Perfect match exam: If all arguments/kwargs are all matched, return the function without any warnings.
2. Best match exam: The system add the each correct matching input counts orderly, and subtract the symmetrical difference between their attributes to calculate the matching score. And select the one with the highest score in the end. If the selection is not a perfect match, a warning message is sent to SARIF.

#### Example of overloads

1. Different types: Caused by the difference between the ONNX spec and PyTorch.

The matching system finds the correct one.

```python
@torch_op("aten::mul")
def aten_mul(self: TReal, other: TReal) -> TReal:
    ...

@torch_op("aten::mul")
def aten_mul_bool(self: BOOL, other: BOOL) -> BOOL:
    ...
```

2. Optional dim: caused by unsupported op.OptionalHasElement (will support on opset version == 20). dim could be "None"

```python
@torch_op("aten::argmax", trace_only=True)
def aten_argmax(
    self: TrealOrUInt8, dim: Optional[int] = None, keepdim: bool = False
) -> TrealOrUInt8:
    ...

@torch_op("aten::argmax", private=True)
def _aten_argmax_dim(self: TrealOrUInt8, dim: int, keepdim: bool = False) -> TrealOrUInt8:
    ...
```

This case is impossible to differentiate, as they both might have dim in kwargs, so in this case, please make sure you turn the one with `dim: int` to private function.

3. Optional dtype: dtype could be "unprovided". The difference from 2 is that dtype would not be None.

```python
@torch_op("aten::new_full")
def aten_new_full(self: TTensor, size: INT64, fill_value: TTensor) -> TTensor:
    ...

@torch_op("aten::new_full")
def aten_new_full_dtype(self: TTensor, size: INT64, fill_value: TTensor, dtype: int) -> TTensor:
    ...
```

Depends on dtype is provided or not, matching system will dispatch the ATen op to the correct one.

4. `None` and `[]` and `NoneType` are considered failing the match.

5. Two functions have the same score is recorded into SARIFs.

### TODOs

1. Type promotion can be captured by dispatcher only if OpSchema can provide it. However, the implementation of "graph-level" pass vs "in-op"" promotion can be further discussed in https://github.com/microsoft/onnxscript/issues/563.
5. torchlib should provide the "opset version" to OnnxRegistry.
7. How to expose OnnxRegistry with custom add/remove ops APIs nneds to be further discussed.

Co-authored-by: Justin Chu <justinchuby@microsoft.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100660
Approved by: https://github.com/thiagocrepaldi
2023-05-24 16:39:22 +00:00
47b4136439 Refactor normalize passes to use @register_graph_pattern (#101764)
Cleans up normalize passes by using register_graph_pattern decorator

Differential Revision: [D45973543](https://our.internmc.facebook.com/intern/diff/D45973543/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45973543/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101764
Approved by: https://github.com/jansel
2023-05-24 16:27:21 +00:00
eqy
66f6e0e605 [CUDA][DLPack] Handle legacy default streams for DLPack conversion (#101318)
It seems that some legacy default stream logic (e.g., present in a8ff647e42/torch/utils/dlpack.py (L114) ) is not handled on the potential receiving end in `torch/_tensor.py`.

Open to suggestions on how to make the test case less clunky, as this was the combination we arrived at after discovering flakiness in alternate versions.

Thanks to Olga Andreeva for surfacing this issue and providing a repro.

CC @Aidyn-A @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101318
Approved by: https://github.com/ngimel
2023-05-24 16:14:50 +00:00
3baa67caee [quant][pt2e][be] Move annotate helper function to quantizer/utils.py (#102127)
Summary: att

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'
```

Reviewed By: kimishpatel

Differential Revision: D46001285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102127
Approved by: https://github.com/kimishpatel
2023-05-24 16:13:28 +00:00
5f0463a6d7 [inductor] Move two cpu tests to test_cpu_repro.py (#101887)
Summary: The two are cpu only tests.

Test Plan:
```
buck2 test @//mode/dev-nosan //caffe2/test/inductor:test_inductor -- --exact 'caffe2/test/inductor:test_inductor - test_in_out_buffer_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)' --run-disabled
```

Reviewed By: bertmaher

Differential Revision: D46011571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101887
Approved by: https://github.com/bertmaher
2023-05-24 15:41:06 +00:00
e3d97b6213 [inductor] Added smooth_l1_loss refs (#102077)
Added `smooth_l1_loss` to refs + tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102077
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-24 15:07:08 +00:00
ddf4f7bc89 fix inference_mode with torch.compile (#101219)
It looks like inference_mode wasn't playing well with functionalization.

If you run torch.compile on a function, and the inputs to the function are tensors created outside of inference mode, then we need to make sure that when we created functional tensor wrappers for those inputs during compilation, those functional wrappers properly mirror whether or not the original tensor is an inference tensor.

Hopefully fixes https://github.com/pytorch/pytorch/issues/101151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101219
Approved by: https://github.com/albanD, https://github.com/ezyang
2023-05-24 14:58:40 +00:00
98ab11a2c3 separate out dynamo .requires_grad and .is_grad_enabled guards (#100570)
Fixes https://github.com/pytorch/pytorch/issues/100977

This will hopefully fix this error (from [issue](https://github.com/pytorch/pytorch/issues/99616))

This PR fixes an internal model: we were running an inductor inference graph, but `torch.is_grad_enabled()` was True, causing us to error inside of the inference graph when we encountered an out= operator.

I haven't been able to create a smaller repro - before landing this, I want to create a smaller repro to convince myself of why we need to separate out these guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100570
Approved by: https://github.com/ezyang
2023-05-24 14:58:40 +00:00
32643bc926 Remove vsx suffix in sleef calls (#100149)
Sleef has automatic architecture selection for Power. There is no need to call architecture specific interfaces. If we call the generic interface, Sleef will correctly choose the architecture specific code, based on the architecure (vsx for Power8, vsx3 for Power9 and Power10). So, the vsx suffix in Sleef calls in PyTorch are removed, so that the architecture specific code selection is handled by Sleef internally.

Fixes the issue wherein older (and slower) vsx code in Sleef was getting executed on newer Power9 and Power10 processors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100149
Approved by: https://github.com/jgong5
2023-05-24 14:24:38 +00:00
d08066a438 [Reland][functorch] test for compiling functorch transforms (#100718)
Original PR over at #100151. Was reverted due to internal test failures.
I have fixed the internal build system.

Differential Revision: [D45608453](https://our.internmc.facebook.com/intern/diff/D45608453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100718
Approved by: https://github.com/kshitij12345, https://github.com/atalman
2023-05-24 14:21:38 +00:00
08fb648fe1 Add mechanism to turn any RAII guard into a Python Context Manager (#102037)
This PR:
- adds a mechanism to turn any RAII guard into a Python Context Manager
- turns ExcludeDispatchKeyGuard into a context manager, and purges usages
of the older torch._C.ExcludeDispatchKeyGuard from the codebase.

The mechanism is that given a RAII guard, we construct a context
manager object that holds an optional guard. When we enter the context
manager we populate the guard, when we exit we reset it.

We don't delete torch._C.ExcludeDispatchKeyGuard for BC reasons (people
are using it in fbcode). If this code actually sticks
(it is using C++17 and that worries me a bit), then I'll apply the
change to other RAII guards we have, otherwise, we can write our own
std::apply.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102037
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-05-24 14:20:52 +00:00
8b7bd81902 determined collective device by _get_pg_default_device rather than explicit cuda (#101533)
There are many communication operations for shardedTensor in the state dict of fsdp. They use the external passed-in pg (or the default pg), which currently supports cuda devices. Before communication, the memory will be moved to cuda, which is implicit (because it is essentially moving data to the memory type required by pg, not the computing device type). Similarly, when users use fsdp on a custom backend, they will pass in a custom pg (which does not support cuda devices), which may cause fsdp to not work properly in some cases. This PR obtains the memory type supported by the pg through _get_pg_default_device during communication, and moves the data to it when needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101533
Approved by: https://github.com/awgu
2023-05-24 13:48:43 +00:00
fd1d442185 [inductor] Add more dynamic shapes support for CudaWrapperCodeGen (#102019)
Summary: Use size hint for autotuning; Fix some symbol arg codegen
problem. More PRs coming for fixing unit test failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102019
Approved by: https://github.com/jansel
2023-05-24 13:29:47 +00:00
ee95e37a69 [c10d] Record time spent for init_process_group, new_group, _store_based_barrier (#101912)
1. Record time spent for init_process_group, new_group, _store_based_barrier
2. Rename c10d_error_logger to c10d_logger for generalization.
3. Refactor to move logger wrappers in distributed_c10d.py to logger to c10d_logger.py.
4. Rename the logger wrappers (bc breaking). Exception_handler is renamed to exception_logger to avoid confusion with logging handler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101912
Approved by: https://github.com/fduwjj
2023-05-24 09:36:34 +00:00
d6afa7d003 add Half support for sinh, cosh, ploygamma, entr and i0e on CPU (#99002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99002
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/ngimel
2023-05-24 08:42:14 +00:00
8aea9dad8f Bump mpmath from 1.2.1 to 1.3.0 in /.github/requirements (#102058)
Bumps [mpmath](https://github.com/fredrik-johansson/mpmath) from 1.2.1 to 1.3.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/fredrik-johansson/mpmath/releases">mpmath's releases</a>.</em></p>
<blockquote>
<h2>1.3.0</h2>
<p>Security issues:</p>
<ul>
<li>Fixed ReDOS vulnerability in mpmathify() (CVE-2021-29063) (Vinzent Steinberg)</li>
</ul>
<p>Features:</p>
<ul>
<li>Added quadsubdiv() for numerical integration with adaptive path splitting
(Fredrik Johansson)</li>
<li>Added the Cohen algorithm for inverse Laplace transforms
(Guillermo Navas-Palencia)</li>
<li>Some speedup of matrix multiplication (Fredrik Johansson)</li>
<li>Optimizations to Carlson elliptic integrals (Paul Masson)</li>
<li>Added signal functions (squarew(), trianglew(), sawtoothw(), unit_triangle()
sigmoidw()) (Nike Dattani, Deyan Mihaylov, Tina Yu)</li>
</ul>
<p>Bug fixes:</p>
<ul>
<li>Correct mpf initialization from tuple for finf and fninf (Sergey B Kirpichev)</li>
<li>Support QR decomposition for matrices of width 0 and 1 (Clemens Hofreither)</li>
<li>Fixed some cases where elliprj() gave inaccurate results (Fredrik Johansson)</li>
<li>Fixed cases where digamma() hangs for complex input (Fredrik Johansson)</li>
<li>Fixed cases of polylog() with integer-valued parameter with complex type
(Fredrik Johansson)</li>
<li>Fixed fp.nsum() with Euler-Maclaurin algorithm (Fredrik Johansson)</li>
</ul>
<p>Maintenance:</p>
<ul>
<li>Dropped support for Python 3.4 (Sergey B Kirpichev)</li>
<li>Documentation cleanup (Sergey B Kirpichev)</li>
<li>Removed obsolete files (Sergey B Kirpichev)</li>
<li>Added options to runtests.py to skip tests and exit on failure
(Jonathan Warner)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/mpmath/mpmath/blob/master/CHANGES">mpmath's changelog</a>.</em></p>
<blockquote>
<p>--1.3.0--
Released March 7, 2023</p>
<p>Security issues:</p>
<ul>
<li>Fixed ReDOS vulnerability in mpmathify() (CVE-2021-29063) (Vinzent Steinberg)</li>
</ul>
<p>Features:</p>
<ul>
<li>Added quadsubdiv() for numerical integration with adaptive path splitting
(Fredrik Johansson)</li>
<li>Added the Cohen algorithm for inverse Laplace transforms
(Guillermo Navas-Palencia)</li>
<li>Some speedup of matrix multiplication (Fredrik Johansson)</li>
<li>Optimizations to Carlson elliptic integrals (Paul Masson)</li>
<li>Added signal functions (squarew(), trianglew(), sawtoothw(), unit_triangle()
sigmoidw()) (Nike Dattani, Deyan Mihaylov, Tina Yu)</li>
</ul>
<p>Bug fixes:</p>
<ul>
<li>Correct mpf initialization from tuple for finf and fninf (Sergey B Kirpichev)</li>
<li>Support QR decomposition for matrices of width 0 and 1 (Clemens Hofreither)</li>
<li>Fixed some cases where elliprj() gave inaccurate results (Fredrik Johansson)</li>
<li>Fixed cases where digamma() hangs for complex input (Fredrik Johansson)</li>
<li>Fixed cases of polylog() with integer-valued parameter with complex type
(Fredrik Johansson)</li>
<li>Fixed fp.nsum() with Euler-Maclaurin algorithm (Fredrik Johansson)</li>
</ul>
<p>Maintenance:</p>
<ul>
<li>Dropped support for Python 3.4 (Sergey B Kirpichev)</li>
<li>Documentation cleanup (Sergey B Kirpichev)</li>
<li>Removed obsolete files (Sergey B Kirpichev)</li>
<li>Added options to runtests.py to skip tests and exit on failure
(Jonathan Warner)</li>
</ul>
<p>--1.2.0--
Released February 1, 2021</p>
<p>Features and optimizations:</p>
<ul>
<li>Support @ operator for matrix multiplication (Max Gaukler)</li>
<li>Add eta() implementing the Dedekind eta function</li>
<li>Optimized the python_trailing function (adhoc-king)</li>
<li>Implement unary plus for matrices (Max Gaukler)</li>
<li>Improved calculation of gram_index (p15-git-acc)</li>
</ul>
<p>Compatibility:</p>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="b5c04506ef"><code>b5c0450</code></a> version 1.3.0</li>
<li><a href="a27581ca77"><code>a27581c</code></a> Merge pull request <a href="https://redirect.github.com/fredrik-johansson/mpmath/issues/656">#656</a> from cclauss/patch-2</li>
<li><a href="9d7884bf96"><code>9d7884b</code></a> don't use .ae method in library code</li>
<li><a href="967de83d51"><code>967de83</code></a> Downgrade to ubuntu-20.04 for Py35 and Py36</li>
<li><a href="6425c6aa41"><code>6425c6a</code></a> build: strategy: fail-fast: false</li>
<li><a href="e2341c762e"><code>e2341c7</code></a> GitHub Actions: Test on Python 3.11 production release</li>
<li><a href="1258e33e16"><code>1258e33</code></a> fix failing doctests</li>
<li><a href="b7c15d668c"><code>b7c15d6</code></a> include signals documentation; remove duplicate docstrings</li>
<li><a href="1b476ea230"><code>1b476ea</code></a> update doc building instructions</li>
<li><a href="5f57beb1e3"><code>5f57beb</code></a> Merge pull request <a href="https://redirect.github.com/fredrik-johansson/mpmath/issues/646">#646</a> from cclauss/patch-1</li>
<li>Additional commits viewable in <a href="https://github.com/fredrik-johansson/mpmath/compare/1.2.1...1.3.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=mpmath&package-manager=pip&previous-version=1.2.1&new-version=1.3.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102058
Approved by: https://github.com/huydhn
2023-05-24 08:40:35 +00:00
faa7eb81c6 change error_message for XPU Autocast data type check (#102073)
XPU autocast supports bf16 and fp16 data types, we are going to change the error_message for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102073
Approved by: https://github.com/jgong5
2023-05-24 08:36:43 +00:00
d06802778e No need to run C++ tests under rerun disabled tests mode (#102132)
Per title.  I extract this part out of the draft PR that I'm working on https://github.com/pytorch/pytorch/pull/102107 because
the remaining issues with rerun disabled tests: log size and unexpected runner failures requires some further investigations while this one is clearing breaking in trunk atm.

Until we can support disable C++ tests, there is no need to run them in rerun disabled tests mode.

### Testing

Coming from https://github.com/pytorch/pytorch/pull/102107, for example https://github.com/pytorch/pytorch/actions/runs/5062224659/jobs/9087747981

```
2023-05-23T22:46:50.1953318Z Running cpp/basic 1/1 ... [2023-05-23 22:46:50.195077]
2023-05-23T22:46:50.1953847Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode
2023-05-23T22:46:50.2066032Z Running cpp/atest 1/1 ... [2023-05-23 22:46:50.206348]
2023-05-23T22:46:50.2066435Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode
2023-05-23T22:46:52.2666743Z No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
2023-05-23T22:46:52.2691817Z Ignoring disabled issues:  []
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102132
Approved by: https://github.com/clee2000
2023-05-24 07:45:48 +00:00
29da75cc55 Enable mypy allow redefinition (#102046)
Related #101528

I tried to enable this in another PR but it uncovered a bunch of type errors: https://github.com/pytorch/pytorch/actions/runs/4999748262/jobs/8956555243?pr=101528#step:10:1305

The goal of this PR is to fix these errors.

---

This PR enables [allow_redefinition = True](https://mypy.readthedocs.io/en/stable/config_file.html#confval-allow_redefinition) in `mypy.ini`, which allows for a common pattern:

> Allows variables to be redefined with an arbitrary type, as long as the redefinition is in the same block and nesting level as the original definition.

`allow_redefinition` allows mypy to be more flexible by allowing reassignment to an existing variable with a different type... for instance (from the linked PR):

4a1e9230ba/torch/nn/parallel/data_parallel.py (L213)

A `Sequence[Union[int, torch.device]]` is narrowed to `Sequence[int]` thru reassignment to the same variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102046
Approved by: https://github.com/ezyang
2023-05-24 07:05:30 +00:00
bf059e3925 [Typing] Export torch.backends as subpackage (#102099)
So that `pyright` is happy.

Do a little refactor in `mps/__init__.py` to avoid cyclical dependency on `torch.fx` by calling `mps._init()` implicitly.

Fixes https://github.com/pytorch/pytorch/issues/101686
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102099
Approved by: https://github.com/Skylion007
2023-05-24 07:03:17 +00:00
d26c8f26d1 Lower xdist processes from auto to NUM_PROCS (#102124)
This is to avoid CUDA OOM issues when running C++ tests both regularly and in memory leak check mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102124
Approved by: https://github.com/clee2000
2023-05-24 06:50:55 +00:00
3318a832b3 Tighten FakeTensor reentrancy asserts, add debugging (#102091)
When investigating failures in https://github.com/pytorch/pytorch/pull/100017 I realized that we were reentering FakeTensorMode even though there was already one on the stack. Although we have attempted assert for these cases in the past, e.g., as in https://github.com/pytorch/pytorch/pull/97186 it seems that the existing protections were insufficient.

In this particular case, the reapplication of FakeTensorMode was due to an interaction with NotImplemented multiple dispatch handling. If proxy tensor mode detects an unrecognized tensor type (this includes FakeTensor, if it is not tracked with a proxy), it will return NotImplemented to give this tensor a chance to unpack itself into proxyable operation. However, this is never the right thing for FakeTensor, where no unpacking is possible. However, today, FakeTensor attempts to reapply the FakeTensorMode, resulting in FakeTensorMode being twice on the stack.

This PR does a number of things:

* It adds an assert in `FakeTensorMode.__torch_dispatch__` that you must not already have this mode on the stack, this is ALWAYS an error
* It modifies `FakeTensor.__torch_dispatch__` to return `NotImplemented` if the mode is already active. This prevents us from readding the mode on the stack
* It adds a new logging artifact `not_implemented` which you can use to get debug logs about all of the times a `__torch_dispatch__` handler returned NotImplemented and why it did so. Your subclass has to manually opt into this logging, but I inserted the necessary logs for ProxyTensorMode and FakeTensor(Mode)
* `with fake_mode` now no-ops if the fake mode is already on the stack, which is what users want anyway
* I am BREAKING pre-autograd tracing, because it is currently doing something weird with the original C++ mode stack. Brian is going to follow up with a fix next week.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102091
Approved by: https://github.com/thiagocrepaldi, https://github.com/eellison, https://github.com/wanchaol, https://github.com/bdhirsh
2023-05-24 05:37:51 +00:00
38f8f756bf group constraints by arg (#102096)
Differential Revision: [D46110979](https://our.internmc.facebook.com/intern/diff/D46110979/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102096
Approved by: https://github.com/ydwu4
2023-05-24 05:27:54 +00:00
907cc6c11c [vision hash update] update the pinned vision hash (#102136)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102136
Approved by: https://github.com/pytorchbot
2023-05-24 03:52:05 +00:00
2e18dd2bdc Improve bf16 neg by bypassing the convertion between BF16 and FP32 (#99711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99711
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/desertfire
2023-05-24 03:25:23 +00:00
45843c7f41 test_memory_format fix for test_modules.py (#102006)
add with_tf32_off, add sm80 check for thresholds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102006
Approved by: https://github.com/ngimel
2023-05-24 02:32:45 +00:00
47e9dba765 move tf32_on_and_off fix for test_convolution.py (#102007)
move tf32_on_and_off after  @torch.backends.cudnn.flags(enabled=True, benchmark=False) due to  @torch.backends.cudnn.flags(enabled=True, benchmark=False) overwriting tf32_on_and_off if after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102007
Approved by: https://github.com/ngimel
2023-05-24 02:23:06 +00:00
d805a53f1f disable tf32 for rnn tests and norm tests (#102005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102005
Approved by: https://github.com/ngimel
2023-05-24 02:22:58 +00:00
ea5eaa8692 Remove config check in specialize (#102098)
Fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102098
Approved by: https://github.com/ezyang
2023-05-24 01:26:22 +00:00
d55aad1f3e Disable (Broken) CUDAStreamVariable in dynamo (#100766)
While attempting to explore XLTransformers w/ PT2, I found that we leak tracing time objects (VariableTrackers) into the runtime:

```
Traceback (most recent call last):
  File "/scratch/voz/work/xlformers/train.py", line 686, in <module>
    main(cfg)
  File "/scratch/voz/work/xlformers/train.py", line 357, in main
    pred, _ = model(x)
  File "/scratch/voz/work/pytorch/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/voz/work/pytorch/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/voz/work/pytorch/torch/_dynamo/eval_frame.py", line 282, in _fn
    return fn(*args, **kwargs)
  File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1416, in forward
    self._lazy_init()
  File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1424, in <resume in forward>
    args, kwargs = cast_floats_to_right_precision(True, True, *args, **kwargs)
  File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1434, in <resume in forward>
    self._rebuild_full_params()
  File "/scratch/voz/work/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1932, in _rebuild_full_params
    def update_p_data(custom_output_tensor: Optional[torch.Tensor] = None) -> None:
  File "/data/home/voz/miniconda3/envs/torch5/lib/python3.10/site-packages/fairscale/nn/data_parallel/fully_sharded_data_parallel.py", line 1932, in <resume in _rebuild_full_params>
    def update_p_data(custom_output_tensor: Optional[torch.Tensor] = None) -> None:
  File "/scratch/voz/work/pytorch/torch/cuda/__init__.py", line 464, in __enter__
    if self.src_prev_stream.device != cur_stream.device:
AttributeError: 'CUDAStreamVariable' object has no attribute 'device'
```

This indicates a serious bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100766
Approved by: https://github.com/ezyang
2023-05-24 01:22:21 +00:00
cc233f4e23 integrate the new event with pytorch (#101025)
Test Plan: This is a no-op diff

Differential Revision: D45698169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101025
Approved by: https://github.com/aaronenyeshi
2023-05-24 00:38:26 +00:00
e79d9b9938 [pt2] add SymInt support for linalg.matrix_power (#101940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101940
Approved by: https://github.com/lezcano, https://github.com/ezyang
2023-05-24 00:21:52 +00:00
69f7b40949 [pt2] add SymInt support for eye (#101955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101955
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-05-24 00:21:52 +00:00
42b974e8f7 [pt2] add meta for linalg_lu_solve (#101836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101836
Approved by: https://github.com/lezcano
2023-05-24 00:21:50 +00:00
6c68116643 [MPS] Calculate nonzero count first before running nonzero op (#102052)
Summary of changes:
- Calculate nonzero count first before running nonzero op
- allocate only 1 element when calling .item(), and blit only the size of destination
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102052
Approved by: https://github.com/kulinseth
2023-05-24 00:19:42 +00:00
be5e77ca4c Make _StorageBase.byteswap faster ( > 10000x) (#101925)
This PR addresses #101690. This PR implement faster data elements swap in `_StorageBase` using C++ rather than using Python.

This PR helps such a situation that a large model saved on a little-endian machine will be loaded on a big-endian machine.

TODO:
- [x] Add test cases
- [x] Add performance comparison before and after the PR
- [ ] (Optional) Investigate further opportunities for performance improvements by [SIMDization](https://dev.to/wunk/fast-array-reversal-with-simd-j3p)

Fixes #101690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101925
Approved by: https://github.com/mikaylagawarecki
2023-05-24 00:13:41 +00:00
94ed26d177 [quant][pt2e] prepare_pt2e use quantization spec directly (#102054)
Summary:
In this PR we aligned with the design of annotation API and uses quantization spec directly for annotation.
main change is in prepare, we consume quantization_spec object directly instead of the observer or fake quant constructor, we create the constructor
inside prepare, and annotation api users only need to interact with quantization spec object after this PR

Test Plan:
```
buck2 test mode/opt caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'
```

Reviewed By: kimishpatel

Differential Revision: D45934088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102054
Approved by: https://github.com/kimishpatel
2023-05-23 23:25:56 +00:00
99f68d56ee [PyTorch] Delete c10::guts::if_constexpr (#101991)
Now that we have C++17, we should not need this any more.

Differential Revision: [D46078335](https://our.internmc.facebook.com/intern/diff/D46078335/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101991
Approved by: https://github.com/r-barnes, https://github.com/Skylion007
2023-05-23 23:19:35 +00:00
f65732552e Support FakeTensor with FlatParameter (#101987)
In this PR we turn FlatParameter into a virtual tensor subclass
which doesn't actually ever get instantiated: __new__ will create
a Parameter instead (or a FakeTensor, if necessary).

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101987
Approved by: https://github.com/awgu, https://github.com/eellison
2023-05-23 23:12:08 +00:00
5147fe4969 Revert "[inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)"
This reverts commit b9721bd70531df641fbd484ab05085a4c52657be.

Reverted https://github.com/pytorch/pytorch/pull/101812 on behalf of https://github.com/osalpekar due to Causing test_nn_cuda tests to crash during runtime. More details at [D46093942](https://www.internalfb.com/diff/D46093942) ([comment](https://github.com/pytorch/pytorch/pull/101812#issuecomment-1560238085))
2023-05-23 23:06:21 +00:00
2e5e53b718 Do not upload MacOS conda environment to GitHub when job fails (#102108)
Fixes https://github.com/pytorch/pytorch/issues/101800. This is not needed anymore as the dependencies issues on MacOS has been addressed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102108
Approved by: https://github.com/malfet
2023-05-23 23:01:33 +00:00
32ce06a5ab Revert "[Reland] fix missing-prototypes warnings in torch_cpu (Part 4) (#101949)"
This reverts commit 4f2c007a1b5170c2aa0d47e388ff9e07c7a7d354.

Reverted https://github.com/pytorch/pytorch/pull/101949 on behalf of https://github.com/osalpekar due to As noted in @izaitsevfb's comment, we are still seeing linker errors, this time due to `nnc_prepacked_linear_clamp_run` being made a static function. ([comment](https://github.com/pytorch/pytorch/pull/101949#issuecomment-1560226880))
2023-05-23 22:53:47 +00:00
45a8f691ec Revert "[Reland] fix missing-prototypes warnings in torch_cpu (Part 5) (#101976)"
This reverts commit 4db2dade258961fbbf44bc0015235da89a26cb46.

Reverted https://github.com/pytorch/pytorch/pull/101976 on behalf of https://github.com/osalpekar due to reverting to allow https://github.com/pytorch/pytorch/pull/101949 to be cleanly reverted ([comment](https://github.com/pytorch/pytorch/pull/101976#issuecomment-1560224839))
2023-05-23 22:50:28 +00:00
0759e1d132 Revert "add Half support for sinh, cosh, ploygamma, entr and i0e on CPU (#99002)"
This reverts commit 5c3cf76eb2c8c8699ae0341c2753903a87bbfda2.

Reverted https://github.com/pytorch/pytorch/pull/99002 on behalf of https://github.com/osalpekar due to Need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/101976 ([comment](https://github.com/pytorch/pytorch/pull/99002#issuecomment-1560221288))
2023-05-23 22:44:59 +00:00
7e58891ca0 Support list output for HigherOrderOperators (#101986)
Fixes the issue in #100278: support list output for HigherOrderOperator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101986
Approved by: https://github.com/zou3519
2023-05-23 21:36:04 +00:00
e7a6818e97 Register top level logger for torch (#102090)
This enables use of artifact logging in modules that aren't under
the modules that were specified here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102090
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2023-05-23 21:24:21 +00:00
149237415f Using deterministic hashing instead of GUID for pytorch serialization id generation (#101964)
Summary:
serialization_id was added in a previous change to be written as a random GUID associated with each time saving of a module is called, for the purpose of adding tracking for saved artifacts. In order not to disturb existing systems that rely on the serialized bytes to be deterministic for serializing the same module, this change uses the combined hash of uncompressed content and file names instead of GUID for serialization id.
The use of this hashing reuses the same CRC32 that is already calculated for zip writing, so it doesn't incur additional computational overhead.

Data descriptor is one of the file headers inside the zip format https://en.wikipedia.org/wiki/ZIP_(file_format)#Data_descriptor. It contains the CRC32 of the uncompressed data. By inspecting the written data in PyTorchStreamWriter, the CRC32 is found for each written record.
In order to make serialization_id a unique and deterministic id for the
serialized files without computation overhead, the updated `serialization_id` is computed based on all files written, and is composed of:
1) a combined hash of record name hashes
2) a combined crc32 of the record uncompressed data

Example value: "15656915541136177431866432772"

Test Plan: buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test

Differential Revision: D46038973

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101964
Approved by: https://github.com/davidberard98
2023-05-23 20:47:30 +00:00
76af22103b Fixed type hints for CosineAnnealingWarmRestarts (#102067)
Fixed type hints for CosineAnnealingWarmRestarts:
- `T_mult` is not `Optional[int]` but just `int`
- `eta_min` is not `Optional[float]` but just `float`
- removed `step` method specific annotation as it is compatible with the base class

e132f09e88/torch/optim/lr_scheduler.py (L1365-L1375)

Otherwise, computation like this `self.T_i * self.T_mult` in `self.step` is not possible:
```
error: Unsupported operand types for * ("int" and "None")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102067
Approved by: https://github.com/janeyx99
2023-05-23 19:06:07 +00:00
4692ea76a0 Fine grained apis docs (#101897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101897
Approved by: https://github.com/msaroufim
2023-05-23 19:03:44 +00:00
723f111545 [custom_op] explicit autograd API (#101824)
This PR adds an explicit API for registering a backward formula for a
CustomOp. In the end state, we will likely have this explicit API and a
magic API (which is sugar on top of an explicit API), since different
parties of users prefer different ones.

Concretely, to define a backward formula for a CustomOp:
- a user must provide us a "save for backward" function that accepts
(inputs, output) and returns exactly what they want saved for backward
- a user must provide us a "backward" function that accepts
(ctx, saved, *grads) and returns us the grad_inputs. The grad_inputs
are returned as a dict mapping str to a gradient.
Please see the changes in custom_op_db.py for examples of the API.

There are a number of pieces to this PR and I'm happy to split it if it
helps. They are:
- The actual APIs for specifying the two functions
(impl_save_for_backward, impl_backward)
- The autograd kernel: we take the functions the user give us and
construct an autograd.Function object that we then register to
the Autograd dispatch key
- Indirection for the autograd kernel. We add a layer of indirection so
that one can swap out the autograd kernel. This is necessary because by
default, we register an "autograd not implemented" kernel as the
Autograd implementation but then swap it for the actual kernel when the
user provides it.

Test Plan:
- We apply this API to give backward formulas for things in
custom_op_db. We then hook up custom_op_db to the Autograd OpInfo tests.
- Various tests in test_python_dispatch.py to check error cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101824
Approved by: https://github.com/ezyang
2023-05-23 18:31:29 +00:00
8487105fae [custom_op] Create a new torch._custom_op namespace (#101823)
torch/custom_op.py is getting long, and the autograd pieces are going to
make it even longer. I'm planning on just organizing the files under
a torch/_custom_op folder.

Note that the imports now look a bit crazy (from torch._custom_op.impl
import...) but they will look more OK when we figure out the plan to
make custom_op public (coming later).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101823
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/bdhirsh
2023-05-23 18:31:29 +00:00
73d1be8e99 [custom_op] Add a test for symints (#101822)
Tests that a custom op annotated with Sequence[int] actually accepts
Sequence[SymInt].
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101822
Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/bdhirsh
2023-05-23 18:31:27 +00:00
6e0c741105 [dtensor] hide mesh validation check under init_process_group flag (#101996)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101996
Approved by: https://github.com/wz337
2023-05-23 18:17:54 +00:00
70eccdbf92 [dtensor] add necessary logging to APIs and components (#101994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101994
Approved by: https://github.com/wz337
2023-05-23 18:17:54 +00:00
eda7efe662 Fix ProfilerTree Test (#101983)
Summary:
T152692570

Created from CodeHub with https://fburl.com/edit-in-codehub

Test Plan:
CI

Sandcastle run

Reviewed By: aaronenyeshi

Differential Revision: D45656571

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101983
Approved by: https://github.com/aaronenyeshi
2023-05-23 18:10:20 +00:00
02a7318a5b [MPS] Add aminmax op (#101691)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101691
Approved by: https://github.com/malfet
2023-05-23 18:01:34 +00:00
80dd847b62 Fix fragile code in torch.__init__.py related to torch._inductor import (#102021)
Fixes #102020

For motivation of this change see the above issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102021
Approved by: https://github.com/msaroufim, https://github.com/jansel
2023-05-23 16:59:17 +00:00
7d1ba0a92a Support resize on meta storage (#101988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-05-23 16:49:17 +00:00
51ff408f77 Add retry when cleaning up Windows workspace (#102051)
Windows flakiness strikes again.  There is a new flaky issue start appearing on HUD in which tearing down Windows workspace fails with `Device or resource busy` error when trying to `rm -rf ./*` the workspace, for example https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717.  It happens on both build and test jobs.  I have looked into all commits since last weekend but there is nothing standing out or Windows-related.

The error means that a process still hold the directory, but it's unclear which one as all CI processes should have been stopped by then (https://github.com/pytorch/pytorch/pull/101460) with the only exception of the runner daemon itself.  On the other hand, the issue is flaky as the next job running on the same failed runner can clean up the workspace fine when checking out PyTorch (https://github.com/pytorch/pytorch/blob/main/.github/actions/checkout-pytorch/action.yml#L21-L35).

For example, `i-0ec1767a38ec93b4e` failed at https://github.com/pytorch/pytorch/actions/runs/5051845102/jobs/9064107717 and its immediate next job succeeded https://github.com/pytorch/pytorch/actions/runs/5052147504/jobs/9064717085.  So, I think that adding retrying should help mitigate this.

Related to https://github.com/pytorch/test-infra/pull/4206 (not the same root cause, I figured out https://github.com/pytorch/test-infra/pull/4206 while working on this PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102051
Approved by: https://github.com/kit1980
2023-05-23 16:41:58 +00:00
431344f2d0 [inductor] Refactor generate_kernel_call (#102018)
Summary: Refactor generate_kernel_call to support codegen call to Triton
kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102018
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-05-23 15:54:49 +00:00
e132f09e88 [Dynamo] Fix test_cuda_set_device to restore device (#102049)
Fixes #102025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102049
Approved by: https://github.com/ngimel
2023-05-23 07:37:12 +00:00
b91eb97d34 [transformer benchmark] relax tolerance in sdp.py (#101965)
Summary:
Otherwise we get
```
Traceback (most recent call last):
  File "<string>", line 49, in <module>
  File "<string>", line 47, in __run
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 346, in <module>
    main(save_path)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 328, in main
    experiment = run_single_experiment(experiment_config)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 229, in run_single_experiment
    assert_close_tensors(nn_mha_output, composite_mha_output)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp__/sdp#link-tree/caffe2/benchmarks/transformer/sdp.py", line 196, in assert_close_tensors
    assert torch.allclose(a, b, atol=1e-3, rtol=1e-3)
AssertionError
```

Test Plan: buck run mode/dev-nosan //caffe2/benchmarks/transformer:sdp

Differential Revision: D45843836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101965
Approved by: https://github.com/drisspg
2023-05-23 06:54:08 +00:00
e9246b290f Initialize cuda tensor in fake tensor (#102027)
Fix for https://github.com/pytorch/pytorch/issues/92627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102027
Approved by: https://github.com/ngimel
2023-05-23 06:24:50 +00:00
9bbee245fe update rules_python and let bazel install its own pip dependencies (#101405)
update rules_python and let bazel install its own pip dependencies

Summary:
This is the official way of doing Python in Bazel.

Test Plan: Rely on CI.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101405).
* #101406
* __->__ #101405
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101405
Approved by: https://github.com/vors, https://github.com/huydhn
2023-05-23 06:20:33 +00:00
2ca75d49a8 [DTensor][3/N] add DTensor constructor function: full (#101436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101436
Approved by: https://github.com/wanchaol
2023-05-23 06:05:40 +00:00
5c3cf76eb2 add Half support for sinh, cosh, ploygamma, entr and i0e on CPU (#99002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99002
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/ngimel
2023-05-23 06:02:15 +00:00
f7c736e1e7 [quant][pt2e] Add observer_or_fake_quant_ctr to QuantizationSpec (#101920)
Summary:
This is the second refactor to align the annotation API with design,
next step is to change prepare_pt2e to consume QuantizationSpec object directly

Test Plan:
```
buck2 test mode/optcaffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'
```

Reviewed By: kimishpatel

Differential Revision: D45927416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101920
Approved by: https://github.com/andrewor14
2023-05-23 05:48:23 +00:00
8cab7994a6 [inductor] Move cpp wrapper dynamic shapes test to test_cpp_wrapper (#102017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102017
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-05-23 03:59:55 +00:00
2bce7c8f46 CUDAGraph trees doc (#101902)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101902
Approved by: https://github.com/msaroufim
2023-05-23 03:35:43 +00:00
4a1e9230ba [vision hash update] update the pinned vision hash (#102028)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/102028
Approved by: https://github.com/pytorchbot
2023-05-23 03:04:07 +00:00
9121f5ca84 Use the symint version of computeStorageNbytes within get_nbytes. (#101634)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/xla/pull/4998) according to [comment](https://github.com/pytorch/xla/pull/4998#issuecomment-1550232063).

This change is needed to make sure calling tensor.sizes() will error if the tensor has dynamic dimension in pytorch/xla.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101634
Approved by: https://github.com/ezyang
2023-05-23 02:42:58 +00:00
f216fea44f Remove commented out pdb (#101993)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101993
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2023-05-23 02:20:10 +00:00
6d0079b12b [BE] Do not expose torch.functional.opt_einsum (#102004)
It's not mentioned in `__all__`, so moving `import torch.backends.opt_einsum
as opt_einsum` into `einsum` function to delay `torch.backends` import and hide it completely from the module scope.
level module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102004
Approved by: https://github.com/janeyx99
2023-05-23 01:52:40 +00:00
a2fd2c2b83 [Pytorch] Add Vulkan support for aten::unsqueeze for 2d to 3d (#101719)
Summary: Unsqueeze operator: https://pytorch.org/docs/stable/generated/torch.unsqueeze.html#torch.unsqueeze

Test Plan:
Unsqueeze tests:
https://www.internalfb.com/phabricator/paste/view/P738187802
```
lfq@lfq-mbp fbsource % buck run --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1 -- --gtest_filter="*unsqueeze*"
Downloaded 0/2 artifacts, 0.00 bytes, 100.0% cache miss (for updated rules)
Building: finished in 15.0 sec (100%) 455/455 jobs, 2/455 updated
  Total time: 15.0 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *unsqueeze*
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.unsqueeze_dim0
[       OK ] VulkanAPITest.unsqueeze_dim0 (96 ms)
[ RUN      ] VulkanAPITest.unsqueeze_dim1
[       OK ] VulkanAPITest.unsqueeze_dim1 (2 ms)
[ RUN      ] VulkanAPITest.unsqueeze_dim2
[       OK ] VulkanAPITest.unsqueeze_dim2 (3 ms)
[----------] 3 tests from VulkanAPITest (101 ms total)
[----------] Global test environment tear-down
[==========] 3 tests from 1 test suite ran. (101 ms total)
[  PASSED  ] 3 tests.
```
All tests:
buck run //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64

https://www.internalfb.com/phabricator/paste/view/P738255852

Reviewed By: SS-JIA

Differential Revision: D45893511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101719
Approved by: https://github.com/SS-JIA
2023-05-23 01:29:40 +00:00
5fe629e314 Add PyObject preservation for UntypedStorage (#97470)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97470
Approved by: https://github.com/ezyang
2023-05-23 01:27:30 +00:00
488a4303a5 Enable quantized_max_pool3d (#101654)
**Summary**
Enable `quantized_max_pool3d` kernel to fix the issue https://github.com/pytorch/pytorch/issues/101386.

**Test Plan**
```
clear && python -u -m pytest -s -v test_quantized_op.py -k test_max_pool3d
clear && python -u -m pytest -s -v test_quantized_op.py -k test_max_pool3d_nhwc
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101654
Approved by: https://github.com/albanD, https://github.com/jgong5, https://github.com/mingfeima
2023-05-23 00:45:38 +00:00
8243abc84a [1/n] instanceof instead of singleton for ph check (#102008)
Summary: Change placeholder check from singleton to instanceof PHBase so you can create your own PH class with metadata

Test Plan: added unit test

Reviewed By: joshuadeng

Differential Revision: D46085128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102008
Approved by: https://github.com/PaliC
2023-05-23 00:07:45 +00:00
81b0f72e16 Fix xnnpack link errors (#101630)
Summary: Setting srcs for all arvr platforms in xnnpack.buck.bzl allows build with arvr/platform010 to pass.

Test Plan: CI

Reviewed By: blchxfm

Differential Revision: D45738660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101630
Approved by: https://github.com/digantdesai
2023-05-22 22:36:15 +00:00
2ae87a1f87 missed StackDataset documentation (#101927)
New dataset class added by #101338 missed in documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101927
Approved by: https://github.com/kit1980
2023-05-22 21:12:16 +00:00
b9721bd705 [inductor][decomp] Add aten._unsafe_index_put for unchecked indexing (#101812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101812
Approved by: https://github.com/lezcano
2023-05-22 20:39:18 +00:00
e07c04f48a [inductor] Update qualname and module for wrapped testcases (#101975)
D45936056 was hitting bizarre failures running unit tests under FB's
test runner, where we'd see things like:
```
9 TESTS FAILED
  ✗ caffe2/test/inductor:fused_attention - <locals> (unittest.loader._FailedTest)
```

The reason for this is, it turns out the test runner uses a two-step process
where it first lists the tests, in one process, and then runs them using the
names from the listing step in separate processes

But, since we're decorating the class, it ends getting listed with a weird name
like `torch._dynamo.config_utils.ContextDecorator.__call__.<locals>._TestCase`,
and when the runner tries to load that module, it fails.

So one solution (other than, you know, using pytest) is to update the
__qualname__ and __module__ of the _TestCase wrapper so that the runner will
actually load the right module.

@build[pytorch_dynamo_inductor]

Differential Revision: [D46044467](https://our.internmc.facebook.com/intern/diff/D46044467/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101975
Approved by: https://github.com/xuzhao9, https://github.com/jansel
2023-05-22 20:35:29 +00:00
c618093681 [vulkan] Fix concat op in feature dimension (#101721)
Summary: Fix a small bug in the `cat_feature` shader where an early exit path was not being taken correctly.

Test Plan:
Referring to [Pytorch Vulkan Testing Procedures](https://www.internalfb.com/intern/wiki/Pytorch_Vulkan_Backend/Development/Vulkan_Testing_Procedures/):

* Run operator unit tests on Mac and Android
* Run model inference and correctness benchmarks

Differential Revision: D45962806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101721
Approved by: https://github.com/salilsdesai
2023-05-22 20:24:40 +00:00
be94ff976d Have irange use if constexpr (#94050)
Test Plan: Sandcastle

Differential Revision: D42997779

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94050
Approved by: https://github.com/Skylion007, https://github.com/soumith, https://github.com/malfet
2023-05-22 20:12:37 +00:00
38e73b30b7 bring quantized_backward.cpp in sync with intern (#101990)
The version of [D45965552](https://www.internalfb.com/diff/D45965552) exported as #101739 was not the latest. This PR brings GH in sync with intern.

For Meta employees, see:
[D46056765](https://www.internalfb.com/diff/D46056765)
[D45965552](https://www.internalfb.com/diff/D45965552)

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101990
Approved by: https://github.com/kit1980
2023-05-22 19:53:42 +00:00
4de5ee43bf [torch.library] Change Library.__del__ into weakref.finalize (#101829)
`__del__` is a bit difficult to use, because when it is called, it is
not guaranteed that anything it uses has not been cleaned up.

Ed tells me he got the following exception one day, which is what
prompted this PR.
```
Exception ignored in: <function Library.__del__ at 0x7fa36d211e50>
Traceback (most recent call last):
  File "/data/users/ezyang/a/pytorch/torch/library.py", line 139, in
  __del__
  AttributeError: 'NoneType' object has no attribute 'remove'
```

One solution is to use weakref.finalize, which lets one define a
function to be run when the object is deleted that can hold references
to specific things it needs.

Another solution is to just check if the object is None, but I like the
weakref solution better.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101829
Approved by: https://github.com/ezyang
2023-05-22 19:51:08 +00:00
5e635e17da Add documentation for a catching invalid index type (#96451)
Summary:
The assert in in compute_q8gemm_prepacked_sparse_dq is currently unreachable.

Added inline comments to explain what is happening.

Test Plan: Ran  qnnpack q8gemm-sparse-test to verify.

Differential Revision: D43930667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96451
Approved by: https://github.com/salilsdesai, https://github.com/jianyuh
2023-05-22 19:37:09 +00:00
c9f8f4cf2d Fix device normalization of automatically generate methods for custom backends. (#101796)
Fixes #ISSUE_NUMBER
Fix the problem of error handling when the device input parameter adopts string type, Align capabilities.

`foo_storage = torch.DoubleStorage(4).foo(device="foo:0, non_blocking=False")`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101796
Approved by: https://github.com/bdhirsh
2023-05-22 19:02:16 +00:00
cyy
4db2dade25 [Reland] fix missing-prototypes warnings in torch_cpu (Part 5) (#101976)
PR #101788 depended on https://github.com/pytorch/pytorch/pull/100849 which was  reverted. Now that https://github.com/pytorch/pytorch/pull/100849 has been relanded,  we can reland #101788 too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101976
Approved by: https://github.com/Skylion007
2023-05-22 19:00:13 +00:00
bdb3fb49bc [c10d] Fix the check message of unsupported collectives ops. (#101775)
1. Fix the check message of unsupported collectives ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101775
Approved by: https://github.com/H-Huang
2023-05-22 18:37:05 +00:00
5ba16011d7 Suppress profiler spam in dynamo benchmarks (#101942)
Makes this stuff go away:
```
STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-20 20:49:34 63580:63580 ActivityProfilerController.cpp:321] Completed Stage: Post Processing
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101942
Approved by: https://github.com/shunting314, https://github.com/desertfire
2023-05-22 18:32:31 +00:00
38a29324b0 [dtensor][2/N] more tensor ops to use strategy propagation (#101203)
As titled, this PR adapts a few more tensor ops to use strategy based
sharding prop
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101203
Approved by: https://github.com/XilunWu
2023-05-22 17:16:14 +00:00
496212f408 Revert "group constraints by arg (#101815)"
This reverts commit 03de15806e5d27ee4ef6d82dbcc66dac78f6e3bf.
Reverted https://github.com/pytorch/pytorch/pull/101815 on behalf of https://github.com/malfet due to it broke ExecuTorch and author was well aware about it"
2023-05-22 09:28:43 -07:00
a630328695 Fix Backend docs search items (#101214)
Fixes #100944

## New

<img width="1142" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/79102f2e-8a8f-4169-be53-9248397e653c">

<img width="765" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/4e5f17e7-a445-4822-ac8a-0d73c9ed71ee">

## Old

<img width="1341" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/985b4ec9-6d11-4962-8619-3c14ec09c3d9">

<img width="1112" alt="image" src="https://github.com/pytorch/pytorch/assets/13214530/e8dcf1a9-73e7-4fd6-8adc-eb036b1bb87b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101214
Approved by: https://github.com/albanD
2023-05-22 14:58:38 +00:00
a6f4088c21 Hint Tensor._make_subclass as a staticmethod (#101961)
Fixes #101862

No more type errors and improved return type value:
```python
import torch
from torch import nn

t = torch.tensor([1, 2, 3], dtype=torch.float32)

t2 = torch.Tensor._make_subclass(  # OK
    nn.Parameter,
    t.data,
)
reveal_type(t2)  # Type of "t2" is "Parameter"

t3 = t._make_subclass(  # OK
    nn.Parameter,
    t.data,
)
reveal_type(t3)  # Type of "t3" is "Parameter"

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101961
Approved by: https://github.com/albanD
2023-05-22 12:42:50 +00:00
19af5c0b69 Explain how fastAtomicAdd works (#101951)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101951
Approved by: https://github.com/albanD
2023-05-22 12:41:08 +00:00
cyy
4f2c007a1b [Reland] fix missing-prototypes warnings in torch_cpu (Part 4) (#101949)
This PR relands the changes introduced in PR #100849. The old PR turnd  nnc_aten_embedding  into a static function, however, it is actually used in torch/csrc/jit/tensorexpr/operators/misc.cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101949
Approved by: https://github.com/albanD
2023-05-22 10:53:07 +00:00
d0bb8fdc64 Revert "[dynamo] Minor refactor to use is_allowed to decide inlining of NNModule methods (#101910)"
This reverts commit 8b2a9f81cc7cab9cb49cd2c96b9304a3f9313fca.

Reverted https://github.com/pytorch/pytorch/pull/101910 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/101910#issuecomment-1556782524))
2023-05-22 08:37:12 +00:00
e9a7115605 Update Kineto submodule (#101952)
The Kineto submodule has been stayed at commit 21beef3787b4134c43584f6c2443341921c41f69. which is Apr 19th.

This commit is just to keep it updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101952
Approved by: https://github.com/kit1980
2023-05-22 01:56:14 +00:00
0a694dba2b [inductor] fix avg_pool2d accuracy problem in lowering (#101789)
Fixes #100987

In the current `avg_pool2d` lowering of inductor when `count_include_pad`, the mean of each window is calculated by dividing a fixing value, i.e. `kernel_size[0] * kernel_size[1]`. However for ceil mode, the amount of number in a window on the border could be less than `kernel_size[0] * kernel_size[1]`. This PR fixes the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101789
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/EikanWang
2023-05-22 01:32:40 +00:00
3004d40439 torch.unique with dim: NumPy compatible sorting (#101693)
Fixes https://github.com/pytorch/pytorch/issues/101681. The change `transpose -> moveaxis` was sufficient.
Not only does it make the output similar to NumPy, it also preserves lexicographical sorting order along selected dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101693
Approved by: https://github.com/ngimel
2023-05-21 21:51:14 +00:00
dcffd5c646 show errors on bazel test failure (#101928)
show errors on bazel test failure

Summary:
Without this it's impossible to know what went wrong in CI.

Test Plan: Should be a no-op correctness wise.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101928).
* #101406
* #101405
* __->__ #101928
* #101445
* #101744
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101928
Approved by: https://github.com/huydhn
2023-05-21 20:04:21 +00:00
2a62b59e04 improve diagnostics from bazel_linter.py (#101445)
improve diagnostics from bazel_linter.py

Summary:
This was swallowing stderr on errors and trying to just parse an empty
string from stdout.

Test Plan: Verify with subsequent broken diff.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101445).
* #101406
* #101405
* #101928
* __->__ #101445
* #101744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101445
Approved by: https://github.com/huydhn
2023-05-21 19:02:35 +00:00
b54cdaf9fb use bazelisk as the bazel binary for lintrunner (#101744)
use bazelisk as the bazel binary for lintrunner

Summary: This let's us rely on .bazelversion to pick the right version.

Test Plan: Rely on CI.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101744).
* #101406
* #101405
* #101928
* #101445
* __->__ #101744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101744
Approved by: https://github.com/huydhn
2023-05-21 18:58:34 +00:00
807d81155f [CUDA][CUBLAS] Fix BF16 reduced precision reduction note in Numerical accuracy docs (#101884)
Fixes #100966

Ref #101044

Align implementation and documentation. (This is what's previously missed from the above issue and PR)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101884
Approved by: https://github.com/eqy, https://github.com/ezyang
2023-05-21 17:38:00 +00:00
351c2ea2fb [export] Prototype on serialization schema. (#101899)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101899
Approved by: https://github.com/angelayi
2023-05-21 06:31:53 +00:00
330c907301 [MPS] Fix embedding cache key (#101857)
Fixes #101198

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101857
Approved by: https://github.com/kulinseth
2023-05-21 06:11:25 +00:00
22ca1a1124 Partially fix shape mismatch in vision_maskrcnn (#101477)
The bulk of the heavy lifting is happening in
https://github.com/pytorch/vision/pull/7592

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101477
Approved by: https://github.com/voznesenskym
2023-05-21 05:20:08 +00:00
9e8da7fb44 [vision hash update] update the pinned vision hash (#101938)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101938
Approved by: https://github.com/pytorchbot
2023-05-21 02:29:32 +00:00
66a2600b6a [T153220354] Fix header inclusions in c10 (#1541) (#101846)
Summary:
This is a re-attempt to land the iwyu header changes, by taking the diff from [PR 100304](https://github.com/pytorch/pytorch/pull/100304), and adding the bare minimal changes to make the diff build corectly in the internal builds.

X-link: https://github.com/facebookresearch/pytorch3d/pull/1541

X-link: https://github.com/fairinternal/pytorch3d/pull/44

- Re-work D45769819 to fix header inclusions in c10

Test Plan:
```
buck2 build --no-remote-cache mode/dev-nosan //caffe2/c10/...

buck2 build --no-remote-cache mode/dev-nosan //deeplearning/fbgemm/fbgemm_gpu/...

buck2 build mode/dev-nosan //vision/fair/pytorch3d/pytorch3d:_C
```

Reviewed By: malfet

Differential Revision: D45920611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101846
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-05-20 19:35:14 +00:00
dde6d56101 Prevent pattern matches across mutation ops in inductor pre-grad FX passes (#101144)
Per https://github.com/pytorch/pytorch/issues/101124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101144
Approved by: https://github.com/jansel
2023-05-20 19:27:56 +00:00
13640bf925 disableing quantizing gradient in 8bw (#101739)
Summary:
Quantizing a *gradient* is not applicable to complex ASR model.

Gradient in INT8
f438266519
Gradient in FP32
f438109197
Clearly two WER shows the limitation with quantizing a gradient.

As of now, we are okay with simply enabling quantized backpropagation but computing gradient in FP32.
It already saves a memory due to model size.

Test Plan: Signals

Differential Revision: D45965552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101739
Approved by: https://github.com/izaitsevfb
2023-05-20 18:39:12 +00:00
f0dc41a768 [ONNX] Bump onnx submodule to release 1.14.0 (#101809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101809
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-05-20 15:09:43 +00:00
03de15806e group constraints by arg (#101815)
Before, we would emit a soup of specializations / constraints without any obvious order to guide readability.

With this diff, we group such results by arg, and add comments preceding each group. Empirically, the results read much better.

Differential Revision: [D45995199](https://our.internmc.facebook.com/intern/diff/D45995199/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101815
Approved by: https://github.com/tugsbayasgalan
2023-05-20 06:01:14 +00:00
b5ee34e5f2 Disallow module forward input mutation in aot_export (#101834)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101834
Approved by: https://github.com/bdhirsh
2023-05-20 05:41:01 +00:00
0c6f409cda [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-20 03:43:33 +00:00
8b2a9f81cc [dynamo] Minor refactor to use is_allowed to decide inlining of NNModule methods (#101910)
Fixes #101609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101910
Approved by: https://github.com/yanboliang
2023-05-20 03:34:20 +00:00
2886b3e692 [vision hash update] update the pinned vision hash (#101917)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101917
Approved by: https://github.com/pytorchbot
2023-05-20 02:57:54 +00:00
bb62a3734e inductor: fix name 'inf' is not defined issue when calling external_call function (#101865)
This PR will fix https://github.com/pytorch/pytorch/issues/101695.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101865
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/Skylion007
2023-05-20 01:44:21 +00:00
350f0cd78c inductor: fix bfloat16 store complier issue (#101856)
Fix the bfloat16 compiler error:
```
/tmp/torchinductor_xiaobing/ez/cezrraw7rtu5vkxcfd544i53crqaobycprf5twyvf7b62jrgi75p.cpp: In function ‘void kernel(const bfloat16*, bfloat16*)’:
/tmp/torchinductor_xiaobing/ez/cezrraw7rtu5vkxcfd544i53crqaobycprf5twyvf7b62jrgi75p.cpp:20:79: error: expected ‘;’ before ‘}’ token
   20 |                         tmp0.store(tmp1 + static_cast<long>(16L*i1_inner), 16)
      |                                                                               ^
      |                                                                               ;
   21 |                     }

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101856
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire
2023-05-20 01:41:41 +00:00
029c6a9934 [accuracy minifier] cast copied model rather than update the original model (#101901)
This is the fix Ed found during the break of the summit :)

I think I'd better to split it out of https://github.com/pytorch/pytorch/pull/99773 so people don't need to patch that PR to run the repro.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101901
Approved by: https://github.com/ezyang
2023-05-20 00:50:32 +00:00
73e887b5c7 [easy] refactor signature flattening transform (#101886)
Move `ChangeInputOutputSignature` out of export function to avoid closed over variables that make dependencies hard to understand. Also rename it while we're at it.

Differential Revision: [D46029076](https://our.internmc.facebook.com/intern/diff/D46029076/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101886
Approved by: https://github.com/tugsbayasgalan
2023-05-20 00:47:04 +00:00
7a17e9d0b6 [dynamo] Bugfix for unspecialized nn module variable (#101859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101859
Approved by: https://github.com/yanboliang, https://github.com/shingjan
2023-05-20 00:46:56 +00:00
48346a4648 [inductor] Test indirect indexing asserts with dimension of size 1 (#101811)
Closes #101354, where the test came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101811
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-19 23:11:56 +00:00
89bd5d3dab [inductor] Implement magic methods on IR values (#101076)
This wraps `ops` into an `OpsWrapper` object which wraps any returned
IR values into an `OpsValue` instance. This allows magic methods to
be implemented and means lowerings can write mathematical expressions much more
fluently. So instead of
```python
ops.add(ops.mul(ops.mul(ops.sub(ops.mul(_Ap2, x), _Ap3), x), x), _1)
```
we can write
```python
(_Ap2 * x - _Ap3) * x * x + _1
```

And it will translate to the equivalent `ops` calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101076
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-19 23:09:37 +00:00
15495f2d96 [quant][pt2e] Introduce QuantizationAnnotation API (#101708)
Summary:
This diff adds QuantizationAnnotation and also refactors the existing annotation to use this object

```
dataclass
class QuantizationAnnotation:
  # How some input nodes should be quantized, expressed as QuantizationSpec
  # a map from torch.fx.Node to QuantizationSpec
  input_qspec_map: Dict[Node, QuantizationSpec]

  # How the output of this node is quantized, expressed as QuantizationSPec
  output_qspec: QuantizationSpec

class QuantizationSpec:
    dtype: torch.dtype
    is_dynamic: bool = False
    quant_min: Optional[int] = None
    quant_max: Optional[int] = None
    qscheme: Optional[torch.qscheme] = None
    ch_axis: Optional[int] = None
    # TODO: follow up PR will add this
    # Kind of observer such as MinMaxObserver, PerChannelHistogramObserver etc.
    # observer_or_fake_quant_type: Union[ObserverBase, FakeQuantizeBase]
```

Example after full refactor:

```
int8_qspec = QuantizationSpec(dtype=torch.int8, ...)
weight_qspec = QuantizationSpec(dtype=torch.int8, ...)
conv_node["quantization_annotation"] = QuantizationAnnotation(
    input_qspec_map={input_node: int8_qspec, weight_node: weight_qspec}
    output_qspec=int8_qspec,
)
```

Note: right now input_qspec_map and output_qspec map are still using observer and fake quant constructors.
Follow up PR: change the input_qspec_map and output_qspec to use QuantizationSpec directly

Test Plan:
```
buck2 test mode/optcaffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'
```

Differential Revision: D45895027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101708
Approved by: https://github.com/andrewor14
2023-05-19 22:54:27 +00:00
03f50fcc02 [codemod][3.10][NamedTuple] Use typing_extensions to get NamedTuple Generics (#101830)
Summary:
3.10 doesn't have support for Generic NamedTuples, but it exists in future versions so typing_extensions supports it

(Note: this ignores all push blocking failures!)

Test Plan: sandcastle

Reviewed By: itamaro

Differential Revision: D45923201

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101830
Approved by: https://github.com/izaitsevfb
2023-05-19 22:50:18 +00:00
b07e97c084 Fix finding existing Needs label comments (#101889)
After https://github.com/pytorch/pytorch/pull/101747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101889
Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/zou3519
2023-05-19 22:43:53 +00:00
c8fd1cfad1 [pt2] Turn off lazy reinit when cuda graph is on (#101848)
Summary: cuda graph doesn't work with cuda 11's cupti lazy reinit. So we'll turn it off if any modules turn on cudagraph

Test Plan: test with cuda graph on

Reviewed By: aaronenyeshi

Differential Revision: D45967197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101848
Approved by: https://github.com/aaronenyeshi
2023-05-19 21:50:38 +00:00
fa7ad77ac9 [Profiler] Workaround CUPTI Lazy Reinit and CUDA Graphs crash in CUDA 11 (#101879)
Summary: Since CUPTI lazy re-init crashes with CUDA Graphs in CUDA 11, we should disable this. Remove this item once majority of workloads move to CUDA 12.

Test Plan: CI Tests

Reviewed By: xw285cornell

Differential Revision: D45921028

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101879
Approved by: https://github.com/xw285cornell
2023-05-19 21:47:07 +00:00
3666ca9d97 Dynamic Shape Doc (#101885)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 2f25c1e</samp>

> _Dynamic shapes guide_
> _`TorchDynamo` and `TorchInductor`_
> _Learn from data flow_

Thanks @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101885
Approved by: https://github.com/eellison, https://github.com/ezyang
2023-05-19 21:43:22 +00:00
ff5b9428aa Fake Tensor Docs (#101882)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 75f33ae</samp>

> _Fake tensors help_
> _compile and optimize code_
> _`PT2` in autumn_

Thanks @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101882
Approved by: https://github.com/eellison, https://github.com/ezyang
2023-05-19 21:39:34 +00:00
581d13a069 Add Logging Doc to compile index (#101888)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ba85a41</samp>

> _`logging` module_
> _documents PyTorch events_
> _cutting through the fog_

Thanks @mlazos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101888
Approved by: https://github.com/eellison
2023-05-19 21:29:25 +00:00
7f3fed125e Revert "separate out dynamo .requires_grad and .is_grad_enabled guards (#100570)"
This reverts commit 1fabee399d74ee5e1b519673a15619e9fded6562.

Reverted https://github.com/pytorch/pytorch/pull/100570 on behalf of https://github.com/PaliC due to breaking inductor tests along with #101219 ([comment](https://github.com/pytorch/pytorch/pull/100570#issuecomment-1555271267))
2023-05-19 21:29:09 +00:00
2dd33c71c1 Docs for torchcompile and functorch (#101881)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at b5f48b6</samp>

> _`torch.compile` docs_
> _Add a new section for `func`_
> _Winter of features_

Thanks @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101881
Approved by: https://github.com/eellison, https://github.com/zou3519
2023-05-19 21:23:43 +00:00
81c181dc01 Update BCEWithLogitsLoss pos_weight description in documentation (#101567)
Fixes #82496 and #65702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101567
Approved by: https://github.com/mikaylagawarecki
2023-05-19 21:23:21 +00:00
5ea7096ebc match sdpa patterns from HF (#100609)
Adds sdpa patterns seen in HF models.

To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609
Approved by: https://github.com/jansel
2023-05-19 21:02:46 +00:00
96ee23e198 Print restarting analysis at INFO level with a exception breadcrumb (#101573)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101573
Approved by: https://github.com/albanD
2023-05-19 20:29:18 +00:00
e5e451a9db Update batch size for a couple models (#101837)
The memory compression for these models is at parity, but because we interleave timings between torch.compile and eager run memory is duplicated between between eager and cudagraphs pool and causes OOM.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101837
Approved by: https://github.com/anijain2305
2023-05-19 19:09:59 +00:00
498c34e8e8 Revert " fix missing-prototypes warnings in torch_cpu (Part 4) (#100849)"
This reverts commit c2f28d1c1df0db78f2951e4df5dde264f80f07eb.

Reverted https://github.com/pytorch/pytorch/pull/100849 on behalf of https://github.com/izaitsevfb due to fails internal Meta builds, including fbcode and android, see D46009888: ld.lld: error: undefined symbol: nnc_aten_embedding ([comment](https://github.com/pytorch/pytorch/pull/100849#issuecomment-1555105800))
2023-05-19 19:05:15 +00:00
083f304d27 Revert "fix inference_mode with torch.compile (#101219)"
This reverts commit 11f7ae19cd068e7b01b7a296db869b4933951a57.

Reverted https://github.com/pytorch/pytorch/pull/101219 on behalf of https://github.com/PaliC due to breaking inductor tests ([comment](https://github.com/pytorch/pytorch/pull/101219#issuecomment-1555104220))
2023-05-19 19:03:00 +00:00
e760a968c8 Revert " fix missing-prototypes warnings in torch_cpu (Part 5) (#101788)"
This reverts commit ac1cf00085b30eadd164ccf02e5208a59ec1b38b.

Reverted https://github.com/pytorch/pytorch/pull/101788 on behalf of https://github.com/izaitsevfb due to depends on #100849 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/101788#issuecomment-1555097961))
2023-05-19 18:58:11 +00:00
4f9aa7cb0f [export] Error when constraining on static values (#101655)
Fixes https://github.com/pytorch/pytorch/issues/100415

Results in the following error:
```
Traceback (most recent call last):
  File "/scratch/angelayi/work/pytorch/test/export/test_export.py", line 572, in test_export_constrain_static
    export(f, example_inputs, constraints)
  File "/scratch/angelayi/work/pytorch/torch/_export/__init__.py", line 348, in export
    method_name_to_graph_module[compile_spec.method_name] = _export(
  File "/scratch/angelayi/work/pytorch/torch/_export/__init__.py", line 119, in _export
    raise UserError(UserErrorType.CONSTRAIN_VIOLATION, str(e))
torch._dynamo.exc.UserError:   File "/scratch/angelayi/work/pytorch/test/export/test_export.py", line 561, in f
    constrain_as_value(c, min=1, max=3)

It appears that you're trying to set a constraint on a value which we evaluated to have a static value of 3. Scroll up to see where this constraint was set.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101655
Approved by: https://github.com/avikchaudhuri
2023-05-19 18:27:36 +00:00
3e2ea32dab [BE]: Enable ruff rule TRY302 and apply fixes (#101874)
Removes useless try statements and unreachable code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101874
Approved by: https://github.com/malfet
2023-05-19 17:30:52 +00:00
1ac663d9f1 collect_env: parse HIP version exception free (#101844)
Should prevent broken collect_env reporting as shown in https://github.com/pytorch/vision/issues/7561#issue-1698000841

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 5204e0f</samp>

> _`get_version_or_na`_
> _Helper function refactors_
> _Code like autumn leaves_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101844
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2023-05-19 17:24:35 +00:00
0df691df4e [ONNX] Support aten::broadcast_to (#101833)
Support aten::broadcast as the way we support on aten::expand.

Fix #92678 #101768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101833
Approved by: https://github.com/thiagocrepaldi
2023-05-19 16:59:54 +00:00
113b67059f Fix specify_constraints signature for exporting module (#101831)
Currently, when exporting a module, specify_constraint's signature is wrong. We make it consistent with the calling convention of the module being exported.

@build[pytorch_dynamo_inductor]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101831
Approved by: https://github.com/avikchaudhuri
2023-05-19 16:58:17 +00:00
11f7ae19cd fix inference_mode with torch.compile (#101219)
It looks like inference_mode wasn't playing well with functionalization.

If you run torch.compile on a function, and the inputs to the function are tensors created outside of inference mode, then we need to make sure that when we created functional tensor wrappers for those inputs during compilation, those functional wrappers properly mirror whether or not the original tensor is an inference tensor.

Hopefully fixes https://github.com/pytorch/pytorch/issues/101151

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101219
Approved by: https://github.com/albanD, https://github.com/ezyang
2023-05-19 16:14:56 +00:00
1fabee399d separate out dynamo .requires_grad and .is_grad_enabled guards (#100570)
Fixes https://github.com/pytorch/pytorch/issues/100977

This will hopefully fix this error (from [issue](https://github.com/pytorch/pytorch/issues/99616))

This PR fixes an internal model: we were running an inductor inference graph, but `torch.is_grad_enabled()` was True, causing us to error inside of the inference graph when we encountered an out= operator.

I haven't been able to create a smaller repro - before landing this, I want to create a smaller repro to convince myself of why we need to separate out these guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100570
Approved by: https://github.com/ezyang
2023-05-19 16:14:56 +00:00
f99eeb5bdf Check devices on meta functions that return inputs (#101807)
FakeTensor has a default device logic that wraps meta tensors to the right device after running meta kernels and throws on multiple devices. This logic was only running on the wrapping from meta kernels -> fake. For out variants, where the output of the meta kernel was already a fake tensor because it was an input, the device logic wasn't running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101807
Approved by: https://github.com/ngimel
2023-05-19 16:13:39 +00:00
61b6b038b0 inductor: fix FloorDiv issue for dynamic shape path (#101793)
For TIMM ```tf_mixnet_l``` cpu dynamic shape path, we always get a wrong result compared with eager mode, the root cause is that we compute a wrong index when doing vectorization:

```
or(long i2=static_cast<long>(0L); i2<static_cast<long>(16L*(((std::ceil((1.0/2.0)*(std::ceil((1.0/2.0)*(std::ceil((1.0/2.0)*(std::ceil((1.0/2.0)*ks1))))))))*(std::ceil((1.0/2.0)*(std::ceil((1.0/2.0)*(std::ceil((1.0/2.0)*(std::ceil((1.0/2.0)*ks1))))))))) / 16L)); i2+=static_cast<long>(16L))
```
the main loop's index using ```/``` rather than ```//```. After this PR, the ```tf_mixnet_l``` accuracy test can be passed.

How to reproduce this issue?

```
python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --accuracy --float32 -dcpu --inference -n5 --inductor --dynamic-shapes --only tf_mixnet_l
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101793
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/ezyang
2023-05-19 12:39:27 +00:00
e06bd8f3b1 fsdp support create hybrid-sharded process group for custom backend (#100622)
FSDP creates communication groups for intra-node communication through dist.new_subgroups. Previously, dist.new_subgroups only supported creation based on the number of CUDA devices. However, issue #99706 removed the avaliable-check for CUDA devices, allowing for custom backend create group based on num of custom devices per node.

This PR allows FSDP to explicitly pass device num within the node when creating communication groups for intra-node communication, instead of defaulting to the number of CUDA devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100622
Approved by: https://github.com/awgu
2023-05-19 06:08:55 +00:00
4441ce21dc Add missing conversion functions between half and float for ppc64le (#100168)
Fixes compilation error on ppc64-le resulting from missing conversion functions 'convert_half_float' and 'convert_float_half'.
These functions are implemented by this commit.

Started failing compilation  from the following commit onwards: ced5c89b6fbe827a538b7ada96b2f9a5989871c7.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100168
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-05-19 05:22:26 +00:00
4486a1d09a Improve the functionality of untyped storage for privateuse1. (#100868)
Complete the implementation of the interface  is_pinned() of untyped storage class for privateuse1.
And refactor the implementation in typed storage by   untyped_storage.is_pinned().

Hi,  @ezyang
This is another improvement of untyped storage for privateuse1, can you  take a moment to review it?  Thanks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100868
Approved by: https://github.com/kurtamohler, https://github.com/ezyang
2023-05-19 04:33:59 +00:00
f66d5dd788 SymIntify functorch vmap (#101409)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101409
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-05-19 03:07:41 +00:00
1aaf0396eb [reland][opinfo] empty_strided (#101782)
Follows #100223

Previous PR: #100890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101782
Approved by: https://github.com/ezyang
2023-05-19 03:06:29 +00:00
e5b7c7a04f Fix torchinductor uint8 bug (#101468)
Fixes #96604

## Issue description

When we use a constant tensor with a uint8 type, the kernel generated by torchinductor output wrong results. For example, the negative value  of `5` in uint8 will be `255`, and it is `True` that `255` is larger than `5`. However, the output result is `False` when we compare `torch.neg(5)` with `5`. It is because torchinductor bypass the data type for constant tensors and the `5` here is taken as a int32. Then, the comparison is between `-5` with `5`.

## Solution
This PR generates an extra conversion for uint8 constant value when we use it. it does not occur on the first assignment but the access for this constant value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101468
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-05-19 02:01:50 +00:00
4c1bc91f42 Support autograd.Function w/ grad (#99483)
This PR adds support for tracing autograd.Function with grad.

A few important bullet points outlining our approach:

1) Our goal is to verify soundness in order to add a call_function to the autograd.Function's `apply` to the graph.
2) We achieve (1) by either verifying soundness or rejecting soundness, by ensuring that both forward and backward of the autograd.Function are sound.
3) For the forward, if we verify soundness, we install its guards into the graph.
4) For the backward, if we verify soundness, we throw it out. However, backwards soundness verification is more onerous, and has a config driven set of banned attrs and methods for tensors.

1-4 above are achieved by turning the forward and backward into UserDefinedFunctionVariables, and inlining through them, relying on dynamo's soundness detection. If we graph break in these, we raise and treat them as unsound. As noted above, backwards is stricter yet.

For the tracing, the safety comes from dynamo's HigherOrderOperator system. That system ensures that not only do we trace soundly, but that no new variables are lifted into inputs during the tracing, and that the forward and backwards are entirely self contained.

Whenever we reject a function as unsound, we restore back, as usual.

Due to some limitations in the lifting logic, we have an escape hatch we implemented for tensors that are known in forward, but cross into backwards through save_tensors (save) /saved_tensors (load). We escape hatch here to avoid having the known saved tensors coming from forward end up being accidentally treated as lifted variables (and rejected). This is sound, but a little hacky feeling.

Additionally, due to some limitations in fx node removal, combined with how we produce subgraphs for the traces installed from HigherOrderOperators, we had to improve our node removal logic. In the event of a restore, we remove the old nodes from the graph, as usual in dynamo. However, because the references to these nodes may exist in subgraphs, we traverse any nodes users and remove them first if and only if they are in another graph. This is always sound, because removal should only be downstream of restoration at this point.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99483
Approved by: https://github.com/zou3519
2023-05-19 01:26:21 +00:00
7776a41bd6 [ONNX] Detect None constant during jit scalar type analysis (#101608)
Fixes #97987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101608
Approved by: https://github.com/titaiwangms
2023-05-19 01:20:01 +00:00
eb470ab2fb Add cooperative_groups header to cuda_to_hip_mappings.py (#100721)
This PR is to add hip/hip_cooperative_groups.h to avoid the hipify errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100721
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-05-19 01:18:12 +00:00
bcb4444cec PyTorch -> C++17 (#98209) (#100557)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4f0b524</samp>

This pull request updates the codebase and the documentation to use C++17 instead of C++14 as the minimum required C++ standard. This affects the `ATen`, `c10`, and `torch` libraries and their dependencies, as well as the CI system and the `conda` package metadata.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100557
Approved by: https://github.com/malfet
2023-05-19 00:49:08 +00:00
6f13d6892a Add meta support for multinomial (#101324)
# Summary
Found this when trying to compile the text gen loop of nanogpt here: b33289942b/torchbenchmark/models/nanogpt_generate/model.py (L322)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101324
Approved by: https://github.com/ngimel
2023-05-19 00:04:26 +00:00
6f46716ee2 Fix/skip CSE tests on Python-3.8 without astunparse (#101805)
If `astunparse` is not installed, following guard will be generated in `test_guard_function_builder_with_cse`:
```python
def ___make_guard_fn():
    def guard(L):
        if not (x[0].a < x[1].a * (3 - x[2].a)):
            return False
        if not (a.b.c[0].d.e + a.b.c[1].d.e * a.b.c[2].d.e > 0):
            return False
        if not (f(m.n[0], '0').x.y.z * f(m.n[0], '1').x.y.z * f(m.n[0], '2').x.y.z < 512):
            return False
        if not (self.g(a, b).k + (1 - self.g(a, b).k) <= m[0].a + self.g(a, b).k):
            return False
        return True
    return guard
```

Though, I have to say, hardcoding string comparison is pretty weird.

Also, skip `test_guards_cse_pass_[single|multiple]` if AST unparsing is missing.

Fixes failure in a test introduced by https://github.com/pytorch/pytorch/pull/98488

copilot:poem
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101805
Approved by: https://github.com/atalman, https://github.com/ysiraichi
2023-05-18 23:14:35 +00:00
0a0acce515 [vision hash update] update the pinned vision hash (#101821)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101821
Approved by: https://github.com/pytorchbot
2023-05-18 22:28:29 +00:00
60547fcbee Autoformat torch/utils/checkpoint (#101649)
Per title

Differential Revision: [D45933467](https://our.internmc.facebook.com/intern/diff/D45933467/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101649
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2023-05-18 21:55:05 +00:00
d7f6bfe651 Fix require_backends_available to reenable distributed tests (#101704)
## TLDR
Fix decorator to re-enable 26+ distributed tests that were previously being skipped in CI

## Explanation

As part of the UCC upstream, we updated the backend tests cases to also include "ucc".

3ed1569e86/torch/testing/_internal/common_distributed.py (L90-L92)

In distributed tests we use a decorator which reads from this config and makes sure all backends are available on the system.

3ed1569e86/torch/testing/_internal/distributed/distributed_test.py (L7131)

 **However**, UCC is not configured on by default for a certain subset of CI tests, which causes the entire test to be skipped (even if the test is meant for nccl and the backend being tested is nccl).

As the fix, we should just check that only the `BACKEND` being tested is available

## Changes
- Change logic to only check if the current backend being used is available
- Rename `require_backends_available` -> `require_backend_is_available`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101704
Approved by: https://github.com/rohan-varma
2023-05-18 21:33:15 +00:00
b5217d0898 Revert "match sdpa patterns from HF (#100609)"
This reverts commit c73923473d4ed0ab08143cb8fe3e8c3f86f2cf73.

Reverted https://github.com/pytorch/pytorch/pull/100609 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, please see D45973031 ([comment](https://github.com/pytorch/pytorch/pull/100609#issuecomment-1553650280))
2023-05-18 21:12:38 +00:00
a76c1af351 Revert "Implement adding bias vector into structured sparse linear operator (#100881)"
This reverts commit c3a893c659bebf0e5b62452a751c4e6ab3dc5b2d.

Reverted https://github.com/pytorch/pytorch/pull/100881 on behalf of https://github.com/izaitsevfb due to breaks internal builds, see D45972633 ([comment](https://github.com/pytorch/pytorch/pull/100881#issuecomment-1553621418))
2023-05-18 20:47:02 +00:00
eb9ac9c156 Revert "Add activation functions (ReLU and SiLU for now) for structured sparse linear operator (#101339)"
This reverts commit bfb3941ad8aaf0af159c2bec3cf1cbec1488f335.

Reverted https://github.com/pytorch/pytorch/pull/101339 on behalf of https://github.com/izaitsevfb due to Depends on #100881, which has to be reverted due to internal build breakage. ([comment](https://github.com/pytorch/pytorch/pull/101339#issuecomment-1553618216))
2023-05-18 20:42:44 +00:00
2c0d607882 [bazel] add build for functorch (#101475)
Fixes #101469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101475
Approved by: https://github.com/ezyang
2023-05-18 20:29:08 +00:00
7ffdd4fedc Update release related information (#101819)
Update release related information. Features became more complex. Number of commits per releases have increased a lot.
We had in average:
2.5k commits for releases 1.1.0-1.7.0,
3-3.5k commits for releases  1.8.0-1.12.0
4.5k-5k commits for releases  1.13.0, 2.0.0

Hence current target is 3 releases a year
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101819
Approved by: https://github.com/svekars, https://github.com/malfet
2023-05-18 20:27:16 +00:00
686b12c93d Reduce log output when no tests are prioritized (#101803)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 733b991</samp>

Improve test reordering output in `tools/testing/test_selections.py`. Add a check to only print reordering information when there are tests to prioritize.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101803
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-05-18 20:21:41 +00:00
f95d42b1b7 [DataPipe] Update docstring for functional form of DataPipes (#100446)
Copy the docstring from IterDataPipe and MapDataPipe classes to their functional form. Done using [`functools.update_wrapper`](https://docs.python.org/3/library/functools.html#functools.update_wrapper), xref https://stackoverflow.com/questions/6394511/python-functools-wraps-equivalent-for-classes.

See also parallel change to `.pyi` stub files at https://github.com/pytorch/pytorch/pull/100503

Fixes https://github.com/pytorch/data/issues/792 and https://github.com/weiji14/zen3geo/issues/69.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100446
Approved by: https://github.com/NivekT
2023-05-18 19:59:00 +00:00
556bb691fd [AO]Fix observed LSTM layer setup individually observed LSTM (#101299)
Summary: We have found that `_get_lstm_with_individually_observed_parts()` is missing setup step which sets up the LSTM layer state initializing weights and biases of this layer. This diff fixes the observed numerical discrepancy seen by CTRL team in using the above API.

Test Plan: N3358643

Differential Revision: D45821681

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101299
Approved by: https://github.com/andrewor14
2023-05-18 19:15:01 +00:00
2fa1b563da [dynamo] Activation checkpoint higher order ops - Reland 101028 (#101790)
https://github.com/pytorch/pytorch/pull/101028 was reverted due to internal breakage. Relanding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101790
Approved by: https://github.com/zou3519
2023-05-18 19:09:14 +00:00
a33ac44540 Better needs label error message (#101747)
As suggested in https://github.com/pytorch/pytorch/issues/101694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101747
Approved by: https://github.com/malfet, https://github.com/zou3519
2023-05-18 18:27:13 +00:00
8b751b41c0 Do not trigger lint and pull workflows when sync nightly #26921 (#101746)
These workflows run when the nightly PR #26921 is synched and they will fail due to the large commit message saved in env variable `COMMIT_MESSAGES`, for example https://github.com/pytorch/pytorch/actions/runs/4977477882.  We don't need to run CI jobs on nightly.

The list of failing workflows is from f3e13d9567

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101746
Approved by: https://github.com/atalman, https://github.com/malfet
2023-05-18 16:57:03 +00:00
1930428d89 Minor improvement on the decomposition of upsample_bilinear (#101682)
This is how it's done in core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101682
Approved by: https://github.com/ngimel
2023-05-18 16:51:51 +00:00
cyy
ac1cf00085 fix missing-prototypes warnings in torch_cpu (Part 5) (#101788)
This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147 and #100245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101788
Approved by: https://github.com/Skylion007
2023-05-18 16:38:14 +00:00
c9ba967c21 Upstream xformers code (#100583)
# Summary
Since the initial upstream of memory efficient attention from xformers: #86157, significant work updates have been made to the kernel including - increased performance, bug-fixes, and added functionality. This PR upstreams the latest version of this kernel as of: version 0.0.20 or commit: [6425fd0cacb1a6579aa2f0c4a570b737cb10e9c3](6425fd0cac)

## Future
Although this version of the Kernel has support for dropout and arbitrary attention bias, I did not add this support to SDPA yet, and left the guards in sdp_utils. Those will follow up PRs in order to reduce the scope creep of these substantial changes, and ensure that nothing is broken.

## Specific Changes
### Minor Changes
* The build system work was done in the previous PR and so no changes were needed to CMAKE 🤞
* Adding the new files and re-arranging/creating folder structure
* Updating include paths
* Switching from xformer specific functions: `XFORMERS_CHECK -> TORCH_CHECK`
* Changes to xformer specific macros
* Updates to the `generate_kernels.py` to use account for Pytorch file structure, also added an arg parse that I could run on a test dir before creating the files in place.

### Bigger Changes
* Previous Kernel changes "Removed the chunk optimization: see discussion here: https://github.com/pytorch/pytorch/pull/96880"
* Increased the number of cuda kernels -> potentially effecting the cuda_lib size.
* Preemptively made changes to the dtypes of seed and offset in order to allow for cuda_graphs: #100196 this is not finished.
* Made VERY BC breaking changes to at::_efficient_attention_forward and at::_efficeint_attention_backward function signatures.
    * I made these changes due to in part to the ability for this PR to land:https://github.com/pytorch/pytorch/pull/100196

### Due Diligence Checks:
* CUDA_lib size:
    * Before: 496 MiB
    * After:    496MiB
* Performance Sweep:
    * I sweeped over 576 configs for forward only inference and the geomean speedup was 0.98x with a min speed up of 0.84 and a max speedup of 1.2
    * For Forw+Back running on 270 configs ( to reduce memory) the geomean speedup was 1.02X with a min speed up of 1.02 and a max speedup of 1.35.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100583
Approved by: https://github.com/cpuhrsch
2023-05-18 16:15:34 +00:00
794cc3952e adding moco to CI (#101098)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101098
Approved by: https://github.com/desertfire
2023-05-18 10:01:49 +00:00
b315c9b5ab [CI] Enlarge memory for OOM models in inductor cpu HF accuracy test (#101395)
Change the Inductor CPU HF accuracy test node from `linux.4xlarge` (32GB) to `linux.24xlarge` (192GB) to enlarge the node memory. Also add 3 HF models back to CI test.

Fixes #101390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101395
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/huydhn
2023-05-18 09:23:30 +00:00
72a73ef67b Add aten.searchsorted.Tensor meta kernel (#101637)
Test Plan: CI

Differential Revision: D45933187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101637
Approved by: https://github.com/ezyang
2023-05-18 06:55:11 +00:00
cyy
c2f28d1c1d fix missing-prototypes warnings in torch_cpu (Part 4) (#100849)
This PR fixes more missing-prototypes violations in the torch_cpu source following PRs #100053, #100147 and #100245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100849
Approved by: https://github.com/albanD
2023-05-18 03:49:45 +00:00
900ca4df59 inductor: skip weight packing when has zero shape (#101355)
This PR wil skip weight packing when has zero shape to fix https://github.com/pytorch/pytorch/issues/101211.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101355
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-18 03:32:45 +00:00
e48a052e7b Fix link error on s390x (#101000)
When I got the main branch and picked up #99872, I got the following link error. The root cause is that method definitions in the header file will generate multiple instantiations for the same method signature.

This PR fixes the link error by avoiding to generate multiple instantiations.

```
% python setup.py develop
...
[1080/1456] Linking CXX shared library lib/libtorch_cpu.so
FAILED: lib/libtorch_cpu.so
: && /usr/bin/c++ -fPIC -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor ...
...
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))':
AvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0xa520): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))':
AvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0xa5c0): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))':
AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp:(.text+0x5970): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))':
AdaptiveMaxPoolKernel.cpp.ZVECTOR.cpp:(.text+0x5a10): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))':
AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0x7d90): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))':
AdaptiveAvgPoolKernel.cpp.ZVECTOR.cpp:(.text+0x7e30): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/Activation.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_int_flt(int __vector(4))':
Activation.cpp.ZVECTOR.cpp:(.text+0x65840): multiple definition of `at::vec::ZVECTOR::vec_int_flt(int __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x16920): first defined here
/usr/bin/ld: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/Activation.cpp.ZVECTOR.cpp.o: in function `at::vec::ZVECTOR::vec_flt_int(float __vector(4))':
Activation.cpp.ZVECTOR.cpp:(.text+0x658e0): multiple definition of `at::vec::ZVECTOR::vec_flt_int(float __vector(4))'; caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/UfuncCPUKernel_add.cpp.ZVECTOR.cpp.o:UfuncCPUKernel_add.cpp.ZVECTOR.cpp:(.text+0x169c0): first defined here
collect2: error: ld returned 1 exit status
[67/316] Building CXX object test_api/CMakeFiles/test_api.dir/modules.cpp.o
ninja: build stopped: subcommand failed.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101000
Approved by: https://github.com/malfet
2023-05-18 02:13:22 +00:00
bfb3941ad8 Add activation functions (ReLU and SiLU for now) for structured sparse linear operator (#101339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101339
Approved by: https://github.com/cpuhrsch
2023-05-18 01:53:18 +00:00
a0e6f82087 [inductor] send max_pool2d_with_indices and its backwand to fallback if dilation is not 1 (#100531)
Fixes #93384

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100531
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-18 01:07:06 +00:00
dafa009c3c [dynamo][moco] Save global torch state to restore on graph break (#101201)
This is relevant to  https://github.com/pytorch/pytorch/pull/100570 as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101201
Approved by: https://github.com/voznesenskym
2023-05-18 01:03:15 +00:00
28098cae6b [DataLoader] Adding StackDataset (#101338)
Torch wrapping datasets list has:
`TensorDataset`
`ConcatDataset`
`ChainDataset`

`TensorDataset` is useful for stacking sets of tensors but can't work with objects without `.size()` method.

This PR proposes `StackDataset`, similar to `TensorDataset` but for a general case like `ConcatDataset`.

Possible usage of `StackDataset` is multimodal networks with different input like image+text or for staking non-tensor input and property to predict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101338
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-05-18 00:57:12 +00:00
f0f0f70904 Fix check-labels workflow commenting on forked PRs (#101467)
Using `pull_request_target` allows securely passing the secrets to make comments on a forked PRs.
See more about `pull_request_target` in https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/

The change was verified in https://github.com/malfet/deleteme/pull/53 - with `on pull_request` there were no "This PR needs a label" comment, with with `on pull_request_target` the comment can be posted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101467
Approved by: https://github.com/malfet
2023-05-18 00:47:53 +00:00
c73923473d match sdpa patterns from HF (#100609)
Adds sdpa patterns seen in HF models.

To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609
Approved by: https://github.com/jansel
2023-05-18 00:33:52 +00:00
18f6f30d7c Make HUD link https (#101461)
It will now send you to the HUD site instead of staying on GitHub and adding the HUD link after the GitHub URL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101461
Approved by: https://github.com/drisspg
2023-05-18 00:11:48 +00:00
124d812f38 [BE] Fix rule not found error message (#101745)
Prevent error message from becoming of single column of characters

Thanks @clee200 for explaining how it worked before

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at fef1e25</samp>

> _`reject_reason` fixed_
> _Syntax error caused trouble_
> _Autumn of bugs ends_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101745
Approved by: https://github.com/kit1980, https://github.com/osalpekar
2023-05-17 23:57:36 +00:00
66e398951a [inductor/decomp] Add aten._unsafe_index to disable range checks (#101602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101602
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-17 23:36:24 +00:00
b256091c7b [inductor] Generate indirect_indexing checks even if optimized out (#100895)
Fixes #100831, fixes #100878

Previously `gen_assert_indirect_indexing` was only called on the index
expressions passed to `ops.load` and `ops.store` which means if the
variable is optimized out during lowering, we never generate the
assert. This instead makes `ops.indirect_indexing` eagerly generate
the assert statement, whether or not it will be used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100895
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-17 23:36:24 +00:00
ef512db0f8 [inductor] Constant and index_expr propagation pass (#101077)
This pass does a limited form of constant propagation, as well as propagation of
sympy indexing expressions. For example, say you have the function:
```python
def flip(x):
    i = torch.arange(x.size(0) - 1, -1, -1, device=x.device)
    return x[i]
```

On current main this results in indirect indexing:
```python
class buf0_loop_body:
    var_ranges = {z0: 4, z1: 3}
    index0 = 3 - z0
    index1 = 3*indirect0 + z1
    index2 = 3*z0 + z1
    def body(self, ops):
        get_index = self.get_index('index0')
        index_expr = ops.index_expr(get_index, torch.int64)
        set_indirect0 = self.set_indirect0(index_expr)
        get_index_1 = self.get_index('index1')
        load = ops.load('arg0_1', get_index_1)
        get_index_2 = self.get_index('index2')
        store = ops.store('buf0', get_index_2, load, None)
        return store
```

With this PR the indexing is propagated through the computation and into direct
indexing:

```python
class buf0_loop_body:
    var_ranges = {z0: 4, z1: 3}
    index0 = -3*z0 + z1 + 9
    index1 = 3*z0 + z1
    def body(self, ops):
        get_index = self.get_index('index0')
        load = ops.load('arg0_1', get_index)
        get_index_1 = self.get_index('index1')
        store = ops.store('buf0', get_index_1, load, None)
        return store
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101077
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-17 23:36:24 +00:00
df6acf27fc update gloo submodule (#101472)
Ran command: `git submodule update --remote -- third_party/gloo`

This is to pull in changes up to 31b1f0204b to enhance gloo scalability for large distributed training jobs running into ephemeral port exhaustion issues.

The test failure in https://github.com/pytorch/pytorch/pull/101438 shows the need for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101472
Approved by: https://github.com/H-Huang
2023-05-17 23:06:19 +00:00
8c0b148926 [CI] Distribute bot workload (#101723)
Pined hashes updates to be done by @pytorchupdatebot
As mergebot token access is restricted to environment

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at d57c0f4</samp>

> _`UPDATEBOT_TOKEN`_
> _A new name for the night_
> _Autumn leaves falling_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101723
Approved by: https://github.com/huydhn
2023-05-17 21:46:55 +00:00
29de581764 [Dynamo] Graph break on torch.cuda.set_device() (#101668)
Fixes #97280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101668
Approved by: https://github.com/jansel
2023-05-17 21:35:08 +00:00
5f07c589b0 Revert "[inductor] Refactor RNG operators (#100064)"
This reverts commit 3bbf0683a1d56d8edc03822ccf3e38445322b4f8.

Reverted https://github.com/pytorch/pytorch/pull/100064 on behalf of https://github.com/izaitsevfb due to breaks inductor tests, see D45936056 ([comment](https://github.com/pytorch/pytorch/pull/100064#issuecomment-1552093728))
2023-05-17 21:16:41 +00:00
3135bec4a0 [docs] Clarify when to use SparseAdam (#101465)
![image](https://github.com/pytorch/pytorch/assets/31798555/ff19a522-2630-4578-bc0e-6a704aa94d4e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101465
Approved by: https://github.com/albanD
2023-05-17 21:16:20 +00:00
5d3cfda1ed Revert "match sdpa patterns from HF (#100609)"
This reverts commit f33725b82b83703df1d1135cc34a64a1b2b856a9.

Reverted https://github.com/pytorch/pytorch/pull/100609 on behalf of https://github.com/izaitsevfb due to Based on #100064, which needs to be reverted due to diff-train issues. ([comment](https://github.com/pytorch/pytorch/pull/100609#issuecomment-1552089472))
2023-05-17 21:13:11 +00:00
b429a4de13 Update public_api to remove duplicated randn_like (#101302)
Remove duplicated `randn_like` in `functorch/op_analysis`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101302
Approved by: https://github.com/drisspg
2023-05-17 21:03:13 +00:00
2236d5ef83 [Security] Move mergebot workflows in its own env (#101718)
Pin update workflows will be updated to use @pytorchupdatebot in separate PR
2023-05-17 13:33:06 -07:00
f3fc531eee Check for pytest extensions in run_test (#100916)
not very elegant

checked on separate conda env that doesnt have the usual ci dependencies

the two pytest extensions at fault are pytest-rerunfailures and pytest-shard, also included pytest-flakefinder just incase

no idea if this is a good way to do this

could also check individually and add flags based on that, but was told that needing to requiring all the ci dependencies to be downloaded was also ok
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100916
Approved by: https://github.com/huydhn
2023-05-17 20:27:55 +00:00
e3c9a1e5c4 Run dynamo tests in parallel (#101432)
cuts off ~30 min per shard
(2 shards and 2 python versions so 2 hours total)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101432
Approved by: https://github.com/huydhn, https://github.com/desertfire, https://github.com/ZainRizvi
2023-05-17 20:26:24 +00:00
e3c66ded86 remove default lower bound in dynamic_dim suggestions (#101636)
So instead of `2 <= dynamic_dim(x, 0)` simply suggest `dynamic_dim(x, 0)`. This has exactly the same effect.

Differential Revision: [D45933273](https://our.internmc.facebook.com/intern/diff/D45933273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101636
Approved by: https://github.com/tugsbayasgalan, https://github.com/ydwu4
2023-05-17 19:55:04 +00:00
c8579b7374 Run test_cpp_memory_snapshot_pickle only when linux and x86_64 (#101366)
On Arm, I got

```
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5260, in test_cpp_memory_snapshot_pickle
    mem = run()
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5257, in run
    t = the_script_fn()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 496, in prof_func_call
    return prof_callable(func_call, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 493, in prof_callable
    return callable(*args, **kwargs)
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 5254, in the_script_fn
                @torch.jit.script
                def the_script_fn():
                    return torch.rand(311, 411, device='cuda')
                           ~~~~~~~~~~ <--- HERE
RuntimeError: record_context_cpp is not support on non-linux non-x86_64 platforms
```

dfe484a3b3/torch/csrc/profiler/unwind/unwind.cpp (L4-L24) seems related

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101366
Approved by: https://github.com/zdevito
2023-05-17 19:44:21 +00:00
dfac4364c4 Revert "[opinfo] empty_strided (#100890)"
This reverts commit 01c7106580667720de80fac12a95cab0fed78ad1.

Reverted https://github.com/pytorch/pytorch/pull/100890 on behalf of https://github.com/PaliC due to broke test_ops.py slow test ([comment](https://github.com/pytorch/pytorch/pull/100890#issuecomment-1551903975))
2023-05-17 19:00:15 +00:00
f33725b82b match sdpa patterns from HF (#100609)
Adds sdpa patterns seen in HF models.

To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609
Approved by: https://github.com/jansel
2023-05-17 17:44:40 +00:00
8e51521cee [quant][pt2] Handle maxpool + conv + bn case in prepare QAT (#100941)
Summary: This commit fixes a bug where we copy the metadata from
the wrong node after replace_pattern. This happened in the case
of [maxpool -> getitem1 -> conv -> bn -> getitem2], where
`getitem1` is the placeholder node fed into the fused conv + bn
pattern, and we incorrectly copied the metadata from `getitem1`
instead of from `getitem2`. We fix this bug by filtering out
the placeholder nodes before doing the metadata copying.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_getitem_placeholder

Reviewers: jerryzh168, kimishpatel

Differential Revision: [D45916751](https://our.internmc.facebook.com/intern/diff/D45916751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100941
Approved by: https://github.com/jerryzh168
2023-05-17 17:36:32 +00:00
3ed1569e86 Adding serialization ID to inline container (#100994)
Summary:
In order to better track models after serialization, this change writes a serialization_id as a UUID to inline container. Having this ID enables traceability of model in saving and loading events.
serialization_id is generated as a new UUID everytime serialization takes place. It can be thought of as a model snapshot identifier at the time of serialization.

Test Plan:
```
buck2 test @//mode/dev //caffe2/caffe2/serialize:inline_container_test
```

Local tests:
```
buck2 run @//mode/opt //scripts/atannous:example_pytorch_package
buck2 run @//mode/opt //scripts/atannous:example_pytorch
buck2 run @//mode/opt //scripts/atannous:example_pytorch_script
```

```
$ unzip -l output.pt
Archive:  output.pt
  Length      Date    Time    Name
---------  ---------- -----   ----
       36  00-00-1980 00:00   output/.data/serialization_id
      358  00-00-1980 00:00   output/extra/producer_info.json
       58  00-00-1980 00:00   output/data.pkl
      261  00-00-1980 00:00   output/code/__torch__.py
      326  00-00-1980 00:00   output/code/__torch__.py.debug_pkl
        4  00-00-1980 00:00   output/constants.pkl
        2  00-00-1980 00:00   output/version
---------                     -------
     1045                     7 files
```

```
unzip -p output.pt "output/.data/serialization_id"
a9f903df-cbf6-40e3-8068-68086167ec60
```

Differential Revision: D45683657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100994
Approved by: https://github.com/davidberard98
2023-05-17 17:08:48 +00:00
326a4cc815 Support map autograd and pytree in/out. (#101633)
Rebased https://github.com/pytorch/pytorch/pull/100494 and added dummy AOTConfig.

This PR adds autograd and pytree support for map operator.

Implementation-wise:

1. We temporarily make two HigherOrderOperators, "map" and "map_impl":
- "map" is user-facing. Currently, it unwraps the pytrees in inputs and create a flat_fn for it. Dynamo currently cannot deal with pytree.tree_flatten and pytree.tree_unflatten, we therefore make it a HigherOrderOperator to trigger dynamo logic of handling HigherOrderOperators.
- "map_impl" is the actual operator that works with the rest of torch subsystems such as functionalization, make_fx. It accepts flattend arguments, and a num_mapped_args integer denoting how many of the flattend arguments need to mapped i.e. their first dimension will be unstacked.

2. We create the forward and backward graph in autograd key and call torch.autograd.Function. Currently, the backward graph is recomputation-based and we need to partition the joint graph in the future to be more efficient.

Example traced graphs for map operators:
### Case 1: simple f and autograd
```python
def f(x, y):
    return x + y

def g(xs, y):
    out = control_flow.map(f, xs, y)
    return torch.autograd.grad(out, (xs, y), torch.ones_like(out))

gm = make_fx(g, tracing_mode="symbolic")(torch.ones(3, 4, 5, requires_grad=True), torch.ones(5, requires_grad=True))
# gm.print_readable() produces following:
class g(torch.nn.Module):
    def forward(self, xs_1: f32[3, s1, s2], y_1: f32[s2]):
        # No stacktrace found for following nodes
        body_graph_0 = self.body_graph_0
        map_impl = torch.ops.map_impl(body_graph_0, 1, xs_1, y_1);  body_graph_0 = None
        getitem: f32[3, s1, s2] = map_impl[0];  map_impl = None
        ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False)
        is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like);  getitem = None
        body_graph_1 = self.body_graph_1
        map_impl_1 = torch.ops.map_impl(body_graph_1, 2, xs_1, ones_like, y_1);  body_graph_1 = xs_1 = ones_like = None
        getitem_1 = map_impl_1[0]
        getitem_2: f32[3, s1, s2] = map_impl_1[1]
        getitem_3: f32[3, s2] = map_impl_1[2];  map_impl_1 = None
        sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_3, [0], True);  getitem_3 = None
        sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0);  y_1 = None
        view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = sym_size = None
        return (getitem_2, view)

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s2]):
            # No stacktrace found for following nodes
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg2_1);  arg1_1 = arg2_1 = None
            return [add]

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]):
            # No stacktrace found for following nodes
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg3_1);  arg1_1 = None
            is_same_size = torch.ops.aten.is_same_size.default(add, arg2_1);  add = None
            sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg2_1, [0], True)
            sym_size: Sym(s2) = torch.ops.aten.sym_size(arg3_1, 0);  arg3_1 = None
            view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = sym_size = None
            return [None, arg2_1, view]
```
### Case 2: list input/output f and autograd
```python
def f(x, y):
    return [x[0].cos() + y.sin(), x[1].sin() * y.cos()]

def g(xs, y):
    out = control_flow.map(f, xs, y)
    flat_out, _ = pytree.tree_flatten(out)
    flat_inp, _ = pytree.tree_flatten((xs, y))
    requires_grad_inp = [inp for inp in flat_inp if inp.requires_grad]
    return torch.autograd.grad(flat_out, requires_grad_inp, [torch.ones_like(out) for out in flat_out])

gm = make_fx(g, tracing_mode="symbolic")(
    [torch.ones(3, 4, 5), torch.ones(3, 4, 5, requires_grad=True)],
    torch.ones(5, requires_grad=True))

# gm.print_readable() produces following:
class g(torch.nn.Module):
    def forward(self, xs, y):
        xs_1: f32[3, s1, s2], xs_2: f32[3, s1, s2], y_1: f32[s2], = fx_pytree.tree_flatten_spec([xs, y], self._in_spec)
        # No stacktrace found for following nodes
        body_graph_0 = self.body_graph_0
        map_impl = torch.ops.map_impl(body_graph_0, 2, xs_1, xs_2, y_1);  body_graph_0 = None
        getitem: f32[3, s1, s2] = map_impl[0]
        getitem_1: f32[3, s1, s2] = map_impl[1];  map_impl = None
        ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False)
        ones_like_1: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem_1, pin_memory = False)
        is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like);  getitem = None
        is_same_size_1 = torch.ops.aten.is_same_size.default(getitem_1, ones_like_1);  getitem_1 = None
        body_graph_1 = self.body_graph_1
        map_impl_1 = torch.ops.map_impl(body_graph_1, 4, xs_1, xs_2, ones_like, ones_like_1, y_1);  body_graph_1 = xs_1 = xs_2 = ones_like = ones_like_1 = None
        getitem_2 = map_impl_1[0]
        getitem_3 = map_impl_1[1]
        getitem_4: f32[3, s1, s2] = map_impl_1[2]
        getitem_5: f32[3, s2] = map_impl_1[3];  map_impl_1 = None
        sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_5, [0], True);  getitem_5 = None
        sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0);  y_1 = None
        view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = sym_size = None
        return pytree.tree_unflatten([getitem_4, view], self._out_spec)

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]):
            # No stacktrace found for following nodes
            cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1);  arg1_1 = None
            sin: f32[s2] = torch.ops.aten.sin.default(arg3_1)
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin);  cos = sin = None
            sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1);  arg2_1 = None
            cos_1: f32[s2] = torch.ops.aten.cos.default(arg3_1);  arg3_1 = None
            mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1);  sin_1 = cos_1 = None
            return [add, mul]

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s1, s2], arg4_1: f32[s1, s2], arg5_1: f32[s2]):
            # No stacktrace found for following nodes
            cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1);  arg1_1 = None
            sin: f32[s2] = torch.ops.aten.sin.default(arg5_1)
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin);  cos = sin = None
            sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1)
            cos_1: f32[s2] = torch.ops.aten.cos.default(arg5_1)
            mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1)
            is_same_size = torch.ops.aten.is_same_size.default(add, arg3_1);  add = None
            is_same_size_1 = torch.ops.aten.is_same_size.default(mul, arg4_1);  mul = None
            mul_1: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, sin_1);  sin_1 = None
            mul_2: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, cos_1);  arg4_1 = cos_1 = None
            sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(mul_1, [0], True);  mul_1 = None
            sym_size: Sym(s2) = torch.ops.aten.sym_size(arg5_1, 0)
            view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = None

            #
            sin_2: f32[s2] = torch.ops.aten.sin.default(arg5_1)
            neg: f32[s2] = torch.ops.aten.neg.default(sin_2);  sin_2 = None
            mul_3: f32[s2] = torch.ops.aten.mul.Tensor(view, neg);  view = neg = None
            cos_2: f32[s1, s2] = torch.ops.aten.cos.default(arg2_1);  arg2_1 = None
            mul_4: f32[s1, s2] = torch.ops.aten.mul.Tensor(mul_2, cos_2);  mul_2 = cos_2 = None
            sum_2: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg3_1, [0], True);  arg3_1 = None
            view_1: f32[s2] = torch.ops.aten.view.default(sum_2, [sym_size]);  sum_2 = sym_size = None
            cos_3: f32[s2] = torch.ops.aten.cos.default(arg5_1);  arg5_1 = None
            mul_5: f32[s2] = torch.ops.aten.mul.Tensor(view_1, cos_3);  view_1 = cos_3 = None
            add_1: f32[s2] = torch.ops.aten.add.Tensor(mul_3, mul_5);  mul_3 = mul_5 = None
            return [None, None, mul_4, add_1]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101633
Approved by: https://github.com/zou3519
2023-05-17 16:52:26 +00:00
38e537db55 Handle multi-user case in split-cat simplification (#101473)
Summary: Post refactoring, previous diff had a drop in QPS gained on prod model - because of multi-user getitems. Multi user getitems can be handled by the replacer.

Differential Revision: D45893988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101473
Approved by: https://github.com/jansel
2023-05-17 16:36:17 +00:00
e17d9f2c64 Fix determenistic typos (#101631)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101631
Approved by: https://github.com/lezcano, https://github.com/ZainRizvi
2023-05-17 16:12:28 +00:00
07e759eca2 [PT2][Quant] Move to module partitioner for linear pattern quantization (#101122)
Subgraph matcher is somewhat unreliable as the pattern can vary depending on
the dimensionality of input tensor used to trace _and_ what appears before
linear

Differential Revision: [D45713915](https://our.internmc.facebook.com/intern/diff/D45713915/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101122
Approved by: https://github.com/jerryzh168
2023-05-17 15:47:08 +00:00
ebae77e891 [transformer benchmark] sort by cuda time (#101349)
Summary: The benchmark is running on CUDA

Test Plan: buck run mode/opt //caffe2/benchmarks/transformer:sdp_backwards

Differential Revision: D45843837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101349
Approved by: https://github.com/drisspg
2023-05-17 15:38:56 +00:00
403ce1a1c9 Fix benchmark model names printouts with tqdm (#101627)
With the TQDM changes in #100969 -- the models names ended up getting hidden from the benchmark printouts.  We would print the model name with no newline, then tqdm would print a `\r` and overwrite the name of the running model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101627
Approved by: https://github.com/ezyang
2023-05-17 15:31:11 +00:00
bec655f826 [PT] Update module partitioner to return parameter node (#101121)
Instead of returning param name, return parameter get_attr node.

Differential Revision: [D45713916](https://our.internmc.facebook.com/intern/diff/D45713916/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101121
Approved by: https://github.com/angelayi
2023-05-17 14:56:51 +00:00
75375b410d inductor(CPU): fix issue when padding/stride/dilation size is one for cpu weight packing pass(reland) (#101353)
Differential Revision: [D45874469](https://our.internmc.facebook.com/intern/diff/D45874469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101353
Approved by: https://github.com/desertfire
2023-05-17 14:49:14 +00:00
2c807a4acf [PT2][Quant] Remove None annotations (#101120)
None annotations are not needed anymore. Remove them.

Differential Revision: [D45713917](https://our.internmc.facebook.com/intern/diff/D45713917/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101120
Approved by: https://github.com/jerryzh168
2023-05-17 14:38:34 +00:00
783a46adee [functorch] fix UB in interpreter stack (#101568)
The UB was:
- We grab a reference to the last element in the interpreter stack
(DynamicLayerStack)
- Then, we pop the last element in the interpreter stack
- Finally, we continue to use the reference to the last element.

The fix is to stop using that reference and instead use the popped
element.

Test Plan:
- It's difficult to write a test for this PR so I didn't
- Patched in https://github.com/pytorch/pytorch/pull/101409 and verified
that this PR fixes the bad_variant_access it was experiencing under
clang compilers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101568
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-05-17 13:36:47 +00:00
cb6fa890d4 s390x SIMD: Propagate NaN in minimum and maximum operations (#99716)
This change fixes NNUtilsTest.ClipGradNormErrorIfNonfinite test from test_api c++ unit test when ZVECTOR is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99716
Approved by: https://github.com/jgong5
2023-05-17 11:47:46 +00:00
a85f6aa4ca s390x zvector: implement expm1 for complex vectorized types (#99872)
This change fixes build with zvector on s390x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99872
Approved by: https://github.com/jgong5
2023-05-17 11:12:24 +00:00
f72f0119ec Implement CSE for dynamo guards. (#98488)
This PR extracted the CSE part of the code in #89707.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98488
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-05-17 10:47:24 +00:00
f994d0b619 [dynamo] Change dimension constraint summary to log.info (#101584)
Running with `TORCH_LOGS=dynamo` will have the dimension constraint summary pop up again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101584
Approved by: https://github.com/avikchaudhuri
2023-05-17 06:47:14 +00:00
39f52c0218 Switch AOT Inductor test to export, add dynamic, fix invocation bug (#101585)
Fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101585
Approved by: https://github.com/ngimel, https://github.com/desertfire
2023-05-17 05:52:08 +00:00
c3a893c659 Implement adding bias vector into structured sparse linear operator (#100881)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100881
Approved by: https://github.com/cpuhrsch
2023-05-17 05:46:22 +00:00
97180aca5e Enables barrier to support the specified device (#99589)
Enables barrier to support the specified device, e.g cuda/custom device. There is some discussion here: https://github.com/pytorch/pytorch/issues/97938#issue-1646833919

Today, there are two limitations of barrier:
One is that barrier does not support custom  #device:
fbdb86c174/torch/csrc/distributed/c10d/ProcessGroup.hpp (L512-L522)

The second is that there is a special valid for nccl when device_id is not None, which is an assumption for cuda and nccl bindings, and also hinders custom device.
789070986c/torch/distributed/distributed_c10d.py (L3504-L3508)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99589
Approved by: https://github.com/kwen2501
2023-05-17 05:26:04 +00:00
6261aa5c8d [inductor][cpp] support non contiguous vectorization codegen (#99966)
Currently, cpp vectorization is supported only when the node has at least one contiguous index. The PR enables cpp vectorization when all indices in the node are non-contiguous. Specifically, the most inner index is selected as the tiling index.

### Validation
For the E2E performance and functionality, both inference and training model suites for data type float32 and bfloat16 are validated. All the results show that there is no performance regression and no new failures compared with baseline.

### Code
The modification could help certain kernels in GPT-J do vectorization. Here is a snippet of output code change.

**Before**
```
{
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(1L + (2L*i1) + (256L*i0))];
                auto tmp1 = static_cast<float>(tmp0);
                auto tmp2 = decltype(tmp1)(-tmp1);
                auto tmp3 = static_cast<bfloat16>(tmp2);
                out_ptr0[static_cast<long>((2L*i1) + (64L*i0))] = tmp3;
            }
        }
    }
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>((2L*i1) + (256L*i0))];
                out_ptr1[static_cast<long>((2L*i1) + (64L*i0))] = tmp0;
            }
        }
    }
```
**After**
```
{
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(16L))
            {
                auto tmp0 = ([&]() { __at_align__ bfloat16 tmpbuf[16 * 2]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>(1L + (2L*i1_inner) + (2L*i1) + (256L*i0))]; return load_bf16_as_float(tmpbuf); })();
                auto tmp1 = (tmp0);
                auto tmp2 = tmp1.neg();
                auto tmp3 = (tmp2);
                { __at_align__ bfloat16 tmpbuf[16*sizeof(float)/sizeof(bfloat16)]; store_float_as_bf16(tmpbuf, tmp3); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr0[static_cast<long>((2L*i1_inner) + (2L*i1) + (64L*i0))] = tmpbuf[i1_inner]; }
            }
        }
    }
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(16L*ks0); i0+=static_cast<long>(1L))
        {
            for(long i1=static_cast<long>(0L); i1<static_cast<long>(32L); i1+=static_cast<long>(16L))
            {
                auto tmp0 = ([&]() { __at_align__ bfloat16 tmpbuf[16 * 2]; for (long i1_inner = 0; i1_inner < 16; i1_inner++) tmpbuf[i1_inner] = in_ptr0[static_cast<long>((2L*i1_inner) + (2L*i1) + (256L*i0))]; return at::vec::Vectorized<bfloat16>::loadu(tmpbuf, 16); })();
                { __at_align__ bfloat16 tmpbuf[16*sizeof(float)/sizeof(bfloat16)]; tmp0.store(tmpbuf, 16); for (long i1_inner = 0; i1_inner < 16; i1_inner++) out_ptr1[static_cast<long>((2L*i1_inner) + (2L*i1) + (64L*i0))] = tmpbuf[i1_inner]; }
            }
        }
    }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99966
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-17 05:19:22 +00:00
47f43ed84a Actually functionalize torch.export (#101433)
I thought i enabled this, but apparently not. This PR makes the export fully functional for real this time :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101433
Approved by: https://github.com/angelayi
2023-05-17 05:09:24 +00:00
0c470b17e3 Extend storage create for custom storageImpl (#100237)
Fixes #ISSUE_NUMBER

For the scenario where users inherit storageimpl to implement their own subclasses, the current storage creation method cannot correctly create storage objects.

Refer to the registration method of Allocator to expand the creation method of storageimpl, users can register their own custom storageimpl creation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100237
Approved by: https://github.com/albanD
2023-05-17 04:30:13 +00:00
d1a472a366 Fix Buck OSS build after flatbuffers update in #100716 (#101626)
Broken in trunk after #100716, for example https://github.com/pytorch/pytorch/actions/runs/4996560867/jobs/8949908952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101626
Approved by: https://github.com/PaliC
2023-05-17 04:19:17 +00:00
41d668c9dc work around precision error in constraint solver (#101607)
In https://github.com/pytorch/pytorch/pull/101307 we tried to fix https://github.com/pytorch/pytorch/issues/101093 using `nsimplify` to convert floats into rationals, but the fix is not reliable: it is possible for `nsimplify` to pick constants that don't work.

Currently, constraint solving is only used by `export`, but constraints are added in all modes. This means that we can hit this issue even in non-`export` modes. This diff works around this issue for such modes by delaying raising such failures until constraint solving.

Differential Revision: [D45922797](https://our.internmc.facebook.com/intern/diff/D45922797/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101607
Approved by: https://github.com/ezyang
2023-05-17 03:25:04 +00:00
3c4f97c213 [vision hash update] update the pinned vision hash (#101635)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101635
Approved by: https://github.com/pytorchbot
2023-05-17 03:18:47 +00:00
ba2bc7df8f Enable backward on _foreach_zero_ (#101149)
Currently torchgen cannot find an appropriate `DifferentiabilityInfo` for `_foreach_zero_` because `gen_foreach_derivativeinfo` doesn't correctly make use of `functional_info_by_signature` and `differentiability_infos`, and `is_reference_for_foreach` a bit too strict to `_foreach_zero_`.

Generated code in `VariableType`
```c++
void _foreach_zero_(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  [[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self );

  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<ZeroBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<ZeroBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<ZeroBackward0>(new ZeroBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i] ));
              return grad_fn;
          }
      }());
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_zero_(ks & c10::after_autograd_keyset, self_);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
}
```

Rel:
- #58833
- #96405
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101149
Approved by: https://github.com/soulitzer
2023-05-17 03:10:13 +00:00
3bbf0683a1 [inductor] Refactor RNG operators (#100064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100064
Approved by: https://github.com/ngimel
2023-05-17 01:29:31 +00:00
bb3558961f [MPS] Add histogram ops (#96652)
Adds `torch.histc`, `torch.histogram`, `torch.histogramdd`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96652
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-17 01:25:43 +00:00
20cf42de2c Revert "[Reland] Add sym_size/stride/numel/storage_offset to native_function.… (#100749)"
This reverts commit bb454891ed5ce97f580ae52e20f8e9ff2d0f3bf5.
2023-05-16 18:17:02 -07:00
9a17989b63 Prioritize modified tests when running on main (#101618)
If a PR modifies a test, prioritize running that test on the default branch so that we get the test signal faster

Fixes https://github.com/pytorch/pytorch/issues/101617
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101618
Approved by: https://github.com/huydhn
2023-05-17 00:49:45 +00:00
7ca5e68c00 Reorganize foreach ops more logically in native_functions.yaml (#101583)
This is a purely cosmetic change where I organized the foreach ops in native_functions.yaml such that
1. all variants of each op are grouped together
2. add, sub, mul, div are first
3. every op after is alphabetical

This way, it's easier to see all the variants of an op, say add, in one screen. Items 2 and 3 are not strictly necessary but is simple a more organized scheme than not caring at all.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101583
Approved by: https://github.com/mlazos
2023-05-17 00:18:25 +00:00
cde597efa1 [docs] Warn that GradScaler can scale under 1 (#101569)
Completes action item 1 in https://github.com/pytorch/pytorch/issues/99640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101569
Approved by: https://github.com/ngimel
2023-05-16 23:56:07 +00:00
e69198b043 Revert "Support map autograd and pytree in/out (#100494)"
This reverts commit b8fa41be9d396d97cfcd53964a228e2f987e104a.

Reverted https://github.com/pytorch/pytorch/pull/100494 on behalf of https://github.com/PaliC due to breaking tests on trunk, please check hud.pytorch.org for the broken tests ([comment](https://github.com/pytorch/pytorch/pull/100494#issuecomment-1550454835))
2023-05-16 22:50:18 +00:00
b8fa41be9d Support map autograd and pytree in/out (#100494)
This PR adds autograd and pytree support for map operator.

Implementation-wise:

1. We temporarily make two HigherOrderOperators, "map" and "map_impl":
- "map" is user-facing. Currently, it unwraps the pytrees in inputs and create a flat_fn for it. Dynamo currently cannot deal with pytree.tree_flatten and pytree.tree_unflatten, we therefore make it a HigherOrderOperator to trigger dynamo logic of handling HigherOrderOperators.
- "map_impl" is the actual operator that works with the rest of torch subsystems such as functionalization, make_fx. It accepts flattend arguments, and a num_mapped_args integer denoting how many of the flattend arguments need to mapped i.e. their first dimension will be unstacked.

2. We create the forward and backward graph in autograd key and call torch.autograd.Function. Currently, the backward graph is recomputation-based and we need to partition the joint graph in the future to be more efficient.

Example traced graphs for map operators:
### Case 1: simple f and autograd
```python
def f(x, y):
    return x + y

def g(xs, y):
    out = control_flow.map(f, xs, y)
    return torch.autograd.grad(out, (xs, y), torch.ones_like(out))

gm = make_fx(g, tracing_mode="symbolic")(torch.ones(3, 4, 5, requires_grad=True), torch.ones(5, requires_grad=True))
# gm.print_readable() produces following:
class g(torch.nn.Module):
    def forward(self, xs_1: f32[3, s1, s2], y_1: f32[s2]):
        # No stacktrace found for following nodes
        body_graph_0 = self.body_graph_0
        map_impl = torch.ops.map_impl(body_graph_0, 1, xs_1, y_1);  body_graph_0 = None
        getitem: f32[3, s1, s2] = map_impl[0];  map_impl = None
        ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False)
        is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like);  getitem = None
        body_graph_1 = self.body_graph_1
        map_impl_1 = torch.ops.map_impl(body_graph_1, 2, xs_1, ones_like, y_1);  body_graph_1 = xs_1 = ones_like = None
        getitem_1 = map_impl_1[0]
        getitem_2: f32[3, s1, s2] = map_impl_1[1]
        getitem_3: f32[3, s2] = map_impl_1[2];  map_impl_1 = None
        sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_3, [0], True);  getitem_3 = None
        sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0);  y_1 = None
        view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = sym_size = None
        return (getitem_2, view)

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s2]):
            # No stacktrace found for following nodes
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg2_1);  arg1_1 = arg2_1 = None
            return [add]

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]):
            # No stacktrace found for following nodes
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(arg1_1, arg3_1);  arg1_1 = None
            is_same_size = torch.ops.aten.is_same_size.default(add, arg2_1);  add = None
            sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg2_1, [0], True)
            sym_size: Sym(s2) = torch.ops.aten.sym_size(arg3_1, 0);  arg3_1 = None
            view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = sym_size = None
            return [None, arg2_1, view]
```
### Case 2: list input/output f and autograd
```python
def f(x, y):
    return [x[0].cos() + y.sin(), x[1].sin() * y.cos()]

def g(xs, y):
    out = control_flow.map(f, xs, y)
    flat_out, _ = pytree.tree_flatten(out)
    flat_inp, _ = pytree.tree_flatten((xs, y))
    requires_grad_inp = [inp for inp in flat_inp if inp.requires_grad]
    return torch.autograd.grad(flat_out, requires_grad_inp, [torch.ones_like(out) for out in flat_out])

gm = make_fx(g, tracing_mode="symbolic")(
    [torch.ones(3, 4, 5), torch.ones(3, 4, 5, requires_grad=True)],
    torch.ones(5, requires_grad=True))

# gm.print_readable() produces following:
class g(torch.nn.Module):
    def forward(self, xs, y):
        xs_1: f32[3, s1, s2], xs_2: f32[3, s1, s2], y_1: f32[s2], = fx_pytree.tree_flatten_spec([xs, y], self._in_spec)
        # No stacktrace found for following nodes
        body_graph_0 = self.body_graph_0
        map_impl = torch.ops.map_impl(body_graph_0, 2, xs_1, xs_2, y_1);  body_graph_0 = None
        getitem: f32[3, s1, s2] = map_impl[0]
        getitem_1: f32[3, s1, s2] = map_impl[1];  map_impl = None
        ones_like: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem, pin_memory = False)
        ones_like_1: f32[3, s1, s2] = torch.ops.aten.ones_like.default(getitem_1, pin_memory = False)
        is_same_size = torch.ops.aten.is_same_size.default(getitem, ones_like);  getitem = None
        is_same_size_1 = torch.ops.aten.is_same_size.default(getitem_1, ones_like_1);  getitem_1 = None
        body_graph_1 = self.body_graph_1
        map_impl_1 = torch.ops.map_impl(body_graph_1, 4, xs_1, xs_2, ones_like, ones_like_1, y_1);  body_graph_1 = xs_1 = xs_2 = ones_like = ones_like_1 = None
        getitem_2 = map_impl_1[0]
        getitem_3 = map_impl_1[1]
        getitem_4: f32[3, s1, s2] = map_impl_1[2]
        getitem_5: f32[3, s2] = map_impl_1[3];  map_impl_1 = None
        sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(getitem_5, [0], True);  getitem_5 = None
        sym_size: Sym(s2) = torch.ops.aten.sym_size(y_1, 0);  y_1 = None
        view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = sym_size = None
        return pytree.tree_unflatten([getitem_4, view], self._out_spec)

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s2]):
            # No stacktrace found for following nodes
            cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1);  arg1_1 = None
            sin: f32[s2] = torch.ops.aten.sin.default(arg3_1)
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin);  cos = sin = None
            sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1);  arg2_1 = None
            cos_1: f32[s2] = torch.ops.aten.cos.default(arg3_1);  arg3_1 = None
            mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1);  sin_1 = cos_1 = None
            return [add, mul]

    class <lambda>(torch.nn.Module):
        def forward(self, arg0_1, arg1_1: f32[s1, s2], arg2_1: f32[s1, s2], arg3_1: f32[s1, s2], arg4_1: f32[s1, s2], arg5_1: f32[s2]):
            # No stacktrace found for following nodes
            cos: f32[s1, s2] = torch.ops.aten.cos.default(arg1_1);  arg1_1 = None
            sin: f32[s2] = torch.ops.aten.sin.default(arg5_1)
            add: f32[s1, s2] = torch.ops.aten.add.Tensor(cos, sin);  cos = sin = None
            sin_1: f32[s1, s2] = torch.ops.aten.sin.default(arg2_1)
            cos_1: f32[s2] = torch.ops.aten.cos.default(arg5_1)
            mul: f32[s1, s2] = torch.ops.aten.mul.Tensor(sin_1, cos_1)
            is_same_size = torch.ops.aten.is_same_size.default(add, arg3_1);  add = None
            is_same_size_1 = torch.ops.aten.is_same_size.default(mul, arg4_1);  mul = None
            mul_1: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, sin_1);  sin_1 = None
            mul_2: f32[s1, s2] = torch.ops.aten.mul.Tensor(arg4_1, cos_1);  arg4_1 = cos_1 = None
            sum_1: f32[1, s2] = torch.ops.aten.sum.dim_IntList(mul_1, [0], True);  mul_1 = None
            sym_size: Sym(s2) = torch.ops.aten.sym_size(arg5_1, 0)
            view: f32[s2] = torch.ops.aten.view.default(sum_1, [sym_size]);  sum_1 = None

            #
            sin_2: f32[s2] = torch.ops.aten.sin.default(arg5_1)
            neg: f32[s2] = torch.ops.aten.neg.default(sin_2);  sin_2 = None
            mul_3: f32[s2] = torch.ops.aten.mul.Tensor(view, neg);  view = neg = None
            cos_2: f32[s1, s2] = torch.ops.aten.cos.default(arg2_1);  arg2_1 = None
            mul_4: f32[s1, s2] = torch.ops.aten.mul.Tensor(mul_2, cos_2);  mul_2 = cos_2 = None
            sum_2: f32[1, s2] = torch.ops.aten.sum.dim_IntList(arg3_1, [0], True);  arg3_1 = None
            view_1: f32[s2] = torch.ops.aten.view.default(sum_2, [sym_size]);  sum_2 = sym_size = None
            cos_3: f32[s2] = torch.ops.aten.cos.default(arg5_1);  arg5_1 = None
            mul_5: f32[s2] = torch.ops.aten.mul.Tensor(view_1, cos_3);  view_1 = cos_3 = None
            add_1: f32[s2] = torch.ops.aten.add.Tensor(mul_3, mul_5);  mul_3 = mul_5 = None
            return [None, None, mul_4, add_1]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100494
Approved by: https://github.com/zou3519
2023-05-16 22:05:11 +00:00
552b712f80 Run C++ testcases in parallel with pytest-xdist (#101440)
After an investigation, running C++ tests with https://github.com/pytest-dev/pytest-cpp is just slower than running them directly, plain and simple. I'm curious on the exact root cause, but that's a story for another day.

`time build/bin/test_lazy` takes half a minute to run 610 tests on `linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 2, 5, linux.4xlarge.nvidia.gpu)` while `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v` takes 20+ minutes on the same runner.  This is a very costly price to pay.

The saving grace here is that https://github.com/pytest-dev/pytest-cpp supports pytest-xdist to run tests in parallel with `-n auto`, so `time pytest /var/lib/jenkins/workspace/build/bin/test_lazy -v -n auto` takes only 3 minutes.  This is still not as fast as running C++ tests directly, but it's order of magnitude faster than running them sequentially.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101440
Approved by: https://github.com/clee2000
2023-05-16 21:52:36 +00:00
b998ec96ac Don't run libtorch tests on slow test shard (#101429)
They should run on the default shard instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101429
Approved by: https://github.com/huydhn
2023-05-16 21:50:14 +00:00
a26516b78b Add inductor as a test disable group (#101448)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101448
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-05-16 21:48:49 +00:00
e0fc24cdc5 add retries to inductor benchmark suite (#101019)
This pr accomplishes
1) Enables retries for downloading torchbenchmark and huggingface models in a similar method to how we do it for timm models right now.
2) creates a `_download_model` function for the hugging face and TIMM runners whose output I plan to use to preload the models somewhere if possible (please double check I'll be saving the right thing). Instead of retries, we plan to just add torchbench to a docker image as it is relatively small.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 3361a4c</samp>

> _We're the brave and bold coders of the `common.py` module_
> _We've made a handy function for downloading models_
> _We've shared it with our mates in the other runners_
> _So pull and push and try again, we'll get them all in time_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101019
Approved by: https://github.com/huydhn, https://github.com/desertfire
2023-05-16 21:41:50 +00:00
01da732691 Fix type annotation of torch.split (#100655)
The type annotation indicates `list` but the returned type is `tuple`
```python
>>> import torch
>>> type(torch.arange(10).split(4))
<class 'tuple'>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100655
Approved by: https://github.com/kit1980
2023-05-16 21:35:41 +00:00
41468833fb vision_maskrcnn is now deterministic (#101116)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101116
Approved by: https://github.com/ngimel
2023-05-16 21:32:17 +00:00
2e08c68564 Avoid cond prefix when naming subgraph of HigherOrderOperators (#101439)
Fixes the issue in  #100278: HigherOrderOperator body functions should not all be named "cond_body".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101439
Approved by: https://github.com/zou3519
2023-05-16 21:28:40 +00:00
0585944eac Revert "match sdpa patterns from HF (#100609)"
This reverts commit 0a7ea9627f087afef5c59b2500e018abb6c3c1b5.

Reverted https://github.com/pytorch/pytorch/pull/100609 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, see D45899223 ([comment](https://github.com/pytorch/pytorch/pull/100609#issuecomment-1550349249))
2023-05-16 20:57:33 +00:00
42e65a2587 [pt2] add meta for linalg_lu_factor_ex (#101375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101375
Approved by: https://github.com/lezcano
2023-05-16 20:56:54 +00:00
cb734123e2 [GHF] Ignore flaky classification for pin updates (#101587)
Also add regression test to validate it

Fixes https://github.com/pytorch/test-infra/issues/4126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101587
Approved by: https://github.com/atalman
2023-05-16 20:48:30 +00:00
b1474019a4 Test Reordering: Run previously failing tests first (#101123)
Makes the CI prioritize running any test files that had a failing test in a previous iteration of the given PR.

A follow up to https://github.com/pytorch/pytorch/pull/100522 which makes the `.pytest_cache` available to use here

A concrete example:
1. Person A pushes a new commit and creates a PR.
2. 2 hours later, test_im_now_broken.py fails
3. Person A attempts to fix the test, but the test is actually still broken
4. The CI, seeing that test_im_now_broken.py had failed on a previous run, will now prioritize running that test first. Instead of waiting another 2 hours to get a signal, Person A only needs to wait ~15 minutes (which is how long it takes for tests to start running)

# Testing
I modified a file to make the tests invoking it fail and triggered CI twice with this failure.

First run: https://github.com/pytorch/pytorch/actions/runs/4963943209/jobs/8883800811
Test step took 1h 9m to run

Second run: https://github.com/pytorch/pytorch/actions/runs/4965016776/jobs/8885657992
Test step failed within 2m 27s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101123
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-05-16 19:57:54 +00:00
b5ed606a8b use Bazelisk to fetch Bazel in CI (#101424)
use Bazelisk to fetch Bazel in CI

Summary:

Advantages:
1. this is cross-platform, no MacOS specific code
2. this lets us use define the version of Bazel succinctly in
   .bazelversion (already provided)

Note that this change will upgrade our version of Bazel to 6.1.1.

Test Plan: Rely on CI.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101424).
* #101406
* #101405
* #101445
* __->__ #101424
* #101411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101424
Approved by: https://github.com/huydhn, https://github.com/vors
2023-05-16 19:52:15 +00:00
e7681b53e3 Fix typing for setup_context in autograd (#101464)
The original only matches a tuple of length 1, but it's intended to match any length.

Also, it now aligns with the docstring at L320
d5cba0618a/torch/autograd/function.py (L320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101464
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-05-16 18:41:35 +00:00
eac5f2a8e4 Revert "Actually functionalize torch.export (#101433)"
This reverts commit eec752ed056160ea848facfd19a19235e5f16e55.

Reverted https://github.com/pytorch/pytorch/pull/101433 on behalf of https://github.com/PaliC due to causing failures on functorch macOS tests ([comment](https://github.com/pytorch/pytorch/pull/101433#issuecomment-1550111671))
2023-05-16 17:51:45 +00:00
b94f143ace SymIntify convNd and conv_transposeNd, fix inductor symint handling (#101488)
Fixes https://github.com/pytorch/pytorch/issues/101014

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101488
Approved by: https://github.com/ngimel
2023-05-16 17:46:52 +00:00
411ba1c8bf [pt2] skip flaky linalg_householder_product tests (#101551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101551
Approved by: https://github.com/lezcano
2023-05-16 17:44:15 +00:00
afea1a9fe9 [meta] error checking for inplace ops (#101532)
Fixes #100753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101532
Approved by: https://github.com/lezcano
2023-05-16 17:26:59 +00:00
54fe828cd0 Improve rebase message when PR is uptodate (#101504)
Also, preserve target branch commit revision at time of merge

Fixes https://github.com/pytorch/test-infra/issues/4148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101504
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-05-16 17:26:08 +00:00
20deccf8a1 BE changes for tryrebase.py (#101503)
- Use context manager rather than explicit ```try: finally:```

- Add `ref/remotes` prefix to `onto_branch` in `main` rather than in
`rebase_onto` functions

- Define `MAIN_BRANCH` and `VIABLE_STRICT_BRANCH` constants in tests.
- Replace `self.assertTrue(x in y)` with `self.assertIn(x, y)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101503
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-05-16 17:26:08 +00:00
1272cd73da Revert "extend serialization for tensor metadata (#99808)"
This reverts commit 4b9bc6f2a6d33fc9ca8065789fc287ad411b27ac.

Reverted https://github.com/pytorch/pytorch/pull/99808 on behalf of https://github.com/izaitsevfb due to Breaks internal builds: ld.lld: error: undefined symbol: torch::jit::GetBackendMetaSerialization() ([comment](https://github.com/pytorch/pytorch/pull/99808#issuecomment-1550071656))
2023-05-16 17:22:25 +00:00
wgb
3f87c04cf8 fix a typo in common_device_type.py (#101485)
Fixes #ISSUE_NUMBER
In common_device_type.py the PrivateUse1TestBase class has a typo. Reference [https://github.com/pytorch/pytorch/pull/99960](url)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101485
Approved by: https://github.com/albanD
2023-05-16 17:16:09 +00:00
2a3e45a2a8 Docs: update default device description (#101283)
Closes #101274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101283
Approved by: https://github.com/albanD
2023-05-16 17:07:31 +00:00
010763be9a [DTensor][2/N] add DTensor constructor function: empty (#101022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101022
Approved by: https://github.com/wanchaol
2023-05-16 16:50:54 +00:00
5cc361c736 [DTensor][1/N] add DTensor constructor function: ones (#100933)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100933
Approved by: https://github.com/wanchaol
2023-05-16 16:50:54 +00:00
2af7df62a5 log inductor compilation time to scuba (#101317)
Summary: Set up timer around `compile_fx_inner` and log to scuba

Differential Revision: D45822137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101317
Approved by: https://github.com/nmacchioni
2023-05-16 16:32:17 +00:00
eec752ed05 Actually functionalize torch.export (#101433)
I thought i enabled this, but apparently not. This PR makes the export fully functional for real this time :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101433
Approved by: https://github.com/angelayi
2023-05-16 16:22:13 +00:00
59dff01319 Add top level function to check if running with deploy (#101420)
Also not sure if this should be a public function or not. Leaving it private for now but let me know if you prefer for it to be public.

FYI @nikitaved this will logically conflict with your triton kernel PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101420
Approved by: https://github.com/malfet
2023-05-16 16:05:49 +00:00
05f6250815 Add missing torch.distributed.ReduceOp.AVG in type stubs (#101534)
Add missing `AVG` to `torch.distributed.ReduceOp` enum for type annotation.

Ref:

88b6a4577b/torch/csrc/distributed/c10d/Types.hpp (L35-L47)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101534
Approved by: https://github.com/Skylion007
2023-05-16 15:51:21 +00:00
47d31364d7 run buildifier on WORKSPACE (#101411)
run buildifier on WORKSPACE

Summary: Make it easier to keep the file clean with subsequent changes.

Test Plan: Should be a no-op.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101411).
* #101406
* #101405
* #101445
* #101424
* __->__ #101411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101411
Approved by: https://github.com/huydhn
2023-05-16 14:53:28 +00:00
ff3f19615f Type conversion between float/complex dtypes (#97935)
This PR is an implementation of the feature request https://github.com/pytorch/pytorch/issues/97888, for the implementation of `torch.dtype.to_complex()` and `torch.dtype.to_float()` methods that convert between float and complex dtypes of the same precision.

Disclaimer: it's the first time I code in C++ so hopefully the code is correct, but I'm not super confident about the PR. Any advice/comment is welcome. It's also my first contribution to a large library, so hopefully I'm not doing anything wrong !

@ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97935
Approved by: https://github.com/ezyang
2023-05-16 14:39:44 +00:00
2b2a717f19 [inductor] erfc: lowering (#101416)
Codegen support was already present. This PR just removes the fallback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101416
Approved by: https://github.com/lezcano
2023-05-16 14:31:13 +00:00
23d1cc3811 Update llama to failing (#101565)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101565
Approved by: https://github.com/janeyx99
2023-05-16 14:12:26 +00:00
9e023e1818 [fx] Better replacements finder in subgraph rewriter (#100556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100556
Approved by: https://github.com/mcr229
2023-05-16 14:08:44 +00:00
6bc0f4a4ee [reland][CustomOp] Add Dispatcher error callback (#101452)
Reland of #101015, original stack reverted due to internal test
flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101452
Approved by: https://github.com/soulitzer
2023-05-16 13:33:31 +00:00
c8be493dac [reland][custom_op] Change the python type that maps to ListType in schema (#101451)
Reland of #101190. Original stack was reverted due to internal test
flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101451
Approved by: https://github.com/soulitzer
2023-05-16 13:33:31 +00:00
4f8cbaa10a [reland] Cleanup custom op library after each custom_op test (#101450)
Reland of #100980. Original PR was reverted due to internal test
flakiness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101450
Approved by: https://github.com/soulitzer
2023-05-16 13:33:29 +00:00
c2e16d8b2c buck1 can't properly handle '/' on rule names, so fixing 'impl/cow/context' and 'core/impl/cow/context_test' build rules (#101552)
This is because PR #101510 can't be landed due to the author not have linked a github account to his internal meta account.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101552
Approved by: https://github.com/DanilBaibak
2023-05-16 11:28:58 +00:00
c51dfbf5b4 triu/tril: complete dtype support for CPU/CUDA. (#101414)
As per title, we can support full dtype table for these ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101414
Approved by: https://github.com/ngimel
2023-05-16 10:42:36 +00:00
88b6a4577b inductor: fix sign gets wrong result dtype issue (#101377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101377
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-05-16 08:01:06 +00:00
935100cbde [profiler] When record_inputs=True, record scalar lists of length <= 30 (#100593)
Many ops take as inputs scalars or scalar lists which are important to understand the properties of the op. For example, convolution ops' behavior and output shapes often depend on padding and strides, which are provided as scalars of lists of scalars. This will record scalar lists when record_inputs=True.

Details:
During collection (and this was true before this PR as well), we serialize values and tensor metadata into an InputOutputEncoder. After collection occurs, we deserialize these values to attach the information to each of the events.

This PR does this:
- Adds support for serializing scalar lists during collection / serialization
- Adds an extra field called "Concrete Args"
- Splits up the deserialization process into two steps - one for generating "input shapes" and one for generating "concrete args". We split up input shapes and concrete args to avoid interrupting any previous workflows that relied on the specific data in the input shapes category; additionally, it's just a better description. Note that single scalars will remain in the "input shapes" category as they were already in that category in the past.

Differential Revision: [D45798431](https://our.internmc.facebook.com/intern/diff/D45798431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100593
Approved by: https://github.com/aaronenyeshi
2023-05-16 07:58:46 +00:00
e389bfa01a inductor: add dtype check before doing cpu binary fusion (#101376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101376
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-05-16 07:27:26 +00:00
6f7ebcdcd8 [inductor] enable descriptive name for cpp kernels (#101330)
This PR enables the descriptive name for cpp kernels similar to the triton kernel name. A new configuration `config.cpp.descriptive_names` is added similar to that of triton. The kernel name follows the format: `cpp_<fused_name>_<id>`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101330
Approved by: https://github.com/XiaobingSuper, https://github.com/jansel
2023-05-16 06:48:11 +00:00
86869475ff [inductor] move dtype propagation log to schedule artifact (#101351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101351
Approved by: https://github.com/jansel
2023-05-16 06:43:38 +00:00
dfc46153a7 [inductor] add graph id prefix to inductor_wrapper_call profile info (#101350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101350
Approved by: https://github.com/jansel
2023-05-16 06:32:00 +00:00
d9d34b3e18 [vision hash update] update the pinned vision hash (#101471)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101471
Approved by: https://github.com/pytorchbot
2023-05-16 05:33:08 +00:00
c03555a303 add retries to external contribution data upload (#100889)
Adds retries to external contribution upload as it is shown to be flaky

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 43c2602</samp>

Added a function to read data from S3 objects and used it to implement a retry mechanism and verification for uploading external contribution stats. Modified `tools/stats/upload_external_contrib_stats.py` and `tools/stats/upload_stats_lib.py`.
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 43c2602</samp>

> _We'll upload the stats to the cloud, me hearties_
> _We'll use `read_from_s3` to check them all_
> _We'll retry if the connection fails, me hearties_
> _We'll log the results and have a ball_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100889
Approved by: https://github.com/huydhn
2023-05-16 05:00:48 +00:00
773f6b626d [ONNX] Diagnostic to show all unsupported call_functions (#100451)
Introduce `Analysis` to analyze fx graphmodule and emit diagnostics. This class
can be extended to interact with `Transform` (passes) to decide if a pass should
trigger based on graph analysis result. E.g., if decomp needs to run by checking
operator namespace in nodes. For now leaving it as out of scope but can revisit
if maintaining multi fx extractor becomes reality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100451
Approved by: https://github.com/titaiwangms
2023-05-16 04:59:23 +00:00
45d080e0ac [ONNX] Diagnostic 'log' and 'log_and_raise_if_error' (#100407)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100407
Approved by: https://github.com/thiagocrepaldi
2023-05-16 04:55:50 +00:00
e4eaf33346 Re-enable detectron2_maskrcnn on CI (#100791)
#99665 has been fixed, we can re-enable these models on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100791
Approved by: https://github.com/huydhn
2023-05-16 04:25:58 +00:00
af4248b9ad Update the torchbench pin to include timm upgrade (#101466)
Upgrade the timm  (huggingface/pytorch-image-models) repo from 45af496 to 6635bc3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101466
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-05-16 04:21:44 +00:00
964e61ee95 [quant][pt2] Handle no conv bias in prepare QAT fusion (#100610)
Summary: This commit adds support for conv + BN fusion for the
case where conv has no bias. Since the replacement patterns with
and without conv bias are substantially different, we perform the
replacement for each of these two cases separately.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_no_conv_bias

Reviewers: jerryzh168, kimishpatel

Differential Revision: [D45743510](https://our.internmc.facebook.com/intern/diff/D45743510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100610
Approved by: https://github.com/jerryzh168
2023-05-16 04:05:53 +00:00
7052fb37bd [Dynamo] Improve handling UnspecializedNNModuleVariable side effect (#101141)
Fixes #101102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101141
Approved by: https://github.com/jansel
2023-05-16 03:57:13 +00:00
6f8a71aa3d [c10d][Fix] Start gloo sequence numbers at 0. (#101422)
Gloo PG used to create a random sequence number and broadcast it to
the rest of the group. But when we started enforcing sequence number checks in
ProcessGroupWrapper, we observed this was occasionally flaky. For example, this
error in a job was wrong, as all ranks were running the first broadcast
collective. Somehow the sequence number wasn't communicated across the store
correctly:

``
RuntimeError: Detected mismatch between collectives on ranks. Rank 16 is running collective: CollectiveFingerPrint(SequenceNumber=1977865401, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=54090078, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))).Collectives differ in the following aspects: Sequence number: 1977865401vs 54090078
```

The issue reproduces rarely in tests, but is more common in large world size
jobs.

Differential Revision: [D45870688](https://our.internmc.facebook.com/intern/diff/D45870688/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101422
Approved by: https://github.com/H-Huang
2023-05-16 03:55:36 +00:00
4b849744d1 [IValue] Only coalesce once (#101447)
Differential Revision: [D45880966](https://our.internmc.facebook.com/intern/diff/D45880966/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101447
Approved by: https://github.com/H-Huang
2023-05-16 03:52:53 +00:00
13056ca229 Revert "[fx] Better replacements finder in subgraph rewriter (#100556)"
This reverts commit 9842d1ef94e84088735e143bffb238cb5eda7446.

Reverted https://github.com/pytorch/pytorch/pull/100556 on behalf of https://github.com/izaitsevfb due to Reverting temporarily to unblock diff train, see D45743510 and #100610 ([comment](https://github.com/pytorch/pytorch/pull/100556#issuecomment-1548934932))
2023-05-16 03:50:06 +00:00
194d360329 Add more canonical way of adding runtime pass (#100956)
* #100955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100956
Approved by: https://github.com/ydwu4, https://github.com/guangy10
2023-05-16 03:23:04 +00:00
f0786ad776 Use %zu instead of %ld when formatting size_t (#101412)
This fixes compiling on systems where `size_t` is an `unsigned int` instead of an `unsigned long int` (32 bit Raspberry Pi OS is one example).
`%ld` expects an `unsigned long int`, while `%zu` specifies that it's an unsigned size_t.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101412
Approved by: https://github.com/albanD
2023-05-16 02:45:55 +00:00
52363de2ec Clean up grad check in sdp_utils.h (#101435)
# Summary
The priorty order was not being run correctly because of confusing function name
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101435
Approved by: https://github.com/jbschlosser
2023-05-16 02:22:45 +00:00
c3f7db3f52 Use python3 instead of /usr/bin/env python3 on Windows (#101437)
I'm seeing some curious flaky failure on Windows where python3 couldn't be found in the env, for example https://github.com/pytorch/pytorch/actions/runs/4983028765/jobs/8920011406 or https://github.com/pytorch/pytorch/actions/runs/4967229128/jobs/8889106289.  On the other hand, other scripts invoked directly with python3 works fine in the same workflow.

So let's use use `python3 .github/scripts/parse_ref.py` instead.  The binary python3 is in GITHUB_PATH populated by `setup-windows` step
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101437
Approved by: https://github.com/PaliC
2023-05-16 02:10:24 +00:00
24cc7fe020 Fix Wishart distribution documentation (#95816)
This PR fixes the `torch.distributions.wishart.Wishart` example.

Running the current example
```python
m = Wishart(torch.eye(2), torch.Tensor([2]))
m.sample()  # Wishart distributed with mean=`df * I` and
            # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j
```
fails with
```
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Untitled-1 in
      [321](untitled:Untitled-1?line=320) # %%
----> [322](untitled:Untitled-1?line=321) m = Wishart(torch.eye(2), torch.Tensor([2]))
      [323](untitled:Untitled-1?line=322) m.sample()  # Wishart distributed with mean=`df * I` and
      [324](untitled:Untitled-1?line=323)             # variance(x_ij)=`df` for i != j and variance(x_ij)=`2 * df` for i == j

Untitled-1 in __init__(self, df, covariance_matrix, precision_matrix, scale_tril, validate_args)
     [83](untitled:Untitled-1?line=82)
     [84](untitled:Untitled-1?line=83)         if param.dim() < 2:
---> [85](untitled:Untitled-1?line=84)             raise ValueError("scale_tril must be at least two-dimensional, with optional leading batch dimensions")
     [86](untitled:Untitled-1?line=85)
     [87](untitled:Untitled-1?line=86)         if isinstance(df, Number):

ValueError: scale_tril must be at least two-dimensional, with optional leading batch dimensions
```

Is seems that the parameters of `Wishart.__init__()` were re-ordered, but the documentation was not updated.
This PR fixes it. Here is the updated behaviour:

```python
m = Wishart(torch.Tensor([2]), covariance_matrix=torch.eye(2))
m.sample()
```

```
Untitled-1:255: UserWarning: Singular sample detected.
tensor([[[6.6366, 0.7796],
         [0.7796, 0.2136]]])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95816
Approved by: https://github.com/ngimel, https://github.com/kit1980
2023-05-16 02:07:30 +00:00
7f3b00bfe0 [Inductor] Improve view/reshape on tensors with shape 0 (#101051)
Fixes failure 14k github models.
This is a follow up for #99671. There is another case we don't handle well, which inspired me to switch to ```fake_reindex``` to handle tensors with shape 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101051
Approved by: https://github.com/ngimel
2023-05-16 01:38:14 +00:00
d198033661 Revert torch.fx.interpreter error printing change (#101462)
Apparently this is breaking internal peeps and I don't care enough
to keep it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101462
Approved by: https://github.com/wushirong, https://github.com/ngimel, https://github.com/voznesenskym
2023-05-16 01:25:49 +00:00
799ef7e501 [caffe2/tools/autograd] Fix non-determinism in code gen (#101425)
Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101425
Approved by: https://github.com/albanD
2023-05-16 00:54:03 +00:00
a8376099f9 fix print tensor in cpp for privateuse1 (#100797)
Fixes #ISSUE_NUMBER
1、fix lintrunnr in `test/inductor/test_cuda_repro.py`
2、In Libtorch, if we rename the `privateuseone` backend to `foo`, and when we print tensor with `std::cout << tensor`, we will get the output like this,
```
1.0, 2.0 ...
[PrivateUse1FloatType{2,3}]
```
and it should be like this
```
1.0, 2.0 ...
[fooFloatType{2,3}]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100797
Approved by: https://github.com/ezyang
2023-05-16 00:25:35 +00:00
788ff0623b [decomp] fix decomp of batch_norm when weight/bias is not flattened (#101059)
Fix https://github.com/pytorch/pytorch/issues/100970
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101059
Approved by: https://github.com/ezyang
2023-05-16 00:00:34 +00:00
1faef895ca Inductor cpp wrapper: support sympy.Expr as input (#101257)
Leverage the logic in https://github.com/pytorch/pytorch/pull/95533 to get the `dtype` of `sympy.Expr` and support it as graph input in the cpp wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101257
Approved by: https://github.com/jgong5, https://github.com/Skylion007, https://github.com/EikanWang, https://github.com/jansel
2023-05-15 23:57:28 +00:00
187eb7ca88 Enable default workflow PyT 2.0 UTs on ROCm stack (#100981)
PR to enable default workflow PyTorch 2.0 unit tests for the ROCm stack.

- Enables all the dynamo unit test suites
- Enables some of the inductor unit test suites
       - `test_config`
       - `test_cpp_wrapper` (cpu only)
       - `test_minifier`
       - `test_standalone_compile`
       - `test_torchinductor_dynamic_shapes`
       - `test_torchinductor_opinfo`
       - `test_torchinductor`
       - `test_triton_wrapper`
- Introduces TEST_WITH_ROCM conditions for unit test skip/fail dictionaries in test_torchinductor_dynamic_shapes.py and test_torchinductor_opinfo.py

Note this PR follows on from the discussions for the previous UT enablement PR https://github.com/pytorch/pytorch/pull/97988, we have opted to only enable a few inductor suites at the moment to ease the upstreaming effort as these files are changing very quickly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100981
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-05-15 23:45:04 +00:00
01c7106580 [opinfo] empty_strided (#100890)
Follows: #100223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100890
Approved by: https://github.com/ezyang
2023-05-15 23:39:39 +00:00
3920ec1442 Apply the same fix to cleanup process on Windows CPU build job (#101460)
This goes together with https://github.com/pytorch/test-infra/pull/4169.  To be replace by the main branch once https://github.com/pytorch/test-infra/pull/4169 merges
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101460
Approved by: https://github.com/clee2000, https://github.com/PaliC
2023-05-15 23:19:03 +00:00
0577043d94 Rename minpybind namespace from py to mpy (#101410)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101410
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-05-15 23:15:01 +00:00
a206e8b027 [small BE] update NcclTest dim size (#101127)
Previously input dimensions are fixed to 3x3, this is a small change to make that configurable. Will be used in future additions to nccl tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101127
Approved by: https://github.com/rohan-varma
2023-05-15 23:05:10 +00:00
59a3759d97 Update cpp_extension.py (#101285)
When we need to link extra libs, we should notice that 64-bit CUDA may be installed in "lib", not in "lib64".

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 05c1ca6</samp>

Improve CUDA compatibility in `torch.utils.cpp_extension` by checking for `lib64` or `lib` directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101285
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-05-15 22:47:41 +00:00
1732077758 Bump up flatbuffer submodule version to the latest release (v23.3.3) (#100716)
The current flatbuffer version uses `--std=c++0x` which is too old. On my system, one of flatbuffer's dependency has stopped supporting C++0x, causing a build issue on my system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100716
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-05-15 22:28:01 +00:00
9d858642af [PTD] Make input contiguous for _ReduceScatter (#101373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101373
Approved by: https://github.com/wz337
2023-05-15 22:08:21 +00:00
0e811044bd [dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180)
Notes:
- No segfaults observed in any CI tests: dynamo unittests, inductor unittests, dynamo-wrapped pytorch tests. So we remove the warning that using dynamo 3.11 may result in segfaults.
- Fixed a weakreflist copying bug that caused a few dynamo-wrapped tests to hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99180
Approved by: https://github.com/malfet, https://github.com/TamirFriedman-RecoLabs
2023-05-15 22:06:28 +00:00
3b7c6b21d7 Disable locality reodering in training (#101423)
Differential Revision: [D45874682](https://our.internmc.facebook.com/intern/diff/D45874682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101423
Approved by: https://github.com/ngimel
2023-05-15 21:34:49 +00:00
70ef0bb45a Fix checkpoint doc small formatting issue (#101419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101419
Approved by: https://github.com/albanD
2023-05-15 21:33:56 +00:00
af841f38bd [SPMD] Allow Override.replacement to have a global view (#101427)
It's easier for users to implement one Override that takes care of
all target submodules of different types, instead of specifying one
mapping pair for each FQN/type. For example, when calculating
sharding for sparse layers, the decision needs to be make globally.
In this, case it's helpful to allow user Override to get access to
all submodules and make replacement decisions accordingly.

Differential Revision: [D45879732](https://our.internmc.facebook.com/intern/diff/D45879732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101427
Approved by: https://github.com/fegin
2023-05-15 21:27:41 +00:00
9ffad5b62b Remove input tracker from runtime assertion pass (#100955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100955
Approved by: https://github.com/ydwu4
2023-05-15 21:26:47 +00:00
ts
563d8058f4 Fix inconsistent torch.nn.MaxPool1d output on cpu and gpu (#99843)
Fixes #99412 , correctly raising an error when an output of invalid size is calculated.

Would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99843
Approved by: https://github.com/mikaylagawarecki
2023-05-15 20:27:43 +00:00
9eb1748b2b [pt2] add meta and SymInt support for linalg_lu (#101372)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101372
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-05-15 20:25:00 +00:00
ac4cc63ae2 [pt2] add meta for linalg_ldl_solve (#101367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101367
Approved by: https://github.com/lezcano
2023-05-15 20:25:00 +00:00
0a7ea9627f match sdpa patterns from HF (#100609)
Adds sdpa patterns seen in HF models.

To actually make the patterns match, we need constant folding to remove addition of all-zeros mask, and figure out what to do with low mem dropout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100609
Approved by: https://github.com/jansel
2023-05-15 20:01:58 +00:00
9842d1ef94 [fx] Better replacements finder in subgraph rewriter (#100556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100556
Approved by: https://github.com/mcr229
2023-05-15 20:00:59 +00:00
7912b34789 Revert "[CustomOp] Add Dispatcher error callback (#101015)"
This reverts commit c0e5d7e7fee31c332f1cf3d3e4d2305cc1d07bba.

Reverted https://github.com/pytorch/pytorch/pull/101015 on behalf of https://github.com/huydhn due to Revert this as the earlier commits in the stack have been reverted ([comment](https://github.com/pytorch/pytorch/pull/101015#issuecomment-1548476583))
2023-05-15 19:49:53 +00:00
4b9bc6f2a6 extend serialization for tensor metadata (#99808)
Fixes #ISSUE_NUMBER
Add the serialization logic of backend metadata to the serialization of tensor, which is implemented through custom registration functions.

In #97429 , the structure backendMeta is provided in TensorImpl, and we think that this part of information may also need to be serialized for custom.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99808
Approved by: https://github.com/ezyang
2023-05-15 19:45:34 +00:00
3b82298265 [caffe2/torchgen] Fix codegen non-determinism (#101286)
Summary:
Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101286
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-15 18:45:19 +00:00
349a2b3871 Revert "Cleanup custom op library after each custom_op test (#100980)"
This reverts commit d0d81652306bcf88f804a11a0061bcee847c6e5d.

Reverted https://github.com/pytorch/pytorch/pull/100980 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100980#issuecomment-1548336634))
2023-05-15 18:17:42 +00:00
wgb
22b9bef3d0 Add device extensions to the test framework for supporting custom device (#99960)
Fixes #ISSUE_NUMBER
add a PrivateUse1TestBase in torch/testing/_internal/common_device_type.py for supporting custom device extensions "privateuse1", and add “device_type" parameter in instantiate_device_type_tests function for adding custom device testbase, the default value is None.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99960
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-15 18:16:00 +00:00
b50595702b Revert "[custom_op] Change the python type that maps to ListType in schema (#101190)"
This reverts commit de6470e28e31c24862950ca381d32f910a168dd0.

Reverted https://github.com/pytorch/pytorch/pull/101190 on behalf of https://github.com/jeanschmidt due to preventing the revert of #100980 ([comment](https://github.com/pytorch/pytorch/pull/101190#issuecomment-1548332644))
2023-05-15 18:15:08 +00:00
ee40cce475 [AOTAutograd] add export entrypoints (#100587)
The main addition in this PR is two new API's in AOTAutograd.

**APIs**

`aot_export_module`: Given a module, exports it into a functionalized FX graph. Returns an `fx.GraphModule`, `GraphSignature` pair. The `GraphSignature` tells you various information about the graph, such as which graph inputs correspond to module params/buffers (and their fqn's), how to pytree-ify the inputs and the outputs of the graph. If you specify `trace_joint=True`, then you'll get back a joint forward-backward graph, that also returns parameter gradients in addition to the user outputs.

There are several restrictions on this API, detailed in the comments. The most notable one is probably that this API does not handle partial graphs: If you want a backward graph, then you module's forward function is **required** to return a scalar loss that we can backprop through. It also does not support capturing the optimizer step.

I (gratefully) used @SherlockNoMad and @suo's internal version of the `GraphSignature` object for this API, with a few minor changes in order to integrate it into AOTAutograd.

`aot_export_joint_simple`: Given a function, we'll trace it into a joint forward-backward graph and return it. Unlike the above API, the function is **not** required to return a scalar loss. However, this API makes the guarantee that you **do not** need to make any calling convention changes between the original function, and the exported one, provided that you do that you do the following:
* If you pass `trace_joint=False`, no work is needed: we'll export a functionalized forward graph with the same set of inputs as the original function
* If you pass `trace_joint=True`, then you will need to manually use the `default_partitioner` or `min_cut_partitioner` from functorch. If you do, and get back a fw and bw graph, then the forward graph will be runnable identically to the original user function.

The main use case for this API is higher order ops: a higher order op like `torch.cond()` can implement its derivative formula by using this API to export a joint graph (for both the true subgraph and the false subgraph), partition it into a fw/bw graph, and run cond on the `true_bw`, `false_bw` subgraphs. cc @zou3519 @Chillee

**Implementation Strategy**

A lot of the work in this PR went in to trying to find a reasonable way to re-use existing AOTAutograd components to expose these API's. Concretely:

* The two new API's are both thin wrappers around `_aot_export_function`: this is a general purpose export API, that just re-uses `create_aot_dispatcher_function`. If we want to add e.g. an export API that includes the optimizer step in the future, we could probably implement it using `_aot_export_function`.
* `aot_export_module` works extra hard to re-use as much of AOTAutograd as possible. For example, when tracing an inference graph, I perform the export under `torch.no_grad()` to make sure we don't accidentally trace out a backwards graph. When exporting a joint graph, I manually `.detach()` all user outputs except the loss, to make sure that we don't accidentally compute gradients for any other user outputs (even if the user forgot to manually detach them).
* A large portion of `aot_export_module` comes from parsing out and creating a `GraphSignature` object. We discussed a few weeks ago that there's potentially a lot more information that we could stuff into this object (see [doc](https://docs.google.com/document/d/1_qzdKew5D1J2Q2GkZ1v5jsczSsIU-Sr0AJiPW7DdGjE/edit?usp=sharing)). For now, I ended up deciding to support the more limited use case of exporting a fwd-bwd full graph, without some of the extra annotations in that doc (for example, if we were to export partial graphs, we would need annotations for saved activations). My thought is that once a more concrete use case comes up that the existing API doesn't satisfy, we can revisit the annotations then.
* I factored out `create_functional_call()` and `create_tree_flattened_fn()` for pytree-flattening and lifting-params-and-buffers, since I also need them in the export code
* I added an `AOTConfig.is_export` flag. The export API re-uses all of the same code paths as the rest of AOTAutograd, but there are a few points where we need to either exit early (and avoid making a runtime epilogue), or add extra error checking, that is only valuable for export.
* `aot_dispatch_autograd()` now exits early if it's being called in an export context, so it returns the full graph instead of also trying to create an `autograd.Function`. I think we probably want to factor this out, although I figured it would be safer to wait a bit for clarity on how functional RNG works with export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100587
Approved by: https://github.com/ezyang, https://github.com/SherlockNoMad
2023-05-15 18:08:11 +00:00
bba12a4668 aot_autograd: factor out runtime epilogue from aot_dispatch_base (#100586)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100586
Approved by: https://github.com/ezyang
2023-05-15 18:08:11 +00:00
a4830bd86b fix sign return type (#101346)
Fixes #101216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101346
Approved by: https://github.com/eellison, https://github.com/jansel
2023-05-15 17:50:36 +00:00
d0db7d624d Revert "[dynamo] Activation checkpointing as higher order op (#101028)"
This reverts commit de15e740a1f1cf0f267bb77ef851522ce2ab4674.

Reverted https://github.com/pytorch/pytorch/pull/101028 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/101028#issuecomment-1548280970))
2023-05-15 17:47:08 +00:00
13383f45c5 Revert "[c10d] Bridge c10d and gloo stores. (#100384)"
This reverts commit 74b2c04aa1a127fdaf06282bf6534017b619be66.

Reverted https://github.com/pytorch/pytorch/pull/100384 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100384#issuecomment-1548279946))
2023-05-15 17:44:54 +00:00
2341bd69e9 Revert "[caffe2/tools/autograd] Fix non-determinism in code gen (#101287)"
This reverts commit 52f526cfc0092978ebe6d7be8ae2e71a6d989254.

Reverted https://github.com/pytorch/pytorch/pull/101287 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/101287#issuecomment-1548273201))
2023-05-15 17:39:14 +00:00
3e1c8168f8 Add pattern to merge/simplify split-cat (#100713)
Summary:
In simple cases, both split and cat node can be removed in a "split->cat" pattern. However, there are various cases where they can't simply be removed and we need to simplify split/ add transforms before cat. Some such cases are:
* Split-dim != cat-dim (but equal split)
* Final node: cat vs stack
* Final node has additional args
* Shuffling of args between split/cat
* Some final nodes are non-(cat/stack)

For more details, please refer to https://docs.google.com/presentation/d/1SxBuY_FZfljSlX6i8slRNgP2CsUCICP0o4qe8cNNX8U/edit#slide=id.g232e9a90f64_0_273 (slides 8-15)

Differential Revision: D45452404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100713
Approved by: https://github.com/jansel
2023-05-15 17:37:21 +00:00
721b144f0f [MPS] Add support for Custom Kernels (#100661)
- This change introduces these APIs to enable developing custom kernels on the MPS Stream:
`torch::mps::get_command_buffer()`
`torch::mps::get_dispatch_queue()`
`torch::mps::commit()`
- Add ObjC test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100661
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-15 17:02:33 +00:00
f48718f749 Update torchbench pin (#101365)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101365
Approved by: https://github.com/albanD, https://github.com/awgu
2023-05-15 16:52:31 +00:00
e35323d6a7 [Profiler] Fix HTML plot output for profiler export_memory_timeline (#101316)
Summary: Wrap the PNG image of the memory plot inside of an HTML body, so that the file can be easily opened or embedding in other frontends.

Test Plan:
CI Tests

# Ran locally on Resnet50:
{F988498243}
{F988498789}
https://www.internalfb.com/manifold/explorer/trace_stats/tree/749163530321413/tmpj3ifzs7r.pt.memorytl.html

Differential Revision: D45827509

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101316
Approved by: https://github.com/xuzhao9
2023-05-15 16:31:06 +00:00
a8ea4178ab Fixed bug in interpolate when interpolation size is larger than max (#101403)
## Description

This is a bug fix for rare cases that can happen with specific scale, antialias=False, output for a random line can be wrong. For example:
```
line 14
output uint8: [76, 78, 80, 81, 83, 85, 87, 88, 90]
expected float: [149, 152, 155, 158, 161, 164, 167, 170, 173]
diff: [-73, -74, -75, -77, -78, -79, -80, -82, -83]
opencv ref: [149 152 155 158 161 164 167 170 173]
```

It appears that for this line we have 3 weights coeff instead of 2:
```
line 13 | 351, 2
k: 1130 15254
line 14 | 378, 3
k: 0 16384 -6780            <-------  We should have 2 weights and not 3
line 15 | 432, 2
k: 15254 1130
```
which comes from our `_compute_weights_aa` function that is specifically used for AA=False and uint8.
```
    xmin = std::max(
        static_cast<int64_t>(center - support + 0.5 + align_corners_delta), static_cast<int64_t>(0));
    xsize = std::min(
        static_cast<int64_t>(center + support + 0.5 + align_corners_delta), input_size) - xmin;
```
```
center - support + 0.5 + align_corners_delta: 14.999999999999998
static_cast<int64_t>(center - support + 0.5 + align_corners_delta): 14
xmin -> 14

center + support + 0.5 + align_corners_delta: 17.0
static_cast<int64_t>(center + support + 0.5 + align_corners_delta): 17.0
xsize -> 17 - 14 = 3  <------ 3 instead of 2
```

For float dtype, AA=False weights and indices are computed differently due to historically first implemented.

In any case, `xsize` should not be larger than `max_interp_size`, so we decided to clip `xsize`.

Once fixed computed indices and weights are same as for float dtype code path:
```
# Option: xsize = min(xsize, max_interp_size)
Line Num | xmin, xsize

14 | 378, 2                 xmin=378 <---> xmin = i * stride = i * 3 * 9 => i = 14
k: 0 16384                  16384 = w * (1 << 14) => w = 1.0

=> i=14, w=0 and i=15, w=1
```
vs
```
Line Num | index0, index1
F32: 14 | 15, 16
F32: lambda0, lambda1: 0.999999, 9.53674e-07
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101403
Approved by: https://github.com/NicolasHug
2023-05-15 15:55:42 +00:00
cyy
a94135641c Fix some NVCC warnings (Part 2) (#101383)
PR #95568 enables more NVCC warnings. However, some cu files need to be modified to make building process more warning free.  PR #100823 already contains some fixes. This PR aims to fix the remaining ones without breaking the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101383
Approved by: https://github.com/zou3519
2023-05-15 15:48:45 +00:00
effe1425dd ASAN: fix use-after-free (#101400)
arguments() returns vector member of object returned by schema() call.
When object returned by schema() call is destroyed, the vector is deallocated as well,
it's lifetime isn't extended.

This issue detected while running `pytest -v test/mobile/test_lite_script_type.py -k test_nest_typing_namedtuple_custom_classtype` with ASAN.

<details>
<summary>ASAN output</summary>

```
==1134126==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0005a5790 at pc 0x03ff844488d8 bp 0x03fff584afe8 sp 0x03fff584afd8
READ of size 8 at 0x60d0005a5790 thread T0
    #0 0x3ff844488d7 in __gnu_cxx::__normal_iterator<c10::Argument const*, std::vector<c10::Argument, std::allocator<c10::Argument> > >::__normal_iterator(c10::Argument const* const&) /usr/lib/gcc/s390x-i
bm-linux-gnu/11/include/g++-v11/bits/stl_iterator.h:1028
    #1 0x3ff8444293f in std::vector<c10::Argument, std::allocator<c10::Argument> >::begin() const /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_vector.h:821
    #2 0x3ff84d807d1 in torch::jit::toPyObject(c10::IValue) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:617
    #3 0x3ff84d80305 in torch::jit::toPyObject(c10::IValue) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604
    #4 0x3ff84856871 in pybind11::detail::type_caster<c10::IValue, void>::cast(c10::IValue, pybind11::return_value_policy, pybind11::handle) /home/user/pytorch/torch/csrc/jit/python/pybind.h:138
    #5 0x3ff85318191 in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is
_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_me
thod const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const /home/user/pytorch/cmake/../third_party/pybin
d11/include/pybind11/pybind11.h:249
    #6 0x3ff85317cfd in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is
_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_me
thod const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) /home/user/pytorch/cmake/../third_party/pybind11/incl
ude/pybind11/pybind11.h:224
    #7 0x3ff82ee52e9 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:929
    #8 0x3ffab002903 in cfunction_call Objects/methodobject.c:543
    #9 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #10 0x3ffaaf8e919 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #11 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #12 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #13 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #14 0x3ffab105447 in call_function Python/ceval.c:5891
    #15 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #16 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #17 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #18 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #19 0x3ffaaf8a615 in _PyObject_FastCallDictTstate Objects/call.c:142
    #20 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #21 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #22 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #23 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #24 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #25 0x3ffab105447 in call_function Python/ceval.c:5891
    #26 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #27 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #28 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #29 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #30 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #31 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #32 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #33 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #34 0x3ffab105447 in call_function Python/ceval.c:5891
    #35 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #36 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #37 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #38 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #39 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #40 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #41 0x3ffab105447 in call_function Python/ceval.c:5891
    #42 0x3ffab0ff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #43 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #44 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #45 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #46 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #47 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #48 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #49 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #50 0x3ffab105447 in call_function Python/ceval.c:5891
    #51 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #52 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #53 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #54 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #55 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #56 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #57 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #58 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #59 0x3ffab105447 in call_function Python/ceval.c:5891
    #60 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #61 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #62 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #63 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #64 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #65 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #66 0x3ffaaf8ab9b in PyVectorcall_Call Objects/call.c:267
    #67 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #68 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #69 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #70 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #71 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #72 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #73 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #74 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #75 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #76 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #77 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #78 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #79 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #80 0x3ffab105447 in call_function Python/ceval.c:5891
    #81 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #82 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #83 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #84 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #85 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #86 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #87 0x3ffab105447 in call_function Python/ceval.c:5891
    #88 0x3ffab0ff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #89 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #90 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #91 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #92 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #93 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #94 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #95 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #96 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #97 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #98 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #99 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #100 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #101 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #102 0x3ffab105447 in call_function Python/ceval.c:5891
    #103 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #104 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #105 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #106 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #107 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #108 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #109 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #110 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #111 0x3ffab105447 in call_function Python/ceval.c:5891
    #112 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #113 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #114 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #115 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #116 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #117 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #118 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #119 0x3ffaaf8ad17 in _PyObject_Call Objects/call.c:305
    #120 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #121 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #122 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #123 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #124 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #125 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #126 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #127 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #128 0x3ffab105447 in call_function Python/ceval.c:5891
    #129 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #130 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #131 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #132 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #133 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #134 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #135 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #136 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #137 0x3ffab105447 in call_function Python/ceval.c:5891
    #138 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #139 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #140 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #141 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #142 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #143 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #144 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #145 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #146 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #147 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #148 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #149 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #150 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #151 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #152 0x3ffab105447 in call_function Python/ceval.c:5891
    #153 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #154 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #155 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #156 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #157 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #158 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #159 0x3ffab105447 in call_function Python/ceval.c:5891
    #160 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #161 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #162 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #163 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #164 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #165 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #166 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #167 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #168 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #169 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #170 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #171 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #172 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #173 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #174 0x3ffab105447 in call_function Python/ceval.c:5891
    #175 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #176 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #177 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #178 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #179 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #180 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #181 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #182 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #183 0x3ffab105447 in call_function Python/ceval.c:5891
    #184 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #185 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #186 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #187 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #188 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #189 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #190 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #191 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #192 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #193 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #194 0x3ffab105447 in call_function Python/ceval.c:5891
    #195 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #196 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #197 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #198 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #199 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #200 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #201 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #202 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #203 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #204 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #205 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #206 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #207 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #208 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #209 0x3ffab105447 in call_function Python/ceval.c:5891
    #210 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #211 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #212 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #213 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #214 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #215 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #216 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #216 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #217 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #218 0x3ffab105447 in call_function Python/ceval.c:5891
    #219 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #220 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #221 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #222 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #223 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #224 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #225 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #226 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #227 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #228 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #229 0x3ffab105447 in call_function Python/ceval.c:5891
    #230 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #231 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #232 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #233 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #234 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #235 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #236 0x3ffab105447 in call_function Python/ceval.c:5891
    #237 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #238 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #239 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #240 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #241 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #242 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #243 0x3ffab105447 in call_function Python/ceval.c:5891
    #244 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #245 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #246 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #247 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #248 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #249 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290

0x60d0005a5790 is located 80 bytes inside of 136-byte region [0x60d0005a5740,0x60d0005a57c8)
freed by thread T0 here:
    #0 0x3ffab537de5 in operator delete(void*) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160
    #1 0x3ff55984fdb in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::deallocate(std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2>*, unsigned long) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:145

previously allocated by thread T0 here:
    #0 0x3ffab53734f in operator new(unsigned long) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x3ff5598443f in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:127
    #2 0x3fff5849ecf  ([stack]+0xb2ecf)

SUMMARY: AddressSanitizer: heap-use-after-free /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_iterator.h:1028 in __gnu_cxx::__normal_iterator<c10::Argument const*, std::vector<c10::Argument, std::allocator<c10::Argument> > >::__normal_iterator(c10::Argument const* const&)
Shadow bytes around the buggy address:
  0x100c1a000b4aa0: fd fd fd fd fd fd fd fd fd fd fd fa fa fa fa fa
  0x100c1a000b4ab0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x100c1a000b4ac0: fd fd fd fd fd fa fa fa fa fa fa fa fa fa fd fd
  0x100c1a000b4ad0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa
  0x100c1a000b4ae0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
=>0x100c1a000b4af0: fd fd[fd]fd fd fd fd fd fd fa fa fa fa fa fa fa
  0x100c1a000b4b00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1134126==ABORTING
```

Additional backtraces (not full):
Allocation:
```
#0  __memset_z196 () at ../sysdeps/s390/memset-z900.S:144
#1  0x000003ff96f3072a in __asan::Allocator::Allocate (this=this@entry=0x3ff97041eb8 <__asan::instance>, size=size@entry=136, alignment=8, alignment@entry=0, stack=<optimized out>,
    stack@entry=0x3ffdbb45d78, alloc_type=<optimized out>, can_fill=true) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_allocator.cpp:599
#2  0x000003ff96f2c088 in __asan::asan_memalign (alignment=alignment@entry=0, size=size@entry=136, stack=stack@entry=0x3ffdbb45d78, alloc_type=alloc_type@entry=__asan::FROM_NEW)
    at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_allocator.cpp:1039
#3  0x000003ff96fb73b0 in operator new (size=136) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99
#4  0x000003ff41404440 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::allocate (this=0x3ffdbb468c0,
    __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:127
#5  0x000003ff414042a0 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::allocate (__a=...,
    __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/alloc_traits.h:464
#6  0x000003ff41403b66 in std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > > (__a=...)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/allocated_ptr.h:98
#7  0x000003ff4140372a in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (this=0x3ffdbb47888, __p=@0x3ffdbb47880: 0x0, __a=..., __args=..., __args=..., __args=..., __args=...)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:648
#8  0x000003ff41403328 in std::__shared_ptr<c10::FunctionSchema, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (this=0x3ffdbb47880, __tag=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1342
#9  0x000003ff41402f06 in std::shared_ptr<c10::FunctionSchema>::shared_ptr<std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (
    this=0x3ffdbb47880, __tag=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:409
#10 0x000003ff41402b6e in std::allocate_shared<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (__a=...,
    __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:862
#11 0x000003ff4140215c in std::make_shared<c10::FunctionSchema, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (__args=..., __args=..., __args=..., __args=...)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:878
#12 0x000003ff413d180c in c10::TupleType::createWithSpec<c10::basic_string_view<char> > (qualName=..., field_names=std::vector of length 1, capacity 1 = {...},
    field_types=std::vector of length 1, capacity 1 = {...}, field_defaults=std::vector of length 0, capacity 0) at /home/user/pytorch/aten/src/ATen/core/type.cpp:769
#13 0x000003ff413b9ca6 in c10::TupleType::createNamed (qualName=..., field_names=std::vector of length 1, capacity 1 = {...}, field_types=std::vector of length 1, capacity 1 = {...})
    at /home/user/pytorch/aten/src/ATen/core/type.cpp:725
#14 0x000003ff4115fbac in c10::ivalue::TupleTypeFactory<c10::TupleType>::fallback (type=...) at /home/user/pytorch/aten/src/ATen/core/dynamic_type.cpp:383
#15 0x000003ff708217fe in c10::ivalue::Tuple::type<c10::TupleType> (this=0x6080004b8520) at /home/user/pytorch/aten/src/ATen/core/ivalue_inl.h:781
#16 0x000003ff70800740 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:613
#17 0x000003ff70800306 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604
#18 0x000003ff702d6872 in pybind11::detail::type_caster<c10::IValue, void>::cast (src=...) at /home/user/pytorch/torch/csrc/jit/python/pybind.h:138
#19 0x000003ff70d98192 in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const (this=0x3ffdbb4ca20, call=...)
    at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:249
#20 0x000003ff70d97cfe in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) (call=...)
    at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:224
#21 0x000003ff6e9652ea in pybind11::cpp_function::dispatcher (self=<PyCapsule at remote 0x3ff83e27720>,
    args_in=(<torch._C.LiteScriptModule at remote 0x3ff811844b0>, (<Tensor at remote 0x3ff814efb00>,)), kwargs_in=0x0) at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:929
```

Deallocation:
```
#0  operator delete (ptr=0x60d0005a5740) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160
#1  0x000003ff44904fdc in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::deallocate (this=0x3ffc5dc8020,
    __p=0x60d0005a5740, __t=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:145
#2  0x000003ff44904fa8 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::deallocate (
    __a=..., __p=0x60d0005a5740, __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/alloc_traits.h:496
#3  0x000003ff449041f2 in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::~__allocated_ptr (
    this=0x3ffc5dc8030) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/allocated_ptr.h:74
#4  0x000003ff44904888 in std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2>::_M_destroy (this=0x60d0005a5740)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:538
#5  0x000003ff43895a62 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x60d0005a5740) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:184
#6  0x000003ff43895420 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x611000c40648) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:705
#7  0x000003ff4466e7f4 in std::__shared_ptr<c10::FunctionSchema, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x611000c40640)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1154
#8  0x000003ff4466d820 in std::shared_ptr<c10::FunctionSchema>::~shared_ptr (this=0x611000c40640) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:122
#9  0x000003ff448d82f6 in c10::TupleType::~TupleType (this=0x611000c40580) at /home/user/pytorch/aten/src/ATen/core/jit_type.h:1142
#10 0x000003ff448d8346 in c10::TupleType::~TupleType (this=0x611000c40580) at /home/user/pytorch/aten/src/ATen/core/jit_type.h:1142
#11 0x000003ff731296a4 in std::_Sp_counted_ptr<c10::TupleType*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x603000c43ae0)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:348
#12 0x000003ff71eaf666 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x603000c43ae0) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:168
#13 0x000003ff71eaf330 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x3ffc5dc9368) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:705
#14 0x000003ff73129ee4 in std::__shared_ptr<c10::TupleType, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x3ffc5dc9360)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1154
#15 0x000003ff73122390 in std::shared_ptr<c10::TupleType>::~shared_ptr (this=0x3ffc5dc9360) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:122
#16 0x000003ff73d00788 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:613
#17 0x000003ff73d00306 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101400
Approved by: https://github.com/zou3519
2023-05-15 15:32:10 +00:00
66eef31444 Revert "[fx] change from #users to num_users in graph printout (#101140)"
This reverts commit e568c5a18d0fb390437912e13c96e29697af27ec.

Reverted https://github.com/pytorch/pytorch/pull/101140 on behalf of https://github.com/jeanschmidt due to There are internal changes to this commit that are preventing landing, so I am reverting to unblock the diff train ([comment](https://github.com/pytorch/pytorch/pull/101140#issuecomment-1547989487))
2023-05-15 14:35:22 +00:00
616208b4fe [BE]: Cleanup deprecated stdlib imports (UP006,UP035) (#101361)
Automated fix to cleanup some deprecated/useless python imports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101361
Approved by: https://github.com/zou3519
2023-05-15 14:32:41 +00:00
1b7d875083 put third_party/ittapi/ in .bazelignore (#101364)
put third_party/ittapi/ in .bazelignore

Summary:
Bazel fails when recursing into this directory because it has a
symlink that infinitely recurses. We don't use this library in Bazel,
so it's safe to just ignore its existence.

Test Plan: Verified with `bazel query //...`

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/101364).
* #101406
* #101405
* __->__ #101364

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101364
Approved by: https://github.com/zou3519, https://github.com/malfet
2023-05-15 14:12:40 +00:00
cca31f1797 Revert "implement a function to convert a storage to copy-on-write (#100819)"
This reverts commit aec11b8c802617d87f54e7c2c0ffe96e33657b2c.

Reverted https://github.com/pytorch/pytorch/pull/100819 on behalf of https://github.com/jeanschmidt due to added tests are breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100819#issuecomment-1547929531))
2023-05-15 14:10:23 +00:00
bfb2888b51 Re enable AutogradNotImplementedFallback on Windows (#101062)
Fixes #48763
Due to #48763 AutogradNotImplementedFallback  were disabled. I re-enabled them and the CI is successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101062
Approved by: https://github.com/zou3519
2023-05-15 13:41:06 +00:00
9b6ccde0e6 fix precision error in constraint solver (#101307)
When adding guards to the constraint solver, we check that they are consistent, i.e., they do not simplify to false when their free symbols are substituted with the corresponding concrete values.

However this check may "spuriously" fail because it doesn't take into account precision errors when comparing floats. Since the symbols involved are all positive integers, we try to approximate floats in the guards with rationals, providing concrete values as hints: `sympy.nsimplify` does the job.

As an alternative approach, we considered using `sympy.evalf` to compare with reduced precision. But we did not pursue it because
* the choice of what is a good reduced precision feels arbitrary (`sympy` uses `1e15` by default);
* more importantly, there is no guarantee that we will not encounter the same problem when solving downstream.

Differential Revision: [D45826951](https://our.internmc.facebook.com/intern/diff/D45826951/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101307
Approved by: https://github.com/ezyang
2023-05-15 11:03:24 +00:00
87f9160b67 Revert "[inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115)"
This reverts commit 4c8ee583c3af7ee6bf21ac7908a5f81455dc96e5.

Reverted https://github.com/pytorch/pytorch/pull/100115 on behalf of https://github.com/jeanschmidt due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/100115#issuecomment-1547417287))
2023-05-15 08:31:58 +00:00
7dd8e08817 [pt2] add meta for linalg_ldl_factor_ex (#101362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101362
Approved by: https://github.com/lezcano
2023-05-15 02:56:49 +00:00
a8964d6377 [pt2] add meta and SymInt support for linalg_householder_product (#101315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101315
Approved by: https://github.com/lezcano
2023-05-15 02:56:49 +00:00
cc54da4877 Inductor cpp wrapper: fix FallbackKernel support (#100788)
Fixes cpp wrapper support for kernels that are not exposed in `torch.ops.aten`. The current PR limits the support scope to `repeat_interleave.Tensor` and will submit follow-up PRs for more OPs.

The PR maps the python schema of the kernel to the cpp schema and uses `c10::Dispatcher::singleton().findSchemaOrThrow` to find the corresponding cpp OP.

The current support is limited and will raise `AssertionError` for unsupported cases.
The limitation includes:
- only support kernel that is not alias
- only support kernel the args and returns of which don't have `alias_info`
- only support output args to be a `Tensor`
- only support input args to be `Tensor`, `Optional[int]`, `Optional[float]` and `Optional[bool]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100788
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-05-15 00:45:44 +00:00
72908e768e Fix Math Typesetting for torch.linalg.matrix_exp (#101363)
Fixes current the matrix_exp documentation typesetting which has an unescaped underscore.

It currently looks like this

<img width="540" alt="image" src="https://github.com/pytorch/pytorch/assets/3844846/cbff79c3-8c1a-4003-bee3-c4c97ae0e3a0">

With the fix, it looks like this
<img width="555" alt="image" src="https://github.com/pytorch/pytorch/assets/3844846/a24d9a3f-bbbd-4685-9244-2bc06872b966">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101363
Approved by: https://github.com/lezcano
2023-05-15 00:31:12 +00:00
fcf2fb273c Make missing model import error marginally better (#101221)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101221
Approved by: https://github.com/albanD, https://github.com/anijain2305
2023-05-14 19:57:01 +00:00
96487d0d1f Refactor after_dynamo to have a CLI interface too. (#101220)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101220
Approved by: https://github.com/anijain2305
2023-05-14 19:03:16 +00:00
9ba64cba55 Fix torch.utils._traceback on Python 3.11 (#101277)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101277
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-05-14 19:03:16 +00:00
dfe484a3b3 [BE]: Bugfix functorch and some generic typing improvements (#101337)
Fixes some typing bugs found with newer versions of mypy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101337
Approved by: https://github.com/ezyang
2023-05-14 14:20:56 +00:00
65412f95f0 [dynamo] Graph break on ops having inplace_view tag (#100787)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100787
Approved by: https://github.com/jgong5, https://github.com/eellison, https://github.com/jansel
2023-05-14 11:42:35 +00:00
568db1b464 [dtensor] Relax condition for _split_tensor() (#101218)
When tensor.size(self.dim) < num_chunks, we will fill empty chunk with empty tensor (https://github.com/pytorch/pytorch/pull/98722). Therefore, we no longer needs this assert.

For example, when sharding a tensor with 1 element on 2 ranks along dim 0, results would be as follows:
```
rank:0, dtensor:DTensor(local_tensor=tensor([0.4963], device='cuda:0'), device_mesh=DeviceMesh:([0, 1]), placements=[Shard(dim=0)])
rank:1, dtensor:DTensor(local_tensor=tensor([], device='cuda:1'), device_mesh=DeviceMesh:([0, 1]), placements=[Shard(dim=0)])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101218
Approved by: https://github.com/wanchaol
2023-05-14 07:39:27 +00:00
674e52b0b9 [vision hash update] update the pinned vision hash (#101347)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101347
Approved by: https://github.com/pytorchbot
2023-05-14 03:08:55 +00:00
8876c0b282 [transformer benchmark] fix in sdp_bwd for scaled_dot_product_attention return type (#101341)
Summary:
Otherwise we get
```
Traceback (most recent call last):
  File "<string>", line 49, in <module>
  File "<string>", line 47, in __run
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/fbcode/platform010/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 188, in <module>
    main()
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 184, in main
    run_timing(min_run_time, batch_size, embed_dim, num_heads, max_seq_len, dtype)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 105, in run_timing
    rand_fused_upward = cpt(x, x, x, mask).clone().detach()
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/torch/nn/modules/module.py", line 1502, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/torch/nn/modules/module.py", line 1511, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/users/jongsoo/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/caffe2/benchmarks/transformer/__sdp_backwards__/sdp_backwards#link-tree/caffe2/benchmarks/transformer/sdp_backwards.py", line 39, in forward
    attn, _ = torch.nn.functional.scaled_dot_product_attention(
ValueError: too many values to unpack (expected 2)
```

Test Plan: buck run mode/dev-nosan //caffe2/benchmarks/transformer:sdp_backwards

Differential Revision: D45843838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101341
Approved by: https://github.com/drisspg
2023-05-14 01:34:51 +00:00
2361f7f0ce Update doc strings to make description of is_causal consistent for nn.Transformer and nn.MHA (#101089)
Summary: Update doc strings to make description of is_causal consistent for nn.Transformer and nn.MHA

Test Plan: sandcastle & github CI/CD

Differential Revision: D45737197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101089
Approved by: https://github.com/mikaylagawarecki
2023-05-13 18:14:38 +00:00
f6c2859ee3 Print the path to the code with TORCH_LOGS=output_code (#99038)
This is quite useful to play around with the code once it's been generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99038
Approved by: https://github.com/mlazos
2023-05-13 16:20:57 +00:00
07d3772eff fix typo in comments under torch/distributions/mixture_same_family.py (#101290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101290
Approved by: https://github.com/Skylion007
2023-05-13 14:25:52 +00:00
ab74744522 add inplace_view tag to resize_as_() (#100786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100786
Approved by: https://github.com/jgong5, https://github.com/bdhirsh, https://github.com/eellison
2023-05-13 13:49:14 +00:00
76b72bd80d Rewrite frame state to use a struct for shapes, splitting scalar and size, prep for stride (#101250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101250
Approved by: https://github.com/ezyang
2023-05-13 09:28:32 +00:00
e406125b6b [profiler] replace record_concrete_inputs_enabled interface with callback instead of boolean (#101292)
Summary: This allows an internal use case to register a callback that can vary over time instead of being a static value over the lifetime of the program.

Test Plan: ran the test listed above ^^.

Differential Revision: D45805139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101292
Approved by: https://github.com/aaronenyeshi
2023-05-13 05:06:48 +00:00
44fb7fcb83 [vision hash update] update the pinned vision hash (#101323)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101323
Approved by: https://github.com/pytorchbot
2023-05-13 04:56:58 +00:00
387b369ee4 [CI] Fix a dashboard command line string formatting bug (#101325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101325
Approved by: https://github.com/ngimel
2023-05-13 03:00:23 +00:00
4414160453 Factor automatic dynamic into a private helper function (#101114)
Simple cleanup; makes it easier to disable automatic dynamic for nested tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101114
Approved by: https://github.com/ezyang
2023-05-13 02:34:46 +00:00
9e089db32e [MPS] Enable arange for int8 and uint8 dtypes (#101303)
Not sure, why it was not enabled previously.
Sort types in `AT_DISPATCH_MPS_TYPES` by group (floats first then integers) and size.
Test implicitly in `test_bernoulli`.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 80c7ed7</samp>

> _`Char` and `Byte` types_
> _MPS can dispatch them now_
> _Winter of tensors_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101303
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/atalman, https://github.com/kulinseth
2023-05-13 01:19:08 +00:00
ceecccc09e Bugfix: Correctly detect test changes in PRs (#101304)
Fixes a bug where the logic for deciding what tests have been edited by a PR would include all files that had been edited since the merge base, including files that were in main!

Now it will only consider the files that are part of the PR itself
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101304
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-05-13 00:59:41 +00:00
d75f93603a Flatten exceptions in dynamo (#100779)
Fixes https://github.com/pytorch/pytorch/issues/93571

[before and after](https://gist.github.com/mlazos/256b0e8f0f98495752a22b960e9f4fcb)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100779
Approved by: https://github.com/ezyang
2023-05-13 00:58:57 +00:00
cc0a271935 [GHF] Use baseRefOid to get PRs merge base (#101232)
In general, `GitHubRepo` class must not contain any methods that incur local IO, nor modify git repo checkout its invoked from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101232
Approved by: https://github.com/kit1980
2023-05-13 00:27:10 +00:00
c498b1ad95 [C10D] Implement extended store api in HashStore. (#100633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100633
Approved by: https://github.com/fduwjj
2023-05-12 23:46:49 +00:00
2a14652879 [CI] Introduce dashboard-tag to pass dashboard run configs (#101320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101320
Approved by: https://github.com/huydhn
2023-05-12 23:26:16 +00:00
bb454891ed [Reland] Add sym_size/stride/numel/storage_offset to native_function.… (#100749)
…yaml (#91… (#91919)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91919 Approved by: https://github.com/ezyang

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92402

Reviewed By: ezyang

Differential Revision: D42565586

Pulled By: SherlockNoMad

fbshipit-source-id: 1c2986e45307e076d239836a1b45441a9fa3c9d9
ghstack-source-id: 969f4928486e04c57aaf98e20e3c3ca946c51613

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100749
Approved by: https://github.com/zhxchen17, https://github.com/albanD
2023-05-12 22:57:42 +00:00
4dbab17edb [c10d] Use macro to deduplicate codes (#101243)
Ops.cpp copies code for each of the three device keys (CPU, CUDA PrivateUse1).
Use macro to deduplicate it.
No logic change.
Cc @kumpera @H-Huang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101243
Approved by: https://github.com/H-Huang
2023-05-12 22:12:28 +00:00
0be53d83fc [MPS] Add support for MPSProfiler Python bindings (#101002)
- Added torch.mps.profiler.[start() and stop()] APIs with RST documentation
- Added test case in test_mps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101002
Approved by: https://github.com/malfet
2023-05-12 21:55:34 +00:00
816400a294 Add branch change name for composite action (#101309)
Mention in release runbook, that composite workflow should reference release branch rather than trunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101309
Approved by: https://github.com/atalman
2023-05-12 21:50:12 +00:00
47c99e3a1c Update PyTorch docker base image to Ubuntu-20.04 (take 2) (#101310)
Followup after https://github.com/pytorch/pytorch/pull/101301

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 219c58a</samp>

> _`BASE_RUNTIME` changed_
> _Ubuntu twenty oh four_
> _Spring of new features_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101310
Approved by: https://github.com/atalman
2023-05-12 21:46:00 +00:00
5fe834afc1 [inductor] Insert triton barrier before storing to inplace buffers (#100769)
The linked issue demonstrates a triton bug where a load broadcasted
over multiple warps may see the result of a store that happens later
in the triton program. The workaround is to add a barrier before
storing, which enforces that all warps have already read the data.

e.g. in `test_embedding_var_mean` we now generate:
```python
    tl.debug_barrier()
    tl.store(in_out_ptr1 + (tl.broadcast_to(x0, [XBLOCK, 1])), tmp17, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100769
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-05-12 21:37:34 +00:00
05077f2ac3 [PyTorch] Avoid extra refcounting in vector variant of VariableType::unpack (#95835)
Looks like this line was a historical relic of Variable and Tensor not being the same. I spot checked assembly and it's not the same, which already implies this way is better; specifically there are fewer locked refcounting instructions (I believe the type of the expression is `Tensor` not `const Tensor&` and both forks must have the same type). Spotted this with at::cat in an internal workload; the actual fix is to enable InferenceMode but this should reduce the penalty for failing to do that.

Differential Revision: [D43714744](https://our.internmc.facebook.com/intern/diff/D43714744/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95835
Approved by: https://github.com/albanD
2023-05-12 21:21:08 +00:00
6afa9a4a69 [CI] Change dashboard workflow inputs type to boolean (#101308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101308
Approved by: https://github.com/shunting314
2023-05-12 21:14:31 +00:00
a12b640dc9 Fix typos in troubleshooting.rst (#101305)
There are several typos in the troubleshooting documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101305
Approved by: https://github.com/desertfire
2023-05-12 21:05:13 +00:00
6ac0542747 Cpp Reduce LR on plateau scheduler (#100311)
Hello!

Recently i was playing with LibTorch libs, but i noticed that currently there is only one LR Scheduler implementation available. I needed 'Reduce on plateau scheduler', so implemented it by myself. Used it a lot of times, and it seem work as it should, so decided to share my implementation here.

If u will decide that this is something worth to merge, or it needs tweaking/tests let me know!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100311
Approved by: https://github.com/albanD
2023-05-12 20:50:48 +00:00
c772d56966 Use 20.04 as base image (#101301)
As 18.04 just EOLed 2 weeks ago.

Fixes https://github.com/pytorch/pytorch/issues/81120

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101301
Approved by: https://github.com/seemethere, https://github.com/atalman
2023-05-12 20:49:04 +00:00
066175d69c [CI] Add workflow_dispatch.inputs to control dashboard runs (#101279)
Summary: This gives a finer control for developers to specify which set
of configs to measure for their one-off dashboard run. Right now the
queuing for those runs look pretty bad.

Another change here is reducing the inference measurement frequency to
2 times a week.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101279
Approved by: https://github.com/huydhn
2023-05-12 20:34:50 +00:00
a8c32eb78e [PyTorch] add test for numel slow path affecting data_ptr (#100993)
This test would have stopped #98090 -- data_ptr needs to call custom Python numel if it exists, since it could be arbitrary Python.

Differential Revision: [D45701566](https://our.internmc.facebook.com/intern/diff/D45701566/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100993
Approved by: https://github.com/ezyang
2023-05-12 20:33:39 +00:00
568bac7961 [BE][GHF] Add retries_decorator (#101227)
I've noticed that 3-4 functions in trymerge are trying to implement similar tail recursion for flaky network retries.

Unify them using single wrapper in `gitutils.py`

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 8d40631</samp>

> _`retries_decorator`_
> _adds resilience to GitHub scripts_
> _autumn of errors_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101227
Approved by: https://github.com/kit1980
2023-05-12 20:29:06 +00:00
2fcc2002fa Handle tail 0-size tensor appropriately in MultiTensorApply (#100811)
Fixes #100701

It seems like we don't call `multi_tensor_apply_kernel` at all if the input tensor lists are small and their last tensors are zero-size as per e.g. ca9f55f79d/aten/src/ATen/native/cuda/MultiTensorApply.cuh (L100-L102)
which was introduced in 05943712a4.

This PR special cases the last zero-size tensors so that we won't be negligent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100811
Approved by: https://github.com/ngimel
2023-05-12 20:26:45 +00:00
52f526cfc0 [caffe2/tools/autograd] Fix non-determinism in code gen (#101287)
Fix several cases of leaking set-iteration-order to generated sources, causing non-determinism in generated code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101287
Approved by: https://github.com/albanD
2023-05-12 20:23:50 +00:00
630593d3cc [bazel] add python targets (#101003)
This PR adds bazel python, so that bazel build could be used from python like `import torch`.

Notable changes:
- Add the python targets.
- Add the version.py.tpl generation.
- In order to archive the `USE_GLOBAL_DEPS = False` just for the bazel build, employ a monkey-patch hack in the mentioned `version.py.tpl`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101003
Approved by: https://github.com/huydhn
2023-05-12 19:44:01 +00:00
4434b9af6a [quant][pt2] Handle constant conv args in prepare QAT fusion (#100525)
Summary: Previously, we would only match and replace conv + BN
patterns with default constant args for conv (stride, padding,
dilation etc.). If the user sets one of these args to values
that are different from the default, we would simply not fuse
the pattern. This is due to a limitation in the subgraph
rewriter: see https://github.com/pytorch/pytorch/issues/100419.

This commit works around the above limitation by first
configuring the subgraph rewriter to ignore literals when
matching, and then manually copy over the constant args to the
new subgraph after `replace_pattern`.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_fusion_constant_args

Reviewers: jerryzh168, kimishpatel

Differential Revision: [D45515437](https://our.internmc.facebook.com/intern/diff/D45515437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100525
Approved by: https://github.com/jerryzh168
2023-05-12 19:15:47 +00:00
3f734c584e Revert "Mark Windows CPU jobs as unstable (#100581)" (#100676)
This reverts commit 478a5ddd8ad51bf54beb9cdfd353187cf8b63d93.  Putting the Windows jobs to where they come from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100676
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-05-12 19:15:11 +00:00
7e333fe502 Fix cuda graphs & sdpa for dropout==0 (#101280)
Fixes cuda graph failures from https://github.com/pytorch/pytorch/pull/100931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101280
Approved by: https://github.com/ngimel
2023-05-12 19:06:45 +00:00
a8ff647e42 Disable conv cache emptying (#101038)
We warmup cudagraph trees in the cudagraph memory pool so that if we are part of the way through your run, and a large majority of memory is already allocated to cudagraphs, we dont try to allocate again to eager which would split memory pool in half. However this means this is causing us to fail the following assert due to the `emptyCache` call in CUDNN benchmarking: https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2959.

Disable the empty cache call during cudagraph warmup to fix error. Disabling did not have a significant affect on memory:

![image](https://github.com/pytorch/pytorch/assets/11477974/90513a1e-aa77-410c-a32e-2f80b99e673f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101038
Approved by: https://github.com/shunting314, https://github.com/ngimel
2023-05-12 18:49:46 +00:00
5ac48eb353 [FSDP]Skip unshard call during checkpointing for NO_SHARD sharding strategy (#101095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101095
Approved by: https://github.com/fegin
2023-05-12 18:19:18 +00:00
aec11b8c80 implement a function to convert a storage to copy-on-write (#100819)
implement a function to convert a storage to copy-on-write

Summary:
This will be used in the _lazy_clone() operator as well as reshape().

Test Plan: 100% coverage of reachable lines.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100819).
* #100821
* #100820
* __->__ #100819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100819
Approved by: https://github.com/ezyang
2023-05-12 17:45:04 +00:00
f0f700e8d2 ASAN: fix use-after-free (#101064)
When tensor is resized, reference array to it's sizes may become invalid. Make a copy in advance.

<details>
<summary>ASAN report</summary>

```
=================================================================
==1115867==ERROR: AddressSanitizer: heap-use-after-free on address 0x61000013d790 at pc 0x03ff8e7da360 bp 0x03fff53c83a0 sp 0x03fff53c8390
READ of size 8 at 0x61000013d790 thread T0
    #0 0x3ff8e7da35f in c10::SymInt::is_heap_allocated() const /home/user/pytorch/c10/core/SymInt.h:154
    #1 0x3ff8e7da35f in c10::SymInt::maybe_as_int() const /home/user/pytorch/c10/core/SymInt.h:215
    #2 0x3ff8e7d0a6d in c10::SymInt::sym_eq(c10::SymInt const&) const /home/user/pytorch/c10/core/SymInt.cpp:69
    #3 0x3ff7a9ab0bd in c10::SymInt::operator==(c10::SymInt const&) const /home/user/pytorch/c10/core/SymInt.h:177
    #4 0x3ff7a9aaedd in bool std::__equal<false>::equal<c10::SymInt const*, c10::SymInt const*>(c10::SymInt const*, c10::SymInt const*, c10::SymInt const*) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-
v11/bits/stl_algobase.h:1162
    #5 0x3ff7a9aae4b in bool std::__equal_aux1<c10::SymInt const*, c10::SymInt const*>(c10::SymInt const*, c10::SymInt const*, c10::SymInt const*) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/
stl_algobase.h:1211
    #6 0x3ff7a9aae05 in bool std::__equal_aux<c10::SymInt const*, c10::SymInt const*>(c10::SymInt const*, c10::SymInt const*, c10::SymInt const*) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/s
tl_algobase.h:1219
    #7 0x3ff7a9aad97 in bool std::equal<c10::SymInt const*, c10::SymInt const*>(c10::SymInt const*, c10::SymInt const*, c10::SymInt const*) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_alg
obase.h:1556
    #8 0x3ff4b23c771 in c10::ArrayRef<c10::SymInt>::equals(c10::ArrayRef<c10::SymInt>) const /home/user/pytorch/c10/util/ArrayRef.h:188
    #9 0x3ff4cb91bc1 in bool c10::operator!=<c10::SymInt>(c10::ArrayRef<c10::SymInt>, c10::ArrayRef<c10::SymInt>) /home/user/pytorch/c10/util/ArrayRef.h:341
    #10 0x3ff6d1b57ff in torch::ADInplaceOrView::resize_(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/torch/csrc/autograd/Variab
leTypeManual.cpp:408
    #11 0x3ff6d1e59c7 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c1
0::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>
> >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13
    #12 0x3ff6d1e59c7 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10:
:ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::Sy
mInt>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::Disp
atchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:480
    #13 0x3ff51ca5129 in at::Tensor const& c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(void*, c10::OperatorKernel*,
c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>&&, c10::optional<c10::MemoryFormat>&&) /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50
    #14 0x3ff51ca6e8f in at::Tensor const& c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::OperatorHandle const&, c10::D
ispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:90
    #15 0x3ff51ca6e8f in at::Tensor const& c10::Dispatcher::redispatch<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Ten
sor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)
const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:656
    #16 0x3ff5182006b in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::redispatch(c10::DispatchKeySet, at::Tensor const&, c
10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:492
    #17 0x3ff5182006b in at::_ops::resize_::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) aten/src/ATen/Operators_4.cpp:2144
    #18 0x3ff6d1d5e07 in at::redispatch::resize__symint(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) aten/src/ATen/RedispatchFunctions.h:2847
    #19 0x3ff6d1bbb67 in torch::autograd::VariableType::(anonymous namespace)::resize_(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pyto
rch/torch/csrc/autograd/VariableTypeManual.cpp:243
    #20 0x3ff6d1bd197 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c1
0::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10
::optional<c10::MemoryFormat> > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFu
nctionIntoFunctor.h:13
    #21 0x3ff6d1bd197 in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10:
:ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>), &torch::autograd::VariableType::(anonymous namespace)::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor
 const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c
10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor
.h:480
    #22 0x3ff51ca5129 in at::Tensor const& c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(void*, c10::OperatorKernel*,
c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>&&, c10::optional<c10::MemoryFormat>&&) /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50
    #23 0x3ff5181ead1 in at::Tensor const& c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::OperatorHandle const&, c10::D
ispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:90
    #24 0x3ff5181ead1 in at::Tensor const& c10::Dispatcher::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Tensor co
nst& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/user/pytorch/at
en/src/ATen/core/dispatch/Dispatcher.h:639
    #25 0x3ff5181ead1 in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>,
c10::optional<c10::MemoryFormat>) const /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:487
    #26 0x3ff5181ead1 in at::_ops::resize_::call(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) aten/src/ATen/Operators_4.cpp:2137
    #27 0x3ff79b44fcf in at::Tensor::resize__symint(c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const aten/src/ATen/core/TensorBody.h:2452
    #28 0x3ff79a802db in torch::autograd::THPVariable_resize_(_object*, _object*, _object*)::$_0::operator()(at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const /home/us
er/pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:13417
    #29 0x3ff7999f1eb in torch::autograd::THPVariable_resize_(_object*, _object*, _object*) /home/user/pytorch/torch/csrc/autograd/generated/python_variable_methods.cpp:13419
    #30 0x3ffa2c9b009 in method_vectorcall_VARARGS_KEYWORDS Objects/descrobject.c:344
    #31 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #32 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #33 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #34 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #35 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #36 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #37 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #38 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255
    #39 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #40 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #41 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #42 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #43 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #44 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #45 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #46 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255
    #47 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #48 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #49 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #50 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #51 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #52 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #53 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #54 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #55 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #56 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #57 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #58 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #59 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #60 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #61 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #62 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #63 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #64 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #65 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #66 0x3ffa2dff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #67 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #68 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #69 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #70 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #71 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #72 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #73 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #74 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #75 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #76 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #77 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #78 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #79 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #80 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #81 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #82 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #83 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #84 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #85 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #86 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #87 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #88 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #89 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #90 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #91 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #92 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #93 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #94 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #95 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #96 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #97 0x3ffa2c8ab9b in PyVectorcall_Call Objects/call.c:267
    #98 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #99 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #100 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #101 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #102 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #103 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #104 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #105 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #106 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #107 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494
    #108 0x3ffa2c8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #109 0x3ffa2df0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #110 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #111 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #112 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #113 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #114 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #115 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #116 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #117 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #118 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #119 0x3ffa2dff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #120 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #121 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #122 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #123 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255
    #124 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #125 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #126 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #127 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #128 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #129 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #130 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #131 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #132 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #133 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #134 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #135 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #136 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #137 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #138 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #139 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #140 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #141 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #142 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #143 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #144 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #145 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #146 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #147 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #148 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #149 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494
    #150 0x3ffa2c8ad17 in _PyObject_Call Objects/call.c:305
    #151 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #152 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #153 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #154 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #155 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #156 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #157 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #158 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #159 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #160 0x3ffa2dff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #161 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #162 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #163 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #164 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #165 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #166 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #167 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #168 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #169 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #170 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #171 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #172 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #173 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255
    #174 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #175 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #176 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #177 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #178 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #179 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #180 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #181 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #182 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #183 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #184 0x3ffa2dff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #185 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #186 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #187 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #188 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #189 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #190 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #191 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #192 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #193 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #194 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #195 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255
    #196 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #197 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #198 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #199 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #200 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #201 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #202 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #203 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #204 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #205 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #206 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #207 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #208 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #209 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #210 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #211 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #212 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #213 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #214 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #215 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #216 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #217 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #218 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #219 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #220 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #221 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494
    #222 0x3ffa2c8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #223 0x3ffa2df0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #224 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #225 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #226 0x3ffa2dffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #227 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #228 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #229 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #230 0x3ffa2c8ab15 in PyVectorcall_Call Objects/call.c:255
    #231 0x3ffa2c8ac65 in _PyObject_Call Objects/call.c:290
    #232 0x3ffa2c8ada9 in PyObject_Call Objects/call.c:317
    #233 0x3ffa2e059c7 in do_call_core Python/ceval.c:5943
    #234 0x3ffa2dffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #235 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #236 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #237 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #238 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #239 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #240 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #241 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #242 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #243 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #244 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #245 0x3ffa2c8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #246 0x3ffa2c8eddd in method_vectorcall Objects/classobject.c:53
    #247 0x3ffa2df00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #248 0x3ffa2df013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #249 0x3ffa2e05447 in call_function Python/ceval.c:5891
    #250 0x3ffa2dff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #251 0x3ffa2df052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #252 0x3ffa2e02b67 in _PyEval_Vector Python/ceval.c:5065
    #253 0x3ffa2c8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #254 0x3ffa2c8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #255 0x3ffa2c8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #256 0x3ffa2d3f307 in slot_tp_call Objects/typeobject.c:7494
    #257 0x3ffa2c8a933 in _PyObject_MakeTpCall Objects/call.c:215

0x61000013d790 is located 80 bytes inside of 192-byte region [0x61000013d740,0x61000013d800)
freed by thread T0 here:
    #0 0x3ffa3237de5 in operator delete(void*) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160
    #1 0x3ff8e7e3221 in c10::TensorImpl::~TensorImpl() /home/user/pytorch/c10/core/TensorImpl.cpp:75

previously allocated by thread T0 here:
    #0 0x3ffa323734f in operator new(unsigned long) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x3ff4aeeb3d1 in c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_null_type<c10::TensorImpl> > c10::intrusive_ptr<c10::TensorImpl, c10::detail::intrusive_target_default_nul
l_type<c10::TensorImpl> >::make<c10::intrusive_ptr<c10::StorageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >, c10::DispatchKeySet&, caffe2::TypeMeta&>(c10::intrusive_ptr<c10::S
torageImpl, c10::detail::intrusive_target_default_null_type<c10::StorageImpl> >&&, c10::DispatchKeySet&, caffe2::TypeMeta&) /home/user/pytorch/c10/util/intrusive_ptr.h:498
    #2 0x3ff76f79e17  (/home/user/pytorch/build/lib.linux-s390x-cpython-310/torch/lib/libtorch_cpu.so+0x2fb79e17)

SUMMARY: AddressSanitizer: heap-use-after-free /home/user/pytorch/c10/core/SymInt.h:154 in c10::SymInt::is_heap_allocated() const
Shadow bytes around the buggy address:
  0x100c2000027aa0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x100c2000027ab0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x100c2000027ac0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
  0x100c2000027ad0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x100c2000027ae0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
=>0x100c2000027af0: fd fd[fd]fd fd fd fd fd fd fd fd fd fd fd fd fd
  0x100c2000027b00: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x100c2000027b10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x100c2000027b20: fa fa fa fa fa fa fa fa 00 00 00 00 00 00 00 00
  0x100c2000027b30: 00 00 00 00 04 fa fa fa fa fa fa fa fa fa fa fa
  0x100c2000027b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1115867==ABORTING
```
</details>

<details>
<summary>Additional backtraces (not full)</summary>

Memory deallocation:
```
#0  operator delete (ptr=0x61000013d740) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160
#1  0x000003ffa77e3222 in c10::TensorImpl::~TensorImpl (this=0x61000013d740) at /home/user/pytorch/c10/core/TensorImpl.cpp:75
#2  0x000003ff63e76e8c in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::reset_ (this=0x3ffd7ec8230) at /home/user/pytorch/c10/util/intrusive_ptr.h:291
#3  0x000003ff63e76910 in c10::intrusive_ptr<c10::TensorImpl, c10::UndefinedTensorImpl>::~intrusive_ptr (this=0x3ffd7ec8230) at /home/user/pytorch/c10/util/intrusive_ptr.h:370
#4  0x000003ff63e67240 in at::TensorBase::~TensorBase (this=0x3ffd7ec8230) at /home/user/pytorch/aten/src/ATen/core/TensorBase.h:80
#5  0x000003ff63e85ee0 in at::Tensor::~Tensor (this=0x3ffd7ec8230) at aten/src/ATen/core/TensorBody.h:90
#6  0x000003ff63f67304 in resize__functionalization (dispatchKeySet=..., self=..., size=..., memory_format=...) at /home/user/pytorch/aten/src/ATen/FunctionalizeFallbackKernel.cpp:173
#7  0x000003ff63f89258 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>), &(resize__functionalization(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>))>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>) (
    this=0x6030000390a0, args=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13
#8  c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>), &(resize__functionalization(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>))>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>) (functor=0x6030000390a0, dispatchKeySet=..., args=..., args=...,
    args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:480
#9  0x000003ff6aca560a in c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > (
    unboxed_kernel_func=0x3ff63f88a80 <c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tenso
r const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>), &(resize__functionalization(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>))>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<long>, c10::optional<c10::MemoryFormat>)>, functor=0x6030000390a0,
    dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50
#10 0x000003ff6aca715c in c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > (this=0x6210005e1b28, opHandle=...,
    dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:96
#11 c10::Dispatcher::redispatch<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const (
    this=0x3ff919400e0 <c10::Dispatcher::realSingleton()::_singleton>, op=..., currentDispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:656
#12 0x000003ff6a82006c in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const (
    this=0x3ff919a07e0 <at::_ops::resize_::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)::op>, currentDispatchKeySet=..., args=...,
    args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:492
#13 at::_ops::resize_::redispatch (dispatchKeySet=..., self=..., size=..., memory_format=...) at /home/user/pytorch/build/aten/src/ATen/Operators_4.cpp:2144
#14 0x000003ff861d5e08 in at::redispatch::resize__symint (dispatchKeySet=..., self=..., size=..., memory_format=...) at aten/src/ATen/RedispatchFunctions.h:2847
#15 0x000003ff861b579e in torch::ADInplaceOrView::resize_ (ks=..., self=..., size=..., optional_memory_format=...) at /home/user/pytorch/torch/csrc/autograd/VariableTypeManual.cpp:401
```

Memory access:
```
#0  c10::SymInt::maybe_as_int (this=0x61000013d790) at /home/user/pytorch/c10/core/SymInt.h:215
#1  0x000003ff734d0a6e in c10::SymInt::sym_eq (this=0x61000013d790, sci=...) at /home/user/pytorch/c10/core/SymInt.cpp:69
#2  0x000003ff5f6ab0be in c10::SymInt::operator== (this=0x61000013d790, o=...) at /home/user/pytorch/c10/core/SymInt.h:177
#3  0x000003ff5f6aaede in std::__equal<false>::equal<c10::SymInt const*, c10::SymInt const*> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1162
#4  0x000003ff5f6aae4c in std::__equal_aux1<c10::SymInt const*, c10::SymInt const*> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1211
#5  0x000003ff5f6aae06 in std::__equal_aux<c10::SymInt const*, c10::SymInt const*> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1219
#6  0x000003ff5f6aad98 in std::equal<c10::SymInt const*, c10::SymInt const*> (__first1=0x61000013d790, __last1=0x61000013d7a0, __first2=0x602000015c30)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_algobase.h:1556
#7  0x000003ff2ff3c772 in c10::ArrayRef<c10::SymInt>::equals (this=0x3ffed7c9900, RHS=...) at /home/user/pytorch/c10/util/ArrayRef.h:188
#8  0x000003ff31891bc2 in c10::operator!=<c10::SymInt> (a1=..., a2=...) at /home/user/pytorch/c10/util/ArrayRef.h:341
#9  0x000003ff51eb5800 in torch::ADInplaceOrView::resize_ (ks=..., self=..., size=..., optional_memory_format=...) at /home/user/pytorch/torch/csrc/autograd/VariableTypeManual.cpp:408
#10 0x000003ff51ee59c8 in c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c
10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>
 > >::operator()(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) (this=0x6030007dca40, args=..., args=..., args=..., args=...)
    at /home/user/pytorch/aten/src/ATen/core/boxing/impl/WrapFunctionIntoFunctor.h:13
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt
>, c10::optional<c10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<
c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tenso
r const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) (functor=0x6030007dca40, dispatchKeySet=..., args=..., args=..., args=...)
    at /home/user/pytorch/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h:480
#12 0x000003ff369a512a in c10::callUnboxedKernelFunction<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > (
    unboxed_kernel_func=0x3ff51ee51f0 <c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor const& (c10::DispatchKeySet, at::Tenso
r const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>), &torch::ADInplaceOrView::resize_>, at::Tensor const&, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&, c10::Ar
rayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > >, at::Tensor const& (c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::call(c10::OperatorKern
el*, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>, functor=0x6030007dca40, dispatchKeySet=..., args=..., args=..., args=...)
    at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:50
#13 0x000003ff369a6e90 in c10::KernelFunction::call<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> > (this=0x6210005e1bc8, opHandle=...,
    dispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/boxing/KernelFunction_impl.h:90
#14 c10::Dispatcher::redispatch<at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat> >(c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::Arr
ayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)> const&, c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const (
    this=0x3ff5d6400e0 <c10::Dispatcher::realSingleton()::_singleton>, op=..., currentDispatchKeySet=..., args=..., args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:656
#15 0x000003ff3652006c in c10::TypedOperatorHandle<at::Tensor const& (at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)>::redispatch(c10::DispatchKeySet, at::Tensor const&,
c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>) const (
    this=0x3ff5d6a07e0 <at::_ops::resize_::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::ArrayRef<c10::SymInt>, c10::optional<c10::MemoryFormat>)::op>, currentDispatchKeySet=..., args=...,
    args=..., args=...) at /home/user/pytorch/aten/src/ATen/core/dispatch/Dispatcher.h:492
#16 at::_ops::resize_::redispatch (dispatchKeySet=..., self=..., size=..., memory_format=...) at /home/user/pytorch/build/aten/src/ATen/Operators_4.cpp:2144
#17 0x000003ff51ed5e08 in at::redispatch::resize__symint (dispatchKeySet=..., self=..., size=..., memory_format=...) at aten/src/ATen/RedispatchFunctions.h:2847
#18 0x000003ff51ebbb68 in torch::autograd::VariableType::(anonymous namespace)::resize_ (ks=..., self=..., size=..., optional_memory_format=...)
    at /home/user/pytorch/torch/csrc/autograd/VariableTypeManual.cpp:243
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101064
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-12 17:41:30 +00:00
a3700571e1 Fixed a bug in interpolate uint8 AVX2 on non-contig input (#101136)
Description:
- Fixed a bug in interpolate uint8 AVX2 on non-contig input
- Added tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101136
Approved by: https://github.com/NicolasHug
2023-05-12 17:17:10 +00:00
4a7ee79bf9 [BE] super small comment update to gradcheck.py (#101103)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101103
Approved by: https://github.com/soulitzer
2023-05-12 16:41:44 +00:00
a53cda1ddc [optim][BE] split test file into logical parts: SWA, LR, optim (#101100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101100
Approved by: https://github.com/albanD
2023-05-12 16:41:44 +00:00
a64e97b62c Revert "[dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180)"
This reverts commit aa8dcab1ce3fb96a7ccdee5861f4d281086ca3e1.

Reverted https://github.com/pytorch/pytorch/pull/99180 on behalf of https://github.com/huydhn due to Sorry for reverting this, but linux-bionic-py3.11-clang9 test starts to timeout after this taking more than 3h30m. This is probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/99180#issuecomment-1545982256))
2023-05-12 16:18:22 +00:00
8e54218024 [ROCM] Add build ROCM support to build-triton-wheel.yml (#95142)
To match with upstream and build triton whl's locally so nightly pytorch whls can access them without needing to use pypi.org.

We may have a better approach to build both whl's at once, but for now, to save duplication of code, another matrix is added for device (cuda/rocm)  With rocm invoking a different commit and repo.  The goal is to eventually have a single whl support both backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95142
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd, https://github.com/atalman
2023-05-12 15:54:57 +00:00
ts
dfd822d756 Fix deserialization for UpsamplingBilinear2d (#101248)
Fixes #100935 , adding handling for the recompute_scale_factor field. I would be happy to write a test for this, but might need some advice on where it should go/how to reliably reproduce the given issue. I'd also be happy to iterate on the proposed changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101248
Approved by: https://github.com/albanD
2023-05-12 15:40:17 +00:00
fa40195fac Don't set_current_node in DDP. (#101046)
Fixes https://github.com/pytorch/pytorch/issues/101045

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101046
Approved by: https://github.com/wconstab, https://github.com/malfet
2023-05-12 14:37:22 +00:00
d54fcd571a [dynamo] Skip tests that are broken in fbcode (#101217)
Some tests don't work in fbcode, for some reason.  Skip these until we
can figure them out.

Differential Revision: [D45791340](https://our.internmc.facebook.com/intern/diff/D45791340/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101217
Approved by: https://github.com/davidberard98
2023-05-12 14:13:14 +00:00
74b2c04aa1 [c10d] Bridge c10d and gloo stores. (#100384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100384
Approved by: https://github.com/fduwjj
2023-05-12 13:55:31 +00:00
c0e5d7e7fe [CustomOp] Add Dispatcher error callback (#101015)
The PyTorch Dispatcher's "no kernel found for DispatchKey" error message
is a bit long and winded. This PR adds a way to add a custom error
callback and changes the CustomOp API to use the custom error callback
to deliver better error messages.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101015
Approved by: https://github.com/ezyang
2023-05-12 13:49:20 +00:00
de6470e28e [custom_op] Change the python type that maps to ListType in schema (#101190)
Previously, to specify e.g. int[], a user needed to do Tuple[int, ...].
This PR changes it to Sequence[int].

Bikeshedding: we could totally just use List[int] instead. The types
that the user gives us that we use to infer a schema is not entirely
faithful: for example, we convert `int` to SymInt.

I didn't feel strongly between Sequence[int] and List[int] so I went
with the more faithful one, plus Python recommends that you use Sequence
for input arguments (over list or tuple), though we don't subscribe to
that philosophy in general.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101190
Approved by: https://github.com/bdhirsh
2023-05-12 13:49:20 +00:00
d0d8165230 Cleanup custom op library after each custom_op test (#100980)
This PR tells the custom op tests to destroy all custom ops with
specified namespace after each test.

The general problem is that if a test fails, the custom op isn't cleaned
up. We could fix this via try-finally, but using a tearDown method
seemed like a nice O(1) solution.

Test Plan:
- deleted some foo._destroy, verified that the test suite passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100980
Approved by: https://github.com/soulitzer, https://github.com/bdhirsh
2023-05-12 13:49:18 +00:00
3ffeab7f80 [custom_op] Make repeated registrations error gracefully (#100979)
Previously the error message went through torch.library. This PR changes
it so that on each custom_op.impl_* call:
- we store a (function, location) tuple
- if a (function, location) tuple exists already, then we raise an
error.

This logic already existed for the abstract impl (the impl for meta and
fake tensors), so this PR just extends it to the others.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100979
Approved by: https://github.com/bdhirsh, https://github.com/soulitzer
2023-05-12 13:49:15 +00:00
b3b333205f Fix asarray doc examples. (#100971)
Fixes issue raised on [PyTorch discuss](https://discuss.pytorch.org/t/confused-on-an-example-on-pytorch-official-documentation/178785).

**Summary:** the examples in `asarray` docs have a few mistakes that makes it not work. This PR fixes those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100971
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2023-05-12 11:52:10 +00:00
b5c8d0359c Update autograd.rst (#101007)
Fixes #ISSUE_NUMBER

typo fix and small change to improve clarity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101007
Approved by: https://github.com/lezcano, https://github.com/anjali411
2023-05-12 11:47:51 +00:00
aa8dcab1ce [dynamo 3.11] enable other torch 3.11 dynamo-related tests (#99180)
Notes:
- No segfaults observed in any CI tests: dynamo unittests, inductor unittests, dynamo-wrapped pytorch tests. So we remove the warning that using dynamo 3.11 may result in segfaults.
- Some dynamo-wrapped pytorch tests hang. They will be skipped in the dynamo-wrapped test suite and will be addressed in a future PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99180
Approved by: https://github.com/malfet
2023-05-12 07:03:09 +00:00
d56e1b2f67 add Half support for unary ops on CPU (#98493)
Add Half support for log_sigmoid and some unary ops on CPU, including sinc, acosh, asinh, atanh, digamma, trigamma, rsqrt, acos, asin, atan, ceil, cos, erf, erfc, erfinv, exp, expml, floor, log, log10, log1p, log2, i0, round, sin, sqrt, tan, tanh, trunc, lgamma.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98493
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/ngimel
2023-05-12 04:52:34 +00:00
98f6b815b7 [BE] Make some simplifications to torch.utils.checkpoint logic (#101193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101193
Approved by: https://github.com/albanD
2023-05-12 04:35:22 +00:00
e568c5a18d [fx] change from #users to num_users in graph printout (#101140)
`#users` means stuff in various chat apps, which makes it annoying to copypasta graphs into them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101140
Approved by: https://github.com/ezyang
2023-05-12 04:34:01 +00:00
2c29149109 Enhance Composable FSDP cast forward input tests (#100349)
The fix for https://github.com/pytorch/pytorch/pull/99545 (https://github.com/pytorch/pytorch/pull/99546) explicitly required users to set `cast_forward_inputs=False` if they wanted to avoid hitting #99545 while using an FSDP root module with no direct parameters.

After further consideration, [the team believes](https://github.com/pytorch/pytorch/pull/99546#discussion_r1180898687) it is sufficiently common for the default `cast_forward_inputs=False` to be used with a FSDP root module possessing no direct parameters that a solution to #99545 that accommodates this use case is desired.

This PR builds on @zhaojuanmao's https://github.com/pytorch/pytorch/pull/100290 (nice!) to enhance the FSDP cast forward inputs testing to include a broader range of scenarios and to include `model.eval()` testing as well as training mode validation. (I unfortunately don't have permissions that would allow me to use ghstack directly but I can rebase this PR however the team desires, once #100290 lands etc.)

Currently, the evaluation mode testing is commented out while the team decides on the best approach to implementing the broader solution to https://github.com/pytorch/pytorch/pull/99545. Once an implementation is decided, the evaluation mode validation function in the new tests added in this PR can be uncommented and should continue to pass. I also include one potential evaluation mode solution suggestion in this PR but leave the existing code unchanged since I know the team is intending to consider a range of solutions this week.

Test notes:
1. The 8 tests added here are a superset of the current `test_float16_on_one_submodule` tests, including validation of the following configurations: (`cast_root_forward_inputs_submodule` = True/False, `cast_forward_inputs_submodule` = True/False, `use_root_no_params` = True/False) across both training and evaluation modes.
2. The `float16_on_one_submodule` model configuration is currently only tested in the FSDP root module with parameters scenarios (as was the existing case) but this test can be easily extended to test it in the FSDP root module with no parameters scenarios as well if the team thinks the additional test resource usage is justified.
3. Since this test amortizes the cost of test setup across the aforementioned range of scenarios, the loop-based implementation of `dtype` validation (below) would have been undesirably complex IMHO[^1] :
```python
        ############### Logical equivalent of current test result matrix ############
        if self.cast_root_forward_inputs_submodule or self.cast_forward_inputs_submodule:
            self.assertEqual(self.forward_inputs[self.c2].dtype, torch.float16)
            if use_root_no_params:
                if self.cast_root_forward_inputs_submodule:
                    self.assertEqual(self.forward_inputs[self.model].dtype, torch.float16)
                else:
                    self.assertEqual(self.forward_inputs[self.model].dtype, torch.float32)
                self.assertEqual(self.forward_inputs[self.c1].dtype, torch.float16)
            else:
                self.assertEqual(self.forward_inputs[self.c1].dtype, torch.float32)
        else:
            self.assertEqual(self.forward_inputs[self.model].dtype, torch.float32)
            self.assertEqual(self.forward_inputs[self.c1].dtype, torch.float32)
            if not use_root_no_params: # this input will only exist in the root with params case until eval fix is applied
                self.assertEqual(self.forward_inputs[self.c2].dtype, torch.float32)
```
so I implemented the validation function as an expected result lookup that provides the added benefit of explicitly specifying the failed subtest upon failed `dtype` assertions, e.g.:
```python
AssertionError: None mismatch: torch.float32 is not None
Subtest `no_cast_root_no_cast_child_no_root_params` failed.
```
The potential solution to https://github.com/pytorch/pytorch/pull/99545 that I added as a suggestion in the file conversation passes this test set but I know there are a lot of different ways that it could be resolved so I'll assume that change will be tackled in a separate PR unless the team wants to include it in this one.

As mentioned, I've currently based this PR off of https://github.com/pytorch/pytorch/pull/100290 so am happy to either wait for that to land first or rebase this PR however the team wants.

[^1]: Batching the scenarios into different tests is also possible of course but would involve unnecessary test setup overhead, happy to switch to that approach if the team prefers that though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100349
Approved by: https://github.com/awgu
2023-05-12 04:23:18 +00:00
49578913fb update timm commit (#100931)
Fixes #100903

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100931
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-05-12 04:22:08 +00:00
3ae612ba7f [dtensor] remove assertions about submesh checks (#101229)
This PR removes assertions from submesh checks to directly return local
tensor, this is so that all the other APIs can work with submesh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101229
Approved by: https://github.com/fduwjj
2023-05-12 04:20:35 +00:00
bf50180b4a enable dispatch stub for backend PrivateUse1 (#99611)
When expanding the new backend of pytorch in the form of out ot tree, Privateuse1 will be reused. So we also need to support PrivateUse1 in the dispatch stub module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99611
Approved by: https://github.com/ezyang
2023-05-12 04:02:12 +00:00
e98d762f21 update requirements.txt in /docs (#101092)
Fixes #101090

Local `make html` works under /docs. Not super sure how to verify doc build actually have no issue due to this update.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101092
Approved by: https://github.com/kiersten-stokes, https://github.com/ezyang
2023-05-12 03:19:36 +00:00
de15e740a1 [dynamo] Activation checkpointing as higher order op (#101028)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101028
Approved by: https://github.com/voznesenskym, https://github.com/zou3519
2023-05-12 03:17:41 +00:00
c5c75aa06d [vision hash update] update the pinned vision hash (#101230)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101230
Approved by: https://github.com/pytorchbot
2023-05-12 03:16:39 +00:00
ce76670c6f [GHF][BE] Add __repr__ to FlakyRule (#101234)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 9d9f7a4</samp>

> _`FlakyRule` class_
> _Defines rules for flaky tests_
> _Autumn leaves falling_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101234
Approved by: https://github.com/kit1980
2023-05-12 02:08:20 +00:00
f0cc535c28 [GHF][BE] Memoize `read_flaky_rule (#101239)
Added caching to `read_flaky_rules`, as it's called several times during the merge process and every call incurs network access. Also, one should not expect flaky rules to change while PR is landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101239
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-05-12 02:07:37 +00:00
47ec9cc26d Improve error messages in THPVariable_set_grad (#100683)
Fixes #100174

I'm not sure if there's another direction that we had in mind for this issue, but if so I'm happy to make the changes 🙂

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100683
Approved by: https://github.com/soulitzer
2023-05-12 01:54:20 +00:00
02f152626c Fix typos in error message (#101231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101231
Approved by: https://github.com/huydhn
2023-05-12 01:34:55 +00:00
d9cfa0461a use const_data_ptr in get_device_pointers (#100997)
use const_data_ptr in get_device_pointers

Summary: These are just inputs to arange.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100997
Approved by: https://github.com/lezcano, https://github.com/ezyang
2023-05-12 01:24:19 +00:00
b9bfc2b2d9 Warn on failure to end warmup, add explicit api for start of model invocation (#101129)
CUDAGraph trees needs to known when you are doing a new invocation of your model. We have two heuristics for that :
- you invoke torch.compile again (like as a top level module you compiled)
- you have run a forward with a corresponding backward that hasn't been invoked yet, in which case we ignore the above

This doesn't always get it right, especially if you forget to use torch.no_grad() in inference. This adds a warning for that case, and adds an explicit `cudagraph_mark_step_begin` api.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101129
Approved by: https://github.com/ezyang
2023-05-12 01:15:01 +00:00
ts
74dc2a53f6 Thread generator through trunc_normal_ (#100810)
This will solve @albertz's issue as described in #98200 , threading the generator argument through the trunc_normal_ function. I'm still working on #99796 (and won't let it stall out), but this fix doesn't trigger any JIT issues, so I think it might be helpful to get it merged now.

Would be happy to iterate on this if there are any issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100810
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-12 01:04:08 +00:00
4c8ee583c3 [inductor] fix incorrect strides in copy() decomp, fix hf_LongFormer + hf_BigBird errors (#100115)
Fixes https://github.com/pytorch/pytorch/issues/100067 and https://github.com/pytorch/pytorch/issues/93428.

See the comment [here](https://github.com/pytorch/pytorch/issues/100067#issuecomment-1523856970) for details. The bug was that the decomposition that inductor uses for `aten.copy` doesn't respect the strides of the input in all cases. The fixes that I added should work, but will be pretty slow - we allocate a tensor (potentially larger than `self` if `self` is a slice), and perform an `as_strided_scatter` + `as_strided`. Longer term, stride-agnostic IR should let us remove this decomp?  cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100115
Approved by: https://github.com/albanD
2023-05-12 00:50:35 +00:00
a6b8e69d36 [aot autograd] fix de-dupping metadata computation bug (#100431)
Fixes https://github.com/pytorch/pytorch/issues/100224

There was a bug in the way that metadata was computed when going from "metadata before-removing-dupes" to "metadata after-removing-dupes". In fact, when I ran the repro with `functorch.config.debug_assert = True`, that immediately signaled to me that the metadata was incorrect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100431
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-05-12 00:50:35 +00:00
5651006b9d [aot_autograd] proper handling for when outputs are aliased but have identical size/stride/offset metadata (#100430)
Fixes https://github.com/pytorch/pytorch/issues/100348, see the discussion in the issue for details. The problem was that for code like this:
```
def f(x):
    out = ...
    return out, out.detach()
```

The `.detach()` would turn into a `.alias()`, and inductor turns `.alias()` calls into no-ops. Inductor would effectively see that the two graph outputs have the same metadata, and return `out, out`. cc @ngimel alternatively we could have inductor try to detect when it's not ok to make `.alias()` a no-op, but that would probably require some custom logic instead of making `.alias()` a decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100430
Approved by: https://github.com/ngimel
2023-05-12 00:50:35 +00:00
2c786961b7 Towards making torch._inductor.ir typed (#100712)
This PR just contains some mild gyrations necessary to appease mypy.
However, it is not complete; there are a number of legitimate bugs
and mistyping that I need to work out before I can actually turn this
on.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100712
Approved by: https://github.com/ngimel
2023-05-12 00:07:33 +00:00
380054ebb2 Add IRNode.realize stub with docs (#100710)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100710
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-05-12 00:07:33 +00:00
738ba13b35 [BE]: enable PLE error codes in ruff and fix bugs (#101079)
Enables PyLint error codes implemented in ruff. These are un-opinionated static analysis checks on Python code that finds common bugs. After running all the PLE error codes that are implemented in ruff, I fixed the bugs, added a few ignores for malformed Python code that is part of our JIT test script, and finally added a few ignores for a false positive on PLE0605 and submitted an issue upstream to fix in ruff https://github.com/charliermarsh/ruff/issues/4345 .

Common bugs found here include analysis for malformed logging format calls, bad string format calls, invalid escape sequences, and more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101079
Approved by: https://github.com/malfet
2023-05-11 23:57:25 +00:00
b7bf953bbc [MPS] Fix bernoulli for int types (#100946)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 069fd23</samp>

This pull request enhances the MPS implementation of random operations in `Distributions.mm` and adds more dtype tests for the bernoulli distribution in `test_mps.py`. This improves the performance, correctness, and usability of the MPS backend for PyTorch.

Fixes https://github.com/pytorch/pytorch/issues/100717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100946
Approved by: https://github.com/kulinseth
2023-05-11 23:52:38 +00:00
599ae95d1a [dtensor] use stack to manage mesh resources (#101202)
This PR changes the context manager behavior of device mesh, now we use
a mesh env to track the current mesh and save the mesh to a stack so
that we can allow nested context manager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101202
Approved by: https://github.com/wz337
2023-05-11 23:48:36 +00:00
6d6abba0d8 [IValue] Better handle sparseTensors in extractStorages (#100783)
Sparse tensors don't seem to be handled when we have tensors instead
of pyobjects.

Differential Revision: [D45632427](https://our.internmc.facebook.com/intern/diff/D45632427/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100783
Approved by: https://github.com/H-Huang
2023-05-11 23:44:51 +00:00
cb94ea6044 [BE] Simplify tests, elaborate testnames in test_optim.py (#101004)
- Deletes unused kwargs
- Make test names more descriptive to remove need of comments. Overall it's better to codify over comment
- Added a test for duplicate params across groups
- Greatly simplified test_empty_grad to discover that the crux of the bug was NOT its emptiness, but rather with multi-dim emptiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101004
Approved by: https://github.com/albanD
2023-05-11 23:27:24 +00:00
49c8a0cad0 [SPMD][BE] Remove the legacy tracing code (#100858)
Remove the legacy tracing code as it cause several test and benchmark issues.

Differential Revision: [D45649123](https://our.internmc.facebook.com/intern/diff/D45649123/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100858
Approved by: https://github.com/wanchaol
2023-05-11 23:08:27 +00:00
c567748e16 Make interpolate_bilinear deterministic using decomposition (#101115)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101115
Approved by: https://github.com/ngimel
2023-05-11 22:48:01 +00:00
daed3bf8f9 Implement coalesced all_gather_into_tensor (#101157)
This PR adds support for the following use cases:
- Sync style:
```
with dist._coalescing_manager():
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])
```
- Async style:
```
with dist._coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_gather_into_tensor(output_tensors[i], input_tensors[i])

# do a bunch of other things
cm.wait()
# do things that depend on the all-gather's
```
Each `all_gather_into_tensor` would be independent in terms of data and their buffer location. But could be executed in parallel by supported backends (like NCCL).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101157
Approved by: https://github.com/kumpera, https://github.com/wanchaol
2023-05-11 20:58:47 +00:00
e47cdd0ca4 [BE] Testing docs: clarify test instantiation function usage (#100905)
Beefing up docs with discussion about when to use `instantiate_device_type_tests()` vs. `instantiate_parametrized_tests()` + description on what each does.

Spoiler: use only one - the former for device-specific and the latter for device-agnostic tests. Both support `@parametrize`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100905
Approved by: https://github.com/janeyx99
2023-05-11 20:48:03 +00:00
ae23328625 Remove obsolete upsample_bilinear2d lowerings (#101111)
We pre-autograd decompose, so we never need to lower this!

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101111
Approved by: https://github.com/ngimel
2023-05-11 20:41:57 +00:00
346e1f512f sparse compressed validation: allow empty-batched inputs (#101180)
Fixes https://github.com/pytorch/pytorch/issues/101179.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101180
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-05-11 20:30:20 +00:00
65b15be04c Fix incorrect sparse_dim in COO.zero_() and in binary operations with zero-sized COO operands (#98292)
Fixes https://github.com/pytorch/pytorch/issues/97627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98292
Approved by: https://github.com/nikitaved, https://github.com/cpuhrsch, https://github.com/amjames
2023-05-11 19:05:34 +00:00
41a4e22015 Update torchbench pin (#101071)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101071
Approved by: https://github.com/malfet
2023-05-11 18:09:40 +00:00
f7571507e0 Add global boolean for controlling whether to record concrete shapes or not (#101043)
Summary: We don't think the performance impact of recording concrete shapes is significant; but it's good to have a knob for turning it off quickly in case it has a large performance impact.

Test Plan:
Ran D45681838. It prints the state of that "concrete inputs" boolean. I ran it before and after canarying a change to `pytorch/kineto:pytorch_record_concrete_inputs`; before, it returns true; after, it returns false.

Note that D45681838 had to add `service` on the main function. That's because we need to `initFacebook` in order to use jks.

Differential Revision: D45680162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101043
Approved by: https://github.com/aaronenyeshi
2023-05-11 18:07:35 +00:00
14964b3aa5 Add is_xpu to torch type (#101072)
# Motivate
Without this PR:
```python
>>>import torch
>>>torch.IntTensor.is_cuda
False
>>>torch.IntTensor.is_xpu
<attribute 'is_xpu' of 'torch._C._TensorBase' objects>
```

With this PR:
```python
>>>import torch
>>>torch.IntTensor.is_xpu
False
```
Align to CUDA, some customer code use is_xpu to check the backend. Without this PR, the check is always True which result in an unexpected behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101072
Approved by: https://github.com/mikaylagawarecki
2023-05-11 17:50:59 +00:00
1e6002bef6 [pt2] Skip if curr_size is None (#101170)
Summary: There is a chance curr_size is None (got an error None cannot set item), so skip if it's None

Test Plan: unit test in D44736488

Differential Revision: D45767829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101170
Approved by: https://github.com/ezyang
2023-05-11 17:20:38 +00:00
0ec4646588 CUDA Graph Trees - error on deallocated access (#100927)
Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927
Approved by: https://github.com/zou3519
2023-05-11 17:17:14 +00:00
369a256381 [Dynamo] Remove cross import in dynamo unit tests (#100851)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100851
Approved by: https://github.com/jansel
2023-05-11 17:07:25 +00:00
502e791241 Update cpuinfo submodule to include AVX512-FP16 detection (#100865)
Update submodule cpuinfo to include AVX512-FP16 detection.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100865
Approved by: https://github.com/Blackhex, https://github.com/jgong5, https://github.com/malfet
2023-05-11 15:13:19 +00:00
4a4854f6b2 [inductor] Test for shape padding (#100493)
Summary:
This wasn't tested anywhere as far as I can tell, so it was breaking
on Triton updates that fiddled with the signature of `do_bench`

Test Plan: `test_shape_padding`

Differential Revision: D45499978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100493
Approved by: https://github.com/Chillee, https://github.com/jansel
2023-05-11 15:10:54 +00:00
d283075282 Reduce fake_tensor create_mode logging (#101074)
A lot of Meta-internal logging is at INFO level, so this produces a lot of spam

Differential Revision: [D45732720](https://our.internmc.facebook.com/intern/diff/D45732720/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101074
Approved by: https://github.com/eellison, https://github.com/ezyang
2023-05-11 13:26:38 +00:00
ad070b6dfa Check canary_models for models too in torchbench.py (#101081)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101081
Approved by: https://github.com/desertfire
2023-05-11 13:23:17 +00:00
4eaaa08623 Revert "Fix header inclusions in c10 by iwyu (#100304)"
This reverts commit 6037ee8cc914d64a27965a35b20472044416a2a5.

Reverted https://github.com/pytorch/pytorch/pull/100304 on behalf of https://github.com/jeanschmidt due to Breaking meta internal builds and fbgemm builds ([comment](https://github.com/pytorch/pytorch/pull/100304#issuecomment-1543919257))
2023-05-11 12:37:35 +00:00
683adb2091 Add dummy CUDA kernel for assert_async.msg (#101130)
This PR doesn't actually propogate the error messages down to internal kernel implementations. Will do in a follow up PR. This op is only really meant to be used in export context and export doesn't support CUDA right now so it is safe to just ignore the assert message. This PR is for quickly unblocking https://github.com/pytorch/pytorch/pull/100791 and https://github.com/pytorch/pytorch/issues/100918
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101130
Approved by: https://github.com/anijain2305, https://github.com/yanboliang
2023-05-11 12:27:10 +00:00
cbfed470bd Revert "CUDA Graph Trees - error on deallocated access (#100927)"
This reverts commit 3941bbc5ba10acdf103cd91bab7de67bfef95957.

Reverted https://github.com/pytorch/pytorch/pull/100927 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100927#issuecomment-1543874258))
2023-05-11 12:07:20 +00:00
dd2c22f4bb bsr_dense_bmm(): enable more precise float32 support with float64 accumulators (#100882)
Float64 is there in Triton! This PR increases precision for float32 inputs with float64 accumulation dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100882
Approved by: https://github.com/cpuhrsch
2023-05-11 11:22:55 +00:00
979f55d3bc implementation of DataPtr context for copy-on-write tensors (#100818)
implementation of DataPtr context for copy-on-write tensors

Summary:
Copy-on-write storage
=====================
This library adds support for copy-on-write storage, i.e. lazy copies,
to tensors. The design maintains the PyTorch invariant that tensors
alias if and only if they share a storage. Thus, tensors that are lazy
copies of one another will have distinct storages that share a data
allocation.

Thread-safety
-------------
The correctness of this design hinges on the pre-existing PyTorch user
requirement (and general default programming assumption) that users
are responsible for guaranteeing that writes do not take places
concurrently with reads and other writes.

Lazily copied tensors add a complication to this programming model
because users are not required to know if lazy copies exist and are
not required to serialize writes across lazy copies. For example: two
tensors with distinct storages that share a copy-on-write data context
may be given to different threads that may do whatever they wish to
them, and the runtime is required to guarantee its safety.

It turns out that this is not that difficult to protect because, due
to the copy-on-write requirement, we just need to materialize a tensor
upon writing. This could be done entirely without synchronization if
we materialized each copy, however, we have a common-sense
optimization to elide the copy for the last remaining reference. This
requires waiting for any pending copies.

### Thread-safety detailed design
There are two operations that affect the copy-on-write details of a
tensor:

1) lazy-clone (e.g. an explicit call or a hidden implementation detail
   added through an operator like reshape)
2) materialization (i.e. any write to the tensor)

The key insight that we exploit is that lazy-clone is logically a read
operation and materialization is logically a write operation. This
means that, for a given set of tensors that share a storage, if
materialization is taking place, no other read operation, including
lazy-clone, can be concurrent with it.

However, this insight only applies within a set of tensors that share
a storage. We also have to be concerned with tensors with different
storages that share a copy-on-write context. In this world,
materialization can race with lazy-clone or even other
materializations. _However_, in order for this to be the case, there
must be _at least_ two references to the context. This means that the
context _can not_ vanish out from under you if you are performing a
lazy-clone, and hence, it only requires an atomic refcount bump.

The most complicated case is that all lazy-copies are concurrently
materializing. In this case, because a write is occurring, there are
no in-flight lazy-copies taking place. We must simply ensure that all
lazy-copies are able to materialize (read the data) concurrently. If
we didn't have the aforementioned optimization where the last copy
steals the data, we could get away with no locking whatsoever: each
makes a copy and decrements the refcount. However, because of the
optimization, we require the loser of the materializing race wait for
the pending copies to finish, and then steal the data without copying
it.

We implement this by taking a shared lock when copying the data and
taking an exclusive lock when stealing the data. The exclusive lock
acquisition ensures that all pending shared locks are finished before
we steal the data.

Test Plan: 100% code coverage.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100818).
* #100821
* #100820
* #100819
* __->__ #100818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100818
Approved by: https://github.com/ezyang
2023-05-11 11:13:51 +00:00
87084643e5 [CI][MPS] Actually make grid_sampler_2d available (#101108)
In CI older MacOS SDK can be used to compile the binary, so add guard for availability of `MPSGraphResizeNearestRoundingModeRoundToEven` enum value.
MPS feature availability checks are deliberately done at runtime (by using `is_macos_13_or_newer` and forward-declaring methods in `MPSGraphVenturaOps.h`) rather than at compile time (by using `#ifdef`s).

Modify error message and XFAIL condition in `test_mps.py` to fail test due to missing conditional on macOS-13.2 or newer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101108
Approved by: https://github.com/kulinseth
2023-05-11 10:35:09 +00:00
c4752b1a91 [MPS] Rename metalIndexingFunction to metalIndexingPSO (#101156)
Rename to reflect its return type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101156
Approved by: https://github.com/DenisVieriu97
2023-05-11 09:27:29 +00:00
8b4e28d65d Fix microbenchmarks (#101065)
As per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101065
Approved by: https://github.com/jansel
2023-05-11 09:14:22 +00:00
036a8d6b4a Remove NullContext() from benchmark runners (#100309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100309
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2023-05-11 06:42:27 +00:00
c25fdc20c2 [cuBLAS][cuBLASLt] Allow user-specified cuBLASLt workspace size via CUBLASLT_WORKSPACE_SIZE (#101145)
Provide an option to configure the workspace size used by cuBLASLt rather than fixing it as a compile-constant of 1MiB due to observed performance differences on H100 and recommendations from cuBLAS e.g., https://docs.nvidia.com/cuda/archive/11.8.0/cuda-toolkit-release-notes/index.html#title-cublas-library.

Some quick profiling shows that in some cases up to 32MiB of workspace is needed on H100:
```
import torch
import time

m = 1024
n = 2048
warmup = 20
iters = 200
dtype = torch.bfloat16

for k in (1024, 2048, 4096, 8192, 9376, 16384, 32768):
  a = torch.randn(m, k, device='cuda', dtype=dtype)
  b = torch.randn(n, k, device='cuda', dtype=dtype).transpose(1, 0)
  i = torch.randn(n, device='cuda', dtype=dtype)
  for _ in range(warmup):
    torch.addmm(i, a, b)
  torch.cuda.synchronize()
  t1 = time.perf_counter()
  for _ in range(iters):
    torch.addmm(i, a, b)
  torch.cuda.synchronize()
  t2 = time.perf_counter()
  print(f"m:{m}, n:{n}, k:{k} TFLOP/s: {( 2*m*n*k)*iters/(t2 - t1)/1e12}")
```
1MiB:
```
m:1024, n:2048, k:1024 TFLOP/s: 62.40964655242158
m:1024, n:2048, k:2048 TFLOP/s: 79.33321703070685
m:1024, n:2048, k:4096 TFLOP/s: 96.69701590181765
m:1024, n:2048, k:8192 TFLOP/s: 83.2892371366678
m:1024, n:2048, k:9376 TFLOP/s: 83.91872373271516
m:1024, n:2048, k:16384 TFLOP/s: 86.57820235279185
m:1024, n:2048, k:32768 TFLOP/s: 88.37227761178467
```

32 MiB:
```
m:1024, n:2048, k:1024 TFLOP/s: 73.50633352382425
m:1024, n:2048, k:2048 TFLOP/s: 104.32016319633199
m:1024, n:2048, k:4096 TFLOP/s: 131.37290416527784
m:1024, n:2048, k:8192 TFLOP/s: 152.08780769805506
m:1024, n:2048, k:9376 TFLOP/s: 154.93898780286096
m:1024, n:2048, k:16384 TFLOP/s: 165.13973167154688
m:1024, n:2048, k:32768 TFLOP/s: 160.62065020500813
```

CC @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101145
Approved by: https://github.com/ngimel
2023-05-11 06:38:32 +00:00
b46553f652 [inductor] simplify test_cpu_repro with self.common (#101050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101050
Approved by: https://github.com/jansel
2023-05-11 06:29:39 +00:00
ea86eb3197 inductor: fallback ConvTranspose when output_padding is big (#100846)
Fixes https://github.com/pytorch/pytorch/issues/100225.

Do not use mkldnn when `output_padding` is big to align the behavior with eager mode:
7d0e4e2aa8/aten/src/ATen/native/Convolution.cpp (L500-L507)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100846
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-05-11 06:18:37 +00:00
a66de845de [Quant][PT2E]Fix pt2e quantization maxpool input observer issue (#100961)
**Summary**
Fix the issue https://github.com/pytorch/pytorch/issues/100959. The root cause is for node of `torch.ops.aten.max_pool2d_with_indices.default`, there are 2 output node as output tensor and max indices. So in its `node.meta["val"]` is a tuple of `FakeTensors` (For example: `'val': (FakeTensor(..., size=(1, 2, s1, s1)), FakeTensor(..., size=(1, 2, s1, s1), dtype=torch.int64))`). It will fail the check  of inserting observer since which only accept one `FakeTensor` case.

**Test Plan**
```
python -m pytest test_quantize_pt2e.py -k test_max_pool2d_quantizer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100961
Approved by: https://github.com/jerryzh168, https://github.com/jgong5
2023-05-11 06:14:34 +00:00
cyy
6037ee8cc9 Fix header inclusions in c10 by iwyu (#100304)
This work introduces include-what-you-use  support for c10 by a CMake option defaulting to off. We also remove some unused header inclusions and  fix a trivial inclusion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100304
Approved by: https://github.com/ezyang
2023-05-11 05:19:42 +00:00
da02ccc60e Revert "PyTorch -> C++17 (#98209) (#100557)"
This reverts commit 083f88e12632059e7e710634fc8708c8205678d5.

Reverted https://github.com/pytorch/pytorch/pull/100557 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100557#issuecomment-1543285863))
2023-05-11 03:43:11 +00:00
2621fbda7d Turn on anomaly detection for AOTAutograd backward tracing (#101047)
Previously, anomaly detection was only enabled on the inner forward function, and not on the overall joint function that calls backward. I believe this impeded us from printing "this is the forward that triggered the backward" because that printing only happens if anomaly mode is enabled when you run backward(). This PR fixes it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101047
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-05-11 03:38:20 +00:00
15a51e2012 simplify sdpa backward meta registration (#101128)
Per title.

there's an off chance that query_reshaped etc was actually discontiguous after reshape, but even in that case I'm pretty sure the computed gradients would still be contiguous, and we are properly transposing output gradients to produce correct strides.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101128
Approved by: https://github.com/drisspg
2023-05-11 03:30:07 +00:00
5f89d89ada [vision hash update] update the pinned vision hash (#101142)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101142
Approved by: https://github.com/pytorchbot
2023-05-11 03:16:23 +00:00
075d36d37f [Dynamo] Fix nested function resume execution (#100426)
Fixes #99665

Let me explain the root cause using the unit test I added:
* This bug is triggered when:
  * ```wrapped``` is a nested function.
  * ```wrapped``` is in another module which is different from the main function ```fn```.
  * There is a graph break inside of ```wrapped```.
* The root cause is when resuming nested function, actually we are using the outermost function(```fn``` in my example)'s global variables, but ```wrapped``` calls ```inner_func``` which is not part of ```fn```'s globals, so we have to set correct globals when nested function resume execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100426
Approved by: https://github.com/jansel
2023-05-11 03:10:23 +00:00
c84627c2ee benchmarks: make --amp works for cpu path (#101057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101057
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-05-11 02:51:38 +00:00
a1aa32e204 [dtensor] tensor ops to use strategy based sharding prop (#100607)
This is the first series of PR that adopts operator impls to use a
strategy based approach, each op utilizes OpStrategy and PlacementStrategy
to generate their own strategy. By utilizing the strategy based
approach along with the op graph, we could enable more advanced op
implementation (decomp is possible), and turn the sharding prop to be
more like a contraint satisfication problem.

This PR alone only adds some basic tensor op strategies, and it directly
works on the op graph that was used for metadata propagation. The tensor ops
added in this PR mainly follows one of the arg strategy. The next set of
PRs would add more op strategies to other ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100607
Approved by: https://github.com/XilunWu
2023-05-11 02:47:20 +00:00
d1f0c8e2d0 Run C++ test_api binary directly in CI slow jobs (#101088)
Similar to ASAN, the test starts to timeout on slow jobs such as slow gradcheck, for example 30cecc0e11.  This needs to be investigated later, but it's of low priority as we can run test_api binary directly in the meantime in these jobs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101088
Approved by: https://github.com/clee2000
2023-05-11 02:08:20 +00:00
0848ed21b8 [c10d] Figure out device to use for object collectives (#100954)
Fixes https://github.com/pytorch/pytorch/issues/97938

this pr is clone from https://github.com/pytorch/pytorch/pull/100238, which is important to me. But
@kwen2501 has not resolved the confliction. So, this pr is submitted to resolve the confliction.
the only confliction is `distributed_c10d.py:2653`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100954
Approved by: https://github.com/kwen2501
2023-05-11 01:49:09 +00:00
a0e6ae2c01 Restore Vulkan tests to periodic (#101026)
The flaky issue has been fixed by https://github.com/pytorch/pytorch/pull/100909.  In addition, retry support for C++ is now available after https://github.com/pytorch/pytorch/pull/99956.  So it's safe to move this back from unstable now.

Per the discussion with @clee2000, it makes sense to have this as a periodic job given that this one rarely fails.  To prevent any gap here, I have created https://github.com/pytorch/test-infra/pull/4144 to add the necessary testing for Vulkan change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101026
Approved by: https://github.com/kit1980
2023-05-11 01:42:51 +00:00
13d445c2c2 Move periodic dynamo benchmarks to inductor workflow (#100915)
will this mess up the dashboard?

add new workflow for dynamo benchmarks, triggers on ciflow/inductor, runs periodically on main

dynamo benchmarks take about 7-8 hours total

in the past week, the inductor workflow has triggered ~580 times on PRs (~850 total) and periodic has been triggered ~100 times total.  This is an estimate
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100915
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/malfet
2023-05-11 01:11:24 +00:00
b1a8a10a73 inductor(CPU): fix masked_fill issue when filled value is nan (#101058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101058
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-05-11 00:57:04 +00:00
3271413e74 Revert "Fix header inclusions in c10 by iwyu (#100304)"
This reverts commit 39ec5fa722730f6c25490c2c33933b014767f297.

Reverted https://github.com/pytorch/pytorch/pull/100304 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, it is almost there but fails on Windows 39ec5fa722, which is in unstable mode after https://github.com/pytorch/pytorch/pull/100548 ([comment](https://github.com/pytorch/pytorch/pull/100304#issuecomment-1542975714))
2023-05-11 00:37:32 +00:00
bb7d9886fb [efficiency_camp] Vector Realloc Optimize caffe2::BinaryElementwiseWithArgsOp::DoRunWithType (#100631)
Summary: Reserve the vector capacity to avoid resizing(realloc)

Test Plan:
**Internal:**
```
$ buck test mode/dev-nosan caffe2/caffe2/python/operator_test:elementwise_ops_test
```

Differential Revision: D45529614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100631
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-05-10 23:34:54 +00:00
7110060cff Enable reordering pass (#100747)
Restricts the reordering pass to only reorder nodes before the `copy_` epilogue that functionalization generates.

Brings `python benchmarks/dynamo/torchbench.py  --performance  --backend inductor --amp --inference --only hf_Bert` from 1.46 -> 1.49

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100747
Approved by: https://github.com/jansel
2023-05-10 23:23:55 +00:00
91ca9a276f Revert "Enable reordering pass (#100747)"
This reverts commit 6308563a39aab6c261e4adf81391c53707f785e7.

Reverted https://github.com/pytorch/pytorch/pull/100747 on behalf of https://github.com/jeanschmidt due to braking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100747#issuecomment-1542906461))
2023-05-10 22:54:40 +00:00
6c3af6a966 Revert "inductor(CPU): fix issue when padding/stride/dilation size is one for cpu weight packing pass (#100951)"
This reverts commit 2b250e19210b256edcd8f94652d33c3bbbe382fb.

Reverted https://github.com/pytorch/pytorch/pull/100951 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, Jasson A should follow up to make sure we have this sorted out ASAP ([comment](https://github.com/pytorch/pytorch/pull/100951#issuecomment-1542878888))
2023-05-10 22:17:24 +00:00
176dabf88c [MPS] Export check for 13.3 to is_macos13_or_newer (#101119)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b420550</samp>

Updated `isOnMacOS13orNewer` function in `MPSHooks.cpp` to handle macOS 13.3. This fixes a bug with MPS on this version of macOS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101119
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-05-10 22:14:03 +00:00
c650b12e0b [pt2] Add some helper function for SymIntVector (#101056)
Summary: To SymInt-ify some fbgemm kernels f4c83b4fb3/fbgemm_gpu/src/jagged_tensor_ops_autograd.cpp (L109), we need to provide a toSymIntVector helper function in pytorch

Test Plan: Run fbgemm op + torch.compile to make sure the symbolic shape is produced.

Differential Revision: D45724091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101056
Approved by: https://github.com/suo
2023-05-10 21:27:55 +00:00
8a20ea0a1f [Dynamo] Fix torch.{cuda/cpu}.amp.autocast arguments binding bug (#101052)
Fixes Meta internal user case.

Repro:
```
import torch
import torch._dynamo

def fn(x):
    with torch.cuda.amp.autocast(False):
        x = torch.sin(x + 1)
    return x

x = torch.randn([2, 3])
ref = fn(x)
print(ref)
opt_fn = torch._dynamo.optimize(backend="inductor")(fn)
print(opt_fn(x))
```

Error:
```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 425, in _compile
    out_code = transform_code_object(code, transform)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/bytecode_transformation.py", line 1000, in transform_code_object
    transformations(instructions, code_options)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/convert_frame.py", line 410, in transform
    tracer.run()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 2010, in run
    super().run()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 703, in run
    and self.step()
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 663, in step
    getattr(self, inst.opname)(inst)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 385, in wrapper
    return inner_fn(self, inst)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 1095, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/symbolic_convert.py", line 554, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/torch.py", line 381, in call_function
    return AutocastModeVariable.create(target_values=args, kwargs=kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/variables/ctx_manager.py", line 198, in create
    bound_args = inspect.signature(torch.autocast).bind(*target_values, **kwargs)
  File "/scratch/ybliang/work/env/lib/python3.9/inspect.py", line 3045, in bind
    return self._bind(args, kwargs)
  File "/scratch/ybliang/work/env/lib/python3.9/inspect.py", line 2984, in _bind
    raise TypeError(
TypeError: multiple values for argument 'device_type'

from user code:
   File "/scratch/ybliang/work/repos/debug/debug6.py", line 10, in fn
    with torch.cuda.amp.autocast(False):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101052
Approved by: https://github.com/anijain2305
2023-05-10 21:19:18 +00:00
08ef92e711 Delete Python-2 checks from setup.py (#101112)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 557960b</samp>

> _`Python 2` is gone_
> _PyTorch cleans up its code_
> _Winter of legacy_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101112
Approved by: https://github.com/kit1980, https://github.com/albanD
2023-05-10 20:17:31 +00:00
96f46316c9 Preserve PyTest Cache across job runs (#100522)
Preserves the PyTest cache from one job run to the next.  In a later PR, this will be used to change the order in which we actually run those tests

The process is:
1. Before running tests, check S3 to see if there is an uploaded cache from any shard of the current job
2. If there are, download them all and merge their contents. Put the merged cache in the default .pytest_cache folder
3. After running the tests, merge the now-current .pytest_cache folder with the cache previously downloaded for the current shard. This will make the merged cache contain all tests that have ever failed for the given PR in the current shard
4. Upload the resulting cache file back to S3

The S3 folder has a retention policy of 30 days, after which the uploaded cache files will get auto-deleted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100522
Approved by: https://github.com/huydhn
2023-05-10 18:37:28 +00:00
2dc93c20ac [ROCm]Fixed ut test_memory_timeline (#96752)
Fixed test_memory_profiler::TestMemoryProfilerE2E::test_memory_timeline by changing the (arbitrary) threshold for logging. We observe differently-sized allocations on different AMD GPUs, so chose a higher threshold of 512 to account for those differences and yet satisfy the test requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96752
Approved by: https://github.com/jithunnair-amd, https://github.com/kit1980
2023-05-10 17:49:35 +00:00
e762cce61f Allow cmake vars in docker build (#100867)
Fixes #100866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100867
Approved by: https://github.com/malfet
2023-05-10 17:44:30 +00:00
058d740f59 [reland][quant][pt2e] Change input act annotation to a map and allow dynamic quantization for non zeroth argument (#101005) (#101041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101005

Previously the node annotation looks like the following:
```
node.meta["..."] = {
    "input_act_obs_or_fq_ctr": ...,
    "weight_obs_or_fq_ctr": ...,
    "weight_index": 1,
}
```
Basically we need specifiy the index for weight and also have a separate key for weight config, in this PR we changed that to:
```
node.meta["..."] = {
    "input_act_obs_or_fq_ctr_map": {input_node: ..., weight_node: ...},
}
```
This can support specifying the observer/fake quant constructor for any argument of the node

Test Plan: buck2 test @//mode/opt //caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'

Differential Revision: D45719781

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101041
Approved by: https://github.com/andrewor14
2023-05-10 17:43:21 +00:00
3941bbc5ba CUDA Graph Trees - error on deallocated access (#100927)
Turn warning to error if we detect tensor is accessed after its memory is overwritten/released by a new invocation of cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100927
Approved by: https://github.com/zou3519
2023-05-10 17:15:33 +00:00
3b2a93a3b5 [inductor] Make codecache file permissions less restrictive (#100870)
`tempfile.mkstemp` always creates the file `0o600` permissions, so
only the current user can access it. Instead, this salts the original
filename with the pid and thread id to avoid conflicts between
temporary files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100870
Approved by: https://github.com/jansel
2023-05-10 16:32:45 +00:00
32c9e7d377 [CI] Run test_multi_gpu in test_inductor_distributed (#100135)
Summary: The guard reason string change is needed after https://github.com/pytorch/pytorch/pull/98107/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100135
Approved by: https://github.com/anijain2305
2023-05-10 16:16:55 +00:00
000368b092 Allow C++ custom class to define __repr__ and use it from Python (#100724)
When handling custom classes from Python, it is nice to be able to specify how they are displayed to the user.

Out of the two standard functions to do this, only `__str__` could be implemented in C++. This PR add `__repr__` to the allowlist of magic methods.

The second commit tweaks the default output of `__str__` to make it more informative, but I can remove the change if you want.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100724
Approved by: https://github.com/ezyang
2023-05-10 15:46:45 +00:00
c0d33f66c9 [pt2] remove unused meta_linalg_eigh (#100965)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100965
Approved by: https://github.com/ezyang
2023-05-10 15:45:36 +00:00
6abde61f8e [pt2] add meta function for _linalg_eigh (#100964)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100964
Approved by: https://github.com/ezyang
2023-05-10 15:45:15 +00:00
cyy
39ec5fa722 Fix header inclusions in c10 by iwyu (#100304)
This work introduces include-what-you-use  support for c10 by a CMake option defaulting to off. We also remove some unused header inclusions and  fix a trivial inclusion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100304
Approved by: https://github.com/ezyang
2023-05-10 15:42:43 +00:00
c658732950 [RFC] Add tqdm to benchmarking script (#100969)
Here's what it looks like, on a slower running benchmark:

https://github.com/pytorch/pytorch/assets/13564/47c4a5bd-e963-45de-a15c-2fd943de0fa4

There's actually quite a bit of dead time, it's possible there are more spots we should add tqdm to. Looking for opinions on utility of this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100969
Approved by: https://github.com/Skylion007
2023-05-10 15:39:24 +00:00
0fbe55ea8f [FSDP][state_dict] Make sharded_state_dict work with composable fully_shard (#100856)
The current implementation of sharded_state_dict only works with wrapper based FSDP (both use_orig_params and not use_orig_params work) but not with fully_shard. This PR changes the implementation of sharded_state_dict when loading to fix the incompatibility.

Differential Revision: [D45626856](https://our.internmc.facebook.com/intern/diff/D45626856/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45626856/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100856
Approved by: https://github.com/awgu, https://github.com/zhaojuanmao
2023-05-10 15:32:45 +00:00
9ba2bfea9c [PG Wrapper] Add diff capability (#100214)
Currently we print out the mismatched collectives, but it is hard to
tell exactly the mismatch. This diff adds functionality to detect the exact mismatch
and report it.

New error is as follows:

```
Detected mismatch between collectives on ranks. Rank 0 is running collecti     ve: CollectiveFingerPrint(SequenceNumber=1151423632, OpType=ALLREDUCE, TensorShape     =[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (defaul     t), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_me     mory=false (default), memory_format=(nullopt))), but Rank 1 is running collective:      CollectiveFingerPrint(SequenceNumber=1151423632, OpType=REDUCE, TensorShape=[20,      10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), de     vice=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=f     alse (default), memory_format=(nullopt))). Collectives differ in the following aspects:         Op type: ALLREDUCEvs REDUCE
```

i.e. the "Collectives differ in the following..." messaging is added.

Differential Revision: [D45375737](https://our.internmc.facebook.com/intern/diff/D45375737/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100214
Approved by: https://github.com/H-Huang
2023-05-10 15:32:30 +00:00
9ff547a57f Revert "Fix ordered dict loading with LibTorch (#100743)"
This reverts commit d371a890a21ab4b39905ed1797da6a15c0c43f53.

Reverted https://github.com/pytorch/pytorch/pull/100743 on behalf of https://github.com/jeanschmidt due to New test introduced SerializationTest.SaveStateDict is adding regressions ([comment](https://github.com/pytorch/pytorch/pull/100743#issuecomment-1542400538))
2023-05-10 15:29:14 +00:00
cb668b1291 Optimize split-split pass (#100983)
Summary:
Previously, we were replacing all getitems of a split - even the ones not affected by the pattern. For large split nodes, this was inefficient.

For instance, on an internal ads model - split-split pass took ~1100s. This is down to ~18s after this optimization

Test Plan:
* Compiled and tested on internal model (compilation time down by ~1100s)
* CI tests

Differential Revision: D45698034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100983
Approved by: https://github.com/jansel
2023-05-10 14:43:03 +00:00
f542b31c9d [export] More robust view->view_copy pass (#100908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100908
Approved by: https://github.com/ydwu4
2023-05-10 14:25:17 +00:00
8a193c6dc5 [DataPipe] Add generated docstring to functional form DataPipe (#100503)
This PR modified the generation process of `datapipe.pyi` to include the doc strings for each DataPipe in functional form.

The new generated `.pyi` file will look like [this](https://gist.github.com/NivekT/95095f14da85a837a0727a19a5ba367c). I have confirmed the doc string will be visible in PyCharm.

You can copy this [file](https://gist.github.com/NivekT/95095f14da85a837a0727a19a5ba367c) and overwrite your local `datapipe.pyi` to validate this change as well.

Note: We need to create a similar change in TorchData to allow DataPipes in that library to show the doc strings as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100503
Approved by: https://github.com/ejguan
2023-05-10 14:06:46 +00:00
51fe53e619 [opinfo] item (#100313)
Follows #100223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100313
Approved by: https://github.com/ezyang
2023-05-10 11:32:45 +00:00
55844dfdbc [FSDP][state_dict] Restore the state_dict_config for NO_SHARD (#100855)
Any change to the user configurations should be temporary. This PR fixes the issue when NO_SHARD state_dict/load_state_dict is called, the state_dict_config and state_dict_type are changed permanently.

Differential Revision: [D45593313](https://our.internmc.facebook.com/intern/diff/D45593313/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45593313/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100855
Approved by: https://github.com/awgu, https://github.com/zhaojuanmao, https://github.com/rohan-varma
2023-05-10 10:01:21 +00:00
a723f1f2b9 fix _privateuse1_tag problem (#100632)
Fix _privateuse1_tag bug in torch/serialization.py
Add device_index after device_type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100632
Approved by: https://github.com/ezyang
2023-05-10 09:53:19 +00:00
5a933d044f [opinfo prims] equal (#100663)
Follows: #100223
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100663
Approved by: https://github.com/ezyang
2023-05-10 08:16:00 +00:00
eqy
33f3dca6b5 [CUDA][CUBLAS] Fix BF16 reduced precision reduction note in docs (#101044)
#100966

CC @ngimel @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101044
Approved by: https://github.com/ngimel
2023-05-10 06:50:58 +00:00
eqy
6e2efd16d8 [CUDA][CUBLAS] Add cuBLAS workspace allocation behavior to docs (#100919)
Adding to the docs for now, hopefully we can move to `cudaMallocAsync`-backed cuBLAS workspaces soon which should alleviate the recent confusion around `cuBLAS` "leaking" memory through workspaces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100919
Approved by: https://github.com/ngimel
2023-05-10 06:40:26 +00:00
1e89a56a5b Apply static policy correctly to unspec (#98983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98983
Approved by: https://github.com/ezyang
2023-05-10 05:59:12 +00:00
dfa951171a Fix typo in RELEASE.md and README.md (#100536)
Some minor spelling, grammar and typographical mistakes have been fixed in RELEASE.md & README.md files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100536
Approved by: https://github.com/ezyang
2023-05-10 05:06:45 +00:00
2b250e1921 inductor(CPU): fix issue when padding/stride/dilation size is one for cpu weight packing pass (#100951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100951
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-05-10 04:53:37 +00:00
083f88e126 PyTorch -> C++17 (#98209) (#100557)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4f0b524</samp>

This pull request updates the codebase and the documentation to use C++17 instead of C++14 as the minimum required C++ standard. This affects the `ATen`, `c10`, and `torch` libraries and their dependencies, as well as the CI system and the `conda` package metadata.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100557
Approved by: https://github.com/malfet
2023-05-10 04:47:35 +00:00
30cecc0e11 [MPS] Fix build regressions introduced by #92868 (#101036)
https://github.com/pytorch/pytorch/pull/92868 introduced  `OBJC` and `OBJCXX` language dialects, but fails to propagate some important flags, like OpenMP include path(if found),  `-fno-objc-arc` and `-Wno-unguarded-availability-new` suppression.

This PR remedies that and fixes https://github.com/pytorch/pytorch/issues/100925

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 62677d4</samp>

This pull request improves the support for MPSGraph on Apple platforms by fixing some CMake flags for parallelism and memory management. It modifies `cmake/Dependencies.cmake` and `CMakeLists.txt` accordingly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101036
Approved by: https://github.com/atalman, https://github.com/huydhn
2023-05-10 04:15:41 +00:00
b004c0b3c6 [inductor] default cache_dir in torch._inductor.codecache should be lazily evaluated (#100824)
`getpass.getuser` may raise exceptions in some circumstances, where users cannot override the default cache dir with env `TORCHINDUCTOR_CACHE_DIR`. Hence the assemble of default cache dir should be lazily evaluated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100824
Approved by: https://github.com/ezyang
2023-05-10 03:36:39 +00:00
b06c180a32 CUBLAS Flag (CUBLAS_GEMM_DFALT_TENSOR_OP -> CUBLAS_GEMM_DEFAULT_TENSOR_OP) (#100976)
Looking at the docs https://docs.nvidia.com/cuda/cublas/index.html?highlight=cublasGemmEx#cublasgemmex It seems like the flag should be `CUBLAS_GEMM_DEFAULT_TENSOR_OP`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100976
Approved by: https://github.com/ezyang
2023-05-10 03:26:31 +00:00
535368f00e [vision hash update] update the pinned vision hash (#101032)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101032
Approved by: https://github.com/pytorchbot
2023-05-10 03:24:19 +00:00
649e609667 [c10d] make ProcessGroupNCCL work.wait() respect timeout (#100162)
Fixes #83486

TestDistBackendWithSpawn.test_monitored_barrier_allreduce_hang and NcclErrorHandlingTest.test_nccl_timeout passed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100162
Approved by: https://github.com/ezyang
2023-05-10 03:07:47 +00:00
b33c9c7c9f [inductor] support vec type conversion between float and bool (#100950)
Fix https://github.com/pytorch/pytorch/issues/100466
Fix https://github.com/pytorch/pytorch/issues/100800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100950
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-05-10 02:16:06 +00:00
44e73da444 Extend assert statement to include ListVariable (#100841)
Fixes #100697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100841
Approved by: https://github.com/lezcano, https://github.com/anijain2305
2023-05-10 01:57:10 +00:00
27d5019e39 STFT: correct stft definition and better document tensor shapes (#100427)
Fixes #100177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100427
Approved by: https://github.com/lezcano
2023-05-10 01:42:01 +00:00
2241aaa60c Revert "[quant][pt2e] Change input act annotation to a map and allow dynamic quantization for non zeroth argument (#101005)"
This reverts commit f08ddae8885bc703408b949642d4a5bee30efce8.

Reverted https://github.com/pytorch/pytorch/pull/101005 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/101005#issuecomment-1541143426))
2023-05-10 01:27:47 +00:00
76cc3ab4f3 [CI] Delete skips from https://github.com/pytorch/pytorch/issues/93847 (#96049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96049
Approved by: https://github.com/jansel
2023-05-10 01:27:27 +00:00
bf214f40d4 explicitly check or discard cudaGetLastError return value (#100488)
cudaGetLastError and hipGetLastError will clear any error value within CUDA and HIP, respectively. This is often done on purpose to clear benign errors. Discarding the return value should be indicated by casting to void and a nearby comment. This silences warnings from HIP:

warning: ignoring return value of function declared with 'nodiscard' attribute [-Wunused-result]

Performing an audit of pytorch sources found one use of cudaGetLastError that was incorrectly ignored in IndexKernel.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100488
Approved by: https://github.com/ezyang
2023-05-10 01:24:07 +00:00
f08ddae888 [quant][pt2e] Change input act annotation to a map and allow dynamic quantization for non zeroth argument (#101005)
Summary:
Previously the node annotation looks like the following:
```
node.meta["..."] = {
    "input_act_obs_or_fq_ctr": ...,
    "weight_obs_or_fq_ctr": ...,
    "weight_index": 1,
}
```
Basically we need specifiy the index for weight and also have a separate key for weight config, in this PR we changed that to:
```
node.meta["..."] = {
    "input_act_obs_or_fq_ctr_map": {input_node: ..., weight_node: ...},
}
```
This can support specifying the observer/fake quant constructor for any argument of the node

Test Plan: buck2 test @//mode/opt //caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_resnet18_with_quantizer_api (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2EModels)'

Reviewed By: kimishpatel

Differential Revision: D45553195

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101005
Approved by: https://github.com/kimishpatel
2023-05-10 00:42:25 +00:00
20a231b55b [BE] Prevent pytest from thinking this class defines any tests (#100949)
Fixes the error:

```
/var/lib/jenkins/pytorch/test/inductor/test_torchinductor.py:6021: PytestCollectionWarning: cannot collect test class 'TestFailure' because it has a __init__ constructor (from: test/inductor/test_torchinductor.py)
  class TestFailure:
```

It does so by marking the class as not actually being a test class, despite it's name starting with `Test`.
For more details see: https://stackoverflow.com/a/72465142/21539

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100949
Approved by: https://github.com/huydhn
2023-05-10 00:20:11 +00:00
6308563a39 Enable reordering pass (#100747)
Restricts the reordering pass to only reorder nodes before the `copy_` epilogue that functionalization generates.

Brings `python benchmarks/dynamo/torchbench.py  --performance  --backend inductor --amp --inference --only hf_Bert` from 1.46 -> 1.49

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100747
Approved by: https://github.com/jansel
2023-05-09 22:36:57 +00:00
7da8705f18 [dynamo 3.11] fix segfault when printing stack trace (#99934)
Dynamo will frequently segfault when attempting to print stack traces. We fix this by:
- Fixing stack size calculations, as we did not account for exception tables
- Creating shadow execution frames in a way that more closely resembles what CPython does to create its execution frames

Dynamo/inductor-wrapped pytorch tests are enabled up the stack - those need to be green before this PR can be merged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99934
Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/jansel
2023-05-09 22:12:45 +00:00
d57544d39a Revert "fix specify_constraints's signature when exporting model (#100739)"
This reverts commit b0a372e1faef97adf46bab08510c7e0abdff2611.

Reverted https://github.com/pytorch/pytorch/pull/100739 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100739#issuecomment-1540920698))
2023-05-09 21:42:35 +00:00
4b8127b90e Revert "[Dynamo] Fix nested function resume execution (#100426)"
This reverts commit d719f0276d69a8315b65f4c4500cfc1cdaddb025.

Reverted https://github.com/pytorch/pytorch/pull/100426 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100426#issuecomment-1540915913))
2023-05-09 21:32:13 +00:00
fa6df34d30 [ET selective build] add kernel metadata section to selective_build.yaml (#100665)
Summary:
For each op, we have a List[List[dtype;dim-order]]:
  - the inner list contains the `dtype;dim-order` info for each arg if we have a Tensor/TensorList/OptionalTensorList
  - the outer list contains different occurances of dtype/dim-order combinations for that op in the program

Example:
```
et_kernel_metadata:
  aten::add.out:
    # A list of different dtype/dim-order combinations used in model
    - # Each contains the list of args of Tensor dtype and dim order if applicable
      - FLOAT;0,1
      - FLOAT;0,1
      - NON_TENSOR_ARG
      - FLOAT;0,1
      - FLOAT;0,1
    -
      - INT;0,1
      - INT;0,1
      - NON_TENSOR_ARG
      - INT;0,1
      - INT;0,1
  aten::mul.out:
    - - FLOAT;0,1
      - FLOAT;0,1
      - FLOAT;0,1
      - FLOAT;0,1
```

We don't have the arg name so far; we need to parse the schema (functions.yaml) to get that info.  We depend on the order of args from that file.

Test Plan: `buck run fbcode//executorch/codegen/tools:test_gen_oplist_real_model`

Differential Revision: D45551409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100665
Approved by: https://github.com/larryliu0820
2023-05-09 21:30:01 +00:00
35834a405c Run C++ tests on CI with run_test.py (#99956)
After https://github.com/pytorch/pytorch/pull/99559, we can now run C++ test with `run_test.py`.  Although advance features such as `--import-slow-tests` and `--import-disabled-tests` won't work for now, there will still be a gain in reliability and performance as C++ can now be retried and run in parallel.

This covers all C++ tests in the CI including aten, libtorch, and Vulkan C++ tests across all platforms Linux, Windows, MacOS.

Notes:
* To support C++ test discovery, the env variable `CPP_TESTS_DIR` can be set to where the C++ test binaries is located
* Support pytest -k argument via run_test as this is used by pytest-cpp to replace `--gtest-filter`
* The XML output is in pytest format, but it's ok now because we don't have slow test or flaky test support for C++ test yet
* ~~I need to figure out why conftest.py doesn't work when I invoke pytest directly for C++ test, so `--sc` is not available for C++ tests at the moment.  Proper pytest plugin like stepwise works fine though.  I'll investigate and fix it in a separate PR~~ Found the cause, `conftest.py` is per directory and needs to be in any arbitrary directory that holds C++ test
* Two tests `test_api` and `test_tensorexpr` timed out on ASAN, I suspect that ASAN is now used on top of the python executable, which is slower than running native C++ code.  IMO, it's ok to run these tests as before on ASAN for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99956
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-05-09 21:24:12 +00:00
a8c2cd1039 Add CUTLASS-based MM for structured sparse linear operator (#100485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100485
Approved by: https://github.com/cpuhrsch
2023-05-09 21:05:15 +00:00
d63e0b1578 [optim] More cleanup and reorg of test_optim.py (#100917)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100917
Approved by: https://github.com/albanD
2023-05-09 21:03:15 +00:00
d0dab772df [BE][optim] Remove objects from being globals and comment to clarify (#100899)
What happened in this PR?

1. Added comments to clarify rosenbrock
2. Moved global objects to be within classes for better readability/grouping
3. Renamed dnn to cnn for consistency

This is the very first of the cleanup of test_optim.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100899
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-05-09 21:03:15 +00:00
e353013aa4 [Vulkan] Ensure non-zero divisors in Vulkan API Tests (#100909)
Summary:
This fixes flakiness of div_to_scalar_wrapped
See [here](b89f74aa35) for flakiness of div_to_scalar_wrapped

Test Plan:
On Devserver:
```
LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run //xplat/caffe2:pt_vulkan_api_test_bin
```

On Mac:
```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
```

To test that these changes fixed flakiness of div_to_scalar_wrapped, I ran the test 1000 times on devserver before the changes, and observed failures. Then ran it 1000 times after the changes and didn't observe any failures.

Reviewed By: SS-JIA

Differential Revision: D45670642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100909
Approved by: https://github.com/SS-JIA
2023-05-09 20:55:33 +00:00
17fec516fe [Vulkan] Test conv2d after division (#100910)
Summary: This tests running a conv2d with clamp after dividing the input tensor by another tensor. Both tensors have number channels = 3 (i.e. not a multiple of 4) and therefore, the channel dimension was padded. Hence, we are testing our divide-by-zero fix (D44392406)

Test Plan:
```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="VulkanAPITest.conv2d_clamp_after_div"
```

Reviewed By: SS-JIA

Differential Revision: D44550026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100910
Approved by: https://github.com/SS-JIA
2023-05-09 20:30:59 +00:00
9035b6a651 Allow disable binary build jobs on CI (#100754)
Given the recent outage w.r.t. binary workflows running on CI, I want to close the gap between them and regular CI jobs.  The first part is to add the same filter step used by regular CI jobs so that oncalls can disable the job if need.

* Nightly runs are excluded as it includes the step to publish nightly binaries.  Allowing oncalls to disable this part requires more thoughts.  So this covers only CI binary build and test jobs
* As binary jobs doesn't have a concept of test matrix config which is a required parameter to the filter script, I use a pseudo input of test config default there

### Testing

* https://github.com/pytorch/pytorch/issues/100758.  The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911034089/jobs/8768782689
* https://github.com/pytorch/pytorch/issues/100759.  The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911033966/jobs/8768713669

Note that Windows binary jobs are not run in PR anymore after https://github.com/pytorch/pytorch/pull/100638, and MacOS binary jobs only run nightly.  So there are only Linux jobs left.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100754
Approved by: https://github.com/ZainRizvi
2023-05-09 20:01:00 +00:00
c3f3cb5b0f [quant][pt2e] Support conv bn fusion in convert step for QAT flow (#100442)
Summary:
This PR adds support for folding bn weights into conv for QAT flow, this is equivalent
to the QAT branch of `from_float` in eager mode quantized conv module: https://github.com/pytorch/pytorch/blob/main/torch/ao/nn/quantized/modules/conv.py#L223

Items that needs followup:
* there are some workaround I did because quantize_per_tensor is using float/int args and dynamo does not support these args, need to fix after we change the quantized model representation and also change these args to Tensor

Test Plan: buck2 test @//mode/opt //caffe2/test:quantization_pt2e -- --exact 'caffe2/test:quantization_pt2e - test_convert_qat_conv_bn_fusion (quantization.pt2e.test_quantize_pt2e.TestQuantizePT2E)'

Reviewed By: andrewor14

Differential Revision: D45344281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100442
Approved by: https://github.com/kimishpatel
2023-05-09 19:43:51 +00:00
f92b3e1477 [MPS][BE] std::is_same::value -> std::is_same_v (#100975)
PyTorch is C++17 project, so let's use some C++17 features.

I.e. `s/std::is_same<X, Y>::value/std::is_same_v<X, Y>`
And use `if constexpr` in few places when this construct is used.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 7b7683f</samp>

> _We're sailing on the sea of code, we're making it more neat_
> _We're using `is_same_v` and `if constexpr` to keep it sweet_
> _We're refactoring the range tensor logic, we're avoiding duplication_
> _We're heaving on the ropes of `Distributions.mm`, on the count of three, with elation_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100975
Approved by: https://github.com/jeanschmidt, https://github.com/albanD, https://github.com/kulinseth, https://github.com/Skylion007
2023-05-09 19:27:28 +00:00
858657090b Make sure we get full file path for filtering in pr-sanity-check (#100978)
The filename get a `.../` path which makes the next diff not work for files that are too nested.
This can seen in https://github.com/pytorch/pytorch/actions/runs/4920598180/jobs/8789553971?pr=100583 for example where most files are ignored.

The detection properly works after this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100978
Approved by: https://github.com/seemethere
2023-05-09 18:42:42 +00:00
5970fb402e C++ CustomClass in Python: indicate which methods are not implemented (#100171)
Without these changes, it can be hard to know which magic methods are not implemented on a given ScriptObject.

before:
```py
torch.ops.load_library("somelib.so")
c = torch.classes.somelib.SomeClass()
print(len(c))
# raise NotImplementedError
```

after:
```py
torch.ops.load_library("somelib.so")
c = torch.classes.somelib.SomeClass()
print(len(c))
# raise NotImplementedError: '__len__' is not implemented for __torch__.torch.classes.somelib.SomeClass
```

------

I could not find a linked issue, if you want me to open one as well I can do this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100171
Approved by: https://github.com/ezyang
2023-05-09 18:41:40 +00:00
0073d4cd27 Update FBGEMM submodule (#100236)
To [pytorch/FBGEMM@d0ee798](d0ee798b1f) that among other things includes [pytorch/FBGEMM#1672](https://github.com/pytorch/FBGEMM/pull/1672)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100236
Approved by: https://github.com/ezyang
2023-05-09 18:39:28 +00:00
d98d95fb9f Revert "[Dynamo] Remove cross import in dynamo unit tests (#100851)"
This reverts commit c4bbeb5b8a27259fc2b644e3f185b4ba859a2d39.

Reverted https://github.com/pytorch/pytorch/pull/100851 on behalf of https://github.com/jeanschmidt due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100851#issuecomment-1540646623))
2023-05-09 18:30:01 +00:00
6aa80beca1 [c10d] Implement new Store methods in TCPStore. (#100383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100383
Approved by: https://github.com/fduwjj
2023-05-09 17:43:16 +00:00
8769fb854d [BE] Fix flake8 B027 errors - missing abstractmethod decorator (#100715)
Enables B027 and applies fixes by adding abstract method decorators. Autofix generated by ruff master.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100715
Approved by: https://github.com/ezyang
2023-05-09 17:28:48 +00:00
bd18225c04 [functorch] Remove internal assert in index_put batch rule (#100516)
Fixes #94630

Context: https://github.com/pytorch/pytorch/issues/94630#issuecomment-1518946929

I'm not sure how to correctly test it using pure `vmap` without joining `jacrev` here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100516
Approved by: https://github.com/zou3519
2023-05-09 17:23:26 +00:00
19be2bb875 Revert "[MPS] Add support for Custom Kernels (#100661)"
This reverts commit f39cda83d133cdd92e06e3d8cdd91340b43eb2c2.

Reverted https://github.com/pytorch/pytorch/pull/100661 on behalf of https://github.com/malfet due to Break internal builds, but also guarding dispatch_t define behind __OBJC__ guard is not a good practices ([comment](https://github.com/pytorch/pytorch/pull/100661#issuecomment-1540540002))
2023-05-09 17:02:04 +00:00
f558af2a55 [adam] Use the right params in weight_decay, rename for clarity, fixes #100707 (#100973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100973
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-09 17:00:27 +00:00
ba47a2b227 [export] Pickle of ExportGraphModule (#100924)
try 2 of reland of https://github.com/pytorch/pytorch/pull/100620 bc merge conflict 😭...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100924
Approved by: https://github.com/tugsbayasgalan
2023-05-09 16:58:24 +00:00
b71ec6bdf3 Revert "Forward fix lint failure from #100661 (#100907)"
This reverts commit fb69aa15921895a21fe8bd9b1b7807f3d4c1cbe3.

Reverted https://github.com/pytorch/pytorch/pull/100907 on behalf of https://github.com/jeanschmidt due to Required in order to revert #100661 ([comment](https://github.com/pytorch/pytorch/pull/100907#issuecomment-1540504748))
2023-05-09 16:55:20 +00:00
0e08a9b057 Wrap more constraint violation cases to UserError (#100897)
Cases covered in this PR:
 - Example inputs conflict with input constraints
 - Example inputs conflict with inline constraints
 - Suggest users to use `constrain_as_*()` when trying to export with data-dependent operations

Differential Revision: [D45666627](https://www.internalfb.com/diff/D45666627)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100897
Approved by: https://github.com/avikchaudhuri
2023-05-09 16:44:57 +00:00
b179d34a19 Handle negative padding in reflect_pad_backward (#100923)
Fixes #100793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100923
Approved by: https://github.com/jansel
2023-05-09 16:30:48 +00:00
92a7640b76 Add mul tests with sparse sample inputs (#100393)
This PR implements sparse sample inputs and error inputs for mul OpInfo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100393
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-05-09 16:13:14 +00:00
0141a242fd bsr_dense_bmm(): remove sparse_rowspace kernel and some dead code (#100876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100876
Approved by: https://github.com/cpuhrsch, https://github.com/Skylion007
2023-05-09 16:12:11 +00:00
793bd6993a Work around torchdynamo import error with functional collectives (#100901)
Summary:
Currently there are build configs where the torchdynamo import trips over a
strange SystemError related to some module's __dict__.items() returning NULL,
while torchdynamo tries to iterate all torch modules and process them for
its allowed functions list.

While this is hard to repro, we should be able to work around it and then fix
it properly.

Test Plan: Rely on others to test this, assuming CI passes.

Reviewed By: anijain2305

Differential Revision: D45663313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100901
Approved by: https://github.com/yanboliang, https://github.com/malfet
2023-05-09 16:09:42 +00:00
93aac15d82 make torch/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm data_ptr-correct (#100886)
make torch/csrc/jit/backends/coreml/objc/PTMCoreMLFeatureProvider.mm data_ptr-correct

Summary:
https://developer.apple.com/documentation/coreml/mlmultiarray shows
that this is looking for a mutable input and is permitted to mutate
the data in subsequent operations.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100886
Approved by: https://github.com/Skylion007
2023-05-09 15:35:48 +00:00
c5d7226ab9 Upgrade torchbench pin (#100937)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100937
Approved by: https://github.com/albanD
2023-05-09 15:34:27 +00:00
1ea224c2a4 make torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp data_ptr-correct (#100888)
make torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100888
Approved by: https://github.com/ezyang
2023-05-09 15:29:08 +00:00
bc3108c2e2 make torch/csrc/jit/runtime/register_prim_ops.cpp data_ptr-correct (#100832)
make torch/csrc/jit/runtime/register_prim_ops.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100832
Approved by: https://github.com/ezyang
2023-05-09 15:07:54 +00:00
de02c8bed4 Revert "Rename DispatchKey.PrivateUse1 to custom device in torchgen. (#99406)"
This reverts commit c0ecd9895831f9329bd189de9b1e28ad68c93b5b.

Reverted https://github.com/pytorch/pytorch/pull/99406 on behalf of https://github.com/ezyang due to we're doing it another way ([comment](https://github.com/pytorch/pytorch/pull/99406#issuecomment-1540295309))
2023-05-09 15:04:16 +00:00
01476465dd Revert "add a cast function that suppresses -Wcast-function-type-strict (#100170)"
This reverts commit 642f4ed606b3d66bab21d44019ae5762637eeca9.

Reverted https://github.com/pytorch/pytorch/pull/100170 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100170#issuecomment-1540140636))
2023-05-09 13:56:48 +00:00
d371a890a2 Fix ordered dict loading with LibTorch (#100743)
Fixes #100741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100743
Approved by: https://github.com/Skylion007
2023-05-09 13:52:45 +00:00
a3f656cc6c use const_data_ptr as source of std::copy (#100885)
use const_data_ptr as source of std::copy

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100885
Approved by: https://github.com/Skylion007
2023-05-09 13:47:34 +00:00
36d91b5513 Add differentiable mkldnn_rnn_layer_backward to support double backward of LSTM (#100627)
### Description

This PR is to fix #99413, which shows the limitation of double backward using oneDNN in LSTM.

This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically.

During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first.

Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627
Approved by: https://github.com/jgong5, https://github.com/soulitzer
2023-05-09 12:58:57 +00:00
d261e43c37 [fix] cat_slice_cat : slice with negative size (#100828)
Fixes https://github.com/pytorch/pytorch/issues/100807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100828
Approved by: https://github.com/ngimel
2023-05-09 11:53:13 +00:00
622e582a2b Register get_cpu_capability for jit (#100723)
Description:

Context: In torchvision we ensure that functional ops are torchscriptable. Recently exposed `torch.backends.cpu.get_cpu_capability()` in https://github.com/pytorch/pytorch/pull/100164 is failing in torchvision CI
```
RuntimeError:
Python builtin <built-in function _get_cpu_capability> is currently not supported in Torchscript:
  File "/usr/local/lib/python3.10/dist-packages/torch/backends/cpu/__init__.py", line 17
    - "AVX512"
    """
    return torch._C._get_cpu_capability()
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
```
Ref: https://github.com/pytorch/vision/pull/7557

In this PR, `torch._C._get_cpu_capability()` is explicitly registered for JIT and tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100723
Approved by: https://github.com/albanD
2023-05-09 09:52:29 +00:00
c4bc259f00 bsr_dense_mm(): better test coverage (#100543)
This PR improves test coverage for `bsr_dense_mm` by:
- ~~enabling correctness tests for `float32`~~.
- extending and testing input correctness checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100543
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-05-09 09:26:02 +00:00
43127f19f1 Revert "Allow disable binary build jobs on CI (#100754)"
This reverts commit 4c3b52a5a99e8aed2576773829e6a353a65aa2b4.

Reverted https://github.com/pytorch/pytorch/pull/100754 on behalf of https://github.com/huydhn due to The subset of Windows binary jobs running only in trunk fails because the runners do not have Python setup ([comment](https://github.com/pytorch/pytorch/pull/100754#issuecomment-1539586399))
2023-05-09 07:15:32 +00:00
4c3b52a5a9 Allow disable binary build jobs on CI (#100754)
Given the recent outage w.r.t. binary workflows running on CI, I want to close the gap between them and regular CI jobs.  The first part is to add the same filter step used by regular CI jobs so that oncalls can disable the job if need.

* Nightly runs are excluded as it includes the step to publish nightly binaries.  Allowing oncalls to disable this part requires more thoughts.  So this covers only CI binary build and test jobs
* As binary jobs doesn't have a concept of test matrix config which is a required parameter to the filter script, I use a pseudo input of test config default there

### Testing

* https://github.com/pytorch/pytorch/issues/100758.  The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911034089/jobs/8768782689
* https://github.com/pytorch/pytorch/issues/100759.  The job is skipped in https://github.com/pytorch/pytorch/actions/runs/4911033966/jobs/8768713669

Note that Windows binary jobs are not run in PR anymore after https://github.com/pytorch/pytorch/pull/100638, and MacOS binary jobs only run nightly.  So there are only Linux jobs left.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100754
Approved by: https://github.com/ZainRizvi
2023-05-09 06:53:34 +00:00
e72385af20 [Reducer] Move require_finalize_ (#100782)
This doesn't need to be set in the loop, just once.

Differential Revision: [D45632426](https://our.internmc.facebook.com/intern/diff/D45632426/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100782
Approved by: https://github.com/Skylion007, https://github.com/fegin
2023-05-09 05:30:55 +00:00
d90f71ea0b [PG NCCL] Provide work obj in postProcess (#100781)
This will be needed for allreduce_sparse so we can set the outputs on
the work object.

Differential Revision: [D45632428](https://our.internmc.facebook.com/intern/diff/D45632428/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100781
Approved by: https://github.com/fegin
2023-05-09 05:30:55 +00:00
a0752b68e7 [BE] Remove empty pre and post proc functions (#100780)
These functions are noops, so just use the collective() override.

Differential Revision: [D45632429](https://our.internmc.facebook.com/intern/diff/D45632429/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100780
Approved by: https://github.com/Skylion007, https://github.com/fegin
2023-05-09 05:30:52 +00:00
7012600abe fix cpu autocast check in rnn (#100621)
https://github.com/pytorch/pytorch/pull/100100 added Typechecking while `torch.is_autocast_enabled()` always return `False` on cpu. This PR fixes the autocast check for cpu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100621
Approved by: https://github.com/albanD
2023-05-09 04:34:18 +00:00
26cd958718 Support runtime assertion for inline constraints (#100763)
This pr does the following:
1. previously, inline constraints is not properly set for tensor output data-dependent ops such as a.nonzero because of its return value is not symint. This pr just uses all the unbacked symbols i.e.those start with "i"/"f" in create_unbacked_sym* functions. Note that these symbols are guaranteed to be a super set of inline user constraints.

2. add inline assertions support by checking.

Currently, it only deal with tensor, SymInt, SymFloat, SymBool output data-dependent ops and ignore the rest. It's good enough for now as we only have a limited number of data-dependent ops (.item and .nonzero are explicitly tested).

The examples for graph that is added assertions is shown below:

```
class ExportGraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: i64[s0], = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        nonzero_default: i64[i0, 1] = torch.ops.aten.nonzero.default(arg0);  arg0 = None
        return pytree.tree_unflatten([nonzero_default], self._out_spec)

class GraphModule(torch.nn.Module):
    def forward(self, x):
        arg0: i64[s0], = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
        sym_size: Sym(s0) = torch.ops.aten.sym_size(arg0, 0)
        nonzero_default: i64[i1, 1] = torch.ops.aten.nonzero.default(arg0);  arg0 = None
        sym_size_1: Sym(i1) = torch.ops.aten.sym_size(nonzero_default, 0)
        ge: Sym(i1 >= 3) = sym_size_1 >= 3
        scalar_tensor_default: f32[] = torch.ops.aten.scalar_tensor.default(ge);  ge = None
        _assert_async_msg = torch.ops.aten._assert_async.msg(scalar_tensor_default, 'nonzero_default.shape[0] is outside of inline constraint [3, 5].');  scalar_tensor_default = None
        le: Sym(i1 <= 5) = sym_size_1 <= 5;  sym_size_1 = None
        scalar_tensor_default_1: f32[] = torch.ops.aten.scalar_tensor.default(le);  le = None
        _assert_async_msg_1 = torch.ops.aten._assert_async.msg(scalar_tensor_default_1, 'nonzero_default.shape[0] is outside of inline constraint [3, 5].');  scalar_tensor_default_1 = None
        return pytree.tree_unflatten([nonzero_default], self._out_spec)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100763
Approved by: https://github.com/tugsbayasgalan
2023-05-09 04:19:57 +00:00
75e4214f92 Fix recursive_store for smaller elementSize (#100902)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8cbc54f</samp>

Add support for symbolic integers of different sizes in `tensor_new.cpp`. Use a switch statement to cast them to the appropriate fixed-width integer type.

Fixes crash reported in https://github.com/pytorch/pytorch/issues/100455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100902
Approved by: https://github.com/ngimel
2023-05-09 04:10:29 +00:00
cecfcf1e17 [MPS] Handle MPS failures of test_modules.py in common_modules.py (#95334)
- Also cleaned up `test_modules.py` from skipMPS code.
- Added `skipMPS` for unsupported or failing tests on MPS backend in common_modules.py.
   (We'll remove `skipMPS` from those tests once a fix is available for them.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95334
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-05-09 03:55:16 +00:00
97bb4c2538 [vision hash update] update the pinned vision hash (#100926)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100926
Approved by: https://github.com/pytorchbot
2023-05-09 02:51:29 +00:00
9eab13fc90 Reenable llama benchmark (#100877)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100877
Approved by: https://github.com/albanD
2023-05-09 01:12:54 +00:00
5ef50ef2d8 [caffe2] Remove inline keyword of function CUDACachingAllocator::format_size (#100734)
Summary: `CUDACachingAllocator::format_size` is used not only in CUDACachingAllocator.cpp but also in CUDAMallocAsyncAllocator.cpp. This caused a breakage when the compiler inlined the function and the linker couldn't find it when resolving symbols for CUDAMallocAsyncAllocator.cpp.

Differential Revision: D45612790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100734
Approved by: https://github.com/interwq, https://github.com/kit1980
2023-05-09 01:03:39 +00:00
4447dfa673 Remove MacOS workflow step to disable XProtect (#100692)
I added this few weeks back trying to fix the flaky dependencies missing on MacOS in https://github.com/pytorch/pytorch/pull/99506.  But I think this step is not really needed.  More importantly, it starts to hang on MacOS 13, for example https://github.com/pytorch/pytorch/actions/runs/4889081518/jobs/8727397905.  The reason is unclear, but this should be removed nonetheless.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100692
Approved by: https://github.com/ZainRizvi
2023-05-09 00:34:24 +00:00
660a0d8622 [Functorch] Skip docs setup if called in optimize mode (#100750)
Test plan: `python3 -OO -c "import torch._functorch.deprecated"`

Fixes https://github.com/pytorch/pytorch/issues/100680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100750
Approved by: https://github.com/albanD
2023-05-08 23:36:57 +00:00
16a4075327 Throw if 'dropout' argument name but func does not have nondeterministic_seeded (#100771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100771
Approved by: https://github.com/ezyang
2023-05-08 23:34:28 +00:00
9a811d1df2 [BE] Update notes linkage in common_device_type, fix very minor grammar (#100898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100898
Approved by: https://github.com/jbschlosser, https://github.com/soulitzer
2023-05-08 22:21:57 +00:00
812cadf90a [3/n] loading meta to device (#100495)
Summary: Make it possible to `torch.jit.load(model, device)` to a device when `model` contains weights that are on device `meta`. Just leave the `meta` weights on `meta`, and load the weights that can be loaded to the target device.

Reviewed By: singlaiiit, RoshanPAN, sayitmemory

Differential Revision: D45099145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100495
Approved by: https://github.com/houseroad
2023-05-08 22:14:38 +00:00
bde7b81f34 [S337714] Back out "[PyTorch] Don't do extra numel() check in TensorImpl::data() (#98090)" (#100893)
Summary:
Original commit changeset: 9875964c3b32

Original Phabricator Diff: D44586464

Reviewed By: drdarshan

Differential Revision: D45664329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100893
Approved by: https://github.com/xush6528
2023-05-08 21:56:44 +00:00
2d2f716ddc [export] Fix cond for pass_base (#100836)
I ported over the code for the inline interpreter incorrectly in the pass base 😅

Originally the function `make_inline_interpreter` is supposed to take in a fx.Interpreter type but I accidentally passed in an fx.Interpreter object. Also realized while modifying this diff (and comments from Tugsuu) that we don't really need this InlineInterpreter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100836
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2023-05-08 21:51:03 +00:00
b0a372e1fa fix specify_constraints's signature when exporting model (#100739)
Currently, when f is a Module, the signature should be the "forward" methods signature. For example,

```python
class Module(torch.nn.Module):
    def forward(self, x):
        return x.sin()
mod = Module()
x = torch.ones([3, 3])
torch._dynamo.export(mod, x, constraints=[dynamic_dim(x, 0)])
```
Previously, it prints following:
```python

def specify_constraints(*args, **kwargs):
    return [
        2 <= dynamic_dim(x, 0),
        2 <= dynamic_dim(x, 1),
    ]
```
After the pr, it prints:
```python
def specify_constraints(x):
    return [
        2 <= dynamic_dim(x, 0),
        2 <= dynamic_dim(x, 1),
    ]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100739
Approved by: https://github.com/avikchaudhuri
2023-05-08 21:26:30 +00:00
fb69aa1592 Forward fix lint failure from #100661 (#100907)
https://github.com/pytorch/pytorch/pull/100661 breaks lint check, but it's just some empty spaces, so ... forward fixing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100907
Approved by: https://github.com/ZainRizvi
2023-05-08 21:19:00 +00:00
95f191a248 Always run prioritized tests first, even if they're expected to run serially (#100748)
Today, we prioritize running test files that were edited in the user's PR, with the idea being to run them before we run any other test.

Except, if the modified test is supposed to run serially, then we still end up running it after all the parallelized tests have finished running.

This PR fixes that to _always_ run the prioritized tests before the regular tests, regardless of if the test is supposed to run serially or in parallel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100748
Approved by: https://github.com/huydhn
2023-05-08 20:23:46 +00:00
c4bbeb5b8a [Dynamo] Remove cross import in dynamo unit tests (#100851)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100851
Approved by: https://github.com/jansel
2023-05-08 20:16:57 +00:00
5079bf3df6 [inductor] Add variable names to MemoryDep (#100308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100308
Approved by: https://github.com/eellison
2023-05-08 20:08:58 +00:00
651b5b0f5f Fix nightly build of C++ docs (#100845)
The fix is to upgrade breathe version (and sphinx accordingly), for example https://github.com/pytorch/pytorch/actions/runs/4898593997/jobs/8749163278.

```
Exception occurred:
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/breathe/renderer/sphinxrenderer.py", line 104, in DomainDirectiveFactory
    'function': (python.PyModulelevel, 'function'),
AttributeError: module 'sphinx.domains.python' has no attribute 'PyModulelevel'
```

This was missed in https://github.com/pytorch/pytorch/pull/100601 because `RUN_DOXYGEN` is only set to true in the nightly job.  Specifically, the 2 plugins `breathe` and `exhale` are only used when `RUN_DOXYGEN` is set to true https://github.com/pytorch/pytorch/blob/main/docs/cpp/source/conf.py#L37-L42

### Testing

https://github.com/pytorch/pytorch/actions/runs/4910813882/jobs/8771541636 passes with RUN_DOXYGEN set to true and C++ docs looks ok https://docs-preview.pytorch.org/100845/cppdocs/index.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100845
Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/ZainRizvi, https://github.com/malfet
2023-05-08 20:06:51 +00:00
f39cda83d1 [MPS] Add support for Custom Kernels (#100661)
- This change introduces these APIs to enable developing custom kernels on the MPS Stream:
`torch::mps::get_command_buffer()`
`torch::mps::get_dispatch_queue()`
`torch::mps::commit()`
- Add ObjC test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100661
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-08 20:05:46 +00:00
d9d98b4d54 Skip DNS host fix on ROCm runners (#100861)
The quick fix in https://github.com/pytorch/pytorch/pull/100507 is only needed on Linux runners, not ROCm as the latter doesn't have the corresponding step in [setup-linux](https://github.com/pytorch/pytorch/pull/100436).  ROCm runners use `setup-rocm` action instead, and it doesn't seem to have any issue with DNS, so there is no need to add the quick fix to ROCm.

This is currently breaking ROCm binary test job in trunk, for example https://github.com/pytorch/pytorch/actions/runs/4905878029/jobs/8761271482.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100861
Approved by: https://github.com/ZainRizvi, https://github.com/jithunnair-amd
2023-05-08 19:16:41 +00:00
aaa1323c97 remove double doc upload after CloudFront fix (#99032)
Grace period is over. See pytorch/test-infra#3894 for details. This needs to be merged simultaneously with the CloudFront update to avoid disruption.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99032
Approved by: https://github.com/kit1980, https://github.com/atalman
2023-05-08 19:09:48 +00:00
116e04be29 Initialize view_replay_enabled_ in the AutogradState ctor (#100822)
Cruise uses [clang static analyzer](https://clang-analyzer.llvm.org/) internally.
In the v2.0.0 release of PyTorch it found this problem

```
In file included from external/pytorch/aten/src/ATen/ATen.h:7:
In file included from external/pytorch/aten/src/ATen/Context.h:3:
In file included from external/pytorch/aten/src/ATen/CPUGeneratorImpl.h:3:
In file included from external/pytorch/aten/src/ATen/core/Generator.h:22:
In file included from external/pytorch/c10/core/GeneratorImpl.h:8:
In file included from external/pytorch/c10/core/TensorImpl.h:6:
external/pytorch/c10/core/InferenceMode.h:58:5: warning: Passed-by-value struct argument contains uninitialized data (e.g., field: 'view_replay_enabled_')
    AutogradState::set_tls_state(AutogradState(
    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
```

In other words, the value of `view_replay_enabled_` could be initialized which may lead to subtle bugs later on.

This PR addresses the warning by explicitly initializing it to `false`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100822
Approved by: https://github.com/Skylion007
2023-05-08 18:57:14 +00:00
ec144b9412 handle new param from torch.compile (Inductor pattern matcher), enable_log (#100814)
This PR puts a placeholder param handler for a new param being passed in from Inductor, enable log.
Fixes this error below, where I've been unable to run torch.compile on NanoGPT due to this error:

~~~
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_inductor/fx_passes/fuse_attention.py", line 219, in _sfdp_init
    register_replacement(
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_inductor/pattern_matcher.py", line 658, in register_replacement
    search_gm = trace_fn(search_fn, example_inputs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_inductor/pattern_matcher.py", line 828, in training_graph
    aot_function(
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fn' raised:
TypeError: patched_aot_function() got an unexpected keyword argument 'enable_log'
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100814
Approved by: https://github.com/fegin
2023-05-08 18:34:45 +00:00
ccd060abd8 [stronghold][bc-linter] Switch to reusable action, enable for everyone (#100737)
* Switches BC-linter to reusable action (see https://github.com/pytorch/test-infra/pull/4109)
* Removes dogfooding check / enables it for everybody
* Adds the link to the docs/user guide: https://github.com/pytorch/test-infra/wiki/BC-Linter in case of failure

To be merged on Monday, May 8 (BC linter launch date).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100737
Approved by: https://github.com/osalpekar
2023-05-08 18:28:29 +00:00
176ef97fc1 [inductor] Fix bug where a node gets erased twice (#100848)
Fixes #100806

The underlying bug is if you erase an FX node twice, everything runs without error, but `len(graph.nodes)` reports the incorrect value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100848
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2023-05-08 18:24:19 +00:00
0731420645 [PyTorch/Distributed]Only sync buffers when broadcast_buffers is True (#100729)
Summary: Disable buffers sync in _sync_module_states(...) when broadcast_buffers is False. This change will memory usage when a model has huge buffers and does not need broadcast buffers.

Test Plan: .

Differential Revision: D45610709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100729
Approved by: https://github.com/mrshenli
2023-05-08 16:34:29 +00:00
bfe5f5bbe1 [WIP] enable cuda graphs support for flash attention with dropout (#100196)
Fixes #99905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100196
Approved by: https://github.com/drisspg
2023-05-08 16:19:18 +00:00
bb28f3f519 USE_PRECOMPILED_HEADERS is not supported on Apple M1 (#92868)
Fixes #80018

```bash
MACOSX_DEPLOYMENT_TARGET=12.6 CC=gcc CXX=g++ DEBUG=1 USE_DISTRIBUTED=0 USE_MKLDNN=0 USE_CUDA=0 BUILD_TEST=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 USE_XNNPACK=0 USE_PRECOMPILED_HEADERS=1 USE_MPS=1 python setup.py develop
```

`error: Objective-C was disabled in PCH file but is currently enabled`

This PR(https://github.com/pytorch/pytorch/pull/80432) has been reverted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92868
Approved by: https://github.com/kulinseth, https://github.com/malfet
2023-05-08 16:03:34 +00:00
86ddfc7f68 [inductor] Move cpp wrapper trigger logic to inner_compile (#100611)
Summary: This enables cpp wrapper for backward as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100611
Approved by: https://github.com/jansel
2023-05-08 15:24:02 +00:00
62c53aabdb Revert "[xla hash update] update the pinned xla hash (#100369)"
This reverts commit 41bafb0b7bd75cb5f3f86015d871e731e0ae8488.

Reverted https://github.com/pytorch/pytorch/pull/100369 on behalf of https://github.com/ezyang due to bot ignored signal? ([comment](https://github.com/pytorch/pytorch/pull/100369#issuecomment-1538550434))
2023-05-08 15:21:16 +00:00
6eb0d7541d [pt2] add SymInt support for linalg_qr_backward (#100833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100833
Approved by: https://github.com/ezyang
2023-05-08 13:48:25 +00:00
1e591a8b64 [pt2] add meta function for solve_triangular (#100829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100829
Approved by: https://github.com/ezyang
2023-05-08 13:48:15 +00:00
cd8b82e5c6 bsr_dense_mm(): code refactoring (#100634)
Code unification/refactoring for better re-use. Intended for easier `sampled_addmm` implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100634
Approved by: https://github.com/cpuhrsch
2023-05-08 13:27:39 +00:00
41bafb0b7b [xla hash update] update the pinned xla hash (#100369)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100369
Approved by: https://github.com/pytorchbot
2023-05-08 10:28:08 +00:00
cyy
333de1fdb0 Fix some NVCC warnings (#100823)
PR #95568 enables more NVCC warnings. However, some cu files need to be modified to make building process more warning free. Therefore, this work aims to fix them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100823
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2023-05-08 10:23:01 +00:00
46affcb004 inductor(CPU): skip weight packing when autocast is enabled (#100844)
Currently, the packed op doesn't support autocast and the packing path happened before AOTAutograd, which changes the default autocast behavior. Now, we disable the packing path, and the bfloat16 packing path can work after we move this path after AOTAutograd(I will do it after https://github.com/pytorch/pytorch/pull/100652 is done).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100844
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-08 07:15:09 +00:00
92cecb8e3c inductor(CPU): don't do binary fusion if binary's inputs are same tensor (#100843)
This PR will fix https://github.com/pytorch/pytorch/issues/100802.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100843
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-08 07:12:32 +00:00
970c60b336 inductor: disable lowmem_dropout on CPU (#100702)
In https://github.com/pytorch/pytorch/pull/97002, we fall back bernoulli and disabled lowmem_dropout on CPU, which brings significant performance improvements for both bernoulli and dropout.
PR https://github.com/pytorch/pytorch/pull/97931 disabled lowmem_dropout by default, thus removed the code that disabled lowmem_dropout on CPU, but unfortunately, it introduced performance regression on CUDA (https://github.com/pytorch/pytorch/issues/98614). Then https://github.com/pytorch/pytorch/pull/98631 reenabled lowmem_dropout by default.
As a result, the performance of dropout on CPU has decreased since https://github.com/pytorch/pytorch/pull/98631. This pr re-added the code to disable lowmem_dropout on CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100702
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-05-08 04:51:16 +00:00
7d0e4e2aa8 Fix AT_USE_JITERATOR checks (#100464)
`AT_USE_JITERATOR` evaluates to false when the definition isn't
included, so these files were not using jiterator at all.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100464
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-05-08 01:55:41 +00:00
3b6a7f4d51 [MPS] Fix index_put with deterministic algorithm enabled (#97660)
Prevent using parallel computing when deterministic algorithm is set.

Fixes #97574

Benchmark:
```
[--------------- index_put_ Deterministic Algorithm Enabled ---------------]
                                                              |  cpu  |  mps
1 threads: -----------------------------------------------------------------
      Dtype: torch.float32 Features: 1024; Num Indices: 512   |   37  |   49
      Dtype: torch.float32 Features: 1024; Num Indices: 1024  |   54  |   50
      Dtype: torch.float32 Features: 1024; Num Indices: 2048  |   86  |   50
      Dtype: torch.float32 Features: 1024; Num Indices: 4096  |  150  |   49

Times are in microseconds (us).

[-------------- index_put_ Deterministic Algorithm Disabled ---------------]
                                                              |  cpu  |  mps
1 threads: -----------------------------------------------------------------
      DType: torch.float32 Features: 1024; Num Indices: 512   |   37  |   49
      DType: torch.float32 Features: 1024; Num Indices: 1024  |   53  |   49
      DType: torch.float32 Features: 1024; Num Indices: 2048  |   86  |   49
      DType: torch.float32 Features: 1024; Num Indices: 4096  |  147  |   50

Times are in microseconds (us).
```

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at ebf2ff3</samp>

Added a deterministic version of `index_put` for MPS tensors that runs on a single thread and can be enabled by a global context flag. Refactored the existing `index_put` function and the kernel selection logic to support both parallel and serial modes. Added a test function to verify the deterministic behavior of `index_put` under different conditions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97660
Approved by: https://github.com/kulinseth
2023-05-08 00:57:29 +00:00
4154c8ea15 [quant][pt2] Add Conv + BN + ReLU fusion for prepare QAT (#100283)
Summary: This follows https://github.com/pytorch/pytorch/pull/98568,
which lays all the groundwork for Conv + BN fusion in prepare QAT.
Conv + BN + ReLU fusion can reuse the same match and replace
patterns and is handled similarly.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_relu_fusion
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_relu_numerics

Reviewers: kimishpatel, jerryzh168

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D45515494](https://our.internmc.facebook.com/intern/diff/D45515494)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100283
Approved by: https://github.com/jerryzh168
2023-05-07 20:35:16 +00:00
05e355022f [inductor] track provenance of masks from indices (#100816)
Fixes #100530

When indices for indirect read are computed rather than read from another tensor, they should be masked according to the index used in computation. Currently though we don't associate masks with index variables, so the computed indices don't have associated masks also. This PR associates masks with index variables when they are created.

On this PR, both the device assert and masked load are generated, and hopefully device assert should be removed later once your value analysis PR lands.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100816
Approved by: https://github.com/Chillee, https://github.com/lezcano
2023-05-07 18:51:43 +00:00
953aa6d90e [TP] Enable more generic attn in Tensor Parallelism (#100508)
To make TP more generic for Attention module, we come up with this new col/rowwise parallel style.

Basically, the idea behind is that:
We only do DTensor op for Col/Rowwise sharded part. For the rest of ATen ops, we will leave it to Tensor ops.

And we set this behavior as default for Colwise and Rowwise parallel style. If people want to customize it, they can always pass in different prepare_input or prepare_output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100508
Approved by: https://github.com/wanchaol
2023-05-07 18:15:49 +00:00
03433080e6 [inductor] Support FallbackKernel in cpp wrapper codegen (#100553)
Summary: This works well for ops without kwargs. For ops with kwargs, we
need to register ordered_kwargs_for_cpp_kernel for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100553
Approved by: https://github.com/jansel
2023-05-07 14:33:53 +00:00
cyy
5293dee920 fix missing-prototypes warnings in torch_cpu (Part 3) (#100245)
This PR fixes more missing-prototypes violations in the torch_cpu source following PRs  #100053 and #100147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100245
Approved by: https://github.com/Skylion007
2023-05-07 07:54:44 +00:00
358fe95088 [fix] check for histogramdd when bins is int[] (#100624)
Fixes https://github.com/pytorch/pytorch/issues/93274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100624
Approved by: https://github.com/lezcano
2023-05-07 05:54:54 +00:00
ca9f55f79d misc. fixes to constraints warnings and errors (#100745)
1. Move constraint violation error after constraint discovery warning, and attach them when we have both.
2. Remove verbose internal traceback for relevant guard in constraint violation error.
3. Remove mention of `assume_static_by_default` in specialization warning.
4. Fix indenting of `specializations` body and make it assert individually instead of returning a conjunction.
5. Remove return annotation on signature used in generated `specializations` and `specify_constraints` functions.
6. Split `&` ranges because we don't support them yet.

Differential Revision: [D45619852](https://our.internmc.facebook.com/intern/diff/D45619852/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100745
Approved by: https://github.com/tugsbayasgalan
2023-05-06 18:22:31 +00:00
0bf9722a3a modify ipex backend (#99499)
The ipex backend of torch.compile calls ipex.compile instead of torch.jit.trace and torch.jit.freeze. ipex.compile will handle the compilation process for IPEX internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99499
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-06 16:51:13 +00:00
82091d666c [ONNX] Refactor Input/Output Adapter (#100490)
This PR refactors how InputAdapter and OutputAdapter is used throughout the exporter.

During refactoring, API issues with passes (torch.onnx._internal.fx._pass.Transform) were identified and should be tackled on another API. In short, some passes can modify the input/output of the model and the input/output adapter must be in sync with such change, otherwise, the adapters will not reflect the actual model input/output. The first instance of this issue was with `ReplaceGetAttrWithPlaceholder` pass that adds new inputs to the model. In order to work this around, a new input adapt step to append new inputs (generated by the pass) was introduced. That resulted in the number of inputs of the ONNX model to mismatch the numer of inputs of the pytorch model, though.

Follow up on https://github.com/pytorch/pytorch/pull/98421
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100490
Approved by: https://github.com/BowenBao
2023-05-06 16:01:49 +00:00
aa081d8f27 [CI] Update torchtext commit (#100767)
Also, install torchdata as torchtext depends on it

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at d3ec9e4</samp>

> _`pytorch/data` joins_
> _CI scripts get updated_
> _Winter of changes_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100767
Approved by: https://github.com/ngimel, https://github.com/r-barnes
2023-05-06 15:24:48 +00:00
00d4890218 [c10d] Apply EFA workaround to Store tests. (#100382)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100382
Approved by: https://github.com/fduwjj
2023-05-06 15:16:13 +00:00
266c84e3ab [pt2] add meta function for linalg_qr (#100714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100714
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-05-06 15:04:02 +00:00
8d56b0a5b5 remove unused tuple_cat utility (#100731)
remove unused tuple_cat utility

Test Plan: Verified unused with `git grep`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100731
Approved by: https://github.com/ezyang
2023-05-06 12:48:44 +00:00
44caa395cb inductor: fix mm_plus_mm fusion pattern issue when has broadcast add (#100679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100679
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/jgong5
2023-05-06 07:13:53 +00:00
d719f0276d [Dynamo] Fix nested function resume execution (#100426)
Fixes #99665

Let me explain the root cause using the unit test I added:
* This bug is triggered when:
  * ```wrapped``` is a nested function.
  * ```wrapped``` is in another module which is different from the main function ```fn```.
  * There is a graph break inside of ```wrapped```.
* The root cause is when resuming nested function, actually we are using the outermost function(```fn``` in my example)'s global variables, but ```wrapped``` calls ```inner_func``` which is not part of ```fn```'s globals, so we have to set correct globals when nested function resume execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100426
Approved by: https://github.com/jansel
2023-05-06 05:04:50 +00:00
f73973d789 Expose function to retrieve list of registered loggers (#100776)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100776
Approved by: https://github.com/ezyang
2023-05-06 04:22:28 +00:00
6af509860e Add logcumsumexp forward-ad (#100629)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8bb6158</samp>

This pull request adds forward and backward AD support for the `logcumsumexp` operator in functorch, a library for composable function transformations. It implements a forward-mode formula and a decomposition in `derivatives.yaml`, a C++ function for computing directional derivatives in `FunctionsManual.cpp`, and updates the tests and metadata in `test_ops.py` and `common_methods_invocations.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100629
Approved by: https://github.com/soulitzer
2023-05-06 04:08:55 +00:00
71c4becda7 [inductor] Track operator counts (#100329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100329
Approved by: https://github.com/ngimel
2023-05-06 03:00:56 +00:00
8360b6c2a8 [c10d] Expose new Store methods. (#100381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100381
Approved by: https://github.com/fduwjj
2023-05-06 02:50:51 +00:00
19d8d31c94 [fbcode/caffe2] Make fmt formatter methods const (#100616)
Summary:
Staging an update to the latest fmt version triggered lots of build errors due to non-`const` methods on custom formatters. This fixes the `format()` methods to be `const` as they don't mutate any state anyway, as well as `parse()` methods that don't need to mutate internal state. This mitigates many future build errors.

Updates were identified and executed by using regular expression search/replacements such as:
`(constexpr auto parse\(ParseContext& [^)]*\)) \{` -> `$1 const {`
`(constexpr auto parse\(ParseContext& [^)]*\)) ->` -> `$1 const ->`
`(auto format\(.*, FormatContext& [^)]*\)) \{` -> `$1 const {`
`(auto format\(.*, FormatContext& [^)]*\)) ->` -> `$1 const ->`

Any changes to third-party code was then reverted. Some small changes detected from subsequent build errors were then applied.

Test Plan: CI

Differential Revision: D45463620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100616
Approved by: https://github.com/davidberard98
2023-05-06 01:38:25 +00:00
a5cb888013 [inductor] Do not try to shape-pad symbolic-sized tensors (#100738)
Summary: We use benchmarking to decide whether to pad tensors for mm alignment, but if the sizes are symbolic, we can't really do that.

Test Plan:
```
pytest test_torchinductor_dynamic_shapes.py -k padding
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100738
Approved by: https://github.com/jiawenliu64, https://github.com/ngimel
2023-05-06 01:35:10 +00:00
e55f02f4d0 lint test/inductor/test_cuda_repro.py (#100777)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100777
Approved by: https://github.com/yanboliang, https://github.com/bertmaher
2023-05-06 01:31:58 +00:00
850556ed6e Add "all" option to logging (#100664)
Adds the long-promised "all" option to logging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100664
Approved by: https://github.com/lezcano
2023-05-06 01:11:18 +00:00
f1b2e00700 graph break when calling resize_as_() on graph input (#100148)
Fix #94831

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100148
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-06 01:03:48 +00:00
ce50674f85 [inductor] TARGETS for all inductor tests (#100744)
We had many test scripts for inductor that aren't covered by TARGETS
files.  This diff adds all the ones that work.

Differential Revision: [D45606775](https://our.internmc.facebook.com/intern/diff/D45606775/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D45606775/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100744
Approved by: https://github.com/Chillee
2023-05-06 00:55:15 +00:00
3362c1d240 [ONNX] add cast operator after reduce to match desired dtype (#100700)
This PR conditionally inserts a cast operator after a reduction operation  to match the specified dtype in the exported ONNX model.  The code changes affect **opset9**, and **opset13**.

I understand there's an [automatic upcast to int64](c91a41fd68/torch/onnx/symbolic_opset9.py (L783)) before reduction most likely to prevent overflow so I left that alone and only conditionally add casting back to desired dtype.

## Test int32
```
import torch
import onnx
a = torch.tensor([10, 20, 30, 80], dtype=torch.int32)
def test():
    class SumInt32(torch.nn.Module):
        def forward(self, a):
            return torch.sum(a, dtype=torch.int32)

    sumi = SumInt32().eval()
    assert sumi(a).dtype == torch.int32
    print("Torch model output type matches input type")

    torch.onnx.export(sumi, (a), "/tmp/sumi_int32.onnx", opset_version=12)
    model = onnx.load("/tmp/sumi_int32.onnx")

    assert model.graph.output[0].type.tensor_type.elem_type == onnx.TensorProto.INT32
    print("ONNX model output type matches input type")
test()
```
![sumi_int32 onnx](https://user-images.githubusercontent.com/10516699/236499220-59b64821-5807-4f69-b0e2-90ae34280e03.png)

## Test int64

```
import onnx
import torch

a = torch.tensor([10, 20, 30, 80], dtype=torch.int64)

def test():
    class SumInt64(torch.nn.Module):
        def forward(self, a):
            return torch.sum(a, dtype=torch.int64)

    sumi = SumInt64().eval()
    assert sumi(a).dtype == torch.int64
    print("Torch model output type matches input type")
    torch.onnx.export(sumi, (a), "/tmp/sumi_int64.onnx", opset_version=12)
    model = onnx.load("/tmp/sumi_int64.onnx")
    assert model.graph.output[0].type.tensor_type.elem_type == onnx.TensorProto.INT64
    print("ONNX model output type matches input type")

test()

```
![sum_int64 onnx](https://user-images.githubusercontent.com/10516699/236422133-15f9cda3-242f-46da-9b23-c2e920f27078.png)

Fixes https://github.com/pytorch/pytorch/issues/100097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100700
Approved by: https://github.com/thiagocrepaldi
2023-05-06 00:05:57 +00:00
fee6d46940 Revert "Bump up flatbuffer submodule version to the latest release (v23.3.3) (#100716)"
This reverts commit 8d31b81edce016652f9c4e8df4bdaf45db0758df.

Reverted https://github.com/pytorch/pytorch/pull/100716 on behalf of https://github.com/malfet due to This will break internal builds, please wait for co-dev land ([comment](https://github.com/pytorch/pytorch/pull/100716#issuecomment-1536909954))
2023-05-05 23:45:11 +00:00
e5b065525b Add unit test for nested_tensor input to nn.TransformerEncoder (#100650)
Summary: Add unit test for nested_tensor input & fix

Test Plan: sandcastle

Differential Revision: D45580393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100650
Approved by: https://github.com/jbschlosser
2023-05-05 23:34:14 +00:00
aad017183d Introduce aggressive merge to CapabilityPartitioner (#100195)
With the old partitioner, suppose `add` is supported, the following code
```python
def fn(a, b, c, d):
    x = a + b # add
    y = c + d # add_1
    return (x, y)

traced = symbolic_trace(fn)
partitioner = CapabilityBasedPartitioner(traced, supported_ops, allows_single_node_partition=True)
partitions = partitioner.propose_partitions()
```
results in the partitions `[[add], [add_1]]`. However, since these two partitions do not depend on each other, they can be aggressively merged into a single partition `[[add, add_1]]` without causing any issues. This PR introduces a new feature that allows such aggressive merging by introducing an option `aggressive_merge` to the Partitioner class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100195
Approved by: https://github.com/SherlockNoMad
2023-05-05 23:20:17 +00:00
9790f9174a skip lcnet (#100726)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100726
Approved by: https://github.com/voznesenskym
2023-05-05 23:19:42 +00:00
db5430fd25 fix bash math for pr-sanity-check? (#100735)
The current version fails with `.github/scripts/pr-sanity-check.sh: line 44: "0" + "5": syntax error: operand expected (error token is ""0" + "5"")`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100735
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-05-05 23:14:39 +00:00
e3d783c013 [inductor] Cleanup strip_last_size logic (#100305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305
Approved by: https://github.com/ngimel
2023-05-05 23:10:47 +00:00
bd9d50a3fc Remove future deprecation warning from kl_div docs (#96541)
Fixes #95687
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96541
Approved by: https://github.com/albanD
2023-05-05 23:01:21 +00:00
e20c94bda9 [MPS] Add the test for 5D in test_mps which is skipped. (#99271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99271
Approved by: https://github.com/DenisVieriu97
2023-05-05 22:57:06 +00:00
a1f318daba Fix get_reordered_tests in run_test.py (#100752)
i think get_reordered_tests broken since master -> main switch

add typing for some functions

checked for `prioritized` in the logs

limited testing because I only care about one very small part of the log thats near the beginning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100752
Approved by: https://github.com/huydhn
2023-05-05 22:46:56 +00:00
c9593bc0e1 [ONNX] Refactor diagnose_call message_formatter signature (#100299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100299
Approved by: https://github.com/justinchuby, https://github.com/thiagocrepaldi
2023-05-05 22:31:12 +00:00
3f025c607c summarize graph breaks (#100696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100696
Approved by: https://github.com/yanboliang
2023-05-05 22:27:47 +00:00
f76d0e1b82 remove unused extract_arg_by_filtered_index utility (#100649)
remove unused extract_arg_by_filtered_index utility

Test Plan: Verified unused with `git grep`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100649
Approved by: https://github.com/ezyang
2023-05-05 22:25:43 +00:00
4a90deb137 [Doc] Add GRU new gate calculation difference (#100646)
Summary: Add a note for the calculation difference of GRU new gate `n_t` between PyTorch and original paper.

Fix: #99531

Test Plan: Please see GitHub pipelines.

Differential Revision: D45579790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100646
Approved by: https://github.com/mikaylagawarecki
2023-05-05 22:18:54 +00:00
e0a3d014e9 [CI] Do not auto-label nightly builds PR (#100740)
As I'm tired of removing `ciflow/trunk`/`ciflow/inductor` on those, which would fail anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100740
Approved by: https://github.com/huydhn
2023-05-05 22:09:13 +00:00
8d31b81edc Bump up flatbuffer submodule version to the latest release (v23.3.3) (#100716)
The current flatbuffer version uses `--std=c++0x` which is too old. On my system, one of flatbuffer's dependency has stopped supporting C++0x, causing a build issue on my system.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100716
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-05-05 21:58:36 +00:00
5c14eea1de Revert "extend serialization for tensor metadata (#99808)"
This reverts commit 73dd6f04c97f647470dbc55e03f666fa88f634c3.

Reverted https://github.com/pytorch/pytorch/pull/99808 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/99808#issuecomment-1536823538))
2023-05-05 21:55:52 +00:00
7961812c4d Rename ForceInPlace to InPlaceHint. (#99764)
The name makes more sense since it's a hint to scheduler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99764
Approved by: https://github.com/wanchaol
2023-05-05 21:29:05 +00:00
036d2f6593 Add unstable-periodic to upload test stats (#100751)
add unstable-periodic workflow to upload test status
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100751
Approved by: https://github.com/huydhn
2023-05-05 21:13:40 +00:00
d41134e2f2 dynamic equality constraint (#99993)
This diff adds support for dynamic equality constraints of the form `dynamic_dim(x, 0) == dynamic_dim(y, 1)`. The process of constraint discovery can already understand equality guards between dimensions and suggests such equality constraints, so this closes the loop on that. Correspondingly we now raise `ConstraintViolation` when we find that such a guard is added on a dynamic dimension and the user did not specify such a constraint. (NOTE: This is distinct from a dynamic dimension being guarded equal to a constant, which is already an error.)

Differential Revision: [D45279437](https://our.internmc.facebook.com/intern/diff/D45279437/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99993
Approved by: https://github.com/tugsbayasgalan
2023-05-05 21:09:18 +00:00
2f5e9b60f9 [ROCm] Limiting the NUM_PROCS to 8 while UT testing (#100133)
- Few AMD machines have >8 GPUs so limiting the NUM_PARALLEL_PROCS to 8, so number of test shards are also max 8
- Parallelizing for >8 is limited by memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100133
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet
2023-05-05 20:05:59 +00:00
67e3dd86b5 Update Multipy CI pin (#100640)
picks up a change that adds testing for multipy importing functional collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100640
Approved by: https://github.com/PaliC, https://github.com/malfet
2023-05-05 20:01:45 +00:00
59cb02db54 Symintify TensorFactories.empty_like (#100668)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/xla/pull/4876)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100668
Approved by: https://github.com/ezyang
2023-05-05 19:38:31 +00:00
447a20fdb1 [profiler] provide torch.profiler._utils._init_for_cuda_graphs() as a workaround (#100441)
There are known issues with profiling cuda graphs - particularly, if you create a cuda graph before the first use of the profiler, and then run that cuda graph during profiling.

One workaround is to add `with profile(): pass` before creating the cuda graph that you want to profile later.

For convenience, we provide this function to use the workaround. This also adads a test for this workaround, to ensure that it continues working.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100441
Approved by: https://github.com/Chillee, https://github.com/aaronenyeshi
2023-05-05 19:25:37 +00:00
41dc25d5fc [inductor] Pattern to replace cumsum with arange (#100673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100673
Approved by: https://github.com/eellison
2023-05-05 19:08:34 +00:00
f42eae4755 Revert "[export] Pickle of ExportGraphModule (#100620)"
This reverts commit d4975a5fe0b263087c8f060409a9331a1dbdde76.

Reverted https://github.com/pytorch/pytorch/pull/100620 on behalf of https://github.com/clee2000 due to broke export/test_serialize.py::TestSerialize::test_pickle_dynamic across various jobs d4975a5fe0, i think you hit another landrace? ([comment](https://github.com/pytorch/pytorch/pull/100620#issuecomment-1536643519))
2023-05-05 18:52:48 +00:00
bf2258f582 Fix frequent "warning C4141: 'dllimport': used more than once" (#100708)
This was introduced recently for MSVC builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100708
Approved by: https://github.com/ezyang
2023-05-05 18:45:58 +00:00
d4975a5fe0 [export] Pickle of ExportGraphModule (#100620)
reland of https://github.com/pytorch/pytorch/pull/100423 bc merge conflict...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100620
Approved by: https://github.com/mergennachin
2023-05-05 18:21:39 +00:00
4ca26d183a [CI] update hf version for ci (#100666)
per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100666
Approved by: https://github.com/malfet
2023-05-05 18:12:53 +00:00
b89b5716a9 ROCm fixes for PyT2.0 (#100089)
This PR brings some updates and fixes in regards to PyT2.0 functionality

1 - ROCm's version of triton does not yet support tl.reduce
Until supported we are opting to revert the removal of the aten.prod make_fallback for ROCm brought in with 7a6c650b81

This issue was found locally with the latest aten.prod UTs on ROCm
```
FAILED [0.0916s] inductor/test_torchinductor.py::CudaTests::test_prod_cuda - torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
AttributeError: module 'triton.language' has no attribute 'reduce'
```

2 - Adds aten.miopen_batch_norm as an explicit fallback as perf issues are observed when registered as a decomposition, setting warning=False as the fallback is expected

3 - Fixes a typo and redundant assignment in _inductor/triton_heuristics.py brought in with dd778a7610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100089
Approved by: https://github.com/kit1980, https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/jansel
2023-05-05 18:07:57 +00:00
3f725db4a6 [inductor] Run dead_node_elimination to a fixed point (#100672)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100672
Approved by: https://github.com/lezcano
2023-05-05 17:55:30 +00:00
d66add688f Revert "Add logcumsumexp forward-ad (#100629)"
This reverts commit d658c62677b7c096b0fda3ce7a4f0accc727430e.

Reverted https://github.com/pytorch/pytorch/pull/100629 on behalf of https://github.com/clee2000 due to broke slow test, see above comment for details ([comment](https://github.com/pytorch/pytorch/pull/100629#issuecomment-1536575442))
2023-05-05 17:42:35 +00:00
40df6e1647 [ONNX] Simplify repeat_intereleave export for scalar-valued 'repeat' (#100575)
This PR simplifies the ONNX export of torch.repeat_interleave when 'repeat' is a scalar value (so each index in the input is repeated the same number of times). (Issue #100438)

Here is a before/after of a simple model export:
```python
# Model + export code
import torch

class RepeatInterleaveModel(torch.nn.Module):
    def forward(self, x):
        return x.repeat_interleave(2, dim=-1)

args = (torch.rand((2, 2, 16)),)
model = RepeatInterleaveModel()
torch.onnx.export(model, args, "repeat_interleave.onnx", opset_version=17)
```

**Before (static shapes)**
![repeat_interleave onnx(1)](https://user-images.githubusercontent.com/46343317/236014996-00726832-1e76-4fb4-950d-4b54cc5cc20c.png)

-----
**Before (dynamic shapes, second graph is Loop body)**
<p float="left">
  <img src="https://user-images.githubusercontent.com/46343317/236029895-20b0ae0a-240f-466d-bb01-e619ec5967ad.png" width="45%" />
  <img src="https://user-images.githubusercontent.com/46343317/236029915-e67b808a-029b-4997-bc05-1ce59eec409a.png" width="47%" />
</p>

-----
**After (for both static and dynamic shapes)**
<img src="https://user-images.githubusercontent.com/46343317/236015235-633811cb-09a2-435d-a293-1b2bcb7dea50.png" width="66%" />

-----

This PR also fixes a bug where the exporter throws an expection when the input has dynamic shapes and the 'dim' parameter is not specified to torch.repeat_interleave. Also adds a new testcase to cover this. (Issue #100429)

Fixes #100438 and #100429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100575
Approved by: https://github.com/BowenBao
2023-05-05 17:00:42 +00:00
3f2336d3fe Revert "[EZ] move test decorator up in the class def (#100719)"
This reverts commit daf5100656c65cb6f1777f7e4173fd494624b565.

Reverted https://github.com/pytorch/pytorch/pull/100719 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks lint in trunk ([comment](https://github.com/pytorch/pytorch/pull/100719#issuecomment-1536514589))
2023-05-05 16:47:27 +00:00
6b20ac3bc4 make torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp data_ptr-correct (#100681)
make torch/csrc/jit/passes/onnx/unpack_quantized_weights.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100681
Approved by: https://github.com/ezyang
2023-05-05 16:30:55 +00:00
1fe91f5922 make torch/csrc/distributed/c10d/quantization/quantization.cpp data_ptr-correct (#100688)
make torch/csrc/distributed/c10d/quantization/quantization.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100688
Approved by: https://github.com/ezyang
2023-05-05 16:27:14 +00:00
c676aa8bee make torch/csrc/distributed/c10d/ProcessGroupGloo.cpp data_ptr-correct (#100689)
make torch/csrc/distributed/c10d/ProcessGroupGloo.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100689
Approved by: https://github.com/ezyang
2023-05-05 16:26:43 +00:00
543e9c6517 use const_ and mutable_ data_ptr for much of torch/csrc/jit/runtime/static/ops.cpp (#100678)
use const_ and mutable_ data_ptr for much of torch/csrc/jit/runtime/static/ops.cpp

Summary:

We can't address the TEWrapper cases yet because it erases all
arguments to mutable void*.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100678
Approved by: https://github.com/ezyang
2023-05-05 16:20:24 +00:00
4101de342b Type torch._inductor.codegen.wrapper (#100657)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100657
Approved by: https://github.com/voznesenskym
2023-05-05 16:19:23 +00:00
642f4ed606 add a cast function that suppresses -Wcast-function-type-strict (#100170)
add a cast function that suppresses -Wcast-function-type-strict

Summary:
These casts are a necessary evil due to the design of Python. Python
ultimately casts it back to the original type based on the flags
specified in the PyMethodDef.

Nevertheless, the new Clang flag -Wcast-function-type-strict breaks
with this.

Test Plan: Passes builds with Clang 16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100170
Approved by: https://github.com/ezyang
2023-05-05 16:16:06 +00:00
e53b288679 remove unused filter_map utility (#100647)
remove unused filter_map utility

Test Plan: Verified unused with `git grep`.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/100647).
* #100649
* __->__ #100647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100647
Approved by: https://github.com/ezyang
2023-05-05 16:15:56 +00:00
bf08b072a7 Add functionalization pass in TorchDynamo (#99461)
Fixes: https://github.com/pytorch/pytorch/issues/99000

Differential Revision: [D45106409](https://our.internmc.facebook.com/intern/diff/D45106409)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99461
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305, https://github.com/zou3519
2023-05-05 16:08:14 +00:00
31fdd19b5b Add support for list copy in dynamo export (#100669)
Summary:
Issue:
`torch._dynamo.exc.Unsupported: call_method ListVariable() copy [] {}`

Fix:
Add `copy()` to "method_call" to _dynamo/variables/lists.py

Take it over from #98184. To unblock a meta internal model onboarding to ExecuTorch.

Differential Revision: D45592416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100669
Approved by: https://github.com/jansel
2023-05-05 16:04:19 +00:00
0e017af35b make torch/csrc/jit/python/pybind_utils.cpp data_ptr-correct (#100682)
make torch/csrc/jit/python/pybind_utils.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100682
Approved by: https://github.com/Skylion007
2023-05-05 15:53:06 +00:00
a2e81a8004 [DataLoader] __getitems__ added to description of Dataset API and better supported within Subset (#100375)
DataLoader supports batched loading from Mapped Datasets.

This is the fetcher's implementation of auto-detection of batch loading support.

torch.utils.data._utils.fetch._MapDatasetFetcher
```
class _MapDatasetFetcher(_BaseDatasetFetcher):
    def fetch(self, possibly_batched_index):
        if self.auto_collation:
            if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:
                data = self.dataset.__getitems__(possibly_batched_index)
            else:
                data = [self.dataset[idx] for idx in possibly_batched_index]
```

Description of Dataset API now shows this feature.

Additionally, Subset dataset now supports `__getitems__` if parent dataset supports it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100375
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-05-05 15:52:28 +00:00
6064c4c64c Disable core dumping on ROCm UT workflows (#100532)
Recently an issue was observed on PyTorch CI in which the ROCm nodes were running out of space due to out of control core dumping.
https://github.com/pytorch/pytorch/issues/99578

To mitigate this issue we have proposed to disable core dumping on the ROCm workers with the `--ulimit core=0` flag in the docker run command in `_rocm-test.yml`
https://stackoverflow.com/questions/58704192/how-to-disable-core-file-dumps-in-docker-container/59611557#59611557

Before this change
```
ulimit -a
core file size          (blocks, -c) unlimited
```

After this change
```
ulimit -a
core file size          (blocks, -c) 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100532
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet
2023-05-05 15:48:31 +00:00
daf5100656 [EZ] move test decorator up in the class def (#100719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100719
Approved by: https://github.com/angelayi
2023-05-05 15:35:56 +00:00
7a15e82388 Fix tensor registration to work with coalescing collectives. (#99763)
We do it by making it possible to register multiple tensors for the same
worker and coordinate waiting/cleanup among them.

This ensures waiting on any number the output tensors will result in a
single stream sync. This simplifies codegen by inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99763
Approved by: https://github.com/wanchaol
2023-05-05 14:25:35 +00:00
54f27c7d5c make torch/csrc/distributed/c10d/quantization/quantization_gpu.cu data_ptr-correct (#100685)
make torch/csrc/distributed/c10d/quantization/quantization_gpu.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100685
Approved by: https://github.com/Skylion007
2023-05-05 14:14:36 +00:00
57e19ad86d Add pattern to merge consecutive splits (#100107)
Summary:
Fx pass for "split->split => split" pattern

{F959486105}

Test Plan:
* CI tests
* Run in a notebook on CMF MIMO

{F959938752}

Differential Revision: D45204109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100107
Approved by: https://github.com/jansel
2023-05-05 09:52:35 +00:00
c91a41fd68 [Inductor][Quant]Enable the decomposed dequant maxpooling2d loop fusion (#99132)
**Summary**
Lowering of [`max_pool2d` ](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/lowering.py#L2732) will check the `num_reads` of input `StorageBox.data`. When num of reads is larger than 1, input of `StorageBox` will invoke `realize` and break the loop fusion with previous node. The previous node could be `decomposed.dequant_per_tensor.tensor` in quantization use case. For `decomposed.dequant_per_tensor.tensor`, it has 3 num of reads. But 2 of these 3 num of reads are scalar tensors as `zero point` and `scale`. In this PR, we try to relax the criterion for `StorageBox.realize`. Specifically, when the input is an instance of `Pointwise`, we will also check the number of non scalar tensor's read, and only invoke `StorageBox.realize` when the number of non scalar tensor's read is also larger than 1. It helps enable the loop fusion and vec code gen of pattern `decomposed.dequant_per_tensor.tensor - max_pool2d`.

**Test Plan**
```
cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_maxpool2d_lowering
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99132
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-05 08:20:16 +00:00
675029aabf inductor: add params check before doing sfdp fusion (#100619)
We need to add params check before doing sfdp fusion as https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/attention.cpp#L544.

This PR will fix #100315 and #100318.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100619
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-05-05 06:47:21 +00:00
37f1be041a [pt2] enable svd in fake_tensor (#100130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100130
Approved by: https://github.com/ezyang, https://github.com/lezcano
2023-05-05 06:27:59 +00:00
bb6b24c622 [BE] Dockerize PyTorch docs jobs (#100601)
Saw some connection error to pip in docs jobs today, so let's dockerize it:

* https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072
* https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072

Some additional fixes:
* Moving the docs script from under `.circleci` to under `.ci` as they should be
* Linter (as scripts under .ci are subjected to shellcheck)
* Fix some minor Sphinx warnings in functorch docs

### Testing
Docs previews look fine:

* https://docs-preview.pytorch.org/100601/index.html
* https://docs-preview.pytorch.org/100601/cppdocs/index.html
* https://docs-preview.pytorch.org/100601/functorchdocs/index.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100601
Approved by: https://github.com/clee2000
2023-05-05 06:24:46 +00:00
05adf4d49d inductor(cpu): skip ConvTranspose2d packing if has output_size input (#100612)
Fix https://github.com/pytorch/pytorch/issues/100344.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100612
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-05-05 06:19:20 +00:00
63f2f9fb0d [BE] Remove unused CircleCI checks (#100630)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100630
Approved by: https://github.com/atalman
2023-05-05 05:45:48 +00:00
ee4cb4b1e7 Add --offload-to-disk support to minifier (#100546)
When minifying extremely large repros, the minifier can run out of memory. This is because, for delta debugging, the minifier keeps a copy of every intermediate output in the network. This can easily put you over the memory limit for your GPU. To make matters worse, we cannot easily delta debug in such a situation, as delta debugging involves replacing intermediates with inputs, but doing so can cause an intermediate to become live longer than its actual extent in the original model (since inputs all have to be allocated up front).

The strategy in this PR is to use `load_tensor` from the previous PR to offer a low memory mode for delta debugging. Instead of putting intermediates as inputs, we instead load them in the middle of the graph in question.  If, through DCE, the load_tensor ends up floating to the top of the graph, we can input-ify it. We now no longer save all intermediates in memory, but instead save them to disk. I used this to successfully minify the repro that helped us solve https://github.com/pytorch/pytorch/pull/100332

The testing is not very good. I can try to add more robust testing but it will involve a more involved refactor to FX minifier. Let me know if that's what you want.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100546
Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
2023-05-05 05:25:03 +00:00
ce1ad1c143 Add load_storage (#100519)
This adds a new operator debugprims::load_storage which does the unusual thing of loading a tensor from disk (via ContentStoreReader). This will be used in a later PR to implement delta debugging in the minifier, even when the repro is too big to fit into memory. The way it works is that you specify a name of the tensor you want to load, as well as enough metadata to reconstruct the tensor, if the store isn't available. If there is an active content store, we read and return the tensor from that store; otherwise we use `rand_strided` to create it.

I needed some infra improvements to do this:

* `custom_op` now supports factory functions. Factory functions have to be registered specially via `impl_factory`
* I modified `clone_input` to also support dtype conversion, which I use to change the dtype of a loaded tensor if necessary.
* ContentStore needs to work with a device argument, so we torch.load directly to the correct device. This is for fake tensor support.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100519
Approved by: https://github.com/zou3519, https://github.com/anijain2305
2023-05-05 05:25:03 +00:00
d4dad36cf1 [quant][pt2] Improve prepare_qat Conv + BN numerics test (#100271)
Summary: This commit makes two improvements to the existing
test for Conv + BN fusion in `prepare_qat_pt2e`:

(1) Test `per_tensor_symmetric` in addition to `per_channel_symmetric`
(2) Initialize BN stats the same way in both flows. This is
    necessary to get the `per_tensor_symmetric` case to pass.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_numerics

Reviewers: jerryzh168, kimishpatel

Differential Revision: [D45512851](https://our.internmc.facebook.com/intern/diff/D45512851)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100271
Approved by: https://github.com/jerryzh168
2023-05-05 04:46:13 +00:00
bf52d570d9 torch.save/load torch.compiled models (#97565)
Opening this so I can discuss with @albanD

I built a proof of concept of an in place API for an nn.Module that allows us to save and load a torch.compiled model with no issues https://github.com/msaroufim/mlsys-experiments/blob/main/save-compiled-model.py

So users can run` model.compile()` and then run `torch.save(model, "model.pt")` and `torch.load(model, "model.pt)` with no issues unlike the rather strange current suggestion we give to users which is `opt_mod = torch.compile(mod); torch.save(mod, "model.pt")`

Right now I'm trying to extend this to work for nn.modules more generally

TODO: Failing tests
* [x] torch.jit.load -> issue was because of aliasing `__call__` to `_call_impl`, _call_impl used to be skipped when now it lo longer is so expanded the skip check. I added an explicit `torch.jit.load()` test now which @davidberard98 suggested
* [x] functorch seems to be a flake - ran locally and it worked `pytest functorch/test_eager_transforms.py`
* [x] a test infra flake - `test_testing.py::TestImports::test_no_mutate_global_logging_on_import_path_functorch`
* [x] It seems like I broke inlining in dynamo though `python -m pytest test/dynamo/test_dynamic_shapes.py -k test_issue175` chatting with Voz about it but still not entirely sure how to fix - found a workaround after chatting with @yanboliang
* [x] `pytest test/dynamo/test_modules.py` and `test/dynamo/test_dynamic_shapes` `test/dynamo/test_misc.py` seem to be failing in CI but trying it out locally they all pass tests passed with 0 failures
* [x] `pytest test/profiler/test_profiler_tree.py ` these tests have ProfilerTrees explicitly printed and will now break if __call__ is not in tree - ran with `EXPECT_ACCEPT=1`
* [x] `pytest test/test_torch.py::TestTorch::test_typed_storage_deprecation_warning` a flake, ran this locally and it works fine
* [x] I reverted my changes to `_dynamo/nn_module.py` since it looks like @wconstab is now directly handling `_call_impl` there but this is triggering an infinite inlining which is crashing
* [x] Tried out to instead override `__call__`, python doesnt like this though https://github.com/pytorch/pytorch/pull/97565#issuecomment-1524570439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97565
Approved by: https://github.com/aaronenyeshi, https://github.com/albanD, https://github.com/voznesenskym
2023-05-05 03:57:49 +00:00
2f9538006e [vision hash update] update the pinned vision hash (#100671)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100671
Approved by: https://github.com/pytorchbot
2023-05-05 02:35:49 +00:00
d658c62677 Add logcumsumexp forward-ad (#100629)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100629
Approved by: https://github.com/soulitzer
2023-05-05 02:21:27 +00:00
2f41bc5465 [DataLoader] Add context to NotImplementedErrors in dataset.py (#100667)
Add helpful context message to `NotImplementedError`'s thrown by Dataset and IterableDataset, reminding users that they must implement `__getitem__`/`__iter__` in subclasses. Currently, users are presented with a bare `NotImplementedError` without describing the remedy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100667
Approved by: https://github.com/NivekT
2023-05-05 02:16:42 +00:00
a3989b2802 remove unused concat_iseq (#100648)
remove unused concat_iseq

Test Plan: Verified with `git grep`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100648
Approved by: https://github.com/Skylion007
2023-05-05 02:02:57 +00:00
35a6b04419 Set assume_static_by_default to True in Dynamo config (#100458)
We expect fine grained dynamic shape enabled at all times, which means that a dimension is assumed to be static unless user explicitly says otherwise.

Differential Revision: D45473365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100458
Approved by: https://github.com/avikchaudhuri
2023-05-05 00:50:41 +00:00
73eab18ac8 set lowmem_dropout and fallback_random configs for all tests in test_… (#100506)
…fused_attention

This allows all the tests in test_fused_attention to succeed when run together, otherwise replacements are registered without proper config set, and thus some tests fail and succeed only on rerun. This is also confusing when running full file locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100506
Approved by: https://github.com/drisspg
2023-05-05 00:37:35 +00:00
b6d318291b [FSDP] Do not sys.exit(0) explicitly at end of unit test (#100645)
We are going to see if this closes https://github.com/pytorch/pytorch/issues/100641. The guess is that this might allow NCCL to be destroyed before Python finalizes, avoiding any issues with calling `pybind11::gil_scoped_release` like in [`destroy_nccl_comm`](8994d9e610/torch/csrc/cuda/python_nccl.cpp (L46)).

Test plan:
```
CUDA_VISIBLE_DEVICES=0,7 numactl -C 2 python test/distributed/fsdp/test_fsdp_unshard_params.py -v -k test_with_grads_core --repeat 200 2>&1 | tee out
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100645
Approved by: https://github.com/rohan-varma, https://github.com/fegin
2023-05-05 00:09:17 +00:00
6d2f8114be Revert "[BE] Dockerize PyTorch docs jobs (#100601)"
This reverts commit 2703684acf5643ca69ecfac4bdb861abe2a8aa41.

Reverted https://github.com/pytorch/pytorch/pull/100601 on behalf of https://github.com/huydhn due to Curiously, this breaks inductor jobs ([comment](https://github.com/pytorch/pytorch/pull/100601#issuecomment-1535515587))
2023-05-04 23:13:15 +00:00
da0993280d use const_data_ptr in torch/csrc/lazy/core/hash.h (#100644)
use const_data_ptr in torch/csrc/lazy/core/hash.h

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100644
Approved by: https://github.com/ezyang
2023-05-04 22:56:19 +00:00
fe3c83d349 Have testing overhead dashboard only use successful workflows (#100580)
Once this pr is approved I'll update the data to do this.

Relevant queryhttps://console.rockset.com/query?query_text_id=8b05b37e-747b-421c-9347-151586b0ac80
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100580
Approved by: https://github.com/huydhn
2023-05-04 22:49:34 +00:00
1d5577b601 No need to run Windows binary build for every PR (#100638)
Per the discussion with @malfet , there is no need to run Windows binary build for every PR. We will keep it running in trunk (on push) though just in case.

This also moves the workflow back from unstable after the symlink copy fix in 860d444515

Another data point to back this up is the high correlation between Windows binaries debug and release build v.s. Windows CPU CI job.  The numbers are:

* `libtorch-cpu-shared-with-deps-debug` and `win-vs2019-cpu-py3` has 0.95 correlation
* `libtorch-cpu-shared-with-deps-release` and `win-vs2019-cpu-py3` has the same 0.95 correlation

The rest is noise, eh?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100638
Approved by: https://github.com/atalman
2023-05-04 21:57:39 +00:00
c525440ba3 Logging documentation updates (#100595)
Updated the logging.rst with info about the env var.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100595
Approved by: https://github.com/msaroufim, https://github.com/lezcano
2023-05-04 21:54:02 +00:00
24e9b8f5f4 [PT2E][Quant] Use subgraph matcher annotate linear pattern (#100566)
This diff adds subgraph matcher for pattern matching. Furthermore, we also move
annotations for the matched subgraph in a way that only input and output nodes
of the matched subgraph have quantization related valid annotations.

Differential Revision: [D45535539](https://our.internmc.facebook.com/intern/diff/D45535539/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100566
Approved by: https://github.com/jerryzh168
2023-05-04 21:31:59 +00:00
8869897ebe [replicate] support simpler device_id (#100217)
Allow passing in `device_id=[device]` regardless of CPU or GPU. We
modify the kwarg as needed to pass to DDP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100217
Approved by: https://github.com/awgu, https://github.com/zhaojuanmao
2023-05-04 21:06:04 +00:00
2703684acf [BE] Dockerize PyTorch docs jobs (#100601)
Saw some connection error to pip in docs jobs today, so let's dockerize it:

* https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072
* https://github.com/pytorch/pytorch/actions/runs/4877612277/jobs/8702572072

Some additional fixes:
* Moving the docs script from under `.circleci` to under `.ci` as they should be
* Linter (as scripts under .ci is subjected to shellcheck)
* Fix some minor Sphinx warnings in functorch docs

### Testing
Docs previews look fine:

* https://docs-preview.pytorch.org/100601/index.html
* https://docs-preview.pytorch.org/100601/cppdocs/index.html
* https://docs-preview.pytorch.org/100601/functorchdocs/index.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100601
Approved by: https://github.com/clee2000
2023-05-04 20:33:51 +00:00
73dd6f04c9 extend serialization for tensor metadata (#99808)
Fixes #ISSUE_NUMBER
Add the serialization logic of backend metadata to the serialization of tensor, which is implemented through custom registration functions.

In #97429 , the structure backendMeta is provided in TensorImpl, and we think that this part of information may also need to be serialized for custom.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99808
Approved by: https://github.com/ezyang
2023-05-04 20:32:11 +00:00
25b42aef67 [Inductor] Using PythonPrinter for SymInt arguments codegen for FallbackKernal (#100606)
Fixes Meta internal user case.

Repro:
```
import torch
import torch._inductor

torch._inductor.config.disable_cpp_codegen = True

@torch.compile(backend="inductor", dynamic=True)
def func(input: torch.Tensor) -> torch.Tensor:
    n = input.size(-1)
    output = input + int(n * 0.2) + 1
    return output, input + 1

print(func(torch.rand(5, device="cpu")))
print(func(torch.rand(10, device="cpu")))
```

Error:
```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/debug/debug7.py", line 20, in <module>
    print(func(torch.rand(10, device="cpu")))
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 280, in _fn
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/debug/debug7.py", line 12, in func
    @torch.compile(backend="inductor", dynamic=True)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/eval_frame.py", line 280, in _fn
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 3346, in forward
    return compiled_fn(full_args)
  File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 1260, in g
    return f(*args)
  File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 2210, in runtime_wrapper
    all_outs = call_func_with_args(
  File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 1285, in call_func_with_args
    out = normalize_as_list(f(args))
  File "/scratch/ybliang/work/repos/pytorch/torch/_functorch/aot_autograd.py", line 1372, in rng_functionalization_wrapper
    return compiled_fw(args)
  File "/tmp/torchinductor_ybliang/od/codk4bo4oqmjiec35zlz2rsildcix33lsxpdcy7pi6p4nvdrofpu.py", line 27, in call
    buf0 = torch.ops.aten.add.Tensor(arg1_1, floor(0.2*s0))
NameError: name 'floor' is not defined
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100606
Approved by: https://github.com/xw285cornell
2023-05-04 20:10:23 +00:00
4bad3f62f7 [MPS] Add support for MPSProfiler (#100635)
- Enable event and interval-based os signpost tracing via env-var 'PYTORCH_MPS_TRACE_SIGNPOSTS' (python bindings sent in separate PR).
- Enable logging of MPS graphs, native kernels, and copies and their GPU times via env-var `PYTORCH_MPS_LOG_PROFILE_INFO`.
- Enable dumping the table of kernel profiling results sorted based on Mean GPU time when the process ends (SIGINT also handled).
- Fix a bug in MPSAllocator where the Allocator completionHandlers were called after MPSAllocator instance was destroyed.
- Added option to use Schedule Handlers to begin signpost intervals.
- Refer to comments in `MPSProfiler.h` to learn how to set env-vars for logging and signpost tracing. Proper documentation will be sent in a separate PR later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100635
Approved by: https://github.com/kulinseth
2023-05-04 20:02:33 +00:00
8994d9e610 [dynamo] Hide guard_fail_hook behind a flag to improve cache lookup time (+10% DebertaV2) (#100590)
For TorchDynamo eager backend, DebertaV2 speedup improves from 0.77x to 0.87x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100590
Approved by: https://github.com/voznesenskym, https://github.com/wconstab
2023-05-04 18:52:21 +00:00
edebad81a9 Add a rst doc for the performance dashboard (#100592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100592
Approved by: https://github.com/msaroufim, https://github.com/huydhn
2023-05-04 18:28:09 +00:00
23a095ca5f Chunked inplace weight loading API (#100615)
Chunking inplace memory writing to save memory further

Reviewed By: zyan0

Differential Revision: D45506186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100615
Approved by: https://github.com/davidberard98
2023-05-04 17:41:18 +00:00
04d67e20a7 Revert "torch.save/load torch.compiled models (#97565)"
This reverts commit 87f08d717e022b8dd8de03c82ab77a9b3d52d5f6.

Reverted https://github.com/pytorch/pytorch/pull/97565 on behalf of https://github.com/clee2000 due to sorry but I think this breaks dynamo tests 87f08d717e ([comment](https://github.com/pytorch/pytorch/pull/97565#issuecomment-1535103171))
2023-05-04 17:07:33 +00:00
67fc9bbb9b Rename percentiles to quantiles in triton.testing.do_bench (#100477)
Summary: To reflect upstream changes.

Test Plan: CI

Differential Revision: D45488904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100477
Approved by: https://github.com/ngimel
2023-05-04 16:53:17 +00:00
6370ac0251 [codemod] Replace hasattr with getattr in caffe2/torch/ao/quantization/stubs.py (#100597)
Summary:
The pattern
```
X.Y if hasattr(X, "Y") else Z
```
can be replaced with
```
getattr(X, "Y", Z)
```

The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate.

**This diff is very low risk. Green tests indicate that you can safely Accept & Ship.**

Test Plan: Sandcastle

Reviewed By: vkuzo

Differential Revision: D44886422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100597
Approved by: https://github.com/Skylion007
2023-05-04 16:36:23 +00:00
9c185b6b46 [codemod] Replace hasattr with getattr in caffe2/docs/source/notes/extending.rst (#100598)
Summary:
The pattern
```
X.Y if hasattr(X, "Y") else Z
```
can be replaced with
```
getattr(X, "Y", Z)
```

The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate.

**This diff is very low risk. Green tests indicate that you can safely Accept & Ship.**

Test Plan: Sandcastle

Differential Revision: D44886464

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100598
Approved by: https://github.com/Skylion007
2023-05-04 16:36:15 +00:00
87f08d717e torch.save/load torch.compiled models (#97565)
Opening this so I can discuss with @albanD

I built a proof of concept of an in place API for an nn.Module that allows us to save and load a torch.compiled model with no issues https://github.com/msaroufim/mlsys-experiments/blob/main/save-compiled-model.py

So users can run` model.compile()` and then run `torch.save(model, "model.pt")` and `torch.load(model, "model.pt)` with no issues unlike the rather strange current suggestion we give to users which is `opt_mod = torch.compile(mod); torch.save(mod, "model.pt")`

Right now I'm trying to extend this to work for nn.modules more generally

TODO: Failing tests
* [x] torch.jit.load -> issue was because of aliasing `__call__` to `_call_impl`, _call_impl used to be skipped when now it lo longer is so expanded the skip check. I added an explicit `torch.jit.load()` test now which @davidberard98 suggested
* [x] functorch seems to be a flake - ran locally and it worked `pytest functorch/test_eager_transforms.py`
* [x] a test infra flake - `test_testing.py::TestImports::test_no_mutate_global_logging_on_import_path_functorch`
* [x] It seems like I broke inlining in dynamo though `python -m pytest test/dynamo/test_dynamic_shapes.py -k test_issue175` chatting with Voz about it but still not entirely sure how to fix - found a workaround after chatting with @yanboliang
* [x] `pytest test/dynamo/test_modules.py` and `test/dynamo/test_dynamic_shapes` `test/dynamo/test_misc.py` seem to be failing in CI but trying it out locally they all pass tests passed with 0 failures
* [x] `pytest test/profiler/test_profiler_tree.py ` these tests have ProfilerTrees explicitly printed and will now break if __call__ is not in tree - ran with `EXPECT_ACCEPT=1`
* [x] `pytest test/test_torch.py::TestTorch::test_typed_storage_deprecation_warning` a flake, ran this locally and it works fine
* [x] I reverted my changes to `_dynamo/nn_module.py` since it looks like @wconstab is now directly handling `_call_impl` there but this is triggering an infinite inlining which is crashing
* [x] Tried out to instead override `__call__`, python doesnt like this though https://github.com/pytorch/pytorch/pull/97565#issuecomment-1524570439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97565
Approved by: https://github.com/aaronenyeshi, https://github.com/albanD
2023-05-04 16:23:12 +00:00
4b2f496eab [c10d] Implement new Store methods in PrefixStore. (#100380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100380
Approved by: https://github.com/fduwjj
2023-05-04 15:18:58 +00:00
97245a06e1 Turn on TORCH_CHECK for NT wrap_buffer (#100596)
TORCH_INTERNAL_ASSERT_DEBUG_ONLY won't be enabled during non-debug builds, but for 1 dimension Tensors the check is cheap enough and not catching this can slow down development a lot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100596
Approved by: https://github.com/drisspg
2023-05-04 15:06:27 +00:00
26533349a7 [codemod] Replace hasattr with getattr in caffe2/torch/jit/_trace.py (#100362)
Summary:
The pattern
```
X.Y if hasattr(X, "Y") else Z
```
can be replaced with
```
getattr(X, "Y", Z)
```

The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate.

**This diff is very low risk. Green tests indicate that you can safely Accept & Ship.**

Test Plan: Sandcastle

Differential Revision: D44886479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100362
Approved by: https://github.com/Skylion007
2023-05-04 14:49:04 +00:00
6120c5842c [codemod] Replace hasattr with getattr in caffe2/torch/ao/quantization/utils.py (#100361)
Summary:
The pattern
```
X.Y if hasattr(X, "Y") else Z
```
can be replaced with
```
getattr(X, "Y", Z)
```

The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate.

**This diff is very low risk. Green tests indicate that you can safely Accept & Ship.**

Test Plan: Sandcastle

Reviewed By: jerryzh168

Differential Revision: D44886493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100361
Approved by: https://github.com/Skylion007
2023-05-04 14:46:38 +00:00
ff974cd962 Fixing interpolate on uint8 unsqueezed 3D CL tensor (#100258)
Description:

- Fixed a bug with memory format issue:

When input is channels last 4d tensor that was produced as following
```
t = torch.ones(1, 3, 32, 32).contiguous(memory_format=torch.channels_last)
t = t[0]
t = t[None, ...]
```
upsampling will produce output with channels first memory format but our avx code does not take that into account.

Here is a repro code to show that nightly is broken for this particular case:
```python
import torch

torch.manual_seed(0)

input = torch.randint(0, 256, size=(1, 3, 256, 256), dtype=torch.uint8).contiguous(memory_format=torch.channels_last)
input = input[0]
input = input[None, ...]

assert input.is_contiguous(memory_format=torch.channels_last)

output = torch.nn.functional.interpolate(input, (224, 224), mode="bilinear", antialias=True)
expected = torch.nn.functional.interpolate(input.float(), (224, 224), mode="bilinear", antialias=True)

assert output.is_contiguous()
assert expected.is_contiguous()

torch.testing.assert_close(expected, output.float(), atol=1, rtol=1)
# >
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/pytorch/torch/testing/_comparison.py", line 1511, in assert_close
#     raise error_metas[0].to_error(msg)
# AssertionError: Tensor-likes are not close!
#
# Mismatched elements: 14120 / 150528 (9.4%)
# Greatest absolute difference: 214.6112518310547 at index (0, 1, 152, 13) (up to 1 allowed)
# Greatest relative difference: 17.005144119262695 at index (0, 2, 26, 2) (up to 1 allowed)
```

- Also renamed needs_unpacking by skip_unpacking

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100258
Approved by: https://github.com/NicolasHug
2023-05-04 13:28:33 +00:00
9b3552eb2c Add runtime assertions for input shape constraints (#100247)
This PR adds runtime assertions as an extra pass in the exported graph. Several high level information:
1. We specialize all dimensions that were not added to the user input constraints
2. We haven't added relational constraints as runtime assertions (e.g x[1] == x[0]), will do in a follow up diff

Differential Revision: [D45408971](https://our.internmc.facebook.com/intern/diff/D45408971)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100247
Approved by: https://github.com/guangy10, https://github.com/avikchaudhuri
2023-05-04 13:26:58 +00:00
8f1122ce7b [inductor] Enable conditional use of tl.reduce (#100569)
Splitting this into a separate PR so we can test both branches in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100569
Approved by: https://github.com/ngimel
2023-05-04 13:07:34 +00:00
94d306fd45 [inductor] Stop using x + tl.zeros(...) in generated triton (#100163)
For reductions, this changes the accumulator
```python
_tmp2 = tl.zeros([XBLOCK, RBLOCK], tl.int8) + -128
```
to
```python
_tmp2 = tl.full([XBLOCK, RBLOCK], -128, tl.int32)
```
which is equivalent since addition does type promotion from `int8` to `int32`

For constant indexing, this changes
```python
tl.store(in_out_ptr0 + (0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp4, None)
```
to
```python
tl.store(in_out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```

For variable indexing, this changes
```python
tl.store(out_ptr0 + (0 + tl.zeros([XBLOCK], tl.int32)), tmp1, None)
```
to
```python
tl.store(out_ptr0 + (tl.broadcast_to(x0, [XBLOCK])), tmp1, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100163
Approved by: https://github.com/ngimel
2023-05-04 13:07:34 +00:00
06fbd5dc9c [inductor] Fix argmin/max with duplicate values (#100573)
Fixes #99879

This adds `minimum_with_index` helper functions to compute the minimum
value and index simultaneously, with a preference for the smaller
index which is required to match eager in case of duplicates.

I also remove the mask-and-sum hack with a `tl.reduce` using
the previously mentioned helper. This additionally fixes the indices
being added together in the case of duplicates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100573
Approved by: https://github.com/ngimel
2023-05-04 13:07:31 +00:00
4918940184 [inductor] Fix nan-handling of max and min reductions (#100572)
This adds helpers that replace tritons `minimum`, `maximum`, `min` and
`max` with the correct NaN prrpagation. I also removed
`ops.int_minimum` in favor of `ops.minimum` because we can just omit
the nan-checks by checking the dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100572
Approved by: https://github.com/ngimel
2023-05-04 13:07:27 +00:00
19a57870a3 Fix a number of issues with divs in ValueRangeAnalysis (#100547)
This PR:
- Adds `floordiv` and `truncdiv` as they were missing
- Maps `div` to its correct definition (it was being mapped to `floordiv`)
- Simplifies the bounds of `floordiv`
- Fixes some issues with the returned types of `floor` `ceil`
- Adds tests for the previous point

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100547
Approved by: https://github.com/ezyang
2023-05-04 12:31:55 +00:00
a204f7f518 [c10d] Fix subprocess group handlig in scatter_object_list. (#100552)
scatter_object_list assumed src was a group rank while all collectives use global ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100552
Approved by: https://github.com/fduwjj
2023-05-04 10:04:21 +00:00
aecbaa5d45 [vmap] bucketize (#95783)
Ref: https://github.com/pytorch/pytorch/issues/96740
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95783
Approved by: https://github.com/zou3519
2023-05-04 07:23:35 +00:00
c4fd76e7b4 Revert "[export] Pickle result of export (#100423)"
This reverts commit 7226dbcbce87464fb170019a6ffeb80f82c37804.

Reverted https://github.com/pytorch/pytorch/pull/100423 on behalf of https://github.com/angelayi due to merge conflict ([comment](https://github.com/pytorch/pytorch/pull/100423#issuecomment-1534163373))
2023-05-04 06:41:06 +00:00
7226dbcbce [export] Pickle result of export (#100423)
Pickles the metadata["val"] into TensorMetadata struct so that it'll be retrained when we unpickle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100423
Approved by: https://github.com/mergennachin
2023-05-04 06:37:16 +00:00
8d598f2f25 [exportdb] Change case ids to case names for UserErrors. (#100600)
Associate UserErrors with the unique case name instead of the
case ids, because in practice they work similarly but names are more
meaningful to use and remember.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100600
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2023-05-04 06:14:50 +00:00
c58d9642d0 Don't build Triton from source in benchmarks/dynamo/Makefile (#100613)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100613
Approved by: https://github.com/voznesenskym
2023-05-04 06:13:42 +00:00
8eb82135d1 [docs] Docs for writing ATen IR passes + FX Pattern matching (#100577)
I'm not really sure where to put this...maybe just link it somewhere in torch.compile docs?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100577
Approved by: https://github.com/msaroufim
2023-05-04 05:17:10 +00:00
fe3ecfe0cf Add AotAutogradFallbackTests to dynamic suite (#100454)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100454
Approved by: https://github.com/ezyang
2023-05-04 04:28:45 +00:00
2dca418112 Reland basic dynamo support for traceable collectives (#100476)
Relative to the original land, this also contains:
- Fix torchdeploy import of functional collectives
- Can't import torchdynamo utils due to torch._refs being missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100476
Approved by: https://github.com/kumpera
2023-05-04 04:25:35 +00:00
9f3c6b1b63 Fix graph break in a common func(self, *args) pattern (Faster stable diffusion) (#100444)
Stable Diffusion has a pattern like this:

```
    def forward(self, hidden_states, encoder_hidden_states=None, attention_mask=None, **cross_attention_kwargs):
        # The `Attention` class can call different attention processors / attention functions
        # here we simply pass along all tensors to the selected processor class
        # For standard processors that are defined here, `**cross_attention_kwargs` is empty
        return self.processor(
            self,
            hidden_states,
            encoder_hidden_states=encoder_hidden_states,
            attention_mask=attention_mask,
            **cross_attention_kwargs,
        )
```

Wherein processor is something like `AttnProcessor2_0`, which is callable but not an NNModule.

This allows for a significant speedup in stable diffusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100444
Approved by: https://github.com/anijain2305
2023-05-04 03:38:52 +00:00
c2556c034d Improve minifier printing to be more chatty when it makes sense (#100486)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100486
Approved by: https://github.com/voznesenskym
2023-05-04 02:51:26 +00:00
c7e9f40653 Misc accuracy improvements on minifier (#100447)
The changes:

* Add config knob `same_two_models_use_fp64` for toggling whether or not to use fp64
* Add a test showing that RMSE is superior to atol/rtol
* Add `--strict-accuracy` options, which allows for testing against integral/boolean accuracy.  Regular accuracy by default now ONLY. There's a test which exercises this, it's a little delicate but I had trouble thinking of a good test otherwise.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100447
Approved by: https://github.com/voznesenskym
2023-05-04 02:51:26 +00:00
7f997aa393 [codemod] Replace hasattr with getattr in caffe2/test/distributed/fsdp/test_fsdp_optim_state.py (#100360)
Summary:
The pattern
```
X.Y if hasattr(X, "Y") else Z
```
can be replaced with
```
getattr(X, "Y", Z)
```

The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate.

**This diff is very low risk. Green tests indicate that you can safely Accept & Ship.**

Test Plan: Sandcastle

Differential Revision: D44886500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100360
Approved by: https://github.com/rohan-varma, https://github.com/Skylion007, https://github.com/awgu
2023-05-04 02:44:22 +00:00
8df748f3be [vision hash update] update the pinned vision hash (#100510)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100510
Approved by: https://github.com/pytorchbot
2023-05-04 02:32:43 +00:00
c29ab84115 Fix bug in process_group_name when there is duplicate pgs (#100518)
Summary: with the new c10d API, we don't need all ranks to call new_group. Integrate with the new API, so that every rank just call new_group 3 times, with a local barrier with the members within the group.

Reviewed By: xunnanxu, eeggl

Differential Revision: D45315615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100518
Approved by: https://github.com/kumpera
2023-05-04 02:12:28 +00:00
253b9d3247 [replicate] input casting support (#100216)
Supports input casting by doing this in the pre hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100216
Approved by: https://github.com/awgu
2023-05-04 01:46:15 +00:00
e87ed2a88d [primTorch] add ref for polar (#100345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100345
Approved by: https://github.com/ezyang
2023-05-04 01:37:02 +00:00
2892c06e82 Ensure device arg is passed to test_transformers (#100260)
# Summary
Follow up to #100121 to actually make sure that test functions are accepting a device arg as input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100260
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-05-04 01:36:06 +00:00
f558bb6f76 inplace PyTorchStreamReader getRecord() (#100418)
Summary: Sometimes we want to getRecord into an pre-allocated memory to save cpu memory. Adding new API to support the inplace memory writing.

Test Plan: caffe2/serialize/inline_container_test

Reviewed By: zyan0

Differential Revision: D45439517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100418
Approved by: https://github.com/davidberard98, https://github.com/houseroad
2023-05-04 01:30:59 +00:00
e6c0164f1c Use Boxed Calling Convention for AOT Eager (#100417)
The boxed format is more memory efficient, especially with backwards & activations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100417
Approved by: https://github.com/ezyang
2023-05-04 01:22:36 +00:00
d67e4db8ff Require contiguous for view_as_complex (#100428)
Fix for https://github.com/pytorch/pytorch/issues/100086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100428
Approved by: https://github.com/ngimel
2023-05-04 01:18:07 +00:00
d25c93f919 Remove speech_transformer workaround, torchbench handles it correctly now (#100558)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100558
Approved by: https://github.com/albanD
2023-05-04 01:14:24 +00:00
fd841763e1 [dynamo] Minor fixes and clean-up in eval_frame.c (#100496)
This fixes a few reference counting bugs in eval_frame.c, simplifies a few functions a bit, and adds a few missing error handling code paths.  Probably the only important reference counting bug is that `call_callback` previously leaked `THPPyInterpreterFrame` in Python 3.11+.

Summary below:

- eval_frame_callback_get shouldn't incref Py_None
- Don't leak THPPyInterpreterFrame in Python 3.11+
- set_profiler_hooks would decref profiler_start_hook and profiler_end_hook too many times if called with None as an argument (but we never actually used that code path).
- Simplify some argument parsing
- Only create guard_profiler_name_str once
- Add a few missing error checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100496
Approved by: https://github.com/albanD
2023-05-04 00:45:15 +00:00
6aeb85add8 add checkpoint support for custom device (#99626)
Fixes #ISSUE_NUMBER
1、add checkpoint support for custom device
2、add a device argument, I want to add a device="cuda" parameter to the func `forward` of `CheckpointFunction`, and I can specify the device type when using it, but the func `apply` of `torch.autograd.Function` does not support `kwargs`, so I added a variable named `_device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99626
Approved by: https://github.com/soulitzer
2023-05-04 00:23:42 +00:00
eqy
3c1dd0a4b1 [cuDNN][CUDA] Fix for install_cudnn.sh following 12.1 CI update (#100501)
Trying out a fix for the path

CC @Aidyn-A @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100501
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-05-04 00:17:10 +00:00
fa2bfab93e [C10D] Drop the GIL when creating a TCPSTore to avoid deadlocks. (#100555)
TCPSTore creation is a blocking operation so it can lead to a deadlock
if multiple threads are trying to instantiate it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100555
Approved by: https://github.com/H-Huang
2023-05-04 00:15:55 +00:00
c3bcf5f628 Support multiple separator characters, '/' and '\\', on Windows. (#98146)
On Windows, both '/' and '\\' can be used as a path separator, so `StripBasename` should handle them as path separators.

`StripBasename` is used in the `is_enabled` function in `torch\csrc\jit\jit_log.cpp`
Therefore, without this pull request, is_enabled does not work properly on Windows.

For more details, please refer to the issue #98145.

Fixes #98145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98146
Approved by: https://github.com/ezyang
2023-05-04 00:15:28 +00:00
f82756335d [ONNX] Update 'Functionalize' pass to support pre-decomp graph; Drop 'aten_graph' arg for 'DynamoExporter' (#99667)
Summary
- Previously this was required by and entangled with `tracing_mode=symbolic` for `dynamic` tracing.
  That is resolved by #99555 and its follow ups.
- Later decomposition pass will do graph lowering, so this step is duplicated.
- Updated `Functionalization` to workaround https://github.com/pytorch/pytorch/issues/99774#issuecomment-1527949391

Todo
- Training vs eval in dynamo_export
  So we are effectively exporting all models in traning mode by
  default. But for the sake of this export we are only interested in eval mode.
  The question is, should we call `model.eval()` in `dynamo_export`?
  Tests with model containing batch norm fails 'functionalization' in training mode.
  We are explicitly calling `model.eval()` for these model for now.
- Merge decomp and functionalize pass. Both calls into `make_fx`.
  Merging potentially increases performance. However it is unclear
  if it will result in different behavior.

Fixes #99662. (For the functionalization issue. Still need missing op support.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99667
Approved by: https://github.com/titaiwangms
2023-05-04 00:01:22 +00:00
9bc68fcd25 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If stride == 1 we can use the whole warp to reduce the gradients
If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505
Approved by: https://github.com/ngimel
2023-05-03 23:52:58 +00:00
61813a8e62 [reland][CI] Start to collect inference perf with cpp_wrapper ON (#100187) (#100502)
Summary: Previous failures were caused by GCP outage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100502
Approved by: https://github.com/huydhn
2023-05-03 23:51:18 +00:00
1a6f613b8f Check uppercase when checking for merge blocking SEVs (#100559)
Otherwise it's triggered by phrases like "removing merge blocking" in the details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100559
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-05-03 23:47:06 +00:00
0a6a0ac49b [MPS] Add dot input check (#100099)
Fixes #99564

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at c21d056</samp>

This pull request adds input validation and error handling tests for the `dot` and `vdot` operations in the `mps` namespace, using a new helper function and a new test function. This enhances the MPS backend and the testing framework for these operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100099
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-03 23:35:10 +00:00
3c5ec6af14 Partition modules (#98628)
Added helper functions to match nodes in the graph that are decomposed from their source (leaf modules, or functional ops), as a result of dynamo tracing.

`get_source_partitions(graph: torch.fx.Graph, wanted_sources: List[Any]) -> Dict[Any, SourcePartition]`

Args:
* graph: The graph we want to partition
* wanted_sources: List of sources of nodes that were decomposed from this source. This can be a function (ex. torch.nn.functional.linear) or a leaf module type (ex. torch.nn.Linear)

Returns:
* Dictionary mapping sources (ex. torch.nn.modules.linear.Linear) to a list of SourcePartitions that correspond to the list of nodes that were flattened from a module of that type.

```
@dataclass
class SourcePartition():
    # Nodes in a particular partition
    nodes: List[Node]
    # Module type
    module_type: Type
    # Nodes in the graph that are needed as inputs to the partition
    input_nodes: List[Node] = field(default_factory=list)
    # Nodes in the partition that are being used by nodes outside of the partition
    output_nodes: List[Node] = field(default_factory=list)
    # Parameters that are being used
    params: List[str] = field(default_factory=list)
```

Example:

Original:
```
x -> linear -> linear -> relu -> linear
```
Traced graph:
```
.graph():
    %arg0 : [#users=1] = placeholder[target=arg0]
    %_param_constant0 : [#users=1] = get_attr[target=_param_constant0]
    %t_default : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0,), kwargs = {})
    %_param_constant1 : [#users=1] = get_attr[target=_param_constant1]
    %addmm_default : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1, %arg0, %t_default), kwargs = {})
    %_param_constant0_1 : [#users=1] = get_attr[target=_param_constant0]
    %t_default_1 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant0_1,), kwargs = {})
    %_param_constant1_1 : [#users=1] = get_attr[target=_param_constant1]
    %addmm_default_1 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant1_1, %addmm_default, %t_default_1), kwargs = {})
    %relu_default : [#users=1] = call_function[target=torch.ops.aten.relu.default](args = (%addmm_default_1,), kwargs = {})
    %_param_constant2 : [#users=1] = get_attr[target=_param_constant2]
    %t_default_2 : [#users=1] = call_function[target=torch.ops.aten.t.default](args = (%_param_constant2,), kwargs = {})
    %_param_constant3 : [#users=1] = get_attr[target=_param_constant3]
    %addmm_default_2 : [#users=1] = call_function[target=torch.ops.aten.addmm.default](args = (%_param_constant3, %relu_default, %t_default_2), kwargs = {})
    return [addmm_default_2]
```
Result of `get_module_partitions`:
```
{<class 'torch.nn.modules.linear.Linear'>: [
    ModulePartition(nodes=[_param_constant0, t_default, _param_constant1, addmm_default], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[arg0], output_nodes=[addmm_default], params=["_param_constant0", "_param_constant1"]),
    ModulePartition(nodes=[_param_constant0_1, t_default_1, _param_constant1_1, addmm_default_1], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[addmm_default], output_nodes=[addmm_default_1], params=["_param_constant0_1", "_param_constant1_1"]),
    ModulePartition(nodes=[_param_constant2, t_default_2, _param_constant3, addmm_default_2], module_type=<class 'torch.nn.modules.linear.Linear'>, input_nodes=[relu_default], output_nodes=[addmm_default_2], params=["_param_constant2", "_param_constant3"])],

 <class 'torch.nn.modules.activation.ReLU'>: [
    ModulePartition(nodes=[relu_default], module_type=<class 'torch.nn.modules.activation.ReLU'>, input_nodes=[addmm_default_1], output_nodes=[relu_default], params=[])]}
```

Also added helper function to check if two module partitions are connected:
`check_subgraphs_connected(subgraph1: SourcePartition, subgraph2: SourcePartition) -> bool`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98628
Approved by: https://github.com/cccclai
2023-05-03 23:31:56 +00:00
75945d54f7 Properly propagates checkpoint wrapper args and kwargs (#99791)
It looks like passing `*args` and `**kwargs` to `checkpoint_wrapper()` does not work because someone forgot some `*`s. This adds them back in.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99791
Approved by: https://github.com/awgu
2023-05-03 23:19:21 +00:00
8f6951cf55 [cuDNN][cuDNN V8 frontend API] Clean up time_sorted_plan workaround for cuDNN v8 API (#100287)
`cudnn-frontend` (bumped in #99674) has added support for limiting the number of kernels to benchmark, so we can remove the workaround introduced in #91032.

CC @ngimel @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100287
Approved by: https://github.com/ngimel
2023-05-03 22:37:16 +00:00
478a5ddd8a Mark Windows CPU jobs as unstable (#100581)
Caused by https://github.com/pytorch/pytorch/pull/100377, something removes VS2019 installation on the non-ephemeral runner.  I think moving this to unstable is nicer to gather signals in trunk without completely disable the job or revert https://github.com/pytorch/pytorch/pull/100377 (for the Nth times)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100581
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-05-03 21:43:43 +00:00
f04bb519f5 [DataPipe] Change DataPipe display name in profiler (#100042)
Script:
```python
from torchdata.datapipes.iter import IterableWrapper
from torchdata.dataloader2 import DataLoader2, MultiProcessingReadingService

ls = range(16)
dp = IterableWrapper(ls).map(fn_2).map(fn_3).map(fn_4)

rs = MultiProcessingReadingService(num_workers=0, main_prefetch_cnt=0, worker_prefetch_cnt=0)
dl2 = DataLoader2(dp, reading_service=rs)

with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as prof:
    for x in dl2:
        pass
```

Output before:
```
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
             enumerate(DataPipe)#MapperIterDataPipe        76.37%       1.419ms       213.08%       3.959ms      80.796us            49
    enumerate(DataPipe)#IterableWrapperIterDataPipe        12.70%     236.000us        12.70%     236.000us      13.882us            17
...
```

Output after:
```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Mapper(datapipe=Mapper, fn=fn_4, input_col=None, out...        29.79%     645.000us        99.17%       2.147ms     126.294us            17
Mapper(datapipe=IterableWrapper, fn=fn_2, input_col=...        29.24%     633.000us        42.96%     930.000us      54.706us            17
Mapper(datapipe=Mapper, fn=fn_3, input_col=None, out...        24.76%     536.000us        68.59%       1.485ms      87.353us            17
IterableWrapper(deepcopy=True, iterable=range(0, 16)...        10.58%     229.000us        10.58%     229.000us      13.471us            17
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100042
Approved by: https://github.com/ejguan
2023-05-03 21:36:13 +00:00
72c68704d7 Revert "Temporarily move ROCm to unstable (#99579)" (#100564)
This reverts commit c412056921f1e251bd955bb5fd9bd117d5a97ee5.  Need to revert this manually due to a merge conflict.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100564
Approved by: https://github.com/kit1980
2023-05-03 21:04:23 +00:00
a304b2a45f Activate TracingContext earlier (#100043)
Ensure any calls into VariableTracker have a valid TC

Previously, Calls into VariableBuilder from symbolic locals construction were not done under an active TC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100043
Approved by: https://github.com/anijain2305
2023-05-03 20:55:30 +00:00
3d10e748e7 [Reland] Initial version of Dynamo capture for HigherOrderOperator (#100544)
Original PR #99988

The problem was that we added `wrap` to torch._ops which actually puts
it on `torch.ops.wrap` which is a namespace that can be open-registered
to. The fix is that we now shove `wrap` into a new file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100544
Approved by: https://github.com/voznesenskym
2023-05-03 20:49:05 +00:00
e552b91286 torch.utils.checkpoint warns if user does not pass use_reentrant explicitly (#100551)
Now that we have updated all internal callsites, per https://fb.workplace.com/groups/pytorch.oss.dev/permalink/1635183750239493/ we should raise a warning when use_reentrant is not explicitly passed for 2.1

Deprecation note:
- Not passing in use_reentrant explicitly is now deprecated and will raise a warning. In the future the default value of use-reentrant will be False. To preserve the existing behavior you can pass in use_reentrant=True. It is recommended that you use use_reentrant=False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100551
Approved by: https://github.com/Skylion007
2023-05-03 20:48:07 +00:00
0595ecf00c [ONNX] Add symbolic for _convolution_mode (#89107)
As per #68880
implement the operator _convolution_mode in the ONNX exporter. This will allow user to leverage the padding 'str' mode where it can be set to 'valid' or 'same'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89107
Approved by: https://github.com/titaiwangms, https://github.com/BowenBao
2023-05-03 20:42:30 +00:00
d419ad17b2 [dynamo] Disable pytest AST rewriting in test_export (#100484)
pytest rewrites Python assert statements in unit tests to provide more detailed error messages. Unfortunately, this breaks some dynamo tests. Disable AST rewriting in test_export.py so that "pytest test/dynamo/test_export.py" passes.

Fixes #93449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100484
Approved by: https://github.com/tugsbayasgalan
2023-05-03 20:40:46 +00:00
2f13a7a7a7 Prevent GraphArg from keeping real tensors live (#100515)
This may potentially fix an OOM on IG Cover model (Meta only: T152238176)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100515
Approved by: https://github.com/voznesenskym, https://github.com/yf225
2023-05-03 20:14:19 +00:00
16d268e280 Fix comment error in TensorIterator.cpp (#100227)
Fixes comment error in TensorIterator.cpp

I believe there is an error in the comment, based on the following code snippet
```c++
if (shape0 * stride[dim0] != stride[dim1]) {
        return false;
}
```
I have corrected the comment accordingly. Please let me know if any further action is required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100227
Approved by: https://github.com/kit1980
2023-05-03 19:59:14 +00:00
6a12f10b08 Publicly exposing torch.backends.cpu.get_cpu_capability() (#100164)
Description:

- As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`.

- In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities

Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-03 19:02:07 +00:00
3e18d3958b [DataLoader] Follow-up Fix: TypeVars of Sampler (#100409)
API backward compatibility fixed:
https://github.com/pytorch/pytorch/pull/97338#discussion_r1169164163

Mapped Dataset can accept noninteger indices from custom Samplers.

Fixes #97338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100409
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-05-03 17:38:31 +00:00
db4572dbf1 Revert tl.reduce usage (#100521)
Test Plan: sandcastle

Reviewed By: bertmaher

Differential Revision: D45513572

fbshipit-source-id: a03df851503f72313dfb50238e7d6db9239bf42e
2023-05-03 12:20:33 -04:00
287f74c4fc Revert D45387167: Multisect successfully blamed D45387167 for test or build failures (#100424)
Summary:
This diff is reverting D45387167
D45387167: Basic dynamo support for traceable collectives (#94440) by wconstab has been identified to be causing the following test or build failures (internal)

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Reviewed By: s4ayub

Differential Revision: D45448312

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100424
Approved by: https://github.com/rohan-varma, https://github.com/kumpera
2023-05-03 16:10:54 +00:00
2ac6ee7f12 Migrate jobs: windows.4xlarge->windows.4xlarge.nonephemeral (#100548)
This is reopening of the PR https://github.com/pytorch/pytorch/pull/100377

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman

(cherry picked from commit 7caac545b1d8e5de797c9593981c9578685dba81)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100548
Approved by: https://github.com/jeanschmidt, https://github.com/janeyx99
2023-05-03 15:47:18 +00:00
843ead134c [ONNX] Add supported ops into test_fx_op_consistency - 1st batch (#100265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100265
Approved by: https://github.com/justinchuby
2023-05-03 14:42:25 +00:00
2ebb48ff28 [SPMD] add FQN argument to Override.replacement (#100473)
Differential Revision: [D45486089](https://our.internmc.facebook.com/intern/diff/D45486089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100473
Approved by: https://github.com/wanchaol
2023-05-03 14:20:01 +00:00
6cc0158311 Use maybe_unused attr in VariableType (#100498)
simple cosmetic change, a fallout of #100250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100498
Approved by: https://github.com/albanD
2023-05-03 14:14:29 +00:00
58f796ff5d Revert "Initial version of Dynamo capture for HigherOrderOperator (#99988)"
This reverts commit 4c99f9cdf236756efcdb365679ddec788b756eeb.

Reverted https://github.com/pytorch/pytorch/pull/99988 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/99988#issuecomment-1533081452))
2023-05-03 14:02:40 +00:00
b2d703e2d7 Stop loading functorch._C unless torchdim is needed (#100491)
Just a small optimization. This PR changes it so that import of
functorch.dim ends up loading functorch._C (which is entirely composed
of torchdim APIs)

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100491
Approved by: https://github.com/Chillee, https://github.com/kshitij12345
2023-05-03 13:47:49 +00:00
8b64dee5d2 [fix] torch_compile_debug don't log with 0 (#100462)
Fixes https://github.com/pytorch/pytorch/issues/99906

Tested locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100462
Approved by: https://github.com/mlazos
2023-05-03 08:23:09 +00:00
896eb1db26 [Dynamo] Skip TB Background_Matting model eager accuracy check because of non deterministic (#100513)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100513
Approved by: https://github.com/anijain2305
2023-05-03 07:06:50 +00:00
9e2808aa47 Retry resolving download.pytorch.org with Google DNS (#100509)
We want to retry resolving `download.pytorch.org` one more time with Google DNS as it seems to work on the runner.  This is to avoid the intermittent NXDOMAIN error when using AWS local DNS 10.0.0.2, for example https://github.com/pytorch/pytorch/actions/runs/4864714757/jobs/8674570552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100509
Approved by: https://github.com/malfet, https://github.com/atalman
2023-05-03 04:51:51 +00:00
771a9debbe [PT2E][Quant] Refactor quantizer and qnnpack qantizer code to support dqlinear config (#99399)
This diff introduces a few refactors:

- Move observer creation to utils.py.
- Use quantization spec to supply args to observers.
- Use annotation function registration corresponding QuantizationConfig. This
  will be later used in dynamic quantized linear.

Differential Revision: [D45073790](https://our.internmc.facebook.com/intern/diff/D45073790/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99399
Approved by: https://github.com/jerryzh168
2023-05-03 03:23:32 +00:00
1bbca4fbc0 Relax after_aot restriction on no buffers, serialize small constants (#100472)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100472
Approved by: https://github.com/bdhirsh, https://github.com/voznesenskym
2023-05-03 03:10:22 +00:00
2089a9bd48 Refactor minifier tests to be more compact (#100471)
Mostly burning in more assumptions based on commonality on the tests,
so writing new tests takes less code.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100471
Approved by: https://github.com/voznesenskym
2023-05-03 03:10:22 +00:00
409fc7a4c7 Make hash_storage work with size 0/1 storage (#100467)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100467
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2023-05-03 03:10:19 +00:00
4b9ba3fad5 Allow discontiguous NestedTensors to empty_like (#98383)
# Summary
Preivously we disallowd dis-contiguous NTs to passed into to empty_like. This was done out of an abundance of caution, :think:. However it should be safe to create an empty NT for dis-contiguous NTs. Empty like does account for offsets, strides, and sizes in construction of the result and therefore this should be safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98383
Approved by: https://github.com/cpuhrsch
2023-05-03 02:27:08 +00:00
419387f66f Run periodic jobs only twice a day on weekends (#100489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100489
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2023-05-03 02:07:28 +00:00
6b2ecb12b6 OpInfo: specifying sparse sample input function implies the corresponding layout support (#100392)
As in the title.

The PR fixes an issue of silently skipping tests as described in #100391.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100392
Approved by: https://github.com/pmeier, https://github.com/cpuhrsch
2023-05-03 02:04:39 +00:00
3ae0e23b90 Fix sum OpInfo for sparse sample inputs and assert coverage for sparse-enabled operators (#100391)
This PR enables sum tests for sparse sample inputs. Previously, the tests existed but were never run because the sum OpInfo instance was created without specifying `supports_sparse_*=True`. To avoid such mistakes in the future, the following PR https://github.com/pytorch/pytorch/pull/100392 enables the `supports_sparse_*` flags automatically when OpInfo creation specifies `sample_inputs_sparse_*_func`.

In addition, the PR applies several fixes to sum tests for sparse sample inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100391
Approved by: https://github.com/cpuhrsch
2023-05-03 02:04:39 +00:00
ffcbd1c2de Move tracked nn_modules from OutputGraph to TracingContext (#100457)
Lint

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100457
Approved by: https://github.com/anijain2305
2023-05-03 02:00:11 +00:00
2439090bef Remove special casing for stride/size setup for guards (#100456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100456
Approved by: https://github.com/ezyang
2023-05-03 01:59:52 +00:00
9439cb0e11 Avoid using einsum for torch.cat DTensor propogation (#100251)
DTensor was reusing `einop_rule` to propagate sharding for torch.cat.
However, einsum only supports up to 52 subscripts (i.e., input tensors).
We have encountered use cases where one cat operator has more than 60
input tensors. Therefore, this commit reimplements sharding prop
rule for cat without using einsum.

Differential Revision: [D45435232](https://our.internmc.facebook.com/intern/diff/D45435232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100251
Approved by: https://github.com/wanchaol
2023-05-03 01:56:18 +00:00
d23dbfff60 [ONNX] Add RemoveConstantInputStep to adapt torch inputs to ONNX inputs (#100252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100252
Approved by: https://github.com/BowenBao
2023-05-03 01:50:47 +00:00
6b5f50004d [inductor] Change the default value of layout (#100254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100254
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-05-03 01:48:05 +00:00
c3aa59c8f5 Revert "[WIP] enable cuda graphs support for flash attention with dropout (#100196)"
This reverts commit 32615618e439ce84d9365bd0d8892e34fcbe8add.

Reverted https://github.com/pytorch/pytorch/pull/100196 on behalf of https://github.com/clee2000 due to broke no ops build 32615618e4 https://github.com/pytorch/pytorch/actions/runs/4866578063/jobs/8678258318 ([comment](https://github.com/pytorch/pytorch/pull/100196#issuecomment-1532352810))
2023-05-03 01:41:56 +00:00
dc4a25312f Fix hosts update for binary build (#100507)
Forward fix for regression caused by https://github.com/pytorch/pytorch/pull/100475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100507
Approved by: https://github.com/huydhn
2023-05-03 00:58:04 +00:00
e4ad67f9c2 Remove ci: sev label and details from ci-sev.md tempalte (#100504)
We don't want to add the label automatically: this way we can limit CI SEV creation to people with write permissions only.

Also remove `details` section as it's not really filled by people.

Fixes https://github.com/pytorch/pytorch/issues/100143
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100504
Approved by: https://github.com/clee2000, https://github.com/huydhn
2023-05-03 00:51:06 +00:00
34e90b8df1 Revert "[inductor] Cleanup strip_last_size logic (#100305)"
This reverts commit de7793d577fec7af286ba63b309ccd3795a8c038.

Reverted https://github.com/pytorch/pytorch/pull/100305 on behalf of https://github.com/jansel due to causes IMA errors on huggingface ([comment](https://github.com/pytorch/pytorch/pull/100305#issuecomment-1532317310))
2023-05-03 00:42:48 +00:00
8ec0a939a2 [PT2E][Quant] Fix but in quant spec of symmetric static quant (#99398)
Activation quant spec should have qscheme = per_tensor_affine
Weights quant spec should have ch_axis=0 for per_channel_symmetric

Differential Revision: [D45073789](https://our.internmc.facebook.com/intern/diff/D45073789/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99398
Approved by: https://github.com/jerryzh168
2023-05-03 00:36:03 +00:00
8430430e94 Handle trailing masked column behavior for nested tensor (#100113)
Summary:
Handle trailing masked column behavior for nested tensor by padding during to_padded, to original tensor size

https://github.com/pytorch/pytorch/issues/97111

Test Plan: sandcastle & github

Differential Revision: D45167874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100113
Approved by: https://github.com/bertmaher, https://github.com/cpuhrsch, https://github.com/drisspg
2023-05-03 00:30:17 +00:00
0acfe2ce09 [dashboard] higher tolerance for AlbertForQuestionAnswering (#100277)
@desertfire

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100277
Approved by: https://github.com/desertfire
2023-05-02 23:51:08 +00:00
de7793d577 [inductor] Cleanup strip_last_size logic (#100305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100305
Approved by: https://github.com/ngimel
2023-05-02 23:46:26 +00:00
32615618e4 [WIP] enable cuda graphs support for flash attention with dropout (#100196)
Fixes #99905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100196
Approved by: https://github.com/drisspg
2023-05-02 23:05:31 +00:00
a587f1ff0a [CI] Change the dashboard run to once a day (#100499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100499
Approved by: https://github.com/huydhn
2023-05-02 22:35:49 +00:00
7ff71a3a48 Populate download.pytorch.org IP to container (#100475)
Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems.

Why not copy `/etc/hosts` from host to the container? Because it would break container ip resolution in distributed tests, that relies on `socket.gethostbyname(socket.gethostname())` to work.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 756d0b1</samp>

Propagate `download.pytorch.org` IP address to docker containers in `test-pytorch-binary` action and workflow. This fixes DNS issues when downloading PyTorch binaries inside the containers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475
Approved by: https://github.com/huydhn
2023-05-02 22:08:06 +00:00
2ec6eb3d09 Revert "PyTorch -> C++17 (#98209)" (#100497)
This reverts commit 8f0c825d36d6737000dd93bc86aa18761166a7b6.

https://github.com/pytorch/pytorch/pull/98209#issuecomment-1532099965, cannot revert normally due to unmerged linked diff

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100497
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-05-02 21:22:31 +00:00
543b7ebb50 Revert "Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377)"
This reverts commit 7caac545b1d8e5de797c9593981c9578685dba81.

Reverted https://github.com/pytorch/pytorch/pull/100377 on behalf of https://github.com/malfet due to This is not the PR I've reviewed ([comment](https://github.com/pytorch/pytorch/pull/100377#issuecomment-1532148086))
2023-05-02 21:05:53 +00:00
7caac545b1 Migrate jobs from windows.4xlarge windows.4xlarge.nonephemeral instances (#100377)
This is reopening of the PR [100091](https://github.com/pytorch/pytorch/pull/100091)

# About this PR

Due to increased pressure over our windows runners, and the elevated cost of instantiating and bringing down those instances, we want to migrate instances from ephemeral to not ephemeral.

Possible impacts are related to breakages in or misbehaves on CI jobs that puts the runners in a bad state. Other possible impacts are related to exhaustion of resources, especially disk space, but memory might be a contender, as CI trash piles up on those instances.

As a somewhat middle of the road approach to this, currently nonephemeral instances are stochastically rotated as older instances get higher priority to be terminated when demand is lower.

Instances definition can be found here: https://github.com/pytorch/test-infra/pull/4072

This is a first in a multi-step approach where we will migrate away from all ephemeral windows instances and follow the lead of the `windows.g5.4xlarge.nvidia.gpu` in order to help reduce queue times for those instances. The phased approach follows:

* migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral` instances under `pytorch/pytorch`
* migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral` instances under `pytorch/pytorch`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.4xlarge` to `windows.4xlarge.nonephemeral`
* submit PRs to all repositories under `pytorch/` organization to migrate `windows.8xlarge.nvidia.gpu` to `windows.8xlarge.nvidia.gpu.nonephemeral`
* terminate the existence of `windows.4xlarge` and `windows.8xlarge.nvidia.gpu`
* evaluate and start the work related to the adoption of `windows.g5.4xlarge.nvidia.gpu` to replace `windows.8xlarge.nvidia.gpu.nonephemeral` in other repositories and use cases (proposed by @huydhn)

The reasoning for this phased approach is to reduce the scope of possible contenders to investigate in case of misbehave of particular CI jobs.

# Copilot Summary

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

This pull request migrates some windows workflows to use `nonephemeral` runners for better performance and reliability. It also adds support for new Python and CUDA versions for some binary builds. It affects the following files: `.github/templates/windows_binary_build_workflow.yml.j2`, `.github/workflows/generated-windows-binary-*.yml`, `.github/workflows/pull.yml`, `.github/actionlint.yaml`, `.github/workflows/_win-build.yml`, `.github/workflows/periodic.yml`, and `.github/workflows/trunk.yml`.

# Copilot Poem

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 579d87a</samp>

> _We're breaking free from the ephemeral chains_
> _We're running on the nonephemeral lanes_
> _We're building faster, testing stronger, supporting newer_
> _We're the non-ephemeral runners of fire_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100377
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/atalman
2023-05-02 20:41:12 +00:00
311c2bb7ec Move pattern match for foreach before bulky if-else in save_variables (#100445)
One caveat could be that the first if branch doesn't seem to use `arg.expr` at all.

fixes https://github.com/pytorch/pytorch/pull/96405#discussion_r1175669480.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100445
Approved by: https://github.com/soulitzer
2023-05-02 20:38:51 +00:00
e8a1d0be3e Revert "Mount /etc/hosts into container (#100475)"
This reverts commit 99ded8bbcea896b02f1c0babb055329c503ca95e.

Reverted https://github.com/pytorch/pytorch/pull/100475 on behalf of https://github.com/malfet due to Breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/100475#issuecomment-1532097309))
2023-05-02 20:23:32 +00:00
5fbb40669f [dynamo][moco] Disallow_in_graph distributed APIs (#100071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100071
Approved by: https://github.com/jansel, https://github.com/H-Huang
2023-05-02 20:09:25 +00:00
0dc671c247 [c10d] Add new Store methods: append, multi_get, multi_set. (#100379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100379
Approved by: https://github.com/fduwjj
2023-05-02 19:46:09 +00:00
8f0c825d36 PyTorch -> C++17 (#98209)
This diff locks in C++17 as the minimum standard with which PyTorch can be compiled.

This makes it possible to use all C++17 features in PyTorch.

This breaks backward compatibility in the sense that users with older compilers may find their compilers no longer are sufficient for the job.

Summary: #buildmore

Differential Revision: D44356879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98209
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/PaliC
2023-05-02 19:41:50 +00:00
50b0fff060 ci: win cpu test -> trunk, cuda test -> periodic (#100478)
Bumps windows CPU tests to trunk.yml (retaining build in pull.yml), this
also bumps the cuda tests to periodic.yml (retaining build in
trunk.yml).

Hopefully this change will rein in windows spending on AWS since it is
currently our costliest platform (in terms of dollar amount / hours used)

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100478
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-05-02 19:29:33 +00:00
0efab60401 [BE] Update cutlass with NVIDIA upstream changes to 3.1 (#100333)
Updates cutlass with some more upstream changes that went into the 3.1 rc. We already merged in 3.1 so best to get these performance and other fixes into master as well. Follow up to #94188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100333
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-05-02 19:12:29 +00:00
06bf5d4de7 enable headdims > 64 for flash attention on sm90 (#99776)
Follow up to #99105 which disabled FlashAttention when using autograd and mem eff attention for the following cases

head_dim > 64
sm86 or newer

We have tested enabling FlashAttention on sm90 and it works, so this PR will enable it back for sm90 and add in a test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99776
Approved by: https://github.com/malfet, https://github.com/drisspg
2023-05-02 19:11:48 +00:00
279f3cd0a6 [pt2] add SymInt support for dsplit, hsplit, vsplit (#100352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100352
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-05-02 18:51:03 +00:00
794e3971ab Add size check before calling stack_.at(dict_pos) in unpickler.cpp (#94300)
Hi!

I've been fuzzing different pytorch modules, and found a crash inside one of them.

Specifically, I'm talking about a module for unpickling and a function called `Unpickler::readInstruction()`. Running this function with provided crash file results in a crash, which occurs while calling `auto dict = stack_.at(dict_pos).toGenericDict();` [unpickler.cpp:561](0e94fbc0c8/torch/csrc/jit/serialization/unpickler.cpp (L561)). The crash occurs, because the index `dict_pos` is out of bounds (which itself happens because the stack size is 0).

Besides this pull-request, there is another one related to unpickler hardening: https://github.com/pytorch/pytorch/pull/84343

All tests were performed on this pytorch version: [abc54f93145830b502400faa92bec86e05422fbd](abc54f9314)

### How to reproduce

1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch)

2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .`

3. Copy crash file to the current directory:

    - [crash-042dff5e121580425d9d34d0f293918f3c9fbf1e.zip](https://github.com/pytorch/pytorch/files/10674361/crash-042dff5e121580425d9d34d0f293918f3c9fbf1e.zip)

4. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash``

5. And execute the binary: `/message_deserialize_sydr /homedir/crash-042dff5e121580425d9d34d0f293918f3c9fbf1e`

After execution completes you will see this error message:

```txt
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551613) >= this->size() (which is 0)
```

And this stacktrace:

```asan
erminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551613) >= this->size() (which is 0)
==39== ERROR: libFuzzer: deadly signal
    #0 0x5d0df1 in __sanitizer_print_stack_trace /llvm-project/compiler-rt/lib/asan/asan_stack.cpp:87:3
    #1 0x545727 in fuzzer::PrintStackTrace() /llvm-project/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:210:5
    #2 0x52b933 in fuzzer::Fuzzer::CrashCallback() /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:233:3
    #3 0x7f9118e0341f  (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f)
    #4 0x7f9118c2300a in raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300a)
    #5 0x7f9118c02858 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x22858)
    #6 0x7f9119040910  (/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e910)
    #7 0x7f911904c38b  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa38b)
    #8 0x7f911904c3f6 in std::terminate() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa3f6)
    #9 0x7f911904c6a8 in __cxa_throw (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa6a8)
    #10 0x7f91190433aa  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa13aa)
    #11 0x63acdf in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_range_check(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1073:4
    #12 0xce8f93e in std::vector<c10::IValue, std::allocator<c10::IValue> >::at(unsigned long) /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1094:2
    #13 0xce8f93e in torch::jit::Unpickler::readInstruction() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:546:26
    #14 0xce8d527 in torch::jit::Unpickler::run() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:235:27
    #15 0xce8d1c2 in torch::jit::Unpickler::parse_ivalue() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:192:3
    #16 0xcdf0792 in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:127:20
    #17 0xcdf104d in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:137:10
    #18 0xe0532db in torch::distributed::rpc::ScriptRemoteCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/script_remote_call.cpp:74:16
    #19 0xe0ffa10 in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/utils.cpp:108:14
    #20 0x602a41 in LLVMFuzzerTestOneInput /message_deserialize_fuzz.cc:192:27
    #21 0x52ce61 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #22 0x516d7c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #23 0x51cacb in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #24 0x546062 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #25 0x7f9118c04082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #26 0x51169d in _start (/message_deserialize_fuzz+0x51169d)

NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better crash reports.
SUMMARY: libFuzzer: deadly signal
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94300
Approved by: https://github.com/malfet, https://github.com/apach301
2023-05-02 18:50:31 +00:00
ab65bac3ce Use yaml.SafeLoader instead of legacy yaml.Loader (#100443)
See 957ae4d495/lib/yaml/loader.py (L51)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100443
Approved by: https://github.com/Skylion007
2023-05-02 18:32:36 +00:00
02a0fb8df4 Add error_inputs_sparse method to OpInfo (#100389)
Per https://github.com/pytorch/pytorch/pull/98288#discussion_r1170553576, this PR introduces `OpInfo.error_inputs_sparse_func` attribute for sparse inputs in parallel to the `OpInfo.error_inputs_func` attribute which is used for strided inputs.

These attributes are kept separate as the existing testing framework that calls `error_inputs_func` may apply operations to inputs that are unsupported for sparse tensors (e.g. as in test/functorch/).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100389
Approved by: https://github.com/cpuhrsch, https://github.com/pmeier
2023-05-02 18:30:12 +00:00
d425da8bf3 Replace master with main in links and docs/conf.py (#100176)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100176
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-02 18:20:32 +00:00
0aac244680 Support expandable_segments:True in fbcode for caching allocator
Now that expandable_segments has been merged from OSS, we can enable it in the internal build. It still defaults to off, so this should not change any behavior changes in the allocator unless the flag is explicitly set.

Differential Revision: D45249535

Pull request resolved: https://github.com/pytorch/pytorch/pull/100184
2023-05-02 11:12:39 -07:00
99ded8bbce Mount /etc/hosts into container (#100475)
Follow up after https://github.com/pytorch/pytorch/pull/100436 to disable download.pytorch.org access over ipv6 access problems.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 55c9443</samp>

This pull request improves the network configuration of the test-pytorch-binary GitHub action and workflow by mounting the host's `/etc/hosts` file into the container. This enables the container to resolve hostname aliases consistently with the host machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100475
Approved by: https://github.com/huydhn
2023-05-02 17:34:07 +00:00
af92fc1cd7 Revert "[functorch] test for compiling functorch transforms (#100151)"
This reverts commit ea5f6d73124c799d402a5e749b923c21af84e4a5.

Reverted https://github.com/pytorch/pytorch/pull/100151 on behalf of https://github.com/atalman due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/100151#issuecomment-1531871900))
2023-05-02 17:33:29 +00:00
4c99f9cdf2 Initial version of Dynamo capture for HigherOrderOperator (#99988)
This PR introduces a `wrap(body_fn, *args)` higher order operator
The semantics of `wrap(body_fn, *args)` is to just run `body_fn(*args)`

Underneath Dynamo, this PR makes it so that we rewrite calls to
`wrap(body_fn, *args)` with `wrap(new_fn, *new_args)` where `new_fn` has
no free variables. This PR does not update cond/map to use the new
mechanism yet (we do not support nn.Modues yet, will come in the future).

The design we take is:
- OutputGraph represents the graph being built by Dynamo that may be
compiled and executed.
- OutputGraph owns a root SubgraphTracer, where it builds the FX graph.
- OutputGraph may own multiple nested SubgraphTracers.
- When we need to trace the body function of a HigherOrderOperator, we
construct a new SubgraphTracer to build the graph of the body function.

Mechanically, when Dynamo sees a new `wrap` HigherOrderOperator with a
body function, it:
- Creates a new SubgraphTracer via OutputGraph.new_subtracer
- Executes the body function
This captures the body function into the graph on the new
SubgraphTracer while modifying the state of the OutputGraph. For
example, the OutputGraph may receive new GraphArgs, new guards, and new
side effects.

If capture of the body function fails, then Dynamo graph breaks on the
HigherOrderOperator.

Test Plan:
- added test/dynamo/test_higher_order_ops.py

Future:
- We're not actually able to tell Dynamo to completely graph break on the
HigherOrderOperator. Instead, when we do graph break, Dynamo begins
introspecting `HigherOrderOperator.__call__`. It should probably not do
this.
- Ideally we would error out on new SideEffects. I don't know how to do
this yet.
- We don't support dealing with nn.Modules yet (e.g. calling nn.Modules
or accessing attributes of tracked nn.Modules from a body_fn). There's
an open question on what should actually happen here
- Ideally we would rewrite map/cond to use the new mechanism but we need
to fix the previous bullet point before we can get there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99988
Approved by: https://github.com/voznesenskym, https://github.com/anijain2305
2023-05-02 17:11:02 +00:00
984a2397ba Refactor OutputGraph (#99987)
This PR splits OutputGraph into two classes:
- SubgraphTracer (handles FX-tracing)
- OutputGraph (handles Dynamo-specific output graph logic, like
tracking graph inputs, compiling the graph, and executing it).

The motivation behind this is in the next PR up in the stack.
TL;DR is: in order to do higher-order operators, we need nested
SubgraphTracer, one for each level of nesting of the higher-order
operators.

I'm happy to flatten the stack into a single PR, but this separate made
it easier for me to test. Lmk if you want the stack flattened.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99987
Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
2023-05-02 17:11:02 +00:00
1114673c90 Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)"
This reverts commit 97afbcbc8007857a51c85e9c61fe6d80564ef1f9.

Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))
2023-05-02 16:46:04 +00:00
ec3c8abb54 [inductor] Remove redundant model copy when running with cpp_wrapper (#100275)
Summary: to reduce the peak memory consumption

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100275
Approved by: https://github.com/jansel
2023-05-02 16:43:18 +00:00
af62d098fe [export] Migrate internal verifier to subclass export/verifier
Differential Revision: D45416983nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100388
2023-05-02 08:50:48 -07:00
41361538a9 [pt2] add SymInt support for tensordot and inner (#100356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100356
Approved by: https://github.com/ezyang
2023-05-02 14:42:50 +00:00
4582ceb2c4 [distributed][sharded_tensor] Move local_shards check from ShardedTensorBase to ShardedTensor (#100197)
Differential Revision: [D45369211](https://our.internmc.facebook.com/intern/diff/D45369211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100197
Approved by: https://github.com/fduwjj
2023-05-02 12:42:24 +00:00
8556cf208a Make backend_accuracy_fails suppress errors in same_two_models (#100324)
The basic idea is that if we're trying to match for an accuracy
error, we don't want to switch to a compile/runtime error, because
that's probably us breaking things in a different way.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100324
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:16 +00:00
054a254b06 Run minifier tests same process when possible (#100416)
The fast minifier tests now take only 10s to run.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100416
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:16 +00:00
f093ee1722 Prevent Triton from getting eagerly imported when importing torch._inductor (#100374)
This makes 'import torch._inductor.utils' go from 3.5s to 2.1s

See also https://github.com/openai/triton/issues/1599

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100374
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:12 +00:00
74cc377162 Speed up minifier tests by applying some configs that speed things up. (#100387)
Previously, test_after_aot_cpu_compile_error took 101s.  After this
patch, it only takes 46s, a more than 2x speedup.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100387
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:09 +00:00
0a479d9b9c Simplify minifier testing by incorporating fault injection in prod code (#100357)
Previously, minifier testing injected faults by injecting extra code
into the repro scripts, and then ensuring this code got propagated to
all subsequent subprocess calls.  This was not only quite complicated,
but also induced a big slowdown on the minifier, because to inject the
faults, you had to import torch._inductor, which would cause the
compilation threads to immediately get initialized before you even got
to do anything else in the repro script.

This new approach fixes this problem by incorporating the fault
injection into "prod" code.  Essentially, for inductor fault injection
we introduce some new config flags that let you "configure" Inductor to
be buggy; for Dynamo fault injection we just permanently keep the buggy
testing backends registered.  This is MUCH simpler: we only have to
propagate the buggy config (which is something we're already doing),
and it saves the minifier scripts from having to immediately initialize
inductor on entry.

Also, I enable the test for Triton runtime errors, now that tl.assert_device is here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100357
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:06 +00:00
17be65381d Do not use pickle to output config entries in repro scripts (#100354)
New output looks like:

```
torch._dynamo.config.dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = False
torch._inductor.config.fallback_random = True
torch._inductor.config.triton.cudagraphs = True
```

instead of an unreadable pickle.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100354
Approved by: https://github.com/voznesenskym
2023-05-02 11:44:01 +00:00
0093df78df Manually resolve download.pytorch.org to IPv4 (#100436)
This is an attempt to address https://github.com/pytorch/pytorch/issues/100400 by using only IPV4 when accessing the domain.

I kind of want to ignore AAAA records from DNS instead, but couldn't find an easy way to do so.  https://www.cloudflare.com/learning/dns/dns-records/dns-aaaa-record doc mentions this

```
Like A records, AAAA records enable client devices to learn the IP address for a domain name. The client device can then connect with and load the website.

AAAA records are only used when a domain has an IPv6 address in addition to an IPv4 address, and when the client device in question is configured to use IPv6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100436
Approved by: https://github.com/malfet
2023-05-02 08:35:48 +00:00
52a36a98d9 [dynamo] Graph break on a list referencing self (#100296)
Fixes https://github.com/pytorch/pytorch/issues/100150

I did not try hard to support this w/o a graph break. As this pattern is not common, current PR graph breaks and avoids an infinite recursion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100296
Approved by: https://github.com/jansel
2023-05-02 06:38:28 +00:00
090ec55f8d Only skip in torch inductor test
Differential Revision: D45464303nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100435
2023-05-01 22:21:37 -07:00
d5169e7141 Use a stable ordering for saved values in functorch.default_partition (#100111)
Previously, due to the use of the Python set data structure, the ordering of saved values (and how they would appear in the graph) was unstable and changed across runs, making it hard to debug downstream applications. Here we use a dict (with insertion-ordering semantics) to deduplicate values in a way that preserves ordering
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100111
Approved by: https://github.com/Skylion007
2023-05-02 05:14:31 +00:00
ea5f6d7312 [functorch] test for compiling functorch transforms (#100151)
Add a basic test to make sure functorch-torch.compile works as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100151
Approved by: https://github.com/zou3519
2023-05-02 04:56:07 +00:00
ff29722364 [inductor] Prevent reusing aliased buffers if aliases still have uses (#100332)
Fixes #100314
In dependencies, we should track not only immediately used buffer, but also aliased buffers that point to it, otherwise we can reuse and overwrite the buffer while there are still pending uses.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100332
Approved by: https://github.com/jansel
2023-05-02 04:05:16 +00:00
3fd46e1f0d [vision hash update] update the pinned vision hash (#100437)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100437
Approved by: https://github.com/pytorchbot
2023-05-02 02:41:06 +00:00
fdc853b14c Add --baseline option to benchmark runners (#100266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100266
Approved by: https://github.com/ngimel
2023-05-02 02:35:11 +00:00
c6c9258357 Delete @property support at module level, it is unused (#100353)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100353
Approved by: https://github.com/voznesenskym
2023-05-02 01:50:20 +00:00
e918fd18e7 Disable densenet121 as it is flaky (#100371)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100371
Approved by: https://github.com/voznesenskym
2023-05-02 01:49:11 +00:00
123be4b694 [dtensor] add debug tool to track op coverage (#100124)
This PR adds a debug tool to track the op coverage needed in DTensor.

Note that we specifically target ops after decomp table in inductor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100124
Approved by: https://github.com/XilunWu
2023-05-02 01:45:55 +00:00
13da6585b6 [MPS] Skip all empty ops tests (#100368)
Fixes #100175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100368
Approved by: https://github.com/kulinseth
2023-05-02 00:43:58 +00:00
a50fb50c51 [MPS] Fix exception regex not compared (#100367)
Previously when using `self.assertRaisesRegex` to test raised exception and its regex, the regex wasn't actually compared because mps was not in the `NATIVE_DEVICES`. This PR fixes that by enabling exception regex comparisons for mps device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100367
Approved by: https://github.com/albanD
2023-05-02 00:43:58 +00:00
5daef13883 Fix merging label removal (#100433)
During regular merge process, when `GitHubPR` object is created, it does not have `merging` label and when label is added it does not update existing `GitHubPR` object either

To fix the problem, call REST API wrapper `gh_remove_label` directly. Worst case that can happen, if label is already removed at this point, is that it will be printed to the stderr, which is not rendered on HUD anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100433
Approved by: https://github.com/PaliC, https://github.com/kit1980
2023-05-02 00:30:13 +00:00
f143c92739 [docs] Fix typo in get-started.rst (#100355)
This PR changes `""nvprims_nvfuser"` which should be a typo to `"nvprims_nvfuser"`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100355
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-05-02 00:29:53 +00:00
7b684310c8 [BE][GHF] Do not call GraphQL twice (#100434)
During regular merge process, `GitHubPR` and `GitHubRepo` objects are first created in main() and than re-created in `merge()` instead of being passed by reference, which results in making the same GraphQL requests to the repo twice

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ee4e23e</samp>

> _Sing, O Muse, of the skillful coder who refactored_
> _The `merge` function, to accept a `GitHubPR` object,_
> _And thus reduced the calls to the divine API_
> _And the duplication of code, that source of errors._
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100434
Approved by: https://github.com/kit1980, https://github.com/PaliC, https://github.com/huydhn, https://github.com/ZainRizvi
2023-05-02 00:26:49 +00:00
dc9c79d3cf Allow each fully_shard unit to cast foward inputs for mixed precision config (#100290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100290
Approved by: https://github.com/rohan-varma
2023-05-02 00:03:48 +00:00
429155b3c8 Disable some check to get the test pass
Differential Revision: D45437730nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100364
2023-05-01 16:28:12 -07:00
66fde107e2 [codemod] Replace hasattr with getattr in caffe2/torch/testing/_internal/common_device_type.py
Differential Revision: D44886473nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100363
2023-05-01 16:28:07 -07:00
3fb0bf4d96 Automatic pulling ExtraFileMaps without explicit mapping.
Differential Revision: D45170126nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99747
2023-05-01 16:27:56 -07:00
a1d041728b Back out "[aarch64][tools/build_defs/third_party/fbcode_defs.bzl] Fix dep handling in cross-builds"
Differential Revision: D45415678nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100294
2023-05-01 16:27:51 -07:00
c3ccdc0125 Add store.wait() tests (#99577)
Fixes #53863

`pytest test/distributed/test_store.py -vsk test_wait`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99577
Approved by: https://github.com/H-Huang
2023-05-01 22:59:52 +00:00
97afbcbc80 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If `stride == 1` we can use the whole warp to reduce the gradients
If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441
Approved by: https://github.com/ngimel
2023-05-01 22:41:00 +00:00
dc27b842ba Ensure optimizer state references are cleared (#100282)
Fixes https://github.com/pytorch/pytorch/issues/100264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100282
Approved by: https://github.com/janeyx99, https://github.com/yanboliang
2023-05-01 22:25:07 +00:00
e88e92e7a2 Update to reruns + timeouts in run_test.py (#100412)
https://github.com/pytorch/pytorch/pull/100200/files made unknown tests more likely to fail b/c lacking test times but still have time outs, so fix that
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100412
Approved by: https://github.com/huydhn
2023-05-01 21:51:53 +00:00
29b2745285 Add message about need_weights=False performance profile. (#100396)
Summary: Add message about need_weights=False/True performance profile.

Test Plan: sandcastle

Differential Revision: D45446417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100396
Approved by: https://github.com/albanD
2023-05-01 21:45:41 +00:00
940662c4dc Remove some dyn shape flags (#100317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100317
Approved by: https://github.com/ezyang, https://github.com/Neilblaze
2023-05-01 21:36:49 +00:00
aafc6ce8cc Produce constant variables in cases where a SymNode is created with a constant (#100144)
` AOT_DYNAMIC_SHAPES=1 TORCHDYNAMO_DYNAMIC_SHAPES=1  benchmarks/dynamo/huggingface.py --performance  --training --amp --backend eager --disable-cudagraphs --device cuda --only AllenaiLongformerBase --explain`

Looks promising!

Goes from:

Dynamo produced 173 graphs covering 2760 ops with 160 graph breaks (14 unique)

To:

Dynamo produced 6 graphs covering 2298 ops with 15 graph breaks (7 unique)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100144
Approved by: https://github.com/ezyang
2023-05-01 21:32:11 +00:00
0cf6e74fa9 add users to external contrib stat upload (#100403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100403
Approved by: https://github.com/kit1980
2023-05-01 20:35:51 +00:00
0bcb9dac4f [ONNX] Non-global diagnostic context (#100219)
Summary:

* `dynamo_export`, and everything within now access diagnostic context through a maintained local
  variable, instead of global.
* Refactored `diagnose_call` decorator to require local diagnostic context, instead of accessing global.
* Modified `test_fx_to_onnx_*.py` tests to only log '*.sarif' logs when `verbose=True`.
* Temporarily removed diagnostics for `OnnxFunction`, as they don't have access to diagnostic context
  anymore. These diagnostics will be the responsibility of `onnxscript`, and they will return once
  diagnostics system is integrated there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100219
Approved by: https://github.com/justinchuby
2023-05-01 19:58:53 +00:00
8e084cbfaa [ONNX] Remove 'diagnose_step' (#99944)
`ThreadFlowLocation`, a.k.a 'step', cannot fully be visualized by `SARIF vscode extension` today.
Discarding `diagnose_step` such that we don't end up creating diagnostics that record things there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99944
Approved by: https://github.com/justinchuby
2023-05-01 19:58:53 +00:00
c94b6a6712 [ONNX] Introduce 'diagnostics' to 'dynamo_export' api (#99668)
Summary
* Introduce `DiagnosticContext` to `torch.onnx.dynamo_export`.
* Remove `DiagnosticEngine` in preparations to update 'diagnostics' in `dynamo_export` to drop dependencies on global diagnostic context. No plans to update `torch.onnx.export` diagnostics.

Next steps
* Separate `torch.onnx.export` diagnostics and `torch.onnx.dynamo_export` diagnostics.
* Drop dependencies on global diagnostic context. https://github.com/pytorch/pytorch/pull/100219
* Replace 'print's with 'logger.log'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99668
Approved by: https://github.com/justinchuby, https://github.com/abock
2023-05-01 19:58:49 +00:00
85bd6bc010 Cache pretrained mobilenet_v2 and mobilenet_v3_large models in Docker (#100302)
Follow the example I did for ONNX in https://github.com/pytorch/pytorch/pull/96793, this caches the pretrained `mobilenet_v2 model` and `mobilenet_v3_large` used by CI jobs.  I think there might be an issue either with AWS or with the domain download.pytorch.org as the connection to the latter has been failing a lots in the past few days.

Related flaky jobs:
* https://github.com/pytorch/pytorch/actions/runs/4835873487/jobs/8618836446
* https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639
* https://github.com/pytorch/pytorch/actions/runs/4835783539/jobs/8618404639

```
Downloading: "https://download.pytorch.org/models/mobilenet_v2-b0353104.pth" to /var/lib/jenkins/.cache/torch/hub/checkpoints/mobilenet_v2-b0353104.pth
Traceback (most recent call last):
  File "/opt/conda/envs/py_3.8/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1302, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1251, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1011, in _send_output
    self.send(msg)
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 951, in send
    self.connect()
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 1418, in connect
    super().connect()
  File "/opt/conda/envs/py_3.8/lib/python3.8/http/client.py", line 922, in connect
    self.sock = self._create_connection(
  File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 808, in create_connection
    raise err
  File "/opt/conda/envs/py_3.8/lib/python3.8/socket.py", line 796, in create_connection
    sock.connect(sa)
OSError: [Errno 99] Cannot assign requested address
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100302
Approved by: https://github.com/ZainRizvi
2023-05-01 19:31:37 +00:00
fd82f11882 [lite interpreter][hack] Add batch_norm_update_stats if batchnorm and training are present (#100134)
Summary: not sure how the train bool to batch_norm gets set. But its not the is_training module level flag. We get weird behavior for teams trying to do on device training because of this

Test Plan: ci

Differential Revision: D45335791

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100134
Approved by: https://github.com/larryliu0820
2023-05-01 19:27:39 +00:00
d5bd23684d Pin scikit-image and tb-nightly CI requirements (#100399)
Docker build starts to fail recently, for example https://github.com/pytorch/pytorch/actions/runs/4853022561/jobs/8648730115.  I notice some packages on requirements-ci haven't been pinned yet

## Testing

`pip install --verbose -r .ci/docker/requirements-ci.txt` locally.  I can confirm that the issue can be reproduced with Python 3.11 locally, and is fixed by this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100399
Approved by: https://github.com/clee2000
2023-05-01 19:10:08 +00:00
ts
2a6a159c0c Modify repeat_interleave docs to highlight potential overloading (#99650)
Fixes #99259 , drawing to attention that input is optional by putting a variation of the method signature at the top of the file and by modifying the input arguments.

Note that I'm not certain how to get the additional signature at the same level of indentation as the first one, but I think this change does a good job of highlighting the change is optional.

Would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99650
Approved by: https://github.com/mikaylagawarecki
2023-05-01 17:53:03 +00:00
73dac48464 Add bertmaher to triton pin CODEOWNERS (#100390)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100390
Approved by: https://github.com/bertmaher, https://github.com/albanD
2023-05-01 16:54:15 +00:00
5f92909faf Use correct standard when compiling NVCC on Windows (#100031)
Test Plan: Sandcastle

Differential Revision: D45129001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100031
Approved by: https://github.com/ngimel
2023-05-01 16:28:23 +00:00
73645a8412 Add CUDA 12.1 CI workflows (#98832)
Adds CUDA 12.1 CI workflows, removes CUDA 11.7.
CC @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98832
Approved by: https://github.com/atalman
2023-05-01 16:25:53 +00:00
3edff6b6ec Improve detection of workspace/non-output allocations in cudagraphs (#99985)
When we run cudagraph trees we are not allowed to have permanent workspace allocations like in cublas because we might need to reclaim that memory for a previous cudagraph recording, and it is memory that is not accounted for in output weakrefs so it does not work with checkpointing. Previously, I would check that we didn't have any additional allocations through snapshotting. This was extremely slow so I had to turn it off.

This PR first does the quick checking to see if we are in an error state, then if we are does the slow logic of creating snapshot. Also turns on history recording so we get a stacktrace of where the bad allocation came from.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99985
Approved by: https://github.com/zdevito
2023-05-01 15:58:45 +00:00
5d93265cce Report timeout/infra_error instead of 0.0000 on infra error (#100372)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100372
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-05-01 14:56:01 +00:00
a014d1b18c [Easy][FSDP] Clarify _use_unsharded_grad_views comment (#100359)
This is an easy follow-up to the previous PR to (1) clarify that `view` is the original parameter's gradient and (2) that after `reshard()` the gradient is on CPU only if offloading parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100359
Approved by: https://github.com/rohan-varma
2023-05-01 12:58:43 +00:00
2d8deffc1e Refactor repro/minifier into CLI; add analyze (#100226)
This is a two part PR; I can split it if you really want me to.

The first part is a refactor of the after aot repro/minifier scripts to come with a command line interface. I maintain exact BC with the previous interface (so, e.g., you still get a repro.py and a run_minifier.py that do the same thing as before), but each of these scripts also take command line arguments now which you can use to customize what actually happens. Check `run_repro` for full documentation on the arguments.

The second part of this is an implementation of `analyze` subcommand on the new CLI for any repro.

<img width="1277" alt="image" src="https://user-images.githubusercontent.com/13564/235045677-8545aab7-5e83-4813-bbec-47783dc60122.png">

This facility is oriented towards accuracy debugging. It does several things:

1. It will run your model twice and check for nondeterminism in inductor/float64, *even* on intermediate inputs (our benchmarking nondeterminism test only checks for nondeterminism on the final output). This makes localizing which operator is nondeterministic easy.
2. It will run your compiled model side-by-side with eager and float64 variants, and then report when things diverge too far from RMSE delta from float64.

Importantly, it does all this without requiring every intermediate to be held in memory (which will cause an OOM on large repros, such as the one I tested this on.)

Some other minor improvements:

* MinifierTestBase now has an easy to comment out spot that you can use to retain the temporary directory; good for debugging
* We print "running minifier" and "running repro" in MinifierTestBase to make it easier to orient where logs are coming from
* same takes a `log_error` optional argument which you can use to reroute the error logs when things mismatch
* counters["inductor"]["intermediate_hooks"] tracks the number of intermediate hooks we've codegen'ed; good for populate the tqdm interface
* torch.fx.interpreter gets an official `boxed_run` interface which uses the boxed arguments calling convention and doesn't retain inputs unnecessarily long
* torch.utils._content_store gets compute_tensor_metadata/read_tensor_metadata helper functions for computing tensor information without serializing it

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100226
Approved by: https://github.com/bertmaher, https://github.com/bdhirsh, https://github.com/anijain2305
2023-05-01 11:12:38 +00:00
89c43f4108 Revert "Produce constant variables in cases where a SymNode is created with a constant (#100144)"
This reverts commit d7bdfd345402615eccbcc8bda24addb5cd3fa696.

Reverted https://github.com/pytorch/pytorch/pull/100144 on behalf of https://github.com/ezyang due to ci failure is real ([comment](https://github.com/pytorch/pytorch/pull/100144#issuecomment-1529587039))
2023-05-01 11:10:48 +00:00
83b803c2b5 [FSDP] Fix use_orig_params=True, CPU offload, no_sync() (#100180)
This should fix https://github.com/pytorch/pytorch/issues/98494. We follow a similar approach as in past PRs for mismatched dtype or size from running in `no_sync()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100180
Approved by: https://github.com/rohan-varma
2023-05-01 05:15:51 +00:00
e779a30d50 [BE] Fix SIM109 compare-with-tuple (#100337)
Use {replacement} instead of multiple equality comparisons

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100337
Approved by: https://github.com/Skylion007
2023-04-30 19:51:32 +00:00
01abbfbaae [BE] Fix all B022 useless-contextlib-suppress (#100335)
No arguments passed to contextlib.suppress. No exceptions will be suppressed and therefore this context manager is redundant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100335
Approved by: https://github.com/Skylion007
2023-04-30 18:47:40 +00:00
d7bdfd3454 Produce constant variables in cases where a SymNode is created with a constant (#100144)
` AOT_DYNAMIC_SHAPES=1 TORCHDYNAMO_DYNAMIC_SHAPES=1  benchmarks/dynamo/huggingface.py --performance  --training --amp --backend eager --disable-cudagraphs --device cuda --only AllenaiLongformerBase --explain`

Looks promising!

Goes from:

Dynamo produced 173 graphs covering 2760 ops with 160 graph breaks (14 unique)

To:

Dynamo produced 6 graphs covering 2298 ops with 15 graph breaks (7 unique)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100144
Approved by: https://github.com/ezyang
2023-04-30 17:13:57 +00:00
cc3ed8ae53 [inductor] avoid zero division error for dropout (#100222)
Fixes #100025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100222
Approved by: https://github.com/ngimel, https://github.com/jgong5
2023-04-30 16:17:43 +00:00
beb7f79517 Fix intermediate hooks on inplace buffers, enable it in testing (#100322)
Fixes https://github.com/pytorch/pytorch/issues/100312

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100322
Approved by: https://github.com/ngimel
2023-04-30 13:34:44 +00:00
155fa4e714 Use sympy.And instead of bitwise operator, for better promotion (#100328)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100328
Approved by: https://github.com/voznesenskym
2023-04-30 13:01:36 +00:00
6c934a89a7 Skip invalid grads in outplace foreachs' backward (#100256)
Fixes #100248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100256
Approved by: https://github.com/soulitzer, https://github.com/albanD
2023-04-29 22:45:26 +00:00
76bcc87277 fix TIMM mobilevit_s complier issue for dynamic CPU path (#100230)
For TIMM ```mobilevit``` dynamic path, there has a compiler issue(```
python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --performance --float32 -dcpu -n2 --inductor --no-skip --dashboard --only mobilevit_s --inference --dynamic-shapes```
):

```
/tmp/torchinductor_xiaobing/xy/cxyslqzcsxkco4ieph7t63kn5q74ka35ak75lwfon32nlalxmru5.cpp:29:130: error: invalid operands of types ‘long int’ and ‘double’ to binary ‘operator%’
   29 |                             auto tmp0 = in_ptr0[static_cast<long>((((((-1L) + ks1) / 8L)*(((-1L) + ks1) / 8L))*((((2L*((i2 / 1L) % (std::ceil((1.0/2.0) + ((1.0/2.0)*(((-1L) + ks1)
```

There has a modulo for ```long % double```, this PR will convert inputs to long before do this operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100230
Approved by: https://github.com/jansel
2023-04-29 12:05:47 +00:00
e1021ec535 [decomp] Bad accuracy for elu_backward (#100284)
Accuracy is tested by the full model at https://github.com/pytorch/pytorch/issues/100061
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100284
Approved by: https://github.com/ngimel
2023-04-29 04:21:20 +00:00
3d55bce3bf Revert "Move win-vs2019 build and test to unstable (#100281)"
This reverts commit 999e17d80a8107e88c92a1019e3d7aff1d740e8c.

Reverted https://github.com/pytorch/pytorch/pull/100281 on behalf of https://github.com/malfet due to All runners has been updated ([comment](https://github.com/pytorch/pytorch/pull/100281#issuecomment-1528622556))
2023-04-29 03:47:12 +00:00
2442858f52 [MPS] Fix layer_norm_backward_mps key (#100295)
Followup after https://github.com/pytorch/pytorch/pull/98794
See report in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1527312211 and reproducer in https://github.com/pytorch/pytorch/issues/98602#issuecomment-1528214175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100295
Approved by: https://github.com/kit1980, https://github.com/izaitsevfb
2023-04-29 03:37:35 +00:00
03806eddbf [dynamo] Compile torchvision augmentations (#100292)
Resolves https://github.com/pytorch/pytorch/issues/100112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100292
Approved by: https://github.com/jansel
2023-04-29 02:59:41 +00:00
6647e61a59 Update docstring for dynamo.export tracing_mode (#100205)
As follow up to #99877.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100205
Approved by: https://github.com/ezyang
2023-04-29 02:12:08 +00:00
9075e3c2c6 Revert "Run test_fx_to_onnx_with_onnxruntime serially (#100298)"
This reverts commit 3a3f781f6cd90abbceb63a9cb59546d892ef899e.

Reverted https://github.com/pytorch/pytorch/pull/100298 on behalf of https://github.com/huydhn due to No need as https://github.com/pytorch/pytorch/pull/100297 has been landed ([comment](https://github.com/pytorch/pytorch/pull/100298#issuecomment-1528476786))
2023-04-29 02:07:39 +00:00
1267897d67 [ONNX] Skip flaky dynamic test in CI (#100297)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100297
Approved by: https://github.com/titaiwangms, https://github.com/kit1980, https://github.com/huydhn
2023-04-29 01:56:45 +00:00
3a3f781f6c Run test_fx_to_onnx_with_onnxruntime serially (#100298)
This test starts to fail out of nowhere in trunk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100298
Approved by: https://github.com/kit1980
2023-04-29 00:51:25 +00:00
7684044b71 Add size check before calling .back() in rpc/script_call.cpp (#94297)
Hi!

I've been fuzzing different pytorch modules, and found a crash inside one of them.

Specifically, I'm talking about a module that processes `script_call` rpc requests and a function `ScriptCall::fromIValues(std::vector<at::IValue>& ivalues)`.

Running this test case causes a crash that occurs when `ivalues.back()` is called [script_call.cpp:90](abc54f9314/torch/csrc/distributed/rpc/script_call.cpp (L90)). The crash occurs because the vector `ivalues` is empty.

All tests were performed on this pytorch version: [abc54f93145830b502400faa92bec86e05422fbd](abc54f9314)

The provided patch checks if there are enough elements in the ivalues vector.

### How to reproduce

1. To reproduce the crash, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch)

2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .`

3. Copy crash file to the current directory:

    - [crash-9f76d4e37a2391136a4ce07d47269db1e063e4b4.zip](https://github.com/pytorch/pytorch/files/10674059/crash-9f76d4e37a2391136a4ce07d47269db1e063e4b4.zip)

4. Run the container: ``docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash``

5. And execute the binary: `/message_deserialize_fuzz /homedir/crash-9f76d4e37a2391136a4ce07d47269db1e063e4b4`

After execution completes you will see this stacktrace:

```asan
AddressSanitizer:DEADLYSIGNAL
=================================================================
==57==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x0000008e7b19 bp 0x7ffd2fdded70 sp 0x7ffd2fddec40 T0)
==57==The signal is caused by a READ memory access.
==57==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x8e7b19 in c10::IValue::isString() const /pytorch_fuzz/aten/src/ATen/core/ivalue.h:639:27
    #1 0x8e7b19 in c10::IValue::toStringRef[abi:cxx11]() const /pytorch_fuzz/aten/src/ATen/core/ivalue_inl.h:2179:3
    #2 0xe04fb58 in torch::distributed::rpc::ScriptCall::fromIValues(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/torch/csrc/distributed/rpc/script_call.cpp:90:53
    #3 0xe0511f0 in torch::distributed::rpc::ScriptCall::fromMessage(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/script_call.cpp:133:10
    #4 0xe0ff71e in torch::distributed::rpc::deserializeRequest(torch::distributed::rpc::Message const&) /pytorch_fuzz/torch/csrc/distributed/rpc/utils.cpp:102:14
    #5 0x602a41 in LLVMFuzzerTestOneInput /message_deserialize_fuzz.cc:192:27
    #6 0x52ce61 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:611:15
    #7 0x516d7c in fuzzer::RunOneTest(fuzzer::Fuzzer*, char const*, unsigned long) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:324:6
    #8 0x51cacb in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /llvm-project/compiler-rt/lib/fuzzer/FuzzerDriver.cpp:860:9
    #9 0x546062 in main /llvm-project/compiler-rt/lib/fuzzer/FuzzerMain.cpp:20:10
    #10 0x7f41e42a8082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #11 0x51169d in _start (/message_deserialize_fuzz+0x51169d)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV /pytorch_fuzz/aten/src/ATen/core/ivalue.h:639:27 in c10::IValue::isString() const
==57==ABORTING
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94297
Approved by: https://github.com/ezyang
2023-04-29 00:26:35 +00:00
35991df5d6 fix(docs): torch.autograd.graph.Node.register_hook can override grad_inputs, not grad_outputs (#100272)
Fixes #99165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100272
Approved by: https://github.com/soulitzer
2023-04-29 00:10:12 +00:00
2b79d6c425 Update testing aggregate data (#100070)
Updates testing aggregates data to also show workflows which is useful for actually seeing how long workflows take.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100070
Approved by: https://github.com/seemethere
2023-04-29 00:09:52 +00:00
6a02342131 Check inputs have same dtype in addmm_impl_cpu_ even if input has zero numel (#100274)
Fixes #99226

When an inputs has zero numel, addmm_impl_cpu_'s check that the inputs have the same dtype are bypassed. This PR adds a check before  the early return.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100274
Approved by: https://github.com/ngimel
2023-04-29 00:07:54 +00:00
d7fa7fa8cf Introduce fast path in the CPU equal op
Differential Revision: D45282119nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100024
2023-04-28 16:00:17 -07:00
331ed5bee7 Add comment link to revert message (#100276)
* add comment url/link to revert message for easier tracking
* update gql mocks accordingly, not sure why databaseid in checkruns got updated as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100276
Approved by: https://github.com/huydhn
2023-04-28 22:41:29 +00:00
999e17d80a Move win-vs2019 build and test to unstable (#100281)
As described in https://github.com/pytorch/pytorch/issues/100273, the vs2019 test jobs are failing due to numpy incompatibility

They've been disabled for a quick response, and now moving them to unstable so that we can keep getting signal on those jobs

Fixes https://github.com/pytorch/pytorch/issues/100273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100281
Approved by: https://github.com/huydhn
2023-04-28 21:41:57 +00:00
ccce7a2de0 follow up PR for test_c10d_ucc.py in response to Xiang's review of #88110 (#99654)
* Adds extra test_allgather_base in UccProcessGroupWithDispatchedCollectivesTests; rest of nccl and gloo tests there don't work on ucc
* Adds cpu tests for [op]_work_wait_gpu tests
* Added single tensor input test for allgather_basics; multi tensor input still doesn't seem to be supported by ucc
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99654
Approved by: https://github.com/kwen2501
2023-04-28 21:38:16 +00:00
8714fc7a2b [ONNX] Set tracing_mode through options.dynamic_shapes and enable dynamic tests in test_fx_to_onnx_runtime.py (#100212)
After #99876 and #99877, the dynamic tests are unblocked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100212
Approved by: https://github.com/BowenBao
2023-04-28 21:30:52 +00:00
0a5c930499 Re-enable CUDA 12.1 builds for Windows (#100268)
Related: https://github.com/pytorch/pytorch/pull/98492
This PR enables Windows builds after the needed AMIs are ready.

CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100268
Approved by: https://github.com/atalman, https://github.com/malfet
2023-04-28 21:11:27 +00:00
5b98910139 [inductor] Stop using x + tl.zeros(...) in generated triton (#100163)
For reductions, this changes the accumulator
```python
_tmp2 = tl.zeros([XBLOCK, RBLOCK], tl.int8) + -128
```
to
```python
_tmp2 = tl.full([XBLOCK, RBLOCK], -128, tl.int32)
```
which is equivalent since addition does type promotion from `int8` to `int32`

For constant indexing, this changes
```python
tl.store(in_out_ptr0 + (0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp4, None)
```
to
```python
tl.store(in_out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```

For variable indexing, this changes
```python
tl.store(out_ptr0 + (0 + tl.zeros([XBLOCK], tl.int32)), tmp1, None)
```
to
```python
tl.store(out_ptr0 + (tl.broadcast_to(x0, [XBLOCK])), tmp1, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100163
Approved by: https://github.com/ngimel
2023-04-28 21:01:24 +00:00
270a33165b [inductor] Move reduction_type special cases out of make_reduction (#99660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99660
Approved by: https://github.com/ngimel
2023-04-28 21:01:24 +00:00
6ab9453ea9 File level rerun changes (#100200)
Fixes #ISSUE_NUMBER
* change hook so that test still gets saved in --sc when fails in test setup (caused an off by 1 error due to setup being called before the logreport hook)
* allow reruns for all tests now that --sc is used
* increase number of reruns now that --sc is used
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100200
Approved by: https://github.com/huydhn
2023-04-28 20:57:49 +00:00
eqy
43dea76305 [CUDA] Switch to at::empty_like in adaptive_avg_pool3d_backward_cuda (#100202)
Same as #100138, `gradInput` is already zero'd out. Also clean up includes after #100138.

CC @ngimel @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100202
Approved by: https://github.com/ngimel
2023-04-28 20:47:24 +00:00
380ccfd442 Revert "Added round_with_scale_factor arg to ATen (#97868)"
This reverts commit aa99c5b4eda345f792687c490e72c8575110977a.

Reverted https://github.com/pytorch/pytorch/pull/97868 on behalf of https://github.com/osalpekar due to Caused breakages in the glow compiler - see [D45374622](https://www.internalfb.com/diff/D45374622) for more details
2023-04-28 20:47:00 +00:00
5022143f88 Bump cuDNN frontend submodule to 0.9 (#99674)
Testing via CI for now

CC @ngimel @ptrblck @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99674
Approved by: https://github.com/ngimel
2023-04-28 20:46:54 +00:00
eqy
3f656ad7bb [CUDA] Do accumulation for Adaptive Average Pooling in opmath_t (#99378)
Fix for an issue surfaced from the discuss forum: https://discuss.pytorch.org/t/adaptiveavgpool2d-causes-some-data-to-contain-inf/177420

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99378
Approved by: https://github.com/ngimel
2023-04-28 20:43:12 +00:00
b66d7007d8 Add aten.smooth_l1_loss_backward to core_aten_decompositions (#100267)
Summary: https://github.com/pytorch/pytorch/pull/100242 didn't cover all
test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100267
Approved by: https://github.com/jansel
2023-04-28 19:32:17 +00:00
9e1f46d55b Use [[maybe_unused]] in VariableType_[0-4].cpp (#100250)
This is kind of trivial, as per title.

Removing `(void)_any_requires_grad` and giving `[[maybe_unused]]` attribute to that variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100250
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2023-04-28 19:00:19 +00:00
c4bed869d1 [PG Wrapper] Enhance error msg (#100213)
Previously, the mismatch report would not give the full details of the
collective running on the mismatched rank, it would look something like:

```
Detected mismatch between collectives on ranks. Rank 26 is running collective: CollectiveFingerPrint(SequenceNumber=683057617, OpType=BROADCAST, TensorShape=[1], TensorDtypes=Long, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cpu, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is running collective: CollectiveFingerPrint(SequenceNumber=513876813OpType=BROADCAST).
```

i.e. Rank 1 is missing more details such as the shape, type etc.

This was due to `num_tensors` field not being populated, which operator<<
checks to determine whether to print additional information such as the tensor
shape.

Adding this field gives a better error:

```
Detected mismatch between collectives on ranks. Rank 0 is run     ning collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=ALLREDUCE     , TensorShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=     float (default), device=cpu, layout=Strided (default), requires_grad=false (defaul     t), pinned_memory=false (default), memory_format=(nullopt))), but Rank 1 is runnin     g collective: CollectiveFingerPrint(SequenceNumber=1564312518, OpType=REDUCE, Tens     orShape=[20, 10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float      (default), device=cpu, layout=Strided (default), requires_grad=false (default), pi     nned_memory=false (default), memory_format=(nullopt))).
```

Differential Revision: [D45372325](https://our.internmc.facebook.com/intern/diff/D45372325/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100213
Approved by: https://github.com/H-Huang
2023-04-28 18:49:18 +00:00
e0a2b49f0b [SPMD] Introduce prerequisites to graph_optimization_pass (#99970)
Some optimizations require prerequisite passes. It is hard to debug why a optimization pass because of the prerequisites condition does not match. Adding this check makes it easier to discover the error.

Differential Revision: [D45255377](https://our.internmc.facebook.com/intern/diff/D45255377/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99970
Approved by: https://github.com/lessw2020
2023-04-28 18:38:01 +00:00
61dffa61c3 [fix] masked_scatter_: non-contiguous self (#100232)
Fixes https://github.com/pytorch/pytorch/issues/99638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100232
Approved by: https://github.com/ngimel
2023-04-28 18:12:23 +00:00
9cd48b0575 Add warning information for dtypetensor. (#99521)
Fixes #ISSUE_NUMBER

Without affecting the existing cpu/cuda logic, a separate interface is provided for the custom backend and users can choose whether to use the interface function which provides 10 tensor types with custom backend variations.

Therefore, users can use torch.set_deafult_tensor_type to set the default device tensor type, or use torch.xxx.dtypetensor to create a tensor.For example,torch.set_deafult_tensor_type(torch.foo.DoubleTensor) or torch.foo.DoubleTensor([]).

@albanD , please review my changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99521
Approved by: https://github.com/albanD
2023-04-28 18:01:45 +00:00
56e235ad8c Pin functorch docs requirements (#100257)
The job https://github.com/pytorch/pytorch/actions/runs/4830815291/jobs/8607848573 starts to fail with the new IPython https://pypi.org/project/ipython/#history

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100257
Approved by: https://github.com/clee2000
2023-04-28 17:58:58 +00:00
628a8df1c9 [c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)
This is a mirror PR of D45339293

Summary:
These tests cause the following errors internally with unknown reason:
```
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd'
```
Commenting these tests out to unblock other PRs.

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-04-28 17:38:12 +00:00
efed5e1c47 Fix triton auto update pin workflow (#100211)
* allow pytorchbot to approve pin changes
* add ciflow/inductor label when pin changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100211
Approved by: https://github.com/huydhn
2023-04-28 17:06:31 +00:00
1b84be551a Improved CustomOp API with schema inference (#100127)
This PR changes the CustomOp API. There are now two ways to create a
CustomOp object.

Method 1: with no schema string. We will infer what the schema string is
from your type annotations

```py
@custom_op("customlib::foo")
def foo(x: Tensor) -> Tensor:
    ...
```

Method 2: with a schema string, if the inference doesn't work well.

```py
@custom_op("customlib::foo", "(Tensor x) -> Tensor")
def foo(x):
    ...
```

Some details:
- We support most combinations of {Tensor, Number, int, float, bool} and
{Optional[typ], Tuple[typ, ...]} as inputs. The combinations we support are mostly
from me reading native_functions.yaml.
- We support only Tensor or Tuple of Tensor of fixed size returns.
- A lot of this PR is input validation for both of the above two
methods. For example, when a user provides a manual schema string, then
their function must not have any type annotations and the number of args
and arg names must match the schema.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100127
Approved by: https://github.com/ezyang
2023-04-28 16:53:07 +00:00
7ebb60c9f4 [CustomOp] Fix lifetime semantics (#100114)
This PR makes a CustomOp live forever. The motivation for it living
forever is that:
1. It doesn't matter to a user if it lives forever or not
2. it is a higher-level abstraction over OpOverload, and OpOverload
assumes that OpOverload lives forever.

The only place where it matters that CustomOp lives forever is testing:
I don't want to generate random names for my CustomOp objects. To
resolve the testing problem, This PR adds a CustomOp._destroy() that
clears all the C++ state, including the OpOverloadPacket, that is
associated with the CustomOp object.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100114
Approved by: https://github.com/ezyang
2023-04-28 16:53:07 +00:00
d176e3ff69 [quant][pt2] Add test for prepare_qat Conv + BN numerics (#99846)
Summary: This adds the test to compare the numerics of PT2 vs
FX after the Conv + BN fusion in prepare_qat.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_prepare_qat_conv_bn_numerics

Reviewers: kimishpatel, jerryzh168

Differential Revision: [D45360706](https://our.internmc.facebook.com/intern/diff/D45360706)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99846
Approved by: https://github.com/jerryzh168
2023-04-28 16:43:10 +00:00
23de2e0620 [Dynamo] Fix staticmethods for FSDP (#100117)
This PR fixes capturing static methods for FSDP-managed modules. Previously, if a static method was invoked using `self.<staticmethod>`, then Dynamo would pass `self` twice to the method, causing a graph break due to the method being "unsupported". This PR achieves this by checking for `staticmethod` and using `UserFunctionVariable` instead of `UserMethodVariable`, which handles the correct calling convention.

This fixes FSDP + PT2 on HuggingFace's `T5ForConditionalGeneration`, which otherwise reports an error like the following based on the most recent trunk:
```
Output 0 of AsStridedBackward0 is a view of a view which was created in no_grad mode and is being modified inplace with grad mode enabled.
```
This is in reference to the `scores` tensor in `scores += position_bias_masked` ([code](a0ae2310ec/src/transformers/models/t5/modeling_t5.py (L559))).

I am not clear if this PR's fix is actually masking a different problem though. I wonder if there are edge cases with respect to Dynamo resuming execution and input mutations. Possibly, this PR only side steps the problem because there is no more recompilation at the static method `_relative_position_bucket()` ([code](a0ae2310ec/src/transformers/models/t5/modeling_t5.py (L443))).

In `UserDefinedObjectVariable.var_getattr()`, there is an existing branch:
e5291e633f/torch/_dynamo/variables/user_defined.py (L395-L398)
I am not clear on when this branch can be triggered since if `subobj` is a static method, it still takes the `FunctionTypes` branch:
e5291e633f/torch/_dynamo/variables/user_defined.py (L403-L404)
To preserve backward compatibility, the current version of this PR only modifies this `FunctionTypes` branch to differentiate between `staticmethod` and not `staticmethod`.

The PR that added this `FunctionTypes` branch is https://github.com/pytorch/pytorch/pull/92050/, and I checked that the added test `test_torch_distributions_functions()` only exercises the non-`staticmethod` case (since `Independent.log_prob` is not a `staticmethod`).

The last commit in `pytorch` that touched the `staticmethod` branch before https://github.com/pytorch/pytorch/pull/92050/ was the move from the `torchdynamo` repo into `pytorch`, so I cannot easily tell which test cases it corresponds to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100117
Approved by: https://github.com/anijain2305
2023-04-28 14:31:20 +00:00
e6f9bc500b CustomOp simple abstract implementation registration (#99439)
This PR:
- adds an abstract registration API for CustomOp (CustomOp.impl_abstract)
that is used for both FakeTensor and meta tensors
- deletes CustomOp.impl_meta

The user story behind this API is that it is the one-stop shop for
registering implementations for data-less Tensors, i.e. FakeTensor and
Meta tensor.

The abstract implementation provided by the user:
- gets registered as the FakeTensor implementation AND the meta formula
- can be written like a regular meta formula. If the user decides that
they need something more special (i.e. data-dependent output shape),
then they are able to query a current context object (FakeTensorImplCtx)
that has methods to construct new unbacked symints.

Caveats:
- we really need to make FakeTensor/FakeTensorMode public. Otherwise,
there isn't a way for the user to interactively test that their abstract
implementation is correct without running through large pieces of the
PT2 stack (make_fx or torch.compile).
- We do not memoize the symints produced by
ctx.create_unbacked_symint(). It is possible to do this in the
future, but it is difficult to do soundly and I am not convinced of
the utility outside of the nonzero() usecase mentioned in #95399

Public API:
- More docs will come when we actually expose this API to users by
putting it in a public namespace, unless you folks want it now.
- The APIs mentioned in `__all__` are the ones that are intended to be
public.

Test Plan:
- Updated existing custom_op_db operators
- Added new numpy_nonzero and numpy_nms operations that test operations
that have data-dependendent output shape.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99439
Approved by: https://github.com/ezyang
2023-04-28 13:45:39 +00:00
4135295a76 Excise yaml dependency in torchgen.model (#100203)
The problem:
- The new CustomOp API depends on torchgen.model
- torchgen.model imports `yaml`
- `yaml` is not a PyTorch runtime dependency

To unblock myself, because I'm not sure how long it'll take to
convince people yaml should be a PyTorch runtime dependency
(unless one of you wants to approve #100166), this PR removes the
yaml dependency from torchgen.model.

It does so by splitting torchgen.utils (the offender) into
torchgen.utils (no yaml) and torchgen.yaml (which uses yaml).

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100203
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-04-28 13:45:39 +00:00
55b661137f [inductor] Use decomposition for smooth_l1_loss_backward (#100242)
Summary: This forward fixes a CI check failure introduced by
https://github.com/pytorch/pytorch/pull/99429. I also updated
auto-labeler rule to trigger inductor test for any decomposition
changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100242
Approved by: https://github.com/bertmaher
2023-04-28 13:23:20 +00:00
2504089329 Enable test_linalg_solve_triangular_large (#96182)
PR to see if test fails after removing skip line

Fixes #70111
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96182
Approved by: https://github.com/lezcano
2023-04-28 12:54:27 +00:00
90c44b134a Revert "[CI] Start to collect inference perf with cpp_wrapper ON (#100187)"
This reverts commit 3e87fc521ba0fc89b0980e018b4c625d5577d339.

Reverted https://github.com/pytorch/pytorch/pull/100187 on behalf of https://github.com/desertfire due to scheduled dashboard run failed
2023-04-28 11:55:29 +00:00
07c02b9e92 Add vmap support for smooth_l1_loss_backward (#99429)
Follow-up of #98357
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99429
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-04-28 10:58:07 +00:00
d4bf76c2a4 Persist torch.assert in aten graph (#100101)
This PR introduces a new operator called aten._assert_async.msg, which allows passing a tensor value and assertion message as inputs. As part of TorchDynamo, we're replacing the use of torch._assert with this new operator so that make_fx also knows how to handle assertions. This is subset of https://github.com/pytorch/pytorch/pull/98878, refer there for historic reviews.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100101
Approved by: https://github.com/jansel
2023-04-28 07:31:43 +00:00
cef15ecc2e [ROCm] Also look for 'Cijk' (rocblas kernel) to verify gemm in test_kineto (#92889)
PR #88207 enabled ActivityType::CUDA for ROCm. TestProfiler.test_kineto needs an update in the test code to look for the correct pattern for gemm kernels for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92889
Approved by: https://github.com/malfet
2023-04-28 06:23:30 +00:00
751c54b546 Add experimental export() API (#100034)
PT2 Export API Prototype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100034
Approved by: https://github.com/angelayi
2023-04-28 06:12:59 +00:00
a23365885f [FSDP] Make set_state_type to SHARDED_STATE_DICT compatible with NO_SHARD sharding_strategy (#100208)
Currently, if we use NO_SHARD strategy for fully_shard and set state_dict_type to be SHARDED_STATE_DICT, a runtime error would be raised ("``sharded_state_dict`` can only be used when parameters are flatten and sharded.").

This PR updates pre_state_dict_hook, post_state_dict_hook, pre_load_state_dict_hook, and post_load_state_dict_hook to set state_dict_type and state_dict_config to full state when using NO_SHARD, even if the state_dict_type and state_dict_config of the root module is set to sharded state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100208
Approved by: https://github.com/rohan-varma
2023-04-28 04:37:58 +00:00
cyy
7220201a2c fix missing-prototypes warnings in torch_cpu (Part 2) (#100147)
This PR fixes more missing-prototypes violations in the torch_cpu source following PR #100053.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100147
Approved by: https://github.com/zou3519
2023-04-28 04:15:57 +00:00
54c0edf6da Track exact origin_node on best effort basis (#100110)
Currently, we track 'origins' on IR nodes so that we have some idea about what FX IR nodes contributed to any given fused kernel. However, the origins are dumped into an undifferentiated set, so if you have, e.g., multiple outputs, you cannot easily tell which output corresponds to which FX node.

This PR introduce a more precise notion of tracking "origin_node" which says that the contents of this Buffer/Loop node corresponds EXACTLY to the output of a particular FX node; e.g., if you serialized each intermediate when running the generated inductor code, you could compare them with the corresponding intermediates from the original FX graph.

Tracking origin_node in all cases requires quite a bit of effort, so this PR introduces the tracking on a strictly best effort basis. The logic in torch/_inductor/graph.py sets up the associations, but only when it is "obvious" which IR node should get the assignment, and there is work in torch/_inductor/ir.py for propagating this information around as necessary. Like origins, origin_node is not a true dataclass field (as this would break all existing positional arg call sites), instead, it is added post facto via `__post_init__`. At the moment, it is only valid for Buffer/Loop to have an origin_node, but we could imagine relaxing this in the future.

The payoff is in torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/triton.py where we currently just print the FX node name and the tensor (but a more useful integration will be coming later.)

I also introduce a debugging tool `debug_ir_traceback` which tracks tracebacks of where IRNodes were allocated, to help you understand why a node doesn't have an `origin_node`.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100110
Approved by: https://github.com/voznesenskym
2023-04-28 04:15:27 +00:00
89b1e67d0a [Tensor Parallel] Add a new Colwise Parallel style when Pairwise cannot directly used (#100137)
Some use cases, users cannot directly `PairwiseParallelStyle` and they might need to specify colwise and rowwise separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100137
Approved by: https://github.com/wz337
2023-04-28 03:27:51 +00:00
56a93ed56d Store constraints and example inputs in the graph module as metadata in export (#99961)
Metadata to store in the GraphModule:
 - input shape constraints
 - example inputs
 - other inline constraints

The saved constraints (in mem) will be used directly after export to convert constraints to runtime assertion which is a separate pass after export.
The requirement of saved constraints:
  1. Be able to locate where the constraints is from
  2. Should not break the exported graph module serialization.

Examples of saved constraints
```
input_shape_constraints:
  {'t_id': 140266058179792, 'dim': 0, 'min': 6, 'max': oo}
  {'t_id': 140266058179792, 'dim': 0, 'min': 2, 'max': 10}

inline_constraints:
  i1: ValueRanges(lower=2, upper=5)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99961
Approved by: https://github.com/tugsbayasgalan
2023-04-28 03:14:33 +00:00
cyy
c8877e6080 enable some cuda warnings (#95568)
Currently some CUDA warnings are disabled due to  some old issues of code quality that are fixed now. So it is time to remove the suppression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95568
Approved by: https://github.com/albanD
2023-04-28 02:39:17 +00:00
cba07ffe0c [ONNX] Add xfail into subtests of op consistency and retire fixme (#100173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100173
Approved by: https://github.com/justinchuby
2023-04-28 02:25:59 +00:00
6168bed663 Remove unecessary <execinfo.h> include (#99800)
Fixes compilation error on systems that have __cxa_demangle but no backtrace function, such as systems using GCC and the Musl C library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99800
Approved by: https://github.com/kit1980
2023-04-28 02:20:24 +00:00
a8ad0dc333 [philox_rand] Add decomps (#100206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100206
Approved by: https://github.com/ngimel
2023-04-28 02:20:13 +00:00
9cda7b9e47 [hotfix] Do not import torch.ao.quantization._pt2e from dynamo (#100194)
Summary: Importing torch.ao.quantization._pt2e from dynamo led to
internal test failures related to memory profiling. For now,
let's express the path using a simple string instead.

Reviewers: jerryzh168, kimishpatel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100194
Approved by: https://github.com/jerryzh168
2023-04-28 01:32:23 +00:00
9609aeefbb Revert "[c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)"
This reverts commit ae40a6c7356190ef86b14b10a94a58ca41ca496b.

Reverted https://github.com/pytorch/pytorch/pull/100215 on behalf of https://github.com/huydhn due to Sorry for revert your change, but it breaks lint, please run lintrunner -a torch/testing/_internal/distributed/distributed_test.py to fix the issue then reland it
2023-04-28 01:21:06 +00:00
0692bdd95f Improved message to suppress errors in _dynamo/exc.py (#97345)
If user adds simply to their code:
```python
import torch

torch._dynamo.config.suppress_errors = True
```
they will get:
```
AttributeError: module 'torch' has no attribute '_dynamo'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97345
Approved by: https://github.com/zou3519, https://github.com/kit1980
2023-04-28 01:12:08 +00:00
b51f92ebda [Docs] Fix docstring format (#99396)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99396
Approved by: https://github.com/awgu
2023-04-28 01:10:07 +00:00
64efd88845 Add directly referenced header files for "ceil_div.h" (#99607)
std::enable_if_t is defined in <type_traits>. Directly referencing header files is good programming style

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99607
Approved by: https://github.com/albanD, https://github.com/kit1980
2023-04-28 01:05:05 +00:00
3e87fc521b [CI] Start to collect inference perf with cpp_wrapper ON (#100187)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100187
Approved by: https://github.com/huydhn
2023-04-28 01:03:09 +00:00
674018903d per-Tensor grad_fn for in-place foreach functions (#96405)
Generate a `grad_fn` for each (tuple of) `Tensor`(s) of the same index for `_foreach_foo_` and each `grad_fn` is `FooBackward`.

The current status of foreach functions' backward support for the record:
- out-place: Implemented, but no optimized implementations like their forward path
- in-place: not implemented. I think this check 7eaaefafb3/torchgen/api/autograd.py (L309-L311) is partly responsible but the difference of signature between out-place and in-place (see https://github.com/pytorch/pytorch/pull/96405#discussion_r1154690940) would prevent in-place from using out-place versions (the logic is around 7eaaefafb3/torchgen/api/autograd.py (L495-L500))

```c++
void _foreach_abs_(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_abs_(ks & c10::after_autograd_keyset, self_);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      AT_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      AT_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
}
```

Related:
- #95431
- #95765 for multiple `grad_fn`s logic

---

Examples: outputs of `_foreach_add_.List`, `_foreach_addcmul_.ScalarList`, and `_foreach_exp`

```c++
void _foreach_addcmul__ScalarList(c10::DispatchKeySet ks, at::TensorList self, at::TensorList tensor1, at::TensorList tensor2, at::ArrayRef<at::Scalar> scalars) {
  auto self_ = unpack(self, "self", 0);
  auto tensor1_ = unpack(tensor1, "tensor1", 1);
  auto tensor2_ = unpack(tensor2, "tensor2", 2);
  auto _any_requires_grad = compute_requires_grad( self, tensor1, tensor2 );

  (void)_any_requires_grad;
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<AddcmulBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i], tensor1[i], tensor2[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<AddcmulBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<AddcmulBackward0>(new AddcmulBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i], tensor1[i], tensor2[i] ));
              return grad_fn;
          }
      }());
    }
    if (!grad_fns.empty()) {

        for (const auto& i : c10::irange(grad_fns.size())) {
            auto grad_fn = grad_fns[i];
            if (grad_fn != nullptr) {
                grad_fn->self_scalar_type = self[i].scalar_type();
                grad_fn->tensor1_scalar_type = tensor1[i].scalar_type();
                if (grad_fn->should_compute_output(1)) {
                  grad_fn->tensor2_ = SavedVariable(tensor2[i], false);
                }
                grad_fn->value = scalars[i];
                if (grad_fn->should_compute_output(2)) {
                  grad_fn->tensor1_ = SavedVariable(tensor1[i], false);
                }
                grad_fn->tensor2_scalar_type = tensor2[i].scalar_type();
            }
        }
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> tensor1__storage_saved(tensor1_.size());
  for (const Tensor& tensor : tensor1_)
    tensor1__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> tensor1__impl_saved(tensor1_.size());
  for (size_t i=0; i<tensor1_.size(); i++)
    if (tensor1_[i].defined()) tensor1__impl_saved[i] = tensor1_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> tensor2__storage_saved(tensor2_.size());
  for (const Tensor& tensor : tensor2_)
    tensor2__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> tensor2__impl_saved(tensor2_.size());
  for (size_t i=0; i<tensor2_.size(); i++)
    if (tensor2_[i].defined()) tensor2__impl_saved[i] = tensor2_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_addcmul_(ks & c10::after_autograd_keyset, self_, tensor1_, tensor2_, scalars);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor1__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor1_))
      TORCH_INTERNAL_ASSERT(tensor1__storage_saved[i].value().is_alias_of(tensor1_[i].storage()));
  }
  for (size_t i=0; i<tensor1_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor1__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor1_))
      TORCH_INTERNAL_ASSERT(tensor1__impl_saved[i] == tensor1_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor2__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(tensor2_))
      TORCH_INTERNAL_ASSERT(tensor2__storage_saved[i].value().is_alias_of(tensor2_[i].storage()));
  }
  for (size_t i=0; i<tensor2_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (tensor2__impl_saved[i] && !at::impl::tensorlist_has_dispatch(tensor2_))
      TORCH_INTERNAL_ASSERT(tensor2__impl_saved[i] == tensor2_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
}

```

```c++
void _foreach_add__List(c10::DispatchKeySet ks, at::TensorList self, at::TensorList other, const at::Scalar & alpha) {
  auto self_ = unpack(self, "self", 0);
  auto other_ = unpack(other, "other", 1);
  auto _any_requires_grad = compute_requires_grad( self, other );

  (void)_any_requires_grad;
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<AddBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i], other[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<AddBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i], other[i] ));
              return grad_fn;
          }
      }());
    }
    if (!grad_fns.empty()) {

        for (const auto& i : c10::irange(grad_fns.size())) {
            auto grad_fn = grad_fns[i];
            if (grad_fn != nullptr) {
                grad_fn->other_scalar_type = other[i].scalar_type();
                grad_fn->alpha = alpha;
                grad_fn->self_scalar_type = self[i].scalar_type();
            }
        }
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  std::vector<c10::optional<Storage>> other__storage_saved(other_.size());
  for (const Tensor& tensor : other_)
    other__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> other__impl_saved(other_.size());
  for (size_t i=0; i<other_.size(); i++)
    if (other_[i].defined()) other__impl_saved[i] = other_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_add_(ks & c10::after_autograd_keyset, self_, other_, alpha);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (other__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(other_))
      TORCH_INTERNAL_ASSERT(other__storage_saved[i].value().is_alias_of(other_[i].storage()));
  }
  for (size_t i=0; i<other_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (other__impl_saved[i] && !at::impl::tensorlist_has_dispatch(other_))
      TORCH_INTERNAL_ASSERT(other__impl_saved[i] == other_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
}

...

void _foreach_exp_(c10::DispatchKeySet ks, at::TensorList self) {
  auto self_ = unpack(self, "self", 0);
  auto _any_requires_grad = compute_requires_grad( self );

  (void)_any_requires_grad;
  std::vector<c10::optional<at::Tensor>> original_selfs(self.size());
  std::vector<std::shared_ptr<ExpBackward0>> grad_fns;
  if (_any_requires_grad) {
    for (const auto& i : c10::irange( self.size() )) {
      const auto ith_requires_grad = compute_requires_grad(self[i]);
      check_inplace(self[i], ith_requires_grad);
      grad_fns.push_back([&]() -> std::shared_ptr<ExpBackward0> {
          if (!ith_requires_grad) {
              return nullptr;
          } else {
              auto grad_fn = std::shared_ptr<ExpBackward0>(new ExpBackward0(), deleteNode);
              grad_fn->set_next_edges(collect_next_edges( self[i] ));
              return grad_fn;
          }
      }());
    }
  }
  #ifndef NDEBUG
  std::vector<c10::optional<Storage>> self__storage_saved(self_.size());
  for (const Tensor& tensor : self_)
    self__storage_saved.push_back(
      tensor.has_storage() ? c10::optional<Storage>(tensor.storage()) : c10::nullopt);
  std::vector<c10::intrusive_ptr<TensorImpl>> self__impl_saved(self_.size());
  for (size_t i=0; i<self_.size(); i++)
    if (self_[i].defined()) self__impl_saved[i] = self_[i].getIntrusivePtr();
  #endif
  {
    at::AutoDispatchBelowAutograd guard;
    at::redispatch::_foreach_exp_(ks & c10::after_autograd_keyset, self_);
  }
  #ifndef NDEBUG
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__storage_saved[i].has_value() && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__storage_saved[i].value().is_alias_of(self_[i].storage()));
  }
  for (size_t i=0; i<self_.size() && !at::impl::dispatch_mode_enabled(); i++) {
    if (self__impl_saved[i] && !at::impl::tensorlist_has_dispatch(self_))
      TORCH_INTERNAL_ASSERT(self__impl_saved[i] == self_[i].getIntrusivePtr());
  }
  #endif
  if (!grad_fns.empty()) {
      auto differentiable_outputs = flatten_tensor_args( self );
      TORCH_INTERNAL_ASSERT(differentiable_outputs.size() == grad_fns.size());
      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              rebase_history(differentiable_outputs[i], grad_fns[i]);
          }
      }
  }
  if (!grad_fns.empty()) {

      for (const auto& i : c10::irange(grad_fns.size())) {
          auto grad_fn = grad_fns[i];
          if (grad_fn != nullptr) {
              grad_fn->result_ = SavedVariable(self[i], true, self[i].is_view());
          }
      }
  }
}

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96405
Approved by: https://github.com/soulitzer
2023-04-28 00:55:04 +00:00
9a3e411a41 More rigorous mixed overloads on SymInt (#100008)
Previously the change to aten/src/ATen/native/LossNLL.cpp eventually resulted in a double / SymInt division, which ended up calling the int64_t / SymInt overload, truncating the double (bad!) By adding overloads for all the int/float types, we avoid this situation from happening in the future.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100008
Approved by: https://github.com/albanD
2023-04-28 00:54:44 +00:00
9dcabe293a Delete pytorch/caffe2/contrib/docker-ubuntu-14.04 (#100155)
It's not used anywhere AFAIK and only triggers security issues scanners.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100155
Approved by: https://github.com/huydhn
2023-04-28 00:41:37 +00:00
d1fbd33c70 [FSDP] Remove unneeded disable of tf32 (#100179)
I recall needing to disable tf32, but I cannot repro the issue anymore. Nowhere else in our unit tests do we disable tf32, so we can try to get rid of this disabling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100179
Approved by: https://github.com/rohan-varma
2023-04-28 00:14:40 +00:00
1f4183e275 [FSDP] Subtest sharding strategy in test_fsdp_grad_acc.py (#100178)
Let us make the unit test faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100178
Approved by: https://github.com/rohan-varma
2023-04-28 00:14:40 +00:00
ae40a6c735 [c10d] Comment out ddp_hook_with_optimizer_parity tests (#100215)
This is a mirror PR of D45339293

Summary:
These tests cause the following errors internally with unknown reason:
```
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adam'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_adamw'
AttributeError: type object 'TestDistBackendWithSpawn' has no attribute 'test_ddp_hook_with_optimizer_parity_sgd'
```
Commenting these tests out to unblock other PRs.

Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100215
Approved by: https://github.com/wz337, https://github.com/fduwjj
2023-04-28 00:05:46 +00:00
a145a3332c Add tensor to fake clone snapshot for immutable source of truth (#100128)
There's a longstanding, well known mutability bug in dynamo, https://github.com/pytorch/pytorch/issues/93610 (and more issues, but this is the one I had at hand).

Ops that do in place mutation of tensors will mutate their corresponding FakeTensors.

So, for example, if you do `t_` on a tensor, you will reverse its strides. This, in turn, means that the FakeTensors strides are now also reversed, say, if you are trying to torch.compile:

```
class F(torch.nn.Module):
            def forward(self, x, y):
                x = x.t_()
                y = y.t_()
                return (x + y,)
```

However, we recently introduced accessing the fake_tensor memo/cache to get the symbolic shape values for sizes and strides during guard installation time.

This means that tensors captured with a given size and stride, say, for x above, size:(3,3) stride:(3, 1), will get their memo updates to size(3, 3), stride(1, 3).  Now, whenever you access this value for anything, it reflects it's current state in the tracing, as opposed to the state at which we initially started tracing on.

This causes us to produce guards that are never valid, for the example above, that `x.stride()[0] == 3`.

The solution is to not allow mutation to affect the fake tensors we use as source of truth here. We can do this by forcing a clone of the fake tensor at builder time, and storing that as the source of truth for our dynamic sizes and strides during guard installation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100128
Approved by: https://github.com/ezyang
2023-04-27 23:58:15 +00:00
ca1cf434e7 Not flatten states when use_orig_param is True and sharding is NO_SHARD (#100189)
When use_orig_param is True and sharding is NO_SHARD, parameters and states are not flattened, so optimizer states should not be flattened as well. The unit test will fail without the fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100189
Approved by: https://github.com/awgu
2023-04-27 23:47:01 +00:00
3241fbd627 Inductor cpp wrapper: support LinearBinary (#99957)
Support `mkldnn::_linear_pointwise.binary` in cpp wrapper for GPT-J.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99957
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2023-04-27 23:46:42 +00:00
0221198790 Added Typechecking to input tensor in RNN (#100100)
The input tensor of the RNN forward must be the same type as the weights.
While passing tensor of type long the error is:
    `RuntimeError: expected scalar type Long but found Float`
Which is misleading because it said to convert Something to Long, but the correct solution is to convert the input to Float (Which is the type of the weights).

The new error:
    `RuntimeError: input must have the type torch.float32, got type torch.int64`

Is correct and more verbose

Fixes #99998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100100
Approved by: https://github.com/drisspg
2023-04-27 23:36:57 +00:00
b8d7a28e1a refactor test_sdpa into two test classes to account for failure modes (#100121)
### Summary
This PR creates a new TestSDPAFailureModes test class in order to better seperate what each test is trying to do.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100121
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-04-27 21:42:40 +00:00
477ca1789c Avoid elementwise dispatch of gradient unscaling/validation ops in _foreach_non_finite_check_and_unscale_cpu_ (#100108)
Fixes [#82206](https://github.com/pytorch/pytorch/issues/82206)

When executing a `ShardedGradScaler` step in the context of `cpu_offload`, [the function](ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L151-L152)) `_foreach_non_finite_check_and_unscale_cpu_` is grindingly slow. This issue is due to the elementwise op dispatching/redispatching/execution that is engendered by the current approach to gradient tensor validation:
ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L159-L163)

The subsequent `isinf` and `isnan` checks with associated `any` checks result in unscalable elementwise op dispatches:
ecd2c71871/torch/distributed/fsdp/sharded_grad_scaler.py (L173-L181)

This inefficency is of course hidden in the current FSDP tests given their (appropriately) trivial parameter dimensionality. In the perf analysis below, the example test configures only the final `Linear(4, 8)` module parameters to require grad, so there are 40 elements to iterate through. However, if one increases the dimensionality to a still-modest 320008 elements (changing the final module to `Linear(40000,8)`), the execution time/cpu cost of the test is dominated by the elementwise op dispatching/redispatching/execution of the `any` validation ops in this function.

To characterize the current behavior, I use a slightly modified version of an existing `ShardedGradScaler` test [^1]. The following modifications to the test are made to allow the analysis:

1. Run just `CUDAInitMode.CUDA_BEFORE` for clarity instead of additional scenarios
2. Increase the final module to `Linear(40000, 8)` (along with modifying the preceding module to make the dimensions work) ,
3. For the cProfile run (but not valgrind or perf) the test runs just a single [`_train_for_several_steps`](ecd2c71871/torch/testing/_internal/common_fsdp.py (L926-L934)) step per rank (instead of 2 steps)
4. I temporarily reduce `init_scale` further to ensure we don't hit any `infs`, short-circuiting our analysis

### Current behavior

The most relevant call subgraph:
![callgrind_subgraph_elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234656744-b7ca81b2-ce5b-4035-9918-0ad57d3689d3.png)

Note that:

1. Instead of dispatching to the relevant autograd op and then redispatching to the relevant CPU op implementation 8 times per test, (2 train steps x 2 any calls per parameter per step x 2 orig parameters) we (I believe unnecessarily) call the relevant dispatch flow elementwise, so 640016 times! (only 1 node in this trace so 320008 elements/2 X 2 train steps x 2 calls per element per step).
2. Nearly 50% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (320008 execs), `isinf` (320008 execs) and `any` (640016 execs) calls.
3. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) are below to give one a sense of the relative dispatch and op execution cost in an elementwise context[^3].
![THPVariable_any_op_elementwise_dispatch_absolute_IR](https://user-images.githubusercontent.com/7462936/234656886-3c017ee3-8a04-4a7d-bdf8-6c690de42c92.png)
![structured_any_all_out_impl_absolute_IR](https://user-images.githubusercontent.com/7462936/234656915-0b203bb7-bd05-4ceb-a38b-67b0d4862aa7.png)

Using cprofile stats:

```bash
python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_8wa9uw39.stats'); stats.print_stats()"
...
ncalls  tottime  	percall  	cumtime  	percall  filename:lineno(function)
1   	20.159   	20.159   	66.805   	66.805 	 torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_)
160004  18.427    	0.000   	18.427    	0.000 	 {built-in method torch.isinf}
160004  6.026    	0.000    	6.026    	0.000 	 {built-in method torch.isnan}
```
We see that a single step of the scaler runs for more than a minute. Though there is non-trivial cprofile overhead, we can infer from this that per-element op dispatches/executions are on the order of a 100ns.

On the order of 100 nanoseconds per dispatch is acceptable if we're using typical tensor access patterns, but if we're dispatching each element for each op, obviously everything is going to come to a grinding halt for many practical use cases.

(Given the cost of this function is currently O(n) in the number of gradient elements, feel free to set `TORCH_SHOW_DISPATCH_TRACE=1` if you want to make this function cry 🤣)

I've attached a flamegraph at the bottom of the PR[^2] that more intuitively demonstrates the manner and extent of resource consumption attributable to this function with just a modest number of gradient elements.

### After the loop refactor in this PR:

The most relevant call subgraph:
![callgrind_subgraph_elementwise_dispatch_fix](https://user-images.githubusercontent.com/7462936/234657001-0a448756-b4ce-468e-9f91-1d21597df057.png)

Note that:

1. Less than 0.4% of the relative (inclusive) instruction reads for the entire test in `callgrind` are executed by the `isnan` (4 execs), `isinf` (4 execs) and `any` (8 execs) calls (versus ~50% and 320008, 320008, 640016 respectively above)
2. The `any` pre-dispatch entry point IRs (`torch::autograd::THPVariable_any`) vs actual op implementation IRs (`at::native::structured_any_all_out::impl`) reflect far less overhead (of secondary importance to item number 1)
![THPVariable_any_op_elementwise_dispatch_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234659454-b1e262cf-d291-4d44-aff2-e27efe284e9c.png)
![structured_any_all_out_impl_absolute_IR_fix](https://user-images.githubusercontent.com/7462936/234657154-91fa7cb8-e39e-48c7-abf0-cc58f06c0ae1.png)

Using cprofile stats:

```bash
python -c "import pstats; stats=pstats.Stats('/tmp/fsdp_cprofile_pfap7nwk.stats'); stats.print_stats()"
...
ncalls  tottime  	percall  	cumtime  	percall  	filename:lineno(function)
1    	0.013    	0.013    	0.109    	0.109 		torch/distributed/fsdp/sharded_grad_scaler.py:151(_foreach_non_finite_check_and_unscale_cpu_)
2    	0.022    	0.011    	0.022    	0.011 		{built-in method torch.isinf}
2    	0.018    	0.009    	0.018    	0.009 		{built-in method torch.isnan}
```
We can see our function runtime has dropped from more than a minute to ~100ms.

### Assumptions associated with this loop refactor:

The key assumptions here are:

1. The grads are always on CPU in this function so any MTA-safe constraints ([`can_use_fast_route`](efc3887ea5/aten/src/ATen/native/cuda/AmpKernels.cu (L110-L111)) relating to the relevant CUDA kernel path selection, i.e. slower `TensorIterator` gpu kernel vs `multi_tensor_apply_kernel`) do not apply in this context

2. We've already filtered by dtype and device and can assume the presence of a single CPU device. Unless manually creating separate CPU devices with manually set non-default indexes (which I don't think FSDP supports and should be validated prior to this function), device equality should always be `True` for `cpu` type devices so we should just need to check that the current device is of `cpu` type. [^4].

![elementwise_dispatch](https://user-images.githubusercontent.com/7462936/234660413-8c96ef90-7a23-4307-b8ed-c1fbf932f1e9.svg)

[^1]: `TestShardedGradScalerParityWithDDP.test_fsdp_ddp_parity_with_grad_scaler_offload_true_none_mixed_precision_use_orig_params` test in `test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py`
[^2]: Note the native frame stacks for `torch::autograd::THPVariable_isinf`, `torch::autograd::THPVariable_isnan`, `torch::autograd::THPVariable_any` in particular.
[^3]: There's more `TensorIterator` etc. setup overhead further up the stack beyond `structured_any_all_out`, but roughly speaking
[^4]: Device equality is based on [type and index combination](efc3887ea5/c10/core/Device.h (L47-L51)), CPU device type is -1 by default (`None` on the python side) and is intended to [always be 0](cf21240f67/c10/core/Device.h (L29)) if set explicitly. Though technically, unless in debug mode, this constraint isn't [actually validated](bb4e9e9124/c10/core/Device.h (L171-L184)), so one can actually manually create separate `cpu` devices with invalid indices. I suspect it's safe to ignore that potential incorrect/unusual configuration in this context but let me know if you'd like to add another `cpu` device equality check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100108
Approved by: https://github.com/awgu
2023-04-27 21:33:27 +00:00
1dba53cbab [ONNX] Refactor test_op_consistenct.py and test_fx_op_consistency.py (#100172)
## Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 9255aa3</samp>

This pull request refactors the ONNX operator testing code to use a common module `./test/onnx/onnx_test_common.py` that defines constants, types, classes, and functions for testing ONNX operators. This improves the code quality, readability, and maintainability.

## Walkthrough
<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 9255aa3</samp>

*  Refactor the common code for testing ONNX operators from different files into `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eL10-R24), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR33), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR367-R623))
* Remove the unused and duplicated imports, constants, types, and classes for testing ONNX operators from `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L43-R42), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L28-R29), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L43-R44))
* Import the `unittest`, `opinfo_core`, and `onnx_test_common` modules and the `fixme`, `skip`, and `xfail` functions in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` ( [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4R36), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L37-R37))
* Update the references to the constants, types, functions, and classes for testing ONNX operators in `./test/onnx/test_fx_op_consistency.py` and `./test/onnx/test_op_consistency.py` to use the definitions from `./test/onnx/onnx_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L324-R80), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L389-R135), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L405-R151), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-db2f78a51511bb172cbfde1b2f68272b8b33049abe2571cded27bcd0f3ae5fa4L455-R204), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L333-R107), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L434-R183), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L448-R197), [link](https://github.com/pytorch/pytorch/pull/100172/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L494-R246))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100172
Approved by: https://github.com/justinchuby
2023-04-27 21:32:04 +00:00
61917a006d Make DimConstraints create actionable message (#100103)
This pr makes summary of dimension constraints actionable. Before the pr, it will print:
```
torch.fx.experimental.symbolic_shapes: [WARNING] Summary of dimension constraints:
The following dimensions have been specialized and CANNOT be dynamic.
NOTE: Specializations will happen by default with `assume_static_by_default=True`.
        L['c'].size()[1] == 3
        L['a'].size()[2] == 3
        L['a'].size()[1] == 3
        L['b'].size()[2] == 2
        L['b'].size()[1] == 2
        L['c'].size()[2] == 3

The following dimensions CAN be dynamic.
You can use the following code to specify the constraints they must satisfy:
'''
constraints=[
        dynamic_dim(L['c'], 0) == dynamic_dim(L['a'], 0),
        2 <= dynamic_dim(L['b'], 0),
        2 <= dynamic_dim(L['a'], 0),
]
'''
```
Users need to initialize the L environment manually and copy the constraints over. After the pr, we have:
```
[2023-04-26 05:43:12,849] torch._dynamo.eval_frame: [WARNING] Summary of dimension constraints:
The following dimensions have been specialized and CANNOT be dynamic.
NOTE: Specializations will happen by default with `assume_static_by_default=True`.
'''
def specializations(a, b, c):
    return (a.size()[2] == 3 and
    c.size()[1] == 3 and
    a.size()[1] == 3 and
    c.size()[2] == 3 and
    b.size()[2] == 2 and
    b.size()[1] == 2)

'''

The following dimensions CAN be dynamic.
You can use the following code to specify the constraints they must satisfy:
'''
def specify_constraints(a, b, c):
    return [
        2 <= dynamic_dim(b, 0),
        dynamic_dim(c, 0) == dynamic_dim(a, 0),
        2 <= dynamic_dim(a, 0),
    ]
'''
```

, where dynamic_constraints has the same input signature as users code. This allow users to copy-paste and run the code  to generate the constraints before exporting as shown below:
```
def specify_constraints(a, b, c):
    return [
        2 <= dynamic_dim(b, 0),
        dynamic_dim(c, 0) == dynamic_dim(a, 0),
        2 <= dynamic_dim(a, 0),
    ]
torch._dynamo.export(my_dyn_fn, x, y, z, constraints=specify_constriants(x, y, z))
```

Implementation-wise, this pr also
1. changes shape_env.produce_guards to produce_guards_and_constraints,
2. adds contraints_export_fn hooks,
The purpose is to surface the DimConstraints to dynamo.export, where we could reliably get the original function's signature.

The alternative to the above is to get the function signature before creating SHAPE_ENV guard (https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/output_graph.py#L227) and pass it to DimConstraints, but I couldn't recover the signature before creating SHAPE_ENV because the frame's f_globals/locals don't contain the original function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100103
Approved by: https://github.com/guangy10, https://github.com/tugsbayasgalan
2023-04-27 21:24:18 +00:00
d5f15d3515 Check for debug mode (#92707)
It works by validating the debug builds actually trigger debug level asserts

Turns out, most of our  debug jobs today don't actually build in debug mode (causing the test to fail). The PR also fixes that

Contributes to https://github.com/pytorch/pytorch/issues/88842
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92707
Approved by: https://github.com/malfet, https://github.com/albanD
2023-04-27 20:57:18 +00:00
b02aa5e71d [Feature] storage resize_ support custom device. (#99882)
Fixes #99326

Support storage resize_ for custom device, by calling dispatched tensor operations.

@ezyang  this pr is another case  that was brought up in issue #99326,  please take a moment to review this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99882
Approved by: https://github.com/ezyang
2023-04-27 20:18:35 +00:00
9834358e0f Get SchemaCheckMode to error on ops that return inputs directly. Expose as a dynamo backend, eager_debug (#99744)
Talked to @zou3519 and @ezyang on what the right UX is: tentatively, adding a new dynamo backend is cheap and simple, so it seems worth doing. And longer term, we agreed (?) that it's worth seeing if we can get custom ops sanity asserts to run more automatically, instead of needing a separate backend.

Side comment: that actually seems tough: the mode detects secret mutations by cloning every input to every op, running the op, and checking that the data matches between the real input and the cloned input. So I doubt we'll be able to make that behavior always-on? It would need some config at least.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99744
Approved by: https://github.com/albanD, https://github.com/ezyang, https://github.com/zou3519
2023-04-27 20:12:42 +00:00
1f2d00e537 move SchemaCheckMode to torch/_subclasses (#99743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99743
Approved by: https://github.com/albanD
2023-04-27 20:12:41 +00:00
884c5c86f1 Pass torch.compile mode/options to all backends (#99645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99645
Approved by: https://github.com/anijain2305
2023-04-27 19:41:26 +00:00
7295ab6746 [ONNX] Add test_fx_op_consistency.py (#99465)
Add op consistency test for fx exporter. There will be another PR to work around the limitations of https://github.com/pytorch/pytorch/issues/99534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99465
Approved by: https://github.com/justinchuby
2023-04-27 19:39:32 +00:00
d06b93b0c7 Decompose arange.default to arange.start_step (#99739)
The aten op arange.default is not in the core aten IR, and should decompose into the arange.start_step op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99739
Approved by: https://github.com/SherlockNoMad
2023-04-27 19:06:36 +00:00
a67fa845bd [vmap] Fix searchsorted batch rule (#99698)
Hopefully there is no other missed cases.

Fixes #99603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99698
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-04-27 19:03:41 +00:00
991b1c0286 Do not use --extra-index-url in testing wheels (#100183)
Should prevent regressions like the ones reported in  https://github.com/pytorch/pytorch/issues/100104 from sneaking undetected.

Same for `install_triton_wheel.sh` - always use packages from https://download.pytorch.org/whl/

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at deda821</samp>

> _`pip install` changed_
> _Only use PyTorch nightly_
> _Snowflake packages_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100183
Approved by: https://github.com/kit1980, https://github.com/pmeier
2023-04-27 18:48:02 +00:00
151d76cc23 [quant][pt2e] remove dropout from fx quant
Differential Revision: D45250152nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99935
2023-04-27 11:22:41 -07:00
089b085c32 Optimize periodic jobs (#100182)
Split existing 4 hour scheduled into two 8 hour ones
And schedule x86 MacOS test every 8 hours and exclude them from leak
checks
Schedule iOS tests every 8 hours and exclude them from leak-checks as
well

Remove IOS metal job, as it is already tested by ARM64 MPS job as well
as x86 and arm64 vanilla jobs, as they never caught any regressions in
last 60 days, based on data from running the following query on RockSet:
```sql
SELECT started_at,
      DATE_DIFF(
            'MINUTE',
            PARSE_TIMESTAMP_ISO8601(started_at),
            PARSE_TIMESTAMP_ISO8601(completed_at)
        ) as duration,
    conclusion, name, html_url, torchci_classification
  FROM commons.workflow_job
  WHERE
  workflow_name = 'periodic' and
  name like 'ios-12% % build (default, 1, 1, macos-12)' and
  url like 'https://api.github.com/repos/pytorch/pytorch/%'
  and conclusion = 'failure'
  order by started_at desc, run_id;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100182
Approved by: https://github.com/PaliC, https://github.com/huydhn
2023-04-27 18:07:28 +00:00
01de8ee845 [SPMD][Easy] Add time counter in graph_optimization_pass (#99969)
This can give the idea how expensive the pass is.

Differential Revision: [D45255366](https://our.internmc.facebook.com/intern/diff/D45255366/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99969
Approved by: https://github.com/lessw2020
2023-04-27 17:56:07 +00:00
87db02ea38 [DDP] Perform input casting in pre forward (#100131)
This is so that replicate can also have the feature to cast its
inputs, which it currently does not. Next diff will change replicate pre hook
to support this.

Differential Revision: [D45335179](https://our.internmc.facebook.com/intern/diff/D45335179/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100131
Approved by: https://github.com/zhaojuanmao
2023-04-27 17:34:46 +00:00
ae0eb2342d [Experimental] Remove store barrier after PG init (#99937)
Store based barrier is not scalable.
Experimenting to see if removing it breaks any CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99937
Approved by: https://github.com/kumpera, https://github.com/H-Huang
2023-04-27 17:23:10 +00:00
7bece142a9 [export] Port over const prop pass (#100102)
Stacked on top of https://github.com/pytorch/pytorch/pull/100000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100102
Approved by: https://github.com/gmagogsfm
2023-04-27 17:06:47 +00:00
fad2f6edab [PTD][Checkpoint] Upstream fsspec storage read/write to PT (#98387)
Remove sync_files.
Remove single_file_per_rank and will add it back once we resolve the issue. https://github.com/pytorch/pytorch/issues/98386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98387
Approved by: https://github.com/fegin
2023-04-27 16:47:28 +00:00
b94a0ba5bb [SPMD] Add embedding dense backward prop rule for postional embedding (#100038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100038
Approved by: https://github.com/mrshenli
2023-04-27 16:31:51 +00:00
8fe91d16b0 Remove CUDA 11.6 note from complex docs (#100118)
Removes note in the complex docs pointing to the CUDA 11.6 wheels introduced in https://github.com/pytorch/pytorch/pull/80363.
Background: this warning was added via https://github.com/pytorch/pytorch/issues/79876 which pointed out a slow compilation time in 11.3. The 11.6 pip wheels were thus recommended but are not build anymore as our current support is 11.7, 11.8 (and 12.1 experimental in nightlies).

The note is confusing users as it doesn't explain why 11.6 is needed.
Reference: https://discuss.pytorch.org/t/complex-numbers-cuda-11-6-documentation-warning/178588/1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100118
Approved by: https://github.com/msaroufim
2023-04-27 16:26:27 +00:00
02f059c2b7 Add private _export API (#99992)
Differential Revision: D45279206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99992
Approved by: https://github.com/angelayi, https://github.com/gmagogsfm
2023-04-27 16:24:14 +00:00
f5853342ea [dynamo][numpy] Handle return value being numpy ndarray (#99560)
On top of #95849 this PR is trying to handle the special case when dealing with numpy.

Consider the following example:

```
def f(x: torch.Tensor) -> np.ndarray:
	a = x.numpy()
	return a.T
```
In previous PR this will error out because we translate `a.T` to be a method call on `torch_np.ndarray.T` which is also a `torch_np.ndarray`.

This PR handles this case, by conditionally converting a `torch_np.ndarray` to `np.ndarray` before returning, to match the original behavior.

The compiled version will be:

```
def f(x):
    ___tmp_0 = __compiled_fn_0(x)
    if isinstance(___tmp_0, torch_np.ndarray):
        return ___tmp_0.tensor.numpy()
    else:
        return ___tmp_0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99560
Approved by: https://github.com/jansel, https://github.com/yanboliang
2023-04-27 16:18:35 +00:00
687afeb686 [dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849)
Issue: #93684

# Problem

Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.

# Design (as I know it)

* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.

This PR adds `NumpyTensorVariable` and supports:
1.  tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`

Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
2023-04-27 16:18:35 +00:00
d855b6aed6 [Dynamo] Add unit test for explicitly calling __call__ (#100146)
@wconstab As we discussed last Friday, I added the unit test for explicitly calling __call__ and added comment to explain why we redirecting ```UserMethodVariable.call_function``` to ```NNModuleVariable.call_method``` for a certain case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100146
Approved by: https://github.com/wconstab
2023-04-27 15:47:11 +00:00
cb569dbccd Fix cat forward-AD tests (#99596)
Fixes #94115

Not sure where to add the test. There is an existing sample input but apparently doesn't fail any test.

6580b160d3/torch/testing/_internal/common_methods_invocations.py (L2043)

Edited: Found the skipper and xfailed some failures, which are pre-existing and unrelated to the fix in question. (Those failures are of gradgrad check, while the fix is of forward-AD).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99596
Approved by: https://github.com/soulitzer
2023-04-27 15:21:26 +00:00
659dcc5e71 [inductor] Fix argmin/max with duplicate values (#99920)
Fixes #99879

This adds `minimum_with_index` helper functions to compute the minimum
value and index simultaneously, with a preference for the smaller
index which is required to match eager in case of duplicates.

I also remove the mask-and-sum hack with a `tl.reduce` using
the previously mentioned helper. This additionally fixes the indices
being added together in the case of duplicates.

As an example, this is the kernel generated for `torch.argmin(x, 1)`:
```python
def triton_(in_ptr0, out_ptr0, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 1028 # dynamic_shapes=False
    rnumel = 1028 # dynamic_shapes=False
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rbase = tl.arange(0, RBLOCK)[None, :]
    x0 = xindex
    _tmp1 = tl.full([XBLOCK, RBLOCK], float("inf"), tl.float32)
    _tmp1_index = tl.full([XBLOCK, RBLOCK], 9223372036854775807, tl.int64)
    for roffset in range(0, rnumel, RBLOCK):
        rindex = roffset + rbase
        rmask = rindex < rnumel
        r1 = rindex
        tmp0 = tl.load(in_ptr0 + (r1 + (1028*x0)), rmask & xmask, eviction_policy='evict_last', other=0)
        _tmp1_next, _tmp1_index_next = triton_helpers.minimum_with_index(
            _tmp1, _tmp1_index, tmp0, rindex
        )
        _tmp1 = tl.where(rmask & xmask, _tmp1_next, _tmp1)
        _tmp1_index = tl.where(rmask & xmask, _tmp1_index_next, _tmp1_index)
    _, tmp1_tmp = triton_helpers.min_with_index(_tmp1, _tmp1_index, 1)
    tmp1 = tmp1_tmp[:, None]
    tl.store(out_ptr0 + x0, tmp1, xmask)
```

Or for a persistent reduction, it generates:
```python
    tmp0 = tl.load(in_ptr0 + (r1 + (1024*x0)), rmask & xmask, other=0)
    tmp2 = tl.where(rmask & xmask, tmp0, float("inf"))
    tmp3 = tl.broadcast_to(rindex, tmp2.shape)
    _, tmp4_tmp = triton_helpers.min_with_index(tmp2, tmp3, 1)
    tmp4 = tmp4_tmp[:, None]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99920
Approved by: https://github.com/ngimel
2023-04-27 15:10:50 +00:00
f9c3fcd1df [inductor] Fix nan-handling of max and min reductions (#99881)
This adds helpers that replace tritons `minimum`, `maximum`, `min` and
`max` with the correct NaN prrpagation. I also removed
`ops.int_minimum` in favor of `ops.minimum` because we can just omit
the nan-checks by checking the dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99881
Approved by: https://github.com/ngimel
2023-04-27 15:10:50 +00:00
ed2eb13d76 [inductor] Create triton_helpers module for helper functions (#99880)
This changes codegen of `torch.prod` from:
```python
   tl.reduce(tmp2, 1, _prod_accumulate)[:, None]
```
where `_prod_accumulate` is defined elsewhere, to

```python
   triton_helpers.prod(tmp2, 1)[:, None]
```

A quirk I uncovered though is that `TritonCodeCache` breaks if you
define any new symbol beginning with `triton_`, since it assumes that
must be the kernel name. Instead, I've made the kernel name an
explicit argument to `async_compile.triton` so it doesn't have to guess.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99880
Approved by: https://github.com/ngimel
2023-04-27 15:10:50 +00:00
ad21890f8f [c10d] Scalable PG initiation. (#99931)
Add use_local_synchronization argument to new_group.

When this argument is True, is change new_group to do a store_barrier only on the ranks that are park of the group and not the whole cluster.

This addressess both scalability and composability problems associated with new_group.

Fixes #81291.

This is relanding #84224
As part of the original PR I did a quick benchmark of creating 3 PGs per rank using both functions and perf is the following:

new_group use_local_synchronization=False:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.12 |
| 8 | 0.25 |
| 16 | 0.51 |
| 32 | 0.87 |
| 64 | 1.50 |
| 128 | 2.87 |

new_group use_local_synchronization=True:
| World Size | Time (in secs) |
| --- | ----------- |
| 4 | 0.05 |
| 8 | 0.04 |
| 16 | 0.03 |
| 32 | 0.03 |
| 64 | 0.04 |
| 128 | 0.04 |

Scaling for `use_local_synchronization=False` is sub linear because the number of process groups created as a multiple of world_size decreases as we go up. It's 6 with world_size 4 and 192 with world_size 128.

Scaling for `use_local_synchronization=True` is constant as the number of store barriers executed per rank remains constant at 3.

Setup:

1 AWS host, backend gloo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99931
Approved by: https://github.com/xw285cornell
2023-04-27 13:44:02 +00:00
2eab5abb50 sparse.sum backward: short circuit on zero/empty grad (#98838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98838
Approved by: https://github.com/pearu
2023-04-27 12:06:11 +00:00
67e0913de9 Add support for serializing real tensor data in after aot minifier (#99834)
The new minifier script looks like this:

```
import torch._dynamo.repro.after_aot
reader = torch._dynamo.repro.after_aot.InputReader(save_dir='/tmp/tmpcsngx39e')
buf0 = reader.storage('e2b39c716c0d4efb9fa57375a3902b9dab666893', 16)
t0 = reader.tensor(buf0, (4,))
args = [t0]
mod = make_fx(Repro(), tracing_mode='real')(*args)
```

The real tensor data is stored in the storages folder of the checkpoint dump directory. If you delete this folder / it is otherwise missing, we will transparently fall back to generating random data like before. The tensors are serialized using content store from #99809, which means each storage is content-addressed and we will automatically deduplicate equivalent data (which is useful if you keep dumping out, e.g., your parameters.) We don't use the tensor serialization capability from content store, instead all of the tensor metadata is stored inline inside the repro script (so that everything is in one file if you lose the checkpointed tensors).

We also add a stable_hash option to content store, where we use a slow SHA-1 sum on the data in CPU side to compute a hash that is stable across systems with the same endianness.

Out of rage, I also added support for Dtype.itemsize property access.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99834
Approved by: https://github.com/voznesenskym
2023-04-27 11:52:13 +00:00
5cfaea15c4 relu/threshold backward for sparse: enable 0-nnz grads (#98935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98935
Approved by: https://github.com/pearu
2023-04-27 10:57:05 +00:00
c2402a9257 Change caffe2 branch links to main (#100129)
Just a change

pytorch/tree/master -> pytorch/tree/main
pytorch/blob/master -> pytorch/blob/main
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100129
Approved by: https://github.com/huydhn
2023-04-27 10:31:50 +00:00
77a37a54ce Include all mkl/mkldnn related test files to CPU ATen backend (#99592)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99592
Approved by: https://github.com/kit1980
2023-04-27 10:26:01 +00:00
100a25d021 Basic dynamo support for traceable collectives (#94440)
Make traceable collectives work with torchdynamo,
bypassing problems with tracing the AsyncTensor subclass.

Accept a suboptimal solution for now, and optimize it later.
For now, wait happens immediately, which generally forces an early sync.

Later, find a way either in dynamo or AOT stack to handle
AsyncCollectiveTensor to get the wait in the optimal place.

Note on implementation:
- Dynamo traces 'user-level' fc apis that are designed to behave differently
  in eager vs compiled.  In eager, there will be work-obj registration and
  a wrapper subclass will insert a 'wait' call at the appropriate time.
  In compile/trace mode, wait will be immetiately called, and work obj
  registration is required to be handled by the compile backend at runtime.
- Dynamo needs to trace into some of the helper functions in the 'user-level'
  api, such as '_expand_group' which is essentially a constant transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94440
Approved by: https://github.com/kumpera
2023-04-27 05:38:36 +00:00
925a3788ec [CUDA] Switch to at::empty in max_pool3d_with_indices_backward_cuda (#100138)
Looks like there's an extraneous call to `at::zero` as `gradInput` will always be zero'd by `max_pool3d_with_indices_backward_out_cuda_template`.

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100138
Approved by: https://github.com/ngimel
2023-04-27 04:52:29 +00:00
859e82a7a9 Making fsdp device-agnostic for custom-backend which implement cuda-semantics (#99024)
Custom backend implementation based on privateuse1 with semantics identical to CUDA (CUDA is so popular), named for example 'my_device', and registered as the same module name torch.my_device.

This PR aims to satisfy the constraints of such a backend, which can be directly integrated into the current FSDP implementation.

The main issues addressed are:

#### 1. Device decision for FSDP wrapping of Modules without Parameters

Users typically organize FSDP code as follows:
```python
m = Module().to('my_device:0')
fsdp_m = FSDP(m)
```
or like this:
```python
m = Module()
fsdp_m = FSDP(m, device_id=torch.device('my_device', 0))
```
If the model has Parameters, everything works fine because FSDP will prioritize the device where the Parameters are located. However, for Modules without Parameters, the to() call has no side effects, and FSDP will assume the current CUDA device, which prevents the use of devices other than the current CUDA device for Modules without Parameters. Therefore, when FSDP is called with a device_id argument, this configuration takes top priority.

#### 2. Abstraction of a cuda-like device

Now, in addition to compute_device, _FSDPState includes a device_handler member. In fact, this device_handler is now just a reference to either torch.cuda or torch.my_device. From now on, code that works based on _FSDPState should use state.device_handler to operate streams create, wait or sync, just like using torch.cuda previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99024
Approved by: https://github.com/awgu
2023-04-27 04:13:28 +00:00
4456e932f8 [inductor] fix _print_Pow given reciprocal of dynamic dim with float exponent (#100090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100090
Approved by: https://github.com/XiaobingSuper, https://github.com/jansel
2023-04-27 04:10:15 +00:00
569eff85a0 inductor: enhance conv+binary fusion path test for cpu path (#100058)
The motivations for this PR are:

1. Add negative/positive testing for conv+binary fusion path.
2. Add alias check for in-place fusion path: if the write buffer is an alias tensor, we will lower the out-place path(one test is also added).
3. Fix https://github.com/pytorch/pytorch/issues/99842.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100058
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-27 04:09:22 +00:00
cyy
e248016472 fix missing-prototypes warnings in torch_cpu (Part 1) (#100053)
This PR fixes some missing-prototypes violations in the torch_cpu source.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100053
Approved by: https://github.com/albanD
2023-04-27 04:05:51 +00:00
e0bf51d3bf [dynamo] Add ddp_graphs artifact (#100021)
I want to be able to decouple DDP graph printing from the rest of
dynamo DEBUG-level logging, since frequently these logs are particularly
enlightening.

Differential Revision: [D45290919](https://our.internmc.facebook.com/intern/diff/D45290919/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100021
Approved by: https://github.com/wconstab, https://github.com/mlazos
2023-04-27 03:53:23 +00:00
1504bdf9e7 [inductor] logger message fix in split_cat (#100088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100088
Approved by: https://github.com/Skylion007, https://github.com/jansel
2023-04-27 03:33:08 +00:00
c0ecd98958 Rename DispatchKey.PrivateUse1 to custom device in torchgen. (#99406)
I want to use torchgen to generate code, and my yaml file format is the same as `native_functions.yaml`.
I will use the PrivateUse1, but in my yaml file, I don't want to show PrivateUse1 to the user.
So I want to  achieve the following result(e.g. my device is `YPU`):
```
>>>from torchgen.model import DispatchKey
>>>str(DispatchKey.PrivateUse1)
"YPU"
>>>DispatchKey.parse("YPU")
DispatchKey.PrivateUse1
```
I also thought that not everyone would need this feature, so I add a new func to handle this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99406
Approved by: https://github.com/ezyang
2023-04-27 03:30:48 +00:00
3588688ade inductor: simplify the test_mkldnn_pattern_matcher.py code (#100057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100057
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-04-27 03:16:03 +00:00
13259fe8f0 [ONNX] Fix type annotation for 'fx_to_onnxscript' (#100050)
Curious why it wasn't caught by linter and beartype.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100050
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-04-27 03:05:28 +00:00
3f5d768b56 Refactors/improvements in _inductor/fx_passes (#100063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100063
Approved by: https://github.com/devashishshankar
2023-04-27 01:18:09 +00:00
be8c7c06b6 [Tensor Parallel] Simplify distribute for MHA (#100046)
This function is only called for nn.MHA or the custom MHA we use, and
if it is the former it is converted to the latter. So this check can actually
be an assert.

Differential Revision: [D45300396](https://our.internmc.facebook.com/intern/diff/D45300396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100046
Approved by: https://github.com/wanchaol
2023-04-27 00:54:21 +00:00
97f4af3f4f add sm80orlater check for bfloat test in test_torchinductor (#98034)
small fix to add sm80orlater check for bfloat test in test_torchinductor since bfloat16 is not supported in sm < 80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98034
Approved by: https://github.com/ngimel
2023-04-27 00:45:23 +00:00
bc0c74bcd5 Don't apply _Py_OPCODE twice (#97986)
It's already applied in PyInstDecoder::opcode.
Applying it twice returns incorrect result on big endian systems.

This change fixes 14 tests in test/functorch/test_dims.py on big endian systems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97986
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-04-27 00:27:32 +00:00
32a67e42c4 Introduce FXGraphExtractor into torch.onnx.dynamo_export (#99940)
The current API architecture can be seen as 3 independent exporters as shown below. The public API `dynamo_export()` defaults to one of the 3 variants and the other 2 must be used by instantiating private classes: ![image](https://user-images.githubusercontent.com/5469809/231567368-ec899718-b7c1-4e59-b6a8-383142df245a.png)

This PR refactors the API in a way that `dynamo_export` is the only way to use the ONNX exporter. It defaults to a FX tracer based on ``torch.export``, but an internal-only idiom allows switching the FX tracer (aka `FXGraphExtractor` interface), as shown below:

![image](https://user-images.githubusercontent.com/5469809/231567495-3936362d-06de-4cfc-b752-6c2060701c08.png)

Summary of changes:

* Unifies all exporter variants under a single `dynamo_export` API
  * `ResolvedExportOptions` was expanded to allow `fx_tracer: FXGraphExtractor` to be specified, selecting which FX graph extractor to use, according to the design proposal
  * As a consequence, `torch.onnx._internal.exporter.Exporter` does not have to *internally* specialize for each type of FX API that the exporter might be used. This leads to a single `Exporter` with many `FX graph extractors`
  * Before in red, after in green: ![image](https://user-images.githubusercontent.com/5469809/232633531-4c67449b-4863-474d-9e18-78fc1d31b1bd.png)
* Input processing was moved from `Exporter` subclasses to `FXGraphExtractor` subclasses, where they are actually consumed
  * `Exporter` is a [data]class that holds export options, model and input data in a single cohesive object. Specializing it means create different exporters instead of having one exporter capable of exporting models through different options.
  * `Exporter` doesn't consume the `model_args` that caused it to specialize
* Improved the circular dependency story.
  * https://github.com/pytorch/pytorch/pull/99070 moves `import torch.onnx` to after all dynamo subcomponents, preventing `torch.onnx` to have circular depemndencies when `torch.XXXX` is imported during initialization
  * There are other points we need to improve in subsequent PRs. APIs are organized in a way that it is easy to "import too much"
* Refactored `decomposition_table` as an internal-only `ResolvedExportOptions` property.
  * Similar to input processing, this helper is not actually consumed at tyhe `Exporter` layer. This PR moves it to the layer in which it is used
* Demoted `Exporter.model_signature` to a simple standalone helper
  * There is no need to have this as a exporter method; this is a standard `inpect.signature` usage without any state

Possible next steps are:
* Decouple `passes` and `dispatching` from the cluttered `export_fx_to_onnx`
* Further integration with http://github.com/pytorch/pytorch/pull/98421/ into `FXGraphExtractor` public API + helper for unit testing
  * Some passes are changing input processing, which are not captured by the proposed input adapter

** COPILOT SUMMARY**
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at bdaba31</samp>

### Summary
📝🚀🔧

<!--
1.  📝 - This emoji represents the formatting and documentation changes, such as adding an empty line, updating the `__all__` list, and improving the type annotations and docstrings.
2.  🚀 - This emoji represents the new features and enhancements, such as adding the `DynamoExport` class, supporting custom export options, and flattening HuggingFace model outputs.
3.  🔧 - This emoji represents the refactoring and restructuring changes, such as using the FX graph representation, the `io_adapter` module, and the simplified FX symbolic tracer, and renaming and reorganizing some modules and classes.
-->
This pull request refactors the ONNX exporter code to use the FX graph representation and the new `io_adapter` module for input and output adaptation. It also adds support for custom export options and flattening HuggingFace model outputs in the ONNX test framework. It updates the ONNX dynamo exporter API tests and adds a new module `torch/onnx/_internal/fx/dynamo_graph_extractor.py` for exporting FX models to ONNX with dynamo support. It fixes some type annotations, imports, and formatting issues in the ONNX exporter code.

> _The ONNX exporter got a new look_
> _With FX graph and dynamo hook_
> _It uses `io_adapter`_
> _And custom options matter_
> _For HuggingFace models and `model_signature` book_

### Walkthrough
*  Move the `fx` submodule from `torch/onnx/_internal` to `torch/onnx/_internal/fx`, and rename some of its modules ( [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL21-R26), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L25-R26), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L5-R15), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aL3-R30))
*  Add a new module `torch/onnx/_internal/fx/dynamo_graph_extractor.py` that defines a `DynamoExport` class for generating FX graphs using the `torch._dynamo.export` API ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-078d7b8d0e4050e650fc3c15dc97a0564852191ac7b7bdc069d0b3959c5ee39aR1-R77))
*  Add a new module `torch/onnx/_internal/fx/io_adapter.py` that defines the input and output adapter classes and steps for the ONNX exporter, and a helper function to wrap models with output adapters ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L159-R192), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aL3-R30), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aR72-R176), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4da17ba9e1a187bfacb65a70d6ff15f6c2a60480be8e20fc452d8984a279cd0aL237-R478))
*  Update the `ResolvedExportOptions` class in `torch/onnx/_internal/exporter.py` to inherit from the `ExportOptions` class, and to set the `fx_tracer` and `decomposition_table` attributes based on the `dynamo_graph_extractor` and `function_dispatcher` modules ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L81-R99), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862R117-R126))
*  Update the `Exporter` class in `torch/onnx/_internal/exporter.py` to remove the `export` method and add a new abstract `generate_fx` method, and to use the `fx_tracer` attribute to generate and export the FX graph ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L413-R475), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L422-R486))
*  Update the `FXSymbolicTraceExporter` class in `torch/onnx/_internal/fx/fx_symbolic_graph_extractor.py` to be renamed to `FXSymbolicTracer`, and to inherit from `exporter.FXGraphExtractor` and implement the `generate_fx` method ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L128-R175), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L157-R219))
*  Update the `export_fx_to_onnx` method of the `FXSymbolicTracer` class to be renamed to `_export_fx_to_onnx`, and to be moved to the `exporter.FXGraphExtractor` class ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L193-R234))
*  Update the `dynamo_export` function in `torch/onnx/_internal/exporter.py` to accept and return `ResolvedExportOptions` and `Exporter` objects, respectively ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L536-R606))
*  Update the `run_test_with_fx_to_onnx_exporter_and_onnx_runtime` function in `test/onnx/onnx_test_common.py` to add a new parameter `export_options` for passing custom export options to the `torch.onnx.dynamo_export` function ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eR176), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-1b38383dc1a0228a835d83bb7c4ba2d0c1bcd41297be5c6336572c525846166eL216-R222))
*  Update the `test_log_sigmoid` and `_test_large_scale_exporter` tests in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to use the updated `run_test_with_fx_to_onnx_exporter_and_onnx_runtime` function and the `torch.onnx.dynamo_export` function ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL297-R301), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL682-R686), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL696-R716), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL721-R730))
*  Update the `test_raise_on_invalid_save_argument_type` test in `test/onnx/dynamo/test_exporter_api.py` to use the `io_adapter.InputAdapter` and `io_adapter.OutputAdapter` classes instead of the `exporter.InputAdapter` and `exporter.OutputAdapter` classes ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L139-R139))
*  Move the `model_signature` property from the `Exporter` class in `torch/onnx/_internal/exporter.py` to a standalone function in `torch/onnx/utils.py`, and update the references to it ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L432-R505), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L157-R219), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-849a5778e2dcf7f36587967273cee0bf20642e35bf4c79405111ea3417c3fb3cL54-R75))
*  Move the `UnsatisfiedDependencyError` class from the `Exporter` class in `torch/onnx/_internal/exporter.py` to the top level of the module, and update the references to it ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L442-R512))
*  Rename the `_create_onnx_friendly_decomposition_table` function and the `_ONNX_FRIENDLY_DECOMPOSITION_TABLE` dictionary in `torch/onnx/_internal/fx/function_dispatcher.py` to `_create_default_onnx_decomposition_table` and `_DEFAULT_ONNX_EXPORTER_DECOMPOSITION_TABLE`, respectively, and update the references to them ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL213-R219), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL231-R239))
*  Update the imports in `torch/onnx/_internal/fx/function_dispatcher.py` to use the `torch._ops` and `torch._decomp` modules instead of the `torch.ops` and `torch.decomp` modules, and to use aliases for accessing the `onnxscript.function_libs.torch_aten.ops` and `torch._ops` modules ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL11-R16), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL35-R156), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL160-R166), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL173-R182), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL189-R194), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL201-R204), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL231-R239))
*  Update the `ExportOutput` class in `torch/onnx/_internal/exporter.py` to use the `InputAdapter` and `OutputAdapter` classes from `io_adapter` instead of the ones defined in the same module ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L275-R199))
*  Update the type annotations in `torch/onnx/_internal/fx/serialization.py` and `torch/onnx/_internal/exporter.py` to fix some inconsistencies ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0c7a4333620a22a5c3e5315e30272b59fb7a11b393cb42f8255070bedeb02738L15-R15), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0c7a4333620a22a5c3e5315e30272b59fb7a11b393cb42f8255070bedeb02738L83-R83), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L11-R11), [link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862R18))
*  Remove an unused import of `inspect` from `torch/onnx/_internal/exporter.py` ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L5))
*  Remove an unused import of `torch._dynamo` from `torch/onnx/_internal/fx/passes/shape_inference.py` ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-d38827b1f79525963c39e5c480240cd81f4edcaf8b3bd374a1c6ee2fdb28b334L7))
*  Add a comment to `torch/onnx/_internal/fx/passes/shape_inference.py` to explain why the import of `torch._dynamo` is done inside the `_run` method of the `ShapeInferenceWithFakeTensor` class ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-d38827b1f79525963c39e5c480240cd81f4edcaf8b3bd374a1c6ee2fdb28b334R32-R35))
*  Fix a typo in the docstring of the `_module_expansion_symbolic_trace` function in `torch/onnx/_internal/fx/fx_symbolic_graph_extractor.py` ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-3eef404cb9d85216c050be153c33255ebce1170a77d8b9b17be79bcfb238c9c4L96-R98))
*  Add an empty line to `torch/onnx/__init__.py` for formatting purposes ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-c3c8c09b65c1235ca4494633c6a0aab2761a11a7653ddaf9f874bbcd91e15553R12))
*  Delete the `torch/onnx/_internal/fx/__init__.py` file ([link](https://github.com/pytorch/pytorch/pull/99940/files?diff=unified&w=0#diff-a39fa3741f027bb9717388fc922d1e846fbd43d44f2c5fbee4e8d2188a7edb85))

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99940
Approved by: https://github.com/BowenBao, https://github.com/jansel
2023-04-27 00:25:28 +00:00
763e0a9027 [inductor] fix inconsistent behaviours when padding size is zero (#100082)
Fixes #97117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100082
Approved by: https://github.com/jansel
2023-04-26 23:58:04 +00:00
19e81b7b19 [BE][DTensor] add DeviceMesh test to periodic testing list (#100029)
## Why
This PR adds `test_device_mesh.py` to periodic tests because it requires more GPUs  (4/8) to run testing than the CI machine has.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100029
Approved by: https://github.com/wanchaol, https://github.com/fduwjj
2023-04-26 23:45:10 +00:00
4c6f7cbc86 Fix prims unbind if given dimension size is 0 (#100122)
Fixes #99832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100122
Approved by: https://github.com/ngimel
2023-04-26 23:40:21 +00:00
2989d6c93d [Dynamo] Fix constructing lazy submodule inside of lazy module's initialize_parameters (#100047)
This PR fixed two issues:
* Constructing lazy submodule inside of lazy module's ```initialize_parameters``` - don't unspecialized module if it's lazy.
* Fixes #100001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100047
Approved by: https://github.com/jansel
2023-04-26 23:36:31 +00:00
fab2e3971f enable -Werror=sign-compare in our Bazel build (#98671)
enable -Werror=sign-compare in our Bazel build

Summary:
This is already turned on for CMake, let's see what breaks.

Test Plan: Rely on CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98671
Approved by: https://github.com/kit1980
2023-04-26 23:23:24 +00:00
6789342a56 [dynamo] Make bytecode logging off-by-default (#100093)
A big model (like Meta's production models) can dump 100s of MBs of
bytecode, making the logs virtually unusable.  Let's only turn these on if
they're explicitly requested.

Differential Revision: [D45314055](https://our.internmc.facebook.com/intern/diff/D45314055/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100093
Approved by: https://github.com/mlazos
2023-04-26 23:06:22 +00:00
c523d7d899 Add a new hook (#99854)
Differential Revision: D45220984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99854
Approved by: https://github.com/albanD
2023-04-26 23:00:38 +00:00
eaa00017c8 S390x tests (#99871)
Disable tests using quantized operators if QNNPACK is not available

Two disabled tests use Int8FC operators
which are not available if QNNPACK is not available,
and fail only due to that.

Disable cpuid_test on s390x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99871
Approved by: https://github.com/albanD
2023-04-26 21:48:03 +00:00
45337e20bb Fix byteswapping (#99869)
On big endian systems byteswapping should be done other way around.

This change fixes TestE2ETensorPipe.TestTrainingLoop test from
test_cpp_rpc testsuite on big endian systems.

Use uint64_t when decoding double values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99869
Approved by: https://github.com/ezyang
2023-04-26 21:44:07 +00:00
5f138a6b65 [minifier][after dynamo] clone inputs while retaining gradness (#100066)
Helps with minifying one failure in https://github.com/pytorch/pytorch/issues/98561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100066
Approved by: https://github.com/ezyang
2023-04-26 21:31:18 +00:00
3d39bd5976 [dynamo] Remove redundant recompile call (#100084)
A single call to the `GraphModule.recompile` function occurs after the `GraphModule` has been constructed.
62f9189d9d/torch/_dynamo/output_graph.py (L754-L755)

However, the recompile function has already been called once during construction, so this call should be redundant.
```
call stack:
  recompile, graph_module.py:644
  graph, graph_module.py:411
  __setattr__, module.py:1674
  __init__, graph_module.py:370
  compile_and_call_fx_graph, output_graph.py:754
  ...
```

So maybe it can be deleted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100084
Approved by: https://github.com/ezyang
2023-04-26 21:23:21 +00:00
b9146d8b0b Remove inclusion of non-existent header on s390x (#99870)
This change fixes build on s390x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99870
Approved by: https://github.com/albanD
2023-04-26 21:13:17 +00:00
dc10004553 Add asan slow test shard (#99925)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99925
Approved by: https://github.com/huydhn
2023-04-26 21:10:55 +00:00
9bbd3d6489 [export] ExportPassBase + view_copy pass (#100000)
* Added ExportPassBase, an interpreter based helper pass writing class
* It can also help maintain the dialect based on the operator namespace through having users override the `get_valid_dialects` function (returning an empty lists implies the pass works for any dialect).
* Added a `ReplaceBrokenOpsWithFunctionalOpsPass` to replace all ops that have not been converted with functionalization with their functional ones.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100000
Approved by: https://github.com/gmagogsfm
2023-04-26 21:01:25 +00:00
9bf2dfbbb0 migrate memcpy src to const_data_ptr (#98781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98781
Approved by: https://github.com/Skylion007
2023-04-26 20:43:59 +00:00
e5291e633f Revert "Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091)"
This reverts commit 1183eecbf19f77e2b1d9f3cee56dd8039653a5f5.

Reverted https://github.com/pytorch/pytorch/pull/100091 on behalf of https://github.com/huydhn due to CPU jobs start failing in trunk due to some error in MSVC setup
2023-04-26 19:17:58 +00:00
006785cd46 [dynamo][hf_bigbird] Actually graph break on tensor.unsqueeze_/resize_ (#99986)
Currently, we return `unimplemented` w/o a graph break on seeing a x.unsqueeze_ when x is input. This essentially means we fall back to the original frame.

This PR actually graph breaks so that we can generate the continuation frame for the rest of the function. Instead of graph breaking at LOAD_ATTR, we delay the graph break to the actual CALL_FUNCTION, where its cleaner to graph break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99986
Approved by: https://github.com/jansel
2023-04-26 18:50:06 +00:00
aa99c5b4ed Added round_with_scale_factor arg to ATen (#97868)
Addresses #62396 following the strategy described in https://github.com/pytorch/pytorch/pull/64983#issuecomment-1026177629.

Fixing output size to match opencv, scikit-image, scipy if scale factor is specified on ATen side only due to JIT FC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97868
Approved by: https://github.com/lezcano, https://github.com/mikaylagawarecki
2023-04-26 18:48:37 +00:00
cc628293bf simplify method_def generation (#100059)
simplify method_def generation

Summary:
This removes some duplication. This was originally done to streamline
a subsequent change, but that change turned out to be
misguided. Nevertheless, this is a nice simplification.

Test Plan:
This should change the code gen by removing some redundant
parentheses. Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100059
Approved by: https://github.com/ezyang
2023-04-26 18:46:57 +00:00
c778980fb8 remove casts to getter in python_cpp_function.h (#100065)
remove casts to `getter` in python_cpp_function.h

Summary:
These were triggering the warning `-Wcast-function-type-strict` and
breaking the build on my machine.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100065
Approved by: https://github.com/ezyang
2023-04-26 18:46:41 +00:00
a337c42dfc make ATen/native/cuda/AdaptiveAveragePooling.cu data_ptr-correct (#100030)
make ATen/native/cuda/AdaptiveAveragePooling.cu data_ptr-correct

Summary:
Traced through each input and output to ensure correctness.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100030
Approved by: https://github.com/ezyang
2023-04-26 18:42:20 +00:00
6170be9012 make ATen/native/cuda/EmbeddingBag.cu data_ptr-correct (#99083)
make ATen/native/cuda/EmbeddingBag.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99083
Approved by: https://github.com/ezyang
2023-04-26 18:41:55 +00:00
9b3862cd02 make im2col calls data_ptr-correct (#99111)
make im2col calls data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99111
Approved by: https://github.com/ezyang
2023-04-26 18:41:25 +00:00
06ca9bb915 make col2im calls data_ptr-correct (#99112)
make col2im calls data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99112
Approved by: https://github.com/ezyang
2023-04-26 18:41:14 +00:00
0bc02d3805 [pt2] remove unnecessary if expr (#99865)
`LocalSource(k)` is equivalent to `LocalSource((k))`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99865
Approved by: https://github.com/ezyang
2023-04-26 18:35:20 +00:00
004f3d71aa [export] Move verifier over to export from torch/fx (#100019)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100019
Approved by: https://github.com/tugsbayasgalan
2023-04-26 18:26:46 +00:00
6c550bb4d5 [quant][be] Easier way to override default in QConfigMapping (#99888)
Summary: This commit adds a private helper function to override
the default QConfig in the default QConfigMapping. Previously we
needed to override all the object_types manually while skipping
the fixed qparams ops. This led to duplicate code every time
someone wanted a new default QConfig. After this commit, we can
just call the same helper function instead.

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers: jerryzh168, vkuzo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99888
Approved by: https://github.com/vkuzo, https://github.com/jerryzh168
2023-04-26 18:14:01 +00:00
9f0092c4b7 [CI] Replace timm_efficientdet with timm_vision_transformer in smoketest (#100106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100106
Approved by: https://github.com/yanboliang
2023-04-26 18:03:59 +00:00
3a5427baf4 Add torch.utils._content_store (#99809)
Implements a simple content-addressable store for storages (with tensors implemented as cheap references on top), enabling incremental serialization of tensors to disk, which I intend to use in the accuracy repro extractor.  Check the comment at the top of torch/utils/_content_store.py for more details on the intended use case.

One major piece of this PR is implementing the content hash for tensors.  For our prospective use case, we may need to repeatedly hash up to 80 GB of tensor data every time we snapshot (and we may snapshot multiple times).  Using a conventional cryptographic hash and hashing each snapshot would likely take on order of minutes, which seemed too slow to me.  So instead, I implemented a crappy hash function that can be run on GPU.  It is at least somewhat theoretically grounded: using random parameters generated by Philox, we use the standard shift-multiply and xor sum universal hash family.  The hash function is a bit dorky though; instead of properly doing 160-bit math, it just runs 32-bit hash five times and cats them together.  By the way, this sets the first precedent for kernel in PyTorch library which MUST be torch.compile'd to be run (in fact, this kernel does not run in eager mode because of the use of xor_sum, which doesn't actually exist in ATen.)

I had to add a few more primitives to inductor, namely randint (over the entire int range) and xor_sum.  Fortunately, these primitives are natively supported by Triton/C++, and so they were very easy to plumb through.  xor_sum is exposed as a prim, while randint special cases on when low/high span the entire 32-bit signed integer range.

Thanks to Jeff Johnson for letting me bounce ideas of him on a Saturday morning lol.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99809
Approved by: https://github.com/voznesenskym
2023-04-26 18:02:59 +00:00
45bf3f6216 Optimized EMA implementation (#94820)
This PR proposes an optimized way to do Exponential Moving Average (EMA), which is faster than the current way using `swa_utils.AveragedModel` described in https://pytorch.org/docs/stable/optim.html#custom-averaging-strategies.

This implementation is asynchronous, and is built as an optimizer wrapper so that the EMA weight update happens without any additional CPU/GPU sync, just after optimizer steps, and with limited code changes.

Example usage:
```
model = Model().to(device)
opt = torch.optim.Adam(model.parameters())

opt = EMAOptimizer(opt, device, 0.9999)

for epoch in range(epochs):
    training_loop(model, opt)

    regular_eval_accuracy = evaluate(model)

    with opt.swap_ema_weights():
        ema_eval_accuracy = evaluate(model)
```

Here are some benchmarks (time per iteration) on various torchvision models:

|model|this PR iteration time                      |swa_utils.AveragedModel iteration time| iteration speedup                                      |
|-----|-----------------------------|-----------------------|---------------------------------------------|
|     |                             |                       |                                             |
|regnet_x_1_6gf|62.73                        |67.998                 |1.08                                         |
|regnet_x_3_2gf|101.75                       |109.422                |1.08                                         |
|regnet_x_400mf|25.13                        |32.005                 |1.27                                         |
|regnet_x_800mf|33.01                        |37.466                 |1.13                                         |
|regnet_x_8gf|128.13                       |134.868                |1.05                                         |
|regnet_y_16gf|252.91                       |261.292                |1.03                                         |
|regnet_y_1_6gf|72.14                        |84.22                  |1.17                                         |
|regnet_y_3_2gf|99.99                        |109.296                |1.09                                         |
|regnet_y_400mf|29.53                        |36.506                 |1.24                                         |
|regnet_y_800mf|37.82                        |43.634                 |1.15                                         |
|regnet_y_8gf|196.63                       |203.317                |1.03                                         |
|resnet101|128.80                       |137.434                |1.07                                         |
|resnet152|182.85                       |196.498                |1.07                                         |
|resnet18|29.06                        |29.975                 |1.03                                         |
|resnet34|50.73                        |53.443                 |1.05                                         |
|resnet50|76.88                        |80.602                 |1.05                                         |
|resnext101_32x8d|277.29                       |280.759                |1.01                                         |
|resnext101_64x4d|269.56                       |281.052                |1.04                                         |
|resnext50_32x4d|100.73                       |101.102                |1.00                                         |
|shufflenet_v2_x0_5|10.56                        |15.419                 |1.46                                         |
|shufflenet_v2_x1_0|13.11                        |18.525                 |1.41                                         |
|shufflenet_v2_x1_5|18.05                        |23.132                 |1.28                                         |
|shufflenet_v2_x2_0|25.04                        |30.008                 |1.20                                         |
|squeezenet1_1|14.26                        |14.325                 |1.00                                         |
|swin_b|264.52                       |274.613                |1.04                                         |
|swin_s|180.66                       |188.914                |1.05                                         |
|swin_t|108.62                       |112.632                |1.04                                         |
|swin_v2_s|220.29                       |231.153                |1.05                                         |
|swin_v2_t|127.27                       |133.586                |1.05                                         |
|vgg11|95.52                        |103.714                |1.09                                         |
|vgg11_bn|106.49                       |120.711                |1.13                                         |
|vgg13|132.94                       |147.063                |1.11                                         |
|vgg13_bn|149.73                       |165.256                |1.10                                         |
|vgg16|158.19                       |172.865                |1.09                                         |
|vgg16_bn|177.04                       |192.888                |1.09                                         |
|vgg19|184.76                       |194.194                |1.05                                         |
|vgg19_bn|203.30                       |213.334                |1.05                                         |
|vit_b_16|217.31                       |219.748                |1.01                                         |
|vit_b_32|69.47                        |75.692                 |1.09                                         |
|vit_l_32|223.20                       |258.487                |1.16                                         |
|wide_resnet101_2|267.38                       |279.836                |1.05                                         |
|wide_resnet50_2|145.06                       |154.918                |1.07                                         |

You can see that in all cases it is faster than using `AveragedModel`. In fact in many cases, adding EMA does not add any overhead since the computation is hidden behind the usual iteration flow.

This is a similar implementation to the one currently in [NVIDIA NeMo](https://github.com/NVIDIA/NeMo).

If the team is interested in merging this, let me know and I'll add some documentation similar to `swa_utils` and tests.

Credits to @szmigacz for the implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94820
Approved by: https://github.com/janeyx99
2023-04-26 18:02:11 +00:00
c6ab4ff35c convert to mutable_data_ptr data_ptr calls immediately after at::empty() (#98734)
The tensor is uninitialized in this case, so it is highly likely to be
written before it will be read. Furthermore, because it is a new
tensor, there's no harm in getting mutable access to it: there's no
lazy copy that would be materialized.

This was automatically generated with a regular expression.

Differential Revision: [D44831830](https://our.internmc.facebook.com/intern/diff/D44831830/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98734
Approved by: https://github.com/ezyang
2023-04-26 17:40:24 +00:00
65823619c0 convert trivial data reads to const_data_ptr (#98751)
Differential Revision: [D44834421](https://our.internmc.facebook.com/intern/diff/D44834421/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98751
Approved by: https://github.com/ezyang
2023-04-26 17:37:12 +00:00
5b4a523583 Add all_reduce_coalesced to functional collectives (#98640)
This adds all_reduce_coalesced to MTPG to ease testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98640
Approved by: https://github.com/wanchaol
2023-04-26 17:05:54 +00:00
9bc03db670 Move nn.module state dict pre hook (#98964)
Some modules like lazyModule may override '_save_to_state_dict()', in this case, pre_state_dict hook will not be called. So move the pre_state_dict hook out from '_save_to_state_dict()' to make sure the pre hook could be called

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98964
Approved by: https://github.com/albanD
2023-04-26 16:51:13 +00:00
bb4e9e9124 functionalization: error during mutations on mem overlap (#99919)
Fixes https://github.com/pytorch/pytorch/issues/98143.

If a user mutates a tensor that has overlapping memory, this can cause silent correctness issues with torch.compile. This PR adds a few checks to detect that situation and error.

Unfortunately `at::has_internal_overlap()` wasn't smart enough to detect the one linked in the issue, so I added a (simple) check that only runs in functionalization, that can catch the overlapping memory. We might need to revisit and add more complex checks later though (luckily, functionalization runs during compilation time so we can afford more expensive checks).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99919
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-04-26 16:39:40 +00:00
1183eecbf1 Migrate jobs from windows.4xlarge to windows.4xlarge.nonephemeral instances (#100091) 2023-04-26 18:32:50 +02:00
33fba6ef07 [SPMD] Add arange and zeros to default factory ops (#100037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100037
Approved by: https://github.com/mrshenli, https://github.com/wanchaol
2023-04-26 16:32:10 +00:00
afa9d10ed6 [inductor] Support mixed device in cpp wrapper (#99950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99950
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-26 16:26:56 +00:00
e789de952f Make sizevar addition work properly (#100015)
Rm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100015
Approved by: https://github.com/ezyang
2023-04-26 15:59:26 +00:00
7ec4392068 Remove in-place operations in NegativeBinomial (#96748)
This is a suggestion for a minor modification.

The line `log_normalization[self.total_count + value == 0.] = 0.` prevents Jit compilation when the condition occurs, with the error message

`RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation.`

I propose an alternative that does not involve in-place operations. It uses the function `nan_to_num()` to replace infinite values by 0 where `self.total_count + value == 0.` while leaving `nan` and `-inf` as they are. Readability is suboptimal because the code does not replace nan with numbers, but I could not find a function that only replaces infinite values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96748
Approved by: https://github.com/fritzo, https://github.com/soulitzer
2023-04-26 14:45:08 +00:00
81978120ec [MPS] Fix trace exceptions not raised for error inputs (#99239)
Also rename `trace_mps_out` to `trace_mps` as it is not an out version.

Remove `index_add` from XFAILLIST as it seems working as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99239
Approved by: https://github.com/kulinseth
2023-04-26 14:41:50 +00:00
f4a37c9a5d [MPS] Fix max_pool2d exceptions not raised for error inputs (#99238)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99238
Approved by: https://github.com/kulinseth
2023-04-26 14:41:50 +00:00
f4cf744380 [MPS] Fix gelu exceptions not raised for error inputs (#99237)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99237
Approved by: https://github.com/kulinseth
2023-04-26 14:41:46 +00:00
aaa3eb059a add some missing includes (#100049)
add some missing includes

Summary:
These were failing in my build environment. Clang16, Fedora38, no
extra build config.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100049
Approved by: https://github.com/Skylion007
2023-04-26 14:27:06 +00:00
4b1310bfa4 suppress -Wcast-function-type-strict when casting to PyCFunction (#100068)
suppress `-Wcast-function-type-strict` when casting to PyCFunction

Summary:
These casts are a necessary evil due to the design of Python. Python
ultimately casts it back to the original type based on the flags
specified in the `PyMethodDef`.

Nevertheless, the new Clang flag `-Wcast-function-type-strict` breaks
with this.

While here, convert the cast to a `reinterpret_cast`.

Test Plan: Should be a no-op. Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100068
Approved by: https://github.com/Skylion007
2023-04-26 14:24:26 +00:00
69bf0241b1 Allow calling functorch transforms when their DispatchKeys are disabled (#100011)
This was always the intended behavior: e.g. you should be able to call
functorch.functionalize even when DispatchKey::Functionalize is
disabled.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100011
Approved by: https://github.com/tugsbayasgalan
2023-04-26 13:04:13 +00:00
62f9189d9d make ATen/native/cuda/AveragePool2d.cu data_ptr-correct (#99336)
make ATen/native/cuda/AveragePool2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99336
Approved by: https://github.com/ezyang
2023-04-26 06:01:30 +00:00
a0934f8bad Replace maybe_guard with statically_known (#99383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99383
Approved by: https://github.com/ngimel
2023-04-26 05:53:48 +00:00
400dbde8a0 make ATen/native/cuda/ScanUtils.cuh data_ptr-correct (#99080)
make ATen/native/cuda/ScanUtils.cuh data_ptr-correct

Test Plan: Rely on Ci.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99080
Approved by: https://github.com/Skylion007
2023-04-26 05:48:47 +00:00
1fcf40da63 [MPS] Add linear inputs check (#99228)
Fixes #98211

https://github.com/pytorch/pytorch/issues/98211#issuecomment-1496005668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99228
Approved by: https://github.com/kit1980
2023-04-26 04:44:23 +00:00
c11441fda3 Update torch.arange doc. (#99963)
To always exclude `end` without being affected by rounding error, `epsilon` should be subtracted, instead of being added.

Fixes #99853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99963
Approved by: https://github.com/kit1980
2023-04-26 04:18:56 +00:00
08c49eee5e [ONNX] Support aten::atan2 in torchscript exporter (#100040)
Fixes #51334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100040
Approved by: https://github.com/BowenBao
2023-04-26 04:00:47 +00:00
9d99d8879c add missing include on <stdexcept> from Registry.h (#100036)
add missing include on <stdexcept> from Registry.h

Summary: This throws std::runtime_error in the header.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100036
Approved by: https://github.com/Skylion007
2023-04-26 03:59:20 +00:00
0b1b063158 [buckbuild.bzl] Fix dep handling in cross-builds
Differential Revision: D44960349nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99826
2023-04-25 20:53:28 -07:00
8fd866c666 Add frame summary to for/while loop backedge log message (#100045)
This PR adds the frame summary to the log message, e.g.:
```
[2023-04-26 00:11:21,035] torch._dynamo.symbolic_convert: [INFO] Skipping frame because there is a graph break in a for/while loop
<FrameSummary file /fsx/users/andgu/work/transformers/src/transformers/models/t5/modeling_t5.py, line 1086 in <resume in forward>>
```
Note that the line cited by the frame summary may not be the for/loop itself but rather a line inside the for/while loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100045
Approved by: https://github.com/anijain2305
2023-04-26 03:24:24 +00:00
1ded73f909 Remove little endian asserts (#99713)
They block tests test_embedding_bag_2bit_unpack,
test_embedding_bag_4bit_unpack and test_embedding_bag_byte_unpack in test/quantization/core/test_quantized_op.py.

Without these asserts tests start passing on big endian systems.

Fixes #97803

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99713
Approved by: https://github.com/kit1980
2023-04-26 02:08:28 +00:00
c680f2b8ea relax restriction on cond branches calling closed functions (#100013)
As of https://github.com/pytorch/pytorch/pull/99367 we error when cond branches look up closed vars. The suggested fix is to add these closed vars as args to the branches.

However, while this works for tensor vars (and also primitive vars by explicit wrapping), this is impossible to do for function vars. Moreover, function vars are OK because we trace through them. So relaxing this restriction for function vars is a strict win.

Differential Revision: [D45287893](https://our.internmc.facebook.com/intern/diff/D45287893/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100013
Approved by: https://github.com/tugsbayasgalan
2023-04-26 01:57:24 +00:00
488e0effe3 Fix test_multiple_devices_randint_cuda (#99775)
The test fails with device mismatch error:
```
Traceback (most recent call last):
  File "/pytorch/torch/testing/_internal/common_utils.py", line 2137, in wrapper
    method(*args, **kwargs)
  File "/pytorch/torch/testing/_internal/common_device_type.py", line 401, in instantiated_test
    result = test(self, **param_kwargs)
  File "/pytorch/torch/testing/_internal/common_device_type.py", line 846, in test_wrapper
    return test(*args, **kwargs)
  File "/pytorch/torch/testing/_internal/common_device_type.py", line 1005, in only_fn
    return fn(slf, *args, **kwargs)
  File "/pytorch/torch/testing/_internal/common_device_type.py", line 1029, in multi_fn
    return fn(slf, devices, *args, **kwargs)
  File "/pytorch/test/test_ops.py", line 148, in test_multiple_devices
    self.assertTrue(result.device == cuda_device)
AssertionError: False is not true
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99775
Approved by: https://github.com/ngimel
2023-04-26 01:49:41 +00:00
89baa1a74c [MPS] Add support for linalg.vector_norm (#99811)
Summary of changes:

- Add support for linalg.vector_norm
- Fix zero norm, correct formula is: sum(x != 0)
- Add additional tests in test_mps
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99811
Approved by: https://github.com/kulinseth
2023-04-26 01:34:29 +00:00
539363a873 [inductor] Lowering of rngprims philox_rand (#99289)
An example graph with Dynamic shapes on

`arg0_1` is seed, `arg1_1` is base offset.
~~~
  ===== Forward graph 0 =====
 <eval_with_key>.5 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: f32[s0]):
        # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4605, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0] = torch.ops.aten.mul.Tensor(getitem, arg3_1);  getitem = arg3_1 = None

        # File: /scratch/anijain/work/pytorch/test/inductor/test_torchinductor.py:4606, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        add_5: i64[] = torch.ops.aten.add.Tensor(add_4, 3);  add_4 = None
        div: i64[] = torch.ops.aten.div.Tensor_mode(add_5, 4, rounding_mode = 'floor');  add_5 = None
        mul_2: i64[] = torch.ops.aten.mul.Tensor(div, 4);  div = None
        return (mul_1, mul_2)

~~~

Note that the output `mul2` is basically total `numel` of the random ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99289
Approved by: https://github.com/jansel
2023-04-26 01:22:41 +00:00
111358de19 Support non-ASCII characters in model file paths (#99453)
Fixes #98918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99453
Approved by: https://github.com/albanD, https://github.com/malfet
2023-04-26 01:15:49 +00:00
efded3f3e9 [inductor] Add cpp_wrapper support for FallbackKernel (#99887)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99887
Approved by: https://github.com/ngimel
2023-04-26 01:03:53 +00:00
d3143d0be6 Skip timm_vision_transformer in Inductor torchbench smoketest (#99766)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99766
Approved by: https://github.com/desertfire
2023-04-26 00:49:36 +00:00
79f8ac14d5 Add pass to normalize torch.ops.fb.equally_split
Differential Revision: D45198465nnPull Request resolved: https://github.com/pytorch/pytorch/pull/99941
2023-04-25 17:35:53 -07:00
785676ccb0 [dynamo 3.11] refactor cpython function defs out of eval_frame.c (#99947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99947
Approved by: https://github.com/voznesenskym, https://github.com/albanD
2023-04-26 00:18:12 +00:00
bafa2c4724 Change 'w.r.t.' to 'wrt' in function docstrings to fix doc rendering (#100028)
Fixes #72428 according to decision reached in comments.

I've left other instances of `w.r.t.` in tact (e.g. in parameter/return descriptions, in comments, etc) because there were many, and I didn't' want to go out-of-scope. That being said, I'm happy to change those as well if we'd prefer the consistency!

I've also fixed a typo that I came across while grepping for instances.

Will update with screenshots once docs are built.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100028
Approved by: https://github.com/albanD
2023-04-25 23:53:26 +00:00
676a23f452 [RFC] Allow elastic agent to fail fast (#99051)
Summary: Today, on a segfault on a single trainer , we end up keeping the gpu on all ranks blocked for 5 minutes due to elastic agents barrier timeouts

Test Plan: Rely on existing test to validate . Looking to get some feedback on adding UTs

Differential Revision: D44929488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99051
Approved by: https://github.com/kurman, https://github.com/kiukchung
2023-04-25 23:51:20 +00:00
eddb3a060e Rename master -> main in docs workflow (#100022)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100022
Approved by: https://github.com/janeyx99
2023-04-25 23:33:48 +00:00
1c110652a8 [ONNX] Support aten::tile in torchscript exporter (#99927)
Fixes #99692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99927
Approved by: https://github.com/justinchuby
2023-04-25 22:58:18 +00:00
6bc4651193 [philox_rand] Dynamic shape support (#99290)
Extends the functionalization of rng work to Dynamic shapes. An example of the generated graph looks like this

~~~

[2023-04-24 21:41:37,446] torch._functorch.aot_autograd.__aot_graphs: [INFO] TRACED GRAPH
 ===== Forward graph 1 =====
 <eval_with_key>.7 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: i64[], arg1_1: i64[], arg2_1: Sym(s0), arg3_1: Sym(s1), arg4_1: f32[s0, s1]):
        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:46, code: a = torch.rand_like(x) * x
        add: i64[] = torch.ops.aten.add.Tensor(arg1_1, 0)
        philox_rand = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add, None, device(type='cuda', index=0), torch.float32);  add = None
        getitem: f32[s0, s1] = philox_rand[0]
        getitem_1: i64[] = philox_rand[1];  philox_rand = None
        add_1: i64[] = torch.ops.aten.add.Tensor(getitem_1, 0);  getitem_1 = None
        mul: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem, arg4_1);  getitem = arg4_1 = None

        # File: /scratch/anijain/work/pytorch/test/test_functionalization_of_rng_ops.py:47, code: a = torch.rand_like(x) * a
        add_2: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_1)
        philox_rand_1 = torch.ops.rngprims.philox_rand.default([arg2_1, arg3_1], arg0_1, add_2, None, device(type='cuda', index=0), torch.float32);  arg2_1 = arg3_1 = arg0_1 = add_2 = None
        getitem_2: f32[s0, s1] = philox_rand_1[0]
        getitem_3: i64[] = philox_rand_1[1];  philox_rand_1 = None
        add_3: i64[] = torch.ops.aten.add.Tensor(add_1, getitem_3);  add_1 = getitem_3 = None
        mul_1: f32[s0, s1] = torch.ops.aten.mul.Tensor(getitem_2, mul);  getitem_2 = mul = None

        # No stacktrace found for following nodes
        add_4: i64[] = torch.ops.aten.add.Tensor(arg1_1, add_3);  arg1_1 = add_3 = None
        return (mul_1, add_4)

 ~~~

Each rand op is accompanied by its offset calculation op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99290
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-25 22:40:28 +00:00
dfba65be8b Update Cutlass to v3.1 (#94188)
Now that we are on CUDA 11+ exclusively, we can update Nvidia's Cutlass to the next version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94188
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/malfet
2023-04-25 22:02:42 +00:00
15e1bee269 change torch._dynamo.export(aten_graph=...) to allow pre_autograd tracing (#98031)
pre_autograd tracing is still early, but it should work for basic cases. This PR changes the API a bit for export to expose pre_autograd tracing. Name bikeshedding is welcome, but it looks like:
```
torch._dynamo.export(..., aten_graph="aten_pre_autograd")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98031
Approved by: https://github.com/ezyang
2023-04-25 21:58:14 +00:00
62fad315c1 fix per-dispatchkey-mode caching bug (#98030)
The bug was that: if you want to move a mode to the autograd key, we need to use the "functionality" key for it (AutogradFunctionality). But when we do that, we need to clear any PythonDispatcher caches for every op for **every** autograd key (since you could run autograd ops with both cpu and cuda tensors underneath the mode, which both may have been cached).

I didn't add a test, since this ends up getting indirectly tests by export in the PR. If someone would prefer a direct test I can add one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98030
Approved by: https://github.com/ezyang
2023-04-25 21:58:14 +00:00
d976df49c5 [dynamo] don't use LazyModuleMixin.cls_to_become if it is None (#99943)
**TL;DR**: This PR fixes handling for lazy modules where `cls_to_become is None`. In those cases, we should leave the type of the lazy module as the old value.

**Details**:
Lazy modules are intended to be initialized at execution; some of them are also supposed to switch to a different type after they have been initialized. However, not all are supposed to switch; see this logic from `nn/modules/lazy.py`

```python
    def _infer_parameters(self, ...):
        ...
        if module.cls_to_become is not None:
            module.__class__ = module.cls_to_become
```

i.e., we should leave the module type as the old value if `module.cls_to_become is None`. This PR updates dynamo's handling to match this behavior.

Test `test_lazy_module_no_cls_to_become` added to `test/dynamo/test_module.py`.

Differential Revision: [D45253698](https://our.internmc.facebook.com/intern/diff/D45253698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99943
Approved by: https://github.com/jansel
2023-04-25 21:34:11 +00:00
9e012fd401 [export] Associate one cond() error case with exportdb. (#99844)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99844
Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri
2023-04-25 21:33:24 +00:00
5c16dfd708 Add half to real param description in torch.complex docs (#99938)
Fixes #89733 according to the issue description

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99938
Approved by: https://github.com/Skylion007
2023-04-25 21:23:16 +00:00
cf21240f67 [MPS] Squeeze last dimensions if possible for 5D (or bigger) reductions (#99856)
Summary of changes:
- Reduction ops optimization - squeeze all dimensions after 4th dim if they are all 1
- Disable type inference only for 1D unary ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99856
Approved by: https://github.com/kulinseth
2023-04-25 21:07:28 +00:00
87a2af6d4a Fix loading data on different encoding (#94503)
Add endianness marker when saving,
and if it doesn't match host endianness when loading data, do a byteswap.

Older data will load correctly only on systems
with same endianness it was saved on.
New data should load correctly on systems
with any endianness.

Fixes #65300
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94503
Approved by: https://github.com/kurtamohler, https://github.com/ezyang
2023-04-25 21:05:20 +00:00
8cc57593b9 remove redundant trailing semicolons in StorageImpl.h (#97658)
remove redundant trailing semicolons in StorageImpl.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97658
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-04-25 21:04:22 +00:00
808267767c Prevent grad scale from overflowing (#98876)
Fixes #98828 by capping the growth in the kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98876
Approved by: https://github.com/ngimel
2023-04-25 20:59:44 +00:00
ae5e1819a5 stepcurrent (#98035)
* add stepcurrent flag (--sc) based off the stepwise flag that saves the currently running test so that test running can resume from the last successful test after segfaults, takes in an argument for a key so that different test runs dont overwrite each other
* send sigint to process when timeout so that xml can be made

* add currently unused stepcurrent skip flag (--scs) based off stepwise skip flag that skips the failing test, was going to use if for the keep-going label but having trouble with CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98035
Approved by: https://github.com/huydhn
2023-04-25 20:56:04 +00:00
3e57d49e5b Unblock fbcode build issues with torch/testing IS_CI issue (#99997)
Prefer to land a better fix, but in the worst case this will unblock
to buy time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99997
Approved by: https://github.com/jansel, https://github.com/bertmaher
2023-04-25 20:45:29 +00:00
3a630d9e3a [stronghold][bc-linter] Use merge to find the changes in the PR (#99958)
Fixes the issue with the PR base detection for bc-lint. See also #98538

The context: to lint PR for BC-breaking changes we need two commits with the history between them that accurately represents the changes, introduced in the PR (and **only** these changes).

---

Previous attempts to achieve this failed due to the following reasons:

1. Use `github.event.pull_request.base.sha` and  `github.event.pull_request.head.sha`.
If the PR's base branch advances, the new commits will be included in the `github.event.pull_request.base.sha`, mixing with the changes, introduced by the PR.

2. Find a common ancestor between `github.event.pull_request.base.sha` and  `github.event.pull_request.head.sha`, use it as a base commit.
This approach fails as well, if the PR includes merge commits from the newest history of its base branch. Such commits will appear as the changes, introduced in the PR and thus interfere with BC-linting.

---

Current approach (this PR):

Perform a merge of the  `github.event.pull_request.head.sha` onto the `github.event.pull_request.base.sha`, and use the new commit SHA as the new head.

This approach should always accurately find the changes introduced in the linted PR. The only shortcoming is when the PR cannot be merged onto the HEAD of it's base branch. In this case the BC-linting is skipped (the linting will be performed when the PR is rebased and conflicts are resolve, which is requires anyway before the PR is accepted).

---
### Testing

* [in a separate repo for experiments](https://github.com/izaitsevfb/pr-head-test/pull/2/checks)
* [BC-linter failure (this PR)](https://github.com/pytorch/pytorch/actions/runs/4793434350/jobs/8525891436?pr=99958)
* gh-stack test: [failure](https://github.com/pytorch/pytorch/pull/100004), [success ](https://github.com/pytorch/pytorch/pull/100003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99958
Approved by: https://github.com/osalpekar
2023-04-25 20:35:46 +00:00
0ebd3a78f4 Make sdp_utils more c++ style guide compliant (#99948)
Preivously all of sdp_utils was implemented using inline functions in a header. This is not very comformant to style guide so broke into cpp and header file
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99948
Approved by: https://github.com/Skylion007
2023-04-25 20:07:57 +00:00
865d30a3dd [ONNX] Add new ops and enable the MNIST test (#99284)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at d490881</samp>

### Summary
📦🚀🗺️

<!--
1.  📦 for the package name change and update.
2.  🚀 for the test case enablement and feature support.
3.  🗺️ for the operator mappings addition.
-->
This pull request adds support for converting more PyTorch FX operators to ONNX using the `onnxscript` package. It updates the package installation script and the operator mappings in `function_dispatcher.py`. It also enables a test case for the `max_pool2d` operator.

> _We unleash the fury of the ONNX script_
> _We break the chains of the skipped test_
> _We map the functions of the ATen torch_
> _We forge the metal of the ONNX forge_

### Walkthrough
*  Fix and update the installation of the `onnxscript` package ([link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-ddc2d714e91795d6bfc059caca91ab22d0740909f744d4c22f6642a71b09e8ecL29-R30))
*  Add more mappings from PyTorch ATen operators to ONNX operators in the `_ATENLIB_FUNCTIONS` dictionary in the `function_dispatcher.py` module ([link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL44-R91), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL78-R130), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL103-R193), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL131-R210), [link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-549890bc593f917c4e62c4c43077340e4774c0abdf31657ced8450fdfbed3b3eL144-R223))
*  Enable the `test_max_pool2d` method in the `TestFxToOnnx` class in the `test_fx_to_onnx.py` file, since the `max_pool2d` operator is now supported by the `onnxscript` package ([link](https://github.com/pytorch/pytorch/pull/99284/files?diff=unified&w=0#diff-9dde99e3ce414c7ca10dd319e22f74bcaaddaccd3ab6560c24777baf20616d27L71-L73))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99284
Approved by: https://github.com/BowenBao
2023-04-25 19:45:18 +00:00
fc6f2f6e4e [spmd] simplify data parallel tests (#99901)
As titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99901
Approved by: https://github.com/awgu, https://github.com/mrshenli
2023-04-25 19:31:00 +00:00
0901b41a5e [spmd] Add a few more loss ops to the reduction op list (#99900)
This PR adds a few more loss ops to the reduction op list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99900
Approved by: https://github.com/mrshenli
2023-04-25 19:31:00 +00:00
932ed333f7 [spmd] expose input_batch_dim to DataParallelMode (#99899)
This PR exposes the input batch dim to the DataParallelMode so that
we could have explicit control of which input dim is batch dim
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99899
Approved by: https://github.com/awgu, https://github.com/mrshenli
2023-04-25 19:30:58 +00:00
c6949db481 [spmd] enable fully_shard fused_adam test (#99898)
This PR enables fully_shard fused adam tests with some additional tweaks
about how to handle scalar tensor. Now we treat scalar tensors as if
it's just a scalar value, we don't distribute it as there's no need to
shard a scalar tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99898
Approved by: https://github.com/mrshenli
2023-04-25 19:30:55 +00:00
ad882c5210 [spmd] Use TupleStrategy and enable replicate fused_adam (#99374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99374
Approved by: https://github.com/mrshenli
2023-04-25 19:30:53 +00:00
f274c4ecf6 Don't change filelock log level by default (#99991)
Fixes https://github.com/pytorch/pytorch/issues/99730

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99991
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-04-25 19:02:35 +00:00
650ba57236 Remove Anjali from CODEOWNERS (#99955)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99955
Approved by: https://github.com/albanD
2023-04-25 18:57:46 +00:00
cyy
dbc7e919b8 add Wmissing-prototypes to clang-tidy (#96805)
This PR introduces **-Wmissing-prototypes** of clang-tidy to prevent further coding errors such as the one fixed by PR #96714.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at fd2cf2a</samp>

This pull request makes several internal functions static to improve performance and avoid name clashes. It also fixes some typos, formatting, and missing includes in various files. It adds a new .clang-tidy check to warn about missing prototypes for non-static functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96805
Approved by: https://github.com/malfet, https://github.com/albanD
2023-04-25 18:20:36 +00:00
39ff87c6a4 [ROCM] Extend try-catch mechanism for ROCM detection (#99980)
ROCM path detection currently relies on `hipconfig`. On some systems when calling `hipconfig` through `subprocess` python raises a `NotADirectoryError` that isn't catch at the moment. This commit adds `NotADirectoryError` to exceptions catched when calling `hipconfig`.

Fixes #98629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99980
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-04-25 18:07:29 +00:00
df3455b716 [reland][quant][pt2e][refactor] Cleanup the logic for deciding whether to insert observer/fq or not (#99220) (#99767)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99220

Previously we have two places we need to decide whether to insert observer or fake quantizer or not:
(1) input arguments of a node (2) output of a node, and right now we have separate code to do this
in this PR, the logic is unified in `_needs_obs_or_fq` helper function that takes the target_dtype and is_dynamic from previous output
and target_dtype and is_dynamic for the current Tensor we are looking at

let's use an example for conv node:
```
conv = convolution(input, weight, bias, ...)
```

let's say we have `input_node` object for argument `input`, and `conv_node` for `conv` node in the graph

(1) input arguments, e.g. `input`
the target_dtype/is_dynamic from previous output is the node that produces `input`, we get this from
input_node.meta["target_dtype_info"]["output_act_obs_or_fq"]

the taregt_dtype/is_dynamic for the current argument `input`, comes from conv_node.meta["target_dtype_info"]["input_act_obs_or_fq"]
similarly for weight it comes from conv_node.meta["target"]["weightobs_or_fq"] etc.

(2) output for conv node
the target_dtype/is_dynamic from previous output will be the floating point output from the fp32 convolution operator, so it
is hardcoded to be (torch.float, False), however, technically we should get this from node.meta["val"], but since the
current code base is shared by fx graph mode quantization and pytorch 2.0 export quantization, we cannot do that, we can revisit
after we decide to deprecate fx graph mode quantization

the target_dtype/is_dynamic for the current output comes from conv_node.meta["target_dtype_info"]["output_act_obs_or_fq"]

there is one caveat here about dynamic quantization, that is explained in the comment, so I won't repeat here

Note: also fixed some places in `_get_arg_target_dtype_as_input_to_node` and `_get_arg_target_is_dynamic_as_input_to_node` to make sure "not specified" == specifying a fp32 placeholder observer as well

Next: we can merge the two get target dtype and get is_dynamic function to reduce code duplication

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels
python test/test_quantization.py TestQuantizePT2E
python test/test_quantization.py TestQuantizePT2EModels

Imported from OSS

Differential Revision: D45198323

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99767
Approved by: https://github.com/kimishpatel
2023-04-25 16:53:02 +00:00
14c3eb7fb6 [Testing] flip switch, remove slow path assertions (#99101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99101
Approved by: https://github.com/bdhirsh
2023-04-25 15:38:28 +00:00
e2a3817dfd [BE] Enable C419 rule for any all shortcircuiting (#99890)
Apparently https://github.com/pytorch/pytorch/pull/78142 made torch.JIT allow for simple generator expressions which allows us to enable rules that replace unnecessary list comprehensions with generators in any/all. This was originally part of #99280 but I split it off into this PR so that it can be easily reverted should anything break.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99890
Approved by: https://github.com/justinchuby, https://github.com/kit1980, https://github.com/malfet
2023-04-25 15:02:13 +00:00
e43918b93a [inductor] Fix AOTInductor (#99203)
Summary: Fix the broken AOTInductor flow and add a smoketest on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99203
Approved by: https://github.com/jansel
2023-04-25 14:42:12 +00:00
3afa60bf0f Get OutOfMemoryError to inherit from RuntimeError (#99786)
Get OutOfMemoryError to inherit from RuntimeError so that type checkers do not complain.
Also added a defined in comment to match with the other definitions.

Fixes #95936

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99786
Approved by: https://github.com/Skylion007
2023-04-25 14:31:49 +00:00
e7157bd048 [inductor] Fix shape padding (#99917)
Summary:
We were using the "percentiles" form of triton.testing.do_bench, which
returns a list of like (20th, 50th, 80th) percentile timing; I don't think we
care about that much detail, so let's just use the mean.  I also took the
opportunity to clean up the redundant setting of rep, warmup, and fast_flush.

Test Plan:
```
TORCHBENCH_ATOL=1e-3 TORCHBENCH_RTOL=1e-3 TORCHINDUCTOR_PERMUTE_FUSION=1 TORCHINDUCTOR_SHAPE_PADDING=1 buck2 run mode/opt mode/inplace pytorch/benchmark:run -- ads_dhen_5x --part over --bs 1024 -d cuda -t train --torchdynamo inductor
```

Reviewed By: jiawenliu64

Differential Revision: D45241751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99917
Approved by: https://github.com/jiawenliu64
2023-04-25 13:40:29 +00:00
cc01568efd [pt2] Register meta func to randperm.default (#99593)
Summary:
Looks we're missing the meta func for randperm.default. I get complaints like this when I compile randperm with dynamic shape which I think is because it gets into the real implementation but not the meta func.

```
RuntimeError: expected int but got s0
Exception raised from expect_int at fbcode/caffe2/c10/core/SymInt.h:128 (most recent call first):
# 0  c10::get_backtrace[abi:cxx11](unsigned long, unsigned long, bool)
# 1  std::_Function_handler<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > (), c10::(anonymous namespace)::GetFetchStackTrace()::$_1>::_M_invoke(std::_Any_data const&)
# 2  c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
# 3  c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
# 4  c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__randperm>, at::Tensor, c10::guts::typelist::typelist<c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)
# 5  at::Tensor c10::Dispatcher::redispatch<at::Tensor, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> >(c10::TypedOperatorHandle<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)> const&, c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>) const
# 6  at::_ops::randperm::redispatch(c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)
# 7  c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::randperm>, at::Tensor, c10::guts::typelist::typelist<c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>)
# 8  c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>), &at::(anonymous namespace)::randperm>, at::Tensor, c10::guts::typelist::typelist<c10::SymInt, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool> > >, false>::call(c10::OperatorKernel*, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >*)

```

Differential Revision: D45137851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99593
Approved by: https://github.com/ezyang
2023-04-25 08:55:43 +00:00
8a0badfff1 [ROCM] Do not send "debug"=True down to triton.compile (#99756)
ROCm's version of triton does not currently support tl.device_assert.

This operator among others is effectively a no-op unless "debug" = True is passed in the triton.compile function.

Until we have full suport for tl.device_assert, avoid enabling the debug flag in triton.compile, so we do not have to find every possible location tl.device_assert may be used.

Fixes #99725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99756
Approved by: https://github.com/lezcano
2023-04-25 08:08:05 +00:00
9a69634b28 Skip some failing dynamic shape models on periodic (#99895)
After some recent changes, these tests are failing in periodic trunk.  So let's move them to unstable while waiting for the team to root cause the issue https://github.com/pytorch/pytorch/issues/99893.  Note that a forward fix can use `ciflow/unstable` to run those unstable jobs to confirm that they are fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99895
Approved by: https://github.com/malfet
2023-04-25 07:05:08 +00:00
df1ff0925e inductor: fix issue for conv+binary issue for binary scalar path (#99860)
Fix https://github.com/pytorch/pytorch/issues/99838.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99860
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-25 06:25:14 +00:00
ed3957795c inductor: add fallback test case for hardtanh and leakyrelu fusion pattern (#99859)
Fix https://github.com/pytorch/pytorch/issues/99841.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99859
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-04-25 06:17:11 +00:00
e2cb6bcc91 Fix typos and clarify text in tags.yaml (#99954)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99954
Approved by: https://github.com/eellison
2023-04-25 05:19:12 +00:00
4cf654625c [ONNX] Bump onnx-script version with imported module renaming (#99926)
Avoid potential blocking after https://github.com/microsoft/onnxscript/pull/659
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99926
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2023-04-25 05:17:18 +00:00
04e8df4dd7 Return full accuracy status for printing, not abbreviated version (#99894)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99894
Approved by: https://github.com/jansel
2023-04-25 05:17:10 +00:00
bfa63bf45f div16 changes for dyn shapes (#99930)
Lint, fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99930
Approved by: https://github.com/ngimel
2023-04-25 04:56:39 +00:00
e5c9a0fcf5 [dynamo] avoid graph break on repeat_interleave.self_int (#99528)
Address convit_base failure: https://github.com/pytorch/torchdynamo/issues/1886 mentioned in https://github.com/pytorch/pytorch/issues/93777
Also for models like EleutherAI/gpt-j-6B.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99528
Approved by: https://github.com/ezyang
2023-04-25 04:47:39 +00:00
ecd2c71871 Implement the get_device method in the storage base class. (#99818)
Fixes #ISSUE_NUMBER
like #99817, I find a method is missing,
I'm not sure if it was intentionally removed. But I found that the function is still called on the python side, and the function seems to be very simple to implement.
So I made a change in python side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99818
Approved by: https://github.com/ezyang
2023-04-25 03:45:39 +00:00
e51453298b [ONNX] Improve diagnostics performance (#99936)
Summary
- Do not call `fx_graph_module.print_readable` when recording `fx.GraphModule` function argument diagnostics.
- Cache `inspect.getsourcelines` results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99936
Approved by: https://github.com/justinchuby, https://github.com/abock
2023-04-25 03:31:55 +00:00
466adab7c4 Add fsspec to PT setup.py (#99768)
Follow up for https://github.com/pytorch/pytorch/pull/96532. Including this in setup.py so the package will be available for CI.

Fsspec package size:
```
du  -h /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg
264K    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/__pycache__
58K     /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations/__pycache__
377K    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations
1017K   /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec
96K     /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/EGG-INFO
1.2M    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99768
Approved by: https://github.com/kit1980
2023-04-25 01:34:08 +00:00
4be0aa1382 Allow persistent reductions in dynamic shapes if last numel is static (#99789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99789
Approved by: https://github.com/ezyang
2023-04-25 01:15:35 +00:00
cd61707167 yolov3 dynamic training accuracy is fixed (#99896)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99896
Approved by: https://github.com/albanD
2023-04-25 01:15:24 +00:00
41069f2faa inductor: align inductor behavior with eager mode for split_with_sizes (#99702)
Fix https://github.com/pytorch/pytorch/issues/99686, for eager mode, if the given sizes is not meet requirements, it will report an error, but inductor can run, I think we need align inductor behavior with eager mode, the behavior will be like after this PR:

```
Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1267, in run_node
    return node.target(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/functional.py", line 189, in split
    return tensor.split(split_size_or_sections, dim)
  File "/home/xiaobing/pytorch-offical/torch/_tensor.py", line 804, in split
    return torch._VF.split_with_sizes(self, split_size, dim)
  File "/home/xiaobing/pytorch-offical/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1095, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_subclasses/fake_tensor.py", line 1259, in dispatch
    return decomposition_table[func](*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_decomp/decompositions.py", line 1102, in split_with_sizes
    raise ValueError(
ValueError: Split sizes don't add up to the tensor's size in the given dimension

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1215, in get_fake_value
    return wrap_fake_exception(
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 835, in wrap_fake_exception
    return fn()
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1216, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/home/xiaobing/pytorch-offical/torch/_dynamo/utils.py", line 1279, in run_node
    raise RuntimeError(
RuntimeError: Failed running call_function <function split at 0x7f45b8402ee0>(*(FakeTensor(..., size=(1, 5)), [2, 1, 1]), **{'dim': 1}):
Split sizes don't add up to the tensor's size in the given dimension
(scroll up for backtrace)

The above exception was the direct cause of the following exception:
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99702
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/jansel
2023-04-25 01:13:52 +00:00
96ceae3a7f Use memoized only mode for guard size/stride extraction (#99742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99742
Approved by: https://github.com/ezyang
2023-04-25 01:05:42 +00:00
0b545bc667 Stop marking sequence length as dynamic (#99889)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99889
Approved by: https://github.com/voznesenskym, https://github.com/huydhn
2023-04-25 01:04:16 +00:00
42921fc801 [torchgen] accept scalars for unary SymInt arrays (#99921)
Fixes https://github.com/pytorch/pytorch/issues/99907
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99921
Approved by: https://github.com/malfet
2023-04-25 00:49:15 +00:00
1dbecbf913 make ATen/native/cuda/NaiveConvolutionTranspose3d.cu data_ptr-correct (#99347)
make ATen/native/cuda/NaiveConvolutionTranspose3d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99347
Approved by: https://github.com/ezyang
2023-04-25 00:45:01 +00:00
4ca44d32d3 make ATen/native/cuda/SortStable.cu (#99340)
make ATen/native/cuda/SortStable.cu

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99340
Approved by: https://github.com/ezyang
2023-04-25 00:44:55 +00:00
1b30f588e6 make ATen/native/cuda/RreluWithNoise.cu (#99341)
make ATen/native/cuda/RreluWithNoise.cu

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99341
Approved by: https://github.com/ezyang
2023-04-25 00:44:45 +00:00
fbb0ff10a4 [pt2] add SymInt support for trapezoid ops (#99281)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99281
Approved by: https://github.com/ezyang
2023-04-25 00:44:25 +00:00
36e1ae6778 De-select odd numbered heads from nn.MHA fastpath (#99672)
Summary:
https://github.com/pytorch/pytorch/issues/97128

* Add test for mha num_heads %2 != 0
* Fix test
* Add test for bias false
* show test passes

Test Plan: sandcastle

Differential Revision: D45161767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99672
Approved by: https://github.com/ngimel
2023-04-25 00:27:18 +00:00
3de7fd461a [FSDP][Reland] Include duplicate parameters and modules when calling named_parameters and named_modules (#99448)
The default option of `named_parameters` and `named_modules` is to remove the duplicated parameters and modules. However, in FSDP, we need to know what parameters are shared. As a result, setting `remove_duplicate` to False is required in FSDP. Without setting `remove_duplicate` to False, FSDP won't be able to discover shared weights in some cases (e.g., the shared weights are in the same module or there are shared modules).

The previous PR is reverted due to some modules overwriting the signature of `named_parameters()`. This new PR adds a workaround for the case.

Differential Revision: [D45065973](https://our.internmc.facebook.com/intern/diff/D45065973/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99448
Approved by: https://github.com/zhaojuanmao
2023-04-25 00:27:07 +00:00
0eb59ad093 Change export tracing_mode default to symbolic (#99877)
Differential Revision: [D45231039](https://our.internmc.facebook.com/intern/diff/D45231039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99877
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2023-04-25 00:12:12 +00:00
73f7459a90 Do not assume static by default when exporting (#99554) (#99876)
Fixes https://github.com/pytorch/pytorch/issues/99360

Differential Revision: [D45230857](https://our.internmc.facebook.com/intern/diff/D45230857/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99876
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2023-04-24 23:48:39 +00:00
08a8a37ffe [FSDP] Set NCCL_DESYNC_DEBUG=0 for FSDP unit tests (#99916)
This should fix https://github.com/pytorch/pytorch/issues/99011.

With `NCCL_DESYNC_DEBUG=0`, we can run 100 iterations of
```
CUDA_LAUNCH_BLOCKING=1 NCCL_DESYNC_DEBUG=1 CUDA_VISIBLE_DEVICES=0,7 numactl -C 2 python test/distributed/fsdp/test_fsdp_core.py -v -k test_transformer_no_grad --repeat 100 2>&1 | tee out
```
without erroring, whereas with `NCCL_DESYNC_DEBUG=1`, we can repro the error with high failure rate (usually <10 iterations).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99916
Approved by: https://github.com/zhaojuanmao
2023-04-24 23:20:45 +00:00
855f611baf [spmd] skip gradient copying for fused adam (#99489)
gradients does not need to be copy back as it's not useful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99489
Approved by: https://github.com/mrshenli
2023-04-24 22:50:02 +00:00
ca24a96216 minor fix to fused adam meta registration (#99436)
This PR fixes the registration by adding `max_exp_avg_sqs` to the
output shape list too, and fix some type check issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99436
Approved by: https://github.com/mrshenli
2023-04-24 22:50:02 +00:00
ff7d5b62d4 Improve ProxyTensor tensor_tree list/tuple handling (#99897)
This PR improves the list/tuple handling by merging the logic into
`wrap_with_proxy` directly, and set_meta when we find the current
proxy is a fx.Proxy. This also solves the problem that even `fused_adam`
have `val`, some corresponding `getitem` calls followed after `fused_adam` don't have val
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99897
Approved by: https://github.com/ezyang
2023-04-24 22:50:02 +00:00
78c2e3374d [fx] Remove replace_literals_with_placeholders (#99728)
Summary:
SubraphMatcher contains an ignore_literals flag which we can turn on
instead.

Test Plan: CI

Differential Revision: D45168383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99728
Approved by: https://github.com/cccclai
2023-04-24 22:33:36 +00:00
862d658059 [inductor][non determinism] Disable autotuning when determinisitic mode is ON (#99851)
This removes a source of non-determinism seen in `sebotnet33ts_256`. However, we should see if we can reduce non-determinism in autotuning in general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99851
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/desertfire
2023-04-24 22:31:22 +00:00
7398b5650d [Lint] Fix wrong docstring for dcp save_state_dict() (#99778)
``no_dist=True`` mean not saving in SPMD style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99778
Approved by: https://github.com/H-Huang
2023-04-24 22:18:42 +00:00
33fe2dbb23 Fix a minor bug about method generation. (#99704)
Fixes #ISSUE_NUMBER

when create a torch.device obj, like `x=torch.device("foo")`, the device index is None.

So in this scenario, we need to get the current device index again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99704
Approved by: https://github.com/albanD
2023-04-24 22:18:18 +00:00
baf092b82d Update pt2-bug-report.yml (#99928)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 9691a66</samp>

Update the `pt2-bug-report.yml` template to use `curl` instead of `wget`, `main` instead of `master`, and `python3` instead of `python`. These changes improve the portability and reliability of the bug report process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99928
Approved by: https://github.com/kit1980, https://github.com/msaroufim
2023-04-24 21:57:28 +00:00
3a09aa5977 [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-24 21:27:26 +00:00
3dcc7b396c [easy] iterate dict with sorted keys for accuracy checking (#99793)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99793
Approved by: https://github.com/jansel
2023-04-24 21:26:35 +00:00
2cea2edc27 [easy] Fix upload test stats after master -> main switch (#99924)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99924
Approved by: https://github.com/huydhn
2023-04-24 21:19:09 +00:00
367d3afd7c Update MacOS deployment target to 11.0 (#99857)
MacOS-10.9 (Mavericks) was released a decade ago, update it to Big Sur, that was released in 2020. But keep platform name to 10_9, as `pip` treats platform as one CPython was built on, not the one it runs on. Delete duplicate `compile_x86_64` function from `macos_build.sh` and specify platform name there.

Should fix MacOS x86 periodic build failures.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ee4d5a8</samp>

> _`macosx_10_9` wheel_
> _Aligns with PyTorch support_
> _Winter of updates_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99857
Approved by: https://github.com/huydhn, https://github.com/atalman
2023-04-24 21:15:00 +00:00
4c9d660733 fix gather issue when index is shape of n by 1 (#99709)
Fix https://github.com/pytorch/pytorch/issues/99595

When the index is shape of {N, 1}, it will also have strides of {1, 0}, which is the same as an expanded tensor (e.g. shape of {5, 5} and strides {1, 0}), leading to wrong output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99709
Approved by: https://github.com/XiaobingSuper, https://github.com/ezyang
2023-04-24 20:55:46 +00:00
e9e5ffe83e Re-enable dynamic shapes test in dynamo benchmark (#99816)
Set `torch._dynamo.config.assume_static_by_default = False` for dynamic shapes flag enabled

Fixes #99815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99816
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-04-24 20:34:52 +00:00
d881b2978c Make autocast cache and buffer stealing aware of cudagraph static output tensors (#99368)
In this stack of PRs we adding caching to output tensors for cudagraph trees after we've done initial recording. On initial recording we do not cache tensor outputs because this prevents memory from being reclaimed. On subsequent exeuctions we do cache them to avoid overhead. However, because there is an extra reference around, this caused divergent recording & execution behavior in both autocast caching and autograd gradient stealing. Divergent recording & execution would keep on re-recording and eventually stabilize, but it's not what you want to see happen.

This pr makes the autocast cache and buffer stealing aware of the cudagraph static output tensors.

I will add this to the other cudagraph impl in another pr.

Not sure if this should be in autograd or in autocast since it affects both.. Or somewhere else

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99368
Approved by: https://github.com/albanD, https://github.com/ezyang
2023-04-24 20:23:12 +00:00
3009c42e7d [CI Testing] Re-enable timm_efficientdet training (#99787)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99787
Approved by: https://github.com/desertfire
2023-04-24 20:05:15 +00:00
a1633b1776 Add support for call_method patterns (#99782)
Summary: This add support for CallMethod patterns in pattern_matcher. Also extends split_cat transforms to normalize tensor.split() type nodes

Test Plan: Unit tests (fb + OSS)

Differential Revision: D45195548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99782
Approved by: https://github.com/jansel
2023-04-24 19:26:26 +00:00
41280a0791 Don't detach to create parameters in MetaConverter (#99618)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99618
Approved by: https://github.com/albanD
2023-04-24 19:01:26 +00:00
39590d06c5 Make new_subgroups avaliable for non-cuda depend backend (#99706)
The `new_subgroups` allows for the easy creation of sub-communication groups, but it currently requires CUDA availability. For communications that do not rely on CUDA, such as the CPU-based gloo or custom communication backends, I still hope to be able to use it, such as with the CPU-based gloo (which is also the case when using a custom backend):
```python
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

def gloo_process(rank_id, world_size, group_size, mp_lock):
    assert not torch.cuda.is_available()
    def lock_print(*args, **kwargs):
        with mp_lock:
            print(*args, **kwargs, flush=True)

    os.environ['MASTER_ADDR'] = '127.0.0.1'
    os.environ['MASTER_PORT'] = '29500'
    dist.init_process_group('gloo', rank=rank_id, world_size=world_size)

    subgroup, _ = dist.new_subgroups(group_size)
    subgroup_ranks = list(range(subgroup.rank() * group_size, (subgroup.rank() + 1) * group_size))
    lock_print(f"Rank {rank_id} initialized in subgroup_{subgroup.rank()}: {subgroup_ranks}")

    tensor = torch.Tensor([rank_id + 1])
    subgroup.broadcast(tensor, root=0)

    lock_print(f"After broadcast, rank {rank_id} in subgroup_{subgroup.rank()}:{subgroup_ranks} got {tensor}")

if __name__ == "__main__":
    world_size = 4
    group_size = 2
    processes = []
    mp.set_start_method("spawn")
    mp_lock = mp.Lock()
    for rank in range(world_size):
        p = mp.Process(target=gloo_process, args=(rank, world_size, group_size, mp_lock))
        p.start()
        processes.append(p)

    for p in processes:
        p.join()
```

```bash
Rank 0 assigned to subgroup_0: [0, 1]
Rank 1 assigned to subgroup_1: [2, 3]
Rank 2 assigned to subgroup_0: [0, 1]
Rank 3 assigned to subgroup_1: [2, 3]
After broadcast, rank 2 in subgroup_0:[0, 1] got tensor([3.])
After broadcast, rank 3 in subgroup_1:[2, 3] got tensor([3.])
After broadcast, rank 1 in subgroup_1:[2, 3] got tensor([1.])
After broadcast, rank 0 in subgroup_0:[0, 1] got tensor([1.])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99706
Approved by: https://github.com/kumpera
2023-04-24 18:22:59 +00:00
f0e28b1cb9 Adding the maintainers approved in 2023Q1 Core Maintainers meeting (#98520)
Added Nikita to Core Maintainers
Merged MKLDNN with CPU Performance
Renamed CUDA to GPU Performance
Added Jiong to Compiler and CPU Performance
Added Xiaobing to CPU Performance
Marking Vitaly and Jian Hui as Emeritus
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98520
Approved by: https://github.com/ezyang, https://github.com/soumith, https://github.com/dzhulgakov
2023-04-24 17:58:18 +00:00
7d2a18da0b Enable ruff in lintrunner (#99785)
### This change

- Implements the ruff linter in pytorch lintrunner. It is adapted from https://github.com/justinchuby/lintrunner-adapters/blob/main/lintrunner_adapters/adapters/ruff_linter.py. It does **both linting and fixing**. 🔧
- Migrated all flake8 configs to the ruff config and enabled it for the repo. 
- **`ruff` lints the whole repo in under 2s** 🤯

Fixes https://github.com/pytorch/pytorch/issues/94737 Replaces #99280

@huydhn @Skylion007

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 6b982dd</samp>

### Summary
🧹🛠️🎨

<!--
1.  🧹 This emoji represents cleaning or tidying up, which is what `ruff` does by formatting and linting the code. It also suggests improving the code quality and removing unnecessary or redundant code.
2.  🛠️ This emoji represents tools or fixing, which is what `ruff` is as a code formatter and linter. It also suggests enhancing the code functionality and performance, and resolving potential issues or bugs.
3.  🎨 This emoji represents art or creativity, which is what `ruff` allows by providing a consistent and configurable style for the code. It also suggests adding some flair or personality to the code, and making it more readable and enjoyable.
-->
Add `[tool.ruff]` section to `pyproject.toml` to configure `ruff` code formatter and linter. This change aims to improve code quality and consistency with a single tool.

> _`ruff` cleans the code_
> _like a spring breeze in the fields_
> _`pyproject.toml`_

### Walkthrough
*  Configure `ruff` code formatter and linter for the whole project ([link](https://github.com/pytorch/pytorch/pull/99785/files?diff=unified&w=0#diff-50c86b7ed8ac2cf95bd48334961bf0530cdc77b5a56f852c5c61b89d735fd711R22-R79))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99785
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-04-24 16:18:44 +00:00
dcd686f478 [MPS] Add PSO caching for advanced indexing kernels (#99855)
Use bindless Argument Buffers (unbounded arrays) for advanced indexing kernels - this allows caching of the PSOs since we don't have to query anymore the main metal function for the AB size (this is filled directly now on the CPU).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99855
Approved by: https://github.com/kulinseth
2023-04-24 15:41:47 +00:00
09b189edc3 Improve the precision of abs() and sign() for large values (#99550)
@ev-br found in
https://github.com/Quansight-Labs/numpy_pytorch_interop/pull/117#issuecomment-1514959633
that the precision of `abs()` for large values in the vectorised case is less-than-good.
This PR fixes this issue. While doing that, we are able to comment out a
few tests on extremal values.

Fixes https://github.com/pytorch/pytorch/issues/53958 https://github.com/pytorch/pytorch/issues/48486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99550
Approved by: https://github.com/ngimel, https://github.com/peterbell10
2023-04-24 14:32:56 +00:00
5ee5afb82c Update channel shuffle to return alias instead of self as-is (#99745)
Partially addresses https://github.com/pytorch/pytorch/issues/99655
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99745
Approved by: https://github.com/albanD
2023-04-24 14:02:14 +00:00
ab0a8215bb [xla hash update] update the pinned xla hash (#99863)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99863
Approved by: https://github.com/pytorchbot
2023-04-24 10:18:27 +00:00
466877b692 Nicer logs for dynamic shapes (#99277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99277
Approved by: https://github.com/ezyang
2023-04-24 10:08:05 +00:00
d0886f686e Revert "Do not assume static by default when exporting (#99554)"
This reverts commit d3bb762f1edc770879b7ba51019b02455109349b.

Reverted https://github.com/pytorch/pytorch/pull/99554 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-24 08:27:56 +00:00
c83e1f517d Revert "Delete tracing_mode argument to export (#99555)"
This reverts commit e9786149ab71874fad478109de173af6996f7eec.

Reverted https://github.com/pytorch/pytorch/pull/99555 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-24 08:21:41 +00:00
1e8cf6ad7f Add documentation for torch._logging.set_logs (#99219)
Part of #98871

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99219
Approved by: https://github.com/mlazos, https://github.com/lezcano
2023-04-24 08:06:57 +00:00
6e3cdcad08 Fix flake8 lint errors - part 2 - manual fixes (#99799)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 8aef78f</samp>

### Summary
📝🚀🛠️

<!--
1.  📝 for modifying the logging format and style
2.  🚀 for improving performance and avoiding unnecessary string creation
3.  🛠️ for fixing flake8 issues
-->
This pull request updates some logging calls to use old-style string formatting with `%s` placeholders instead of f-strings in `torch/_dynamo/logging.py`, `torch/_functorch/compilers.py`, and `torch/fx/passes/pass_manager.py` as part of a logging standardization effort. It also adds a `# noqa: F404` comment to the `import __future__` statement in `torch/overrides.py` to fix a flake8 warning.

> _`log` uses old style_
> _formatting strings with `%s`_
> _logging is faster_

### Walkthrough
*  Standardize logging format and style to use old-style string formatting with `%s` placeholders instead of f-string syntax for performance and consistency ([link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-18807f7fd187b8bc8e69e93722566195b36d5bf269099b415a6f90b552228d6bL55-R55), [link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-fae8a66564055743ec031edb87eb22edeebf7fdebef9d21660d5e6a6252e5222L370-R373), [link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-5f3e37ded032f24e247dcf4a3be4b73ea0cf21382e342631742e5a04550202e1L72-R72))
*  Suppress flake8 warning for `import __future__` statement in `torch/overrides.py` with `# noqa: F404` comment ([link](https://github.com/pytorch/pytorch/pull/99799/files?diff=unified&w=0#diff-4f601fe7f31e875ee4354882c0bb490bc35e51d3d413d058cc5fda3be8ca9f15L23-R23))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99799
Approved by: https://github.com/Skylion007
2023-04-24 06:03:26 +00:00
48d112c431 Fix fake tracing of cross entropy with label smoothing and weight (#99830)
Fixes #99726
Adds a special path in cross entropy implementation for tensor subclasses, we don't always use it as it requires slightly more memory and is a bit slower.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99830
Approved by: https://github.com/ezyang
2023-04-24 04:07:23 +00:00
7a6c650b81 [inductor] Lower aten.prod (#99484)
This lowers `aten.prod` using the new `tl.reduce` functionality in triton. I
also introduce `TritonKernel.helper_functions` which allows code to be defined
outside of the kernel body so that we can defined the `_prod_accumulate` helper
function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99484
Approved by: https://github.com/ngimel
2023-04-24 02:48:27 +00:00
79c9e82e27 Fix flake8 lint errors reported by ruff - take 2 (#99798)
Replaces #99784. This PR is pure autofix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99798
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-04-23 23:09:51 +00:00
dc1c0924ec Properly parenthesize dynamo_dynamic_indices test (#99823)
I've got the E2E test case which triggered this in https://github.com/pytorch/pytorch/pull/99809

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99823
Approved by: https://github.com/voznesenskym
2023-04-23 22:41:34 +00:00
6d5040a1ac [BE] Update python versions for black formatter config (#99827)
Update black config to currently supported python versions in PyTorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99827
Approved by: https://github.com/ezyang
2023-04-23 20:38:18 +00:00
f8c6861120 [MPS][BE] Introduce LookUpOrCreateCachedGraph (#99422)
A template that replaces following common pattern:
```cpp
  MPSGraphCache* cache_ = MPSGraphCache::getInstance();
  CachedGraph* cachedGraph = cache_->LookUpAs<CachedGraph>(key);

  if (!cachedGraph) {
    cachedGraph = cache_->CreateCachedGraphAs<CachedGraph>(key, ^MPSCachedGraph*() {
      CachedGraph* newCachedGraph = nil;

      @autoreleasepool {
        MPSGraph* mpsGraph = make_mps_graph();
        newCachedGraph = new PoolingCachedGraph(mpsGraph);
        ...
      }
      return newCachedGraph:
   );
 }
```
with
```cpp
  auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
    ...
  });
```

Fixes memory leak in addmv_out_mps_impl, where new entires were added the cache without doing the lookup first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99422
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-04-23 19:34:38 +00:00
d29cf18442 [CI] Pause perf data collection for max-autotune (#99829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99829
Approved by: https://github.com/ezyang
2023-04-23 18:41:31 +00:00
a89d6b0a82 [MPS] Add encoder coalescing support for native kernels (#99810)
Add support for kernel coalescing to native kernels.
This change reuses the same compute command encoder across successive metal kernel dispatches. The coalescing will stop when a graph op is encountered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99810
Approved by: https://github.com/kulinseth
2023-04-23 18:33:07 +00:00
wgb
2d3456167d [Typo]Summary:Fix a typo in comments (#99824)
Fixes a typo in a comment in torch/testing/_internal/common_device_type.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99824
Approved by: https://github.com/Skylion007
2023-04-23 18:10:26 +00:00
716ef2f5ad Improve code to make it more pythonic. (#99720)
No need to use `keys()` method to assert a key is in or not in a dict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99720
Approved by: https://github.com/kit1980
2023-04-23 16:42:13 +00:00
72daadef2c [dynamo] Explicitly fall back to eager with GraphModule with no output for onnx&tvm backends (#99805)
Fixes #99437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99805
Approved by: https://github.com/jansel
2023-04-23 06:59:00 +00:00
9b0b31a5e3 fix conv+bn folding issue for mixed dtype (#99696)
Align the conv+bn folding behavior with jit path for mixed type case: always keep conv's weight and bias dtype after folding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99696
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-23 05:13:40 +00:00
1fc4d58f43 inductor: fix split+cat issue when cat order is not align the split output's order (#99700)
we should make sure the cat order does align with the split output's order before removing the cat operation. Fix https://github.com/pytorch/pytorch/issues/99686.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99700
Approved by: https://github.com/EikanWang, https://github.com/devashishshankar, https://github.com/jansel
2023-04-23 04:10:20 +00:00
ebd47b0eec Propagate mark_dynamic in Dynamo compiled outputs. (#99634)
If you run a user operation you'll lose it, but this will at least
get the easy stuff.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99634
Approved by: https://github.com/voznesenskym
2023-04-23 03:24:28 +00:00
efed5a4969 Allow data size equal to 0 for SegmentReduce (#99733)
Summary:
Support special case that data size can be 0 for SegmentReduce.

Example code below:
```
x = torch.ones((0, 6)).cuda()
lengths = torch.tensor([0, 0]).cuda()
torch.segment_reduce(x, "sum", lengths=lengths, unsafe=False, initial=0)
```
Previously, error message: Expected data.numel() > 0 to be true, but got false.
Now expect to return 0.

Test Plan: contbuild & OSS CI

Differential Revision: D45133827

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99733
Approved by: https://github.com/ngimel
2023-04-23 01:59:45 +00:00
7a8d0ccddf Correct LBFGS tolerance_grad doc string (#99792)
LBFGS' `tolerance_grad` parameter has had a default value of `1e-7` since #25240. The doc string wasn't updated in that PR to match the change https://github.com/pytorch/pytorch/blob/main/torch/optim/lbfgs.py#L207.

no open issue for it, just happened to set it to 1e-7 and was surprised my results didn't change :-) eventually noticed inconsistency in the doc and seemed like an easy opportunity to figure out how to contribute.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99792
Approved by: https://github.com/janeyx99
2023-04-22 20:19:01 +00:00
f602b3a6ae Preserve mark_dynamic when cloning inputs (#99617)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99617
Approved by: https://github.com/ngimel, https://github.com/voznesenskym, https://github.com/anijain2305
2023-04-22 19:46:31 +00:00
5e73569ab4 Add memoized_only mode to tensor conversion (#99741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99741
Approved by: https://github.com/ezyang
2023-04-22 19:19:39 +00:00
4c2892944f Guard static shapes alongside tensors, instead of from shape_env, in dynamic_shapes=True (#99566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99566
Approved by: https://github.com/ezyang
2023-04-22 16:46:52 +00:00
220712f4de Fix torch.compile() on a skipped module (#98894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98894
Approved by: https://github.com/xw285cornell
2023-04-22 16:10:55 +00:00
d192729cfd inductor: fix AllenaiLongformerBase dynamic shape error on CPU (#98842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98842
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-22 14:43:10 +00:00
31eb9949e4 [dynamo] disallow_in_graph bugfix (#99600)
Testing if the minor change breaks other test cases.

For the added test case, TorchDynamo causes graph break on `torch.ops.foo.custom` but then again starts running on the recursively invoked frame - `foo_cpu` on L48 in testfile. This raises assertion like this

~~~
Traceback (most recent call last):
  File "/scratch/anijain/work/pytorch/test/dynamo/test_decorators.py", line 65, in test_disallow_in_graph_for_custom_op
    res = opt_fn(x)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/eval_frame.py", line 252, in _fn
    return fn(*args, **kwargs)
  File "/scratch/anijain/work/pytorch/test/dynamo/test_decorators.py", line 56, in fn
    b = torch.ops.foo.custom(a)
  File "/scratch/anijain/work/pytorch/torch/_ops.py", line 646, in __call__
    return self._op(*args, **kwargs or {})
  File "/scratch/anijain/work/pytorch/torch/_dynamo/eval_frame.py", line 401, in catch_errors
    return callback(frame, cache_size, hooks, frame_state)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 495, in _convert_frame
    result = inner_convert(frame, cache_size, hooks, frame_state)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 122, in _fn
    return fn(*args, **kwargs)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 331, in _convert_frame_assert
    return _compile(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/utils.py", line 169, in time_wrapper
    r = func(*args, **kwargs)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 401, in _compile
    out_code = transform_code_object(code, transform)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/bytecode_transformation.py", line 1000, in transform_code_object
    transformations(instructions, code_options)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/convert_frame.py", line 371, in transform
    tracer = InstructionTranslator(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/symbolic_convert.py", line 1890, in __init__
    self.symbolic_locals = collections.OrderedDict(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/symbolic_convert.py", line 1893, in <genexpr>
    VariableBuilder(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 165, in __call__
    return self._wrap(value).clone(**self.options())
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 290, in _wrap
    return type_dispatch(self, value)
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 776, in wrap_tensor
    tensor_variable = wrap_fx_proxy(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 923, in wrap_fx_proxy
    return wrap_fx_proxy_cls(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 983, in wrap_fx_proxy_cls
    example_value = wrap_to_fake_tensor_and_record(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 1213, in wrap_to_fake_tensor_and_record
    fake_e = wrap_fake_exception(
  File "/scratch/anijain/work/pytorch/torch/_dynamo/utils.py", line 835, in wrap_fake_exception
    return fn()
  File "/scratch/anijain/work/pytorch/torch/_dynamo/variables/builder.py", line 1214, in <lambda>
    lambda: tx.fake_mode.from_tensor(
  File "/scratch/anijain/work/pytorch/torch/_subclasses/fake_tensor.py", line 1434, in from_tensor
    return self.fake_tensor_converter(
  File "/scratch/anijain/work/pytorch/torch/_subclasses/fake_tensor.py", line 329, in __call__
    return self.from_real_tensor(
  File "/scratch/anijain/work/pytorch/torch/_subclasses/fake_tensor.py", line 283, in from_real_tensor
    out = self.meta_converter(
  File "/scratch/anijain/work/pytorch/torch/_subclasses/meta_utils.py", line 531, in __call__
    r = self.meta_tensor(
  File "/scratch/anijain/work/pytorch/torch/_subclasses/meta_utils.py", line 184, in meta_tensor
    assert not torch._C._dispatch_tls_local_exclude_set().has(
AssertionError:

~~~

It seems `_dynamo.disable` is the right option for custom ops added by `torch.library`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99600
Approved by: https://github.com/jansel
2023-04-22 12:40:33 +00:00
e63c502baa [Executorch][XNNPACK] Quantized Max Pool 2d (#99587)
Adding support for Quantized Max Pool 2d

Additions:
- Add quantized max pool 2d to executorch backend config
- modify max pool node visitors to grab quant params from input/output
- Add qmaxpool 2d patterns for partitioners

Differential Revision: [D44977783](https://our.internmc.facebook.com/intern/diff/D44977783/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99587
Approved by: https://github.com/jerryzh168
2023-04-22 07:17:13 +00:00
7749eec8df Remove deprecated declaration suppression (#99749)
As https://github.com/pytorch/pytorch/pull/55889 landed a while back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99749
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-04-22 06:49:28 +00:00
a964a3dbed [quant][pt2e] add all convs-relu fusion qat configs (#99586)
Currently when prepare_qat_fx with executorch backend config we do not properly quantize conv or conv - relu

To fix this we add all the necessary qat configs for conv and conv-relu

Differential Revision: [D45135947](https://our.internmc.facebook.com/intern/diff/D45135947/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99586
Approved by: https://github.com/jerryzh168
2023-04-22 06:44:23 +00:00
c139dfd71e [quant][pt2e] add dropout to executorch backend config (#99585)
OD Model has a dropout layer in training, In order to match eager mode qat, we also fake quantize the drop out layer in prepare_qat_fx.

To do this we add the dropout layer to the default_op_configs in which the observation type uses a different observer from its input

Differential Revision: [D45095936](https://our.internmc.facebook.com/intern/diff/D45095936/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99585
Approved by: https://github.com/jerryzh168
2023-04-22 06:41:44 +00:00
9db6920635 [spmd] Add list handling to data parallel and add foreach tests (#99373)
This PR adds list handling logic to the new DataParallel expansion and
add foreach optimizer tests, currently current testing sgd optimizers
in foreach mode, for both replicate and fully shard

Next step:

Add fused optim tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99373
Approved by: https://github.com/mrshenli
2023-04-22 05:39:20 +00:00
c1e2fa8189 [dtensor] add StrategyType and TupleStrategy (#99435)
This PR refactors the current StrategyList. It introduces a
StrategyType, which is the base class of Strategy, and it have
two sub strategies:

1. Refactor the previous StrategyList to OpStrategy
2. Add TupleStrategy, the new strategy added to deal with tuple cases where
it could return multiple different OpStrategy for an op.

This would help support a more complicated op and unblocks compile mode
FSDP
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99435
Approved by: https://github.com/mrshenli
2023-04-22 05:39:20 +00:00
82a54513ac [fx] Add a function to allow adding more functions to the side effect function set (#97288)
Summary: There're some customized functions that we would also like to keep during eliminate dead code pass. Add a function to help us to do.

Test Plan: Added a unit test

Differential Revision: D44273630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97288
Approved by: https://github.com/houseroad
2023-04-22 04:42:24 +00:00
87b71e570e [Profiler] Support HTML plot output for profiler export_memory_timeline API (#99751)
Summary:
Support the file extension .html, which will include a PNG image of the plot embedded into an HTML file.

This allows users to avoid processing the timeline manually in their own frontend UI.

Test Plan:
CI Tests

Ran on resnet50 model and generated this html file w/ plot:
See attached html file: {F954232276}
Screenshot: {F954232469}

Differential Revision: D45152735

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99751
Approved by: https://github.com/davidberard98
2023-04-22 04:21:58 +00:00
ca8625f456 [BE][1/N]Add sharding spec logger for ShardedTensor (#99748)
Set up a nullHandler() on the OSS side.
Next step is to set up the counterpart in internal.

This is part of the effort for ShardedTensor deprecation. We want to log internal use cases for different sharding spec.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99748
Approved by: https://github.com/H-Huang, https://github.com/fegin
2023-04-22 04:05:21 +00:00
bd7191111f [ONNX] Add additional_test_kwargs into test_fx_to_onnx_with_onnxruntime.py (#99434)
1. Expand additional_test_inputs to include kwargs
2. Revisit and update tests status by adding ops
3. Disabling dtype -1 assignment avoids potential bugs
4. Expand input/output to accept buit-in type, but they are not dynamically captured by dynamo.export right now, and they would be added as constant input to op.targets.
5. Move run_test_with_fx_to_onnx_exporter_and_onnx_runtime to onnx_test_common.py

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 3c03579</samp>

### Summary
🛠️🧪🚀

<!--
1.  🛠️ for updating the `filter_incompatible_and_dtype_convert_kwargs` function
2.  🧪 for updating the test function and test cases
3.  🚀 for adding support for new operators and scalar types
-->
This pull request improves the ONNX export support for scalar types and some ATen operators in PyTorch. It updates the test framework, the input and output adapters, the function dispatcher and the ONNX script generator to handle these cases. It also fixes or removes some failing or outdated tests.

> _We defy the limits of the ONNX script_
> _We export the models with scalar and copy_
> _We filter and convert the kwargs of dtype_
> _We run the tests with FX and docstring_

### Walkthrough
*  Update the `_InputArgsType` type annotation and the `_run_test_with_fx_to_onnx_exporter_and_onnx_runtime` function signature and docstring to handle int, float and bool inputs for some ONNX operators ([link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL44-R46), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL144-R157), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL155-R164), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL162-R172), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL201-R224), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L197-R199), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-0795f54fd1f38cfbf2c4a863a4efc9f40f2ea020a2b1612605c361b8d8d35862L291-R293))
* Update the `filter_incompatible_and_dtype_convert_kwargs` function to omit the `dtype` argument if it is None ([link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-cabc3e58713d6fe7ab764ade4f2692f6753402322a7b542397cad16fcc72cf4bL203-R205))
* Update the test cases in `test_fx_to_onnx_with_onnxruntime.py` to use the `input_kwargs` parameter as a mapping, to fix the format of the `additional_test_inputs` parameter, and to add or remove `xfail`, `skip_dynamic_fx_test` and `skip_min_ort_version` decorators as needed ([link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL320-R336), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL330-R353), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL357-R380), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL452-L470), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL488-R486), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL509-R510), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbR543), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL559-R565), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL578-R580), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL597-R599), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL611-R620), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL636-R636), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL656-R659), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL672-R675), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL691-R698), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL709-R714), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL732-R730), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL752-R750), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL773-R771), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbR797-R803), [link](https://github.com/pytorch/pytorch/pull/99434/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL807-R816))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99434
Approved by: https://github.com/justinchuby
2023-04-22 04:03:50 +00:00
e9bf94149e [spmd] Introduce Compile Mode FSDP with DTensor (#99062)
This PR introduces compile mode Data Parallel (FSDP/DDP) using DTensor sharding.

Along with the algorithm, it also introduces a new DataParallelMode so that `compile` API can take it
and apply data parallel. This PR trys to preserve the DTensorExpand
approach first to avoid BC, we shall discuss steps to remove
DTensorExpand.

The data parallel mode uses heuristics to determine node types in the
graphs and assign the corresponding sharding. The detailed algorithm
described in the design doc.

The benefits of this approach:
- Model parameters and optimizer states are all DTensors after  `spmd.compile`, which is necessary for FSDP, and also makes it super easier for checkpointing
- As model parameter/optim states are sharding in a per-parameter approach, it would be able to compose with sophisticated second order optimizer (i.e. Shampoo) in a easier way.
- We leverage the model parameter/grads information to derive data parallel pattern. In this way we don't need to worry about DTensor op coverage anymore! As data parallel is just a special case of DTensor operation.
- Use dtensor_expand might work for DDP but aren't going to work for FSDP as dtensor might choose to allgather activation, which might violate native fsdp algorithm.
- The approach is general enough to support both DDP/FSDP and a mixed mode

Follow ups:
- Add the "default" data parallel mode which supports mixing of
replicate/fully shard
- Test more e2e models with more different types of optimizers, etc
- migrate the existing stack from the DTensorExpand mode
- build optimizations on top of this prototype

Differential Revision: [D45174400](https://our.internmc.facebook.com/intern/diff/D45174400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99062
Approved by: https://github.com/mrshenli
2023-04-22 03:13:05 +00:00
be62a80787 [vision hash update] update the pinned vision hash (#99486)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99486
Approved by: https://github.com/pytorchbot
2023-04-22 03:06:44 +00:00
e5664c652a [ONNX] Support aten::scaled_dot_product_attention in torchscript exporter (#99658)
Fixes #97262

<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at d06d195</samp>

### Summary
🆕🚀📝

<!--
1.  🆕 for adding tests and annotations for a new operator.
2.  🚀 for adding support for exporting a new operator to ONNX.
3.  📝 for fixing a minor formatting issue.
-->
This pull request adds ONNX opset 14 support for the `nn.functional.scaled_dot_product_attention` operator, which is used for self-attention in transformer models. It does so by adding tests and annotations in `test/onnx/test_op_consistency.py`, and by adding a symbolic function in `torch/onnx/symbolic_opset14.py` that reuses an existing implementation.

> _To export `scaled_dot_product_attention`_
> _To ONNX opset 14, we need some extension_
> _We import some modules and types_
> _And add a symbolic that pipes_
> _The existing code with some annotation_

### Walkthrough
*  Implement the `nn.functional.scaled_dot_product_attention` operator for ONNX opset 14 ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-244955d820ec138d5ddffb20ee6f517cc4c5d281f19ccb53d8db47043b5ac46fR122-R292))
*  Add imports for modules and types needed for the operator implementation ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-244955d820ec138d5ddffb20ee6f517cc4c5d281f19ccb53d8db47043b5ac46fL17-R23))
*  Add a command to run the pytest module for testing the operator consistency ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753R13))
*  Add the operator to the list of operators tested for consistency ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753R311))
*  Add annotations to indicate the operator's limitations and issues ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L333-R339), [link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753R354-R358))
*  Remove an empty line at the end of `test/onnx/test_op_consistency.py` ([link](https://github.com/pytorch/pytorch/pull/99658/files?diff=unified&w=0#diff-e968c9cb6fc6631cab526cb3a9fe66358c4c6e757e2a223a224b976471bcb753L441))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99658
Approved by: https://github.com/justinchuby
2023-04-22 02:36:39 +00:00
6585d76f0f [docs] nn.functional.embedding: Note expected discrepancy between numerical and analytical gradients (#99181)
*

Fixes https://github.com/pytorch/pytorch/issues/93950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99181
Approved by: https://github.com/albanD
2023-04-22 02:30:53 +00:00
cyy
2b7161e2bf lower cmake version requirement in FindSanitizer.cmake (#97073)
As indicated by the last comment from PR #93147, we should replace CheckSourceRuns in **cmake/Modules/FindSanitizer.cmake**  with older versions to avoid dependency on CMake 3.19+
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97073
Approved by: https://github.com/vfdev-5, https://github.com/Skylion007
2023-04-22 02:02:14 +00:00
93d0a9c1b5 Add pattern to normalize split (#99588)
Summary: We normalize split with sections to split with sizes, so that it is easier to do subsequent transforms

Differential Revision: D45136185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99588
Approved by: https://github.com/jansel
2023-04-22 01:08:11 +00:00
79ec91a943 Add pass to remove redundant conversions (#99697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99697
Approved by: https://github.com/ngimel
2023-04-22 00:37:11 +00:00
4637c5ae5b Revert "Simplify _use_grad_for_differentiable (#98706)"
This reverts commit b9da79d2800c2ca00b57bc3ac86b43e01be174b6.

Reverted https://github.com/pytorch/pytorch/pull/98706 on behalf of https://github.com/huydhn due to Sorry for reverting your PR but a bunch of inductor tests are failing after this commit, so reverting the PR just to be sure
2023-04-22 00:35:56 +00:00
872319d393 [ONNX] Cover 'undiscoverable' ops 'torch.ops.aten' (#99682)
Due to https://github.com/pytorch/pytorch/issues/99681, ops supported by torchlib in `function_dispatcher` may not get
registered. This PR works around it by doing reverse look up.
Also fixes aten signature for `log_sigmoid`, which appears to be an outlier with unique naming style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99682
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-04-22 00:31:59 +00:00
96d3f3dee3 Discover and run C++ tests with run_test.py (#99559)
This depends on [pytest-cpp](https://github.com/pytest-dev/pytest-cpp) to discover and run C++ tests with pytest. C++ tests are built under `${WORKSPACE}/build/bin` directory and copied to the test job under the same path.

* To expose them to `run_test`, I choose to use the mock path prefix `cpp`, for example `build/bin/c10_Array_test` would be named as `cpp/c10_Array_test` and the `python test/run_test.py --cpp -i cpp/c10_Array_test` would run the test in the same way as other Python tests.  I could copy them from `build/bin` to `test/cpp`, but it will be mixed with the source code and CMake file.  So this looks easier
* Some executable under `build/bin` are not C++ tests, and they are exclude, for example `build/bin/torch_shm_manager`
* C++ tests need to run with pytest directly as python command doesn't understand it
* The change is gated by the new `--cpp` argument to `run_test.py`, for example `python test/run_test.py --cpp` will run all available C++ tests
* The tests can be run in parallel
* Failing tests can be retried with `--reruns=2` and `--sw`

```
============================= test session starts ==============================
platform darwin -- Python 3.9.15, pytest-7.2.0, pluggy-1.0.0 -- /Users/huydo/miniconda3/envs/py3.9/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/Users/huydo/Storage/mine/pytorch/test/.hypothesis/examples')
rootdir: /Users/huydo/Storage/mine/pytorch, configfile: pytest.ini
plugins: xdoctest-1.1.0, cpp-2.3.0, rerunfailures-10.3, shard-0.1.2, flakefinder-1.1.0, hypothesis-6.56.4, xdist-3.0.2, repeat-0.9.1
collecting ... collected 3 items / 2 deselected / 1 selected
Running 1 items in this shard: build/bin/scalar_tensor_test::TestScalarTensor.TestScalarTensorMPS
stepwise: skipping 2 already passed items.

../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%]
../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS RERUN [100%]
../build/bin/scalar_tensor_test::TestScalarTensor::TestScalarTensorMPS FAILED [100%]
```

* `--import-slow-tests` and `--import-disabled-tests` won't work for now and that's ok to have it as a future task.

I also add `pytest-cpp==2.3.0` to Linux Docker, MacOS, and Windows.

### Testing

Build PyTorch and run `python test/run_test.py --cpp` on my laptop.  CI change would come later in a separate PR.  Also running `python test/run_test.py --help` now shows all C++ test discovered under `build/bin`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99559
Approved by: https://github.com/clee2000
2023-04-22 00:23:31 +00:00
bfbc4e74ab adjust batch sizes for hf suite (#99691)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99691
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2023-04-21 23:57:53 +00:00
ce60997376 [BE][DTensor] validate the mesh argument in DeviceMesh construction (#99094)
## What's in this PR
DeviceMesh's __init__ function now requires all calling ranks to pass the same `mesh` argument.

## Why
We want to enforce SPMD style of programs using DTensor. Before this PR, 2-D Parallel API (e.g. _create_1d_device_mesh) defines different DeviceMesh on different ranks. After this PR, it defines each sub-meshes and simply perform communications on the one that it is associated with.

Differential Revision: [D45165511](https://our.internmc.facebook.com/intern/diff/D45165511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99094
Approved by: https://github.com/wanchaol
2023-04-21 23:47:51 +00:00
cf357adc7e Allow torch.fx to take Modules that return dataclass (#99576)
Summary:
Currently torch.fx support Modules with input of namedtuple/dataclass, return as namedtuple, but does not allow Module.forward to return a dataclass, running `test_trace_return_dataclass` without this change will have following error:

  NotImplementedError: argument of type: <class 'test_fx.TestFX.test_trace_return_dataclass.<locals>.MyOutput'>
  File "test_trace_return_dataclass
    traced_graph = symbolic_trace(module).graph
  File "test/__fx__/fx#link-tree/torch/fx/_symbolic_trace.py", line 1114, in symbolic_trace
    graph = tracer.trace(root, concrete_args)
  File "test/__fx__/fx#link-tree/torch/fx/_symbolic_trace.py", line 783, in trace
    (self.create_arg(fn(*args)),),
  File "test/__fx__/fx#link-tree/torch/fx/_symbolic_trace.py", line 378, in create_arg
    return super().create_arg(a)
  File "test/__fx__/fx#link-tree/torch/fx/proxy.py", line 269, in create_arg
    raise NotImplementedError(f"argument of type: {type(a)}")

this diff handle dataclass type.

Test Plan:
buck test @//mode/opt @//mode/inplace //caffe2/test:fx -- test_trace_

  graph():
    %d : torch.Tensor [#users=1] = placeholder[target=d]
    %my_output : [#users=1] = call_function[target=test_fx.MyOutput](args = (), kwargs = {foo: %d, bar: %d})
    return my_output

Differential Revision: D44916519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99576
Approved by: https://github.com/suo
2023-04-21 23:46:49 +00:00
547bef11ee tweak heuristic for sdpa selection based off of *data* (and a decision tree) (#99644)
High level approach:
1. I generated a bunch of data comparing FlashAttention and Cutlass implementations (https://pastebin.com/pe0j3YeK)
2. I trained a decision tree using standard train/val split methodology and hyperparameter sweeps (https://pastebin.com/fjYX1HjR).
2a. I did a bunch of feature augmentation to capture interactions between features.

The heuristic I ended up with is:
```
use_flash = seq_len / (num_heads * batch_size) > 6
```

TL;DR: On my dataset, where FlashAttention and Cutlass differ by more than 10%, the existing heuristic achieves 69% accuracy.  My new heuristic achieves 94% accuracy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99644
Approved by: https://github.com/ngimel, https://github.com/drisspg
2023-04-21 23:28:44 +00:00
bb830224e3 Remove extra space (#99750)
Fixes https://github.com/pytorch/pytorch/issues/99714

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99750
Approved by: https://github.com/lezcano, https://github.com/albanD
2023-04-21 23:18:52 +00:00
4f62e7cb10 [FSDP][BE] Remove unused code (#99731)
Remove the unused code. https://github.com/pytorch/pytorch/pull/99675 is duplicated and we should land this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99731
Approved by: https://github.com/wz337
2023-04-21 23:11:37 +00:00
363d530035 Fix decision logic for should_cast_forward_inputs in _root_pre_forward() and _pre_forward() (#99546)
Fixes #99545

There is currently no topological constraint dictating FSDP instances own ``FlatParamHandle`` s directly. If all parameters are managed by descendant FSDP instances leaving an FSDP instance with no direct ``state._handles``, the  ``should_cast_forward_inputs`` decisions below in both ``_root_pre_forward()`` and ``_pre_forward()`` respectively can return incorrect decisions [^1].

For [``_root_pre_forward()``](436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L514)):

436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L602-L604)

For [``_pre_forward``](436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L384)):

436edc5ac3/torch/distributed/fsdp/_runtime_utils.py (L420-L422)

See the [related issue](https://github.com/pytorch/pytorch/issues/99545) for reproduction.

### Remediation

In this PR, I amend the two decision statements referenced above (in both `_root_pre_forward()` and `_pre_forward()`) to account for FSDP instances without direct handles:
```python
should_cast_forward_inputs = len(state._handles) > 0 and all(
    not handle._force_full_precision for handle in state._handles
)
```

If one configures ``MixedPrecision`` in the example above with ``cast_forward_inputs=True`` and the ``should_cast_forward_inputs`` adjustment above, FSDP returns to the expected behavior and produces no error.

Though the check is the same in both ``_root_pre_forward()`` and ``_pre_forward()`` and hence could be refactored into a separate function, I figured it may make sense to retain separate statements to preserve the ability for root-specific behavior in the future. Whichever approach the team prefers I can update this PR with.

### Implementation considerations and questions:

1. Rather than write a test that would arguably have a poor utility/resource usage profile, I have not added any tests associated with this PR. The new decision logic is exercised by all existing tests (which continue to pass after this PR of course) so I think the utility of new tests is fairly modest. Let me know if you think new tests should be added and I'm happy to do so.
2. As discussed above, the decision statement shared among ``_pre_forward()`` and ``_root_pre_forward()`` could be factored out into a separate function. Given the simplicity of the statement and to retain current flexibility for root-specific decisions it might not be worth the refactor so I haven't done it yet. Let me know if you'd like me to do so.
3. The note below could be updated to indicate the utility of setting ``cast_forward_inputs=True`` for the situations addressed with this PR but I haven't done so since I'm not sure it's worth complicating the current usage guidance. I'd be happy to add verbiage describing the use case if the team wants it.
cde35b4069/torch/distributed/fsdp/api.py (L175-L181)

Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community!

[^1]: Though one could keep the existing decision logic and impose a new topological constraint requiring all FSDP instances have direct `_handles`, I think retaining the current wrapping flexibility is both convenient and useful enough (e.g. programmatic wrapping of modules that may or may not already have all parameters handled by descendant FSDP instances) to update the decision logic as discussed here instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99546
Approved by: https://github.com/awgu
2023-04-21 22:49:50 +00:00
10c938abef Handle meta['val'] for tuple of lists. (#99724)
Fixes https://github.com/pytorch/pytorch/issues/99356

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99724
Approved by: https://github.com/wanchaol
2023-04-21 22:33:21 +00:00
6c899999f4 Fix wrong path when reinstalling MacOS pip requirements (#99758)
I force merged this https://github.com/pytorch/pytorch/pull/99506 too soon to fix MacOS flaky in trunk and forgot to set the path to the pip requirements file correctly.  `popd` needs to be run before the reinstallation, so that the working directory is back to pytorch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99758
Approved by: https://github.com/kit1980
2023-04-21 22:32:59 +00:00
db46d9dc49 [CI] Change max-autotune's output file name (#99754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99754
Approved by: https://github.com/huydhn
2023-04-21 21:53:45 +00:00
8548cb3dd5 Improve OOM error message (#99699)
This PR adds calls to nvml during an OOM to find out the total memory
in use by the process and any other CUDA processes on the device.

This makes it easier to identify cases where non-PyTorch libraries have
allocated memory or another process (such as a data loader) has also
allocated memory on the device.

This also rewords the other parts of the error message to make the meaning
of the memory statistics more clear with this new information:

"""
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 138.00 MiB.
GPU 0 has a total capacty of 15.90 GiB of which 8.44 MiB is free.
Process 1246069 has 577.00 MiB memory in use. Including non-PyTorch memory,
this process has 15.32 GiB memory in use. Of the allocated memory
14.12 GiB is allocated by PyTorch, and 410.41 MiB is reserved
by PyTorch but unallocated. If reserved but unallocated memory is large
try setting max_split_size_mb to avoid fragmentation.  See documentation
 for Memory Management and PYTORCH_CUDA_ALLOC_CONF
"""
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99699
Approved by: https://github.com/ngimel
2023-04-21 21:36:48 +00:00
c39aff1084 Disable XProtect on MacOS runner (#99506)
A theory is that something else on the runner removes the file like Windows Defender.  The number one suspect is `com.apple.XProtect.daemon.scan` https://support.apple.com/guide/security/protecting-against-malware-sec469d47bd8/web

Spot checking on some runners:

* On 13.x (13.3.1 and 13.2.1), the daemon is now called `com.apple.XProtect.daemon.scan`
```
sh-3.2$ sudo launchctl list | grep -i protect
8048	-9	com.apple.XprotectFramework.PluginService
8047	-9	com.apple.XProtect.daemon.scan
```

* On 12.4, the daemon is called `com.apple.XprotectFramework`
```
sudo launchctl list | grep -i protect
-	-9	com.apple.XprotectFramework.PluginService
-	-9	com.apple.XprotectFramework.scan
```

Looking at the list of failures in https://hud.pytorch.org/failure/ModuleNotFoundError%3A%20No%20module%20named%20'sympy', I can confirm that the issue happens with both MacOS 12 and 13 as I can find examples on both.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99506
Approved by: https://github.com/malfet
2023-04-21 21:26:49 +00:00
18fd6394dc Give distinct names to __unknown_tensor (#99729)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99729
Approved by: https://github.com/albanD
2023-04-21 21:03:43 +00:00
b9da79d280 Simplify _use_grad_for_differentiable (#98706)
This makes it so dynamo can trace through it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98706
Approved by: https://github.com/janeyx99
2023-04-21 20:47:19 +00:00
08376cc546 [Inductor] Fix rand_like with kwargs device of str type (#99673)
Fixes #99632

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99673
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-21 20:33:14 +00:00
7876c503b7 [FSDP][optim_state_dict] Consolidate rank0_only load logic (#99647)
Follow up https://github.com/pytorch/pytorch/pull/99624, this PR consolidate the logic of `use_orig_params=False` with `use_orig_params=True` to use the same logic to load optimizer checkpoint when rank0_only is True.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99647
Approved by: https://github.com/wz337
2023-04-21 20:29:54 +00:00
dd07dab1c7 [FSDP][optim_state_dict] Support rank0_only when use_orig_params is on (#99624)
This PR makes `use_orig_params=True` case support rank0_only loading for optim state_dict. The implementation is different from `use_orig_params=False`. The `use_orig_params=False` implementation first flatten the parameters on rank0 and then broadcast the states while this implementation broadcast the state when doing the flattening. The implementation is slower as it broadcast the original parameters instead of the flattened ones. However, the implementation introduced by this PR is simpler. As loading is usually happen once per training life, the performance difference can be ignored. In next PR, we will consolidate the implementations in favor of the simpleness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99624
Approved by: https://github.com/wz337
2023-04-21 20:09:19 +00:00
fc63d710fe Revert "Disable XProtect on MacOS runner (#99506)"
This reverts commit 9bece55a7e620525f76177b2402b178acab66ee8.

Reverted https://github.com/pytorch/pytorch/pull/99506 on behalf of https://github.com/huydhn due to Found a clue on the uploaded archive, reverting this so I can update the PR with the correct mitigation
2023-04-21 19:31:56 +00:00
ee5f09ab80 [Feature] storage pin memory support custom device. (#99712)
Fixes #99326

Support storage pin_memory and is_pinned for custom device, by calling dispatched tensor operations.

@ezyang  this pr is what we have discussed in issue #99326, would you please take a moment to review it, thanks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99712
Approved by: https://github.com/ezyang
2023-04-21 18:31:01 +00:00
400075f733 [stacktraces] Keep addr2line procs around (#99670)
This PR caches the addr -> Frame information across calls to symbolize,
and also keeps the addr2line symbolizing processes around once requested.

This makes calls to symbolize frames that have been seen before nearly instant,
and makes lookup of address in libraries that have already been loaded by
addr2line faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99670
Approved by: https://github.com/ezyang
2023-04-21 18:16:04 +00:00
e09f785a72 [CI] Remove inductor skip list for Huggingface (#99375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99375
Approved by: https://github.com/anijain2305
2023-04-21 18:13:22 +00:00
75e754800f Revert "[quant][pt2e][refactor] Cleanup the logic for deciding whether to insert observer/fq or not (#99220)"
This reverts commit d56adb1b54c8dba3d13ecae93f81c945325bc1c7.

Reverted https://github.com/pytorch/pytorch/pull/99220 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-04-21 18:04:21 +00:00
b96bb2f1a6 [spmd] Introduce ParallelMode and add DTensorExpandMode (#98452)
This PR introduces a ParallelMode interface to define how to do
SPMD expansion and optimize the captured graph. This would be
beneifical for different parallelisms to expand differently
and apply different optimization passes

Put DTensorExpandMode as the first parallel mode that does the
existing dtensor_expand functionality.

Differential Revision: [D45174399](https://our.internmc.facebook.com/intern/diff/D45174399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98452
Approved by: https://github.com/mrshenli
2023-04-21 17:24:54 +00:00
9244264f46 [Inductor] Fix view/reshape on tensors with shape 0 in any dimension (#99671)
From 14k github models, not sure if this is the right way to fix this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99671
Approved by: https://github.com/ngimel
2023-04-21 17:14:50 +00:00
d56adb1b54 [quant][pt2e][refactor] Cleanup the logic for deciding whether to insert observer/fq or not (#99220)
Summary:
Previously we have two places we need to decide whether to insert observer or fake quantizer or not:
(1) input arguments of a node (2) output of a node, and right now we have separate code to do this
in this PR, the logic is unified in `_needs_obs_or_fq` helper function that takes the target_dtype and is_dynamic from previous output
and target_dtype and is_dynamic for the current Tensor we are looking at

let's use an example for conv node:
```
conv = convolution(input, weight, bias, ...)
```

let's say we have `input_node` object for argument `input`, and `conv_node` for `conv` node in the graph

(1) input arguments, e.g. `input`
the target_dtype/is_dynamic from previous output is the node that produces `input`, we get this from
input_node.meta["target_dtype_info"]["output_act_obs_or_fq"]

the taregt_dtype/is_dynamic for the current argument `input`, comes from conv_node.meta["target_dtype_info"]["input_act_obs_or_fq"]
similarly for weight it comes from conv_node.meta["target"]["weightobs_or_fq"] etc.

(2) output for conv node
the target_dtype/is_dynamic from previous output will be the floating point output from the fp32 convolution operator, so it
is hardcoded to be (torch.float, False), however, technically we should get this from node.meta["val"], but since the
current code base is shared by fx graph mode quantization and pytorch 2.0 export quantization, we cannot do that, we can revisit
after we decide to deprecate fx graph mode quantization

the target_dtype/is_dynamic for the current output comes from conv_node.meta["target_dtype_info"]["output_act_obs_or_fq"]

there is one caveat here about dynamic quantization, that is explained in the comment, so I won't repeat here

Note: also fixed some places in `_get_arg_target_dtype_as_input_to_node` and `_get_arg_target_is_dynamic_as_input_to_node` to make sure "not specified" == specifying a fp32 placeholder observer as well

Next: we can merge the two get target dtype and get is_dynamic function to reduce code duplication

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels
python test/test_quantization.py TestQuantizePT2E
python test/test_quantization.py TestQuantizePT2EModels

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D45167585](https://our.internmc.facebook.com/intern/diff/D45167585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99220
Approved by: https://github.com/kimishpatel
2023-04-21 16:58:35 +00:00
06081ac8f3 Update docstring of torch.nn.functional.normalize() (#99512)
Fixes #99125

torch.nn.functional.normalize() already supports dim=tuple(int), but the docstring says int only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99512
Approved by: https://github.com/albanD
2023-04-21 16:45:24 +00:00
e9786149ab Delete tracing_mode argument to export (#99555)
You can have any color you want, as long as it's tracing_mode="symbolic"

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99555
Approved by: https://github.com/voznesenskym
2023-04-21 16:20:51 +00:00
881c57230d Move more stuff to after_aot (#99557)
Not sure why this didn't work first time around. Second time's a charm.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99557
Approved by: https://github.com/anijain2305
2023-04-21 16:20:40 +00:00
d3bb762f1e Do not assume static by default when exporting (#99554)
Fixes https://github.com/pytorch/pytorch/issues/99360

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99554
Approved by: https://github.com/voznesenskym
2023-04-21 15:19:47 +00:00
6b8ef8ea4c [BE] Build PyTorch with -Wnewline-eof (#99687)
This would avoid further regressions like the ones reported in https://github.com/pytorch/pytorch/pull/96668#issuecomment-1468029259

Surround some ONNX/flatbuffer includes with `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wnewline-eof")` cone of shame

Fixes https://github.com/pytorch/pytorch/issues/96747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99687
Approved by: https://github.com/kit1980
2023-04-21 14:46:47 +00:00
ts
dbf0db958f Fix torch.nn.FractionalMaxPool2d output_size error (#99507)
Fixes #99148 , raising an error if output_ratio's size > 2.

Justification for changes:

If an output size is not specified but an output ratio is, we call fractional_max_pool2d_with_indices. We then generate the value of output_size based on the first two integers of the output_ratio (line ~480 of torch.nn.functional.py).

Thus, we should raise a value error in the case that the user passes an output_ratio (instead of an output_size) and the number of elements in output_ratio exceeds two. We must raise an error before calling torch._C._nn.franctional_max_pool2d as the value of output_size passed into torch._C._nn.fractional_max_pool2d is guaranteed to be of size 2 (as the existing code generates it from the first two indices of the passed in ratio).

I would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99507
Approved by: https://github.com/mikaylagawarecki
2023-04-21 14:38:25 +00:00
9861ec9785 Revert "[c10d] Faster coalescing (#98793)"
This reverts commit db456ab83da6a505dcebc128903d5ee4fc2d5712.

Reverted https://github.com/pytorch/pytorch/pull/98793 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-21 09:15:04 +00:00
da57d597e1 Revert "fix onednn ConvTranspose2d channels last issue when ic=1 (#99539)"
This reverts commit 233cc34d3b8a1b92eeeea78661463f8ec660fbcd.

Reverted https://github.com/pytorch/pytorch/pull/99539 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-21 08:44:28 +00:00
9df8b1b594 Init comm_nonblocking_ when creating AutoNcclGroup (#99679)
A quick, trial fix for #99677.

My guess is that when the code instantiates an `AutoNcclGroup` object, it comes with an uninitialized random value for member `comm_nonblocking_`. Then `if (comm_nonblocking_)` evaluates to true, and `NCCL_CHECK_TIMEOUT` triggered.

This change is safe (and needed) anyway whether it indeed fixes #99677.

Cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99679
Approved by: https://github.com/eqy, https://github.com/awgu
2023-04-21 07:56:39 +00:00
24bf15fe8d Support record_stream in dispatch mode (#99529)
Summary:
Issuing a `t.record_stream(s)` call while a `TorchDispatchMode` is active was throwing because PyTorch was unable to convert a c10::Stream back to a Python object. It's now fixed.

Fixes https://github.com/pytorch/pytorch/issues/94403

Test Plan: Added a unit test

Differential Revision: D45117566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99529
Approved by: https://github.com/albanD
2023-04-21 07:17:19 +00:00
0ac0d9d224 Pass locals to enum_repr to correctly make the guard str for enums (#99680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99680
Approved by: https://github.com/jansel
2023-04-21 07:14:49 +00:00
8ee59280d7 Bug - check config for dynamic (#99676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99676
Approved by: https://github.com/ezyang
2023-04-21 06:40:09 +00:00
a421c54753 [exir][delegate] torch.ops.call_delegate (#92562)
Summary:
Followup diffs to integrate this op into the other parts of the delegate workflow.
The unittest results in the following graph:

```
graph():
    %x_1 : [#users=1] = placeholder[target=x_1]
    %y_1 : [#users=1] = placeholder[target=y_1]
    %lowered_module_0 : [#users=1] = get_attr[target=lowered_module_0]
    %call_delegate : [#users=1] = call_function[target=torch.ops.call_delegate](args = (%lowered_module_0, forward, %x_1, %y_1), kwargs = {})
    return call_delegate
```

Test Plan: buck2 run //executorch/exir/tests:delegate -- -r "test_call_delegate"

Differential Revision: D42329287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92562
Approved by: https://github.com/voznesenskym
2023-04-21 06:31:23 +00:00
9def799097 [combined tracebacks] missing gil acquire (#99685)
When this code was refactored, the GIL for appendSymbolized
was dropped accidentally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99685
Approved by: https://github.com/davidberard98
2023-04-21 06:24:36 +00:00
daff040886 [inductor] skip triton.Config that spills (#99385)
TLDR, I did a quick study of register spill in max-autotune and coordesc descent tuning. The conclusion is for the pointwise/reduction kernels, register spill is rare in inductor (which means the configs we consider are relatively reasonable), but it indeed happens sometimes. TBH, this PR does not gonna help reducing compilation time for max-autotune/coordinate descent tuning much because register spilling is very rare. But this PR only contains 2 lines of significant code change, so I guess it's still good to merge it considering ROI and code complexity.

# Register Spill in Max-Autotuner
I ran command
```
rm -rf /tmp/torchinductor_shunting_tmp && time TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_tmp python -u benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only ${MODEL} --disable-cudagraphs --training 2>&1 | tee /tmp/mylog
```
and then analyze the log.
$ cat /tmp/mylog | grep 'nspill' | wc -l

will show the total number of triton.Config's we benchmark;

$ cat /tmp/mylog  | grep 'nspill' | grep -v 'nspill 0'

will show the number of triton.Config's that spill registers.

Checked 5 models
- hf_Bert 0 spills
- resnet50: 2 out of 199 triton.Config's spill. For the 2 configs that spill, they are suboptimal according to the log: https://gist.github.com/shunting314/7ea30a9dafad7156919a99df5feba0ee
- timm_vision_transformer: 2/77 spills. The spilled configs are again sub-optimal: https://gist.github.com/shunting314/a48cbcfb14a07c0b84555e2cf7154852
- BERT_pytorch: 0/123 spills
- timm_resnest 0/255 spills

# Register Spill in Coordinate Descent Tuner
I ran command
```
rm -rf /tmp/torchinductor_shunting_tmp && time TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_tmp TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0  python -u benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only ${MODEL} --disable-cudagraphs --training 2>&1 | tee /tmp/mylog

```

and then analyze the log.

$ cat /tmp/mylog | grep COORDESC | wc -l
shows the total number of configs considered by the coordinate descent tuner

$ cat /tmp/mylog | grep COORDESC | grep -v 'nspill 0'
shows the ones that spill.

Checked 3 models
- hf_Bert (log https://gist.github.com/shunting314/bd943887e77609c7c8b323fe3f554c85 )
0/525 spills
- resnet50: 0/783 spills
- timm_vision_transformer: 2/380 (log https://gist.github.com/shunting314/6231f06c1398e0cddb2a96bf52389c78 )
the 2 spilled one are sub-optimal

# Ignore Spilled Config

With this PR, I run test tests for timm_vision_transformer and  can see all 4 spilled configs (2 for max-autotune and 2 for coordinate descent tuner according to the study above) are skipped for benchmarking:
```
[2023-04-18 00:03:37,291] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 16, YBLOCK: 512, num_warps: 8, num_stages: 1 because of register spilling: 6
[2023-04-18 00:04:50,523] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 64, RBLOCK: 64, num_warps: 8, num_stages: 1 because of register spilling: 626
[2023-04-18 00:04:50,523] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 8, RBLOCK: 512, num_warps: 8, num_stages: 1 because of register spilling: 778
[2023-04-18 00:05:47,170] torch._inductor.triton_heuristics: [DEBUG] Skip config XBLOCK: 1, num_warps: 2, num_stages: 1 because of register spilling: 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99385
Approved by: https://github.com/jansel
2023-04-21 06:21:09 +00:00
9bece55a7e Disable XProtect on MacOS runner (#99506)
A theory is that something else on the runner removes the file like Windows Defender.  The number one suspect is `com.apple.XProtect.daemon.scan` https://support.apple.com/guide/security/protecting-against-malware-sec469d47bd8/web

Spot checking on some runners:

* On 13.x (13.3.1 and 13.2.1), the daemon is now called `com.apple.XProtect.daemon.scan`
```
sh-3.2$ sudo launchctl list | grep -i protect
8048	-9	com.apple.XprotectFramework.PluginService
8047	-9	com.apple.XProtect.daemon.scan
```

* On 12.4, the daemon is called `com.apple.XprotectFramework`
```
sudo launchctl list | grep -i protect
-	-9	com.apple.XprotectFramework.PluginService
-	-9	com.apple.XprotectFramework.scan
```

Looking at the list of failures in https://hud.pytorch.org/failure/ModuleNotFoundError%3A%20No%20module%20named%20'sympy', I can confirm that the issue happens with both MacOS 12 and 13 as I can find examples on both.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99506
Approved by: https://github.com/malfet
2023-04-21 06:10:11 +00:00
63690afc6c Make CI error on inductor fallback when decomp is available (#99473)
Fixes #99446

Remove the warning, as that annoyed end-users who don't know what to do about it.

Instead, try to hold the line by preventing any decomp from being added without making
the corresponding change to inductor's fallbacks.

Note: we probably still need to better document how to update inductor's decomps,
for now it's pretty much "go ask the inductor team for advice"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99473
Approved by: https://github.com/ezyang, https://github.com/ngimel, https://github.com/jansel
2023-04-21 05:47:28 +00:00
deaf983bdb [Inductor][quant]Enable decomposed.quant/dequant lowering and vec code gen (#99131)
**Summary**
Since current quantization flow has not decomposed quant/dequant into prim ops, in this PR

- We enable the quant/dequant decomposition as lowering inside inductor.
- For the `decomposed.quant/dequant.tensor` overload, there are loading of scalar tensor of `zero point` and `scale`, we need to enable the vec code gen for these op overloads.
- Minor change as adding `is_load_uint8_as_float` and `is_store_float_as_uint8` default value `False` into `OptimizationContext`.

**TestPlan**
```
cd test/inductor && python -m pytest test_cpu_repro.py -k test_dequant_quant_lowering
```
co-author with @Xia-Weiwen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99131
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-21 04:33:02 +00:00
2a47f68586 inductor: fix onednn conv2d(transpose) packed issue when input size is three (#99601)
Fix https://github.com/pytorch/pytorch/issues/99568, this PR adds an input size check before doing conv2d(transpose) packed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99601
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-21 04:32:35 +00:00
51742a467d [ONNX] Fix missing import numpy for docs example (#99663)
Fixes https://github.com/pytorch/pytorch/issues/99408
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99663
Approved by: https://github.com/justinchuby
2023-04-21 04:06:45 +00:00
16a4dc0f93 Correct typo for NCCL_MAJOR (#99482)
Correct Typo from NCCL_MACJOR to NCCL_MAJOR

Fixes Typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99482
Approved by: https://github.com/eqy, https://github.com/kwen2501
2023-04-21 03:27:08 +00:00
6427b849a3 Allow in graph einops operators (#99631)
Coordinating with arogozhnikov from einops team, allowing specific operators in the dynamo graph avoids dynamo tracing problems provided the operators are screened for safety - they must not bake in unintended constants or take data-dependent control flow paths.

Fixes #99031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99631
Approved by: https://github.com/jansel
2023-04-21 03:14:38 +00:00
716ba6851e Make testing._internal.common_utils safe to import (#99659)
In edge cases in CI, SLOW_TESTS_FILE is defined but does not point to an existing file.

Guessing this is due to a test case that opens a subprocses and cwd's but doesn't clean its env.

We shouldn't make importing common_utils fail, so issue a warning and proceed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99659
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-04-21 02:59:10 +00:00
d168161cd3 [dynamo] Fix example_inputs with unsqueeze_ (#98696)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98696
Approved by: https://github.com/yanboliang
2023-04-21 02:54:14 +00:00
0d2b55c459 [DTensor] Change Sharding algorithm to be in line with `torch.chunk()` (#98722)
As functional collective being updated, using tensor_split() as the underlying sharding algorithm would require padding and unpadding on multiple ranks. Therefore, we are changing the sharding algorithm to be in line with ``torch.chunk()`` to allow padding on the last two ranks in most of the scenarios.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98722
Approved by: https://github.com/wanchaol
2023-04-21 02:05:22 +00:00
27f8eb8c2b add storage serialization methods for privateuse1 (#98920)
add entry for privateuse1 storage serialization register_package in _register_device_module.
1. User only need to implement `privateuse1_tag` and `privateuse1_deserialize` in the device module of open device. When registering device module, the methods are registered with _package_registry in storage serialization.
2. Provides a fixed sequence number 30 for privateuse1 in storage serialization _package_registry list.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98920
Approved by: https://github.com/ezyang
2023-04-21 01:51:08 +00:00
907f2dad7d [inductor] Update triton pin (#99209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99209
Approved by: https://github.com/ngimel
2023-04-21 01:12:48 +00:00
fdeee43650 Disable SDPA FlashAttention backward and mem eff attention on sm86+ for head_dim above 64 (#99105)
Expand sdpa_utils.h check to disable FlashAttention when using autograd and mem eff attention for the following cases
- head_dim > 64
- sm86 or newer

Previously we only disable these kernels on sm86 and for head_dim equal to 128.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99105
Approved by: https://github.com/malfet
2023-04-21 01:00:15 +00:00
fc8fa6c356 Require at least one tensor to be marked dynamic with --dynamic-batch-only (#99620)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99620
Approved by: https://github.com/voznesenskym
2023-04-21 00:17:08 +00:00
abdd1f4a38 Reuse tracing context and fake tensors from backwards in forwards (#99619)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99619
Approved by: https://github.com/wanchaol
2023-04-20 22:39:48 +00:00
bbfd577b7c bug-report.yml fix broken link (#99425)
fix link in ISSUE_TEMPLATE
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99425
Approved by: https://github.com/kit1980
2023-04-20 22:30:31 +00:00
9f95032101 Fix broken links in contribution_guide.rst (#99295)
mainly from `master` to `main`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99295
Approved by: https://github.com/kit1980
2023-04-20 22:20:56 +00:00
c9b08a087d [Dynamo] Merge symbolic_converter SETUP_WITH & BEFORE_WITH (#99651)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99651
Approved by: https://github.com/williamwen42
2023-04-20 22:12:38 +00:00
c412056921 Temporarily move ROCm to unstable (#99579)
CI SEV https://github.com/pytorch/pytorch/issues/99578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99579
Approved by: https://github.com/orionr
2023-04-20 20:34:54 +00:00
37bcdb98f6 Fix buck parsing in OSS build (#99648)
By removing `@fbsource` cell prefix from `pt_ops.bzl`

Fixes https://github.com/pytorch/pytorch/issues/99642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99648
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-04-20 20:17:38 +00:00
22af604e1b [quant][pt2] Add Conv + BN fusion for prepare QAT (#98568)
**Summary:** This commit adds the `prepare_qat_pt2e` API and the
fusion logic for Conv + BN. We use the subgraph rewriter to
match and replace the pattern with the existing logic in
`nniqat.ConvBn2d`. Note this is not the end-to-end flow yet.
In particular, the convert flow needs to swap the new subgraph
with another one that merges the batchnorm stats back into conv.

The Conv + BN fusion is implemented in the following steps:

1. Annotate all nodes in the pattern `[conv - bn - getitem]`

2. Match and replace this pattern with the fused QAT pattern
   (note that this is a larger subgraph than the original one)

3. Copy over metadata from the original nodes to the
   corresponding nodes in the new subgraph, to ensure the
   stack traces and dtype annotations are preserved

4. Prepare will insert fake quantizes in the right places
   based on the annotations

**Test Plan:**
python test/test_quantization.py TestQuantizePT2E.test_qat_conv_bn_fusion

**Reviewers:** jerryzh168, kimishpatel, yanboliang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98568
Approved by: https://github.com/kimishpatel
2023-04-20 20:15:28 +00:00
418a9fb9d8 [reland][inductor] coordinate descent tuning upon max-autotune (#99594)
Reland https://github.com/pytorch/pytorch/pull/97203 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99594
Approved by: https://github.com/jansel
2023-04-20 19:55:52 +00:00
b87c7ab6d6 Remove redundant found_inf recompute from _step_supports_amp_unscaling path (#98620)
following https://github.com/pytorch/pytorch/pull/97415#issuecomment-1499787115.

Rel: https://github.com/pytorch/pytorch/pull/98613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98620
Approved by: https://github.com/janeyx99
2023-04-20 19:24:09 +00:00
ts
a8e1893b7c Clarify error message of torch.nn.functional.embedding_bag (#99471)
Fixes #99221  , clarifying the error message to highlight the index from inputs which is responsible for the out-of-bounds error,  while maintaining the reference to the relevant index of offsets as a secondary consideration.

Also takes care of some spelling/grammatical issues with another message (primarily "yout" changed to "your").

Would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99471
Approved by: https://github.com/albanD
2023-04-20 19:18:46 +00:00
e68e84ef8a [dynamo] Support BUILD_MAP_UNPACK (#98664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98664
Approved by: https://github.com/yanboliang, https://github.com/voznesenskym
2023-04-20 18:41:50 +00:00
c19d19f6ff [profiler] support cuLaunchKernel (for triton kernel launches) & update kineto submodule (#99571)
**Background**: Prior to this PR, traces for PT2 w/ inductor don't contain connections between CUDA kernels and the CPU launch site. This PR adds those connections.

**Details**: Triton kernels launched by inductor use cuLaunchKernel instead of cudaLaunchKernel. cuLaunchKernel is part of the driver API, while cudaLaunchKernel is part of the runtime API. In order to support cuLaunchKernel, we added support in kineto (pytorch/kineto#752) to also start listening to driver events; hence why we need to update the kineto submodule.

After the change in kineto, we just need to turn this on in the PyTorch  repo by adding CUDA_DRIVER activity type to the CPU and CUDA activity type lists; then

**Testing**: Added test/inductor/test_profiler.py to check for `cuLaunchKernel` in json trace files.

Also, I ran this test:

```python
import torch

x = torch.rand((2, 2), device='cuda')

def fn(x):
    return x.relu()

fn_c = torch.compile(fn)
fn_c(x)

with torch.profiler.profile(with_stack=True) as prof:
    fn_c(x)

prof.export_chrome_trace("relu_profile.json")
```

which generated this chrometrace:
<img width="930" alt="Screenshot 2023-04-18 at 2 58 25 PM" src="https://user-images.githubusercontent.com/5067123/232966895-b65f9daf-7645-44f8-9e2b-f8c11c86ef0a.png">

in which you can see flows between a `cuLaunchKernel` on the CPU side, and the triton kernel on the GPU.

**Kineto Updates**: To get the kineto-side changes required for cupti driver events, this PR updates the kineto pin. In that updated kineto submodule, we also have:
* JSON string sanitizing for event names (likely fix for #99572)
* cuda initialization fixes for multiprocessing
* cuKernelLaunch events (i.e. for this PR)
* DISABLE_CUPTI_LAZY_REINIT (from @aaronenyeshi)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99571
Approved by: https://github.com/ngimel, https://github.com/aaronenyeshi
2023-04-20 18:34:41 +00:00
5315317b7b Skip some detectron2_maskrcnn models with KeyError _ignore_torch_cuda_oom (#99599)
These tests are failing in trunk 233cc34d3b with `KeyError: '_ignore_torch_cuda_oom'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99599
Approved by: https://github.com/malfet
2023-04-20 18:11:35 +00:00
7c3fa5c70d Revert "Build Windows binaries with Visual Studio 2022 Build Tools (#90855) (#99591)
This reverts commit a88c15a849152291b1ebdab13860726dd8be1d81.  Once we have the AMI ready, we can revert this and use VS2022 again.  This is to mitigate flaky Windows build in trunk https://github.com/pytorch/builder/issues/1387.

Note that as VS2019 is already available in the current AMI, it won't be installed again per logic in https://github.com/pytorch/builder/blob/main/windows/internal/vs2019_install.ps1#L25-L29. Thus, this helps avoid the flaky installation issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99591
Approved by: https://github.com/kit1980, https://github.com/Blackhex, https://github.com/malfet
2023-04-20 17:57:10 +00:00
0a98289af3 Stop testing if CUDA is initialized on teardown (#99627)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99627
Approved by: https://github.com/jansel, https://github.com/huydhn
2023-04-20 17:54:48 +00:00
aa4ed332c3 Improve torch.cond useability: Return UserError with actionable error messages (#98909)
It's part of the effort to improve PT2 Export UX. This PR is to improve the usability of `torch.cond()` by separating user errors from the dynamo internal errors. By definition, user error means the usage of `torch.cond()` violates the restrictions of this API therefore needs users to take action and fix the error.

In this notebook N3363227 we discovered a bunch of limitations of using `torch.cond(pred, true_fn, false_fn, operands)`. In summary, the limitations can be categorized as:
 - predicate restriction (`pred`)
 - operands restriction (`operands`)
 - branch restriction (`true_fn` & `false_fn`)

The error message will be more accurate about where the (user) error is from and more actionable for users to fix it.

For example, `operands` must be a list of tensors and the signature of `true_fn` and `false_fn` must match with the `operands`.
If the operands contains non-tensor types, user will see error message like:
```
torch._dynamo.exc.UserError: Expected a list of tensors but got ["<class 'torch.Tensor'>", "<class 'float'>"]

from user code:
   File "~/pytorch/test/dynamo/test_export.py", line 2504, in f_non_tensor_operands
    return cond(True, lambda x, a: x.sin(), lambda x, a: x.cos(), [x, a])
```
If the signature of the branch function doesn't match with `operands`, user will see error message like:
```
torch._dynamo.exc.UserError: too many positional arguments.
  func = 'false_fn' ~/pytorch/test/dynamo/test_export.py:2514, args = [<class 'torch.Tensor'>, <class 'torch.Tensor'>], kwargs = {}
```
Or if the tensor returned from user defined branches has different metadata, e.g. shapes, dtypes, etc., user will see error message like:
```
TypeError: Expected each tensor to have same metadata but got:
  cond_true_0 returns TensorMetadata(shape=torch.Size([2, 1]), dtype=torch.int64, requires_grad=False, stride=(1, 1), memory_format=torch.contiguous_format, is_quantized=False, qparams={})
  cond_false_0 returns TensorMetadata(shape=torch.Size([1]), dtype=torch.float32, requires_grad=False, stride=(1,), memory_format=torch.contiguous_format, is_quantized=False, qparams={})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98909
Approved by: https://github.com/jansel
2023-04-20 17:20:41 +00:00
e47e8c9d98 Guard on default device (#99551)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99551
Approved by: https://github.com/voznesenskym, https://github.com/mlazos
2023-04-20 17:02:59 +00:00
88c45a1954 [SPMD] Allow users to dynamically pass the last_iter to IterGraphModule (#99575)
The current design of IterGraphModule requires users to specify the concrete iteration count which is not always possible and not very precise. This PR introduce `last_iter` to IterGraphModule.forward() which allows users to dynamically specify the last iteration.

Differential Revision: [D45129585](https://our.internmc.facebook.com/intern/diff/D45129585/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99575
Approved by: https://github.com/lessw2020
2023-04-20 16:49:34 +00:00
7acb0bdd22 Fallback for Complex Dtypes in Inductor (#97198)
Differential Revision: [D44257054](https://our.internmc.facebook.com/intern/diff/D44257054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97198
Approved by: https://github.com/ngimel
2023-04-20 16:45:19 +00:00
638feec4e3 Turn on meta converter for complex (#98869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98869
Approved by: https://github.com/ngimel
2023-04-20 16:42:38 +00:00
df84d74058 Allow getting type of ScriptObject (#99542)
Summary:
A very old refactor (https://github.com/pytorch/pytorch/pull/29500) split ScriptModule into ScriptObject (base class) and ScriptModule (subclass). When moving methods around, the `_type` method was moved from ScriptModule to ScriptObject, but the type of its argument wasn't changed. Therefore, it is now impossible to invoke `_type` on a ScriptObject.

The reason I need this fix is that I am using PyTorch's dispatch mode to intercept some operators that accept/return custom classes, which end up being encoded as ScriptObject, and in order to properly handle them I need to be able to verify their type.

Test Plan: N/A

Differential Revision: D45118675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99542
Approved by: https://github.com/albanD
2023-04-20 16:10:19 +00:00
971df458db Reland of "Python binding to set/get CUDA rng state offset" (#99565)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~~

Reland of https://github.com/pytorch/pytorch/pull/98965

(cherry picked from commit 8214fe07e8a200e0fe9ca4264bb6fca985c4911e)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99565
Approved by: https://github.com/anijain2305
2023-04-20 15:42:25 +00:00
f4354b2a5e [dynamo] Support dict kwargs constructor (#98660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98660
Approved by: https://github.com/yanboliang
2023-04-20 15:40:00 +00:00
c17ff0ed36 Print AOT Autograd graph name when accuracy failed (#99366)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99366
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-04-20 15:35:47 +00:00
4721553431 [vmap] Fix searchsorted batch rule for self_logical_rank == 0 (#99526)
Fixes #95888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99526
Approved by: https://github.com/zou3519
2023-04-20 14:59:12 +00:00
2ad02d00b9 [BE] stdint.h->cstdint (#99570)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at cf8834b</samp>

Updated `Generator.h` to use C++ fixed-width integers from `std` instead of C ones. This avoids potential conflicts with other libraries or platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99570
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-04-20 14:23:45 +00:00
35ad5122d2 Revert "[vmap] Fix searchsorted batch rule for self_logical_rank == 0 (#99526)"
This reverts commit 6580b160d35a75d5ceebf376d55422376d0c0d2c.

Reverted https://github.com/pytorch/pytorch/pull/99526 on behalf of https://github.com/zou3519 due to Regressed behavior
2023-04-20 13:19:49 +00:00
ccd5ad816e inductor(CPU): add ISA check before do cpu fx packed weight (#99502)
1. This PR is to fix https://github.com/pytorch/pytorch/issues/99423, which will add an ISA check before doing the bf16 weight pack.
2. Move CPU-related tests from ```test_torchinductor.py``` to ```test_cpu_repo.py``` to reduce the CI time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99502
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-04-20 08:37:15 +00:00
4d8906885e Revert "Temporarily move ROCm to unstable (#99579)"
This reverts commit d06624d3c4c2ffe0c8e1587dd9fab62a7c7a5be6.

Reverted https://github.com/pytorch/pytorch/pull/99579 on behalf of https://github.com/kit1980 due to No longer needed
2023-04-20 07:10:21 +00:00
21e88a543b Revert "Fix trailing spaces lint (#99581)"
This reverts commit fbdb86c1747737c744ad79b5da6bcbd064dc982e.

Reverted https://github.com/pytorch/pytorch/pull/99581 on behalf of https://github.com/kit1980 due to Reverting the previous PR
2023-04-20 07:07:51 +00:00
96a262d666 Revert "Allow in graph einops operators (#99478)"
This reverts commit 309b7edfe1342197ee4f520ceebf0e15127c0f57.

Reverted https://github.com/pytorch/pytorch/pull/99478 on behalf of https://github.com/kit1980 due to dynamo/test_after_aot.py::TestAfterAot::test_save_graph_repro - AssertionError, see https://github.com/pytorch/pytorch/actions/runs/4750274195/jobs/8438535867
2023-04-20 06:42:35 +00:00
edd2507c73 [functorch] Prevent using for-loop for out-of-place index_fill batch rule (#99229)
A follow-up PR for https://github.com/pytorch/pytorch/pull/91364#discussion_r1060723192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99229
Approved by: https://github.com/kshitij12345
2023-04-20 06:40:32 +00:00
a2a4144256 [FSDP]Make param_groups optional for FSDP optim state dict (#99117)
Make param_groups optional for FSDP optim state dict and add corresponding test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99117
Approved by: https://github.com/fegin, https://github.com/zhaojuanmao
2023-04-20 06:34:40 +00:00
68bc0fc012 [inductor] a script to benchmark the perf impact from tensor layout (#99583)
Follow up on Jason's idea of tensor layout tuning. Add a script to show the perf impact of layout to convolution (will add more cases like batch/layer norm, reduction to the scripts).

For convolution, a quick test shows using channels last layout, we get 1.4x speedup for convolution:
```
baseline 4.509183883666992 test 3.178528070449829 speedup 1.419x
```

The speedup definitely also depends on input/weight shapes. E.g., change input channel from 3 in the test to 8, we see speedup to be 2.1x

The trace shows cudnn calls different kernels when input layout changes to channels last.

<img width="997" alt="Screenshot 2023-04-19 at 5 27 54 PM" src="https://user-images.githubusercontent.com/52589240/233228656-4bdcac0a-7633-416a-82e1-17d8dc8ea9a6.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99583
Approved by: https://github.com/jansel
2023-04-20 06:26:10 +00:00
da322ea874 Enable torch.jit.load for custom device (#99535)
Fixes #ISSUE_NUMBER
1、torch.jit.load for custom device
```
# custom device named `foo`
ts_model = torch.jit.script(mode.to(device="foo"))
ts_model.save("./ts.pt") # it is a script model on device `foo`

# and then we want to load it and run it
torch.jit.load("./ts.pt")
```
2、 add some extra key for custom device with `privateuse1`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99535
Approved by: https://github.com/albanD
2023-04-20 05:37:57 +00:00
6580b160d3 [vmap] Fix searchsorted batch rule for self_logical_rank == 0 (#99526)
Fixes #95888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99526
Approved by: https://github.com/zou3519
2023-04-20 05:08:20 +00:00
c0674c439c [vmap] Add max_pool3d batch rule (#99522)
Also add a helper to integrate `max_pool2d_with_indices` and `max_pool3d_with_indices`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99522
Approved by: https://github.com/zou3519
2023-04-20 05:08:19 +00:00
d31a00e713 [vamp] Add max_pool1d batch_rule (#99517)
Fixes #97558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99517
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-04-20 05:08:17 +00:00
233cc34d3b fix onednn ConvTranspose2d channels last issue when ic=1 (#99539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99539
Approved by: https://github.com/mingfeima
2023-04-20 04:41:33 +00:00
3af467eff4 inductor: support sqrt for dynamic shape (#99514)
When running TIMM ```convit_base``` dynamic shape case, there is always has AssertionError, see https://github.com/pytorch/pytorch/issues/97877.

A simple reproduce code is:
```
import torch
import torch._dynamo
import torch._dynamo.config as config

config.dynamic_shapes=True
torch._dynamo.config.assume_static_by_default=False
class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        B, N, C = x.shape
        return self.get_rel_indices(N)

    def get_rel_indices(self, num_patches: int) -> torch.Tensor:
        img_size = int(num_patches ** .5)
        #rel_indices = torch.zeros(1, num_patches, num_patches, 3)
        ind = torch.arange(img_size)
        return ind

model = Model().eval()
opt_model = torch._dynamo.optimize('inductor')(model)

x = torch.randn(8, 8, 8)
ref = model(x)
with torch.no_grad():
    for i in range(3):
        out = opt_model(x)

```

After this code, the generated code will be like this:
```

kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/x5/cx5442c6dcuxsrrlnqi476yzjlgc6g53ukppuaettiyp6dszhmr4.h"
extern "C" void kernel(long* out_ptr0,
                       const long ks0)
{
    {
        #pragma GCC ivdep
        for(long i0=static_cast<long>(0L); i0<static_cast<long>(std::floor(std::sqrt(ks0))); i0+=static_cast<long>(1L))
        {
            auto tmp0 = static_cast<long>(i0);
            out_ptr0[static_cast<long>(i0)] = tmp0;
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99514
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-04-20 04:22:49 +00:00
27a43c0242 inductor: add input type check for fuse_attention (#99296)
For TIMM ```xcit_large_24_p8_224```, the scale factor is a tensor(https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/xcit.py#L205), and ```scaled_dot_product_attention``` doesn't support it, this PR will add a check which only does the fusion when the scale factor is float/int value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99296
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-04-20 04:05:32 +00:00
309b7edfe1 Allow in graph einops operators (#99478)
Coordinating with @arogozhnikov from einops team, allowing specific operators in the dynamo graph avoids dynamo tracing problems provided the operators are screened for safety - they must not bake in unintended constants or take data-dependent control flow paths.

Fixes #99031

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99478
Approved by: https://github.com/jansel
2023-04-20 03:40:50 +00:00
95ca8e589d [ONNX] Update install_onnx.sh: onnx-script -> onnxscript (#99572)
The repository was renamed. The package is not yet renamed. We should update the package name when we bump the commit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99572
Approved by: https://github.com/BowenBao
2023-04-20 01:59:56 +00:00
789070986c [Dynamo] Implementing generic context manager by inlining __enter__ and __exit__ (#98725)
This is a draft version of generic context manager, I believe there are some scenarios that I didn't anticipate. I posted this draft for discussion and check if this is the right direction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98725
Approved by: https://github.com/jansel
2023-04-20 01:16:15 +00:00
fbdb86c174 Fix trailing spaces lint (#99581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99581
Approved by: https://github.com/huydhn
2023-04-20 00:38:04 +00:00
d06624d3c4 Temporarily move ROCm to unstable (#99579)
CI SEV https://github.com/pytorch/pytorch/issues/99578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99579
Approved by: https://github.com/orionr
2023-04-20 00:11:10 +00:00
805a6dc8d2 Add an expect test for test_save_graph_repro (#99538)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99538
Approved by: https://github.com/anijain2305
2023-04-20 00:00:40 +00:00
36acad58b6 [quant][pt2e][refactor] Move the annotation for observer sharing ops into separate util (#99384)
Summary:
In order to keep quantizer simple, we want to move the annotation code for operators like flatten, hardtanh etc. to
a separate utility function that is called after the quantizer annotation is done, this makes these ops (operator list) not
configurable by user, and also makes prepare_pt2e operator aware instead of operator agnostic, this design is not final,
we may change it in the future if we find there are use cases that need these to be configurable or if we feel it is important for prepare_pt2e
to stay agnostic to operator/operator patterns

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_qnnpack_quantizer_obs_sharing_ops

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D45071006](https://our.internmc.facebook.com/intern/diff/D45071006)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99384
Approved by: https://github.com/kimishpatel
2023-04-19 23:49:33 +00:00
1b9edb680f increment generation in run only context (#99099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99099
Approved by: https://github.com/anijain2305
2023-04-19 23:41:08 +00:00
c660db2074 Adding vmap support for special bessel functions (#99543)
Fixes #91402
## Description
This PR adds vmap support for the following bessel functions under torch.special.
*  special.bessel_j0
*  special.bessel_y0
*  special.bessel_j1
*  special.modified_bessel_i0
*  special.bessel_y1
*   special.scaled_modified_bessel_k0
*   special.scaled_modified_bessel_k1
*   special.modified_bessel_i1

## Files changed:
1. [aten/src/ATen/functorch/BatchRulesUnaryOps.cpp](https://github.com/pytorch/pytorch/pull/99543/files#diff-a629acd680b2c8639049755617fe89f803cd1001d9936e95d7bf4e388f41c6b8)
2. [test/functorch/test_vmap.py](https://github.com/pytorch/pytorch/compare/main...SiddharthIVEX:pytorch:sid/vmap_special_bessel?expand=1#diff-17b0cd027c7b1ca042fcfe21cc86284d6e58fa46039f7e4297b22b8e02b68fea)

## How was the PR tested?
1. The unit tests under `test_vmap.py` were run and all of them passed. The output is shown below.
```
configfile: pytest.ini
plugins: hypothesis-6.71.0, anyio-2.2.0
collected 2003 items / 1981 deselected / 22 selected

test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_j0_cpu_float32 PASSED                                           [  4%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_j1_cpu_float32 PASSED                                           [  9%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_y0_cpu_float32 PASSED                                           [ 13%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_bessel_y1_cpu_float32 PASSED                                           [ 18%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_i0_cpu_float32 PASSED                                  [ 22%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_i1_cpu_float32 PASSED                                  [ 27%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_k0_cpu_float32 PASSED                                  [ 31%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_modified_bessel_k1_cpu_float32 PASSED                                  [ 36%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_scaled_modified_bessel_k0_cpu_float32 PASSED                           [ 40%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_scaled_modified_bessel_k1_cpu_float32 PASSED                           [ 45%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_op_has_batch_rule_special_spherical_bessel_j0_cpu_float32 PASSED                                 [ 50%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_j0_cpu_float32 PASSED                                             [ 54%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_j1_cpu_float32 PASSED                                             [ 59%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_y0_cpu_float32 PASSED                                             [ 63%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_bessel_y1_cpu_float32 PASSED                                             [ 68%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_i0_cpu_float32 PASSED                                    [ 72%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_i1_cpu_float32 PASSED                                    [ 77%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_k0_cpu_float32 PASSED                                    [ 81%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_modified_bessel_k1_cpu_float32 PASSED                                    [ 86%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_scaled_modified_bessel_k0_cpu_float32 PASSED                             [ 90%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_scaled_modified_bessel_k1_cpu_float32 PASSED                             [ 95%]
test/functorch/test_vmap.py::TestVmapOperatorsOpInfoCPU::test_vmap_exhaustive_special_spherical_bessel_j0_cpu_float32 PASSED                                   [100%]

================================================================ 22 passed, 1981 deselected in 18.42s ================================================================

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99543
Approved by: https://github.com/zou3519
2023-04-19 23:39:56 +00:00
19788002e7 Remove a couple of additional places where we would construct tensors - aliases of params, inputs (#98950)
Removes two additional places where we would construct tensors
- Non-static inputs. These are only constructed to invoke the `copy_` kernel and do not own memory so we can construct them only once
- Aliases of persistent static inputs (parameters). the memory will be permanently live and is not managed by the cudagraph tapes.

(also sneaking in a bug fix around unaligned static idx)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98950
Approved by: https://github.com/jansel
2023-04-19 23:35:53 +00:00
3233450d07 Add TorchXLA option to benchmark runner (#99505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99505
Approved by: https://github.com/voznesenskym
2023-04-19 22:44:52 +00:00
6026caed1e Make HAS_CPU boolean lazy, speed up import time (#99537)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99537
Approved by: https://github.com/bertmaher, https://github.com/albanD
2023-04-19 22:12:10 +00:00
cf354a0491 Don't eagerly initialize CUDA when importing common_cuda (#99536)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99536
Approved by: https://github.com/Chillee, https://github.com/bertmaher, https://github.com/albanD
2023-04-19 22:12:10 +00:00
32cd05ae60 Package torch.fx type hints (#99541)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at ca3aab4</samp>

> _`fx` module traced_
> _Symbolic graphs transformed_
> _Type stubs for winter_

Fixes https://github.com/pytorch/pytorch/issues/99530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99541
Approved by: https://github.com/kit1980, https://github.com/Chillee
2023-04-19 22:00:07 +00:00
f6f35135a4 suggest constraints to specify for export based on generated shape guards (#98463)
The design of export API expects constraints to be specified on dynamic dimensions, while assuming all other dimensions are static by default. However a user who wishes to export a model may not be fully familiar with the code to plan what to specify.

This diff provides support for discovering constraints to specify. The basic idea is to take the set of generated shape guards and convert them into appropriate constraints. However, we usually generate a LOT of shape guards, and there is often a LOT of redundancy in them. Thus, we also need to simplify the guards so that our suggested constraints are concise yet capture the information content in the guards.

The algorithm for simplification uses `sympy` under the hood, but very surgically to avoid any risk of blowing up. See comments inline for a full description. Briefly,
1. We consider only univariate inequalities, and among them, solve for equalities first.
2. We substitute these exact solutions to convert multivariate inequalities progressively into univariate.
3. Remaining univariate inequalities are solved using `sympy.solvers.inequalities.reduce_inequalities`.
4. As pre-processing, we also eliminate all `//` and `%` operations to generate a set of linear congruence guards, and solve these using `sympy.ntheory.modular.solve_congruence`.

The results are quite dramatic. For example, an internal model produced several hundreds of guards with `dynamic_shapes=True`, which were pretty much inscrutable for humans. The summary contains around 30 dimensions that were specialized and 3 constraints on dynamic dimensions. The output format looks like this:
```
The following dimensions have been specialized and CANNOT be dynamic.
NOTE: Specializations will happen by default with `assume_static_by_default=True`.
	L['foo']['bar'].size()[0] == 4
        ...
	L['baz']['qux'].size()[3] == 96

The following dimensions CAN be dynamic.
You can use the following code to specify the constraints they must satisfy:
constraints=[
	dynamic_dim(L['blah']['bleh'], 1) == dynamic_dim(L['blah']['bloh'], 1),
        ...,
	2 <= dynamic_dim(L['blah']['bloh'], 1),
]
```

Differential Revision: [D44731747](https://our.internmc.facebook.com/intern/diff/D44731747/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98463
Approved by: https://github.com/voznesenskym, https://github.com/ezyang
2023-04-19 21:56:36 +00:00
04f7a2a5e1 Support module dict iter (#99503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99503
Approved by: https://github.com/Chillee, https://github.com/jansel
2023-04-19 21:54:35 +00:00
c75ac11fb5 [cond] error on closed over variables (#99367)
As reported in https://github.com/pytorch/pytorch/issues/90469, the implementation of inlining nested function branches for `cond` doesn't properly handle variables captured from outer scopes. This leads to some examples accidentally working, some others generating incorrect code that don't crash but do the wrong thing, and still others that outright crash because of references to non-existent variables.

Properly supporting closed variables is tricky (see https://github.com/pytorch/pytorch/pull/91981 for an abandoned attempt). While this is definitely something we should be able to support longer term, for now it is better to explicitly error and suggest the fix to the user (amounting to rewriting branches to take closed variables explicitly).

Differential Revision: [D45058621](https://our.internmc.facebook.com/intern/diff/D45058621/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99367
Approved by: https://github.com/ezyang, https://github.com/tugsbayasgalan
2023-04-19 21:54:20 +00:00
237f917f5b [Profiler][Easy] Fix typo in Profiler report input shapes (#99430)
Summary:
There are two variables for profiler input shapes:
- In C++ interface: report_input_shapes
- In Python interface: record_shapes

Therefore record_input_shapes is a typo. We should also look to reducing redundant naming between the two.

Test Plan: CI

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99430
Approved by: https://github.com/davidberard98
2023-04-19 21:50:52 +00:00
af7fed1d92 fix osd rank0_only in fsdp (#99136)
Fixes #99135

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99136
Approved by: https://github.com/fegin
2023-04-19 21:50:38 +00:00
2402fe5210 [memory allocator] fix ifdef typo (#99553)
First PR went in with the expandable allocator accidentally disabled
which happened trying to fix the build on weird architectures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99553
Approved by: https://github.com/ezyang, https://github.com/eellison
2023-04-19 21:45:51 +00:00
495e1b4d0e Add device_asserts before indirect loads and stores (#98590)
This PR also adds a way to CSE statements (not only assignments).

The tests follow the pattern from https://github.com/openai/triton/pull/1143
They take a fair amount of time to run (90s in my box). If we wanted to
improve this, we could avoid testing the `ndim == 3` case.

Changes like this one make me hope that we get to clean the amount of
lowerings we have at some point...

Generated code for `x[y]` with `x.shape == (3, 2, 4),  y.ndim == 1`:

With `dynamic=False`:
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < 3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < 3")
tmp1 = tl.load(in_ptr1 + (x0 + (8*tmp0)), xmask)
```
With `dynamic=True`:
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < ks3")
tmp1 = tl.load(in_ptr1 + (x0 + (ks1*ks2*tmp0)), xmask)
```

Generated code for `x[y+1, y+1]` with `x.shape == (3, 2, 4),  y.ndim == (3, 3)`:
With `dynamic=False` (note how it folds the two upper bounds to `min(3, 2) == 2`
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tmp1 = 1
tmp2 = tmp0 + tmp1
tl.device_assert(((0 <= tmp2) & (tmp2 < 2)) | (~xmask), f"index out of bounds: 0 <= tmp2 < 2")
tmp3 = tl.load(in_ptr1 + (x0 + (12*tmp2)), xmask)
```

With `dynamic=True`:
```python
tl.device_assert(((0 <= tmp2) & (tmp2 < min(ks2, k1))) | (~xmask), f"index out of bounds: 0 <= tmp2 < min(ks2, ks1)")
```

The same works when the CSE'd variable appears 3 or more times, but then it generates `min(ks0, min(ks1, ks2))`

Generated code for `x[y] = z` with `x.ndim = 3`, `y.ndim = 1` and dynamic shapes
```python
tmp0 = tl.load(in_ptr0 + (x1), xmask)
tmp1 = tl.load(in_ptr1 + (x2), xmask)
tl.device_assert(((0 <= tmp0) & (tmp0 < ks3)) | (~xmask), f"index out of bounds: 0 <= tmp0 < ks3")
tl.store(out_ptr0 + (x0 + (ks1*ks2*tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask)
```

Fixes https://github.com/pytorch/pytorch/issues/93538

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98590
Approved by: https://github.com/ngimel
2023-04-19 21:26:57 +00:00
9ac2b041c9 Make opacus xfail instead of skip (#99380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99380
Approved by: https://github.com/desertfire, https://github.com/anijain2305
2023-04-19 21:09:06 +00:00
cfacb5eaaa Revert "Use correct standard when compiling NVCC on Windows (#99492)"
This reverts commit db6944562efad201c7c1dc2fc0539b1f34012666.

Reverted https://github.com/pytorch/pytorch/pull/99492 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-04-19 20:51:26 +00:00
ca89e7942a [SPMD][Easy] switch to tree_map_only to simplify code (#99547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99547
Approved by: https://github.com/fegin
2023-04-19 20:40:09 +00:00
db6944562e Use correct standard when compiling NVCC on Windows (#99492)
Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D45108690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99492
Approved by: https://github.com/ezyang
2023-04-19 20:36:05 +00:00
db456ab83d [c10d] Faster coalescing (#98793)
### Description
The PR aims at reducing CPU overhead of context manager style coalescing.

By "context manager style coalescing", we mean:
Sync style:
```
with _coalescing_manager():
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
```
Async style:
```
with _coalescing_manager(async_ops=True) as cm:
     for i in range(num_coll):
         dist.all_reduce(tensors[i])
cm.wait()
```
In previous implementation, each collective in the `num_coll` loop actually calls into the C++ backend, accumulating pybind overhead.

In the new implementation, we capture the collectives at Python level, and only fire towards C++ at the exit of the coalescing manager.

### Tests
In current PR, the "fast path" only applies to all-reduce.
- Flattened 512M: 16.38 ms, including CPU time 131.21 us
- Old _coalescing_manager 64 x 8M: 22.19 ms, including CPU time 2865 us
- New _coalescing_manager 64 x 8M: 16.93 ms, including CPU time 635 us

Hence a 4x reduction in CPU overhead (dependent on `num_coll`).

Cc @mrshenli @kumpera @wanchaol @fegin
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98793
Approved by: https://github.com/kumpera
2023-04-19 20:17:58 +00:00
bc9eaa7abf Run post-aot compiler at compilation time, not at runtime. (#99457)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99457
Approved by: https://github.com/anijain2305
2023-04-19 19:36:09 +00:00
7546972565 [BE] Refactoring test execution and improving comments (#99467)
Sharing code between the code that handles test results in parallel vs serial mode.

Note that the original version of this code had an inconsistency between the two versions where it would execute `print_to_stderr(err_message)` on every test that ran in parallel, but for serial tests it would only invoke `print_to_stderr(err_message)` if `continue_on_error` was also specified.  By sharing code, this PR changes that behavior to be consistent between the two modes.

Also adding some comments.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 029342c</samp>

> _Sing, O Muse, of the skillful coder who refined_
> _The PyTorch testing script, `run_test.py`, and shined_
> _A light on its obscure logic, with docstrings and comments_
> _And made it run more smoothly, with better error contents_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99467
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-04-19 19:29:07 +00:00
6ca991cacf [Composable API] Add fully_shard debug function to print sharded tree structure, module names and managed param fqns (#99133)
Adding a fully_shard debug function to print sharded tree structure like following format, return module names and their managed parameter fqns as well.

![Screenshot 2023-04-18 at 5 14 54 PM](https://user-images.githubusercontent.com/48731194/232931628-169a63a9-b4d5-4902-9cfd-f40113f3ec98.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99133
Approved by: https://github.com/rohan-varma
2023-04-19 19:27:43 +00:00
6b6dc4418d Warn if guards are added to ShapeEnv after we produced guards (#97820)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97820
Approved by: https://github.com/voznesenskym
2023-04-19 19:23:52 +00:00
ts
2aa35e6cc1 Fix Tensor.uniform_ documentation to mention generator argument (#99510)
Fixes #98202 , mentioning support for generator arguments while providing an example of the function's use.

Would be happy to iterate on this if there are any issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99510
Approved by: https://github.com/soulitzer
2023-04-19 19:23:12 +00:00
d6d55f8590 [fx] Variatic arg matching (#99431)
For cases where the pattern graph matches on x number of arguments, but the matching graph omits some of these arguments (by using the default values instead), right now SubgraphMatcher fails because these graphs have a different number of arguments. So instead in the case where we see the pattern/replacement nodes have different number of arguments, we will add the default values onto whichever argument set is lacking arguments.

Note this support is only for when we are matching targets that are instances of OpOverload, which have a schema and default values tied to them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99431
Approved by: https://github.com/jerryzh168
2023-04-19 18:23:40 +00:00
e21f648cde improve mkldnn matmul performance when one input is contiguous tensor but the strides is not default contiguous strides (#99511)
giving the following case:
```
import torch

a= torch.empty_strided([64, 1, 33], [33, 3, 1], dtype=torch.bfloat16).fill_(1)
b = torch.randn(64, 33, 256).to(dtype = torch.bfloat16)
y = torch.ops.aten.bmm(a, b)
```
```a``` is a contiguous tensor, but the strides are not defaulted contiguous strides ([33, 33, 1]), onednn matmul always running a non-optimized path:
```
onednn_verbose,exec,cpu,matmul,gemm:jit,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,7.28711
```
This PR will convert the inputs' stride to deafult contiguous stride before calling onednn to running an optimization path:
```
onednn_verbose,exec,cpu,matmul,brg:avx512_core_amx_bf16,undef,src_bf16::blocked:abc:f0 wei_bf16::blocked:abc:f0 dst_bf16::blocked:abc:f0,attr-scratchpad:user ,,64x1x33:64x33x256:64x1x256,3.06396
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99511
Approved by: https://github.com/mingfeima, https://github.com/jgong5
2023-04-19 18:13:00 +00:00
efa16c20c3 make ATen/native/cuda/UpSampleNearest3d.cu data_ptr-correct (#99328)
make ATen/native/cuda/UpSampleNearest3d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99328
Approved by: https://github.com/ezyang
2023-04-19 17:48:54 +00:00
5cb788a9a5 Revert "[cuda rng] Making offset calculation independent of device properties (#98988)"
This reverts commit 26f318574fc771bb200b99bcd87c645934c1e706.

Reverted https://github.com/pytorch/pytorch/pull/98988 on behalf of https://github.com/anijain2305 due to Diagnosing if sebotnet has flakiness
2023-04-19 17:23:40 +00:00
5d395769a6 Skip vision_maskrcnn after #98923 (#99394)
This is failing in trunk as documented in https://github.com/pytorch/pytorch/issues/99438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99394
Approved by: https://github.com/desertfire
2023-04-19 17:07:07 +00:00
6b8bab8e39 Fix (4 device) multi-gpu ShardedGradScaler Tests in ciflow/periodic (#99485)
Fixes #99427

Given the provided CI logs, I ~~suspect~~[^1] `inf` is being hit with the initial (FSDP model) step of the [test in question](https://github.com/pytorch/pytorch/actions/runs/4707887920/jobs/8350225813#step:13:36189). The DDP loss is correct and indicative of two steps being taken but the FSDP loss is approximately half of the loss expected with the first step (suggesting a step was skipped and the scale was halved). I'm further reducing `init_scale` in this PR in order to ~~test the hypothesis~~[^2] (error occurs with 4 device multi-gpu tests only, not the 2 device tests I can verify locally).

I'll ensure I add the label `ciflow/periodic`[^3] to future PRs I suspect could potentially exhibit divergent behavior with >2 devices. Ideally all tests would be insensitive to device scaling but I recognize for some tests imposing that design constraint might be more trouble than it's worth.

@awgu @huydhn

[^1]: Suspicion confirmed
[^2]: The relevant periodic tests are [now passing](https://github.com/pytorch/pytorch/actions/runs/4738073998/jobs/8411862508)
[^3]: Didn't know that existed, great to know!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99485
Approved by: https://github.com/huydhn
2023-04-19 16:52:29 +00:00
b0df0cd7cc [reland][quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#99355)
Summary:
Using a decomposed convert to make sure we get exact match, this means the nodes in resnet are
annotated correctly, reland for https://github.com/pytorch/pytorch/pull/98905

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D45071168](https://our.internmc.facebook.com/intern/diff/D45071168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99355
Approved by: https://github.com/kimishpatel
2023-04-19 16:47:15 +00:00
391a3add54 make ATen/native/cuda/TensorModeKernel.cu data_ptr-correct (#99330)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/99330).
* __->__ #99330

make ATen/native/cuda/TensorModeKernel.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99330
Approved by: https://github.com/ezyang
2023-04-19 16:15:22 +00:00
8eb7743401 make ATen/native/cuda/ReflectionPad.cu data_ptr-correct (#99325)
make ATen/native/cuda/ReflectionPad.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99325
Approved by: https://github.com/ezyang
2023-04-19 16:12:36 +00:00
4d3011b600 make ATen/native/cuda/Col2Im.cu data_ptr-correct (#99348)
make ATen/native/cuda/Col2Im.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99348
Approved by: https://github.com/Skylion007
2023-04-19 16:06:39 +00:00
121edd2161 make ATen/native/cuda/Shape.cu data_ptr-correct (#99343)
make ATen/native/cuda/Shape.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99343
Approved by: https://github.com/Skylion007
2023-04-19 16:06:27 +00:00
b01edf45f8 Add typing to debug_utils and repro (#99452)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99452
Approved by: https://github.com/anijain2305
2023-04-19 16:00:19 +00:00
2e25fb5d55 Refactor debug_utils into after_aot and after_dynamo modules (#99450)
There are no code changes but I did take the opportunity to
reorder and group the functions once they were placed in their
respective modules.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99450
Approved by: https://github.com/anijain2305
2023-04-19 16:00:19 +00:00
a3ee9800ba Codegen fixed size for static sympy values (#99362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99362
Approved by: https://github.com/ezyang
2023-04-19 15:58:18 +00:00
e605b5df74 [SPMD] Add sym_stride to DSymInt (#99504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99504
Approved by: https://github.com/fegin
2023-04-19 14:55:40 +00:00
2cb8a8d4cc [SPMD] Support DSymInt for slice_backward in SPMD expansion (#99501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99501
Approved by: https://github.com/fegin
2023-04-19 14:55:40 +00:00
292296141a [SPMD] Support SymInt with non-op call_function nodes (#99420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99420
Approved by: https://github.com/fegin
2023-04-19 14:55:37 +00:00
7c0c663a4c [SPMD] Add aten.stack and aten.select to DTensor prop (#99417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99417
Approved by: https://github.com/fegin
2023-04-19 14:55:34 +00:00
301be37091 Avoid import * from experimental_ops (#99363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99363
Approved by: https://github.com/fegin
2023-04-19 14:55:30 +00:00
8d3dc2131d make ATen/native/cuda/TensorTransformations.cu data_ptr-correct (#99350)
make ATen/native/cuda/TensorTransformations.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99350
Approved by: https://github.com/ezyang
2023-04-19 14:03:48 +00:00
98907589ee Make GetItemSource(*, slice) hashable (#99379)
All Sources must be hashable, since we are using set equality to check for
duplicate sources in AOTAutograd.  We should have a more systematic way
of asserting this.  For this PR just fix the local issue.

Fixes #99145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99379
Approved by: https://github.com/ezyang
2023-04-19 13:50:49 +00:00
9b909cbe9a Update README.md to explain installing triton (#99464)
users building from source should know how to install triton the recommended way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99464
Approved by: https://github.com/msaroufim, https://github.com/alanwaketan
2023-04-19 13:48:56 +00:00
0ae9d15543 make ATen/native/cuda/Bucketization.cu data_ptr-correct (#99334)
make ATen/native/cuda/Bucketization.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99334
Approved by: https://github.com/ezyang
2023-04-19 13:42:58 +00:00
367d3657a4 make ATen/native/cuda/TensorFactories.cu data_ptr-correct (#99342)
make ATen/native/cuda/TensorFactories.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99342
Approved by: https://github.com/ezyang
2023-04-19 13:42:27 +00:00
3ace394d43 make ATen/native/cuda/RangeFactories.cu data_ptr-correct (#99344)
make ATen/native/cuda/RangeFactories.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99344
Approved by: https://github.com/ezyang
2023-04-19 13:41:33 +00:00
67db44694a make ATen/native/cuda/Nonzero.cu data_ptr-correct (#99333)
make ATen/native/cuda/Nonzero.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99333
Approved by: https://github.com/ezyang
2023-04-19 13:40:46 +00:00
441ac80988 make ATen/native/cuda/UpSampleNearest1d.cu data_ptr-correct (#99329)
make ATen/native/cuda/UpSampleNearest1d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99329
Approved by: https://github.com/ezyang
2023-04-19 13:39:39 +00:00
c67c16bcd2 Switch calling convention back to real tensors (#99320)
Months ago, in order to get dynamic shapes working through to Dynamo backends, we changed the calling convention to pass fake tensors rather than real tensors as example inputs to backends. The motivation at the time was, well, backends shouldn't really be peeking at the real tensors when they are doing compilation, and so it would make more sense to hide the real tensors from backends. But there were a bunch of problems:

* This interacted poorly with our accuracy minifier design: accuracy minifier needs access to the real inputs in order to run the model and figure out what happens!
* The TensorRT backend required real inputs and we never figured out how to fix it.
* In practice, all the backends needed to detect if they were passed real tensors, and fakeify them anyway (certainly AOTAutograd does this)
* Parameters and inputs are treated non-uniformly: parameters had to be passed as real tensors, because CUDA graphs requires knowing what the actual tensors are

Furthermore, there were some more problems discovered after the fact:

* Backends may want to optimize on aspects of tensors which you cannot tell without having real tensors; e.g., alignment of the data pointer

So, this PR decides that changing the calling convention was a bad idea, and switches back to passing real tensors. There is a problem though: AOTAutograd will perform fakeification, which means that in practice backends are still going to end up with fake tensors in the end anyway. I want to change this, but this will require some work with bdhirsh's upcoming AOTAutograd export refactor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99320
Approved by: https://github.com/voznesenskym
2023-04-19 12:15:52 +00:00
1eb1911012 migrate cuda files to const_data_ptr (#99357)
migrate cuda files to const_data_ptr

Summary:
These are all going to const_data_ptr, so they ought to all be safe.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99357
Approved by: https://github.com/ezyang
2023-04-19 12:06:25 +00:00
1aa52fc041 make ATen/native/cuda/NaiveDilatedConvolution.cu data_ptr-correct (#99346)
make ATen/native/cuda/NaiveDilatedConvolution.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99346
Approved by: https://github.com/ezyang
2023-04-19 12:06:17 +00:00
f17119d10c make ATen/native/cuda/AdaptiveAveragePooling3d.cu data_ptr-correct (#99324)
make ATen/native/cuda/AdaptiveAveragePooling3d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99324
Approved by: https://github.com/ezyang
2023-04-19 12:05:51 +00:00
bb2cd4a107 Revert "Python binding to set/get CUDA rng state offset (#98965)"
This reverts commit 8214fe07e8a200e0fe9ca4264bb6fca985c4911e.

Reverted https://github.com/pytorch/pytorch/pull/98965 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-19 11:23:32 +00:00
33483b0be4 make ATen/native/cuda/UpSampleTrilinear3d.cu data_ptr-correct (#99349)
make ATen/native/cuda/UpSampleTrilinear3d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99349
Approved by: https://github.com/ezyang
2023-04-19 09:55:18 +00:00
ea50d4f146 Revert "Switch calling convention back to real tensors (#99320)"
This reverts commit 780922c24ec931000cb6a67eeebd2b2288eeb7df.

Reverted https://github.com/pytorch/pytorch/pull/99320 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-19 09:44:06 +00:00
6467495900 Allow split_reduction if all dyn values are static (#99475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99475
Approved by: https://github.com/ngimel
2023-04-19 07:53:25 +00:00
113bd11cf4 Skip levit (#99491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99491
Approved by: https://github.com/ezyang
2023-04-19 07:41:42 +00:00
41d7969590 [SPMD] Upstream iter_move_grads_and_optimizers (#98785)
This PR upstreams `iter_move_grads_and_optimizer` which delay some of the gradients and the corresponding optimizer to the next iteration. D44512863(credit to @lessw2020 ) is the internal implementation, which is only good for the old _SPMD expansion.  This PR changes the implmentation to use the new APIs.

Differential Revision: [D44836486](https://our.internmc.facebook.com/intern/diff/D44836486/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98785
Approved by: https://github.com/mrshenli
2023-04-19 06:40:33 +00:00
fcd2e8cbf4 Support bf16 searchsort op (#99426)
Per title, needed to unblock bf16 for an ads tformer workload

Differential Revision: [D45088972](https://our.internmc.facebook.com/intern/diff/D45088972/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99426
Approved by: https://github.com/XilunWu
2023-04-19 06:25:01 +00:00
b3b0fbca11 [ONNX] Export Relu6 without using Relu (#99022)
The clamp operator used in the export of Relu6 already clamps the input value in between 0 and 6. There's no need to first perform a Relu on the input tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99022
Approved by: https://github.com/BowenBao
2023-04-19 06:18:14 +00:00
d41aa448b8 [ONNX] Run ONNX tests as part of standard run_test script (#99215)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at dcbf7e2</samp>

### Summary
📝🧹🚩

<!--
1.  📝 for simplifying the `./scripts/onnx/test.sh` script
2.  🧹 for refactoring the `test/onnx/dynamo/test_exporter_api.py` file
3.  🚩 for adding the `--onnx` flag to `test/run_test.py` and updating the `TESTS` list
-->
This pull request improves the ONNX testing infrastructure in PyTorch by refactoring the test code, normalizing the scope names, adding a flag to run only the ONNX tests, and simplifying the test script.

> _To export PyTorch models to ONNX_
> _We refactored some scripts and contexts_
> _We used `common_utils`_
> _And normalized the scopes_
> _And added a flag to run the tests_

### Walkthrough
*  Simplify `./scripts/onnx/test.sh` to use `run_test.py` with `--onnx` flag instead of `pytest` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-0017f5b22ae1329acb0f54af8d9811c9b6180a72dac70d7a5b89d7c23c958198L44-R46))
*  Remove `onnx` test from `TESTS` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7L127-R127)). Replace with `onnx_caffe2`.
*  Add `onnx/test_pytorch_onnx_onnxruntime_cuda` and `onnx/test_models` tests to `blocklisted_tests` list in `test/run_test.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R154-R155))
*  Add `ONNX_SERIAL_LIST` list to `test/run_test.py` to specify ONNX tests that must run serially ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R296-R301))
*  Add `ONNX_TESTS` list to `test/run_test.py` to store all ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R370))
*  Add `--onnx` flag to `parse_args` function in `test/run_test.py` to run only ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R920-R928))
*  Include `ONNX_SERIAL_LIST` in `must_serial` function in `test/run_test.py` to run ONNX tests serially or parallelly based on memory usage ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1120))
*  Filter selected tests based on `--onnx` flag in `get_selected_tests` function in `test/run_test.py` to exclude non-ONNX tests ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-e72503c9e3e8766e2d1bacf3fad7b88aa166e0e90a7e103e7df99357a35df8d7R1158-R1165))

### Other minor changes to accommodate this change
*  Replace `unittest` module with `common_utils.TestCase` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L4), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L29-R28), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L71-R70), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L147-R146))
*  Import `TemporaryFileName` class from `common_utils` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L19-R18))
*  Use `common_utils.TemporaryFileName` instead of `TemporaryFileName` in `TestDynamoExportAPI` class in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L92-R91), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L110-R109), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L129-R128))
*  Use `common_utils.run_tests` instead of `unittest.main` in `test/onnx/dynamo/test_exporter_api.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-4545f0c15c73ebe90a875e9bee6c5ca4b6b92fb1ed0ec5560d1568e0f6339d02L155-R154))
*  Add `re` module to `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R6))
*  Add `_remove_test_environment_prefix_from_scope_name` function to `test/onnx/test_utility_funs.py` to normalize scope names of ONNX nodes ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7R32-R58))
*  Use `_remove_test_environment_prefix_from_scope_name` function to compare scope names of ONNX nodes in `TestUtilityFuns` class in `test/onnx/test_utility_funs.py` ([link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1099-R1133), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1119-R1152), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1170-R1188), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1181-R1199), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1220-R1239), [link](https://github.com/pytorch/pytorch/pull/99215/files?diff=unified&w=0#diff-da71d2c81c9dc7ac0c47ff086fded82e4edcb67ba0cd3d8b5c983d7467343bc7L1235-R1258))

Fixes #98626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99215
Approved by: https://github.com/huydhn, https://github.com/titaiwangms
2023-04-19 06:17:47 +00:00
8e69879209 [inductor] adjust sliceView limits (#99447)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99447
Approved by: https://github.com/voznesenskym
2023-04-19 03:13:41 +00:00
4aedb8e116 Revert "[inductor] coordinate descent tuning upon max-autotune (#97203)"
This reverts commit 52ecc3274b1c16fcca3a3d89bd261dbc6513d6ed.

Reverted https://github.com/pytorch/pytorch/pull/97203 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks MacOS test in trunk
2023-04-19 02:33:02 +00:00
e60557793f Make hash update script more robust and run it (#99370)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99370
Approved by: https://github.com/Chillee, https://github.com/voznesenskym
2023-04-19 02:26:03 +00:00
96cad5cf95 Revert "[inductor] adjust sliceView limits (#99447)"
This reverts commit 8009891be65d87ec54e3f777dbc9ccd3b5c20d6e.

Reverted https://github.com/pytorch/pytorch/pull/99447 on behalf of https://github.com/ngimel due to breaks inductor
2023-04-19 01:39:26 +00:00
26f318574f [cuda rng] Making offset calculation independent of device properties (#98988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98988
Approved by: https://github.com/ngimel
2023-04-19 01:35:44 +00:00
bb017d7671 Revert "Codegen fixed size for static sympy values (#99362)"
This reverts commit 6c5fdde881c24329b4356e085a7b171cfd68bf72.

Reverted https://github.com/pytorch/pytorch/pull/99362 on behalf of https://github.com/voznesenskym due to CI Fail
2023-04-19 01:00:52 +00:00
48463f687a refactor macro with AMP (#99285)
Fixes #ISSUE_NUMBER
as the tiltle, optimize the macro with AMP and put the macro in `.hpp` file, so that we can use it for custom device.  @bdhirsh  @albanD
as we talked at this discuss, optimize the macros so that we can add a new macro for other devide, and move these macros to `.hpp` so that we can include these macros with custom device to configure the ops.
https://dev-discuss.pytorch.org/t/improve-the-extension-with-privateuse1-for-custom-device/1196/7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99285
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-04-19 01:00:00 +00:00
52ecc3274b [inductor] coordinate descent tuning upon max-autotune (#97203)
Command to run max autotune baseline:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)
```

Command to do coordinate descent autotuning:
```
TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only ${MODEL_NAME} --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt)
```

Explanation of the envvars show up on the command:
```
- TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 : enable coordinate descent tuning
- TORCHINDUCTOR_PERSISTENT_REDUCTIONS=0 : disable persistent reduction. Need do this so we can tune RBLOCK for reductions
- TORCHINDUCTOR_MAX_AUTOTUNE=1: enable max autotune
- TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting_coordesc : use a separate cache dir for coordinate descent tuning. Optional.
```

Here are my experiments results for around 40 torchbench models: https://docs.google.com/spreadsheets/d/1G7i2whIf8Yu-HhN_WovNxwcE-iFDSAw4x3NK4uL4XhI/edit#gid=0

Some highlights
- We improve 2.2% further upon max-autotune on average (geomean)
- timm_resnest benefits most from coordinate descent tuning. There is 1.07x speedup
- We have descent speedup on transformer models
  - BERT_pytorch:  1.056x
  - timm_vision_transformer: 1.04x
  - hf_Bert: 1.030x
- For resnet models, it looks like we have less gain as model get larger. My guess is larger model spend more time on mm/conv, so our tuning for pointwise/reduction helps less
  - resnet18: 1.021x
  - resnet50: 1.014x
  - resnet152: 1.005x

This kind of coordinate descent autotuning can give us 'upper bound' of the gain we can get for tuning configs for pointwise/reduction. On the other hand, by spot checking, we roughly double the compilation time compared to max-autotune. Next steps can be
- we disable persistent reduction in coordinate descent autotune (it's still enabled in baseline) so we can tune RBLOCK for reduction. We can also try to use autotune to pick persistent reduction or not.
- pick good config without benchmarking (e.g. Natalia mentioned checking register spill)
- try the idea on matmul so we know what's the potential there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97203
Approved by: https://github.com/ngimel
2023-04-19 00:17:10 +00:00
3fef633333 Add CUDA-12.1 manywheel build to trunk (#99458)
As CUDA-11.7 is getting deprecated anyway.

Also, fix the problem when script actually generated the same workflow twice, overriding 11.8 ones with 11.7+11.7-with-pypi

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0c6c182</samp>

> _Oh we are the PyTorch crew and we have a job to do_
> _We build and test the manywheel package with CUDA 11.8_
> _So heave away, me hearties, heave away with all your might_
> _We'll smoke the Linux binary and make sure it runs all right_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99458
Approved by: https://github.com/dagitses, https://github.com/atalman
2023-04-19 00:13:32 +00:00
8009891be6 [inductor] adjust sliceView limits (#99447)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99447
Approved by: https://github.com/voznesenskym
2023-04-19 00:08:27 +00:00
44b09bf673 Reland "Simple Custom Operator API, V0 (#98440)" (#99416)
See the original PR (#98440) for the description. It broke internal
builds due to proxy_tensor.py not importing torch._dynamo, which is
being fixed in the previous PR in the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99416
Approved by: https://github.com/soulitzer, https://github.com/bdhirsh
2023-04-18 23:48:33 +00:00
840431fa59 Fix test/test_proxy_tensor (#99415)
test_proxy_tensor fails when run by itself (python test/test_proxy_tensor.py -v),
but not when all of the tests are run together.

The cause is that torch._dynamo isn't imported in
torch/fx/experimenta/proxy_tensor.py but it is using functions from the
torch._dynamo package.

The fix in this PR is to add the import statements. In the future we can
consider always importing torch._dynamo on `import torch` or moving the
import to the top of the file, but there are some serious circular
dependencies to be worked out.

NB: an import in the middle of the file makes the function a bit slow
the first time the import happens but all subsequent calls are fast.

Test Plan:
- python test/test_proxy_tensor.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99415
Approved by: https://github.com/soulitzer
2023-04-18 23:48:33 +00:00
8a89eec2f8 [BE] Do not use unicode quotes (#99446)
They are mostly used in commented code examples, but even Python-3.12
does not recognize `“foobar”` as valid string literal

I.e. just `s/[“”]/"/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99446
Approved by: https://github.com/huydhn, https://github.com/ezyang
2023-04-18 22:59:56 +00:00
2b49a7330b Make lintrunner work with new main branch (#99466)
We've renamed the `master` branch to `main`. Lintrunner should check for a merge base from this new branch now

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 9743d70</samp>

Updated the linter configuration to reflect the new default branch name. Changed `merge_base_with` from `origin/master` to `origin/main` in `.lintrunner.toml`.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 9743d70</samp>

> _`merge_base_with` changed_
> _`origin/main` is the new_
> _branch name for pytorch_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99466
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-04-18 22:58:50 +00:00
5ff2ad6fc1 torch._int_mm: fix triton kernel caching (#99283)
Summary:

A fix to ensure that kernels generated for `torch._int_mm` can be cached. We can remove this hack one eager mode `torch._int_mm` is better supported.

Let me know if something more proper is needed instead of the hack.

Test plan:

```
// running the script below led to two compilations of triton
// int8,int8->int32 kernel before this PR, and only has
// one compilation which is reused after this PR

import torch
import torch.nn as nn

x = torch.randint(-128, 127, (32, 32), device='cuda', dtype=torch.int8)
y = torch.randint(-128, 127, (32, 32), device='cuda', dtype=torch.int8)

class M(nn.Module):
    def forward(self, x):
        x = torch._int_mm(x, y)
        x = x.to(torch.int8)
        x = torch._int_mm(x, y)
        return x

m = M().cuda().half()
m = torch.compile(m, options={"max-autotune": True})

z = m(x)
z = m(x)
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99283
Approved by: https://github.com/nmacchioni, https://github.com/janeyx99
2023-04-18 22:07:02 +00:00
6c5fdde881 Codegen fixed size for static sympy values (#99362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99362
Approved by: https://github.com/ezyang
2023-04-18 20:34:26 +00:00
60c8a75a7e [EASY] Turn on slow path assertions but only on first run (#98945)
We should at least run the assertions on the first run, because it helps find workspace allocations like in cublas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98945
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-04-18 19:57:44 +00:00
93347dde22 make ATen/native/cuda/DistanceKernel.cu data_ptr-correct (#99327)
make ATen/native/cuda/DistanceKernel.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99327
Approved by: https://github.com/ezyang
2023-04-18 19:55:29 +00:00
cf97b820c1 make magmaLdlHermitian data_ptr-correct (#99361)
make magmaLdlHermitian data_ptr-correct

Summary:
See
https://icl.utk.edu/projectsfiles/magma/doxygen/group__magma__hetrf.html#ga470ab07a6d12b662e260eceecf552d26
for a description of which pointer parameters are outputs (all of
them).

Test Plan: Rely on CI.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/99361).
* __->__ #99361
* #99358
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99361
Approved by: https://github.com/lezcano
2023-04-18 19:45:04 +00:00
472f46635e Cache output tensors on execution (#98944)
Caches output tensors for the common case when the output Tensor storage is unaliased for all graph outputs in all paths. For these persisted tensors we adjust the liveness tracking by also checking that the output tensor does not have an additional python reference.

I limit cached output tensors to be unaliased. If a descendent node discovers it has an alias of a prior output, then the aliased output will no longer be persisted in the ancestor.

The large majority of tensors are unaliased, and preserving aliased output tensors would add significant additional complexity with marginal gains. For instance, when do checkpointing and re-recordings, we need to remove the persisted tensors otherwise it would prevent memory from being reclaimed. If a single persisted tensor was present in multiple paths then that would create an inter-path dependence which adds complexity. Additionally, each further caching of the output would affect the reference count of the other caches, and that reference count would also need to be adjusted depending on if a node was checkpointed.

Still need to do a complete a run but for the models I tried makes the performance extremely close between trees and non trees impl.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98944
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-18 19:44:47 +00:00
93b64f0ad3 [Easy] Remove C++ call now that it wont be on hot path (#98943)
Since we will be caching output tensors, it is no longer necessary for this logic to be in C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98943
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-04-18 19:28:37 +00:00
bce21ee06a Revert "Fix bug in check required output size in _as_strided_scatter_meta (#98483)"
This reverts commit 5b692fd819f1428fc070c3ec3a0cde5d4b83dd03.

Reverted https://github.com/pytorch/pytorch/pull/98483 on behalf of https://github.com/malfet due to Broke inductor, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor%2C%201%2C%201
2023-04-18 18:59:47 +00:00
19501b254f Revert "Codegen fixed size for static sympy values (#99362)"
This reverts commit d8d479c854c622a6ef21190b7e62c9a76f0ea2a7.

Reverted https://github.com/pytorch/pytorch/pull/99362 on behalf of https://github.com/malfet due to Reverting, as it breaks lint
2023-04-18 18:55:48 +00:00
d8d479c854 Codegen fixed size for static sympy values (#99362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99362
Approved by: https://github.com/ezyang
2023-04-18 17:38:09 +00:00
deec8bd522 [fix] Disable EMA if ema_alpha is set to None (#98992)
Summary: changing the condition to enable ema updates in training, and disable it if the "ema_alpha" value is None

Test Plan: f427638974

Differential Revision: D44937126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98992
Approved by: https://github.com/msaroufim
2023-04-18 17:28:07 +00:00
24f882369a [EdgeML] Remove dependency on all_mobile_model_configs.yaml from pt_operator_library BUCK rule (#99122)
Summary: Removes the dependency on the unified YAML file

Test Plan:
Smoke test via some caffe2 tests.

```
buck2 run xplat/caffe2:supported_mobile_models_test
```

Build a major FoA app that uses model tracing  and confirm it still works.

```
buck2 build fb4a
```

CI/CD for the rest.  If operator tracing / bundling was broken, I'd hope in the 1000+ tests spawned by this change should catch it.

Differential Revision: D44946368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99122
Approved by: https://github.com/dhruvbird
2023-04-18 17:19:55 +00:00
c0be06667f [PT2E][Quant] Support for embedding op quantization via
ExecuTorchNativeQuantizer (#99106)

ExecuTorchNativeQuantizer

ExecuTorchNativeQuantizer is a terribly name, I admit, however lets fix it once
we align on what the quantized kernel lib within executorch runtime should be called

Differential Revision: [D44986258](https://our.internmc.facebook.com/intern/diff/D44986258/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44986258/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99106
Approved by: https://github.com/jerryzh168
2023-04-18 16:59:37 +00:00
06f19fdbe5 Turn off Windows Defender in temp folder on binary build workflow (#99389)
This issue starts to show up recently https://github.com/pytorch/pytorch/actions/runs/4724983231/jobs/8385139626 and I'm pretty sure that the root cause is Windows Defender as I did a similar fix on Windows CI a while ago https://github.com/pytorch/pytorch/pull/96931.  Without this, Windows binary build could fail flakily when Windows Defender chooses to delete/quarantine a file in the temp folder.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99389
Approved by: https://github.com/weiwangmeta
2023-04-18 16:45:38 +00:00
00f76dbaaf add comment indicating that maskPrefixSum is mutated (#99309)
add comment indicating that `maskPrefixSum` is mutated

Summary: From review of #99158.

Test Plan: No op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99309
Approved by: https://github.com/Skylion007
2023-04-18 15:57:24 +00:00
e51c6c19c0 make ATen/native/cuda/DilatedMaxPool3d.cu data_ptr-correct (#99319)
make ATen/native/cuda/DilatedMaxPool3d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99319
Approved by: https://github.com/ezyang
2023-04-18 15:39:44 +00:00
a387ac41fb make ATen/native/cuda/DilatedMaxPool2d.cu data_ptr-correct (#99321)
make ATen/native/cuda/DilatedMaxPool2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99321
Approved by: https://github.com/ezyang
2023-04-18 15:36:23 +00:00
d69a1a4491 In detect_fake_mode, assert that all detected fake modes are consistent (#99392)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99392
Approved by: https://github.com/eellison
2023-04-18 15:35:05 +00:00
a8a1c57664 Reset joint graph fake mode earlier, and more comprehensively (#99391)
This bug was discovered by a stronger assert (which I will be posting
in a follow up PR.)

The explanation for this change is a bit long and windy, and I am not
sure I entirely understand the situation myself.  But here's what I
think is going on.

jansel's joint graph pattern matcher does something fairly unusual:
in order to initialize the pattern in question, it (lazily) runs
an aot_function invocation in order to trace out what the joint graph
of a given pattern looks like (we ought not use aot_function, but we
can't really do this until bdhirsh lands AOT Autograd export properly).
However, this lazy initialization occurs within the context of a
separate compilation, which has its own tracing context, and
importantly, fake tensor mode.

What we would like, is the pattern matcher lazy initialization fake
tensor mode to be unrelated to whatever the ambient fake tensor mode of
the graph we were actually compiling.  We want these to be independent,
because we don't really care what the current compiled graph is; this is
a lazy init function, it could have gotten initialized during any
compilation, it just happens to be initialized on this one.

To prevent us from picking up the ambient fake mode, we have to do two
things: we have to remove the tracing context (which stores a fake
mode), and we have to also disable the ambiently active fake mode.

In https://github.com/pytorch/pytorch/pull/99377 eellison proposed an
alternative approach, where we reuse the fake mode.  While this probably
won't cause any errors, it's morally not the right thing to do, because
you'll end up polluting the enclosing fake tensor mode with tensors that
have nothing to do with the mode itself.

This might fix https://github.com/pytorch/pytorch/issues/99286
but it's also possible that https://github.com/pytorch/pytorch/pull/99320
fixed it already.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99391
Approved by: https://github.com/bdhirsh
2023-04-18 15:35:05 +00:00
38e964056b Reland python ops (#99170)
Waiting for the revert to land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170
Approved by: https://github.com/albanD
2023-04-18 15:15:46 +00:00
5327dad617 make ATen/native/cuda/ForeachReduceOp.cu data_ptr-correct (#99318)
make ATen/native/cuda/ForeachReduceOp.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99318
Approved by: https://github.com/ezyang
2023-04-18 15:11:27 +00:00
e7a5cb99e2 [CI] Fix test failures at TestTensorCreationCPU.test_float_to_int_conversion_finite_cpu_uint8 (#98916)
This PR fixes divergent value issues in converting float32 to uint8. The failures of `TestTensorCreationCPU.test_float_to_int_conversion_finite_cpu_uint8` came from the divergent values of PyTorch and numpy among platforms. This PR adds two items:

- Enhance `_float_to_int_conversion_helper()` to have given reference values to provide the stable reference value
- Omit a test for `float.max` since the results on PyTorch are divergent (e.g. `float.max` -> `uint8` is 0 on x86_64, or 255 on s390x).

Fixes #97794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98916
Approved by: https://github.com/dagitses
2023-04-18 15:05:12 +00:00
24d20ea194 make ATen/native/cuda/ConvolutionMM2d.cu data_ptr-correct (#99323)
make ATen/native/cuda/ConvolutionMM2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99323
Approved by: https://github.com/ezyang
2023-04-18 14:01:02 +00:00
7880f9e7e3 Fix isinstance on SymInt in dynamo (#99393)
Fixes https://github.com/pytorch/pytorch/issues/99291

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99393
Approved by: https://github.com/albanD
2023-04-18 14:00:27 +00:00
57e1a50da3 Fix FakeTensor printing (#99205)
I got too confused by the FakeTensor printing, so this PR fixes it to
print normally.

Before:
```
with FakeTensorMode():
    x = torch.empty(2, 2, device="cpu")
    print(x)
    # FakeTensor(FakeTensor(..., device='meta', shape=(2, 2)), cpu)
```
After (Tensor printing doesn't print the default device):
```
FakeTensor(..., shape=(2, 2))
```

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99205
Approved by: https://github.com/eellison
2023-04-18 13:26:27 +00:00
20a90a1f80 make ATen/native/cuda/UpSampleBilinear2d.cu data_ptr-correct (#99313)
make ATen/native/cuda/UpSampleBilinear2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99313
Approved by: https://github.com/ezyang
2023-04-18 13:24:54 +00:00
cee9d86d20 make ATen/native/cuda/AmpKernels.cu data_ptr-correct (#99312)
make ATen/native/cuda/AmpKernels.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99312
Approved by: https://github.com/ezyang
2023-04-18 13:24:28 +00:00
46b9377190 [CI] Collect inductor max-autotune performance every Sunday (#99387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99387
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-04-18 13:20:13 +00:00
ce7c4ba11d Revert "Mark doctr_det_predictor as broken on master (#99370)"
This reverts commit b290381e09e59aadca73be91c601e049c029aa03.

Reverted https://github.com/pytorch/pytorch/pull/99370 on behalf of https://github.com/ezyang due to malfet already directly fixed it
2023-04-18 13:18:10 +00:00
f497031df9 Revert "Simple Custom Operator API, V0 (#98440)"
This reverts commit 0157b2d722b3411721814cf92467dc32f16f5a56.

Reverted https://github.com/pytorch/pytorch/pull/98440 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-18 13:04:27 +00:00
1c042a2137 Revert "Reland python ops (#99170)"
This reverts commit d4de64ae8d5587ed4a4a9d6ce9555a9a7976866d.

Reverted https://github.com/pytorch/pytorch/pull/99170 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-04-18 11:37:43 +00:00
c97dd8e134 Fix the pt2e UT path after refactor (#99402)
**Summary**
After https://github.com/pytorch/pytorch/pull/99064 and https://github.com/pytorch/pytorch/pull/99065 merged, the pt2e UT path has changed, also need to change the module path in `test/test_quantization.py`. Then we can run these tests in top level's test directory.

**Test Plan**
```
cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2E
cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EModels
cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EFX
cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EFXX86Inductor
cd test && python -u -m pytest test_quantization.py -k TestQuantizePT2EFXModels
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99402
Approved by: https://github.com/jerryzh168
2023-04-18 10:48:52 +00:00
88c8c2b71b [dynamo 3.11] implement 3.11 exceptiontable (#96511)
Summary of changes:
- Add CPython exceptiontable parsing/assembling functions in torch/_dynamo/bytecode_transformation.py, based on https://github.com/python/cpython/blob/3.11/Objects/exception_handling_notes.txt.
- Add optional `exn_tab_entry` field to dynamo `Instruction`s in torch/_dynamo/bytecode_transformation.py in order to virtualize exception table entries (start, end, target instructions).
- Add checks guarding against duplicate instructions in dynamo, so that jump/exceptiontable targets are unambiguous. See `get_indexof` in torch/_dynamo/bytecode_analysis.py. Ensure that bytecode generation throughout dynamo does not generate duplicate instructions.
- Allow dynamo bytecode generation logic to generate nested exception table entries for developer convenience. CPython expects entries to not overlap, so we flatten nested entries during assembly in torch/_dynamo/bytecode_transformation.py:compute_exception_table.
- Simulate the block stack in torch/_dynamo/symbolic_convert.py. CPython removed the block stack in 3.11, but dynamo needs it in order to keep track of active contexts. So we simulate the block stack as before by looking at exceptiontable entries in order to determine the current blocks.
- Update context codegen in torch/_dynamo/resume_execution.py. The `SETUP_FINALLY` bytecode, which conveniently had a jump target to the finally block, was removed in 3.11, so we need to keep track of the jump target of the finally block using exceptiontables. Generating resume functions is more difficult since the original exceptiontable entries pointing to old cleanup code need to be modified to point to new cleanup code.
- Fix a push_null bug in torch/_dynamo/variables/functions.py introduced by https://github.com/pytorch/pytorch/pull/98699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96511
Approved by: https://github.com/jansel, https://github.com/yanboliang, https://github.com/albanD
2023-04-18 07:53:24 +00:00
8214fe07e8 Python binding to set/get CUDA rng state offset (#98965)
Why?
* To reduce the latency of hot path in https://github.com/pytorch/pytorch/pull/97377

Concern - I had to add `set_offset` in all instances of `GeneratorImpl`. I don't know if there is a better way.

~~~~
import torch
torch.cuda.manual_seed(123)
print(torch.cuda.get_rng_state())
torch.cuda.set_rng_state_offset(40)
print(torch.cuda.get_rng_state())

tensor([123,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
tensor([123,   0,   0,   0,   0,   0,   0,   0,  40,   0,   0,   0,   0,   0,
          0,   0], dtype=torch.uint8)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98965
Approved by: https://github.com/kulinseth, https://github.com/ezyang
2023-04-18 07:52:21 +00:00
b290381e09 Mark doctr_det_predictor as broken on master (#99370)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99370
Approved by: https://github.com/Chillee, https://github.com/voznesenskym
2023-04-18 06:58:47 +00:00
5c5ad53517 [CUBLAS] Specify alignment for cuBlasLt addmm (#98975)
Fixes the underlying issue previously addressed in #92201 by specifying minimum alignments explicitly to `cuBLAS` rather than relying on a handcrafted rule. ~~We're still investigating some potential failure modes on `sm80` and `sm90` but those would be real `cuBlasLt` heuristics bugs rather than being caused by underspecifying constraints to the heuristics.~~

According to the `cuBLAS` docs the default alignment is 256 bytes so that is the current maximum that is currently being checked: https://docs.nvidia.com/cuda/cublas/

CC @ptrblck @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98975
Approved by: https://github.com/ngimel
2023-04-18 06:19:30 +00:00
5b692fd819 Fix bug in check required output size in _as_strided_scatter_meta (#98483)
Original Issue from #92670

pytest ./generated/test_XuyangBai_PointDSC.py -k test_004

==> RuntimeError: as_strided_scatter: sizes [4], strides [85], storage offset 256 and itemsize 4 requiring a storage size of 2048 are out of bounds for storage of size 1024

Repro:

```
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()

    def forward(self, x):
        x[1].fill_diagonal_(0)   # this check size failed

device = torch.device("cpu")
model = Model()
model.to(device)

torch._dynamo.reset()
compiled_model = torch._dynamo.optimize("inductor")(model)

arg = [torch.rand([4, 1, 1])]
compiled_model(*arg)

```
The error was raised at the checking required size in as_strided_scatter.

https://github.com/pytorch/pytorch/blob/master/torch/_prims/__init__.py#L1818

In the case of input is a tensor with storage offset(a view), when compute input's storage length, should also take input's base tensor's size/stride/offset into account instead of compare it with number of element of input.

This diff fix the bug and add test.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98483
Approved by: https://github.com/ngimel
2023-04-18 05:07:57 +00:00
2611fccfed [Inductor] Unify Inductor CUDA & CPUT unit tests input clone function (#99118)
Inductor CUDA unit tests doesn't preserve ```storage_offset``` when cloning input, this PR fixed it by making both CUDA and CPU tests use the same helper function ```clone_preserve_strides```.

This was found by @lantiankaikai when he was working on #98483, but he can't test it due to lack of CUDA env.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99118
Approved by: https://github.com/ngimel
2023-04-18 03:42:44 +00:00
964c7e3e85 [BE][DTensor] fix DTensor equal op (#99014)
## What problem this PR solves?
#97170 fixed `equal` operator return type (old: Tensor, now: bool) by giving it the correct sharding propagation. This is consistent with the `aten::equal` op. However, the correctness only stays at the local result level:
* `equal` op returns True if the local copy of dtensor A equals to the the local copy of dtensor B

This is not the correct semantic of `equal` which should return True if all local copies of A are equal to the corresponding local copies of B.

## What is this PR?

1. For non-participating ranks, if the return type is scalar, `local_results` is set to `None` which means the default value is a reduced result of participating ranks only.
2. For all ranks, if the return type is scalar and the `op_call` is `aten::equal`(because `aten::equal` is the only function that returns scalar value and needs communication), all gather the `local_results` within the `default pg` and reduce on them with `operator.and_`. The result will be the new `local_result`.

## Result/Impact
For non-participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is same with all other ranks
2. op is not `aten::equal`, the return value is None. Before this PR, this will raise "NotImplementedError" but has not been tested.

For participating ranks and the return type is scalar:

1. op is `aten::equal`, the return value is the equality of two dtensor operands - True if all copies are equal, False otherwise.
2. op is not `aten::equal`, simply the local computation result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99014
Approved by: https://github.com/wanchaol
2023-04-18 03:22:44 +00:00
e6aa8e0729 Test and document dynamo backward hooks support (#99382)
No new support added, but backward hooks are working and now there is a test and some documentation about the limitations (hooks firing after whole graph).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99382
Approved by: https://github.com/yanboliang
2023-04-18 03:03:29 +00:00
cde35b4069 [JIT] clarify errors due to non-literal indexing into ModuleList, ModuleDict (#98606)
TorchScript only supports indexing into ModuleLists with integer literals. The error message already warns about this; but this PR adds clarifications around what a "literal" is. I'm adding this PR because, in my opinion, it's not obvious what a "literal" is and how strict its definition is. The clarification provided in this PR should make it easier for users to understand the issue and how to fix it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98606
Approved by: https://github.com/eellison, https://github.com/gmagogsfm
2023-04-18 02:53:53 +00:00
a763d948d7 [CI] Move last iOS job to periodic (#99388)
There wasn't any failures on trunk, so not sure why it needs to be run all the time. Also M1 builds/tests is a pretty good proxy for iOS builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99388
Approved by: https://github.com/huydhn
2023-04-18 02:10:59 +00:00
4ffd407d12 [CI] Update torchbench pin (#99386)
To fix `doctr` installation for torchbench tests, see https://github.com/pytorch/benchmark/pull/1555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99386
Approved by: https://github.com/kit1980, https://github.com/weiwangmeta
2023-04-18 02:10:40 +00:00
780922c24e Switch calling convention back to real tensors (#99320)
Months ago, in order to get dynamic shapes working through to Dynamo backends, we changed the calling convention to pass fake tensors rather than real tensors as example inputs to backends. The motivation at the time was, well, backends shouldn't really be peeking at the real tensors when they are doing compilation, and so it would make more sense to hide the real tensors from backends. But there were a bunch of problems:

* This interacted poorly with our accuracy minifier design: accuracy minifier needs access to the real inputs in order to run the model and figure out what happens!
* The TensorRT backend required real inputs and we never figured out how to fix it.
* In practice, all the backends needed to detect if they were passed real tensors, and fakeify them anyway (certainly AOTAutograd does this)
* Parameters and inputs are treated non-uniformly: parameters had to be passed as real tensors, because CUDA graphs requires knowing what the actual tensors are

Furthermore, there were some more problems discovered after the fact:

* Backends may want to optimize on aspects of tensors which you cannot tell without having real tensors; e.g., alignment of the data pointer

So, this PR decides that changing the calling convention was a bad idea, and switches back to passing real tensors. There is a problem though: AOTAutograd will perform fakeification, which means that in practice backends are still going to end up with fake tensors in the end anyway. I want to change this, but this will require some work with bdhirsh's upcoming AOTAutograd export refactor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99320
Approved by: https://github.com/voznesenskym
2023-04-18 02:09:57 +00:00
a109453df4 Delete use_functionalize feature flag (#99317)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99317
Approved by: https://github.com/voznesenskym
2023-04-18 02:09:57 +00:00
17d7be68ee Delete functorch use_fake_tensor and debug_fake_cross_ref (#99314)
Using fake tensor with AOTAutograd is now mandatory, simplifying our
logic.  Unfortunately, this means debug_fake_cross_ref must go,
but I don't think anyone has used it recently.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99314
Approved by: https://github.com/eellison, https://github.com/zou3519
2023-04-18 02:09:54 +00:00
2471eac618 Make run_fwd_maybe_bwd work with int inputs (#99365)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99365
Approved by: https://github.com/voznesenskym
2023-04-18 02:05:26 +00:00
3d8498f926 [DataLoader] Add missing documentation for arg in DataLoader (#99371)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99371
Approved by: https://github.com/janeyx99
2023-04-18 02:03:47 +00:00
436edc5ac3 [ONNX] Retire 'DynamoOptimizeExporter' (#99202)
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at f2ccd03</samp>

### Summary
🗑️📝🛠️

<!--
1.  🗑️ - This emoji represents the removal of unused or unnecessary code, such as the class `DynamoOptimizeExporter` and some imports and decorators.
2.  📝 - This emoji represents the improvement of code readability and consistency, such as replacing `skip_fx_exporters` with `xfail` and using more descriptive names for the FX exporters.
3.  🛠️ - This emoji represents the simplification and refactoring of the code, such as removing some FX exporters and reducing the number of arguments and conditions in the tests.
-->
Removed unused code and simplified test logic for FX to ONNX conversion. This involved removing `skip_fx_exporters` and `DynamoOptimizeExporter`, and using `xfail` instead of `skip_fx_exporters` in `pytorch_test_common.py` and `test_fx_to_onnx_with_onnxruntime.py`.

> _Some FX exporters were not in use_
> _So they were removed without excuse_
> _The tests were updated_
> _With `xfail` annotated_
> _To make the ONNX logic more smooth_

### Walkthrough
*  Remove unused imports of `Mapping`, `Type`, and `exporter` from `test/onnx/pytorch_test_common.py` ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-26ce853445bf331686abb33393ee166726923ce36aa2a8de98ac7a2e3bc5a6d8L9-R9), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-26ce853445bf331686abb33393ee166726923ce36aa2a8de98ac7a2e3bc5a6d8L16-R16))
*  Replace custom `skip_fx_exporters` function with standard `xfail` decorator in `test/onnx/pytorch_test_common.py` and `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to simplify test skipping logic and mark tests as expected to fail ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-26ce853445bf331686abb33393ee166726923ce36aa2a8de98ac7a2e3bc5a6d8L209-R220), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL319-R288), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL375-R343), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL619-R563), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL721-R656), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL788-R718))
*  Remove unused `DynamoOptimizeExporter` class from `torch/onnx/_internal/fx/dynamo_exporter.py` and remove references to it in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to simplify FX exporter logic and remove unused code ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-3ecf10bc5f6eb95a19441118cb947bd44766dc5eb9b26346f922759bb0f8c9f2L16-L85), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL37-R37), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL411-L415), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL452-L454))
*  Remove unused variables and parameters related to different FX exporters in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` and use `torch.onnx.dynamo_export` directly to simplify code ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL50-R47), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL191-R188), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL245-R224), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL265-R237), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL279), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL296))
*  Replace `skip` decorators with `xfail` decorators in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to mark tests as expected to fail instead of skipping them unconditionally ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL524-R471), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL665-R600), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL748-R675), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL767-R696))
*  Replace `skip_fx_exporters` decorator with `skip_dynamic_fx_test` decorator in `test/onnx/test_fx_to_onnx_with_onnxruntime.py` to skip tests only for dynamic shapes instead of a specific FX exporter ([link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL591-R541), [link](https://github.com/pytorch/pytorch/pull/99202/files?diff=unified&w=0#diff-c8fa56eefd7f98fb4f9739d57df57f02ede77e28528133736010a6d06651ebcbL831-R761))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99202
Approved by: https://github.com/abock
2023-04-18 01:40:47 +00:00
694ed70e01 [inductor][easy] create a wrap for triton do_bench function (#99216)
triton PR https://github.com/openai/triton/pull/1513 change the interface of do_bench function. The quantile fields name is changed from 'percentiles' to 'quantiles' and its default value is changed from from (0.5, 0.2, 0.8) to None. This break some inductor code since a caller expects a tuple may get a item.

Add a wrapper to maintain the same behavior for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99216
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-18 00:52:00 +00:00
063173cb46 Skip sccache initialization on MacOS (#99121)
Now that the cache is used on MacOS, I'm seeing some flaky timeout when starting the server https://github.com/pytorch/pytorch/actions/runs/4703387666/jobs/8341817701.  This step is optional, so we could just skip it (like what Linux workflow does).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99121
Approved by: https://github.com/clee2000
2023-04-18 00:46:45 +00:00
6df87b2e9b Rename sym_shapes logger to dynamic (#99335)
This matches the logging with the user facing UX dynamic=True,
rather than a new abbreviation that shows up no where else
in the codebase.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99335
Approved by: https://github.com/Skylion007, https://github.com/mlazos, https://github.com/voznesenskym
2023-04-18 00:45:39 +00:00
6212a85af8 Revert "Skip sccache initialization on MacOS (#99121)"
This reverts commit c2fd198cafaac3e5de6d72a80cd4ad19da042ce5.

Reverted https://github.com/pytorch/pytorch/pull/99121 on behalf of https://github.com/huydhn due to Revert to reland this as this miss one change in mac build workflow. My mistake from rebasing from master into main
2023-04-18 00:14:34 +00:00
59e343b12c enable data type propagation (#98065)
Enable data type propagation in schedule node level.
Propagation policy:
(1) ops with dtype args [constant, load, rand, randn] -> direct use dtype as node dtype
(2) ops semantics decide output dtype -> using output dtype
All `override_return_dtype` in https://github.com/pytorch/pytorch/blob/master/torch/_inductor/lowering.py.
(3) other ops: perform promote on input nodes dtype. ADD(BF16, FP32) -> FP32 output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98065
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5
2023-04-18 00:07:35 +00:00
01fdcbdcc9 [FSDP][optim_state_dict][Easy] Temporarily disable rank0_only=True for use_orig_paramscaseEas (#99354)
Summary: We have not made use_orig_params=True support rank0_only optimizer_state_dict.

Test Plan: CI

Reviewed By: wz337

Differential Revision: D45054041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99354
Approved by: https://github.com/wz337
2023-04-18 00:02:09 +00:00
7ff1f3f3f6 Revert "Revert "Expandable blocks in allocator (#96995)"" (#99275)
This reverts commit 851e89c8e817f28270e0fc21d74ced9446bea747.

Differential Revision: [D45034526](https://our.internmc.facebook.com/intern/diff/D45034526)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99275
Approved by: https://github.com/eellison
2023-04-17 23:46:08 +00:00
99c6d46cf7 fix typo in gen_functionalization_type.py (#99303)
propogate -> propagate

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99303
Approved by: https://github.com/kit1980
2023-04-17 22:59:40 +00:00
b003000f41 Updates NCCL to 2.17.1 (#97843)
Re-open of #97407. NCCL 2.17.1 sometimes fails to send a FIN packet and causes hangs. This PR updates NCCL to 2.17.1 that includes a patch for socket shutdown. NCCL 2.18 will also include this patch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97843
Approved by: https://github.com/kwen2501
2023-04-17 22:53:54 +00:00
c2fd198caf Skip sccache initialization on MacOS (#99121)
Now that the cache is used on MacOS, I'm seeing some flaky timeout when starting the server https://github.com/pytorch/pytorch/actions/runs/4703387666/jobs/8341817701.  This step is optional, so we could just skip it (like what Linux workflow does).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99121
Approved by: https://github.com/clee2000
2023-04-17 22:18:46 +00:00
bdaf32261f [FSDP] Ensure that customized non tensor optimizer state can be saved (#99214)
The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

This PR will solve https://github.com/pytorch/pytorch/issues/99079

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99214
Approved by: https://github.com/awgu, https://github.com/awaelchli
2023-04-17 21:54:16 +00:00
d4de64ae8d Reland python ops (#99170)
Waiting for the revert to land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99170
Approved by: https://github.com/albanD
2023-04-17 21:53:41 +00:00
106ccf4a2a [pt2] add meta function for linalg.cross (#99279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99279
Approved by: https://github.com/ezyang
2023-04-17 21:21:45 +00:00
6f7b434f7b [pt2] add SymInt support for column_stack (#99276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99276
Approved by: https://github.com/ezyang
2023-04-17 21:21:45 +00:00
ccc5d1daec Revert D44897935: Multisect successfully blamed D44897935 for test or build failures (#99353)
Summary:
This diff is reverting D44897935
D44897935: [FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912) by fegin has been identified to be causing the following test or build failures:

Tests affected:
- [caffe2/torch/fb/module_factory/sync_sgd/tests:test_pyper_data_parallel_wrapper - caffe2.torch.fb.module_factory.sync_sgd.tests.test_pyper_data_parallel_wrapper.PyPerDataParallelWrapperTest: test_fsdp_submodules_pyper](https://www.internalfb.com/intern/test/562950025957458/)

Here's the Multisect link:
https://www.internalfb.com/multisect/1893714
Here are the tasks that are relevant to this breakage:

We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

If you believe this diff has been generated in error you may Commandeer and Abandon it.

Test Plan: NA

Reviewed By: fegin

Differential Revision: D45027286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99353
Approved by: https://github.com/izaitsevfb, https://github.com/fegin
2023-04-17 20:53:10 +00:00
a6a90eaf28 Remove unnecessary check when logging artifacts (#99260)
Removes a check which would sometimes allow `off_by_default` artifacts to be logged if logged at a higher level.

This change will only allow artifact messages to be displayed if the artifact is enabled, regardless of level.

closes #99144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99260
Approved by: https://github.com/lezcano
2023-04-17 20:47:25 +00:00
ab08284225 Revert "Disable dynamo tracing torchrec.distributed (#97824)"
This reverts commit df216b5736624e611cbc2cb048ba29c66edb3aed.

Reverted https://github.com/pytorch/pytorch/pull/97824 on behalf of https://github.com/izaitsevfb due to back out diff that doubles memory consumption for multitask FAIM flows. See D44978317
2023-04-17 20:34:01 +00:00
08dd4ad0b9 Revert "[pt2] add SymInt support for column_stack (#99276)"
This reverts commit 775dd869d0188dde4e5da27142960760e4f9a1c2.

Reverted https://github.com/pytorch/pytorch/pull/99276 on behalf of https://github.com/ezyang due to reverting this one too for safety
2023-04-17 19:37:58 +00:00
62a6d81143 [SPMD][Easy] Making typing consistent by replacing object with Any (#99332)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99332
Approved by: https://github.com/dracifer
2023-04-17 19:33:45 +00:00
19c2804614 [SPMD][EASY] Remove unnecessary torch.ops prefix (#99331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99331
Approved by: https://github.com/dracifer
2023-04-17 19:33:45 +00:00
f957334c2b Revert "[pt2] add meta function for linalg.cross (#99279)"
This reverts commit efc3887ea508b3cfd94603fd8afe4e8cf6dce7b7.

Reverted https://github.com/pytorch/pytorch/pull/99279 on behalf of https://github.com/ezyang due to Apparently this is breaking inductor on master? So weird
2023-04-17 19:33:16 +00:00
2b54d673fc Add custom backend case for storage and automatically generate storage attributes. (#98478)
Currently storage only considers partial backend. We want storage to create on custom backend by key PrivateUse1.
It also provides an easy automatic generation of storage-related attributes.
When the user registers a new backend, the corresponding methods and attributes can be automatically generated.
Do this code.
`torch.utils.rename_privateuse1_backend('foo')`
`torch.utils.generate_storage_for_privateuse1_backend()`
Then, get the following methods and attributes.
`torch.TypedStorage.is_foo`
`torch.TypedStorage.foo()`
`torch.UntypedStorage.is_foo`
`torch.UntypedStorage.foo()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98478
Approved by: https://github.com/albanD
2023-04-17 19:18:39 +00:00
8efc965e05 Update FBGEMM submodule (#99315)
To e07dda2d50 that among other things includes https://github.com/pytorch/FBGEMM/pull/1648

Fixes https://github.com/pytorch/pytorch/issues/99223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99315
Approved by: https://github.com/albanD
2023-04-17 18:44:56 +00:00
80eab63587 [Quant][pt2e] torch.mean and ReLU6 (#98984)
Add nn.Module ReLU6 in addition to functional relu6.

Also add torch .mean to quantization config

Differential Revision: [D44901038](https://our.internmc.facebook.com/intern/diff/D44901038/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98984
Approved by: https://github.com/jerryzh168
2023-04-17 18:33:04 +00:00
444a9769ae [quant][pt2e] QAT Linear (#98897)
Differential Revision: [D44901039](https://our.internmc.facebook.com/intern/diff/D44901039/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98897
Approved by: https://github.com/tiandiao123, https://github.com/manuelcandales
2023-04-17 18:27:39 +00:00
568935caca [quant][pt2e] QAT conv + bn + relu (#98896)
Differential Revision: [D44901040](https://our.internmc.facebook.com/intern/diff/D44901040/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98896
Approved by: https://github.com/manuelcandales
2023-04-17 18:24:08 +00:00
7401f0f8ce Add unbacked symbool support (#98877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98877
Approved by: https://github.com/ezyang
2023-04-17 17:45:10 +00:00
08ffe34621 Revert "Skip sccache initialization on MacOS (#99121)"
This reverts commit 70ec347f0628c3ae795591b87076aefa1fe9c2ec.

Reverted https://github.com/pytorch/pytorch/pull/99121 on behalf of https://github.com/huydhn due to The cache still timeout and there is no way to increase the timeout value to be more than 10s looking at sccache code 6bbef54b88/src/commands.rs (L48).  This needs reworks
2023-04-17 17:36:05 +00:00
cdab6c8df9 [PT2E][Quant] Support specifying None for obs_or_fq_ctr in target_dtype_info (#99071)
It is cleaner for quantizer to say what does not need observation instead of
putting fp32 observers. This diff add support for that by checking if
target_dtype_info contains none for specific observers and if so skip inserting
observers for those.

Differential Revision: [D44971357](https://our.internmc.facebook.com/intern/diff/D44971357/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99071
Approved by: https://github.com/jerryzh168
2023-04-17 16:37:16 +00:00
36a95625da [PT2E][Quant][BE] Refactor observer code (#99066)
Combine per channel and per tensor observer code

Differential Revision: [D44918494](https://our.internmc.facebook.com/intern/diff/D44918494/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99066
Approved by: https://github.com/jerryzh168
2023-04-17 16:17:36 +00:00
4f4e0db5bd [PT2E][Quant][BE] Split short term and long term tests in different files (#99065)
Just for better organization

Differential Revision: [D44918492](https://our.internmc.facebook.com/intern/diff/D44918492/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44918492/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99065
Approved by: https://github.com/jerryzh168
2023-04-17 16:12:47 +00:00
bcf6393024 [PT2E][Quant][BE] Move pt2e quantization test to separate folder (#99064)
Move it out of fx for better code organizations

Differential Revision: [D44918496](https://our.internmc.facebook.com/intern/diff/D44918496/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44918496/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99064
Approved by: https://github.com/jerryzh168
2023-04-17 16:07:03 +00:00
0711bff9aa [ROCm] add skipCUDAIfVersionLessThan to unskip test_jiterator for ROCm (#99197)
This unskips 121 tests that the decorator `@skipCUDAIf(_get_torch_cuda_version() < (11, 6))` was unintentionally skipping for ROCm.  Other decorators such as `skipCUDAVersionIn` will only activate for the CUDA device, not the CPU or ROCm-as-CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99197
Approved by: https://github.com/ngimel
2023-04-17 16:05:16 +00:00
e549ad0046 Add log_sigmoid_backward forward-AD (#99288)
Fixes #95057
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99288
Approved by: https://github.com/kshitij12345, https://github.com/albanD
2023-04-17 15:45:20 +00:00
dede0bb065 [NCCL] Use OptionalCUDAGuard in ProcessGroupNCCL::WorkNCCL::synchronizeInternal (#98895)
Using `CUDAGuard` does redundant `set_device` in the following loop:
```C++
{
    for (auto& device : devices_) {
      at::cuda::CUDAGuard gpuGuard(device); // set device
      // ...
      // ~gpuGuard() sets original device
    }
    // ...
}
```
It would be more efficient to use `OptionalCUDAGuard` as follows:
```C++
{
    at::cuda::OptionalCUDAGuard gpuGuard;
    for (auto& device : devices_) {
      gpuGuard.set_index(device.index()); // set device
      // ...
    }
    // ...
    // ~gpuGuard() sets original device
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98895
Approved by: https://github.com/mrshenli
2023-04-17 14:16:38 +00:00
ed5395dbef make aten/src/ATen/native/cuda/Indexing.cu data_ptr-correct (#99154)
make aten/src/ATen/native/cuda/Indexing.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99154
Approved by: https://github.com/ezyang
2023-04-17 13:37:12 +00:00
d44c5e3639 make ATen/native/cuda/IndexKernel.cu data_ptr-safe (#99158)
make ATen/native/cuda/IndexKernel.cu data_ptr-safe

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99158
Approved by: https://github.com/ezyang
2023-04-17 13:30:10 +00:00
63a09a588d make ATen/native/cuda/UniqueCub.cu data_ptr-correct (#99150)
make ATen/native/cuda/UniqueCub.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99150
Approved by: https://github.com/ezyang
2023-04-17 13:15:51 +00:00
55ed2b8a32 inductor: add device and dtype check before doing cpu fx packed weight (#99028)
fix https://github.com/pytorch/pytorch/issues/96406 and https://github.com/pytorch/pytorch/issues/98979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99028
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-17 13:10:59 +00:00
0157b2d722 Simple Custom Operator API, V0 (#98440)
This PR introduces CustomOp, a wrapper around a dispatcher operator that allows
users to define custom operators. It adds the skeleton for CustomOp and
some very simple behavior: as of this PR:
- one can create a CustomOp for an operator that does not have inplace or aliasing
- give it CPU/CUDA and Meta implementations
- and trace it into a graph via make_fx.

The design follows
https://docs.google.com/document/d/19Uc5OUCA187q9BZggJb70RT2ZoSTDoG5QQkJkZwd25M/edit
Concretely, we implement the following things mentioned in the doc in this PR:
- Entrypoint 1 (CustomOp.define, creating a new custom operator)
- impl (to define device-specific code) and impl_meta (to define meta
formulas)

The goal for the short term is to get the code to a state where it can be trialed
by the export folks. On top of this PR, the blockers are:
- adding Entrypoint 3 (CustomOp.from_existing)
- adding a way to do data-dependent shape formulas
These will come in future PRs since this one is getting long.

Things that will come in the longer-near-term (before 2.1):
- adding the other entrypoints mentioned in the doc (2 & 3)
- more safety checks and better error messages
- support for views and mutation
- support for defining autograd formulas
- support for functionalization
- making this API public (it's private right now).

Test Plan:
- added a new test case, TestCustomOp. It mostly tests a bunch of error
cases.
- added OpInfos for custom operators and hooked these up to
test_proxy_tensor to test that they work with make_fx. These custom
operators were based off of the ones in the autograd_function_db.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98440
Approved by: https://github.com/ezyang
2023-04-17 12:17:32 +00:00
503104ce31 make ATen/native/cuda/LegacyThrustHelpers.cu data_ptr-correct (#99273)
make ATen/native/cuda/LegacyThrustHelpers.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99273
Approved by: https://github.com/ezyang
2023-04-17 11:45:51 +00:00
e9fef4a70c make gemv calls data_ptr_correct (#99161)
make gemv calls data_ptr_correct

Summary:
The following link for dgemv, for example, shows that all arguments
except for `y` are exclusively input arguments.
https://netlib.org/lapack/explore-html/d7/d15/group__double__blas__level2_gadd421a107a488d524859b4a64c1901a9.html

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99161
Approved by: https://github.com/ezyang
2023-04-17 11:45:42 +00:00
46c0912ca7 make ATen/native/cuda/Blas.cpp data_ptr-correct (#99274)
make ATen/native/cuda/Blas.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99274
Approved by: https://github.com/ezyang
2023-04-17 11:02:31 +00:00
148d49260a [SPMD] Implement split_fused_optimizer to split one fused_optimizer node to two (#98784)
Several optimization passes requires the ability to split the fused_optimizer.  This PR adds the API to support the use cases.

Differential Revision: [D44806450](https://our.internmc.facebook.com/intern/diff/D44806450/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98784
Approved by: https://github.com/mrshenli
2023-04-17 10:02:07 +00:00
2fc7f984e5 make vol2col data_ptr-correct (#99152)
make vol2col data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99152
Approved by: https://github.com/ezyang
2023-04-17 08:53:01 +00:00
ecf4400b3a make radix_sort_pairs data_ptr-correct (#99153)
make radix_sort_pairs data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99153
Approved by: https://github.com/ezyang
2023-04-17 08:52:04 +00:00
98b62f7c12 make remaining gemm calls data_ptr-correct (#99156)
make remaining gemm calls data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99156
Approved by: https://github.com/ezyang
2023-04-17 08:51:58 +00:00
306594b2b0 make ATen/native/cuda/AdaptiveMaxPooling2d.cu data_ptr-correct (#99164)
make ATen/native/cuda/AdaptiveMaxPooling2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99164
Approved by: https://github.com/ezyang
2023-04-17 08:51:48 +00:00
314cba9641 make ATen/native/cuda/SegmentReduce.cu data_ptr-correct (#99163)
make ATen/native/cuda/SegmentReduce.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99163
Approved by: https://github.com/ezyang
2023-04-17 08:51:42 +00:00
777a666a60 make ATen/native/cuda/Unique.cu data_ptr-correct (#99240)
make ATen/native/cuda/Unique.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99240
Approved by: https://github.com/ezyang
2023-04-17 08:51:18 +00:00
31393ea457 make ATen/native/cuda/MultinomialKernel.cu data_ptr-correct (#99241)
make ATen/native/cuda/MultinomialKernel.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99241
Approved by: https://github.com/ezyang
2023-04-17 08:51:11 +00:00
a8429342df fix mul/div overflow issue on CPU float16 (#98820)
Fix https://github.com/pytorch/pytorch/issues/63482 and https://github.com/pytorch/pytorch/issues/98691

The above two issues have the same root cause:

**binary_ops** will create TensorIterator with the flag `promote_inputs_to_common_dtype` on, which will convert both input tensors to the common_dtype_ (the logic is bypassed on CUDA), which might overflow on Half. If one of the inputs is a scalar with abs value larger than ~65000, it will overflow.

This patch will try to fetch the scalar value from the `original_tensor_base` which records the original scalar input value, then in the `cpu_kernel_vec` the TensorIterator is treated as an unary Op.

So previously, CPU and CUDA would have different behaviors for such scenario. This is aligned with this patch, test cases added for both CPU and CUDA device.

The following is the results:

#### before:
```
>>> torch.tensor([3388.], dtype=torch.half).div(524288.0)
tensor([0.], dtype=torch.float16)
>>> torch.tensor([0.01], dtype=torch.float16) * torch.tensor(65536, dtype=torch.float32)
tensor([inf], dtype=torch.float16)
```

#### after:
```
>>> torch.tensor([3388.], dtype=torch.half).div(524288.0)
tensor([0.0065], dtype=torch.float16)
>>> torch.tensor([0.01], dtype=torch.float16) * torch.tensor(65536, dtype=torch.float32)
tensor([655.5000], dtype=torch.float16)
```

Also need to update `RRelu` implementation, to use float to store the intermediate results, otherwise the following test case would fail:
```
. build/bin/test_api --gtest_filter=ModulesTest.RReLU
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98820
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-04-17 07:12:53 +00:00
efc3887ea5 [pt2] add meta function for linalg.cross (#99279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99279
Approved by: https://github.com/ezyang
2023-04-17 03:05:20 +00:00
775dd869d0 [pt2] add SymInt support for column_stack (#99276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99276
Approved by: https://github.com/ezyang
2023-04-17 03:05:20 +00:00
2418b94576 Rename default branch to main (#99210)
Mostly `s/@master/@main` in numerous `.yml` files.

Keep `master` in `weekly.yml` as it refers to `xla` repo and in `test_trymerge.py` as it refers to a branch PR originates from.
2023-04-16 18:48:14 -07:00
31f311a816 [PT2E][Quantization] Refactor Quantizer and QNNPACKQuantizer (#99063)
This diff renames quantization spec/config and operator config. It moves these
datastructures to base quantizer.
Base quantizer API now has get_supported_operators that returns list of
patterns that a quantizer quantizes.
There are two choices being debated for how to convey to user what a particular
quantizer will quantize.

1. Modules. We just convey what nn.Modules will be quantized. Of course that
does not mean that equivalent functional variants wont be quantized, however
for simplifity we just use nn.Module. If certain ops are quatnzied in fused
manner then that will considered internal details. Pros and cons of this
approach
pros:
  - Simple. Only nn Modules are listed.
  - User does not have to see fusion patterns.
Cons:
  - confusing perhaps because it is not clear if supported = nn.Conv2d also
    means that the quantizer supported functional.conv2d
  - Hiding fusion pattern means user has no say in not fusing. Meaning if
    conv2d + relu is fused and user configures to quantize only conv, quantizer
    will also quantize the following relu as if conv2d + relu are fused.

2. Patterns. Be explicit about what is supported and enumerate all possible
compbinations.
Pros:
  - it is very clear what quantizer will do. no surprises.
Cons:
  - It is not simple to parse.
  - It can be argued taht fusion is internal detail of the quantizer. So some
    quantizer implementation may chose to expose fusion patterns, while others
    may not and may not even provide any configurability.

One option is to move set_supported_operators/modules out of base quantizer and
let each quantizer define its own way of communicating what is supported. Issue
with this is that when we want to "Compose" multiple quantizers there is no way
for user to define the order of composition if user does not know what a
quantizer supports. For exampl quantizer A may quantizer conv + relu while B
only conv, but B's implementation is fast. In that case you may compose (B, A)
such B quantizes conv and A quantizes relu. Not knowning what A
and B support, makes such composition harder

Differential Revision: [D44895547](https://our.internmc.facebook.com/intern/diff/D44895547/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44895547/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99063
Approved by: https://github.com/jerryzh168
2023-04-17 00:34:18 +00:00
888c65b6a4 fix fake tensor propagation for cross_entropy with smoothing (#99255)
Fixes #99250, unfortunately I haven't figured out how to handle cross-entropy with smooth loss and weights.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99255
Approved by: https://github.com/jansel, https://github.com/malfet
2023-04-17 00:31:26 +00:00
fa502ab034 simplify codegen for integer min/max since they don't need to propaga… (#99249)
…te nan

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99249
Approved by: https://github.com/jansel, https://github.com/malfet
2023-04-17 00:28:23 +00:00
9ab5fdff81 Remove obsolete HAS_PRIMS_REFS (#99252)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99252
Approved by: https://github.com/ngimel
2023-04-17 00:27:37 +00:00
20a1788136 Revert "[quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905)"
This reverts commit 9e0df2379b9e13a36c59bb0d55f4922de8305bd6.

Reverted https://github.com/pytorch/pytorch/pull/98905 on behalf of https://github.com/izaitsevfb due to Conflicts with D44918496 landed internally, blocks diff train import
2023-04-17 00:17:10 +00:00
be0b12ece5 make untemplated gemm calls data_ptr-correct (#99184)
make untemplated gemm calls data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99184
Approved by: https://github.com/ezyang
2023-04-16 20:11:13 +00:00
29ff5a0c91 make ATen/native/cuda/Embedding.cu data_ptr-correct (#99183)
make ATen/native/cuda/Embedding.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99183
Approved by: https://github.com/ezyang
2023-04-16 20:11:07 +00:00
b08c384106 Add parameter for pin memory of storage to support other devices. (#98692)
Fixes #ISSUE_NUMBER

Add parameter for pin memory of storage to support other devices.
In the future, other backends will provide their own allocators to create pin memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98692
Approved by: https://github.com/ezyang
2023-04-16 20:06:27 +00:00
851e89c8e8 Revert "Expandable blocks in allocator (#96995)"
This reverts commit 6a50b83b739c2d37d0f518f98b8e624eca0ea153.

Reverted https://github.com/pytorch/pytorch/pull/96995 on behalf of https://github.com/izaitsevfb due to Breaks internal tests
2023-04-16 19:23:37 +00:00
6f181aae7c [vmap] Register decomposition for huber_loss_backward (#99236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99236
Approved by: https://github.com/kshitij12345
2023-04-16 18:50:45 +00:00
e2923b521b Further improve symbolic shapes logging (#99159)
* Introduce a frame counter which lets us uniquely identify frames.
  This makes it easier to tell if you are recompiling the same frame
* Shorten evaluate_expr to eval for more visual distinctiveness

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99159
Approved by: https://github.com/Skylion007
2023-04-16 12:06:38 +00:00
fdbc8625a1 Functionalization of torch.rand/rand_like ops (#97377)
This PR introduces the functionalization of RNG ops. Key points are

* Introduces a new `philox_rand` prim operator that accepts seed, offset.
* Adds decompositions for random operators that use these philox_rand prims
* Adds a PhiloxStateTracker to track the offset for each occurence of rand ops
* Changes calling convention of AOT Autograd and adds <fwd_seed, fwd_base_offset> and <bwd_seed, bwd_base_offset>
* Monkeypatches set_rng_state and get_rng_state while AOT Autograd tracing to record the rng state behavior
* Raises assertion for CPU because CPU does not Philox RNG.

Not dealt in this PR
* dropout op - offset calculation is different
* other distributions like normal, poisson etc
* Inductor support
* Cudagraph support
* Dynamic shape support

An example
~~~

class Custom(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        a = torch.rand_like(x) * x
        a = torch.rand_like(x) * a
        return a

    @staticmethod
    def backward(ctx, grad_out):
        x, = ctx.saved_tensors
        return grad_out * torch.rand_like(grad_out) * torch.cos(x)

====== Forward graph 0 ======
def forward(self, fwd_seed_1: i64[], fwd_base_offset_1: i64[], primals_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 0)
    philox_rand: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add, [16, 1], device(type='cuda', index=0), torch.float32);  add = None
    mul: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand, primals_1);  philox_rand = None
    add_1: i64[] = torch.ops.aten.add.Tensor(fwd_base_offset_1, 4);  fwd_base_offset_1 = None
    philox_rand_1: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], fwd_seed_1, add_1, [16, 1], device(type='cuda', index=0), torch.float32);  fwd_seed_1 = add_1 = None
    mul_1: f32[16, 16] = torch.ops.aten.mul.Tensor(philox_rand_1, mul);  philox_rand_1 = mul = None
    return [mul_1, primals_1]

====== Backward graph 0 ======
def forward(self, bwd_seed_1: i64[], bwd_base_offset_1: i64[], primals_1: f32[16, 16], tangents_1: f32[16, 16]):
    # No stacktrace found for following nodes
    add_2: i64[] = torch.ops.aten.add.Tensor(bwd_base_offset_1, 0);  bwd_base_offset_1 = None
    philox_rand_2: f32[16, 16] = torch.ops.prims.philox_rand.default([16, 16], bwd_seed_1, add_2, [16, 1], device(type='cuda', index=0), torch.float32);  bwd_seed_1 = add_2 = None
    mul_2: f32[16, 16] = torch.ops.aten.mul.Tensor(tangents_1, philox_rand_2);  tangents_1 = philox_rand_2 = None
    cos: f32[16, 16] = torch.ops.aten.cos.default(primals_1);  primals_1 = None
    mul_3: f32[16, 16] = torch.ops.aten.mul.Tensor(mul_2, cos);  mul_2 = cos = None
    return [mul_3]

~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97377
Approved by: https://github.com/ezyang
2023-04-16 09:55:56 +00:00
6e1e27fc4e [inductor] Refactor pre-grad passes into inductor.fx_passes (#99130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99130
Approved by: https://github.com/ngimel
2023-04-16 04:05:56 +00:00
91279f9471 [inductor][quant]Enable inductor vec codegen for quantization (#98489)
**Summary**
Enable the `decomposed dequant - pointwise ops - decomposed quant` vectorization code gen inside inductor.
Here is the example in the UT and the generated code:

Example:
* `decomposed dequant - relu - decomposed quant` pattern.
* Using `uint8` as the quantized tensor data type.

Generated Code:
```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_root/hw/chwr6vy6e6sd25sfh42qtywkuf2emodexm2aomp3lbrcxwznfwyi.h"
extern "C" void kernel(const unsigned char* in_ptr0,
                       unsigned char* out_ptr0)
{
    #pragma omp parallel num_threads(56)
    {
        {
            #pragma omp for
            for(long i0=static_cast<long>(0); i0<static_cast<long>(27); i0+=static_cast<long>(1))
            {
                auto tmp0 = at::vec::load_uint8_as_float(in_ptr0 + static_cast<long>(16*i0));
                auto tmp1 = (tmp0);
                auto tmp2 = at::vec::Vectorized<float>(static_cast<float>(100));
                auto tmp3 = tmp1 - tmp2;
                auto tmp4 = at::vec::Vectorized<float>(static_cast<float>(0.01));
                auto tmp5 = tmp3 * tmp4;
                auto tmp6 = at::vec::clamp_min(tmp5, decltype(tmp5)(0));
                auto tmp7 = at::vec::Vectorized<float>(static_cast<float>(100.0));
                auto tmp8 = tmp6 * tmp7;
                auto tmp9 = tmp8.round();
                auto tmp10 = tmp9 + tmp2;
                auto tmp11 = at::vec::Vectorized<float>(static_cast<float>(0));
                auto tmp12 = at::vec::maximum(tmp10, tmp11);
                auto tmp13 = at::vec::Vectorized<float>(static_cast<float>(255));
                auto tmp14 = at::vec::minimum(tmp12, tmp13);
                auto tmp15 = (tmp14);
                tmp15.store_float_as_uint8(out_ptr0 + static_cast<long>(16*i0));
            }
            #pragma omp for simd simdlen(8)
            for(long i0=static_cast<long>(432); i0<static_cast<long>(441); i0+=static_cast<long>(1))
            {
                auto tmp0 = in_ptr0[static_cast<long>(i0)];
                auto tmp1 = static_cast<float>(tmp0);
                auto tmp2 = static_cast<float>(100);
                auto tmp3 = tmp1 - tmp2;
                auto tmp4 = static_cast<float>(0.01);
                auto tmp5 = tmp3 * tmp4;
                auto tmp6 = tmp5 * (tmp5>0);
                auto tmp7 = static_cast<float>(100.0);
                auto tmp8 = tmp6 * tmp7;
                auto tmp9 = std::nearbyint(tmp8);
                auto tmp10 = tmp9 + tmp2;
                auto tmp11 = static_cast<float>(0);
                auto tmp12 = (tmp11 != tmp11) ? tmp11 : std::max(tmp10, tmp11);
                auto tmp13 = static_cast<float>(255);
                auto tmp14 = (tmp13 != tmp13) ? tmp13 : std::min(tmp12, tmp13);
                auto tmp15 = static_cast<unsigned char>(tmp14);
                out_ptr0[static_cast<long>(i0)] = tmp15;
            }
        }
    }
}
''')
```

**Test Plan**
```
cd test/inductor &&  python -m pytest test_cpu_repro.py -k test_decomposed_dequant_relu_quant
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98489
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-16 03:48:52 +00:00
039faf0dbf Add invariant that all symbolic shapes must be bound in graph (#99089)
Previously, we had a problem when partitioning forward-backward dynamic graphs, which is that we could end up with a backward graph that mentions a symbol in an input tensor (e.g., `f32[s0 + s1]`), but without this symbol being otherwise bound elsewhere. When this happens, we have no way of actually deriving the values of `s0` and `s1`. Our fix for this in https://github.com/pytorch/pytorch/pull/93059 was to just retrace the graph, so that s0 + s1 got allocated a new symbol s2 and everything was happy. However, this strategy had other problems, namely (1) we lost all information from the previous ShapeEnv, including guards and (2) we end up allocating a LOT of fresh new symbols in backwards.

With this change, we preserve the same ShapeEnv between forward and backwards. How do we do this? We simply require that every symbol which may be present inside tensors, ALSO be a plain SymInt input to the graph. This invariant is enforced by Dynamo. Once we have done this, we can straightforwardly modify the partitioner to preserve these SymInt as saved for backwards, if they are needed in the backwards graph to preserve the invariant as well.

This apparently breaks yolov3, but since everything else is OK I'm merging this as obviously good and investigating later.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99089
Approved by: https://github.com/voznesenskym
2023-04-16 01:48:19 +00:00
c69d54885a [SPMD][BE] Generalize factory ops support in SPMD expansion (#99233)
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D45028740](https://our.internmc.facebook.com/intern/diff/D45028740)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99233
Approved by: https://github.com/yifuwang
2023-04-16 00:07:27 +00:00
6bb20822f5 [SPMD][BE] Remove deprecated aten.sym_numel branch (#99232)
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D45028732](https://our.internmc.facebook.com/intern/diff/D45028732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99232
Approved by: https://github.com/yifuwang
2023-04-16 00:07:27 +00:00
39be994913 [SPMD][BE] Consolidate DSymInt Branches (#99231)
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D45028726](https://our.internmc.facebook.com/intern/diff/D45028726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99231
Approved by: https://github.com/yifuwang
2023-04-16 00:07:24 +00:00
544cd8e134 [SPMD] Refactor DSize to DSymInt to enable sym_numel (#99206)
This commit uses `aten.arange.default` and `aten.arange.start` to
test `aten.sym_numel`.

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D45028715](https://our.internmc.facebook.com/intern/diff/D45028715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99206
Approved by: https://github.com/yifuwang
2023-04-16 00:07:21 +00:00
bafb984022 [SPMD] Enable aten.full.default with SymInt on sharded dims (#99190)
Differential Revision: [D45028686](https://our.internmc.facebook.com/intern/diff/D45028686)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99190
Approved by: https://github.com/yifuwang
2023-04-16 00:07:18 +00:00
d350646ff6 SymIntify randint and randperm (#98968)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98968
Approved by: https://github.com/xw285cornell
2023-04-15 22:43:51 +00:00
756a86d52c Support large negative SymInt (#99157)
The strategy is that we will heap allocate a LargeNegativeIntSymNodeImpl whenever we have a large negative int, so that we can keep the old `is_symbolic` test (now called `is_heap_allocated`) on SymInt. Whenever we need to do something with these ints, though, we convert them back into a plain `int64_t` (and then, e.g., wrap it in whatever user specificed SymNodeImpl they need.) We cannot wrap directly in the user specified SymNodeImpl as we generally do not know what the "tracing context" is from C++. We expect large negative ints to be rare, so we don't apply optimizations like singleton-ifying INT_MIN.  Here's the order to review:

* c10/core/SymInt.h and cpp
  * `is_symbolic` renamed to `is_heap_allocated` as I needed to audit all use sites: the old `is_symbolic` test would return true for large negative int, but it would be wrong to then try to dispatch on the LargeNegativeIntSymNodeImpl which supports very few operations. In this file, I had to update expect_int,
  * If you pass in a large negative integer, we instead heap allocate it in `promote_to_negative`. The function is written in a funny way to keep compact constructor code for SymInt (the heap allocation happens out of line)
  * clone is now moved out-of-line
  * New method maybe_as_int which will give you a constant int if it is possible, either because it's stored inline or in LargeNegativeIntSymNodeImpl. This is the preferred replacement for previous use of is_symbolic() and then as_int_unchecked().
  * Rename toSymNodeImpl to toSymNode, which is more correct (since it returns a SymNode)
  * Complete rewrite of `normalize_symints.cpp` to use new `maybe_as_int`. Cannot easily use the old code structure, so it's now done doing a macro and typing out each case manually (it's actually not that bad.)
  * Reimplementations of all the unary operators by hand to use `maybe_as_int`, relatively simple.
* c10/core/LargeNegativeIntSymNodeImpl.h - Just stores a int64_t value, but it has to be big and negative. Most methods are not implemented, since we will rewrap the large negative int in the real SymNodeImpl subclass before doing operations with it
* The rest of the files are just rewriting code to use `maybe_as_int`. There is a nontrivial comment in c10/core/SymIntArrayRef.h

Very minor test adjustment in c10/test/core/SymInt_test.cpp . Plan to exercise this properly in next PR.

Companion XLA PR: https://github.com/pytorch/xla/pull/4882

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99157
Approved by: https://github.com/albanD
2023-04-15 22:43:51 +00:00
5c062e8bb4 [vmap] Add vmap support for nn.functional.huber_loss (#99235)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99235
Approved by: https://github.com/kshitij12345
2023-04-15 22:19:35 +00:00
c9403f128b fix breakage from #99027 (#99245)
fix breakage from #99027

Summary:
There is a conditional branch that is not tested by OSS CI.

Test Plan: Run nightly binary builds that include CUDA-12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99245
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/jansel, https://github.com/mihaimaruseac
2023-04-15 21:26:26 +00:00
3cde50e3fa Update NT error message (#99166)
https://github.com/pytorch/nestedtensor has been archived.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99166
Approved by: https://github.com/jbschlosser
2023-04-15 21:18:03 +00:00
47c685def3 [dynamo] Support DELETE_ATTR (#98698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98698
Approved by: https://github.com/yanboliang
2023-04-15 20:31:40 +00:00
15fe5a0798 [Dynamo] Fix benchmark --verbose error (#99224)
Dynamo benchmark --verbose is broken:
```
Traceback (most recent call last):
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 400, in <module>
    torchbench_main()
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/torchbench.py", line 396, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 1967, in main
    return maybe_fresh_cache(
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 993, in inner
    return fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/benchmarks/dynamo/common.py", line 2135, in run
    torch._dynamo.config.log_level = logging.DEBUG
  File "/scratch/ybliang/work/repos/pytorch/torch/_dynamo/config_utils.py", line 67, in __setattr__
    raise AttributeError(f"{self.__name__}.{name} does not exist")
AttributeError: torch._dynamo.config.log_level does not exist
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99224
Approved by: https://github.com/voznesenskym
2023-04-15 20:18:50 +00:00
21681f36f4 [pt2] add SymInt support for fft ops (#99115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99115
Approved by: https://github.com/ezyang
2023-04-15 18:01:39 +00:00
f89b7c2bec [pt2] add SymInt support for roll (#99114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99114
Approved by: https://github.com/ezyang
2023-04-15 18:01:39 +00:00
d5f7ec8a31 Apply dynamic shapes policy correctly to _base tensor (#99211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99211
Approved by: https://github.com/ezyang
2023-04-15 17:18:45 +00:00
85f38b8a33 [BE] Update flake8-comprehensions and adapt to rule C418 (#99178)
Applies rule C418 and fixes all instances of it. Also updates flake8-comprehension

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99178
Approved by: https://github.com/ezyang
2023-04-15 15:33:42 +00:00
506bd05752 make ATen/native/cuda/NLLLoss2d.cu data_ptr-correct (#99179)
make ATen/native/cuda/NLLLoss2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99179
Approved by: https://github.com/ezyang
2023-04-15 14:02:24 +00:00
e9201ab690 make ATen/native/cuda/AdaptiveMaxPooling3d.cu data_ptr-correct (#99185)
make ATen/native/cuda/AdaptiveMaxPooling3d.cu data_ptr-correct

Test Plan: Rely on CI.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/99185).
* #99186
* __->__ #99185
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99185
Approved by: https://github.com/ezyang
2023-04-15 14:02:11 +00:00
34f681c13b [CI] Remove inductor skip list for timm_models (#98840)
Summary: check against the expected csv file instead of skipping tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98840
Approved by: https://github.com/ezyang
2023-04-15 13:54:41 +00:00
a595a50653 [CI] Use expected accuracy csv files to check benchmark test status (#98839)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98839
Approved by: https://github.com/ezyang
2023-04-15 13:54:41 +00:00
1adb6fa922 nn.Linear: dispatch to bsr_dense_mm for half and bfloat16 (#94825)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94825
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
2023-04-15 13:38:42 +00:00
b69f0480a5 make ATen/native/cuda/MaxUnpooling.cu data_ptr-correct (#99189)
make ATen/native/cuda/MaxUnpooling.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99189
Approved by: https://github.com/ezyang
2023-04-15 13:37:07 +00:00
60f914e77e make ATen/native/cuda/UpSampleNearest2d.cu data_ptr-correct (#99186)
make ATen/native/cuda/UpSampleNearest2d.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99186
Approved by: https://github.com/ezyang
2023-04-15 13:25:10 +00:00
05809c7d3b [Dynamo] No graph break for explicit calling Conv{1/2/3}d.forward & ConvTranspose{1/2/3}d.forward (#99015)
Before this PR, if users call ```Conv2d(x)```, dynamo handles it well(no graph break) and puts a ```call_module``` op in the FX graph. However, if users explicitly call ```Conv2d.forward(x)``` in another ```forward``` function, the inlining would be failed(caused graph break). This PR fixed this issue by translating the explicit ```Conv2d.forward(x)``` to ```Conv2d(x)```.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99015
Approved by: https://github.com/jansel, https://github.com/wconstab
2023-04-15 08:04:13 +00:00
157c869026 Enable FSDP `use_orig_params=True` mixed precision training when some ranks have no (non-zero sized) parameter shards (#99175)
Fixes #99174

## Enable FSDP ``use_orig_params=True`` mixed precision training when some ranks have no (non-zero sized) parameter shards

### The issue

Now that ``use_orig_params=True`` allows non-uniform ``requires_grad`` (🎉 🚀 thanks @awgu!!!) with [#98221](https://github.com/pytorch/pytorch/pull/98221), there will be circumstances wherein some ranks have no (non-zero sized) local shards of the original parameters (and hence no associated gradients).

### Use Cases
For a simple Transformer case, imagine a user wraps all encoder layers in separate FSDP instances but allows the classifier head to be wrapped in the same FSDP instance as the relatively large embeddings layers. While this is a sub-optimal wrapping strategy for most use-cases, I believe it is expected to be supported (full precision training works in that context).

I originally encountered this issue while extending a package I maintain, leveraging the relaxed ``requires_grad`` contstraint to simplify multi-phase scheduled fine-tuning FSDP configuration, so a [concrete example is there](https://finetuning-scheduler.readthedocs.io/en/latest/advanced/fsdp_scheduled_fine_tuning.html#basic-scheduled-fine-tuning-with-fsdp).

### Reproduction and Remediation
Currently, ``ShardedGradScaler`` does not accommodate these situations, failing to initialize ``optimizer_state["found_inf_per_device"]`` when ``unscale_`` is called.

In this PR, I extend the existing ``ShardedGradScaler`` tests with an ``use_orig_params=True`` dimension added to the parameterization and test scenarios wherein one rank possesses no (non-zero sized) parameter shards.

The relevant issue can be reproduced with the tests I'm adding in this PR. The current (pre-PR) execution of these tests fail in ``use_orig_params=True`` mode with this error:

```python
./test_fsdp_sharded_grad_scaler.py::TestShardedGradScalerParityWithDDP::test_fsdp_ddp_parity_with_grad_scaler_offload_false_none_mixed_precision_use_orig_params Failed with Error: Process 0 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 657, in run_test
    getattr(self, test_name)()
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 543, in wrapper
    fn()
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_utils.py", line 259, in instantiated_test
    test(self, **param_kwargs)
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_distributed.py", line 174, in wrapper
    return func(*args, **kwargs)
  File "/home/speediedan/repos/pytorch/test/distributed/fsdp/test_fsdp_sharded_grad_scaler.py", line 187, in test_fsdp_ddp_parity_with_grad_scaler
    self._test_fsdp_parity(
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_fsdp.py", line 1152, in _test_fsdp_parity
    fsdp_loss = self._train_for_several_steps(
  File "/home/speediedan/repos/pytorch/torch/testing/_internal/common_fsdp.py", line 1016, in _train_for_several_steps
    sharded_grad_scaler.step(optim)
  File "/home/speediedan/repos/pytorch/torch/distributed/fsdp/sharded_grad_scaler.py", line 291, in step
    return super().step(optimizer, *args, **kwargs)
  File "/home/speediedan/repos/pytorch/torch/cuda/amp/grad_scaler.py", line 368, in step
    assert len(optimizer_state["found_inf_per_device"]) > 0, "No inf checks were recorded for this optimizer."
AssertionError: No inf checks were recorded for this optimizer.
```

A few implementation notes/considerations and questions:

1. Rather than just initialize  ``per_device_found_inf``, one could disable the grad scalar altogether for relevant ranks, altering ``unscale_`` to reduce with a subgroup or some rank mask construct to avoid the ``all_reduce`` s in ``distributed/fsdp/sharded_grad_scaler.py:unscale_()`` from hanging. Given that users may subsequently add parameter groups to an optimizer that would require re-enabling the scaler and the complexity associated with maintaining a separate mask construct or process subgroup, I thought this implementation was cleaner.
2. I extended ``_train_for_several_steps`` and ``_test_fsdp_parity`` in ``/torch/testing/_internal/common_fsdp.py`` with the ability to configure ``sharded_grad_scaler_kwargs`` for future testing flexibility.
3. Should the user be warned that no parameter shards were associated with a given rank? My initial thought is that this should be considered an implementation detail, part of supporting ``use_orig_params`` with heterogeneous ``requires_grad``, and therefore should be transparently handled by PyTorch. Should a DEBUG level message be added? If so, likely further upstream rather than at the scaler step level.
4. Rather than extend the existing ``ShardedGradScaler`` tests with an ``use_orig_params=True`` dimension added to the parameterization, let me know if you prefer that I instead narrow the scope of the new testing to a single additional test, e.g.:
	```python
	# from typing import Optional
	from typing import Optional, List
	# ...
	# use_orig_params = ["enable_use_orig_params", None]
	use_orig_params: List[Optional[str]] = [None]
	# ...
	configs = list(itertools.product(cpu_offload_config, sharding_strategy_config, mixed_precision, use_orig_params))
	configs.append((CPUOffload(offload_params=False), None, "enable_mixed_precision", "enable_use_orig_params"))
	```
Thanks as always to the PyTorch distributed team for your astonishingly impressive and valuable contributions to the open-source ML engineering community!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99175
Approved by: https://github.com/awgu
2023-04-15 05:13:23 +00:00
e9be0b0fb9 [dynamo] Support functools.wraps (#98699)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98699
Approved by: https://github.com/yanboliang, https://github.com/voznesenskym
2023-04-15 03:24:06 +00:00
b9426ded8d [vision hash update] update the pinned vision hash (#99212)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99212
Approved by: https://github.com/pytorchbot
2023-04-15 02:48:35 +00:00
3c4622c0ec Patch failing slow-test logic for inductor-dynamic (#99182)
Fixes #98954

But.. I'm not sure what the right fix is
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99182
Approved by: https://github.com/huydhn
2023-04-15 02:09:10 +00:00
6eab5e88c8 Graph-break on allowed modules if they have hooks (#97184)
Allowed modules are stuck into dynamo's fx graph as call_module
nodes, without dynamo doing any tracing of the module.  This means
during AOT trace time, hooks will fire during tracing when the
call_module is executed, but the hooks themselves will disappear
after that and not be present in the compiled program.
  (worse, if they performed any tensor operations, those would get
   traced so you could end up with part of the hook's functionality).

To circumvent this, there are two options for 'allowed modules' with hooks.
1) don't treat them as 'allowed' - trace into them
2) graph-break, so the module is no longer part of the dynamo trace at all

(1) will fail for users that opted into allowed modules becuase they know
    their module has problems being traced by dynamo.
(2) causes graph breaks on common modules such as nn.Linear, just because they
    are marked as 'allowed'.

It would help matters if we could differentiate between types of allowed modules
  (A) allowed to avoid overheads - used for common ops like nn.Linear
  (B) allowed to avoid dynamo graphbreaks caused by unsupported code

Ideally, we'd use method (1) for group (A) and (2) for (B).

For now, graph-break on all cases of allowed modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97184
Approved by: https://github.com/jansel
2023-04-15 01:46:15 +00:00
55c71cf91f [ONNX] Support aten.stack for dynamo_export (#99191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99191
Approved by: https://github.com/justinchuby
2023-04-15 01:13:00 +00:00
606ce5b653 [ONNX] Introduce Input/Ouptut adapter; Switch to 'DynamoExporter' (#98421)
Summary
* Introduce input/output adapter. Due to design differences, input/output format
between PyTorch model and exported ONNX model are often not the same. E.g., `None`
inputs are allowed for PyTorch model, but are not supported by ONNX. Nested constructs
of tensors are allowed for PyTorch model, but only flattened tensors are supported by ONNX,
etc. The new input/output adapter is exported with the model. Providing an interface to
automatically convert and validate inputs/outputs format.
* As suggested by #98251,
provide extension for unwrapping user defined python classes for `dynamo.export` based
exporter. Unblock huggingface models.
* Re-wire tests to run through `DynamoExporter` w/ `dynamo_export` api. Kept
`DynamoOptimizeExporter` in the tests for now for coverage of this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98421
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi
2023-04-15 01:13:00 +00:00
ef11966aff [composable] Enable replicate + trec_shard overall (#98890)
replicate + trec_shard works if we shard / replicate individually, such as follows:

```
m = TestSparseNN()
shard(m.sparse)
replicate(m.dense)
```

but does not work if users do the following:
```
m = TestSparseNN()
shard(m, sharders=[...])
replicate(m)
```

Many upstream trainers use the latter use case, as sharding is not done on individual module level but rather overall module by specifying planners that contain logic for how to shard different embedding table types.

This diff enables the latter approach (while keeping the former intact), but users need to specify `ignored_modules` to ignore embedding tables in replicate(). This is similar to FSDP (class based and composable) and DDP today.

Differential Revision: [D44899155](https://our.internmc.facebook.com/intern/diff/D44899155/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98890
Approved by: https://github.com/mrshenli, https://github.com/yhcharles
2023-04-15 01:09:00 +00:00
e45fa1a581 Back out "[core][pruning][be] rename BaseSparsifier to BasePruner (#98747)" (#99171)
Summary: Back out D44856390 since renaming the type breaks backwards compatibility of existing models used in integration tests and likely in prod as well.

Test Plan:
buck2 run //aiplatform/modelstore/model_generation/integration_tests:cogwheel_igr_tab_offline_and_recurring_model_generation_v1_api_test-launcher -- --build-fbpkg --run-disabled --run-harness-in-tupperware

Now fails with an OOM: https://www.internalfb.com/servicelab/experiment/100000000259121/trial/100000000331723/run

It was failing with an import error without this revert.

Differential Revision: D44991351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99171
Approved by: https://github.com/izaitsevfb, https://github.com/osalpekar
2023-04-15 00:37:45 +00:00
c130b8a716 Reintroduce s390x SIMD support (#99057)
Reintroduce s390x SIMD support

Use vectorized FMA to fix test precision failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99057
Approved by: https://github.com/malfet
2023-04-15 00:24:44 +00:00
7cb581d42f aot_autograd: more logging on metadata asserts (#99177)
Summary: add better logging to aot autograd asserts to debug internal model issues

Test Plan: let CI run

Differential Revision: D45006044

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99177
Approved by: https://github.com/bertmaher
2023-04-15 00:22:00 +00:00
06ad8d6d5f Remove filter step (#98969)
remove filter steps from linux, rocm, and mac tests

theres still some filter jobs from other places like bazel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98969
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-04-15 00:08:20 +00:00
cb23191523 [Vulkan] rewrite quantized add, mul, conv2d and conv2d_relu ops (#97468)
Summary:
This diffs registers the vulkan quantized binary ops (add/sub/mul/div), and adds graph rewrites for quantized add, mul, conv2d and conv2d_relu.
The rewrites for conv2d and conv2d_relu make use of the convert_qconv2d_context introduced in D41595032

Test Plan: export quantized mcs model to vulkan

Reviewed By: SS-JIA

Differential Revision: D44189363

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97468
Approved by: https://github.com/SS-JIA
2023-04-15 00:08:11 +00:00
a910045add [PATCH] Back out "Move functional collectives implementation to python. (#98595) (#99168)
Summary:
Original commit changeset: ba36f8751adc

Original Phabricator Diff: D44788697

Test Plan: model loading is fine after reverting the diff

Reviewed By: zyan0, sayitmemory

Differential Revision: D44921259
---

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99168
Approved by: https://github.com/izaitsevfb
2023-04-14 23:48:19 +00:00
20019f7c56 Fix bug in symbolic shape messages (#99169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99169
Approved by: https://github.com/anijain2305
2023-04-14 23:18:29 +00:00
70ec347f06 Skip sccache initialization on MacOS (#99121)
Now that the cache is used on MacOS, I'm seeing some flaky timeout when starting the server https://github.com/pytorch/pytorch/actions/runs/4703387666/jobs/8341817701.  This step is optional, so we could just skip it (like what Linux workflow does).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99121
Approved by: https://github.com/clee2000
2023-04-14 23:10:55 +00:00
c0d9a0268d [inductor] Use FakeTensorMode() when creating patterns (#99128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99128
Approved by: https://github.com/ngimel
2023-04-14 22:53:28 +00:00
bd07f8d2e0 DDP forward support custom stream accelerated copy. (#98723)
At present, DDP forward uses `_get_stream` to get a stream,which is cudaStream.
If the custom module already registered to torch, I can use `getattr` to get it and it's stream. Then, the custom stream is used to copy the tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98723
Approved by: https://github.com/ezyang
2023-04-14 20:19:56 +00:00
a1074ddf51 Enable cadd_sparse for BFloat16 on CPU (#96767)
Enabling **cadd_sparse** operation for BFloat16 on CPU to support BFloat16 operations in GNN libraries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96767
Approved by: https://github.com/jgong5, https://github.com/cpuhrsch
2023-04-14 19:50:49 +00:00
2d542d36a8 [dynamo] FakeTensor comparison with "is" instead of "==" (#99134)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99134
Approved by: https://github.com/yanboliang
2023-04-14 19:36:00 +00:00
b9d691040a Update InternalMatch in subgraph_rewriter after repeated replacements (#99039)
Fixes #98974

When `torch.fx.subgraph_rewriter._replace_pattern` is used to remove nodes from a graph, if there are two adjacent matches then after the first removal, the nodes in `InternalMatch.nodes_map` and `placeholder_nodes` become outdated because they contain nodes that were just removed from the graph.

This fix is to update the `match.nodes_map` and `match.placeholder_nodes` using the node changes stored in `match_changed_node`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99039
Approved by: https://github.com/angelayi
2023-04-14 19:35:38 +00:00
651c1be885 Recompute flat_arg_fake_tensors after fakification (#98769)
Summary: This fixes the case when some of the input tensors were
real tensors and fakified in `validate_and_convert_non_fake_tensors`,
but `flat_arg_fake_tensors` would not contain all the inputs
because it was computed before the fakification. We fix this by
recomputing `flat_arg_fake_tensors` after fakification as well.

Test Plan:
python test/dynamo/test_export.py ExportTests.test_mixed_real_and_fake_inputs

Reviewers: Chillee, voznesenskym

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98769
Approved by: https://github.com/voznesenskym
2023-04-14 19:14:29 +00:00
df43fef87f Support >4GB strings in the TorchScript model (#99104)
Summary: The support of BINUNICODE8 is missing. So adding it. So we can support attributes > 4GB. For example, for very large model, we save the lowered model in the EngineHolder using a string attribute.

Test Plan: buck2 test mode/opt //caffe2/test:jit -- --exact 'caffe2/test:jit - test_save_load_large_string_attribute (jit.test_save_load.TestSaveLoad)'

Differential Revision: D44905770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99104
Approved by: https://github.com/qihqi
2023-04-14 18:46:19 +00:00
6b9a52d1a4 [inductor] Refactor post_grad.py (#99127)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99127
Approved by: https://github.com/ngimel
2023-04-14 18:24:24 +00:00
ae55619a2b Add check for same dtype in tensordot implementation (#98938)
Fixes #77517

I believe[ the first bullet point in this comment](https://github.com/pytorch/pytorch/issues/77517#issuecomment-1129233539) from the linked issue is no longer a concern, but please let me know if I'm incorrect here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98938
Approved by: https://github.com/lezcano
2023-04-14 16:57:35 +00:00
9e0df2379b [quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905)
Summary:
Using a decomposed convert to make sure we get exact match, this means the nodes in resnet are
annotated correctly

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98905
Approved by: https://github.com/andrewor14
2023-04-14 16:25:15 +00:00
baa06790f8 Unbreak torch.compile on macos (#99119)
It seems like #96980 made torch.compile() completely ignore the `backend=` arg on macos rendering the entire API useless even if the user wasn't using mps tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99119
Approved by: https://github.com/msaroufim
2023-04-14 15:30:27 +00:00
70072c926e Fix MHA doc string (#99146)
This was missed in #97046
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99146
Approved by: https://github.com/albanD
2023-04-14 15:19:13 +00:00
286b618b7d [SPMD] Move some functions to IterGraphModule.setup() (#99076)
Since users will have to call these steps before calling `setup()`, moving these steps to `setup()` can reduce the API usage complexity.

Differential Revision: [D44973726](https://our.internmc.facebook.com/intern/diff/D44973726/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99076
Approved by: https://github.com/lessw2020
2023-04-14 14:41:10 +00:00
d863876545 [SPMD] Remove the unused code (#99075)
Remove the unused code

Differential Revision: [D44973692](https://our.internmc.facebook.com/intern/diff/D44973692/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99075
Approved by: https://github.com/lessw2020
2023-04-14 14:35:55 +00:00
9642eb59ad make ATen/native/cuda/Normalization.cuh data_ptr-correct (#99044)
make ATen/native/cuda/Normalization.cuh data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99044
Approved by: https://github.com/ezyang
2023-04-14 14:24:13 +00:00
46cfde4645 make ATen/native/cuda/MultiLabelMarginCriterion.cu data_ptr-correct (#99077)
make ATen/native/cuda/MultiLabelMarginCriterion.cu data_ptr-correct

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99077
Approved by: https://github.com/ezyang
2023-04-14 14:17:11 +00:00
4d1297cae8 trivially convert memcpy sources to use const_data_ptr (#98754)
Differential Revision: [D44834491](https://our.internmc.facebook.com/intern/diff/D44834491/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98754
Approved by: https://github.com/Skylion007
2023-04-14 14:03:16 +00:00
3a510e3911 trivially convert all memcpy destinations to mutable_data_ptr (#98753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98753
Approved by: https://github.com/Skylion007
2023-04-14 14:00:00 +00:00
40aaacd4fa Respect sharded dimensions when aten expaned/view consumes SymInt values (#99058)
Currently, aten.expand always expands to the global dimension. Then, it
introduces additional slice and clone ops before running compute on
the expanded tensor with a local tensor.

In this commit, if we detect the op consumes a SymInt size, it respects
both local size and the dimension placements from where the SymInt was
extracted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99058
Approved by: https://github.com/wanchaol
2023-04-14 13:54:05 +00:00
d365d9ed29 [torch package][easy] Make all the save/load tests use buffers (#98798)
Summary:
Make it a bit easier to run the tests anywhere/avoid skipping the tests by using buffers instead of temporary files.
[Er, still figuring out how the sync tooling works, I'll send this against github once the first diff is landed]

Test Plan: buck2 test

Reviewed By: fluckydog232

Differential Revision: D44818261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98798
Approved by: https://github.com/ezyang
2023-04-14 13:52:17 +00:00
210354620c [MPS] Fix macOS build (#99139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99139
Approved by: https://github.com/albanD
2023-04-14 13:26:31 +00:00
05a55b96d2 make ATen/native/cuda/group_norm_kernel.cu data_ptr-correct (#99041)
make ATen/native/cuda/group_norm_kernel.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99041
Approved by: https://github.com/ezyang
2023-04-14 13:20:21 +00:00
298cc5c611 Add vmap support for torch.nn.functional.smooth_l1_loss (#98357)
Partially fixes #97246 and #97558.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98357
Approved by: https://github.com/kshitij12345
2023-04-14 12:29:53 +00:00
1e78a2edcc Make summarize_perf.py work with perf-compare (#99095)
[perf-compare](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml) has a different structure than that of the nightlies.
For these files, the script now generates:

```
# cuda float32 training performance results
## Geometric mean speedup
            huggingface    timm_models    torchbench
--------  -------------  -------------  ------------
inductor           1.46            1.4          1.17

## Mean compilation time
            huggingface    timm_models    torchbench
--------  -------------  -------------  ------------
inductor          57.85          97.63         60.18

## Peak memory compression ratio
            huggingface    timm_models    torchbench
--------  -------------  -------------  ------------
inductor           1.06           1.01          0.83
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99095
Approved by: https://github.com/ezyang
2023-04-14 12:10:54 +00:00
ca735ac856 Don't specialize when indexing by SymInt (#99123)
Fixes https://github.com/pytorch/pytorch/issues/99091

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99123
Approved by: https://github.com/msaroufim
2023-04-14 11:39:43 +00:00
0963e1187a Put GraphArg on the node meta. (#99068)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99068
Approved by: https://github.com/voznesenskym
2023-04-14 11:34:28 +00:00
6a50b83b73 Expandable blocks in allocator (#96995)
Common advice we give for handling memory fragmentation issues is to
allocate a big block upfront to reserve memory which will get split up later.
For programs with changing tensor sizes this can be especially helpful to
avoid OOMs that happen the first time we see a new largest input and would
otherwise have to allocate new segments.

However the issue with allocating a block upfront is that is nearly impossible
to correctly estimate the size of that block. If too small, space in the block
will run out and the allocator will allocate separate blocks anyway. Too large,
and other non-PyTorch libraries might stop working because they cannot allocate
any memory.

This patch provides the same benefits as using a pre-allocating block but
without having to choose its size upfront. Using the cuMemMap-style APIs,
it adds the ability to expand the last block in a segment when more memory is
needed.

Compared to universally using cudaMallocAsync to avoid fragmentation,
this patch can fix this common fragmentation issue while preserving most
of the existing allocator behavior. This behavior can be enabled and disabled dynamically.
 This should allow users to, for instance, allocate long-lived parameters and state in individual buffers,
and put temporary state into the large expandable blocks, further reducing
fragmentation.

See inline comments for information about the implementation and its limitations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96995
Approved by: https://github.com/eellison
2023-04-14 09:49:11 +00:00
2494e62599 [MPS] Add ASSERT_ONLY_METHOD_OPERATORS Finish (#99021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99021
Approved by: https://github.com/albanD
2023-04-14 09:09:18 +00:00
7ddeb8d320 [MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 5 (#99020)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99020
Approved by: https://github.com/albanD
2023-04-14 09:09:17 +00:00
70120f2f92 [MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 4 (#99019)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99019
Approved by: https://github.com/albanD
2023-04-14 09:09:15 +00:00
f043ff2cec [MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 3 (#99018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99018
Approved by: https://github.com/albanD
2023-04-14 09:09:11 +00:00
be50c1c48e [MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 2 (#99017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99017
Approved by: https://github.com/albanD
2023-04-14 09:09:06 +00:00
4ddaab845c [MPS] Add ASSERT_ONLY_METHOD_OPERATORS Part 1 (#99016)
Summary:

1. Part 1~4 add `TORCH_ASSERT_ONLY_METHOD_OPERATORS` to files in the MPS codebase and replace `empty_mps`with `empty`. Also exclude `OperationUtils.h` from the assert as at this stage we still need `<ATen/ATen.h>` to get CI to pass.
2. Part 5 removes `<ATen/ATen.h>` include in `OperationUtils.h` and adds method operator headers to all mps files.
3. The last one moves `TORCH_ASSERT_ONLY_METHOD_OPERATORS` to the top of files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99016
Approved by: https://github.com/albanD
2023-04-14 09:09:02 +00:00
27049f3ff2 make native/cuda/EmbeddingBackwardKernel.cu data_ptr-correct (#99027)
Test Plan: Rely on CI.

Reviewers: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99027
Approved by: https://github.com/ezyang
2023-04-14 08:47:04 +00:00
bce7308881 [SPMD] Upstream partial_lower (#99069)
Several ops cannot be lowered to the Inductor. This PR copies the internal implementation of partial_lower (credit to @yifuwang ) to torch.distributed._spmd to unblock the OSS usage. The internal version will be kept until it is mature and will replace this version.

Differential Revision: [D44970278](https://our.internmc.facebook.com/intern/diff/D44970278/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99069
Approved by: https://github.com/mrshenli, https://github.com/lessw2020
2023-04-14 08:32:05 +00:00
10fbdcf72c Re-PR of 90269 - Force all nn_module associated tensors to be static (#99108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99108
Approved by: https://github.com/ezyang
2023-04-14 05:53:48 +00:00
b3e63baf58 [spmd][easy] delete unused optim states during compile (#99061)
delete a unused states
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99061
Approved by: https://github.com/mrshenli
2023-04-14 05:14:28 +00:00
55a1dc7f88 [dtensor] redistributed by default take self mesh instead (#99060)
This PR switches redistribute to default use self mesh instead of
the global mesh, which is more user friendly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99060
Approved by: https://github.com/mrshenli
2023-04-14 05:14:28 +00:00
cdef4f073c inductor: fix timeout in ModularIndexing (#98841)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98841
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-14 03:58:58 +00:00
0a7baabafb [torch.compile] Add sympy.core.relational.Relational to inductor.ir (#98971)
Fix #98879 from running gpt-neo-125m on inductor with `dynamic=True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98971
Approved by: https://github.com/ezyang
2023-04-14 03:53:47 +00:00
9d62f771eb [ONNX] Remove duplicated code from previous rebase (#99072)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99072
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-04-14 03:19:43 +00:00
cd078d376e GraphArg is always length one, adjust APIs accordingly (#99059)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99059
Approved by: https://github.com/voznesenskym
2023-04-14 03:11:25 +00:00
e613a419ed Remove dead wrap_sym (#99049)
I'm pretty sure this isn't used by anything

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99049
Approved by: https://github.com/Skylion007, https://github.com/voznesenskym
2023-04-14 03:11:25 +00:00
21ed07ceb9 Delete dead orig_graphargs (#99047)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99047
Approved by: https://github.com/voznesenskym
2023-04-14 03:11:25 +00:00
cc345d181a Change unspec ints to not be duck-sized (#99010)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99010
Approved by: https://github.com/janeyx99
2023-04-14 03:09:05 +00:00
a6a3df08e6 make ATen/native/cuda/Loss.cu data_ptr-correct (#99073)
make ATen/native/cuda/Loss.cu data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99073
Approved by: https://github.com/ezyang
2023-04-14 03:04:13 +00:00
8382e91b9c make ATen/native/cuda/MultiTensorApply.cuh data_ptr-correct (#99081)
make ATen/native/cuda/MultiTensorApply.cuh data_ptr-correct

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99081
Approved by: https://github.com/ezyang
2023-04-14 03:03:41 +00:00
13e4cc962a [vision hash update] update the pinned vision hash (#99109)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99109
Approved by: https://github.com/pytorchbot
2023-04-14 03:00:22 +00:00
d5abc7bfee [Vulkan] Move convert_qconv2d_context to custom ops (#98548)
Summary: Move convert_qconv2d_context to it's own custom op library

Test Plan: ```buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2/fb/custom_ops/vulkan_quantized:pt_vulkan_quantized_test_binAppleMac\#macosx-arm64```

Differential Revision: D44688797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98548
Approved by: https://github.com/kirklandsign
2023-04-14 03:00:09 +00:00
ece497b681 make cublasSgemmStridedBatched data_ptr-const (#99085)
make cublasSgemmStridedBatched data_ptr-const

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99085
Approved by: https://github.com/ezyang
2023-04-14 02:58:02 +00:00
64a61fc7c3 make at::cuda::blas::gemm calls data_ptr-correct (#99087)
make at::cuda::blas::gemm calls data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99087
Approved by: https://github.com/ezyang
2023-04-14 02:54:30 +00:00
979c5b4cf8 Move torchdynamo start tracing message earlier (#98990)
Currently, it lives inside run(), but this is too late;
we do a lot of work initializing OutputGraph and those log
messages will show up before "start tracing".  This is bad.
Now the start of tracing is InstructionTranslator construction,
which ensures we catch these sites.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98990
Approved by: https://github.com/yanboliang
2023-04-14 02:15:53 +00:00
ee1f28cd15 Fix the bug of comm headers. (#98658)
`comm.hpp` is  exposed to the developer under `torch/include/torch/csrc/distributed/c10d/`.
But when I use it, the following error occurs : `undefined symbol:xxx`. So I want it can expose to developer.

![image](https://user-images.githubusercontent.com/37650440/230697500-ec095103-3566-4415-88df-17491f01846e.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98658
Approved by: https://github.com/ezyang
2023-04-14 02:06:51 +00:00
c4f81cb6f4 [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
2023-04-14 02:03:33 +00:00
006e6f1a05 Fix CPU vectorized eq and ne operations for complex types (#97374)
The comparison of both the real and imag parts need to be combined and then the result must have the real number be 1 for true and 0 for false while the imag number must always be 0.

Fixes https://github.com/pytorch/pytorch/issues/75950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97374
Approved by: https://github.com/jgong5, https://github.com/kit1980
2023-04-14 02:02:16 +00:00
5e1ac1bb83 Fix visual studio generator (#98605)
If `CMAKE_GENERATOR=Visual Studio 16 2019` then the build will fail if `USE_NINJA=False` not set.
This PR changes that if CMAKE_GENERATOR is set an not equal to ninja then it won't use Ninja.
This is just for easier setting another generator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98605
Approved by: https://github.com/kit1980
2023-04-14 01:46:46 +00:00
06d8e231d5 Make sure that while caching values we don't invoke any Aten operator (#99050)
Summary:
title
also change catch to catch all so we can make it wont fail

Test Plan: existing tests

Reviewed By: harishs88ss

Differential Revision: D44945942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99050
Approved by: https://github.com/angelayi
2023-04-14 01:36:18 +00:00
0a98d94357 [FSDP] Auto-pad for no pad() in post-bwd hook (use_orig_params=True) (#99054)
This avoids the post-backward `F.pad()` call before reduce-scatter for `use_orig_params=True`. It is pretty cool that we built up all of the necessary infra in past PRs so that this change is simple.

We simply append one more padding tensor to pad out the `FlatParameter` numel to be divisible by the world size. This causes the flat gradient to be computed directly with the padded size, removing the need for the explicit `F.pad()` call.

Because the infra is built out right now for `use_orig_params=True`, we only add this auto-pad logic for that path. We can add it for `use_orig_params=False` if needed in follow-up work.

I confirmed in local tests that this removes the pad call.

Before (yes `aten::pad`):
![Screenshot 2023-04-13 at 1 38 21 PM](https://user-images.githubusercontent.com/31054793/231840432-e0875972-6546-4cf1-aaaa-bc3949050519.png)

After (no `aten::pad`):
![Screenshot 2023-04-13 at 1 38 29 PM](https://user-images.githubusercontent.com/31054793/231840422-8dd6f5ab-0a7a-4393-a835-42009948eb62.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99054
Approved by: https://github.com/fegin, https://github.com/zhaojuanmao
2023-04-14 01:30:23 +00:00
49cd650e2b [BE][DTensor] merge random init test to test_random_ops.py (#98874)
Random Ops behavior becomes different on CUDA device and others in PR https://github.com/pytorch/pytorch/pull/98199 . As a result, test of DTensor random initialization in test_init.py was skipped. This PR fixes it by having different testing assertion for different types of device and also merges existing duplicated test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98874
Approved by: https://github.com/wanchaol
2023-04-14 01:20:39 +00:00
36f52cc099 [BuildSpeed] Limit Logcumsumexp complex to OSS builds only (#98957)
As it takes ridiculous amount of time to build with complex times on CUDA-11.4.

Build speeds for a single gpu architecture (`sm_80`) on 3Ghz 8275CL Intel Xeon:
- 143 sec to compile for all dtypes using CUDA-11.6
- 351 sec to compile for all dtypes using CUDA-11.4
- 24 sec to compile for only floating dtypes using CUDA-11.6
- 52 sec to compile for only floating dtypes using CUDA-11.4

Tweak code a bit to make it compilable with MSVC, which is having trouble with nested preprocessor directives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98957
Approved by: https://github.com/r-barnes, https://github.com/ngimel
2023-04-14 00:47:00 +00:00
e778bcec05 Revert "fix allgather func collective to use maybe_wrap_tensor (#98866)"
This reverts commit ada7dfff717ab588ed46347093181bb2eccdc854.

Reverted https://github.com/pytorch/pytorch/pull/98866 on behalf of https://github.com/izaitsevfb due to Conflicts with the co-dev diff D44921259, reverting to unblock the diff train
2023-04-14 00:30:16 +00:00
f84078b40b [dynamo] Remove pointless graphs from with no_grad() (#98956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98956
Approved by: https://github.com/voznesenskym
2023-04-14 00:25:40 +00:00
02d1cf51b6 [Easy] Clean up args remap for DTensor expansion (#99040)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99040
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-04-14 00:23:00 +00:00
fd7eaf79de cmake will only run properly named c10 tests (#98710)
For the purpose of our Bazel and Meta-internal macros tests, we want
to create a single binary that can verify the different
configurations. CMake would see this file and try to run it and fail
in Windows which uses different values.

But we don't care about verifying this in CMake since it's not part of
the build unification effort.

In order to do this, we have to rename the SmallVectorTest to match
the naming convention of the rest of the c10 tests.

Differential Revision: [D44823440](https://our.internmc.facebook.com/intern/diff/D44823440/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98710
Approved by: https://github.com/PaliC
2023-04-13 23:32:42 +00:00
93d75568c7 [ONNX] Refactor ShapeInferenceWithFakeTensor to fill metavalue into the original gm (#98760)
From https://github.com/pytorch/pytorch/pull/97494#discussion_r1160068456, the passes should modify gm inplace, but before this PR, `ShapeInferenceWithFakeTensor` utilizes Transform.transform() to make a copy of the gm, and rely on the assumption that the topological order of two graphs should be the same. This PR addresses the issue by saving another metavalue `static_shape` into gm for op_level_debug, instead of overwriting `val`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98760
Approved by: https://github.com/BowenBao
2023-04-13 22:47:03 +00:00
d5aa4cec57 Delay torch.onnx import to after all dynamo [sub]components (#99070)
ONNX is taking a lot of dependencies on dynamo, which is causing frequent circular dependencies issues
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99070
Approved by: https://github.com/BowenBao, https://github.com/malfet
2023-04-13 22:22:34 +00:00
8062735f78 [ONNX] Support aten::unflatten in torchscript exporter (#99056)
Fixes #98857
Fixes #98190
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99056
Approved by: https://github.com/BowenBao
2023-04-13 22:19:02 +00:00
7b91bd2a7b [primTorch] Add count_nonzero (#98995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98995
Approved by: https://github.com/lezcano
2023-04-13 22:08:19 +00:00
7d74dca780 [primTorch] Add rad2deg and deg2rad (#98994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98994
Approved by: https://github.com/lezcano
2023-04-13 22:08:19 +00:00
668c578083 Automatically generate attributes and methods for custom backends. (#98066)
Fixes #ISSUE_NUMBER
#97593
A new extension mechanism has been added.
When the user registers a new backend, the corresponding methods and attributes can be automatically generated.
Do this code.
`torch.utils.rename_privateuse1_backend('foo')`
`torch.utils.generate_for_privateuse1_backend()`
Then, get the following methods and attributes.
`torch.Tensor.is_foo`
`torch.Tensor.foo()`
`torch.nn.Module.foo()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98066
Approved by: https://github.com/albanD
2023-04-13 22:04:05 +00:00
09ebdf44fa [quant][pt2e] Fix a bug in reference quantized module (decomposed mode) (#98903)
Summary:
Fixed quant_min/quant_max for per channel quantized weight for reference quantized module in decomposed mode,
this bug is triggered while onboard an internal model

Test Plan:
python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx_per_channel_quant_module

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98903
Approved by: https://github.com/andrewor14
2023-04-13 21:55:45 +00:00
6f07ad6cbf more trivial mutable_data_ptr from at::empty (#98750)
Differential Revision: [D44834054](https://our.internmc.facebook.com/intern/diff/D44834054/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98750
Approved by: https://github.com/ezyang
2023-04-13 21:32:24 +00:00
b8580b0897 Fix lazy_modules while enabling Unspecialized '__call__' tracing (#98516)
This fixes a regression added in the following PR to graph-break on allowed modules with hooks, but has its own problems.
- following #97184 PR makes 'allowed modules' with hooks graph-break, and lazy modules
  are allowed. (should we just make lazy modules not allowed ?)
- graph-breaks at lazy modules fail the lazy module unit tests which assert no graphbreaks
- this PR attempts to always 'initialize' lazy modules before tracing/calling into their __call__,
  and initializing a lazy module should delete all its hooks after firing them once, making
  the above issue go away

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98516
Approved by: https://github.com/yanboliang, https://github.com/jansel
2023-04-13 21:23:56 +00:00
1d077f28ed [export] Constraints API (#98433)
Wrapper for users to insert constraints into model code.

The constraints will not be maintained in the graph after tracing through make_fx so retracing with dynamo/make_fx will not work. This will be supported after torch._assert supported is implemented. Then we can convert the constrain_range calls to torch._asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98433
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2023-04-13 21:20:10 +00:00
4d3d3317eb make ATen/native/cuda/LossCTC.cu data_ptr-correct (#99030)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99030
Approved by: https://github.com/Skylion007
2023-04-13 20:42:37 +00:00
9a04482a74 make ATen/native/cuda/SoftMax.cu data_ptr-const (#99029)
Test Plan: Rely on CI.

Reviewers: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99029
Approved by: https://github.com/Skylion007
2023-04-13 20:38:50 +00:00
5d758ea952 make ATen/native/cuda/MultiMarginLoss.cu data_ptr-correct (#99008)
Test Plan: Rely on CI.

Reviewers: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99008
Approved by: https://github.com/Skylion007
2023-04-13 20:38:27 +00:00
8e328762ff [FSDP] Include duplicate parameters and modules when calling named_parameters and named_modules (#98912)
The default option of `named_parameters` and `named_modules` is to remove the duplicated parameters and modules. However, in FSDP, we need to know what parameters are shared. As a result, setting `remove_duplicate` to False is required in FSDP. Without setting `remove_duplicate` to False, FSDP won't be able to discover shared weights in some cases (e.g., the shared weights are in the same module or there are shared modules).

Differential Revision: [D44897935](https://our.internmc.facebook.com/intern/diff/D44897935/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98912
Approved by: https://github.com/awgu
2023-04-13 20:37:11 +00:00
a44813e6d7 trivial data reads to const_data_ptr (#99004)
Summary:

Test Plan: Rely on CI.

Reviewers: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99004
Approved by: https://github.com/Skylion007
2023-04-13 20:30:16 +00:00
35c6547f02 Adds 3D attn_mask support to merge_masks() for Multihead Attention fast path (#98991)
Fixes #97409

Adds support for 3D attn_mask by always expanding attn_mask to 4D as per https://github.com/pytorch/pytorch/pull/98375#issuecomment-1499504721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98991
Approved by: https://github.com/jbschlosser
2023-04-13 20:29:57 +00:00
bae304a5fc make ATen/native/cuda/WeightNorm.cu data_ptr-correct (#99006)
Test Plan: Rely on CI.

Reviewers: ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99006
Approved by: https://github.com/Skylion007
2023-04-13 20:21:59 +00:00
bba2090831 Enable fused optimizer for DP (#98270)
Differential Revision: [D42714482](https://our.internmc.facebook.com/intern/diff/D42714482/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D42714482/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98270
Approved by: https://github.com/awgu
2023-04-13 20:16:32 +00:00
079452ea0f Enable test_matmul_cuda UTs for ROCm (#98797)
test_file | test_name | test_class
-- | -- | --
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_10000_cuda_float32 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_1000_cuda_float32 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_bfloat16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_float16 | (__main__.TestMatmulCudaCUDA)
test_matmul_cuda | test_cublas_addmm_size_100_cuda_float32 | (__main__.TestMatmulCudaCUDA)

This PR is the same fix as https://github.com/pytorch/pytorch/pull/88888. Creating this new PR to sanitize the history.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98797
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/malfet
2023-04-13 19:36:07 +00:00
fc53472ce4 Move/Fix FakeTensor logic for detecting multiple fake modes (#97186)
This was leftover for when we had more logic in the FakeTensor and not FakeTensorMode, and wasn't firing correctly. It also makes more sense for it to be in the other validation function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97186
Approved by: https://github.com/bdhirsh
2023-04-13 19:20:01 +00:00
82b8764b75 [unwind] clarify warnings (#99005)
This PR defers warnings about potentially missing symbols
until we hit a situation where we can find a symbol.

It also hardens some of the logic around addresses that might
be out of the range of known unwind logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99005
Approved by: https://github.com/tugsbayasgalan
2023-04-13 19:06:06 +00:00
d8b09b0139 [FSDP] Full precision in eval mode (#97645)
If model.eval() is true, then runs the model in full precision.

Changes:
- Changed _force_full_precision to check self.is_training
- Check for _force_full_precision when casting gradients to reduced dtype
- Small change when accessing _full_prec_param_padded
- tests for class based and fully_shard APIs

Differential Revision: [D43933690](https://our.internmc.facebook.com/intern/diff/D43933690/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97645
Approved by: https://github.com/awgu
2023-04-13 18:38:22 +00:00
c74310616d _mm_prefetch is for Intel, changed to __prefetch for Arm64 (#96638)
The current master build on Windows Arm64 is broken on this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96638
Approved by: https://github.com/malfet
2023-04-13 18:11:14 +00:00
7a77961d63 trivially migrate std::transform output to mutable_data_ptr (#98756)
Differential Revision: [D44834598](https://our.internmc.facebook.com/intern/diff/D44834598/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98756
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2023-04-13 17:47:10 +00:00
e20981bda9 [Dynamo] Fix Lazy Module initialization with constant arg (#98996)
Fixes Meta internal user case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98996
Approved by: https://github.com/williamwen42
2023-04-13 17:37:25 +00:00
7692243e40 [functorch] typo in merge rule's github handle (#99052)
Github handle was spelled incorrectly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99052
Approved by: https://github.com/kit1980
2023-04-13 17:09:54 +00:00
dda7ce4bb3 Revert "[core][pruning][be] Rename sparsifier folder to pruner (#98758)"
This reverts commit 778fd1922ae127250126e845ecd4a1cb9e335fb5.

Reverted https://github.com/pytorch/pytorch/pull/98758 on behalf of https://github.com/jcaip due to https://www.internalfb.com/diff/D44905951 need to fix broken import in fbcode
2023-04-13 16:30:47 +00:00
e5501a967e [inductor] Support IndexPutFallback in cpp_wrapper (#98972)
Summary:
1) Make the fallback index_put generate the right cpp code in cpp_wapper
2) Add a --cpp-wrapper option to common.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98972
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-13 15:41:03 +00:00
670c5cf962 AOTAutograd: fix 'Trying to backward through the graph a second time' error (#98960)
Fixes https://github.com/pytorch/pytorch/issues/97745. See discussion and comment in the PR for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98960
Approved by: https://github.com/bertmaher, https://github.com/albanD
2023-04-13 10:25:07 +00:00
39438c6803 trivially convert std::copy output to mutable_data_ptr (#98755)
Differential Revision: [D44834597](https://our.internmc.facebook.com/intern/diff/D44834597/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98755
Approved by: https://github.com/ezyang
2023-04-13 09:48:58 +00:00
3a400a5adc Enable passing a dict of module names: log level to set_logs python api (#98989)
Adds "module" kwarg to set_logs to allow a user to pass a dict of module qualified names to log level to the API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98989
Approved by: https://github.com/ezyang
2023-04-13 09:42:32 +00:00
6a568779b6 [quant][pt2e][improvement] Remove the need to annotate all nodes with default annotation (#99001)
Summary:
This PR changes prepare to use some default observer/fq constructor when "target_dtype_info" is not set, this allows user to not initialize all nodes to default
observer/fq constructor. Note we may still need to annotate intermediate node after this PR, there will be a follow up PR to allow users to only annotate things they
want to quantize

Test Plan:
python test/test_quantization.py TestQuantizePT2E
python test/test_quantization.py TestQuantizePT2EModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99001
Approved by: https://github.com/kimishpatel, https://github.com/andrewor14
2023-04-13 09:31:51 +00:00
f501234be0 Add test for squeeze.dims shape function (#98144)
Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98144
Approved by: https://github.com/davidberard98
2023-04-13 08:43:55 +00:00
2c337dd934 [fix] update the condition for aliveness of TensorWrapper (#98748)
Fixes https://github.com/pytorch/pytorch/issues/95561
Fixes https://github.com/pytorch/pytorch/issues/98021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98748
Approved by: https://github.com/zou3519
2023-04-13 08:17:20 +00:00
0ff3059ad0 [pt2] recursive IR check (#98887)
Summary: IR check needs to be recursive to accommodate Tuple[Tensor, Tuple[Tensor]] schema

Test Plan:
Run the repro cmd and make sure it no longer fails
  TORCH_SHOW_CPP_STACKTRACES=1 TORCH_LOGS="+dynamo,aot,inductor" buck2 run mode/opt scripts/ml_model_exploration/coffee:defi_local -- --baseline_model_entity_id 421946503 --meta_ids '{"union_meta":422685721}' -g -t -l --model_type mimo_ctr_mbl_feed

Differential Revision: D44809096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98887
Approved by: https://github.com/wconstab
2023-04-13 06:57:01 +00:00
1854e8ac5f convert trivial assignments to use mutable_data_ptr (#98752)
Differential Revision: [D44834422](https://our.internmc.facebook.com/intern/diff/D44834422/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98752
Approved by: https://github.com/ezyang
2023-04-13 06:15:54 +00:00
388d269234 Use the same python version in MacOS workflows and add more debug messages (#98902)
After https://github.com/fairinternal/pytorch-gha-infra/pull/139 (https://github.com/fairinternal/pytorch-gha-infra/actions/runs/4683157903/jobs/8297905750), the flaky issue on MacOS points to sccache setup.  There are several issues there:
  * sccache is downloaded to `/usr/local/bin/sccache`.  Surprisingly, the build script doesn't find it in some cases (probably the new runners), for example https://github.com/pytorch/pytorch/actions/runs/4681216666/jobs/8293519052 has `which sccache` returns nothing despite that the binary is there.  In such case, `/usr/local/bin` is not in GitHub path.
  * Once sccache is used.  We need to use the correct sccache binary arch.  Using sccache for x86-64 on M1 would end up with a x86-64 torch binary, i.e. 01e011b07c
  * We don't need to set the AWS secret key on MacOS runner anymore.  The AWS M1 runner has access to S3 cache via its IAM profile while GitHub x86-64 runner uses GitHub cache https://github.com/pytorch/pytorch/pull/96142

Other minor changes:
* The same python version is used in both MacOS build and test jobs.  This is set by the workflow via `python-version` parameter
* Add some debug information about the python version is used to run the test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98902
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-04-13 06:03:14 +00:00
8155b72c15 [ROCm] Sync updates in hipify_torch to Pytorch hipify utils for ROCm. (#93169)
This PR intends to sync the updates in the hipify_torch project (https://github.com/ROCmSoftwarePlatform/hipify_torch) to the hipify util used in Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93169
Approved by: https://github.com/malfet
2023-04-13 04:59:31 +00:00
8a6dd0dc97 Disable logging in pattern matcher calls to AotAutograd (#98936)
Fixes #98778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98936
Approved by: https://github.com/wconstab
2023-04-13 04:51:08 +00:00
8a3f1be809 inductor: relax the dynamic variable check for cpu dynamic test case (#98815)
For the following code of dynamic shape case:
```
            #pragma GCC ivdep
            for(long i1=static_cast<long>(0); i1<static_cast<long>(8); i1+=static_cast<long>(1))
            {
                #pragma GCC ivdep
                for(long i2=static_cast<long>(0); i2<static_cast<long>(4*ks2*ks3); i2+=static_cast<long>(1))
                {
                    auto tmp0 = in_ptr2[static_cast<long>(i2 + (4*i1*ks2*ks3) + (32*i0*ks2*ks3))];
                    out_ptr2[static_cast<long>(i1 + (8*i2) + (32*i0*ks2*ks3))] = tmp0;
                }
            }
```
we don't need to check each loop has a dynamic variable. This PR relaxes the check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98815
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-13 04:35:12 +00:00
a408ed24ba Support module hooks in UnspecializedNNModuleVar (#98540)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98540
Approved by: https://github.com/yanboliang
2023-04-13 04:32:50 +00:00
731590bae5 Revert "[quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905)"
This reverts commit 43146bd49087bac9c58a274a06f52301ae8d1f7f.

Reverted https://github.com/pytorch/pytorch/pull/98905 on behalf of https://github.com/jerryzh168 due to breakage due to the previous PR being reverted
2023-04-13 04:21:51 +00:00
296822c475 Make update_expected not fail on one missing file (#98982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98982
Approved by: https://github.com/voznesenskym
2023-04-13 03:59:20 +00:00
43146bd490 [quant][fix] Compare resnet with quantizer api with the prepare_fx and decomposed convert flow (#98905)
Summary:
Using a decomposed convert to make sure we get exact match, this means the nodes in resnet are
annotated correctly

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98905
Approved by: https://github.com/andrewor14
2023-04-13 03:35:37 +00:00
51ff9ce997 [Replicate] Simplify code a bit (#98889)
Simplifies the code, such as making self.modules not a list and only a
single module.

Differential Revision: [D44899281](https://our.internmc.facebook.com/intern/diff/D44899281/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98889
Approved by: https://github.com/mrshenli, https://github.com/yhcharles
2023-04-13 03:21:06 +00:00
cfd1b4df94 [Composable] add checking key for check_fqn function (#98961)
add checking key for check_fqn function

ghstack-source-id: d856f560f1fc449a316135e3844609d0baaf6d66
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96705

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98961
Approved by: https://github.com/awgu
2023-04-13 03:16:14 +00:00
ccc9a3d726 Automatic Dynamic Shapes (#98923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98923
Approved by: https://github.com/ezyang
2023-04-13 02:39:23 +00:00
46a31e9bab Revert "[quant][pt2e] Fix a bug in reference quantized module (decomposed mode) (#98903)"
This reverts commit a2e809f29bd66a0f314edeb602f37b4de05e5c41.

Reverted https://github.com/pytorch/pytorch/pull/98903 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks Windows tests on trunk a2e809f29b
2023-04-13 01:58:27 +00:00
c80592ff9c [ONNX] Remove torch dependencies in _beartype (#98958)
Fix circular dependencies

Fixes #98959
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98958
Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi
2023-04-13 00:54:52 +00:00
75f55ca63b Support FQN as SPMD module override key (#98966)
Differential Revision: [D44940232](https://our.internmc.facebook.com/intern/diff/D44940232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98966
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-04-13 00:45:48 +00:00
6ebeefb4b0 remove merging label when merge is cancelled (#98967)
Adds a script to get rid of the "merging" label when a job is cancelled.
At the moment this can create a race condition is someone cancels a job and starts a new one, though these cases should be pretty rare especially in cases where its from a new merge command.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98967
Approved by: https://github.com/malfet
2023-04-13 00:45:28 +00:00
9a2a6fcfa5 add get_device_index for custom device (#98804)
Fixes #ISSUE_NUMBER
as the title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98804
Approved by: https://github.com/ngimel
2023-04-12 23:58:31 +00:00
c3186dc85e [inductor] Support integer pow (#88938)
This allows the `pow_recursive` form to be used for any integer power
greater than 0, or for tensor cases to fallback to ATen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88938
Approved by: https://github.com/lezcano
2023-04-12 23:51:45 +00:00
efc90c797d improvements to torch.gradient docs (#98824)
Fixes #98693

Clarified docs for `torch.gradient` on `h_l` and how the gradient is computed. For the mathematical equations, I followed this reference: https://www.dam.brown.edu/people/alcyew/handouts/numdiff.pdf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98824
Approved by: https://github.com/ngimel, https://github.com/kit1980
2023-04-12 23:43:40 +00:00
a8f5d72edf Guard color diagnostics opts by compiler type (#98952)
On Linux system where `/usr/bin/c++` is not a symlink to either `g++` or `clang++`, `try_compile` can still incorrectly identify `gcc` as supporting `-fcolor-diagnostics` flag.

Rather than introducing a super complex condition (i.e. `USE_CCACHE` and `LINUX` ...) just guard the checks specific to compiler identifier.

See https://github.com/ccache/ccache/issues/1275

Fixes https://github.com/pytorch/pytorch/issues/83500

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98952
Approved by: https://github.com/albanD
2023-04-12 23:39:37 +00:00
ab761605ae Revert "[export] Constraints API (#98433)"
This reverts commit 1510eb4072b103ef2a10d415067e9d70954efb64.

Reverted https://github.com/pytorch/pytorch/pull/98433 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, asked by author to revert
2023-04-12 23:37:19 +00:00
99aacf5c68 [SPMD] Expedite the allreduce call before doing comm_fusion (#98922)
The allreduce call order and gradients order may be different and can interfere the benefit of comm_fusion. This PR reorders the graph so that all the allreduce calls happen right after its last input.

Differential Revision: [D44900738](https://our.internmc.facebook.com/intern/diff/D44900738/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98922
Approved by: https://github.com/mrshenli
2023-04-12 23:26:37 +00:00
78ff7ca24a [Dynamo] Fix Sequential nn module with duplicated submodule (#98880)
Fixes #98852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98880
Approved by: https://github.com/ngimel
2023-04-12 23:09:50 +00:00
8db04e080c [pt2] add SymInt support for cdist (#98881)
Fixes #98853.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98881
Approved by: https://github.com/ezyang
2023-04-12 23:06:40 +00:00
a2e809f29b [quant][pt2e] Fix a bug in reference quantized module (decomposed mode) (#98903)
Summary:
Fixed quant_min/quant_max for per channel quantized weight for reference quantized module in decomposed mode,
this bug is triggered while onboard an internal model

Test Plan:
python test/test_quantization.py TestQuantizeFx.test__convert_to_reference_decomposed_fx_per_channel_quant_module

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98903
Approved by: https://github.com/andrewor14
2023-04-12 22:35:24 +00:00
3c5a825f3c [AOTAutograd] Fix is-duplicate check in de-dup guard logic (#98932)
**Context**
The existing check to see if an arg is duped is `if dupe_arg_pos != kept_pos:`. However, this incorrectly considers every arg after a true duped arg to also be a duped arg.

Consider `flat_args = [a, b, b, c]`, where indices `1` and `2` are duped.
- `add_dupe_map = {0: 0, 1: 1, 2: 1, 3: 2}`
- For `dupe_arg_pos=2, kept_pos=1`, `2 != 1`, so the check correctly identifies the second `b` to be a duped arg.
- For `dupe_arg_pos=3, kept_pos=2`, `3 != 2`, so the check incorrectly identifies the `c` to be a duped arg.

Indeed, if there were more args like `[a, b, b, c, d, e, ...]`, every arg after the second `b` will be considered a duped arg since its `kept_pos` will always be 1 lower than its `dupe_arg_pos`.

**Overview**
This PR changes `add_dupe_map` to be implemented as a `List[int]`, where the list index implicitly represents the `dupe_arg_pos` and the list element represents the `kept_pos`. We use a list to have stable in-order iteration and because we know the keys to be in `{0, 1, ..., len(flat_args) - 1}`.

With `add_dupe_map` as a list, the `is_dupe_arg` condition is whether the entry in `add_dupe_map` shows a new not-yet-seen index in the iteration. One way to do this is to count the number of unique args so far and compare against that.

This closes https://github.com/pytorch/pytorch/issues/98883, where now the guards change from
```
GUARDS ___guarded_code.valid
and ___check_type_id(L['self'], 93996836333040)
and ___check_obj_id(L['self'], 140119034997536)
and not ___are_deterministic_algorithms_enabled()
and ___check_tensors(L['x'])
and L['self']._buf is L['self']._buf_module._buf
and L['self']._buf_module._buf is L['self']._param
```
to without the final incorrect `L['self']._buf_module._buf is L['self']._param` guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98932
Approved by: https://github.com/ezyang
2023-04-12 22:22:50 +00:00
bb4998b531 Add shape function for aten::cross_entropy_loss (#97875)
Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97875
Approved by: https://github.com/davidberard98
2023-04-12 22:11:56 +00:00
5c38c4cfa4 Improve symbolic shapes guard logging (#98941)
Billing of changes:
* Get rid of `print_guards`; instead, you control this with `TORCH_LOGS=torch.fx.experimental.symbolic_shapes`, debug logging toggles stack traces
* Don't incorrectly report the tracing context frame when we're compiling; we just don't have this info anymore! (TODO: use the saved frames instead). This is via a new TracingContext.clear_frame context manager
* Add TracingContext.extract_stack() which gives you the tracing context stack.
* Add ShapeEnvLoggingAdapter to report which ShapeEnv any given operation is from (this is helpful for debugging situations when there are too many ShapeEnvs floating around)
* Tweak create_symbol log message to also report Source
* Add a debug log whenever duck sizing occurs
* Report an excerpt of both the user and system backtrace whenever a guard is added in INFO mode. I found this is a good balance of "where did the guard come from" without full backtrace verbosity.

Example log output with the new output:

```
[2023-04-12 08:25:49,003] torch.fx.experimental.symbolic_shapes: [INFO] 0: create_env
[2023-04-12 08:25:49,021] torch.fx.experimental.symbolic_shapes: [INFO] 0: create_symbol s0 = 32 for L['x'].size()[0]
[2023-04-12 08:25:50,154] torch.fx.experimental.symbolic_shapes: [INFO] 0: evaluate_expr s0 < 128 [guard added] at w.py:11 in forward2 (_dynamo/variables/tensor.py:476 in evaluate_expr)
[2023-04-12 08:25:52,057] torch.fx.experimental.symbolic_shapes: [INFO] 0: evaluate_expr Eq(Mod(s0, 16), 0) [guard added] (_inductor/codegen/triton.py:77 in is_aligned)
```

from running

```
import torch
import torch._dynamo

def f(x, y):
    return x + y

def forward(x, y):
    return forward2(x, y)

def forward2(x, y):
    if x.size(0) < 128:
        x = x * 2
    else:
        x = x * 3
    r = f(x, y)
    r = r * y
    return r

def woof():
    fn_compiled = torch.compile(forward, dynamic=True)
    x = torch.randn(32, device='cuda')
    y = torch.randn(32, device='cuda')
    print(fn_compiled(x, y))

woof()
```

(To induce the Triton guard, I synthetically reverted https://github.com/pytorch/pytorch/pull/98471)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98941
Approved by: https://github.com/wconstab
2023-04-12 21:58:59 +00:00
1149ba5553 Revert "[NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)"
This reverts commit a33eac398881cfa9aad679ceffd28ace3fa44f01.

Reverted https://github.com/pytorch/pytorch/pull/95715 on behalf of https://github.com/PaliC due to This pr has caused a regression on distributed/test_dynamo_distributed.py::TestMultiProc::test_ddp_baseline_aot_eager_multiprocess causing it to timeout (https://hud.pytorch.org/failure/distributed%2Ftest_dynamo_distributed.py%3A%3ATestMultiProc%3A%3Atest_ddp_baseline_aot_eager_multiprocess)
2023-04-12 21:15:49 +00:00
c650d7b67f [inductor] add cumprod to make_fallback (#98898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98898
Approved by: https://github.com/ezyang, https://github.com/ngimel
2023-04-12 21:02:24 +00:00
a38ff4cfd1 documentation update (#98782)
change` parameters_and_buffers` to `parameter_and_buffer_dicts` in function docstring

Fixes #98766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98782
Approved by: https://github.com/ngimel, https://github.com/kit1980
2023-04-12 20:34:30 +00:00
4828585019 Revert "Move/Fix FakeTensor logic for detecting multiple fake modes (#97186)"
This reverts commit 8a057c445d372e17501c1257e51f47ab7878b371.

Reverted https://github.com/pytorch/pytorch/pull/97186 on behalf of https://github.com/huydhn due to This breaks ONNX test in trunk and it looks like a landrace as the CI signal is green
2023-04-12 19:24:54 +00:00
dc52ba2906 Fix test_mps for macos 13.3 (#98739)
Expected dtype is changed from torch.int64 to torch.int32 prior to
macos 13.3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98739
Approved by: https://github.com/kulinseth
2023-04-12 19:23:08 +00:00
419ad49e65 Make Tensor.__contains__ accept SymInt/Float/Bool. (#98933)
Fixes https://github.com/pytorch/pytorch/issues/98870

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98933
Approved by: https://github.com/albanD, https://github.com/Skylion007
2023-04-12 19:16:33 +00:00
ada7dfff71 fix allgather func collective to use maybe_wrap_tensor (#98866)
It looks like we forgot to switch allgather to use maybe_wrap_tensor,
this PR switch to use that and added test to guard tracing behavior
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98866
Approved by: https://github.com/mrshenli
2023-04-12 19:13:46 +00:00
e99549526e [spmd] move the param_buffers to the front (#98437)
Makes it a bit easy to track the parameter count and corresponding named
states
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98437
Approved by: https://github.com/mrshenli
2023-04-12 19:13:46 +00:00
65070e1f0a Allow set pred with ConstantVariable (#98900)
It's part of the effort to improve PT2 Export UX. This PR is to improve the usability of `torch.cond()` by allowing user to set `pred` as `ConstantVariable` as it's not often to see control flow on rank or a tensor or dim size which is traced as `ConstantVariable`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98900
Approved by: https://github.com/jansel
2023-04-12 18:56:44 +00:00
a33eac3988 [NCCL] Add experimental Nonblocking NCCL Fault Tolerance/Checking (#95715)
Support for nonblocking NCCL communicators/fault tolerance/checking which was added in 2.14 as an experimental feature.
Enabled via the environment variable:
```
TORCH_NCCL_USE_COMM_NONBLOCKING=1
```

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95715
Approved by: https://github.com/kwen2501
2023-04-12 18:33:10 +00:00
09458a2bf1 introduce TensorBase::mutable_data_ptr() (#98163)
See D44409928 for motivation.

Note that we keep the const-ness of the existing data_ptr() member so
that we don't have to change all references atomically. We just change
the ones here that we have higher confidence with.

Differential Revision: [D44611466](https://our.internmc.facebook.com/intern/diff/D44611466/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98163
Approved by: https://github.com/ezyang
2023-04-12 18:15:18 +00:00
be8a4eb8e3 [MPS] Add index_fill op (#98694)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98694
Approved by: https://github.com/kulinseth
2023-04-12 18:13:33 +00:00
01e011b07c [MPS] Move bitwise ops registration to native_functions.yaml (#98908)
Per the offline discussion, there is no technical reason/limitation to have to register bitwise ops using `TORCH_LIBRARY_IMPL`.
Move the registration to `native_functions.yaml` for an easier lookup and consistent registration patterns as other mps ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98908
Approved by: https://github.com/kulinseth
2023-04-12 18:07:15 +00:00
c47464ed95 [PyTorch] Further reduce cost of TypeMeta::_typeMetaData (by 10x!) (#98105)
Currently we should be paying a small cost for the
thread-safe initialization of `index`. Now we should eliminate that
cost. (10x figure in the title comes from internal benchmark that just
calls `TypeMeta::Match<caffe2::Tensor>()` in a loop).

Differential Revision: [D44597852](https://our.internmc.facebook.com/intern/diff/D44597852/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98105
Approved by: https://github.com/ezyang
2023-04-12 17:44:48 +00:00
8a057c445d Move/Fix FakeTensor logic for detecting multiple fake modes (#97186)
This was leftover for when we had more logic in the FakeTensor and not FakeTensorMode, and wasn't firing correctly. It also makes more sense for it to be in the other validation function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97186
Approved by: https://github.com/bdhirsh
2023-04-12 17:40:41 +00:00
8654699c54 [dynamo] Remove _dynamo.skip and fold it in _dynamo.disable (#98899)
Summary
There is confusion between`_dynamo.skip` and `_dynamo.disable`. This removes the `_dynamo.skip` API. The functionality is still available via `_dynamo.disable(recursive=False)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98899
Approved by: https://github.com/jansel
2023-04-12 17:33:26 +00:00
71aea7f56e [MPS] Add error inputs check (#98167)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98167
Approved by: https://github.com/kulinseth
2023-04-12 17:19:13 +00:00
286212080f introduce TensorBase::mutable_data_ptr<T> (#97874)
See D44409928 for motivation.

Note that we keep the const-ness of the existing data_ptr<T>() member so
that we don't have to change all references atomically. We just change
the ones here that we have higher confidence with.

Differential Revision: [D44497685](https://our.internmc.facebook.com/intern/diff/D44497685/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44497685/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97874
Approved by: https://github.com/ezyang
2023-04-12 15:13:30 +00:00
629377ea8b Revert "Replace _dynamo.config with an object instead of module (#96455)"
This reverts commit 420104a88654b0cf1b8600d042cd1f3c90ec5a59.

Reverted https://github.com/pytorch/pytorch/pull/96455 on behalf of https://github.com/jansel due to BC breaking, was landed prematurely
2023-04-12 15:06:14 +00:00
0c0e5c574e [inductor] Consolidate constant_args and cpp_constant_args (#98742)
Summary: Refactor code to simplify the logic. Support convolution as an
extern call in CudaWrapperCodeGen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98742
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-12 11:59:08 +00:00
ff9e34fb35 [inductor] Consolidata kernel and cpp_kernel for wrapper codegen (#98741)
Summary: refactor to simplify the wrapper codegen logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98741
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/ngimel
2023-04-12 11:59:08 +00:00
439a716785 remove unused TensorImpl::unsafe_data<T>() (#98720)
Differential Revision: [D44824809](https://our.internmc.facebook.com/intern/diff/D44824809/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98720
Approved by: https://github.com/ezyang
2023-04-12 09:58:41 +00:00
951df11af8 [dynamo] Raise exception on incorrect usage of disallow_in_graph (#98892)
Summary -
`disallow_in_graph` is mostly useful for backends. Suppose, your backend does not support `torch.abs()`. So, you can use `disallow_in_graph` to do a graph break.

The assumption in the above statement is that `disallow_in_graph` is called on an `allowed` callable. `allowed` in Dynamo language refers to a callable that is put as-is in the Dynamo graph.

Therefore, if one uses `disallow_in_graph` on some non-torch non-allowed function, we want to raise an exception to tell user that they probably want something else.
* If they want to disable Dynamo - they should use torch._dynamo.disable
* If they wanted to stop inlining - they should use torch._dynamo.graph_break. However this is not a decorator. So, we need to provide another API. But, the question - who would want to do this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98892
Approved by: https://github.com/jansel
2023-04-12 07:50:56 +00:00
ee0143bf65 distinguish mutability of TensorImpl::data<T>() (#98719)
There already is a mutable_data<T>() with different semantics, so we
introduce new names:
TensorImpl::(mutable_)?data_dtype_initialized<T>().

Differential Revision: [D44824778](https://our.internmc.facebook.com/intern/diff/D44824778/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98719
Approved by: https://github.com/ezyang
2023-04-12 07:24:35 +00:00
9c98f2ceb7 inductor: rewrite mkldnn fx fusion using pattern_matcher(binary) (#97141)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97141
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-12 06:23:03 +00:00
d3a1a772b5 inductor: rewrite mkldnn fx fusion using pattern_matcher(conv_transpose_unary) (#97140)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97140
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-12 06:12:48 +00:00
73c3cb717d inductor: fix the issue of cat missing dim argument for sink_cat_after_pointwise (#98901)
Fix #98850 which reports an error when a cat doesn't give a dim value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98901
Approved by: https://github.com/jansel
2023-04-12 06:08:11 +00:00
562e5d4942 inductor: rewrite mkldnn fx fusion using pattern_matcher(linear_unary) (#97139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97139
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-12 05:55:37 +00:00
c214c50355 inductor: rewrite mkldnn fx fusion using pattern_matcher(conv_unary) (#97007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97007
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-12 05:52:54 +00:00
0be65069d3 [BE] Use Literal from typing (#98846)
Since PyTorch is Python-3.8+ compatible framework

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98846
Approved by: https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/Neilblaze
2023-04-12 05:49:37 +00:00
6ff32b5575 [MPS] Expose mps package in torch (#98837)
Fixes #98740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98837
Approved by: https://github.com/albanD, https://github.com/Neilblaze
2023-04-12 04:27:49 +00:00
d3a35956de Skip dtensor ops on CPU-only runner due to flaky timeout (#98868)
`distributed/_tensor/test_dtensor_ops` is still flaky in trunk with a curious timeout issue, for example ce4df4cc59.  It seems that the test just hang without any failure.  The root cause is unclear.  On the other hang, https://github.com/pytorch/pytorch/issues/98816 might offer a solution for this.  Anyway, I'm disable the test on CPU for now while the investigation is being done.

The test is still being run on CUDA-available runner because it's not flaky there.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98868
Approved by: https://github.com/clee2000
2023-04-12 03:40:02 +00:00
60ebb2f116 [Gloo][BE] Print stacktrace on collectFullMesh (#98810)
Catch error and torch_check it so full C++ stacktrace is printed for
better debug

Differential Revision: [D44860626](https://our.internmc.facebook.com/intern/diff/D44860626/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98810
Approved by: https://github.com/wanchaol
2023-04-12 03:27:53 +00:00
39fd7f945f Add Symbool support in python to C++ translation (#98453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98453
Approved by: https://github.com/ezyang
2023-04-12 03:21:57 +00:00
bc8cb62bcb torch.compile benchmark utility (#97699)
I've had many exchanges that look like this https://github.com/rasbt/faster-pytorch-blog/pull/2 so this is an attempt to get make this problem easier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97699
Approved by: https://github.com/ezyang
2023-04-12 03:02:06 +00:00
455795c799 Enable fake_crossref unit tests on rocm (#97368)
This PR should enable 900+ fake_crossref unit tests for ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97368
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-04-12 02:38:35 +00:00
9c5473b79c [BE] Move mobile builds to python-3.8 (#98886)
As we've deprecated 3.7 support for PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98886
Approved by: https://github.com/PaliC, https://github.com/seemethere
2023-04-12 02:01:10 +00:00
1510eb4072 [export] Constraints API (#98433)
Wrapper for users to insert constraints into model code.

The constraints will not be maintained in the graph after tracing through make_fx so retracing with dynamo/make_fx will not work. This will be supported after torch._assert supported is implemented. Then we can convert the constrain_range calls to torch._asserts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98433
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2023-04-12 01:32:44 +00:00
ac5025cdad [llvm-17][ORC] Fix for move most ORC APIs to ExecutorAddr, introduce ExecutorSymbolDef. (#98811)
Summary:
Due to change in upstream there are multiple builds that fail to build with llvm-17.
8b1771bd9f
Added a llvm version check.

Test Plan: local testing on failing build with trunk/llvm-12

Reviewed By: zhuhan0

Differential Revision: D44851324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98811
Approved by: https://github.com/malfet, https://github.com/bertmaher
2023-04-12 01:12:37 +00:00
f3080997e5 [SPMD] Introduce remove_copy_for_optimizer optimization (#98580)
This PR adds the ability to remove unused `copy_` (`len(node.users) == 0`) that generated by tracing the optimizer.

Differential Revision: [D44761556](https://our.internmc.facebook.com/intern/diff/D44761556/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98580
Approved by: https://github.com/mrshenli
2023-04-12 00:51:22 +00:00
401320690b [SPMD] Add optimizer states and steps to the return (#98579)
This will correctly functionalize the optimizer. Otherwise, there are orphand copy_.

Differential Revision: [D44761512](https://our.internmc.facebook.com/intern/diff/D44761512/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98579
Approved by: https://github.com/mrshenli
2023-04-12 00:51:22 +00:00
07a1378f52 [SPMD] Introduce schedule_comm_wait (#98578)
`schedule_comm_wait` delays the wait_tensor ops as late as possible. Note that this optimization currently does not reorder the computation ops. For `foreach` based optimizer, we observe that reordering the computation ops is required to achieve a good performance.

Differential Revision: [D44761487](https://our.internmc.facebook.com/intern/diff/D44761487/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98578
Approved by: https://github.com/mrshenli
2023-04-12 00:51:19 +00:00
dd3e2ddc0a [SPMD] Introduce graph_optimization_pass and comm_fusion_with_cat (#98285)
This PR add `graph_optimization_pass` decorator which should be wrapped by all graph optimization passes. This PR also introduces the first graph optimization, `comm_fusion_with_cat`, as the first use case of `graph_optimization_pass`.

Differential Revision: [D44661608](https://our.internmc.facebook.com/intern/diff/D44661608/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98285
Approved by: https://github.com/yifuwang
2023-04-12 00:51:16 +00:00
78ad800a2a [nccl] Remove lock for nccl collective launch for 2.0+ (#97904)
Summary: It looks nccl 2.0+ no longer needs a lock to avoid being called concurrently with cudaFree.

Test Plan: sandcastle + OSS CI

Differential Revision: D44514446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97904
Approved by: https://github.com/malfet, https://github.com/kwen2501
2023-04-11 23:58:54 +00:00
e37986d48f [memory viz] support larger visualizations (#98865)
When there are > 15000 polygons trace_plot starts to get really slow.
So order the allocations and take the smallest allocations beyond the 15000
limit and put them into a single summarized polygon.
A slider allows this limit to be adjusted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98865
Approved by: https://github.com/yf225
2023-04-11 23:56:41 +00:00
a2e0f5128c [dynamo] Fix bug with torch._dynamo.skip (#98862)
Summary
* Fixed an issue with `skip`
* Also removed some tests from test_misc.py and moved them to test_decorators.py as test_misc.py is becoming a dumping ground.

~~~

# Code - fn1 was not getting skipped earlier
def fn2(x):
    return x.sin()

@torch._dynamo.skip
def fn1(x):
    x = x.sigmoid()
    return fn2(x.cos())

def fn(x):
    return fn1(x.tan())

# Extracted graph
def forward(self, L_x_ : torch.Tensor):
    l_x_ = L_x_
    tan = l_x_.tan();  l_x_ = None
    return (tan,)

def forward(self, L_x_ : torch.Tensor):
    l_x_ = L_x_
    sin = l_x_.sin();  l_x_ = None
    return (sin,)
~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98862
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-04-11 23:20:08 +00:00
2de67eaaee [SPMD] Add a dump_graphs_to_files utils to facilitate graph transformation debug (#98284)
Throughout the compilation, there are multiple graphs that will be generated.  This PR add an utils to dump the result graphs to a folder.

Differential Revision: [D44661599](https://our.internmc.facebook.com/intern/diff/D44661599/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98284
Approved by: https://github.com/mrshenli
2023-04-11 23:14:12 +00:00
0962114802 Fix 'fully_shard' may determine compute device incorrectly (#98831)
Fixes #98829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98831
Approved by: https://github.com/awgu
2023-04-11 22:42:48 +00:00
c93ff384c3 [Easy] Reuse source variable in wrap_tensor (#98845)
2fab2893aa/torch/_dynamo/variables/builder.py (L759-L760)
We already save `source = self.get_source()` to begin `wrap_tensor()`. Since the source should be fixed at `VariableBuilder` construction time, we should be okay to reuse the `source` variable instead of calling `get_source()` every time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98845
Approved by: https://github.com/ezyang
2023-04-11 22:23:59 +00:00
ad373efe6d [ONNX] Skip flaky dynamic tests before ORT==1.15 in fx exporter (#98856)
Disable all flaky dynamic tests
From https://github.com/pytorch/pytorch/issues/98626#issuecomment-1502692018

Rerun all test cases and update skip reasons. The cases failing on both static and dynamic shapes are unittest.skipped. If it only fails on dynamic, it's skipped by skip_dynamic_test. There are a few skipped with skip_ort_min_version, since ORT is not supporting dynamic fx exporter until next version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98856
Approved by: https://github.com/BowenBao
2023-04-11 22:08:12 +00:00
6cbe5c5ef7 Fix Lint (#98873)
Fixes lint errors introduced by [#98433](https://github.com/pytorch/pytorch/pull/98779)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98873
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-04-11 21:47:21 +00:00
89894115ab [MTPG] add all_to_all collective to MTPG (#98791)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98791
Approved by: https://github.com/kumpera
2023-04-11 21:35:45 +00:00
420104a886 Replace _dynamo.config with an object instead of module (#96455)
Summary:
    Replace _dynamo.config with an object instead of module

    Current usage patterns of setting and reading fields on config will work
    unchanged.

    Only changes needed going forward:
    1. import torch._dynamo.config will not work. However, just doing
       import torch._dynamo is sufficient to access dynamo config
       as torch._dynamo.config.

    2. Files inside of _dynamo folder need to access config via
       from torch._dynamo.config_util import config instead of
       from torch._dynamo import config. Because _dynamo/__init__.py
       imports some of the files so it would be circular import.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96455
Approved by: https://github.com/williamwen42
2023-04-11 21:23:32 +00:00
06c206cea3 [SPMD] Add the default graph module transformation that is applied after tracing and expansion (#98182)
This PR adds the GraphModuleTransformation class that can be used as the
default transformation after the `train_step()` is traced and expand. The
current implementation includes:
1. Wrap the input graph module with IterGraphModule. This will enable the futher graph optimizations which are all implemented based on IterGraphModule.
2. Ability to lower the graph module to the Inductor. To achieve this goal, `lower_to_inductor()` is implemented.

TODO:
1. The `override` and `gm_transofmation` have overlapping functions -- `override.transform` can be used to achieve the same function as `gm_transformation`. However, the current semantics of `override` is to override and transform partial graphs while `gm_transformation` is to transform the entire expaned GM. The final UX of `compile()` needs some discussion.

2. The current `lower_to_inductor()` assumes that the entire graph can be lowered to Inductor. This assumption is okay for integration of graph optimizations but is too restrictive for many models. We should upstream `partial_lowering()`.

Differential Revision: [D44616783](https://our.internmc.facebook.com/intern/diff/D44616783/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98182
Approved by: https://github.com/mrshenli
2023-04-11 21:12:49 +00:00
367051e47e [docs] Add missing functions to autograd.rst (#98854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98854
Approved by: https://github.com/albanD
2023-04-11 20:45:49 +00:00
3b6a78ea87 [Dynamo] Lazy Module support list/tuple input (#98809)
Fixes Meta internal user case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98809
Approved by: https://github.com/wconstab
2023-04-11 20:38:04 +00:00
def50d2534 Create a new unstable workflow for periodic jobs (#98858)
And move ROCm distributed job there as it's very flaky in trunk at the moment.  Also move ROCm slow job to `slow` workflow as it should be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98858
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2023-04-11 20:12:26 +00:00
88dae230d0 dynamic range constraint API (#98779)
This diff adds the ability to specify range constraints on dynamic dimensions. (Previously we only supported declaring a dynamic dimension, which gets the default range `[2, sympy.oo]`.)

One point worth calling out: our initial design called for compound expressions like `lower <= dynamic_dim(x, d) <= upper`. However this seems difficult to support, because of a combination of desugaring and overloading semantics for such compound expressions in Python. Rather than silently doing the wrong thing, we explicitly error in this case and recommend users to specify multiple constraints, which is supported.

Differential Revision: [D44847318](https://our.internmc.facebook.com/intern/diff/D44847318/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98779
Approved by: https://github.com/ezyang
2023-04-11 20:11:46 +00:00
1e807f1189 Log PT2 compile to Scuba (#98790)
Summary:
Modeled off of https://www.internalfb.com/code/fbsource/[5f363eaeab1b5d620b9df83ba0de65adfd96771b]/fbcode/caffe2/torch/fb/trainer/profilers/gpu_mem_signpost.py?lines=106-115

I didn't use the Scuba integration in torch/_inductor/fb/logging.py to avoid
having to make a new Scuba table; probably should do this.

Test Plan:
```
buck2 test //caffe2/test:test_dynamo
```

Differential Revision: D44850903

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98790
Approved by: https://github.com/desertfire, https://github.com/bertmaher
2023-04-11 20:10:35 +00:00
97889fa199 simplify indexing expression before trying to determine strides (#98783)
This fixes a few failing cases where we fail to compute stride_hint for an indexing expression with ModularIndexing

When can size_hint error out? It shouldn't happen when we are getting regular size hints for expressions where free vars are in ShapeEnv. But this is not the case when we try to recover strides from indexing expressions (which is what stride_hint is for). Suppose you have an indexing expression that looks like
```
289*d0 + ModularIndexing(7399*d1 + d2, 1, 17) + 17*ModularIndexing(7399*d1 + d2, 17, 17) + 46240*ModularIndexing(7399*d1 + d2, 289, 128)
```
and want to understand its stride wrt to variable `d1`. Let's ignore for a moment that stride for ModularIndexing is not well defined, it'll become negative around modulo divisor value, but even without that, the way we usually compute stride is we substitute `0` and `1` for `d1` and compute difference in indexing expression with those substitutions - this is our stride. But for the expression above, the difference would result in an expression that still has free variable `d2` that we don't have a substitution for.
The fix that this PR makes is it expands stride computation to substitute not only `0` and `1` for the variable we are computing a stride for, but also `0` for other variables in the indexing expression (`support_vars`).
Note that computing strides in `stride_hints` is a performance optimization that we use to reorder dimensions or make split decisions for split reduction. If it fails, it's not a hard error - we may incorrectly apply reordering by it won't affect correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98783
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-04-11 20:01:20 +00:00
4130e4f284 [hypothesis==6.70.1] Fix more test errors (#98685)
Summary:
This diff fixes more test failures (T150117218) caused by upgrading the "hypothesis" library to 6.70.1 (D44523679).

# //caffe2/caffe2/python:hypothesis_test
This test generates float numbers and filters out those whose absolute values are less than 1e-2.
It is a known issue of the new version of "hypothesis" that it generates zeros or floats with small absolute values too often:
https://github.com/HypothesisWorks/hypothesis/issues/3603
I'm circumventing this issue by suppressing the health check `filter_too_much`.

# //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test
All arithmetic should be done in float32 when calculating the reference, since the network being tested uses float32 everywhere.
Mixing float32, float64 or even integers will result in intermediate values in float64.
The different precision may cause off-by-1 errors when converting to integer.

Test Plan:
Run all the tests in both "dev" and "opt" modes:
```
for mode in dev opt; do
  buck2 test mode/$mode //caffe2/caffe2/python:hypothesis_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/quantization/server:resize_nearest_dnnlowp_op_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/fb/layers/tests:tum_history_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/fb/dper/layer_models/tests:nn_ops_test -- --run-disabled
  buck2 test mode/$mode //caffe2/caffe2/fb/metrics:metrics_test -- --run-disabled
  buck2 test mode/$mode //deeplearning/numeric_suite/toolkit/test:net_transform_test -- --run-disabled
  buck2 test mode/$mode //f3/type_system:tests -- --run-disabled
done
```

**NOTE:** In the first test (`//caffe2/caffe2/python:hypothesis_test`), the two methods `test_constant_fill_from_tensor` and `test_recurrent` would crash.
But these crash on hypothesis 5.49.0, too, so I'm leaving them alone.

Differential Revision: D44812706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98685
Approved by: https://github.com/malfet
2023-04-11 19:07:55 +00:00
16beb636b8 Generalize summary script to work with more CSV names (#98500)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98500
Approved by: https://github.com/wconstab
2023-04-11 19:05:18 +00:00
6361c3debc Return zero_point from determine_qparams as a int64 (#98746)
Summary:
In some cases, zero_point is returned as an int tensor. We want it to be a long.

This fixes a failed assertion in Executorch op_choose_qparams:
https://www.internalfb.com/code/fbsource/[4609e7dbbf2e]/fbcode/executorch/kernels/quantized/cpu/op_choose_qparams.cpp?lines=49-52

Test Plan: CI

Reviewed By: jerryzh168

Differential Revision: D44764070

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98746
Approved by: https://github.com/jerryzh168
2023-04-11 19:01:05 +00:00
abafb1e6dc [fx] Minor bug fix for SubgraphMatcher when ignoring literals (#98458)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98458
Approved by: https://github.com/andrewor14
2023-04-11 18:54:30 +00:00
c9adc4c376 [Dynamo] De-dup graph inputs (#98775)
###  Overview
This PR de-duplicates graph inputs in TorchDynamo, using the `Source` as the unique identifier for each input. This closes https://github.com/pytorch/pytorch/issues/98743 and https://github.com/pytorch/pytorch/issues/98625.

### Details
`VariableBuilder.wrap_tensor()` should return a `VariableTracker` for the passed-in `value: Tensor`. If `value` is duplicated, we should avoid calling `OutputGraph.create_graph_input()` and `OutputGraph.add_grapharg()`.
- Note that `create_graph_input()` and `add_grapharg()` are not 1:1. For a constant source and either `wrap_sym()` or `wrap_unspecialized_primitive()`, TorchDynamo still calls `create_graph_input()` but not `add_grapharg()`.
- Note that `create_graph_input()` should be called before constructing the corresponding `VariableTracker`. TorchDynamo needs the `fx.Proxy` object to pass to `wrap_fx_proxy()`.

In this PR, the `OutputGraph` saves an additional mapping `input_source_to_var` from each graph input's `Source` to its `VariableTracker`, which works because `Source` is now hashable. This mapping should be updated each time `create_graph_input()` is called. However, since we must construct the `VariableTracker` after `create_graph_input()` returns, we must have a separate call to the `OutputGraph` to update the mapping.

If anyone has any suggestion on how to coalesce this logic and avoid having to remember to update `input_source_to_var` for each `create_graph_input()`, I would love to hear it.

<details>
<summary> Alternate Approach</summary>

Initially, I tried having TorchDynamo construct a new but equivalent `VariableTracker` for the duplicated tensor. However, I abandoned this approach after hitting an assertion in `def wrap_fx_proxy_cls()` due to `"example_value"` already being in the proxy node's metadata because we were reusing the primary tensor's `Proxy` object. Reusing the exact `VariableTracker` also seems less error-prone instead of requiring constructing a new but identical `VariableTracker`.
</details>

### Testing
#### Global Variable Test
```
import torch
@torch.compile()
def f():
    return x + x
x = torch.randn(3)
f()
```

Before:
```
====== Forward graph 0 ======
 <eval_with_key>.6 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[3], arg1_1: f32[3]):
        # File: /data/users/ezyang/b/pytorch/ff.py:5, code: return x + x
        add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
        return (add,)
```

After (only `arg0_1` and no more `arg1_1`):
```
 ====== Forward graph 0 ======
 <eval_with_key>.4 class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[3]):
        # File: dynamo/test_dup_global.py:8, code: return x + x
        add: f32[3] = torch.ops.aten.add.Tensor(arg0_1, arg0_1);  arg0_1 = None
        return (add,)
```

#### FSDP Test
Before we error on
```
File "/.../pytorch/torch/_guards.py", line 244, in __post_init__
    assert self.input_source_a != self.input_source_b
```
and now there is no error.

---
The rename from `name_to_input` to `input_name_to_proxy` is not part of the core logic change and is a remnant from initial attempts. I can undo it later if desired, but I also feel that the new name is more informative. It also fixes the type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98775
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-04-11 18:07:20 +00:00
ca791b6909 [MPS] Add higher order derivatives warning to max_pool2d (#98582)
The higher order derivatives calculations of `max_pool2d` require indices provided, but `mps_max_pool2d` kernel doesn't calculate it. If we calculate indices during back propagations afterwards, that would be expensive and unnecessary since users can directly call `max_pool2d` with `return_indices=True`, which calculates `indices` along.

This PR adds a warning for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98582
Approved by: https://github.com/soulitzer
2023-04-11 18:03:46 +00:00
e2cfdf177b Remove un-used part of cuda rng state (#98787)
The comment is quite confusing as given the use of `sizeof()`, this was never backward compatible as the state is not the same size as it used to be.

Running this through CI right now. If it turns our we serialize some rng_state Tensor, I will update the set function to be BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98787
Approved by: https://github.com/ngimel
2023-04-11 17:45:22 +00:00
778fd1922a [core][pruning][be] Rename sparsifier folder to pruner (#98758)
Summary:
att

Test Plan:
```
python test/test_ao_sparsity.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98758
Approved by: https://github.com/jerryzh168
2023-04-11 17:26:29 +00:00
583193e1d9 [MPS] Fix batch_norm_backwards key (#98794)
One needs different graphs for batch_norm_backwards depending whether or
not gradients are required for some of the params

Fixes https://github.com/pytorch/pytorch/issues/98602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98794
Approved by: https://github.com/kulinseth
2023-04-11 17:23:36 +00:00
2b38bd5bba [ONNX] Safely set node name for 'replace_placeholder_name_and_target' (#98633)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98633
Approved by: https://github.com/wschin
2023-04-11 17:02:19 +00:00
ad1d842234 [Dynamo] Make python random calls real random (#98812)
Fixes #95425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98812
Approved by: https://github.com/wconstab
2023-04-11 16:57:34 +00:00
abe96654de [reland][BE][autograd Function] Raise an error if input is returned a… (#98051)
…s-is and saved for forward or backward in setup_context

Fixes #ISSUE_NUMBER

Relanding this in a new non-ghstack PR so I can import this to do co-dev
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98051
Approved by: https://github.com/zou3519
2023-04-11 15:42:54 +00:00
97a756f57d Enable G004 lint check (#98843)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98843
Approved by: https://github.com/janeyx99, https://github.com/malfet
2023-04-11 14:57:15 +00:00
15686950b7 [spmd] quick fix on batch input view issue (#98813)
This is a quick fix/hack to get around with the issue that some
"global" tensor view operation is invalid, but somehow it get
triggered by some models as mini-batch input itself won't have this
issue.

Since ultimately we should remove the dtensor expand and use the new
expansion, this hack is only temporary to unblock
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98813
Approved by: https://github.com/yifuwang, https://github.com/mrshenli
2023-04-11 14:27:01 +00:00
760967a284 Update _store_based_barrier implementation to reduce load on rank 0 (#98000)
Summary:

Update from using add() which makes rank 0 overloaded with requests to a single request every 10 seconds to handle the last joined worker
Added optional logging_interval arg to _store_based_barrier

Test Plan:
```
pytest test/distributed/test_c10d_common.py -vsk test_store_based_barrier
```

Reviewed By: rohan-varma

Differential Revision: D44430531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98000
Approved by: https://github.com/kumpera
2023-04-11 14:25:29 +00:00
b8b840be3d Convert logging f-strings to use % format, part five (#98765)
This does some annoying but simple cases by hand.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98765
Approved by: https://github.com/wanchaol
2023-04-11 13:17:59 +00:00
5a7aad9681 Convert logging f-strings to use % format, part four (#98705)
This does multi-line concatenated string literals.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98705
Approved by: https://github.com/voznesenskym
2023-04-11 13:17:59 +00:00
5a458a9df4 Convert logging f-strings to use % format, part three (#98704)
This does triple-quoted strings.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98704
Approved by: https://github.com/voznesenskym, https://github.com/albanD
2023-04-11 13:17:56 +00:00
5ca3afd1bf torch.hub: add safe weights_only option to load_state_dict_from_url (#98479)
This adds a `weights_only` option to torch.hub.load_state_dict_from_url which is helpful for loading pretrained models from potentially untrusted sources.

Ex: https://github.com/d4l3k/torchdrive/blob/main/torchdrive/models/simple_bev.py#L618-L621

See https://github.com/pytorch/pytorch/pull/86812 for more info on weights_only

Test plan:

```
pytest test/test_hub.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98479
Approved by: https://github.com/NicolasHug
2023-04-11 12:44:25 +00:00
5907173022 Updated upsampling test to use parametrize_test decorator (#97769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97769
Approved by: https://github.com/NicolasHug
2023-04-11 12:20:00 +00:00
6145964ec9 distinguish implementation of data() and mutable_data() on TensorImpl (#98732)
The old style had them both going through a mutable method on Storage,
which would prevent us from implementing checks differently depending
on whether we are writing or reading.

Differential Revision: [D44831044](https://our.internmc.facebook.com/intern/diff/D44831044/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98732
Approved by: https://github.com/ezyang
2023-04-11 11:37:29 +00:00
34961d416c Remove unused log config settings (#98795)
Summary: Removing deprecated log settings

Test Plan: Removing code, no tests needed

Differential Revision: D44853619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98795
Approved by: https://github.com/anijain2305
2023-04-11 10:07:29 +00:00
ce4df4cc59 Enable triton build in CI docker image for ROCm (#98096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98096
Approved by: https://github.com/malfet
2023-04-11 09:02:19 +00:00
7117c87489 torch.library.Library.impl: add missing param in docstring example (#98619)
previously this was missing the callable
```
            >>> my_lib = Library("aten", "IMPL")
            >>> def div_cpu(self, other):
            >>>     return self * (1 / other)
            >>> my_lib.impl("div.Tensor", "CPU")
            #                            ^ missing `div_cpu` here
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98619
Approved by: https://github.com/ezyang
2023-04-11 06:09:46 +00:00
0c162adfa8 [dynamo] Support callable() on user defined functions (#98662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98662
Approved by: https://github.com/yanboliang
2023-04-11 05:43:46 +00:00
c377a8590b Add nonzero_static() op to pytorch to unblock export (#97417)
Summary: Add new experimental python op (`torch.nonzero_static`) for export. There is NO cuda impl included in this PR

Example:

Say input tensor is `x = torch.tensor([[1, 0], [3, 2]])`

call regular `nonzero()` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1])`
call `nonzero_static(x, size=4)` on x will give you a tensor `tensor([[0, 0], [1, 0], [1, 1], [fill_value, fill_value])` (padded)
call `nonzero_static(x, size=2)` on x will give you a tensor `tensor([[0, 0], [1, 0])` (truncated)

Test Plan:
**Unit Tests**
```
buck test @mode/dev-nosan //caffe2/test:test_dynamo -- 'caffe2/test:test_dynamo - test_export.py::ExportTests::test_export_with_nonzero_static' -- 'caffe2/test:test_dynamo - test_misc.py::MiscTests::test_nonzero_static'
```

**PT2 Export with `nonzero_static()`**
Example of `GraphModule` in the exported graph
```
def forward(self, x):
    arg0, = fx_pytree.tree_flatten_spec(([x], {}), self._in_spec)
    nonzero_static_default = torch.ops.aten.nonzero_static.default(arg0, size = 4);  arg0 = None
    return pytree.tree_unflatten([nonzero_static_default], self._out_spec)
```

Differential Revision: D44324808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97417
Approved by: https://github.com/ezyang
2023-04-11 05:13:36 +00:00
d4ce045cfc [Add] storage support for custom backend. (#98469)
Currently storage only considers partial backend. We want storage to create on custom backend by key PrivateUse1.
@ezyang Could you review my changes?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98469
Approved by: https://github.com/ezyang
2023-04-11 03:55:23 +00:00
1ff0a03e3f Fix misuse of active mask (#98157) (#98159)
Fixes #98157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98159
Approved by: https://github.com/ngimel, https://github.com/kit1980
2023-04-11 03:30:10 +00:00
a7892802b9 [dynamo] Add einops to skipfiles (#98661)
This was causing failures in a torchbench model

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98661
Approved by: https://github.com/yanboliang
2023-04-11 03:21:36 +00:00
910d9224b5 [spmd compile api] use fake tensor for DTensor propagation (#98789)
Summary: When using real tensors for DTensor propagation, functionalized _fuse_adam causes a memory spike of size(params + optim_state), which causes OOM on memory constrained environments.

Test Plan: Tested manually.

Differential Revision: D44845043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98789
Approved by: https://github.com/mrshenli
2023-04-11 03:11:22 +00:00
5a2de506fc [spmd compile api] run gm_transforms before running the first iteration (#98788)
Summary: The non-transformed graph module contains functionalized optimizer which, in a memory constraint environment, needs to be defunctionalized (via fx transformation or lowering to Inductor) before running the first iteration. Otherwise OOM may occur.

Test Plan: Manually tested.

Reviewed By: mrshenli

Differential Revision: D44843942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98788
Approved by: https://github.com/mrshenli
2023-04-11 03:02:32 +00:00
ec1d6580f1 [stronghold][bc-linter] correctly determine the base commit of the PR (#98538)
Currently `${{ github.event.pull_request.base.sha }}` returns the HEAD of the base branch, which is different from **the base of the PR**.

See:
https://github.com/github/docs/issues/431
https://github.com/orgs/community/discussions/39880

However, BC linter needs to know the base revision **of the PR**, as it looks at the changes **in the PR**.

This change is a workaround that determines the correct base of the PR. Hopefully in the future GH provides this information  in the event, and this workaround could be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98538
Approved by: https://github.com/PaliC
2023-04-11 03:00:22 +00:00
ab385bd49e docs: Linking ResNeXt PyTorch Hub Pipeline (#98689)
Introducing ResNeXt model as link to PyTorch Hub see Skip connections section.
Handle issue in #98690.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98689
Approved by: https://github.com/zou3519, https://github.com/kit1980
2023-04-11 02:20:26 +00:00
85e1d74c52 [FSDP] Clarify CPU offload implicitly in reshard_doc (#98666)
Per title

Differential Revision: [D44812344](https://our.internmc.facebook.com/intern/diff/D44812344/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98666
Approved by: https://github.com/awgu
2023-04-11 02:13:23 +00:00
c00fd71a95 Workaround for CuDNN-8.7+ load bug (#98644)
Preload `cudnn_cnn_infer` and consume `dlerror` to prevent spurious call to `abort()` from `libcudnn.so.8`, if `libnvrtc.so` is missing on the system.

Fixes https://github.com/pytorch/pytorch/issues/97041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98644
Approved by: https://github.com/ngimel
2023-04-11 01:45:43 +00:00
fa077377ea [PtE][CoreML] Create modelID as value not reference (#98655)
Summary:
https://www.internalfb.com/logview/details/instagram_ios_crashes/d5fd49a99f3ee21a82b66861de797711

CoreML is crashing in torch::jit::mobile::coreml::CoreMLBackend::compile(c10::IValue, c10::Dict<c10::IValue, c10::IValue>) (PTMCoreMLBackend.mm<175>)

This is related to the crash here https://www.internalfb.com/logview/details/instagram_ios_crashes/a8a317c8da13cd577529e1763364f496/?trace_key=8002f84f5ea00ac68b0dfb91878c754a&selected-logview-tab=shared

kimishpatel's original fix here D44386623 by passing modelID by value instead of reference, however I believe it just moved the error to loadModel invocation.

When we create a copy of modelID on loadModel invocation, it is a reference to the string within the preprocessed IValue payload. When the payload is deallocated, modelID is no longer valid and the dispatched thread still tries to use it causing the error

Test Plan:
```
Running with tpx session id: 2a77b7b1-7594-4479-8ac3-c01db29cf5cc
Trace available for this run at /tmp/tpx-20230407-173155.849234-2a77b7b1-7594-4479-8ac3-c01db29cf5cc/trace.log
RemoteExecution session id: reSessionID-2a77b7b1-7594-4479-8ac3-c01db29cf5cc-tpx
I0407 17:31:55.970502 780835 ConfigeratorDomainConfigs.cpp:177] Notify user with updated size: 92 removed size: 0
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/1970325002807752
    ✓ ListingSuccess: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests : 13 tests discovered (0.177)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchBITests/testBITextModel (0.028)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchBITests/testBIXRayModel (0.167)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmComplexDouble (0.001)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmComplexFloat (0.001)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmDouble (0.001)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCPUBlasTests/testGemmFloat (0.001)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testGanModel (0.303)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testMCSModel (0.395)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testMCSModelInvalidInputShape (0.305)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchCoreMLTests/testXirpModel (0.110)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchDynamicPyTorchTests/testDynamicPytorchFamFlDictModel (0.014)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchDynamicPyTorchTests/testDynamicPytorchFamFlModel (0.005)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - PyTorchDynamicPyTorchTests/testDynamicPyTorchXirpModel (0.065)
    ✓ Pass: //fbobjc/Apps/Internal/PyTorchPlayground:PyTorchPlaygroundTests - main (13.177)
```

Differential Revision: D44808433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98655
Approved by: https://github.com/SS-JIA, https://github.com/tiandiao123, https://github.com/kirklandsign
2023-04-11 01:05:13 +00:00
ef3ea30eed Add CUDA 12.1 workflows (#98492)
CC @atalman @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98492
Approved by: https://github.com/malfet
2023-04-11 00:33:53 +00:00
dda95236c9 Add fast path in our type checks and argparser (#98764)
Add fastpath for common use cases in our python arg parsing.
This is using the observation that exact type check is a lot fast (pointer comparison) than subtype check (isintance call). So we make sure to do these before any isinstance check.

This can be pretty significant where `a.view((1, 1, 1, 1))` goes from ~1.13us to 800ns.

Full test:

Tested perf locally with cpu freq locked and script pinned to a single core to reduce jitter.
Benchmark results after doing each change in this PR one by one:
```
[albandes@albandes-fedora-K2202N0104138 test]$ # Original
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
827 ns ± 0.945 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
947 ns ± 1.23 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
1.04 µs ± 0.882 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
1.14 µs ± 1.59 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
797 ns ± 0.955 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
937 ns ± 1.51 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
1.02 µs ± 3.52 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
823 ns ± 1.76 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
938 ns ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
1.03 µs ± 0.801 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
1.13 µs ± 0.877 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
768 ns ± 2.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
927 ns ± 0.779 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
1.01 µs ± 1.34 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

[albandes@albandes-fedora-K2202N0104138 test]$ # checkLong fastpath
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
801 ns ± 0.982 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
900 ns ± 0.593 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
1 µs ± 1.44 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
1.1 µs ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
782 ns ± 0.968 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
1.11 µs ± 424 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
1.09 µs ± 54.7 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
817 ns ± 0.65 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
912 ns ± 0.853 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
1.02 µs ± 8.45 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
1.11 µs ± 2.53 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
781 ns ± 0.942 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
939 ns ± 1.57 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
1.01 µs ± 0.875 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

[albandes@albandes-fedora-K2202N0104138 test]$ # Tensor check fastpath
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
806 ns ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
903 ns ± 1.82 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
1 µs ± 1.21 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
1.1 µs ± 1.17 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
770 ns ± 1.66 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
931 ns ± 3.36 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
1.02 µs ± 0.983 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
813 ns ± 2.42 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
915 ns ± 0.868 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
1.02 µs ± 1.09 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
1.11 µs ± 1.15 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
785 ns ± 0.807 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
941 ns ± 1.02 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
1.02 µs ± 0.857 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

[albandes@albandes-fedora-K2202N0104138 test]$ # Fast path number in intlist/symintlist
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
728 ns ± 0.503 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
749 ns ± 0.829 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
771 ns ± 0.727 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
800 ns ± 0.962 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
772 ns ± 0.622 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
883 ns ± 0.567 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
915 ns ± 0.638 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
[albandes@albandes-fedora-K2202N0104138 test]$ taskset 0x1 ipython foo.py
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
Running  a.view(1)
735 ns ± 1.27 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1))
753 ns ± 2.57 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1))
774 ns ± 1.38 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.view((1, 1, 1, 1))
801 ns ± 0.835 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze(0)
773 ns ± 0.677 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0,))
873 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
Running  a.squeeze((0, 1))
907 ns ± 0.836 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
```

<details>
  <summary>Test script</summary>

```python
import torch
from IPython import get_ipython

a = torch.empty(1)
print("Running ", "a.view(1)")
get_ipython().run_line_magic("timeit", "a.view(1)")
print("Running ", "a.view((1, 1))")
get_ipython().run_line_magic("timeit", "a.view((1, 1))")
print("Running ", "a.view((1, 1, 1))")
get_ipython().run_line_magic("timeit", "a.view((1, 1, 1))")
print("Running ", "a.view((1, 1, 1, 1))")
get_ipython().run_line_magic("timeit", "a.view((1, 1, 1, 1))")

a = torch.empty(1, 1, 1)
print("Running ", "a.squeeze(0)")
get_ipython().run_line_magic("timeit", "a.squeeze(0)")
print("Running ", "a.squeeze((0,))")
get_ipython().run_line_magic("timeit", "a.squeeze((0,))")
print("Running ", "a.squeeze((0, 1))")
get_ipython().run_line_magic("timeit", "a.squeeze((0, 1))")
```

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98764
Approved by: https://github.com/ngimel
2023-04-11 00:08:26 +00:00
7ecbce374e [DTensor][3/N] enable aten.native_dropout (#98577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98577
Approved by: https://github.com/wanchaol
2023-04-10 23:57:04 +00:00
e686a1e1b3 [DTensor][2/N] add Philox offset adjustment logic in operator_dispatch (#98199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98199
Approved by: https://github.com/wanchaol
2023-04-10 23:57:04 +00:00
67963c32bd [DTensor][1/N] add DTensor RNG state APIs (#98198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98198
Approved by: https://github.com/wanchaol
2023-04-10 23:57:00 +00:00
3c2bc0760b [EdgeML] Switch from BZL to BUCK for model resource testing (#98450)
Summary:
See [this post](https://fb.workplace.com/groups/devinfra.capacity.eng/permalink/1200060064273920/) for context and specifically [this solution](https://fb.workplace.com/groups/devinfra.capacity.eng/posts/1200060064273920/?comment_id=1200166060929987&reply_comment_id=1200177124262214) which this diff implements.

The gist is that updating `bzl` file is *very* expensive for diff time testing and triggers many flaky tests when attempting to land a model update from EdgeML.  The purpose of these bzl files (from what I can tell) is to unit test models via a CXX resources map.  Since it's only used for CXX resource generation, this can be accomplished via generating `fb_xplat_cxx_library` BUCK target instead.  This required shuffling around some existing BUCK files due to buck rules around file ownership.

Since the EdgeML process already generates code to begin with, this is straightforward to do by just changing the code from generating bzl files to now generate a BUCK file and change the existing targets to use it thus we can now delete the old bzl files.

Test Plan:
Run the model gen script.

```
buck2 run mode/opt caffe2/torch/fb/mobile/cli:cli -- --concat_all_model_configs
```

Sanity test the new BUCK target.

```
buck2 build xplat/pytorch_models/build:test_resources
```

Run the model unit tests and confirm they still work.

```
buck2 run xplat/caffe2:for_each_prod_ptl_model_test
```

CI/CD for the rest.

I expect some flaky test given the `bzl` file deletion which triggers off a ton of unrelated tests.

Differential Revision: D44699671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98450
Approved by: https://github.com/JacobSzwejbka
2023-04-10 23:24:05 +00:00
803a1a041a [torch.package][easy] Add another opcode for matching pickle protocol 4+ correctly (#98674)
Summary: IL generates massive function names: which meant that the pickle opcode used is BINUNICODE instead of the short version -- and then it would silently get skipped while pickling with protocol 4.

Differential Revision: D44815351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98674
Approved by: https://github.com/ezyang
2023-04-10 22:58:48 +00:00
76ac454146 Index expanded dims before checking memory overlap (#98656)
As the comment for `get_expanded_dims` says:

```
# copy_ fails when trying to write to tensors with memory overlap,
# for expanded dimensions (a dimension which used to have size 1 -> ?)
# we can select one element from that dimension and write to it
# to achieve writing to all values of that dimension of the input tensor
```

We were doing this for the copy, for not for checking if we could copy. Update it so we index then check for memory overlap. This covers all of the `complex_striding` warnings I observed in TB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98656
Approved by: https://github.com/ngimel, https://github.com/yf225
2023-04-10 22:58:32 +00:00
f011db345f Fix typos under torch/_inductor directory (#97592)
This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97592
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-04-10 22:53:18 +00:00
822464567f Lazily format graphs for debug printing (#98776)
The current code unconditionally formats the graphs, which is a
waste of CPU if no one looks at them.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98776
Approved by: https://github.com/albanD, https://github.com/mlazos
2023-04-10 22:41:33 +00:00
f25f85546f add rng_state support for custom device (#98069)
Fixes #ISSUE_NUMBER
Extend rng device related func,support custom device extensions,and default device is `cuda`.
@bdhirsh @kit1980 would you please take a moment to review my changes?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98069
Approved by: https://github.com/bdhirsh
2023-04-10 22:36:55 +00:00
a13a63ae9a Fix typos under torch/ao directory (#97679)
This PR fixes typos in comments and messages of `.py` files under `torch/ao` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97679
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2023-04-10 22:25:15 +00:00
a531a464fd Fix typos under torch/nn directory (#97594)
This PR fixes typos in comments of `.py` files under `torch/nn` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97594
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-04-10 22:07:15 +00:00
105ef68f72 Fix typos under torch/fx directory (#97596)
This PR fixes typos in comments and messages of `.py` files under `torch/fx` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97596
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-04-10 21:57:36 +00:00
4584851da5 [core][pruning][be] rename BaseSparsifier to BasePruner (#98747)
Summary:

att

Test Plan:
`python test/test_ao_sparsity.py -- TestBasePruner`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98747
Approved by: https://github.com/jerryzh168
2023-04-10 21:25:19 +00:00
bd83b205cc Skip test test_triton_bsr_dense_bmm if not TEST_WITH_TORCHINDUCTOR (#98462)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98462
Approved by: https://github.com/zou3519
2023-04-10 21:21:06 +00:00
5bcbb9bca7 Skip testing distributed backend if the backend (UCC, NCCL, Gloo) is not available (#98576)
After the recent change on https://github.com/pytorch/pytorch/pull/88110 to add a new c10d test for UCC backend, the test starts to fail on ROCm distributed job.  I guess ROCm doesn't support that backend yet, so I go ahead and disable the test there.  Please let me know if the support on ROCm is coming, I will close this PR accordingly.  But it's now failing in ROCm trunk with `AssertionError: Unknown c10d backend type UCC`, for example 4adba70cc6

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98576
Approved by: https://github.com/Fuzzkatt, https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/ZainRizvi
2023-04-10 20:04:40 +00:00
117da58b65 [dynamo 3.11] enable dynamo unittests in 3.11 (#98104)
Enable most dynamo unittests for 3.11. There are a few tests that are skipped due to failures that will be addressed in upcoming PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98104
Approved by: https://github.com/yanboliang, https://github.com/voznesenskym, https://github.com/albanD, https://github.com/jansel, https://github.com/jerryzh168, https://github.com/malfet
2023-04-10 20:04:10 +00:00
457afe48fd [caffe2] Micro-optimizations in BlobGetMutableTensor (#98103)
Make sure we don't call Tensor::GetDevice() twice. Remove redundant branch for the case when tensor->dtype() == options.dtype(); in this case we end up calling raw_mutable_data(options.dtype()) anyway!

Differential Revision: [D44596695](https://our.internmc.facebook.com/intern/diff/D44596695/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98103
Approved by: https://github.com/jerryzh168
2023-04-10 19:43:02 +00:00
02cff64784 Assert that there are not duplicate sources for distinct arguments (#98738)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98738
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2023-04-10 19:32:08 +00:00
b663f7e887 [better_engineering][multiplatform] Replace host_info() check with separate cmd and cmd_exe commands for protos (#98426)
Summary: Same as title

Test Plan: CI

Differential Revision: D44670281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98426
Approved by: https://github.com/ezyang
2023-04-10 18:34:13 +00:00
d5120ff18a [torch.library] Add ability to create library fragments (#98439)
In C++ we have TORCH_LIBRARY_FRAGMENT. This PR adds the same
functionality to the Python torch.library API.

The motivation for this is: for the simple custom op API, we don't want
users to need to deal with Library objects. One way to hide this from
users is to create library fragments.

Test Plan:
- tests that you can create multiple fragments and def+impl operators on each.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98439
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-10 18:04:53 +00:00
618ea6fac3 Fix test_python_dispatch under debug mode (#98609)
The problem for these operators is that they were returning the input
directly as the output. This isn't support and will raise debug asserts.

Test Plan:
- Test locally. The debug build in CI doesn't actually do anything.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98609
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2023-04-10 18:04:53 +00:00
01b2c45659 [autograd_function_db] Add NumpyTake as OpInfo (#98438)
Previously we used this to test the backward of NumpySort. It doesn't
hurt to test it separately (plus I want to use the sample_inputs for
something else).

Test Plan:
- run tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98438
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2023-04-10 18:04:50 +00:00
c139df407b Skip failing test_torchinductor_codegen_dynamic_shapes tests on CPU (#98621)
This test starts to fail in trunk after https://github.com/pytorch/pytorch/pull/97230.  The original PR missed this because these test are marked as slow and is only run periodically.  Is this ok to skip them like `test_upsample_cat_conv_dynamic_shapes`?

Here is an example failure https://github.com/pytorch/pytorch/actions/runs/4638277468/jobs/8208270657. The following tests are all failing with `Failed to find dynamic for loop variable` error like others in the list.  They are:

* `test_conv2d_binary_dynamic_shapes`.  Fixes https://github.com/pytorch/pytorch/issues/98679
* `test_conv2d_unary_dynamic_shapes`.  Fixes https://github.com/pytorch/pytorch/issues/98680
* `test_conv_bn_fuse_dynamic_shapes`.  Fixes https://github.com/pytorch/pytorch/issues/98681
* `test_conv_transpose2d_unary_dynamic_shapes`.  Fixes https://github.com/pytorch/pytorch/issues/98682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98621
Approved by: https://github.com/malfet
2023-04-10 17:52:30 +00:00
9abae6ae32 Make all Source subclasses frozen. (#98737)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98737
Approved by: https://github.com/albanD
2023-04-10 17:51:10 +00:00
69eef5a4be [CUDA12] set_device change (#94864)
This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this:
```Python
import torch
x = torch.randn(1, device="cuda:1")
```
would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`.
After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang
2023-04-10 17:31:12 +00:00
3fcc5ff0d6 Avoid passing buffers to optimizers during spmd rematerialization (#98714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98714
Approved by: https://github.com/fegin
2023-04-10 17:09:15 +00:00
a3701b6740 fix backward bug for custom device (#98586)
Fixes #ISSUE_NUMBER
In the backward on some device , it may get an error to get device index because of exchange a new thread.
So just set_device and check the device index in `setDevice`  func may be better for some many kinds of devices.
For CUDA, the device index check is also included in `setDevice`  func.https://github.com/pytorch/pytorch/blob/master/c10/cuda/impl/CUDAGuardImpl.h#:~:text=%7D-,void%20setDevice(Device%20d)%20const%20override%20%7B,%7D,-void%20uncheckedSetDevice(Device
```
void setDevice(Device d) const override {
    TORCH_INTERNAL_ASSERT(d.is_cuda());
    Device current_device = getDevice();
    if (current_device != d) {
      C10_CUDA_CHECK(cudaSetDevice(d.index()));
    }
  }
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98586
Approved by: https://github.com/albanD
2023-04-10 15:56:38 +00:00
537c346117 feat(add method is_private_use1() in class Device) (#98123)
As the title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98123
Approved by: https://github.com/bdhirsh
2023-04-10 12:30:37 +00:00
b09722f540 Convert logging f-strings to use % format, part two (#98700)
This hits multi-line logging strings

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98700
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
9a8f71f23e Convert logging f-strings to use % format (#98697)
Codemod done with
https://gist.github.com/ezyang/2e8b0463cdc6be278478495b23ff0530 with
assistance from ChatGPT.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98697
Approved by: https://github.com/voznesenskym
2023-04-10 12:19:31 +00:00
ad88afcff8 [xla hash update] update the pinned xla hash (#98195)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98195
Approved by: https://github.com/pytorchbot
2023-04-10 10:17:34 +00:00
95621b3c2e [aot] fix disable amp for runtime wrapper (#97864)
For the current runtime wrapper in aot, `disable_amp` is always set to True. In fact, we would like to avoid disabling autocast if possible because accessing TLS is slow. In this PR, `disable_amp` depends on whether there is any autocast enabled instead of always being True. Many operators would get an improvement of performance (inductor v.s. eager) with this fix.

Example of operators' 0.8 speedup in torchbench (inductor v.s. eager):
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">

</head>

<body link="#0563C1" vlink="#954F72">

  | current | new
-- | -- | --
aten.hardsigmoid.default | 0.709372349 | 0.81414306
aten.tanh.default | 0.715227805 | 0.855556349
aten.add.Scalar | 0.682292123 | 0.860371222
aten.sigmoid_backward.default | 0.688039934 | 0.915606579

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97864
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5, https://github.com/bdhirsh
2023-04-10 05:00:12 +00:00
96fb64a159 Turn off cudagraph trees (#98709)
There were some recent failures on master, and I think it's fair to defer on turning it on till we get a bit of the Tensor construction overhead down because that shows up a lot in the TB benchmarks.

There may ultimately be an unavoidable tradeoff between memory and performance to some extent but we can get the overhead numbers down a bit first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98709
Approved by: https://github.com/Chillee
2023-04-10 03:31:54 +00:00
fdfd370c10 [vision hash update] update the pinned vision hash (#98654)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98654
Approved by: https://github.com/pytorchbot
2023-04-10 03:06:47 +00:00
584244460b use float as accumulate type for reduce Ops: min, max, minmax on CPU (#96079)
Use float32 as acc type for `min`, `max` and `minmax`, in the function ` vec::reduce_all`, float16 inputs will be accumulated in float32.

The performance benefit basically comes from the vectorization of `Half` https://github.com/pytorch/pytorch/pull/96076

Tested on Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz

**single socket**
```
(before)
### using OMP_NUM_THREADS=20
### using numactl --physcpubind=0-19 --membind=0
max: size: torch.Size([64, 128, 1024])  2.071 ms

(after)
### using OMP_NUM_THREADS=20
### using numactl --physcpubind=0-19 --membind=0
max: size: torch.Size([64, 128, 1024])  0.071 ms
```

**single core**
```
(before)
### using OMP_NUM_THREADS=1
### using numactl --physcpubind=0 --membind=0
max: size: torch.Size([64, 128, 1024])  33.488 ms

(after)
### using OMP_NUM_THREADS=1
### using numactl --physcpubind=0 --membind=0
max: size: torch.Size([64, 128, 1024])  0.953 ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96079
Approved by: https://github.com/jgong5, https://github.com/kit1980
2023-04-10 01:48:31 +00:00
8fee46693c Fused attention patterns (#97741)
Patterns based on https://github.com/pytorch/pytorch/pull/94729 mainly as a forcing function for implementing joint graph replacements.

Up until now, we had two places to do pattern matching
1) Pre-grad has janky infra (graph not normalized or functional), but is
   desirable for many types of passes where you want your change to
   affect grad formulas.
2) Post-grad has good infra, but cant change grad formulas.

This PR adds a third place to do pattern matching: the joint
forward+backwards graph.  The idea is to take the patterns and lower
them to a joint graph and replace both the forwards+backwards before
we partition them.  This allows us to do something similar to pre-grad
transforms, but run after normalization and functionalization.

Note that we don't seem to have kernels for all of these patterns, some get decomposed in the dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97741
Approved by: https://github.com/Chillee
2023-04-10 00:35:22 +00:00
f4858fa8ef Improve dynamo support for autograd.Function (#98158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98158
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2023-04-10 00:33:51 +00:00
7e0c26d4d8 [JIT] Allow tuple and list generics (#98703)
As in Python-3.9+ `Dict`, `List`, and `Tuple` from `typing` module are deprecated in favor of their `builtins` counterparts, see [PEP 585](https://peps.python.org/pep-0585/)

Test plan: Run:
```
import torch
from typing import Union

@torch.jit.script
def to_tuple(v: Union[int, tuple[int, int]]) -> tuple[int, int]:
    """Converts int or tuple to tuple of ints."""
    if torch.jit.isinstance(v, int):
        return v, v
    else:
        return v

print(to_tuple(1), to_tuple((3, 4)))
```

It's almost impossible to add test to an existing CI, as test script will not be parseable by Python-3.8, which is a oldest supported Python version

Fixes https://github.com/pytorch/pytorch/issues/98521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98703
Approved by: https://github.com/kit1980
2023-04-09 22:58:58 +00:00
2400cb1d57 distinguish mutability of TensorImpl::data() (#97776)
See D44409928.

Differential Revision: [D44459999](https://our.internmc.facebook.com/intern/diff/D44459999/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97776
Approved by: https://github.com/ezyang
2023-04-09 20:21:56 +00:00
6b9a1cf858 Removed hip call hipDeviceSynchronize (#97209)
Similar to CUDA, fixed test_profiler_tree.py::TestProfilerTree unit test suites by DeviceSynchronize call removal.

@jithunnair-amd @pruthvistony
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97209
Approved by: https://github.com/kit1980, https://github.com/jithunnair-amd
2023-04-09 20:12:52 +00:00
ff825de442 [primTorch] add ref for cumprod (#98670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98670
Approved by: https://github.com/ezyang
2023-04-09 15:22:28 +00:00
9d36361601 make TensorImpl::data_ptr_impl() non-const and have mutable in the name (#97744)
See D44409928.

Differential Revision: [D44450468](https://our.internmc.facebook.com/intern/diff/D44450468/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97744
Approved by: https://github.com/ezyang
2023-04-09 11:08:41 +00:00
54b168484d Support LayerNorm without weight or bias parameters (#98687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98687
Approved by: https://github.com/yifuwang
2023-04-09 02:13:10 +00:00
1be3549a27 Enable replicated embedding in SPMD for NLP models (#98686)
For models like NanoGPT, embeddings are replicated and input ids
are sharded. In this case, output lookups should be sharded to
match ids.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98686
Approved by: https://github.com/yifuwang
2023-04-09 02:13:10 +00:00
fdb04c6a86 Add overflow check for stride calculation (#94900)
Fixes #94120 and #94128.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94900
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-04-09 01:30:55 +00:00
3925f6edb2 add Half to cat fast path on CPU (#96078)
Extend current fast path on `cat` with `Half`: for non-arithmetic Ops, simply do `Vec::load` and `Vec::store`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96078
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-04-09 01:30:48 +00:00
d95ee64b58 ddp forward support custom backend. (#98283)
Currently DDP only considers CUDA backend,DDP forward will transfer tensor to CUDA. We want ddp to run on custom backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98283
Approved by: https://github.com/ezyang
2023-04-09 01:30:42 +00:00
a2e7910dfd [pt2] remove skip for masked.logsumexp in test_proxy_tensor.py (#98676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98676
Approved by: https://github.com/ezyang
2023-04-09 01:28:16 +00:00
b411238d76 [pt2] add meta function for logcumsumexp (#98683)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98683
Approved by: https://github.com/ezyang
2023-04-09 01:26:37 +00:00
387feaa131 add mutable to name of non-const Storage::data_ptr (#97694)
See D44409928.

Differential Revision: [D44432585](https://our.internmc.facebook.com/intern/diff/D44432585/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97694
Approved by: https://github.com/ezyang
2023-04-08 12:44:30 +00:00
2edfcafd4b [inductor] remove RBLOCK from persistent reduction kernel's parameter list (#98653)
This PR resolves comments https://github.com/pytorch/pytorch/pull/97203#discussion_r1160491318 . Send a separate PR since it's easier to test and make sure there is no perf impact.

Tests:
1. python test/inductor/test_torchinductor.py
2. run `python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only hf_Bert --disable-cudagraphs --training` before and after the change to make sure the perf change is neutral.

Now a persistent reduction kernel in hf_Bert looks like:
```
@persistent_reduction(
    size_hints=[4096, 1024],
    reduction_hint=ReductionHint.INNER,
    filename=__file__,
    meta={'signature': {0: '*fp32', 1: '*i64', 2: '*fp16', 3: '*i64', 4: '*fp16', 5: '*i64', 6: '*fp16', 7: '*fp16', 8: '*fp16', 9: '*fp16', 10: 'i32', 11: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['in_out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11), equal_to_1=())]}
)
@triton.jit
def triton_(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, in_ptr5, in_ptr6, in_ptr7, out_ptr2, xnumel, rnumel, XBLOCK : tl.constexpr):
    xnumel = 4096
    rnumel = 768
    RBLOCK: tl.constexpr = 1024
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98653
Approved by: https://github.com/jansel
2023-04-08 10:17:14 +00:00
d77d2f03a5 [ONNX] Fix scalar elements in op.Concat (#98509)
op.Concat wrongly concatenated scalar int, and it would raise errors in ORT. However, we didn't see this bug until SegFault was fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98509
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2023-04-08 09:55:18 +00:00
70535d60fc Restore CPU distributed tests (#97424)
This looks pretty stable now on [HUD](https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=unstable%20%2F%20linux-focal-py3.8-gcc7%20%2F%20test%20(distributed)), so moving it back from unstable.  Pending some discussion on https://github.com/pytorch/pytorch/issues/97178

I'll monitor this a bit longer before merging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97424
Approved by: https://github.com/clee2000
2023-04-08 07:34:05 +00:00
0fa25cbd57 Fix broken MacOS build due to #97690 (#98665)
https://github.com/pytorch/pytorch/pull/97690 breaks MacOS build c68a94c5ea.  The fix looks easy enough so I try to go ahead with a forward fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98665
Approved by: https://github.com/dagitses
2023-04-08 07:02:12 +00:00
cb3c478069 Revert "refactor(add privateuseone floder in aten/src/ATen): add a PrivateUse… (#98127)"
This reverts commit 5a537e291d03baf3ea8b23e4102acb10bfd5db23.

Reverted https://github.com/pytorch/pytorch/pull/98127 on behalf of https://github.com/weiwangmeta due to Sorry, our internal code is not ready to take such changes
2023-04-08 05:32:21 +00:00
526d9bbc65 [ONNX] Refactor op level debugging (#97494)
Fixes #97728
Fixes #98622
Fixes https://github.com/microsoft/onnx-script/issues/393

Provide op_level_debug in exporter which creates randomnied torch.Tensor based on FakeTensorProp real shape as inputs of both torch ops and ONNX symbolic function. The PR leverages on Transformer class to create a new fx.Graph, but shares the same Module with the original one to save memory.

The test is different from [op_correctness_test.py](https://github.com/microsoft/onnx-script/blob/main/onnxscript/tests/function_libs/torch_aten/ops_correctness_test.py) as op_level_debug generating real tensors based on the fake tensors in the model.

Limitation:
1. Some of the trace_only function is not supported due to lack of param_schema which leads to arg/kwargs wronly split and ndarray wrapping. (WARNINGS in SARIF)
2. The ops with dim/indices (INT64) is not supported that they need the information(shape) from other input args.  (WARNINGS in SARIF)
3. sym_size and built-in ops are not supported.
4. op_level_debug only labels results in SARIF. It doesn't stop exporter.
5. Introduce ONNX owning FakeTensorProp supports int/float/bool
6. parametrized op_level_debug and dynamic_shapes into FX tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97494
Approved by: https://github.com/justinchuby, https://github.com/BowenBao
2023-04-08 05:24:43 +00:00
5375e78b50 [Inductor] turn on vectorization with fallback for indirect indexing etc. (#98138)
Always do vectorization with scalar fallback for indirect indexing right now. We can vectorize the indirect indexing load/store by analyzing how the indirect indices are related to the loop variables. This will be done in future PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98138
Approved by: https://github.com/jansel
2023-04-08 05:14:56 +00:00
584a7ef35c [Inductor] cpp further code cleanup (#98135)
This PR primarily made two changes:
1. Support all ops (not only the load related ops) for `ops.masked`. Do recursive checks on masked body in `CppVecKernelChecker`. With this, we can remove `is_load_only_block` function and corresponding checking logic in `masked`.
2. Change the loop steps to the vectorized scaling factor instead of scaling the vectorized loop variables. With this, we can remove all the code that scales the loop variables explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98135
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-04-08 05:09:25 +00:00
85a90d9181 Rename assert options, turn off by default (#98616)
Rename the runtime assert checking options to be more clear. Also turn off the slow path checking, since it is slow enough to significantly affect our compilation time speed in dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98616
Approved by: https://github.com/davidberard98, https://github.com/Neilblaze
2023-04-08 04:44:55 +00:00
a5f3468618 [Dynamo] Fix bug when dynamo generate guards for enum type (#98652)
Fixes Meta internal user case, actually I think this is a ```enum``` bug, we provide workaround in dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98652
Approved by: https://github.com/jansel
2023-04-08 04:30:30 +00:00
0dbdc8a380 reenable lowmem dropout (#98631)
Fixes #98614.
I'll look into when lowmem dropout helps (as enabling it will interfere with sdpa pattern matching), but for now to avoid regressions let's return to the existing state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98631
Approved by: https://github.com/jansel
2023-04-08 04:18:14 +00:00
c68a94c5ea distinguish mutability of untyped Storage::data (#97690)
See D44409928.

Differential Revision: [D44429769](https://our.internmc.facebook.com/intern/diff/D44429769/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97690
Approved by: https://github.com/ezyang
2023-04-08 02:02:28 +00:00
d255c8e1ad Add NLLLoss to DTensor prop rule (#98512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98512
Approved by: https://github.com/wanchaol
2023-04-08 01:22:36 +00:00
a6155f34f6 Set up automated hash pinning for triton (#97568)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97568
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-04-08 01:13:05 +00:00
f959a0d56c Modify 'fake_tensor_unsupported' function (#98585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98585
Approved by: https://github.com/jansel
2023-04-08 01:04:00 +00:00
b7ff717232 [inductor] Use 64-bit indexing for large tensors in triton code (#97447)
This changes `TritonKernel` to have an `index_dtype` property which is
used as the dtype in indexing calculations. By default it is
`tl.int32` but if any input or output buffer is larger than `INT_MAX`
then we use `tl.int64` instead.

should fix #96978, #93606 (need to double check)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97447
Approved by: https://github.com/ngimel
2023-04-08 00:55:51 +00:00
48397cddd7 [inductor] Fix benchmark_compiled_module codegen with CppWrapperCodeGen (#98608)
The python function `benchmark_compiled_module` ends up using C++ expression printer to print the size for `rand_strided`, so you get a set e.g. `{2, 17}` instead of a
tuple `(2, 17)`. Here is a complete example from master:

```python
def benchmark_compiled_module(times=10, repeat=10):
    from torch._dynamo.testing import rand_strided
    from torch._inductor.utils import print_performance
    arg0_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.float32)
    arg1_1 = rand_strided({2, 17}, {17, 1}, device='cpu', dtype=torch.uint8)
    return print_performance(lambda: call([arg0_1, arg1_1]), times=times, repeat=repeat)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98608
Approved by: https://github.com/ngimel
2023-04-08 00:55:51 +00:00
917e9f1157 Fix pytest config (#98607)
`report` can be `TestReport` or `CollectReport`, the latter fails because there
is no duration attribute.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98607
Approved by: https://github.com/clee2000
2023-04-08 00:55:51 +00:00
4563adacc5 Update the use of nvidia-smi for GPU healthcheck (#98036)
This goes together with https://github.com/pytorch/test-infra/pull/3967 to:

* Provide a more accurate health check command with `nvidia-smi`
* Avoid running the check in the edge case when `nvidia-smi` doesn't even exist due to GitHub outage, i.e. https://github.com/pytorch/pytorch/actions/runs/4591098682/jobs/8107204277
* Also check for the number of GPU as part of the health check. The number of GPUs needs to be a power of 2 on a healthy runner.  Fixes https://github.com/pytorch/test-infra/issues/4000

### Testing

Luckily, the PR picked up the broken runner https://github.com/pytorch/pytorch/actions/runs/4640688249/jobs/8213191715, and the script correctly detected that the runner had only 3/4 GPUS and shut it down
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98036
Approved by: https://github.com/weiwangmeta
2023-04-08 00:53:20 +00:00
112dfa1415 Back out "[kineto] add SOFT_ASSERT when logging metdata" (#98630)
Summary:
Original commit changeset: 1089c4d95c54

Original Phabricator Diff: D44513152

Test Plan: signals

Reviewed By: eeggl

Differential Revision: D44804013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98630
Approved by: https://github.com/weiwangmeta
2023-04-08 00:48:51 +00:00
5ceae85f1c [Dynamo] Include UserDict in clone_inputs (#97725)
Fixes #97724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97725
Approved by: https://github.com/yanboliang
2023-04-08 00:19:35 +00:00
cf10fd827e Add comments about maybe_guard usage in Inductor (#98563)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98563
Approved by: https://github.com/jansel
2023-04-07 23:54:18 +00:00
ebd4c165ff Back out "GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)" (#98613)
Summary: This change causes multi-GPU job from XI team to hang after 8K steps.

Differential Revision: D44797248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98613
Approved by: https://github.com/ngimel
2023-04-07 23:31:31 +00:00
2d9f482d88 [fx] Subgraph rewriter matching on attributes (#98604)
Fixes #68534

Similar to how submodules are added, if there already exists an attribute with the same name in `gm` as in `replacement`, the attribute value in `gm` will take precedence.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98604
Approved by: https://github.com/andrewor14, https://github.com/SherlockNoMad
2023-04-07 23:24:13 +00:00
4adae2d1ae Enable flatbuffer tests properly. (#98363)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98363
Approved by: https://github.com/angelayi
2023-04-07 22:36:19 +00:00
0a0f107b50 Retry ONNX tests (the quick way) (#98627)
This is to mitigate a flaky ONNX test in trunk and also improve its reliability till we have https://github.com/pytorch/pytorch/issues/98626  (I figure that this is better than moving the job to unstable).

I try to disable the flaky test https://github.com/pytorch/pytorch/issues/98622, but that won't work as @clee2000 points out because ONNX isn't part of `run_test.py` to download and apply the list of disabled tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98627
Approved by: https://github.com/BowenBao
2023-04-07 22:20:39 +00:00
4f9dbc17a4 [ONNX] Enable xdoctests in CI (#98546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98546
Approved by: https://github.com/justinchuby, https://github.com/kit1980
2023-04-07 22:20:18 +00:00
b2b783ea3c Fix wrong SPMD test target in test_log_softmax (#98610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98610
Approved by: https://github.com/wanchaol
2023-04-07 21:58:05 +00:00
61c74ab0f8 Fix MPI rank and world size pg initialization (#98545)
Fixes https://github.com/pytorch/pytorch/issues/97507

Test command
`pytest test/distributed/test_c10d_common.py -vsk def test_init_process_group_for_all_backends`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98545
Approved by: https://github.com/malfet
2023-04-07 21:57:31 +00:00
24d9001527 Move functional collectives implementation to python. (#98595)
This simplifies a lot the work we need to add new ops.

This relands the previous PR, not sure why it was reverted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98595
Approved by: https://github.com/wconstab
2023-04-07 21:48:05 +00:00
c75dd7c413 grab bag of changes (#98572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98572
Approved by: https://github.com/shunting314, https://github.com/mlazos
2023-04-07 20:02:59 +00:00
9667f261c6 Remove MERGE_IN_PROGRESS when exiting merge (#98611)
I.e. do it in finally section
This should take care of cases like https://github.com/pytorch/pytorch/pull/97645#issuecomment-1500490754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98611
Approved by: https://github.com/PaliC
2023-04-07 19:59:08 +00:00
55724a5ec9 Revert "[experiment] More procs in CI (#98098)"
This reverts commit 9fd3eba6ceb048cfdcb430e34f9168eda888b4c8.

Reverted https://github.com/pytorch/pytorch/pull/98098 on behalf of https://github.com/clee2000 due to I think theres a bug
2023-04-07 19:50:54 +00:00
5210d7c423 [CI] Mark vision_maskrcnn as NONDETERMINISTIC (#98570)
Summary: vision_maskrcnn fails eager checking, so mark it as
NONDETERMINISTIC to reduce noise on the dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98570
Approved by: https://github.com/eellison, https://github.com/huydhn
2023-04-07 19:33:20 +00:00
c5269ad6c6 [quant][pt2e] Add support for a few ops in QNNPackQuantizer to enable quantizing internal model (#98560)
Summary:
This PR adds support for adaptive_avg_pool2d (traced as mean.dim), mean and hardtanh to QNNPackQuantizer

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_qnnpack_quantizer_obs_sharing_ops

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98560
Approved by: https://github.com/andrewor14
2023-04-07 19:26:45 +00:00
89e5774482 Work around CI worker gpu issue for inductor_distributed (#98601)
Just run with 2 gpus no matter how many there are
(still skip if less than 2)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98601
Approved by: https://github.com/ngimel, https://github.com/mrshenli
2023-04-07 18:50:27 +00:00
1c226f5aad [pt2] add meta functions for cummax and cummin (#98552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98552
Approved by: https://github.com/Chillee
2023-04-07 17:58:28 +00:00
483fd3351a [Quant] Add get_symmetric_qnnpack_qat_qconfig_mapping (#98569)
Differential Revision: [D44776230](https://our.internmc.facebook.com/intern/diff/D44776230/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98569
Approved by: https://github.com/andrewor14
2023-04-07 17:57:56 +00:00
e016dec66e Clean up compile reason logic, report only graph break compiles (#98574)
context: https://fb.workplace.com/groups/1075192433118967/posts/1222935648344644/?comment_id=1223002365004639&reply_comment_id=1223501008288108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98574
Approved by: https://github.com/Chillee, https://github.com/xw285cornell
2023-04-07 17:40:00 +00:00
f55e72c0f6 Add option to log recomps (#98564)
Adds an option to TORCH_LOGS to log recompilations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98564
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-04-07 17:30:27 +00:00
9fd3eba6ce [experiment] More procs in CI (#98098)
experiment with more procs but only in master so prs dont get affected

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98098
Approved by: https://github.com/huydhn
2023-04-07 17:21:32 +00:00
e302f083bb Flip Switch Redux (#98341)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98341
Approved by: https://github.com/davidberard98
2023-04-07 16:05:58 +00:00
16ec7efa49 Don't use f-strings in logging calls (1/X) (#98591)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98591
Approved by: https://github.com/albanD
2023-04-07 15:52:50 +00:00
79e14f8fd6 [better_engineering][multiplatform] Repalce host_info() check with select for default_compiler_flags (#98306)
Summary: Same as title

Test Plan: CI

Differential Revision: D44667769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98306
Approved by: https://github.com/priyaramani, https://github.com/malfet
2023-04-07 15:39:38 +00:00
390c51bf87 Skip nnmodule hook guards by default (#98371)
This PR makes basic nnmodule forward hooks work by default, without any overhead.  But it leaves silent correctness issues if users modify/remove their hooks later, thus also emits a warning.

- the usual case is to not use hooks, so avoid guard overhead here
- registering any hook before compile will trigger a warning about hook support
- registering a hook later (or removing one) requires user knowledge and opting in,
  currently this isn't warnable (but maybe we can observe compiled nnmodules to make it
  warnable).

Why skip hook guards by default instead of not tracing __call__/hooks by default?
- avoid having a mode flag that alters dynamo tracing behavior (harder to test both codepaths
  in CI with full coverage)
- the most basic hook usecase (registering a hook before compile, and never removing it)
  will work by default with this PR, while it would require enablement and incur overhead
  in the 'not tracing __call__' proposal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98371
Approved by: https://github.com/jansel
2023-04-07 15:10:51 +00:00
46d765c15e [devX] make labels only count their own occurences (#98551)
Small QoL improvement such that add_numbered_label now works more intuitively. Now if we push different labels instead of having `[reverted, mergedX2, revertX3, mergedX4, revertedX5, mergedX6]` we have `[reverted, merged, revertX2, mergedX2, revertedX3, mergedX3]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98551
Approved by: https://github.com/huydhn
2023-04-07 08:30:46 +00:00
d06662fb57 Add ephemeral merging label (#98543)
Addresses https://github.com/pytorch/test-infra/issues/3950

Test Plan: Ran a dry run on this pr. The label showed up while trying to merge
<img width="354" alt="Screenshot 2023-04-06 at 4 57 48 PM" src="https://user-images.githubusercontent.com/13758638/230514276-1ac70b58-d2d1-4e4b-892b-a957bf156063.png">

And then disappeared after failing
<img width="373" alt="Screenshot 2023-04-06 at 5 00 11 PM" src="https://user-images.githubusercontent.com/13758638/230514470-38b15ec7-cfd9-4efe-b6e8-0f9af5577c62.png">

There's also the trail of adding and removing the "merging" label at the bottom

Notes: This is slightly buggy sometimes. For example when the merge failed when I was editing this textbox, the label did not disappear.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98543
Approved by: https://github.com/malfet
2023-04-07 08:24:54 +00:00
d643a00efc inductor(CPU): support dynamic shape for onednn fusion path (#97230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97230
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-04-07 06:53:31 +00:00
77d9742c24 [Inductor] Fix bug in lowering.slice_ when negative start out of range (#98517)
Fixes error from 14k github models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98517
Approved by: https://github.com/ngimel
2023-04-07 06:48:51 +00:00
45a2f6b70f Revert "Reduce includes of CUDACachingAllocator.h (#97072)"
This reverts commit 1bcb88089468a6ebc667bd76256c4dd6f58b7ee3.

Reverted https://github.com/pytorch/pytorch/pull/97072 on behalf of https://github.com/weiwangmeta due to breaking internal builds
2023-04-07 06:15:11 +00:00
5c8fea5647 Reduce overhead in CUDAGraph Trees (#98529)
Significantly reduces overhead of constructing Tensors and Storages and checking Storage Liveness. Removes the regression for HF models that I tested and removes 75% of overhead of the extremely overhead bound resnet50 training we have in torchbench. (.91x base commit, 1.02x torchinductor default, 1.16x this PR, 1.25 previous cudagraphs impl).

This PR takes care of all of the lower hanging fruit.

- Computes storage aliasing at record time instead of during at runtime. We no longer need to use a runtime storage cache, and can instead index directly into the existing alias if there is one, or construct a new Storage

- Moves the heavyweight C++ calls into a batch - getting storage weakrefs and constructing tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98529
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-04-07 05:46:08 +00:00
616f50da3a [quant][pt2e] QNNPackQuantizer support annotation for resnet18 (#98507)
Summary:
This PR adds annotation support for conv2d relu, linear, maxpool2d, add and add relu so
that we can successfully quantize resnet18 with the prepare_pt2e_quantizer API and get the same result
as fx graph mode quantization

Test Plan:
python test/test_quantization.py TestQuantizePT2EModels.test_resnet18_with_quantizer_api

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98507
Approved by: https://github.com/vkuzo
2023-04-07 04:27:21 +00:00
5a537e291d refactor(add privateuseone floder in aten/src/ATen): add a PrivateUse… (#98127)
Add a PrivateUse1 folder to contain all the feature adaptations for PrivateUse1 under Aten,For example GetGeneratorPrivate which is used for the three-party backend to register his own Generator implementation.This makes it easier for us to centrally manage these features, and it will increase the convenience of adaptation for different back-end manufacturers. For more info: https://github.com/pytorch/pytorch/issues/98073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98127
Approved by: https://github.com/bdhirsh
2023-04-07 03:43:16 +00:00
29608fd28d [pt2][inductor] hardcode autotuning names (#98351)
Summary: switch to hardcoded autotuning names, we want consistency incase the default choice changes

Test Plan: CI

Differential Revision: D44643318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98351
Approved by: https://github.com/jansel
2023-04-07 03:40:33 +00:00
3d8ead7ee1 [vision hash update] update the pinned vision hash (#98367)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98367
Approved by: https://github.com/pytorchbot
2023-04-07 02:56:14 +00:00
1fb8428d70 Fix off-by-1 error in dynamo coverage stats (#98558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98558
Approved by: https://github.com/malfet
2023-04-07 02:52:22 +00:00
2161be08c4 Disable test_torchinductor_dynamic_shapes on ASAN (#98544)
This is yet another wrong shard number calculation on ASAN causing flakiness.  I figure that we don't really need to run this test on ASAN, so let disable it.  There is discussion at the moment to run ASAN periodically too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98544
Approved by: https://github.com/malfet
2023-04-07 02:27:52 +00:00
152d65ae1d [reland][inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98534)
Summary: This is a reland of #98264.

When _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98534
Approved by: https://github.com/huydhn
2023-04-07 02:04:03 +00:00
d4dbdee528 Update _linux-test.yml (#98317)
Skip "setup-ssh" for now for a100 runners from GCP as it frequently encounter issues like "connect ETIMEDOUT 173.231.16.75:443" Every day about 10 occurrences

Examples for just today so far:
2023-04-04T15:07:50.916331Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4609056040/jobs/8146321650
-- | -- | --
2023-04-04T15:03:56.914692Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4609010125/jobs/8146217819
2023-04-04T14:39:58.004717Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4608784966/jobs/8145641764
2023-04-04T14:19:28.854825Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4608561116/jobs/8145147916
2023-04-04T06:15:39.241848Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4604422106/jobs/8135687673
2023-04-04T06:10:21.056131Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4604406947/jobs/8135611094
2023-04-04T05:34:50.908482Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4604198332/jobs/8135201048
2023-04-04T03:04:36.628201Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4603162241/jobs/8133620905
2023-04-04T01:49:27.119830Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4600897505/jobs/8132760483
2023-04-04T01:18:06.141437Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4602745871/jobs/8132387930
2023-04-04T00:38:30.610770Z | inductor | https://github.com/pytorch/pytorch/actions/runs/4602537869/jobs/8131938265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98317
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-04-07 01:51:02 +00:00
a0a0b0c701 Dont decompose dropout so it can be pattern matched (#97931)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97931
Approved by: https://github.com/ngimel
2023-04-07 01:15:24 +00:00
482f87a7bc [quantized] Fix return values of _get_name() in quantized ConvTranspose (#97678)
This PR fixes incorrect return values of _get_name() in quantized `ConvTranspose?d`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97678
Approved by: https://github.com/vkuzo, https://github.com/kit1980
2023-04-07 01:14:42 +00:00
88208c6fdf [inductor][cpp] fix mul for uint8 (#98473)
Fixes #98149

The type of `mul`'s output is not inconsistent with its input. This PR fixes the type of `mul`'s output.

Here is the output code for the newly added test case `pow+cos`. `tmp4` is 1024 before fixing and 0 after fixing.
#### Before fixing
```
auto tmp0 = in_ptr0[static_cast<long>(0)];     // tmp0 is unsigned_char
auto tmp1 = tmp0 * tmp0;                       // tmp1 is int
auto tmp2 = tmp1 * tmp1;                       // tmp2 is int
auto tmp3 = tmp2 * tmp0;                       // tmp3 is int
auto tmp4 = static_cast<float>(tmp3);          // tmp4 is float
auto tmp5 = std::cos(tmp4);
out_ptr0[static_cast<long>(0)] = tmp5;
```

#### After fixing
```
auto tmp0 = in_ptr0[static_cast<long>(0)];     // tmp0 is unsigned_char
auto tmp1 = decltype(tmp0)(tmp0 * tmp0);       // tmp1 is unsigned_char
auto tmp2 = decltype(tmp1)(tmp1 * tmp1);       // tmp2 is unsigned_char
auto tmp3 = decltype(tmp2)(tmp2 * tmp0);       // tmp3 is unsigned_char
auto tmp4 = static_cast<float>(tmp3);          // tmp4 is float
auto tmp5 = std::cos(tmp4);
out_ptr0[static_cast<long>(0)] = tmp5;
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98473
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-04-07 01:10:36 +00:00
06eaa0970b [Resubmit] Don't crash on retrieveDesyncReport (#98470)
Per title

Differential Revision: [D44736409](https://our.internmc.facebook.com/intern/diff/D44736409/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98470
Approved by: https://github.com/XilunWu
2023-04-07 01:10:30 +00:00
4adba70cc6 [inductor][easy] use num_stages=1 for reduction (#98524)
Since num_stages only matters for matmul and does not matter for pointwise/reduction, set num_stage to 1 uniformly for all reductions in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98524
Approved by: https://github.com/ngimel
2023-04-07 01:06:07 +00:00
86cb7f40a9 Fix the missing PATH in mps workflow after #98522 (#98559)
This was missed in #98522
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98559
Approved by: https://github.com/malfet
2023-04-07 00:15:50 +00:00
22411b6f02 Revert "[dynamo 3.11] enable dynamo unittests in 3.11 (#98104)"
This reverts commit 0066f3405f290ab6ef379abea6945058f8eb7ce5.

Reverted https://github.com/pytorch/pytorch/pull/98104 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it is failing on CPU 3.11 test in trunk 0066f3405f.  This is probably a landrace
2023-04-07 00:05:30 +00:00
481ecffb5e Add test c10d ucc tests (#88110)
Creates the equivalent c10d test for ucc for https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_gloo.py and https://github.com/pytorch/pytorch/blob/master/test/distributed/test_c10d_nccl.py. Uses test_c10d_gloo.py as the reference and adds all the common ops. More detailed comparison of available ops here: https://docs.google.com/document/d/1yPsa_X9EiEiqo-j2Yn7ierhccBtEjwoqC-B7-amI0MI/edit?usp=sharing

Also removes extra line for ProcessGroupUCC.cpp barrier blocking wait that got duplicated from merging https://github.com/pytorch/pytorch/pull/85047.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88110
Approved by: https://github.com/zasdfgbnm, https://github.com/kit1980, https://github.com/kwen2501, https://github.com/malfet
2023-04-06 23:51:27 +00:00
8a29afe98a [RFC] Add warning about object-based collectives for GPU tensors to docs. (#97702)
Using GPU tensors in these collectives have caused SEVs, user
confusion, and slowness in the past. These APIs were only designed to
communicate arbitrary python objects, and GPU tensors should either be copied
to CPU first or use the regular collecitves. Add a warning indicating so.

Differential Revision: [D44435849](https://our.internmc.facebook.com/intern/diff/D44435849/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97702
Approved by: https://github.com/kumpera
2023-04-06 23:47:35 +00:00
eb5da4df8a Speed up LossCTC.cu (#97269)
For these two kernels, `grid.x == 1` is enough. `grid.x > 1` leads to repeated computation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97269
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-04-06 23:44:25 +00:00
a2bb2fae1b Add Autocast support to MatMult thourgh explicit cast (#98346)
Fixes external issue https://github.com/microsoft/onnx-converters-private/issues/157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98346
Approved by: https://github.com/BowenBao
2023-04-06 23:19:52 +00:00
0066f3405f [dynamo 3.11] enable dynamo unittests in 3.11 (#98104)
Enable most dynamo unittests for 3.11. There are a few tests that are skipped due to failures that will be addressed in upcoming PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98104
Approved by: https://github.com/yanboliang, https://github.com/voznesenskym, https://github.com/albanD, https://github.com/jansel, https://github.com/jerryzh168, https://github.com/malfet
2023-04-06 23:15:48 +00:00
dbfc4df075 Add $CONDA_ENV/bin to PATH on MacOS (#98522)
This PR explicitly add $CONDA_ENV/bin to MacOS PATH, so that it can always detect and use the correct Python.  $CONDA_ENV is always set to the correct value in setup-miniconda https://github.com/pytorch/test-infra/blob/main/.github/actions/setup-miniconda/action.yml#L141

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b4de81a</samp>

This pull request fixes the conda-pip environment mismatch for the macOS build and test workflows by using consistent pip requirements files. It also adds a conditional block to the `.github/workflows/_mac-test-mps.yml` file to enable the test MPS job.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98522
Approved by: https://github.com/malfet
2023-04-06 21:34:52 +00:00
531b8e8f1e stop using caffe2/core/logging.h forwarding header in serialize lib (#98168)
No need to create a library for this useless header.

Differential Revision: [D44612668](https://our.internmc.facebook.com/intern/diff/D44612668/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98168
Approved by: https://github.com/PaliC
2023-04-06 21:27:07 +00:00
fdb9441e7e Stop recursion on trivial replacement (#97903)
Pattern replacement behaves incorrectly when the replacement pattern maps inputs to outputs (such a pattern can be used to replace redundant code). However, current code in `torch.fx.subgraph_rewriter._replace_pattern` causes the list of replacement nodes to include the entire graph before that node, resulting in an exponential slowdown due to recursive calls traversing the entire graph multiple times.

The proposed fix is to add a check in `_replace_pattern` to prevent the call to `get_replacement_nodes`:
```python
        for ret_node in copied_returning_nodes:
            if ret_node in match.placeholder_nodes:
                replacement_nodes.append(ret_node)
            else:
                get_replacement_nodes(ret_node)
```

Fixes #97817

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97903
Approved by: https://github.com/angelayi
2023-04-06 20:49:08 +00:00
ca1fe9bae5 remove no-op C10_DISABLE_NUMA preprocessor flag (#98243)
Nothing reads this, so setting it does nothing.

Differential Revision: [D44642070](https://our.internmc.facebook.com/intern/diff/D44642070/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44642070/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98243
Approved by: https://github.com/PaliC
2023-04-06 20:38:10 +00:00
e4c8c75583 [PG NCCL] Add TDD, NCCL_DEBUG log (#97692)
Prints these env var setting during setup for easier debug.

Differential Revision: [D44430875](https://our.internmc.facebook.com/intern/diff/D44430875/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97692
Approved by: https://github.com/kumpera
2023-04-06 20:37:46 +00:00
03a428a5b2 [ONNX] Introduce 'Functionalization' for fx exporter (#98245)
<img src="https://user-images.githubusercontent.com/9376104/229648898-7e85efc8-143f-42f9-93e0-298a8f86c0a1.png" width="80%" height="80%">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98245
Approved by: https://github.com/wschin, https://github.com/titaiwangms
2023-04-06 20:26:50 +00:00
edebe413d3 [inductor] fix scatter fallback and fallback in deterministic mode (#98339)
Fixes https://github.com/pytorch/pytorch/issues/93537

add `ir.ScatterFallback` to handle the mutation correctly of scatter/scatter_reduce fallback, also handle the case that `src` is a scalar, and lastly fallback in deterministic mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98339
Approved by: https://github.com/jansel
2023-04-06 19:43:17 +00:00
68cb06c752 Make gen_annotated_args support kwargs (#98396)
This PR is to address the issue seeing in PR #97417 where the newly added op requires `kwargs`, however, currently tools/autograd/gen_annotated_fn_args.py does not support `kwargs`, only `func_args` are generated for test_overrides.py.

The PR adds a new field "is_kwargs" to each argument indicating whether it's a `kwargs` or not. See example:
```
annotated_args = {
    torch._C._VariableFunctions._cast_Byte: [{'is_kwarg_only': 'False', 'name': 'self', 'simple_type': 'Tensor'}],
    ...
```

The full comparison of the generated file `annotated_fn_args.py` can be found here:
  - **Before**: [P681991116](https://www.internalfb.com/phabricator/paste/view/P681991116)
  - **After**: [P681994218](https://www.internalfb.com/intern/paste/P681994218/)

Differential Revision: D44698310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98396
Approved by: https://github.com/ezyang
2023-04-06 19:42:26 +00:00
fe99d39fbd migrate PyTorch to c10::bit_cast (#98418)
Use the standardized version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98418
Approved by: https://github.com/ezyang
2023-04-06 19:38:06 +00:00
213cec3c45 Revert "Add typing_extensions as MacOS ci dependency (#98522)"
This reverts commit e6e33488d3e7de4f58359b6c86b3c43fa33cbfc5.

Reverted https://github.com/pytorch/pytorch/pull/98522 on behalf of https://github.com/huydhn due to This needs rework
2023-04-06 19:37:38 +00:00
12f340dcd9 Add round as UserError (#98376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98376
Approved by: https://github.com/anijain2305
2023-04-06 19:28:00 +00:00
e0b958f975 [SPMD] Allow IterGraph support a more general subgraph movement (#98360)
Resubmit D44444398 due to the  merging conflict.

The original assumption of IterGraph is very restrictive and only allow users to move a subgraph that only one node has the input from external nodes. This PR fixes the limitation.

Differential Revision: [D44689730](https://our.internmc.facebook.com/intern/diff/D44689730/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44689730/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98360
Approved by: https://github.com/lessw2020
2023-04-06 19:13:37 +00:00
f228b3977b Revert "[inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)"
This reverts commit 77f32eb6ccf9c276fba1724e463247930ef71ec3.

Reverted https://github.com/pytorch/pytorch/pull/98264 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but this is failing in trunk due to a name error fake_mode_from_tensors is not defined 67d1a77086. This is probably a landrace
2023-04-06 19:00:09 +00:00
3b6e94cb8c [small] replace with .format() with f-strings (#98514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98514
Approved by: https://github.com/awgu
2023-04-06 18:58:56 +00:00
0210481dcb Fix _like meta registrations (#98160)
The meta implementation for these _like function is wrong whenever device != "meta" (it doesn't fill the memory!).
zeros_like is special due to sparse and is fixed directly by always filling it with zeros.
Every other one is CompositeExplicit implementation, I went with removing their meta registration and tweaking code to avoid infinite recursions.
I can do the same as zeros_like (and add the proper filling for each) but that would duplicate the c++ logic and make the meta registrations non trivial. I can do it if you prefer to removal.

test_meta works fine with these fixes, relying on CI to see if other tests are breaking as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98160
Approved by: https://github.com/ezyang
2023-04-06 18:44:34 +00:00
dcb9440af9 [kineto] add SOFT_ASSERT when logging metdata (#98442)
Summary: having a valid `kineto_activity_` before logging metadata is a crucial invariant worthy of asserts

Test Plan:
## Test with D44362040

Verify that we get SOFT_ASSERT logs before and after the diff

## Log
```
W0329 11:29:34.269824 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.270107 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.270385 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.270653 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.270941 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.271199 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.271476 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.271724 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.272003 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.272280 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.272553 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.272822 718148 profiler_kineto.cpp:122] Warning:  (function operator())
W0329 11:29:34.273092 718148 profiler_kineto.cpp:122] Warning:  (function operator())
```

Reviewed By: aaronenyeshi

Differential Revision: D44513152

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98442
Approved by: https://github.com/aaronenyeshi
2023-04-06 18:39:13 +00:00
e394f6db5a Revert "Improve dynamo support for autograd.Function (#98158)"
This reverts commit 4716fa24115435fa87d04213382d757816b8f1f3.

Reverted https://github.com/pytorch/pytorch/pull/98158 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to breaks MacOS trunk job 4716fa2411.  The signal was missing from the PR because we disabled MacOS job yesterday due to https://github.com/pytorch/pytorch/issues/98362
2023-04-06 18:15:02 +00:00
e6e33488d3 Add typing_extensions as MacOS ci dependency (#98522)
MacOS jobs start to fail in trunk because of this missing dependency 938c5da61e.  So I add it explicitly. Caching issue?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98522
Approved by: https://github.com/malfet
2023-04-06 17:48:25 +00:00
49b80c3ea2 [reland] remove typed StorageImpl::data() and StorageImpl::unsafe_data() (#98411)
Original commit changeset: a466b3cb6a0a

Original Phabricator Diff: D44629941

Differential Revision: [D44709004](https://our.internmc.facebook.com/intern/diff/D44709004/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98411
Approved by: https://github.com/ezyang
2023-04-06 17:42:48 +00:00
e663143871 [dynamo 3.11] fix 3.11.2 issues (#98364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98364
Approved by: https://github.com/albanD
2023-04-06 17:37:25 +00:00
1bcb880894 Reduce includes of CUDACachingAllocator.h (#97072)
On my machine this goes from > 200 to ~80, making rebuilds faster.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97072
Approved by: https://github.com/wanchaol
2023-04-06 17:22:35 +00:00
e085acc9f3 Cleanup Copy.cu logic (#97071)
Some of the logic specific to the cudaMallocAsync allocator related to peer access is placed outside of the allocator itself. This PR refactors, documents, and encapsulates it, while maintaining the same behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97071
Approved by: https://github.com/ngimel, https://github.com/eellison
2023-04-06 17:22:35 +00:00
938c5da61e [inductor] do not generate loops when the condition doesn't hold (#98185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98185
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-04-06 17:22:16 +00:00
bb33173962 Add max-autotune compilers to benchmarks (#98464)
Title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98464
Approved by: https://github.com/shunting314
2023-04-06 17:13:02 +00:00
67d1a77086 Revert "Move functional collectives implementation to python. (#98315)"
This reverts commit 8b0374f83c605c47b7c1ba9274011c4b961666ce.

Reverted https://github.com/pytorch/pytorch/pull/98315 on behalf of https://github.com/huydhn due to Sorry for reverting for PR. This is failing in trunk probably due to a landrace
2023-04-06 16:49:40 +00:00
ce797795e1 Support getattr for ConstantVariable when compiling with Dynamo (#98153)
This PR enables `getattr` on ConstantVariable by implementing its `call_hasattr` function.

Fixes #97480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98153
Approved by: https://github.com/ezyang
2023-04-06 16:48:24 +00:00
4716fa2411 Improve dynamo support for autograd.Function (#98158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98158
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2023-04-06 16:44:37 +00:00
0c5389b401 Remove unnecessary schema_map from spmd API (#98444)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98444
Approved by: https://github.com/wanchaol
2023-04-06 16:04:22 +00:00
77f32eb6cc [inductor] Enable CudaWrapperCodeGen for non-AOT mode (#98264)
Summary: when _inductor.config.cpp_wrapper is specified, we run a
two-pass wrapper codegen to generate wrapper code in cpp which calls
cuLaunchKernel to launch pre-compiled cuda kernels, and then call
load_inline to load that generated wrapper back into the python world.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98264
Approved by: https://github.com/ngimel
2023-04-06 15:59:55 +00:00
348dcf51e5 [inductor] Combine CppWrapperCodeGen and CppAotWrapperCodeGen (#98088)
Summary: Make CppAotWrapperCodeGen generate kernels and wrapper in one
file, which unifies the codegen for AOT and non-AOT mode. There will be
more refactoring for the AOT part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98088
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-06 15:59:55 +00:00
7b25976323 [pt2] add meta function for take (#98451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98451
Approved by: https://github.com/ezyang
2023-04-06 14:48:35 +00:00
019914095e [Easy] remove unnecessary get_rank() in tests (#98445)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98445
Approved by: https://github.com/wanchaol
2023-04-06 14:46:40 +00:00
bbf180af9f Add new aten::device variant to TorchScript (#97023)
Fixes #96627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97023
Approved by: https://github.com/jgong5, https://github.com/BowenBao, https://github.com/davidberard98
2023-04-06 14:19:00 +00:00
d1e7434bcf Improved configuration naming for repetitive workflows (#98496)
Improved configuration naming for repetitive workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98496
Approved by: https://github.com/atalman, https://github.com/malfet
2023-04-06 14:14:16 +00:00
fa4cab8925 [Sparse] Raise exception when expand is called on sparse tensor (#98365)
It's already not working, but this makes error message a bit more readable. I.e. it turns:
```
% python -c "import torch;x=torch.eye(3).to_sparse().expand(3,3)"
```

from
```
NotImplementedError: Could not run 'aten::as_strided' with arguments from the 'SparseCPU' backend.
```
to
```
RuntimeError: Expand is unsupported for Sparse tensors.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98365
Approved by: https://github.com/pearu, https://github.com/cpuhrsch
2023-04-06 14:10:16 +00:00
8b0374f83c Move functional collectives implementation to python. (#98315)
This simplifies a lot the work we need to add new ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98315
Approved by: https://github.com/albanD, https://github.com/wconstab, https://github.com/Neilblaze
2023-04-06 14:06:16 +00:00
f98c1809a4 Add mark_static (#98427)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98427
Approved by: https://github.com/voznesenskym
2023-04-06 12:58:16 +00:00
bdb79a8f52 Turn off divisible_by_16 for dynamic shapes; support ablation (#98471)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98471
Approved by: https://github.com/ngimel, https://github.com/voznesenskym
2023-04-06 12:57:07 +00:00
3142ce208f [quant][pt2e] Support quantizer API in prepare_pt2e_quantizer (#97994)
Summary:
This PR added a quantizer API to prepare_pt2e_quantizer, which enables user to annotate the nodes in the graph
directly to configure quantization, instead of relying on QConfigMapping, please see test cases in
test_quantize_pt2e.py for examples. Also added a prototype for QNNPackQuantizer, that will be modified later
to fully support different quantization capabilities of QNNPack/XNNPack

The goal for introducing quantizer is to add flexibility to the quantization API to allow modeling users and backend developers to express their quantization intentions programmably, which will free architecture optimization team from supporting different use cases in the core API in the future, as a concrete example, we used to have https://pytorch.org/docs/master/generated/torch.ao.quantization.qconfig_mapping.QConfigMapping.html#torch.ao.quantization.qconfig_mapping.QConfigMapping as the API for users to express their intent for quantization in fx graph mode quantization, and it has some fancy options like `set_module_name_regex` and `set_module_name_object_type_order`, this is not needed for all backends and adds burden of maintenance to AO team, in the quantizer API we will move these APIs to a backend specific `Quantizer` that needs this feature, and all the backends or even advanced modeling users can implement their own quantizer to express their intent for quantization through annotating the nodes, for example, to express the quantization intention of quantizing a convolution node, a user will find the convolution node in the graph and do:
```
operator_spec = qnnpack_quantizer.get_default_per_channel_symmetric_qnnpack_operator_spec()
conv_node.meta["target_dtype_info"] = {
    "input_act_obs_or_fq_ctr": _get_act_obs_or_fq_ctr(operator_spec),
    "weight_obs_or_fq_ctr": _get_weight_obs_or_fq_ctr(operator_spec)
    "bias_obs_or_fq_ctr": _get_bias_obs_or_fq_ctr(operator_spec),
    "output_act_obs_or_fq_ctr": _get_act_obs_or_fq_ctr(operator_spec),
    # TODO: validation of weight_index must be set if weight_obs_or_fq_ctr is set
    "weight_index": 1,
    # TODO: validation of bias_index must be set if bias_obs_or_fq_ctr is set
    "bias_index": 2,
}
```
each backend will introduce their own quantizer, e.g. QNNPackQuantizer, which may expose more convenient APIs for modeling users to configure the annotation, and different quantizer can compose with each other to annotate the graph correctly for quantization.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_simple_quantizer
python test/test_quantization.py TestQuantizePT2E.test_qnnpack_quantizer_conv

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97994
Approved by: https://github.com/vkuzo
2023-04-06 11:34:10 +00:00
ccc27bc361 [Inductor] Fix convolution lowering if stride or padding or dilation is 1 element list (#98448)
Fixes error from 14k github models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98448
Approved by: https://github.com/ngimel
2023-04-06 10:40:06 +00:00
b8cf010139 Print collective (#97544)
Prints the collective running in TDD.

Differential Revision: [D44347417](https://our.internmc.facebook.com/intern/diff/D44347417/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97544
Approved by: https://github.com/zhaojuanmao
2023-04-06 06:47:19 +00:00
dab1a7e6a1 [PG Wrapper] Add sequence number (#97462)
Adds sequence number to PG wrapper

Differential Revision: [D44347419](https://our.internmc.facebook.com/intern/diff/D44347419/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97462
Approved by: https://github.com/zhaojuanmao
2023-04-06 06:47:19 +00:00
428c531d00 [FSDP] records for composable (#98428)
Add some function recording since composable API does record a FSDP.forward

Differential Revision: [D44715137](https://our.internmc.facebook.com/intern/diff/D44715137/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98428
Approved by: https://github.com/awgu
2023-04-06 06:40:48 +00:00
eadd84d065 [ROCm] Enable FSDP BF16 comm hooks unit tests (#97517)
This PR enables the following unit tests in FSDP feature on ROCm.
```
test_bf16_hook_has_wrapping_False_sharding_strategy_ShardingStrategy_FULL_SHARD
test_bf16_hook_has_wrapping_False_sharding_strategy_ShardingStrategy_NO_SHARD
test_bf16_hook_has_wrapping_False_sharding_strategy_ShardingStrategy_SHARD_GRAD_OP
test_bf16_hook_has_wrapping_True_sharding_strategy_ShardingStrategy_FULL_SHARD
test_bf16_hook_has_wrapping_True_sharding_strategy_ShardingStrategy_NO_SHARD
test_bf16_hook_has_wrapping_True_sharding_strategy_ShardingStrategy_SHARD_GRAD_OP
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97517
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet
2023-04-06 05:45:17 +00:00
37dc47a1ac Make caling type on user defined class UserError (#98366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98366
Approved by: https://github.com/anijain2305
2023-04-06 05:20:50 +00:00
1cd1d9c24a [SPMD] Dedup collectives from DTensor expansion (#98216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98216
Approved by: https://github.com/wanchaol
2023-04-06 04:41:38 +00:00
11b0a84f3e Enable LogSoftmax for SPMD tracing (#98380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98380
Approved by: https://github.com/wanchaol
2023-04-06 04:41:37 +00:00
e2c81e44db backport std::bit_cast from c++20 to c10 (#98417)
This is already used in a few places in our codebase, so let's
standardize on a tested implementation.

Differential Revision: [D44712290](https://our.internmc.facebook.com/intern/diff/D44712290/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44712290/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98417
Approved by: https://github.com/ezyang
2023-04-06 04:17:45 +00:00
ab95b7a05f Support neg calls to dyn shapes (#94068)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94068
Approved by: https://github.com/jansel
2023-04-06 03:33:24 +00:00
11890156e7 fix grain size setting for baddbmm_cpu_kernel (#98297)
fix https://github.com/pytorch/pytorch/issues/92892

the `grain_size` setting for parallelization in baddbmm_cpu_kernel is wrong, which will make small input size go parallel, leading to omp threading overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98297
Approved by: https://github.com/lezcano, https://github.com/nikitaved
2023-04-06 01:51:10 +00:00
cc5f64957b Add PrivateUse1 for dispatching PyTorch Distributed Collectives. (#98137)
Add PrivateUse1 for dispatching PyTorch Distributed Collectives to support custom device. This PR is to fix https://github.com/pytorch/pytorch/issues/97938#issue-1646833919
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98137
Approved by: https://github.com/kumpera
2023-04-06 01:41:43 +00:00
d3adbbf44b Clean up CUDA 11.6 Docker images (#98395)
We need to clean up 11.6 on PyTorch too after https://github.com/pytorch/builder/pull/1366.  Otherwise, docker build would fail with a `bad argument 11.6` error, i.e. https://github.com/pytorch/pytorch/actions/runs/4614525312/jobs/8159595038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98395
Approved by: https://github.com/atalman, https://github.com/malfet
2023-04-06 01:37:49 +00:00
bd78532020 [BE] Fix collect_env for python-path-with-space (#98415)
By invoking [`Popen`](https://docs.python.org/2.7/library/subprocess.html#popen-constructor) with list of command line arguments, rather than strings that would be parsed by shell.

Test plan:
```shell
% conda create -n py311 python=3.11
% cd ~/miniconda3/envs
% cp -a py311 py\ 311
% ./py\ 311/bin/python -mtorch.utils.collect_env
```

Fixes https://github.com/pytorch/pytorch/issues/98385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98415
Approved by: https://github.com/huydhn
2023-04-06 01:09:23 +00:00
680bf14a40 [EASY] Fix some more places where we incorrectly assume only Tensor (#98310)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98310
Approved by: https://github.com/voznesenskym
2023-04-06 00:57:59 +00:00
478df47fab Disable persistent reductions with dynamic shapes (#98405)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98405
Approved by: https://github.com/voznesenskym
2023-04-06 00:54:35 +00:00
007587aa00 [CI] Update update_expected.py to make it generate a combined csv file (#98407)
Summary: make update_expected.py combine csv files from all shards into a single csv file for each test suite

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98407
Approved by: https://github.com/wconstab, https://github.com/ezyang
2023-04-06 00:00:58 +00:00
a76114832a [quant][pt2e][fix] Fix the internal test failures caused by refactor (#98378)
Summary: att, this PR removes some incorrect assumptions from `_maybe_insert_observers_before_graph_output`

Test Plan:
internal test

Differential Revision: D44697212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98378
Approved by: https://github.com/andrewor14
2023-04-05 23:27:34 +00:00
2a48f43fe2 Add check for 0 to 1 inclusive for elements of target tensor in BCE loss (#97814)
TODO for @mikaylagawarecki : add BC breaking description

Fixes #87373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97814
Approved by: https://github.com/mikaylagawarecki
2023-04-05 23:26:09 +00:00
3112d2a2b6 Export function symbols to enable Windows build of Intel Extension for PyTorch (#98054)
This PR is to export specific function symbols into .dll shared library on Windows platform to support Windows build for [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch).
TORCH_API/TORCH_PYTHON_API/PYBIND11_EXPORT are macros that decorate the function as dllexport while compilation, so that the function symbol will be exported into the .dll shared library file on Windows platform. It is necessary for other libraries (such as IPEX) to import and call these functions through dynamic linking of PyTorch on Windows platform.
The code changes of this PR adds decorators to export specific functions used by IPEX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98054
Approved by: https://github.com/ezyang
2023-04-05 23:23:18 +00:00
013c7f5ba4 [inductor] Move tl.broadcast call out codegen.common (#98304)
This makes only a cosmetic change to the generated code, but means
triton's broadcasting logic doesn't leak out into the CSE class.

Before:
```python
    tmp5_load = tl.load(in_ptr1 + (0))
    tmp5 = tl.broadcast_to(tmp5_load, [XBLOCK])
```

After:
```python
    tmp5 = tl.load(in_ptr1 + (0))
    tmp6 = tl.broadcast_to(tmp5, [XBLOCK])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98304
Approved by: https://github.com/ngimel
2023-04-05 23:10:46 +00:00
bb4174d2a3 [inductor] Enable CSE on masked loads (#98303)
Currently the `TritonKernel.mask_loads` context manager calls
`swap_buffers` which creates a new CSE context. So, code generated in
different mask contexts cannot be CSE'd even if their masks are the
same. This fixes the issue by not calling `swap_buffers` and instead
having `load` manually check if a `"tmp"` name appears in the mask
meaning the load needs to be generated in the compute buffer.

Currently, simple programs involving padding will result in duplcate
masked loads. e.g. the generated code for
```python
def forward():
    a = torch.nn.functional.pad(x, (0, 1))
    return a + a
```

contains the lines

```python
    tmp3 = tl.load(in_ptr0 + (x1 + tl.zeros([XBLOCK], tl.int32)), tmp2 & xmask, other=0)
    tmp4 = tl.where(tmp2, tmp3, 0.0)
    tmp5 = tl.load(in_ptr0 + (x1 + tl.zeros([XBLOCK], tl.int32)), tmp2 & xmask, other=0)
    tmp6 = tl.where(tmp2, tmp5, 0.0)
```

With this change, the duplicates are removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98303
Approved by: https://github.com/ngimel
2023-04-05 23:10:46 +00:00
aa7850c214 rewrite at::vec::*::convert_to_int_of_same_size (#98429)
This was failing to compile with unrelated changes in the
windows-binary-libtorch-release build job. This rewrite seems to avoid
that problem.

For an example failure, see:
144d5268a1

Differential Revision: [D44717809](https://our.internmc.facebook.com/intern/diff/D44717809/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98429
Approved by: https://github.com/huydhn
2023-04-05 23:04:08 +00:00
29d2e4b7fa Forward fix for DataLoader to accept custom Sharding DataPipe (#97287)
Fixes #96975

Changes:
- Make sure custom ShardingDataPipe with `apply_sharding` can be used by `DataLoader`
  - Allow the `apply_sharding` function without the last argument of `sharding_group`
- Make `DataLoader` not relying on `sharding_group`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97287
Approved by: https://github.com/NivekT
2023-04-05 22:33:37 +00:00
d01ee10b25 Add detect_fake_mode (#98321)
This replaces fake_mode_from_tensors but it preferentially looks for
fake_mode in TracingContext and also if there is an active fake mode
on the dispatch stack, before groveling in tensors to find it.

This advances PegasusForCausalLM, which was previously failing because
we generated a graph that had a parameter (non-fake) and a SymInt,
and thus previously we failed to detect the correct fake mode.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98321
Approved by: https://github.com/voznesenskym
2023-04-05 22:15:16 +00:00
5854923c17 Extract ExtraMeta symbolic shape fields into a dedicated SymbolicShap… (#98399)
…eMeta

This modularizes ExtraMeta to bring down its creation cost when it is needed for other functions than syn shape handling.

Change-Id: Ife59b201b0c4fd75090fe8be5171a6dd73a10d10

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98399
Approved by: https://github.com/ezyang
2023-04-05 22:06:10 +00:00
5af47dbb23 Add slow workflow to upload test stats workflow (#98447)
I wonder if it would be better to write an exclusion list instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98447
Approved by: https://github.com/huydhn
2023-04-05 22:03:16 +00:00
d0eafed7fb [Easy] Fix minor errors in DTensor examples (#98430)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98430
Approved by: https://github.com/wanchaol
2023-04-05 21:44:01 +00:00
b1c2925493 [Dynamo] Support typing.Union and typing.Optional (#98384)
Fixes #98265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98384
Approved by: https://github.com/ezyang
2023-04-05 21:31:52 +00:00
846415f6ea Add HPU to the storage tensor backends (#98404)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98404
Approved by: https://github.com/ezyang
2023-04-05 21:29:27 +00:00
29cde00701 [MPS] Add random_ overload (#98333)
That simply calls `torch.random_(from=0, to=None)`

Also, fix optional upper bound calculation for all `dtypes` but int64:
As one can see from https://pytorch.org/docs/stable/generated/torch.Tensor.random_.html
`from` boundary is inclusive, but `to` is exclusive, i.e. if `to` is
omitted for `torch.int8` dtype, it should be set to `128` and to `2`
for torch.bool.

Add test for `torch.random_`

Fixes https://github.com/pytorch/pytorch/issues/98118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98333
Approved by: https://github.com/kulinseth
2023-04-05 21:24:45 +00:00
c9b1e09958 [c10d] delete lengths offset checks (#98368)
According to @kwen2501, NCCL supports up to size_t send counts, so
PGNCCL shouldn't restrict it

A follow up is to think about whether we should add overflow protection
of offset
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98368
Approved by: https://github.com/kwen2501
2023-04-05 21:06:49 +00:00
9c7b03d51e [Dynamo] Fix bug of torch.is_floating_point & is_complex (#98393)
Fixes #95192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98393
Approved by: https://github.com/ezyang
2023-04-05 20:58:16 +00:00
ebeaf8adf1 Add hacky example inputs to dynamo produced graph (#96561)
Executorch currently uses functorch.functionalize API, as a result we have to invoke make_fx twice (one for filtering out autograd related stuff (happens in torchdynamo.export(aten=True) and one for tracing the functionalized version of the graph). The previous PR changes the make_fx behaviour to pass in fake tensors used in dynamo. But as Executorch invokes the second make_fx directly, we need to have access to fake tensors that dynamo used. We cannot call torchdynamo.export again in the second round because we don't have a way to functionalize inside dynamo at the moment. Hence I added this attribute in dynamo for now. Once we move to AOTAutograd functionalization, we don't have to deal with this anymore and I will remove this.

Differential Revision: [D43994692](https://our.internmc.facebook.com/intern/diff/D43994692)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96561
Approved by: https://github.com/zhxchen17, https://github.com/voznesenskym
2023-04-05 20:54:33 +00:00
3d36f6f18d Fix default argument of parse_ir stub (#98397)
It's `false` in pybind code but not provided to stub 2987bc0758/torch/csrc/jit/python/init.cpp (L1625)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98397
Approved by: https://github.com/mikaylagawarecki
2023-04-05 20:46:16 +00:00
3c8e9e38a1 [pt2][inductor] retry add triton.__verison__ as cache key, update cache layout (#98369)
Summary: retry of landing D44550100, try to import triton otherwise consider version as `None`

Test Plan: will make sure windows OSS tests run as well in CI

Differential Revision: D44694213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98369
Approved by: https://github.com/huydhn
2023-04-05 20:43:54 +00:00
f21a176c03 Python Dispatcher should respect FuncTorchBatchedDecomposition key (#98328)
Fixes https://github.com/pytorch/pytorch/issues/97425.

Python Dispatcher's resolve_key function should be equivalent to
computeDispatchTableEntryWithDebug. We added a section to
computeDispatchTableEntryWithDebug but forgot to add it to resolve_key.

This PR fixes that discrepancy.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98328
Approved by: https://github.com/Chillee, https://github.com/kshitij12345, https://github.com/Neilblaze
2023-04-05 20:32:53 +00:00
78e991e575 Patch release process description (#98425)
Patch release process description
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98425
Approved by: https://github.com/seemethere
2023-04-05 20:25:13 +00:00
37b9143206 Require sequence length in huggingface to be dynamic (#98335)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98335
Approved by: https://github.com/voznesenskym
2023-04-05 19:40:22 +00:00
cf1bfca2ba Require batch dimensions to be compiled dynamically (#98334)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98334
Approved by: https://github.com/voznesenskym
2023-04-05 19:40:22 +00:00
69f9bd2323 Don't error if we mark_dynamic without dynamic_shapes on (#98324)
In the terminal state, it won't matter if you have dynamic_shapes
on or not, mark_dynamic will always work.

Today, it's helpful to make this not error so I can easily swap
between static or not and run experiments.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98324
Approved by: https://github.com/voznesenskym
2023-04-05 19:40:22 +00:00
2c6c7deeb3 Added ModuleInfos for Pooling ops (#98358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98358
Approved by: https://github.com/albanD
2023-04-05 19:39:07 +00:00
3a0ad3c194 [easy] Remove large LayerNorm sample input causing OOM from ModuleInfo (#98424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98424
Approved by: https://github.com/huydhn, https://github.com/albanD
2023-04-05 19:38:15 +00:00
3ed66f94b5 Add more debug logs to evaluate_expr (#98344)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98344
Approved by: https://github.com/voznesenskym
2023-04-05 19:35:07 +00:00
f557402e8d remove //c10:headers (#98420)
The c10 library is light enough that there's not really much benefit
to being very unbazel-y and providing an incomplete library that lacks
the source files.

Differential Revision: [D44713077](https://our.internmc.facebook.com/intern/diff/D44713077/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98420
Approved by: https://github.com/ezyang
2023-04-05 19:33:10 +00:00
937ba248eb Make the Index Rounding Mode Consistent Between the 2D and 3D GridSample Nearest Neighbor Interpolations (#97000)
## BC-breaking note:

This is technically a bugfix. Prior to this PR, for `torch.nn.functional.grid_sample(mode='nearest')` the 2D kernel used `std::nearbyint` whereas the 3D kernel used `std::round` in order to determine the nearest pixel locations after un-normalization of the grid. This PR fixes the 3D kernel to use `std::nearbyint` which rounds values that are exactly `<>.5` to the nearest even which is consistent with the behavior of `torch.round`. Unnormalized indices that are exactly `<>.5` will now be rounded to the nearest even instead of being rounded away from 0.

## Description
In the nearest neighbor interpolation mode, the 2D GridSample rounds index to the nearest even using [std::nearbyint](https://github.com/pytorch/pytorch/blob/v2.0.0/aten/src/ATen/native/cpu/zmath.h#L182) whereas the 3D GridSample rounds index away from zero using std::round. This discrepancy needs to be resolved. We are making both 2D GridSample and 3D GridSample to round to the nearest even.

## Unit Test Goals
1. Make sure the x dimension and y dimension rounding behaviors are the same for 2D GridSample.
2. ~~Make sure the 2D GridSample rounding mode is rounding to the nearest even.~~
3. Make sure the x dimension, y dimension, and z dimension rounding behaviors are the same for 3D GridSample.
4. ~~Make sure the 3D GridSample rounding mode is rounding to the nearest even.~~
5. The 2D GridSample and 3D GridSample rounding behaviors are exactly the same.

After some experiments, I found 2 and 4 are difficult to achieve. Even though I can compute the normalized coordinates corresponding to the unnormalized coordinates including [0, 0.5, 1.0, 1.5, 2.0, 2.5, ..., 10.0], the unnormalization process in the GridSample implementations always have a small chance of having floating point error. Therefore, it's not possible to unit test the rounding mode from the normalized coordinates.

## Unit Test Methods

The unit test is simple. By using the same values along the dimension to be tested in the input tensor and the same normalized indices in the grid tensor. The interpolation on the 2D GridSample x-dimension, 2D GridSample y-dimension, 3D GridSample x-dimension, 3D GridSample y-dimension, 3D GridSample z-dimension. Should produce exactly the same interpolated values.
If one CPU/CUDA 2D/3D implementation use a different rounding mode from others, the unit test shall fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97000
Approved by: https://github.com/mikaylagawarecki
2023-04-05 18:47:03 +00:00
dcec2100b1 [dtensor] add placement strategy and einsum strategy (#98227)
This adds placement strategy to the op schema and implement einsum
strategy. It's the basic building piece for compile mode expansion
and new op implementation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98227
Approved by: https://github.com/XilunWu
2023-04-05 17:09:32 +00:00
93063768da [pruning][core][feature] Implement convert for pruner (#97545)
Summary:

This PR implements `BaseSparsifier.convert()`, which performs module swapping.
The modules and mappings will be merged in a future PR.

Test Plan:
`python test/test_ao_sparsity.py -- TestBaseSparsifier.test_convert`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97545
Approved by: https://github.com/jerryzh168
2023-04-05 16:57:11 +00:00
3657b37d6b Add forward_prefetch flag to fully_shard (#98277)
Summary: Per title

Test Plan: CI

Differential Revision: D44656110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98277
Approved by: https://github.com/awgu
2023-04-05 16:49:15 +00:00
49c130256d Clarify Tensor.is_sparse doc (#98408)
That it returns true only for COO matrix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98408
Approved by: https://github.com/cpuhrsch
2023-04-05 16:42:19 +00:00
d1de5f5f0d Change daily aggregates upload job to use sum and occurence counter instead of averages (#98359)
We used to keep track of the average of stats, however, when we munge the data to find interesting insights this makes things difficult (ie. finding total test time for an oncall). The pin is updated such that we keep track of the sum instead as well as an "occurrences" field such that the average can be rederived from sum/occurrences.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98359
Approved by: https://github.com/huydhn
2023-04-05 16:31:58 +00:00
762a81cb7d [spmd compile api] pre-flatten state container and pass the flattened state container to transforms (#98392)
Move the responsibility of flattening the input arguments from the graph module to the caller. This serves two purposes:
- Transformations that add/remove state need to manipulate a state container that maintains the state tensors in the same order as they appear in graph placeholders.
- Reduced runtime cost. The state container is only flattened once upfront.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98392
Approved by: https://github.com/mrshenli
2023-04-05 16:31:47 +00:00
37dbd5bf76 [spmd compile API] add a (temporary) mechanism for overriding input tensors' placements (#98391)
Currently, the compile API assumes all input tensors' shard dimension is the first dimension. dtensor expansion doesn't work when there are input tensors whose shard dimension is not the first dimension.

In addtion, respect non-tensor inputs beyond nn.Module and optim.Optimizers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98391
Approved by: https://github.com/mrshenli
2023-04-05 16:31:47 +00:00
970c08f92f [spmd expansion] support scalar_tensor (#98390)
scalar_tensor is a pure factory function that can't be handled by DTensor prop rule and needs to be currently handled in spmd expansion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98390
Approved by: https://github.com/mrshenli
2023-04-05 16:31:44 +00:00
0830808dde [spmd expansion] speed up expansion by ~5x (#98389)
According to profiling, the top two expensive operations in spmd expansion are propagate_op_sharding and make_fx (for every dispatcher op node). This PR makes the following changes to speed up spmd expansion:
- We are unneccessarily doing propagate_op_sharding twice for every op. Remove one.
- When no tensor redistribution is required, we only need to update non-tensor args of the node according to op_schema and avoid building a GraphModule just for the node.

On a DDP use cases + foreach Adam, this change speeds up spmd expansion by ~5x (~10 min -> ~2 min).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98389
Approved by: https://github.com/mrshenli
2023-04-05 16:31:40 +00:00
161f7c0b28 [spmd expansion] support torch.ops.aten.sym_numel (#98388)
The current logic assumes non-overload ops takes two arguments however torch.ops.aten.sym_numel takes one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98388
Approved by: https://github.com/mrshenli
2023-04-05 16:31:36 +00:00
3344d79e3f Pattern matcher improvements (#97740)
This adds support for multi-output patterns and example-based
replacements.

Tests/usage are next in this PR stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97740
Approved by: https://github.com/ngimel
2023-04-05 15:25:34 +00:00
279ca5f9db Revert "[CUDA12] set_device change (#94864)"
This reverts commit c18be2b2ec00133abe28efcdd0462e50ddd45a1a.

Reverted https://github.com/pytorch/pytorch/pull/94864 on behalf of https://github.com/ezyang due to avoid affecting cuda 11
2023-04-05 14:53:00 +00:00
981f9f0408 Better Handling of Storage Cache (#98254)
Because we do not persist output memory of cudagraphs, we need to reconstruct tensors at their correct memory locations after we've done a run. We were using a storage cache for that but it had a couple of issues:
- If the a data ptr existed in the cache, we should only reuse the corresponding storage if the storage hadn't died
- didnt work across separate nodes. While you wouldn't think this would be an issue, it was in testing HF.
- StorageWeakRef maintains whether the Storage C++ object remains allocated, not whether the corresponding memory has been deallocated. In order to use them to track memory deallocations we must maintain a single StorageWeakRef for all Storages that reference that memory (even if we are constructing Storages that do not have a deallocator function).

This PR  a singlestorage_cache as we execute any tree path. When we retrieve a storage from the cache we
check that it is still alive, and we hash based on both observed recording data ptr and storageimpl weak ref.

Update to use a single storage cache across all executions in a path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98254
Approved by: https://github.com/jansel
2023-04-05 14:45:25 +00:00
f1b901b040 Make sure we dealloc on recording, not just replay (#97440)
Copy over non cuda graph inputs as we are allocating the recording inputs so they do not need to be allocated as we record the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97440
Approved by: https://github.com/ezyang
2023-04-05 14:41:51 +00:00
c18be2b2ec [CUDA12] set_device change (#94864)
This PR adds workaround for CUDA 12 [`cudaSetDevice` change](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g159587909ffa0791bbe4b40187a4c6bb) which will always create primary context on target device. So operations like this:
```Python
import torch
x = torch.randn(1, device="cuda:1")
```
would always create primary context on on device `cuda:1` because it is creating a tensor on it and on device `cuda:0` because the destructor of CUDA Device guard calls `cudaSetDevice(0)`.
After this PR the CUDA Device guard will not call `cudaSetDevice(0)` if primary context does not exist on `cuda:0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94864
Approved by: https://github.com/malfet, https://github.com/atalman, https://github.com/ezyang
2023-04-05 14:34:00 +00:00
7b08889074 Fix GridSample Activation Quantization (#98278)
The convention for activation quantization is rounding to the nearest even.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98278
Approved by: https://github.com/vkuzo
2023-04-05 13:22:08 +00:00
3da7e83250 Add test for pickle_module (#98373)
I.e. a regression test for https://github.com/pytorch/pytorch/issues/88438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98373
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-04-05 13:05:05 +00:00
ea00f850e9 add new() method identifier to _StorageBase (#98201)
The method torch.UntypedStorage.new is not detailed in API docs. Adding a method identifier may make it easier to know that new() method is only implemented by cpp, like copy_() or nbytes().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98201
Approved by: https://github.com/ezyang
2023-04-05 12:47:40 +00:00
2a32bc50c6 Only print guard code when printing guards (#98347)
Before:

<img width="1138" alt="image" src="https://user-images.githubusercontent.com/13564/229915726-19bddea8-8fa4-46d2-8501-358e9f9ea639.png">

After:

```
[2023-04-04 13:44:23,109] torch._dynamo.convert_frame.__guards: [DEBUG] GUARDS:
  ___check_obj_id(L['self'], 139844089003936)
  not ___are_deterministic_algorithms_enabled()
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98347
Approved by: https://github.com/voznesenskym
2023-04-05 12:15:25 +00:00
555ab310dc Add itemsize and nbytes properties to Tensor (#98322)
Adds properties for itemsize and nbytes to Tensor matching the properties in NumPy.

Fixes https://github.com/pytorch/pytorch/issues/12728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98322
Approved by: https://github.com/ezyang
2023-04-05 12:11:55 +00:00
14ccad73b4 fix _slice_meta's shape calculation (#98326)
Fixes #98325.

This PR corrects the output shape calculation used in `_slice_meta` from:

```python
math.floor((end - start) / stride)
```

to

```python
1 + (end - start - 1) // stride
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98326
Approved by: https://github.com/ezyang
2023-04-05 12:07:18 +00:00
b4420f0fd5 Fix complex variable notation for division operator to be consistent. (#98057)
A readability improvement: changes notation in complex division to match comments `(a + bi)/(c + di)` for `constexpr FORCE_INLINE_APPLE complex<T>& operator/=(const complex<U>& rhs)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98057
Approved by: https://github.com/ezyang
2023-04-05 12:06:20 +00:00
526b564fa0 Uniformly use elem when checking ListType (#97873)
Fixes #ISSUE_NUMBER
a initial trial to let code of arg parser become more readable (go through and understand logic behind *torchgen* as a rookie)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97873
Approved by: https://github.com/ezyang
2023-04-05 12:06:03 +00:00
c4de7fdef5 [CI] Mark sebotnet33ts_256 as nondeterministic (#98356)
Summary: The goal is make sure the new dashboard doesn't give noisy
alert on this test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98356
Approved by: https://github.com/ezyang
2023-04-05 12:05:47 +00:00
a05d787eb6 [inductor] Fix slow tests not being run in CI (#97841)
PyTorch slow tests are run in CI with `PYTORCH_TEST_SKIP_FAST=1` which skips any
test not decorated with `@slowTest`. That means tests marked with
`skipIf(not TEST_WITH_SLOW)` are never run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97841
Approved by: https://github.com/jansel
2023-04-05 10:30:20 +00:00
45edc58e4f Revert "remove typed StorageImpl::data() and StorageImpl::unsafe_data() (#98219)"
This reverts commit 144d5268a1ee55a348c36bb6f02b881cc67d5173.

Reverted https://github.com/pytorch/pytorch/pull/98219 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-04-05 09:08:08 +00:00
752e43c301 Move android-emulator-build-test to periodic (#98370)
Per internal discussion, this test is only to verify open source Android build, and it's not critical to run it on every commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98370
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-04-05 07:07:27 +00:00
2987bc0758 Inductor cpp wrapper: support dynamic shapes (#97965)
1. Fixed dynamic shapes support in cpp_wrapper
   - fixed the cpp codegen of `size()` and `stride()`
   - fixed the cpp codegen of `ShapeAsConstantBuffer`
   - changed to use `cexpr` instead of `pexpr` in the cpp codegen of the `sizevar`

2. Enabled dynamic shapes tests for cpp_wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97965
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-05 07:02:30 +00:00
601e7dc0bb Fix typos under caffe2/operators directory (#98235)
This PR fixes typos in comments and messages of `.cc` and `.h` files under `caffe2/operators` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98235
Approved by: https://github.com/kit1980
2023-04-05 06:26:01 +00:00
feb9ec4282 Account for forwards which whose corresponding backwards are not invoked (#98112)
Previously, when we would run a forward graph whose backward we never invoked it would prevent us from switching from warmup to recording. Now, refine the heuristic to allow incrementing the generation as soon as we invoke a backward graph. This still handles the
```
mod1 = torch.compile(...)

mod2 = torch.compile(...)

mod2(mod1(x)).sum().backward()
```
case while accounting for graphs which we may not run backward of.

It also now handles the case where we skip cudagraphify the backward of a forward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98112
Approved by: https://github.com/jansel
2023-04-05 06:12:16 +00:00
ae0d06b42c Fix saving and loading pickle files on Big Endian systems (#95881)
This change fixes test/test_cpp_api_parity.py tests on Big Endian systems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95881
Approved by: https://github.com/malfet
2023-04-05 06:11:31 +00:00
1e3abda31a Revert "[spmd expansion] support torch.ops.aten.sym_numel (#98229)" (#98382)
This reverts commit 4d13fcddeff01bc7d44f752173e8efecf25fcf9b.

Fixes diff train landing issue as the original diff was modified after the PR was merged in OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98382
Approved by: https://github.com/kit1980
2023-04-05 04:07:58 +00:00
e943b120a3 Fix incorrectly getting the name of OrderedDict's index in dynamo (#96940)
Fixes #96737

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96940
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-04-05 03:53:45 +00:00
30d47e4520 Do not track parameters, do not generate guards (#98350)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98350
Approved by: https://github.com/voznesenskym
2023-04-05 03:48:46 +00:00
144d5268a1 remove typed StorageImpl::data() and StorageImpl::unsafe_data() (#98219)
Typed data will now only be a tensor level concept.

Differential Revision: [D44629941](https://our.internmc.facebook.com/intern/diff/D44629941/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98219
Approved by: https://github.com/ezyang
2023-04-05 03:32:02 +00:00
6887333cf9 [inductor] Fix a perf regression caused by https://github.com/pytorch/pytorch/pull/98214 (#98343)
Summary: See the description in https://github.com/pytorch/pytorch/issues/98342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98343
Approved by: https://github.com/jansel
2023-04-05 01:46:20 +00:00
b923f84805 Switch accuracy CI to dynamic batch only (#98307)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98307
Approved by: https://github.com/wconstab
2023-04-05 01:20:12 +00:00
6514d71add Fix typos under torch/distributed directory (#98225)
This PR fixes typos in comments and messages of `.py` files under `torch/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98225
Approved by: https://github.com/soulitzer, https://github.com/kit1980
2023-04-05 00:21:33 +00:00
3af0228338 remove typed StorageImpl::unsafe_data() (#98218)
Typed data will now only be a tensor level concept.

Differential Revision: [D44629939](https://our.internmc.facebook.com/intern/diff/D44629939/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98218
Approved by: https://github.com/ezyang
2023-04-05 00:10:59 +00:00
a3365e1d0d Increment pending forwards after invocation (#98101)
Forwards are only pending following invocation, not before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98101
Approved by: https://github.com/ngimel
2023-04-05 00:04:39 +00:00
3686416a57 [SyncBatchNorm] Support running with low precision parameters (#98332)
This PR fixes https://github.com/pytorch/pytorch/issues/96203.

**Details**
When using `nn.SyncBatchNorm` with the model converted to FP16, there is a dtype discrepancy in the `SyncBatchNorm.forward()` causing an error like:
```
 File "/.../pytorch/torch/nn/modules/_functions.py", line 91, in forward
    mean, invstd = torch.batch_norm_gather_stats_with_counts(
RuntimeError: Expected counts to have type Half but got Float
```
[`torch.batch_norm_gather_stats_with_counts()`](fe9da29842/torch/nn/modules/_functions.py (L88-L97)) requires the `running_mean`, `running_var`, and `counts` to have the same dtype. However, when the model has been converted to FP16, only `running_mean` and `running_var` use FP16, while the `counts` are in FP32 due to [`mean` being in FP32](fe9da29842/torch/nn/modules/_functions.py (L25-L30)). This PR resolves this by casting `counts` from FP32 to FP16 instead of the alternative to cast `mean` and `invstd` from FP32 to FP16.

Moreover, for the backward, this PR casts `weight` from FP16 to FP32 to match the dtype of `mean` and `invstd` as required by `torch.batch_norm_backward_elemt()` instead of the alternative to cast `mean` and `invstd` from FP32 to FP16.

**Test Plan**
I dug up this run command from 2021:
For `world_size` in `{1,2}` and `backend` in `{nccl, gloo}`:
```
WORLD_SIZE=world_size BACKEND=backend  python -m pytest test/distributed/test_distributed_spawn.py -k test_DistributedDataParallel_SyncBatchNorm_half -vs
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98332
Approved by: https://github.com/rohan-varma
2023-04-05 00:00:30 +00:00
2d9b2bcfba Extend TensorImpl with BackendMeta (#97429)
BackendMeta offers a binary interface for the backend to attach arbitrary data to TensorImpl. TensorImpl has exactly one "slot" for backend metadata, however backend is free to compose any structure that is opaque to the framework beyond iheriting standard BackendMeta base.

Change-Id: I670fcdd16dd1c2b00f7eaa1cbc5b5dfea59a6221

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97429
Approved by: https://github.com/ezyang
2023-04-04 23:47:03 +00:00
dd503376bd Revert "[pt2][inductor] add triton.__verison__ as cache key, update cache layout (#98010)"
This reverts commit 0eab3ab51ec3e83bd9961b167bfbdbab7fc064e6.

Reverted https://github.com/pytorch/pytorch/pull/98010 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-04-04 22:22:46 +00:00
bd6db54285 [CI] Mark mobilenet_v3_large as nondeterministic (#98314)
Summary: Skip mobilenet_v3_large for accuracy checking to reduce
noise on the dashboard. The root cause still needs to be investigated.

mobilenet_v3_large shows random accuracy check failures with different
error values from time to time, and here are some examples:
```
cuda train mobilenet_v3_large                  [2023-04-04 14:54:50,990] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.02172, (ref-fp64): 0.01068 and shape=torch.Size([960, 1, 5, 5])
[2023-04-04 14:54:50,990] torch._dynamo.utils: [ERROR] Accuracy failed for key name features.14.block.1.0.weight.grad
```
```
cuda train mobilenet_v3_large                  [2023-04-04 14:57:59,972] torch._dynamo.utils: [ERROR] RMSE (res-fp64): 0.07744, (ref-fp64): 0.03073 and shape=torch.Size([72, 1, 5, 5])
[2023-04-04 14:57:59,973] torch._dynamo.utils: [ERROR] Accuracy failed for key name features.4.block.1.0.weight.grad
```

One observation is turnning off cudnn in the eager mode with
`torch.backends.cudnn.enabled = False` makes the non-deterministic
behvior go away but meanwhile it fails accuaracy checking consistently.
Minifier didn't help to narrow down the error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98314
Approved by: https://github.com/huydhn
2023-04-04 21:55:23 +00:00
ecf08a0f8b [ROCm] Enable test_filtering_env_var (#84100)
The test "test_filtering_env_var" requires CPU device_type along with GPU.
Hence enable both device_types for ROCm, since the PYTORCH_TESTING_DEVICE_ONLY_FOR env var will have the same effect as the code being removed, making the latter redundant anyway:
9e81c0c3f4/.jenkins/pytorch/test.sh (L54-L59)
9e81c0c3f4/torch/testing/_internal/common_device_type.py (L626-L634)

Signed-off-by: Jagadish Krishnamoorthy <jagdish.krishna@gmail.com>

Enables the test disabled by #56178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84100
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2023-04-04 21:49:53 +00:00
51a978fe7b Set number of threads to be 1 for ARM (#97482) (#98267)
Summary:
In highly multi-threaded environment, using # of threads to be matching hardware_concurrency leads to high contention. x86 path actually ends up using different path (MKL path), which results in using 1 thread for x86 as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98267
Approved by: https://github.com/malfet
2023-04-04 21:24:50 +00:00
aaae588727 [FSDP][Docs] Add warning about forward saving param refs (#98320)
This PR adds a warning following an issue hit by an internal user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98320
Approved by: https://github.com/rohan-varma
2023-04-04 21:11:57 +00:00
66d07e3b19 [FSDP] Only move current FSDP's states to GPU during init (#98319)
Fixes https://github.com/pytorch/pytorch/issues/95813
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98319
Approved by: https://github.com/rohan-varma
2023-04-04 21:03:47 +00:00
d7156175fe [FSDP] Add skip writeback check gated by env var (#98300)
This adds the option to skip the `_writeback_orig_params()` function that checks for parameter and gradient writeback in case storages changed, gated by an env var `FSDP_SKIP_WRITEBACK_CHECK`.

As described in the code comment, this writeback check is important for detecting a failure mode of FSDP `use_orig_params=True`. However, because the failure mode is an atypical case and performing the check incurs nontrivial CPU overhead each iteration, we add this option to skip the check altogether.

<details>
<summary>(Before) Pre-backward hook: 1.044 ms</summary>

![Screenshot 2023-04-04 at 9 05 53 AM](https://user-images.githubusercontent.com/31054793/229800917-9580ce6b-3721-469a-9212-f0cbfd8cbb52.png)

</details>

<details>
<summary>(After) Pre-backward hook: 0.500 ms</summary>

![Screenshot 2023-04-04 at 9 34 57 AM](https://user-images.githubusercontent.com/31054793/229810916-b16295d5-7da7-42c4-9168-04edeebe045c.png)

</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98300
Approved by: https://github.com/rohan-varma
2023-04-04 20:55:09 +00:00
96595617b9 Support Modules with custom __getitem__ method through fallback (#97932)
This PR allows to torch.compile torch.nn.Module with custom __getitem__ methods but falling back to Python.

Fixes #97720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97932
Approved by: https://github.com/yanboliang
2023-04-04 20:42:17 +00:00
057911741a [EASY] Teach requires_bwd_pass how to interpret int. (#98312)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98312
Approved by: https://github.com/wconstab
2023-04-04 20:41:26 +00:00
fd0be80dd1 [Dynamo] graph break when calling resize_() on graph input (#98279)
Fixes #97921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98279
Approved by: https://github.com/jansel, https://github.com/eellison
2023-04-04 20:39:12 +00:00
3c36f82fa2 [EASY] Handle new inference csv from CI (#98294)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98294
Approved by: https://github.com/wconstab
2023-04-04 20:37:51 +00:00
75ac6fdcdd Propogate dynamo shape_env to make_fx (#96437)
Currently, when we use assume_static_by_default flag, dynamo won't produce any symbols for input tensors. But when we pass the dynamo generated graph onto make_fx via torchdynamo.export(aten_graph=True), there is no way to pass this flag. We enable this by directly passing the fake tensors dynamo used to make_fx and call make_fx with "real" mode with fake tensors from dynamo.

Note that this is modified version of (https://github.com/pytorch/pytorch/pull/96143)

Differential Revision: [D44561753](https://our.internmc.facebook.com/intern/diff/D44561753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96437
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-04-04 20:37:30 +00:00
0eab3ab51e [pt2][inductor] add triton.__verison__ as cache key, update cache layout (#98010)
Summary:
* change caching to have `system` and `cache` components, where `system` servers as an identifier for that machine's performance. similar to original method of having GPU type and CUDA version be cache keys, and now also includes Triton version. `cache` is similar to the original cache type, but now without GPU name or CUDA version

```
{
    "system": {
        "device": "NVIDIA PG509-210",
        "version": {
            "cuda": "11.4.0",
            "triton": "2.1.0"
        },
        "hash": "e7cfb8786d2e1366b3df564bcb2f957d07545e98bf20c98d33a43b6ee80a91e0"
    },
    "cache": {
        "bias_addmm": {
            "[('cuda', 'torch.float32', 2048, 160, 0, 1, 0), ('cuda', 'torch.float32', 2048, 1140, 228148, 1, 206080), ('cuda', 'torch.float32', 1140, 160, 1, 1140, 0)]": {
                "bias_addmm-alpha=1-beta=1-c73frtshmeth2spjun3zc4l2q7ck43wl356pnlmsmxgmzbfsz7ef": 0.03654399886727333,
                "addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf": 0.03564799949526787,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-cxgwpjkimm4azwffrfuqniwncnv4h5bxrpo4od4an4bstnh7qrqh": 0.04927999898791313,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=4-cqlirysniekkuuvc4ue33dr4gpfzsb5e4bexarrsnsyei4slxvcz": 0.03651199862360954,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=4-cww5uss3k4d3ei2c4lx63pudyzxdwl3ieibhxcrue4zg424eqrnu": 0.03580800071358681,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=4-num_warps=8-cqcla5edxdm3n6rrkmjehexsudravx6lpphfo5zazldpo3rzpqc4": 0.03558399900794029,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=4-num_warps=8-c7gdf2snt4bjlnuzdy3px4pyq3lbsdh4jp6jaie7lq6mdxccy6nl": 0.03455999866127968,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=64-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=5-num_warps=8-cjhcy4scxgy4lxbhjiinvxl3bbrqya63jilcckx2ltsg3mpzxyqr": 0.036288000643253326,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=32-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=5-num_warps=8-cu32a5vsbaln3t55jm2y6xhwgyggejmoatyakcm2huvxofw2zzva": 0.0398080013692379,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=32-BLOCK_M=128-BLOCK_N=128-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=8-croberh4l55jxlrlgkttigtebsnmosycc5rdtbtn3lp3bpovgz4a": 0.0732479989528656,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=64-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=3-num_warps=8-c6oxgunysrqpiwwoinylb3sb2hzvx66yhehma64drqvmz52h3r5t": 0.0306560005992651,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=128-BLOCK_M=32-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-cdrev5e3zno6z6flmhlbxgd26gkdpurljyhrw3ovx6pftoe62dpf": 0.04800000041723251,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=16-BLOCK_M=64-BLOCK_N=64-EVEN_K=False-GROUP_M=8-num_stages=2-num_warps=4-ce3ofrgngrwuo45hw5wqlzztium7gfkf4n5x25gwu4d6ygkea4bs": 0.0751039981842041,
                "triton_mm-ACC_TYPE='tl.float32'-ALLOW_TF32=True-BLOCK_K=16-BLOCK_M=32-BLOCK_N=32-EVEN_K=False-GROUP_M=8-num_stages=1-num_warps=2-cfkz2smezre4x7hyhc2kbeawhqup6qpwzgiavrai2ghe5ghouvn4": 0.07401599735021591
            },
            ...,
        },
        ...,
    }
}

```

Test Plan:
MAST no global: sw-966772723-OfflineTraining_df2509b8
MAST global: sw-966766969-OfflineTraining_19df7c20

Differential Revision: D44550100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98010
Approved by: https://github.com/jansel
2023-04-04 20:37:22 +00:00
9ddd97e1eb [kineto] make input shape collection opt-in for on-demand tracing (#746) (#97917)
Summary:
X-link: https://github.com/pytorch/kineto/pull/746

Make input shape collection opt-in as we are re-rolling out the feature for on-demand tracing. Making this default right away could cause bloated trace size for inferences

Test Plan:
# Repro

## Run
```
buck2 run mode/opt  kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test
```

## Default Path
```
dyno gputrace
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1679681884%2F127.0.0.1%2Flibkineto_activities_2125213.json.gz&bucket=gpu_traces

## Opt in
```
echo -e "CLIENT_INTERFACE_ENABLE_OP_INPUTS_COLLECTION=true" > /tmp/sigrid_kineto.conf && dyno gputrace --gpuconf /tmp/sigrid_kineto.conf
```

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2F0%2F1679682394%2F127.0.0.1%2Flibkineto_activities_2415763.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi

Differential Revision: D44377341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97917
Approved by: https://github.com/aaronenyeshi
2023-04-04 20:33:20 +00:00
b04f86363f Fix ideep submodule (#98305)
It was incorrectly changed in #97157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98305
Approved by: https://github.com/H-Huang
2023-04-04 19:54:14 +00:00
34c7adf1d7 add Half support for sigmoid on CPU (#96077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96077
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-04-04 18:43:27 +00:00
89dc87a225 Deduplicate pointers to manually free (#98097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98097
Approved by: https://github.com/ngimel
2023-04-04 18:38:47 +00:00
a52cf3398c Revert "Add arm tests to mps workflow (#97279)"
This reverts commit dbd41cfa91170458092057397e4ff68f841b83e3.

Reverted https://github.com/pytorch/pytorch/pull/97279 on behalf of https://github.com/clee2000 due to not needed
2023-04-04 18:32:50 +00:00
558e5a240e Introduce torch.onnx.dynamo_export API (#97920)
This is the first phase of the new ONNX exporter API for exporting from TorchDynamo and FX, and represents the beginning of a new era for exporting ONNX from PyTorch.

The API here is a starting point upon which we will layer more capability and expressiveness in subsequent phases. This first phase introduces the following into `torch.onnx`:

```python
dynamo_export(
    model: torch.nn.Module,
    /,
    *model_args,
    export_options: Optional[ExportOptions] = None,
    **model_kwargs,
) -> ExportOutput:
    ...

class ExportOptions:
    opset_version: Optional[int] = None
    dynamic_shapes: Optional[bool] = None
    logger: Optional[logging.Logger] = None

class ExportOutputSerializer(Protocol):
    def serialize(
        self,
        export_output: ExportOutput,
        destination: io.BufferedIOBase,
    ) -> None:
        ...

class ExportOutput:
    model_proto: onnx.ModelProto

    def save(
        self,
        destination: Union[str, io.BufferedIOBase],
        *,
        serializer: Optional[ExportOutputSerializer] = None,
    ) -> None:
        ...
```

In addition to the API in the first commit on this PR, we have a few experiments for exporting Dynamo and FX to ONNX that this PR rationalizes through the new Exporter API and adjusts tests to use the new API.

- A base `FXGraphModuleExporter` exporter from which all derive:
  - `DynamoExportExporter`: uses dynamo.export to acquire FX graph
  - `DynamoOptimizeExporter`: uses dynamo.optimize to acquire FX graph
  - `FXSymbolicTraceExporter`: uses FX symbolic tracing

The `dynamo_export` API currently uses `DynamoOptimizeExporter`.

### Next Steps (subsequent PRs):

* Combine `DynamoExportExporter` and `DynamoOptimizeExporter` into a single `DynamoExporter`.
* Make it easy to test `FXSymbolicTraceExporter` through the same API; eventually `FXSymbolicTraceExporter` goes away entirely when the Dynamo approach works for large models. We want to keep `FXSymbolicTraceExporter` around for now for experimenting and internal use.
* Parameterize (on `ExportOptions`) and consolidate Dynamo exporter tests.
  - This PR intentionally leaves the existing tests unchanged as much as possible except for the necessary plumbing.
* Subsequent API phases:
  - Diagnostics
  - Registry, dispatcher, and Custom Ops
  - Passes
  - Dynamic shapes

Fixes #94774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97920
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms, https://github.com/thiagocrepaldi, https://github.com/shubhambhokare1
2023-04-04 18:13:29 +00:00
fe9da29842 [FSDP][Easy] Remove unused requires_grad_mask (#98299)
Follow-up to https://github.com/pytorch/pytorch/pull/98221 to clean up the unused `requires_grad_mask`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98299
Approved by: https://github.com/rohan-varma
2023-04-04 17:49:35 +00:00
4934dde310 Cleanup redundant CI jobs (#98044)
This cleanup some redundant CI jobs that I found:

* @malfet @ZainRizvi  Do we need debug build in periodic for both 11.8 and 11.7?   This is rarely needed AFAIK.  I try to remove 11.8 here while keep 11.7 to be consistent with the rest of the CI.  Or may be it should be the other way around to keep 11.8
* Remove libtorch 11.7 and 11.8 builds in periodic as it has already been done in [trunk](https://github.com/pytorch/pytorch/blob/master/.github/workflows/trunk.yml#L86-L97)
* Cleanup TSAN (I added this a while back, but there is no drive to go into that further, so let's just kill it) - If you want to keep it, please raise your hand.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4b3ec53</samp>

This pull request simplifies and consolidates the scripts and workflows for the thread sanitizer (TSAN) build and test configuration. It removes redundant and outdated logic, files, and workflows that were previously used to handle the TSAN build differently from the regular build. It enables all the tests for the TSAN build, which has been fixed by another pull request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98044
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2023-04-04 17:07:58 +00:00
10271a60a8 [FSDP] Skip _use_sharded_views() for SHARD_GRAD_OP (#98250)
This PR has `SHARD_GRAD_OP` (and `_HYBRID_SHARD_ZERO2`) skip `_use_sharded_views()` in the post-forward reshard since the strategy does not free the unsharded flat parameter and can preserve the unsharded views. This saves nontrivial CPU overhead both in the post-forward reshard (`_use_sharded_views()`) and the pre-backward unshard (`_use_unsharded_views()`).

<details>
<summary>(Before) Pre-backward hook: 4.356 ms</summary>

<img width="812" alt="Screenshot 2023-04-03 at 6 32 19 PM" src="https://user-images.githubusercontent.com/31054793/229641309-778cf1f9-4b5b-42ec-b2d8-0a1e6e7ce330.png">

</details>

<details>
<summary>(After) Pre-backward hook: 1.044 ms</summary>

![Screenshot 2023-04-04 at 9 05 53 AM](https://user-images.githubusercontent.com/31054793/229800917-9580ce6b-3721-469a-9212-f0cbfd8cbb52.png)

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98250
Approved by: https://github.com/rohan-varma
2023-04-04 17:07:28 +00:00
69f1131178 Bring the fix to flaky missing libzstd on MacOS M1 to its build job (#98236)
I did this fix https://github.com/pytorch/pytorch/pull/92737 a while back on MacOS M1 test job and haven't seen this flaky issue [Library not loaded: @rpath/libzstd.1.dylib](https://hud.pytorch.org/failure/Library%20not%20loaded%3A%20%40rpath%2Flibzstd.1.dylib) there again.  Recently, we start to build on M1 runners, and the flaky failure starts to show up there too, i.e. https://github.com/pytorch/pytorch/actions/runs/4599605256/jobs/8125180118.  So I'm bringing the same fix to MacOS M1 build job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98236
Approved by: https://github.com/ZainRizvi
2023-04-04 17:02:13 +00:00
ba6bc5080f Fix fused_8bit_rowwise_conversion_ops_test (#98183)
Summary:
This test tests an operator that quantizes and serializes a float array.
Among the data serialized, one element is the bias, i.e. the minimum value in the array.

The test may fail when the array contains both +0.0 and -0.0, while all other elements are positive.
(this happens quite frequently with a hypothesis version >= 6.17.4, due to [this issue](https://github.com/HypothesisWorks/hypothesis/issues/3606))
Depending on the exact settings of SIMD (single instruction, multiple data), the elements of the array may be visited in different orders while running the operator and while calculating the reference.
Because +0.0 and -0.0 compare equal, the minimum value may be either +0.0 or -0.0.
Nevertheless, the serialized forms of these two values differ in the sign bit, and can make the test fail because it's conducting an exact match on the serialized result.

To avoid this failure, I'm adding a line to replace all -0.0 with +0.0 in the input array.

Test Plan:
Run this with both hypothesis < 6.17.4 and >= 6.17.4:
```
buck2 test mode/opt caffe2/caffe2/python:fused_8bit_rowwise_conversion_ops_test - test_quantize_op
```

Differential Revision: D44617022

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98183
Approved by: https://github.com/malfet
2023-04-04 16:06:13 +00:00
23a9e08d0d [bazel] Move torch/csrc/distributed/c10d/quantization/quantization_gpu.cu (#98188)
Fixes #79236

Avoid kernel de-registration problems in bazel by virtue of having a single cuda kernel lib.

Test plan: cherry-picked on a branch where we run all GPU tests and verified that this fixes majority of the tests.
https://github.com/pytorch/pytorch/actions/runs/4593347787/jobs/8111184857?pr=96202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98188
Approved by: https://github.com/malfet
2023-04-04 14:44:42 +00:00
42cbf7120a Add parentheses to FloorDiv. (#98290)
[SymPy incorrectly prints](https://github.com/sympy/sympy/issues/25026) multiplications that:
- Have a negative term; and
- Have a custom operation of smaller precedence than `Mul` (except for `Add` -- which is expanded)

I have observed this behavior when running `maml` with dynamic shapes (errors out on `master`, though). There was, for example, the following guard:

```python
# vars[12].size()[2] == 2
# vars[0].size()[2] == 3
# x.size()[2] == 28

vars[12].size()[2] ** 2 - \
    2 * vars[12].size()[2] * \
    (-vars[0].size()[2] + (-vars[0].size()[2] + (x.size()[2] - vars[0].size()[2]) // 2 + 1) // 2 + 1) // 2
```

Which translates into:

```python
>>> 2 ** 2 - 2 * 2 * (-3 + (-3 + (28 - 3) // 2 + 1) // 2 + 1) // 2
>>> 2 ** 2 - 2 * 2 * (-3 + (-3 + 25 // 2 + 1) // 2 + 1) // 2
>>> 2 ** 2 - 2 * 2 * (-3 + 10 // 2 + 1) // 2
>>> 2 ** 2 - 2 * 2 * 3 // 2
>>> 4 - 2 * 2 * 3 // 2  # floordiv and mul have same precedence in Python
>>> -2
```

Now, placing the parentheses correctly (this PR), we get:

```python
>>> 4 - 2 * 2 * (3 // 2)
>>> 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98290
Approved by: https://github.com/ezyang
2023-04-04 13:50:53 +00:00
0b31f87c18 [FSDP] Use correct handle training state when prefetching (#98249)
This PR ensures that when prefetching a `FlatParamHandle.unshard()`, we temporarily set the `FlatParamHandle._training_state` to the expected training state as if the `unshard()` were not prefetched since the `as_params` argument to `_use_unsharded_views()` depends on the handle's training state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98249
Approved by: https://github.com/rohan-varma
2023-04-04 13:34:02 +00:00
950431c334 extract out a caffe2 macros library (#98156)
Slowly carving out the minimal caffe2 dependencies to build PyTorch.

Differential Revision: [D44609764](https://our.internmc.facebook.com/intern/diff/D44609764/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44609764/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98156
Approved by: https://github.com/ezyang, https://github.com/PaliC
2023-04-04 10:04:21 +00:00
f6272ce79d [FSDP] Allow non-uniform requires_grad for use_orig_params=True (#98221)
Closes https://github.com/pytorch/pytorch/issues/91167.

Differential Revision: [D44660134](https://our.internmc.facebook.com/intern/diff/D44660134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98221
Approved by: https://github.com/rohan-varma
2023-04-04 09:47:43 +00:00
301f00f350 generate caffe2/core/macros.h in shared build structure (#98131)
This is only used by Bazel for now.

Differential Revision: [D44604078](https://our.internmc.facebook.com/intern/diff/D44604078/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44604078/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98131
Approved by: https://github.com/ezyang, https://github.com/PaliC
2023-04-04 09:23:03 +00:00
d47a4bf53f Align settings for new device key. (#98224)
Summary: As title.

Test Plan: All CI tests should pass.

Reviewed By: yuhc

Differential Revision: D44341331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98224
Approved by: https://github.com/jackm321, https://github.com/ezyang
2023-04-04 08:39:11 +00:00
86505c692f Disable inductor/test_minifier on ASAN (#98263)
This is to mitigate the timeout issue on ASAN https://github.com/pytorch/pytorch/issues/98262.  This test is slow on ASAN and it seems to cause problem to correctly compute the number of shards needed to run it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98263
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-04-04 08:10:18 +00:00
e7874eea7a fix the use of incomplete vector<T> for C++20 compatibilities (#93978)
Avoid referring to std::vector<T> members and constructor/desctructors when T is incomplete.

Referring to incomplete members is [not legal](https://timsong-cpp.github.io/cppwp/n4868/vector#overview-4) according to the C++ standard.

Non-noexcept constructors need access to members' destructors. As of C++20, std::vector's destructor is constexpr and so forcefully requires a complete type for the vector's elements.

These issues cause build errors in newer toolchains under c++20 mode.

Fix them by moving code that needs complete types to a different place where the type is already defined.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93978
Approved by: https://github.com/Skylion007
2023-04-04 07:47:43 +00:00
a9c7e882ac [Dynamo] Support skip fbcode modules (#98192)
Fix Meta internal use case:
* We are going to skip tracing ```torchrec.distributed```, however, in fbcode, the structure is a bit different from OSS torchrec.
* Meta internally uses ```torch.package```, so we should support skip tracing files like ```<torch_package_0>.torchrec/distributed/...```.
* We put the logic behind a flag ```is_fbcode``` to avoid misuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98192
Approved by: https://github.com/yf225
2023-04-04 06:33:55 +00:00
d16a9b7676 [inductor] be able to enable max-autotune and cudagraphs independently (#98255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98255
Approved by: https://github.com/williamwen42
2023-04-04 06:12:46 +00:00
7eaaefafb3 Revert "Extend TensorImpl with BackendMeta (#97429)"
This reverts commit bc38b278bf4c2890700f8fe751cfd15fcb01da60.

Reverted https://github.com/pytorch/pytorch/pull/97429 on behalf of https://github.com/huydhn due to Sorry for reverting your PR as I am trying to root cause a libtorch build failure on Windows starting from your change bc38b278bf.  AFAICT, there is no other change from the log.  I will reland this if the failure is unrelated
2023-04-04 05:13:18 +00:00
8f2f1a0b32 [torch/fx] add torch/utils/_stats.py to stack frame skiplist (#98117)
We added some @count decorators to stuff that show up now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98117
Approved by: https://github.com/SherlockNoMad
2023-04-04 05:03:56 +00:00
1fae179ee1 add support for SymNodeVariable in getitem_const (#97756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97756
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-04-04 03:33:25 +00:00
b109083098 [quant][pt2e][refactor] Remove backend_config from _maybe_insert_input_observers_for_node (#98094)
Summary:
The goal is to remove the need to use backend_config when pt2e flow code call this function

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98094
Approved by: https://github.com/jcaip
2023-04-04 03:18:24 +00:00
bc38b278bf Extend TensorImpl with BackendMeta (#97429)
BackendMeta offers a binary interface for the backend to attach arbitrary data to TensorImpl. TensorImpl has exactly one "slot" for backend metadata, however backend is free to compose any structure that is opaque to the framework beyond iheriting standard BackendMeta base.

Change-Id: I670fcdd16dd1c2b00f7eaa1cbc5b5dfea59a6221

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97429
Approved by: https://github.com/ezyang
2023-04-04 03:01:14 +00:00
c5963b7792 [vision hash update] update the pinned vision hash (#98261)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98261
Approved by: https://github.com/pytorchbot
2023-04-04 02:56:49 +00:00
4431509a54 introduce c10::DataPtr::mutable_get() and use it in c10 (#98217)
Differential Revision: [D44629940](https://our.internmc.facebook.com/intern/diff/D44629940/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98217
Approved by: https://github.com/ezyang
2023-04-04 02:26:18 +00:00
fa08e546f3 Revert "Add all_reduce_coalesced functional collective (#97157)"
This reverts commit a3fc3531f514d4c01de9c4a60f978d704d615494.

Reverted https://github.com/pytorch/pytorch/pull/97157 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it seems to have a land race with https://github.com/pytorch/pytorch/pull/96226 and fails lint on trunk
2023-04-04 01:50:49 +00:00
177994eb54 [inductor] [cpp] fix bitwise codegen (#98056)
Fixes #97968

Fix to maintain the data type after doing bitwise.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98056
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-04-04 01:33:31 +00:00
0f151ad2ed Inductor cpp wrapper: support LinearUnary (#97655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97655
Approved by: https://github.com/jansel
2023-04-04 01:31:15 +00:00
0e2bde3000 Create script to upload test aggregation data (#97954)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 79f1b37</samp>

This pull request improves the workflow and data processing for uploading contribution and testing statistics to Rockset and S3. It renames and updates a workflow file, removes unused code from a script, and adds a new script to aggregate and upload test results.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97954
Approved by: https://github.com/huydhn
2023-04-04 01:28:08 +00:00
4cf3e7c255 [dynamo benchmarks] Fix inference benchmark runs (#98248)
Update flags for dynamo inference benchmark runs. Add flag to not compute regressions/metric graphs (useful if there aren't previous runs to compare with).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98248
Approved by: https://github.com/shunting314
2023-04-04 01:24:13 +00:00
96ad739ddc Added ModuleInfos for {*}Norm modules (#97919)
Not adding Lazy variants yet pending investigation of #97915

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97919
Approved by: https://github.com/albanD
2023-04-04 01:15:25 +00:00
a3fc3531f5 Add all_reduce_coalesced functional collective (#97157)
Inductor codegen is suboptimal when calling all_reduce_coalesced with input args. We need to fix inductor's calling convention for that, or something else.

Might not work if any outputs is unused.

Test code:

```python
import torch
import torch.distributed as dist
import torch.nn.functional as F
from functorch import make_fx
import os

import torch.distributed._functional_collectives as ft_c
from torch.testing._internal.common_distributed import (
    spawn_threads_and_init_comms,
)
from torch._inductor.compile_fx import compile_fx_inner

def my_fun(a, b):
    c = a * 3
    tensors = ft_c.all_reduce_coalesced([a, c, b], "sum", [0])
    return ((tensors[1] + tensors[0] + tensors[2]).sum(), )

@spawn_threads_and_init_comms(world_size=1)
def inductor_main(self):

    x = torch.arange(4).cuda() * (dist.get_rank() + 1)
    y = torch.arange(4).cuda() * (dist.get_rank() + 1)
    x = x.to(torch.float)
    y = y.to(torch.float) * 0.5
    res = make_fx(my_fun)(x, y)
    print(f"fx graph:\n{res.graph}")
    ind = compile_fx_inner(res, [x, y])
    print(f"inductor done:\n{ind}")

os.environ["PROXY_TENSOR_TRACING"] = "1"
os.environ["TORCH_COMPILE_DEBUG"] = "1"
torch._dynamo.config.output_code = True

if __name__ == "__main__":
    inductor_main(None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97157
Approved by: https://github.com/fegin
2023-04-04 01:13:18 +00:00
69ff39d2e7 Skip gat, gcn and sage for TorchBench CUDA test (#98244)
Summary: The three models only support CPU for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98244
Approved by: https://github.com/ezyang
2023-04-04 01:06:18 +00:00
f386312ec9 [PyTorch] Don't do extra numel() check in TensorImpl::data() (#98090)
`is_empty()` checks `numel() == 0`, but we don't need to access `numel_` at all (or the policy that `numel()` checks) in our happy path -- we just need the data pointer from `storage_`. Let's do the check we need to do using only the data we strictly need, rather than adding instructions loading other pieces of data.

Differential Revision: [D44586464](https://our.internmc.facebook.com/intern/diff/D44586464/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98090
Approved by: https://github.com/Skylion007
2023-04-04 00:59:52 +00:00
9ad66dd588 Switch reduce_scatter and all_gather in DeviceMesh to use functional collectives (#96226)
Among the changes is the introduction of gather_dim and scatter_dim in DeviceMesh collectives to simplify user code.

The current plan is to keep padding and gather/scatter dim support in DeviceMesh while we explore  optimization opportunities in Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96226
Approved by: https://github.com/wanchaol
2023-04-04 00:58:33 +00:00
2ac9086987 run buildifier on unified build files (#98141)
This is pretty tricky. buildifier by default doesn't do much to these
files. It does a little more if you tell it that they are
`BUILD.bazel` files with -type=build. But it can do even more if you
remove the target definitions from the `def define_rules()` wrapper
and dedent them.

I wrote a little wrapper that does that. I'll submit it at a later
date.

Differential Revision: [D44606558](https://our.internmc.facebook.com/intern/diff/D44606558/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44606558/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98141
Approved by: https://github.com/ezyang, https://github.com/PaliC
2023-04-04 00:37:19 +00:00
b1e60bfb6a Pass f_locals as a dict rather than kwargs (#98107)
Fixes https://github.com/pytorch/pytorch/issues/97688

One big problem is that instead of printing x < y we now print
`E["x"] < E["y"]` and now all of the tests wobbled and I'm mad.

Signed-off-by: Edward Z. Yang <ezyangmeta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98107
Approved by: https://github.com/ezyang
2023-04-04 00:30:08 +00:00
b96fe9b61c Fix issues related to ClassInstantier in HF models (#97997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997
Approved by: https://github.com/anijain2305
2023-04-04 00:01:08 +00:00
4d13fcddef [spmd expansion] support torch.ops.aten.sym_numel (#98229)
The current logic assumes non-overload ops takes two arguments however torch.ops.aten.sym_numel takes one.

Differential Revision: [D44615037](https://our.internmc.facebook.com/intern/diff/D44615037/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98229
Approved by: https://github.com/mrshenli
2023-04-03 23:57:10 +00:00
a6bd21d935 [Dynamo] Eagerly initializing Lazy Module to reduce graph breaks (#97946)
Fixes Meta internal user case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97946
Approved by: https://github.com/wconstab
2023-04-03 22:24:43 +00:00
96f548a1ac [inductor] Add an AOT mode for the Triton backend (#98214)
Summary:
This is a copy of https://github.com/pytorch/pytorch/pull/97152 to make
the landing easier.

This PR implements a two-pass wrapper codegen for the Triton
backend to achieve ahead-of-time compilation. In the first pass, the
regular python wrapper code will be generated, and then the generated
code will be executed to perform Triton compilation and autotuning.
After that, the second pass wrapper codegen will generate C++ wrapper
with proper CUDA API to load and launch Triton-generated CUDA kernels.

Like the AOT mode for the cpp backend, the next step would be to provide
a more complete API for AOT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98214
Approved by: https://github.com/eellison
2023-04-03 22:19:18 +00:00
73b06a0268 Fix rendering of arguments for nn.functional ops that use boolean_dispatch (#98092)
Fix #97982

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98092
Approved by: https://github.com/albanD
2023-04-03 21:17:43 +00:00
eeb18d1e54 Fix dynamo tests and re-enable internally (#97937)
Summary:
`:test_dynamo` has been broken for long time internally in Meta. This PR is to fix the  broken test and re-enable it internally.
- Using the root `pytest.ini` for pytest
- Decouple tests so that one can be disabled with affecting others
- Temporarily disable the test cases that require additional efforts to fix

**OSS CI doesn't provide test code coverage info. Meta internal test infra does. The value of re-enabling these tests internally is not only to collect test coverage info but help fbcode developers to build/test from fbcode.**

Test Plan:
`buck test mode/dev-nosan //caffe2/test:test_dynamo`
https://www.internalfb.com/intern/testinfra/testrun/7318349540623516

Differential Revision: D44325238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97937
Approved by: https://github.com/ezyang
2023-04-03 20:47:13 +00:00
3654552b8c add deterministic impl for scatter and scatter_reduction sum/mean mode (#98060)
using the existing deterministic implementation via `index_put` which has a deterministic implementation based on sorting indices.

With the `accumulate` arg in `index_put`, this can work for both scatter and scatter_reduce with sum/mean reduction mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98060
Approved by: https://github.com/mikaylagawarecki
2023-04-03 20:38:29 +00:00
13f169c9da Per Channel in brack-propagation function (#97475)
Summary:
Supporting Per Channel quantization in the gradient computation function.

One workaround that I have added here is
Current QNNPACK is not designed to process [transposed weight](https://fb.workplace.com/groups/pytorch.edge.users/permalink/1283737025829921/)
Here we are simply replacing Per Channel to Per Tensor to compute a gradient (Some slow learning curve or WER degradation might be expected - We don't know, nothing is guaranteed)

Test Plan:
You can create your own synthetic model,
FP32 layer -> INT8 layer with Per Channel and see if loss is decreasing

Differential Revision: D43898794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97475
Approved by: https://github.com/weiwangmeta
2023-04-03 20:34:44 +00:00
8e5f57a2b1 add users to external contribution metrics (#97928)
:copilot summary
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97928
Approved by: https://github.com/kit1980
2023-04-03 19:52:31 +00:00
1ea528ef24 [bf16] bf16 support for conv_depthwise3d (#97819)
Add bf16 for this op

Differential Revision: [D44473429](https://our.internmc.facebook.com/intern/diff/D44473429/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97819
Approved by: https://github.com/fegin
2023-04-03 19:31:27 +00:00
55afaa46a4 Support functools.partial and itertools.product (#98120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98120
Approved by: https://github.com/anijain2305
2023-04-03 18:23:25 +00:00
2c905f2152 Extend Pattern Matcher to allow handling split-cat style patterns (#97726)
Summary:
This diff extends pattern matcher, by adding a few features which allows it to handle split-getitem-cat style patterns.

3 problems I encountered were:

1. In the handler, I only need one Arg() (the one which is the first input to split). None of the other args are relevant to replacement graph.  So, we add a new Ignored() pattern to have ignored args

2. The pattern matching was visiting the split node again and again during the DFS. By propogating the patterns with _users>1 or Any into the child MatchContext, we avoid this problem.

3. To avoid the unbundling issue, I switched to using KeywordArg() instead of Arg() - as for this pattern, we need a flat list of Arg() in the end

Example pattern: https://www.internalfb.com/intern/anp/view/?id=3325856

```
pass_patterns.append(defaultdict(list))

register_replacement_pattern(
    CallFunction(
        aten.cat,
            ListOf( CallFunction(operator.getitem, CallFunction(aten.split_with_sizes, KeywordArg("input_"), Ignored(), Ignored(), _users=Any),
                                                   Ignored()
                                                   ),),
        Ignored()
    ),
    pass_number=3
)
def split_cat_replace(input_):
    return input_
```

Test Plan: https://www.internalfb.com/intern/anp/view/?kernel=default&id=3317105

Reviewed By: jansel

Differential Revision: D44282499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97726
Approved by: https://github.com/jansel
2023-04-03 17:30:56 +00:00
095c129bd3 [CI] Add inference run for the performance dashboard (#98174)
Summary: Remove fp32 training performance run and trade for amp inference
performance run.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98174
Approved by: https://github.com/huydhn
2023-04-03 17:29:55 +00:00
ba7ee00f00 Add a --inference flag to dynamo benchmark script (#98173)
Summary: When calling benchmark scripts, make it a requirement to pass
--inference or --training

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98173
Approved by: https://github.com/huydhn
2023-04-03 17:12:28 +00:00
5a54eb0b15 [caffe2] miniz fix -Wstrict-prototypes (#98027)
Summary: this fixes -Wstrict-prototypes

Test Plan: eyes

Reviewed By: rmaz

Differential Revision: D44556017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98027
Approved by: https://github.com/albanD
2023-04-03 16:56:47 +00:00
0f0c1b6516 Flip back swtich (#98099)
There are some errors occurring on the benchmark - switch back to old cudagraph impl until they are figured out

https://torchci-git-fork-huydhn-add-compilers-bench-74abf8-fbopensource.vercel.app/benchmark/compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98099
Approved by: https://github.com/desertfire
2023-04-03 14:46:33 +00:00
55daa835e9 Added allowed_workflows to pytorch probot (#98082)
Added allowed_workflows to pytorch probot. This is a follow up PR [regarding the retry bot](https://github.com/pytorch/test-infra/pull/3942/files#diff-ee5e4f1e1fa962c6f62e5dcebde6e0bab573e74474601bf5749ccb668fd9c900R14-R16).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98082
Approved by: https://github.com/huydhn
2023-04-03 12:30:43 +00:00
ced5c89b6f add explicit vectorization for Half dtype on CPU (#96076)
This patch is part of half float performance optimization on CPU:
* add specification for dtype `Half` in `Vectorized<>` under both avx256 and avx512.
* add specification for dtype `Half` in functional utils, e.g. `vec::map_reduce<>()`, which uses float32 as accumulate type.

Also add a helper struct `vec_hold_type<scalar_t>`, since Vectorized<Half>::value_type is pointing to its underlying storage type which is `uint16_t`, leading to error if the kernel uses `Vec::value_type`.

Half uses the same logic as BFloat16 in the Vectorized<>, each half vector is mapped to 2x float vectors for computation.

Notice that this patch modified the cmake files by adding **-mf16c** on AVX2 build, from https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html, we can see that all the hardware platforms that support **avx2** already have **f16c**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96076
Approved by: https://github.com/malfet
2023-04-03 10:58:37 +00:00
c99895ca6f Move pull and trunk slow tests to periodic (#98040)
I notice that we are running some slow tests for CPU and `sm86` on pull and trunk.  They take much longer to run than other shards (1.5x to 2x longer).  I propose that we move them to periodic instead. Thoughts?

The correlation between them are:

* `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (slow)` and `linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test (default)` is 0.93
* `linux-bionic-py3.8-clang9-slow / test (slow)` and `linux-bionic-py3.8-clang9 / test (default)` is 0.98

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at db56750</samp>

This pull request updates the `.github/workflows` files to optimize the testing workflows for PyTorch. It adds new periodic workflows for more platforms and configurations, and removes some redundant or slow workflows from the pull and trunk workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98040
Approved by: https://github.com/malfet
2023-04-03 08:13:12 +00:00
c597d9c1f2 Revert "Inductor cpp wrapper: support LinearUnary (#97655)"
This reverts commit d03003ab8e0e00ff4c9e2b80065cf90a8fcef92d.

Reverted https://github.com/pytorch/pytorch/pull/97655 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it looks like the change causes a regression on CPU test time d03003ab8e  (inductor/test_cpp_wrapper.py)
2023-04-03 08:09:58 +00:00
d03003ab8e Inductor cpp wrapper: support LinearUnary (#97655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97655
Approved by: https://github.com/jansel
2023-04-03 04:26:10 +00:00
0c1f524b92 Inductor cpp wrapper: support MKLPackedLinear (#90755)
Invoke `torch.ops.mkl._mkl_linear` from c++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90755
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-04-03 04:07:38 +00:00
5d62d12557 [Inductor] support transpose vertical reduction in cpp (#97781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97781
Approved by: https://github.com/jansel
2023-04-03 02:02:15 +00:00
76074dc0a3 Improve support for dict subclasses (#98154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98154
Approved by: https://github.com/anijain2305
2023-04-03 01:42:08 +00:00
bf22ecba2a [Inductor] support vertical reduction in cpp (#97644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97644
Approved by: https://github.com/jansel
2023-04-03 01:29:12 +00:00
35b3309539 Fix graph break from inline patched init (#98150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98150
Approved by: https://github.com/anijain2305, https://github.com/yanboliang
2023-04-03 01:11:30 +00:00
8e5f491623 [Inductor] simplify CPP backend Tile2D code and support non-contiguous load/store (#97626)
Remove `CppTile2DTailKernel` and `CppTile2DKernelChecker` and reuse `CppVecKernel` and `CppVecKernelChecker` for them. Add vectorization with fallback for load/store in CppVecKernel for the non-contiguous load/store needed by `CppTile2DTailKernel`.

This PR also adds a functional support for transposed copy of bfloat16 data types. Better performance requires vectorized intrinsics implemented for at::vec::transpose_mxn. cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97626
Approved by: https://github.com/jansel
2023-04-03 01:11:20 +00:00
71d850a100 [inductor] Fallback on complex64 kernels (#98155)
Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155
Approved by: https://github.com/anijain2305, https://github.com/ngimel
2023-04-03 01:06:43 +00:00
bc9dd969e1 Support inlining no_grad() decorator (#98121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98121
Approved by: https://github.com/anijain2305, https://github.com/voznesenskym
2023-04-03 00:24:56 +00:00
96403cfcec [Easy] Fix lint error on DTensor math_ops.py (#98170)
This lint error is caused by conflicts betwee #97996 and #98148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98170
Approved by: https://github.com/yifuwang
2023-04-02 19:11:05 +00:00
02179827cb [Easy] Include SPMD and DTensor files in UFMT checks (#98148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98148
Approved by: https://github.com/fegin
2023-04-02 15:34:49 +00:00
38609cc47d TensorExpr eval: fix copying variables from pointers on big endian systems (#96951)
When copying data from pointers, only lowest bytes are copied. On little endian systems they are located at the beginning of pointer. On big endian systems they are located at the end of pointer.

This change fixes TestTensorExprPyBind::test_dynamic_shape and TestTensorExprPyBind::test_dynamic_shape_2d tests from test/test_tensorexpr_pybind.py on big endian systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96951
Approved by: https://github.com/ezyang, https://github.com/EikanWang
2023-04-02 12:49:14 +00:00
2ab18a23e1 Update ideep submodule (#97430)
### Description

This PR is to update ideep submodule for the following two aspects:

1. At inductor side, we are supporting dynamic shape path for packed linear, which we hopes the packed weight of linear doesn't depend on the input shapes and still can get a better a performance using a packed weight got from a dummy input shapes. However the current ideep has a accuracy issue for this case. This updating will fix the issue. 
2. Add an extra arg is_channels_last for deconv to notify ideep whether to go channels last or not because the memory format checks of ideep (e.g. is_nhwc(), is_ndhwc()) is not 100% identical to suggest_memory_format() from pytorch.

### Performance Benchmark

Use TorchBench test in ICX with 40 cores
Intel OpenMP & tcmalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/229072474-193513ba-6727-4451-91ff-0d57e016736f.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97430
Approved by: https://github.com/jgong5
2023-04-02 06:42:09 +00:00
347c67d4a2 [Easy] Consolidate string startswith checks (#98147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98147
Approved by: https://github.com/fegin
2023-04-02 04:02:37 +00:00
7fcff01b50 [reland] switch mean to use reduction linear (#97996)
mean is actually a reduction linear formula if the final reduction
is partial sum (which currently is), so switching to use that instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97996
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
2023-04-02 03:19:56 +00:00
d9e5ab4606 Fix graph break from 'hasattr: HFPretrainedConfigVariable()' (#98119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98119
Approved by: https://github.com/anijain2305
2023-04-02 02:56:45 +00:00
b9d3b3f595 Improve support for contextlib.nullcontext (#98111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98111
Approved by: https://github.com/anijain2305
2023-04-02 02:33:14 +00:00
92b46202ef Add --stats option to benchmark scripts (#98109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98109
Approved by: https://github.com/anijain2305
2023-04-02 02:23:13 +00:00
e402259b8a avoid warning in irange for unsigned types (#97973)
Unsigned types should not be compared to be less than zero.

Differential Revision: [D44538384](https://our.internmc.facebook.com/intern/diff/D44538384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97973
Approved by: https://github.com/Skylion007
2023-04-01 23:52:37 +00:00
2af09393f9 masked_scatter should accept only bool masks (#97999)
Modify test_torch to check that assert is raised in this case

torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU.
This PR finally restricts mask to just `torch.bool`
See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill`

Fixes https://github.com/pytorch/pytorch/issues/94634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999
Approved by: https://github.com/ngimel
2023-04-01 23:25:25 +00:00
bbc4e911c8 Move CPUReproTests to its own file (#97943)
test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97943
Approved by: https://github.com/ngimel
2023-04-01 22:39:49 +00:00
db8abde9b6 [MPS] Enable conditional indexing tests (#97871)
The tests seem to be working now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97871
Approved by: https://github.com/kulinseth
2023-04-01 16:15:08 +00:00
e8d39606eb [SPMD] Enable fused Adam in full train step tracing (#98113)
Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98113
Approved by: https://github.com/yifuwang, https://github.com/fegin
2023-04-01 15:54:13 +00:00
bccf2ef0ce Format DTensor dispatch.py and _meta_registrations.py (#98114)
Format-only changes with black and lintrunner to prepare for the commit on top.

Differential Revision: [D44603809](https://our.internmc.facebook.com/intern/diff/D44603809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98114
Approved by: https://github.com/yifuwang, https://github.com/fegin
2023-04-01 15:54:13 +00:00
64077ce511 remove redundant typed StorageImpl::data() member (#97650)
This has the same implementation as the unsafe variants and the unsafe
variants match the original semantics of the code, given that they
don't check that the type matches.

Given that we're updating callsites anyways to address the mutability
aspect, we might as well just drop this method now.

Differential Revision: [D44410210](https://our.internmc.facebook.com/intern/diff/D44410210/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97650
Approved by: https://github.com/ezyang
2023-04-01 08:16:54 +00:00
13461e9767 [inductor] more cuda metrics in wrapper (#97723)
Following metrics should be helpful:
- percent of time GPU is busy
- percent of time various category of kernels (e.g. pointwise/reduction triton kernel) takes
- percent of time each individual kernel takes compared to total wall time of the benchmark

This PR add those.

Example result from hf_Bert infernece graph:

```
  == triton_pointwise category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_poi_fused_gelu_6_0d1d                  0.48154  12.0     5.52%
triton_poi_fused_clone_1_0d1d2                0.29011  24.0     3.33%
triton_poi_fused_clone_2_0d1d2                0.17417  12.0     2.00%
triton_poi_fused_clone_4_0d1d2                0.10797  12.0     1.24%
Total                                         1.05379           12.08%

  == triton_persistent_reduction category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
triton_per_fused__softmax__to_                0.97188  12.0     11.14%
triton_per_fused_add_native_la                0.37401  24.0     4.29%
triton_per_fused_gelu_native_l                0.02     1.0      0.23%
triton_per_fused_add_embedding                0.01718  1.0      0.20%
Total                                         1.38307           15.86%

  == unknown category kernels ==
Kernel                            Self CUDA TIME (ms)  Count    Percent
------------------------------  ---------------------  -------  ---------
ampere_fp16_s16816gemm_fp16_12                2.24514  24.0     25.74%
ampere_fp16_s16816gemm_fp16_25                1.39796  49.0     16.03%
void cutlass::Kernel<cutlass_8                1.36093  1.0      15.61%
ampere_fp16_s16816gemm_fp16_64                0.74591  12.0     8.55%
ampere_fp16_s16816gemm_fp16_12                0.61989  12.0     7.11%
Memset (Device)                               0.024    12.0     0.28%
void at::native::(anonymous na                0.01543  2.03     0.18%
void at::native::vectorized_el                0.00011  0.03     0.00%
Total                                         6.40937           73.49%

Percent of time when GPU is busy: 101.44%
```

Note: the output shows total time GPU is busy is larger than total wall time. We measure total wall time disabling profiling while measure GPU time enabling profiling, that may distort the measurement a bit? But I assume the effect is not too large assuming the profiler mostly increase CPU time (rather than GPU).

## interesting usages
1. I pick a model that cudagraphs improve perf significantly like densenet121 and run the tool on it's forward graph. It's no surprise that quite a lot of time GPU is idle:
```
(Forward graph) Percent of time when GPU is busy: 32.69%
Total wall time 17.307 ms
```

Its backward graph has less percent of GPU idle time, but it's still high:
```
(Backward graph) Percent of time when GPU is busy: 46.70%
Total wall time 17.422 ms
```

2. I profile a subset of torchbench models and plot a table to show the percent of execution time for pointwise/reduction/persistent_reduction/unknown_category . Since I plan to explore using coordinate descent tuner to improve reduction, those models with high percent of time spending on reduction should be good caididates (e.g. resnet50, mobilenet_v2 ).

NOTE: a same model appears twice. The first rows is for the fwd graph and the second for the bwd graph. We profile different graphs for a model separately.

```
benchmark_name           pointwise_percent    reduction_percent    persistent_reduction_percent    unknown_category_percent    GPU_busy_percent    wall_time_ms
-----------------------  -------------------  -------------------  ------------------------------  --------------------------  ------------------  --------------
resnet18                 19.73%               7.86%                4.81%                           41.25%                      73.65%              2.549ms
resnet18                 18.59%               7.13%                3.35%                           67.35%                      96.41%              3.467ms
resnet50                 29.57%               22.13%               2.07%                           51.68%                      105.46%             6.834ms
resnet50                 26.42%               15.27%               0.94%                           59.68%                      102.31%             13.346ms
vgg16                    26.23%               0.00%                0.00%                           74.20%                      100.43%             18.212ms
vgg16                    15.63%               5.61%                0.10%                           79.42%                      100.75%             33.485ms
BERT_pytorch             28.62%               4.82%                14.88%                          33.32%                      81.64%              7.162ms
BERT_pytorch             14.43%               13.41%               18.19%                          49.24%                      95.27%              10.395ms
densenet121              11.89%               2.14%                3.86%                           16.36%                      34.25%              16.531ms
densenet121              10.37%               2.06%                4.09%                           31.46%                      47.98%              16.934ms
hf_Bert                  23.94%               0.00%                29.88%                          46.09%                      99.90%              7.766ms
hf_Bert                  11.65%               10.54%               20.26%                          61.66%                      104.11%             11.892ms
nvidia_deeprecommender   42.92%               0.00%                0.00%                           56.75%                      99.67%              3.476ms
nvidia_deeprecommender   31.36%               3.44%                0.46%                           65.20%                      100.45%             3.872ms
alexnet                  30.99%               0.00%                0.00%                           69.16%                      100.14%             3.169ms
alexnet                  24.41%               4.83%                0.17%                           71.09%                      100.50%             4.709ms
mobilenet_v2             29.21%               27.79%               2.49%                           44.00%                      103.49%             10.160ms
mobilenet_v2             17.50%               15.05%               1.06%                           69.68%                      103.29%             20.715ms
resnext50_32x4d          18.96%               9.28%                2.31%                           28.79%                      59.33%              5.899ms
resnext50_32x4d          18.48%               11.01%               1.86%                           53.80%                      85.14%              7.167ms
mnasnet1_0               19.07%               14.52%               3.01%                           35.43%                      72.03%              6.028ms
mnasnet1_0               14.17%               12.00%               1.87%                           67.56%                      95.60%              9.225ms
squeezenet1_1            38.56%               0.00%                1.77%                           56.21%                      96.53%              2.221ms
squeezenet1_1            21.26%               7.57%                1.05%                           67.30%                      97.18%              4.942ms
timm_vision_transformer  17.05%               0.00%                18.80%                          65.79%                      101.64%             9.608ms
timm_vision_transformer  9.31%                9.07%                10.32%                          73.25%                      101.96%             16.814ms
```

## how to use
`python {compiled_module_wrapper.py} -p`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97723
Approved by: https://github.com/jansel
2023-04-01 08:04:14 +00:00
553bb01df9 [quant][pt2e][refactor] Remove extra arguments of _maybe_insert_observers_before_graph_output (#98029)
Summary:
This PR allows _maybe_insert_observers_before_graph_output to be reused by pt2e flow

Test Plan:
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestQuantizeFxModels

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98029
Approved by: https://github.com/vkuzo
2023-04-01 05:38:36 +00:00
2630144786 Call to mkldnn_matmul from aten::addmm on AArch64 (#91763)
We have noticed that on BERT_pytorch in torchbenchmark majority of time is spent in running GEMM in aten:addmm. At the moment this calls into BLAS routine, but on AArch64 it will be faster if it calls into mkldnn_matmul. Performance wise compared to build with OpenBLAS it runs faster 1.2x faster on 16 cores with batch size of 8 on Graviton3, while if fast math mode (mkldnn_matmul exposes through oneDNN and Arm Compute Library option to run GEMM with FP32 inputs using BBF16 operations) is enabled then it is 2.3x

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91763
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/malfet
2023-04-01 04:25:57 +00:00
57c6f3fe90 [vision hash update] update the pinned vision hash (#98108)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98108
Approved by: https://github.com/pytorchbot
2023-04-01 03:07:44 +00:00
5df59f957f Fix G001,G002,G003 in logs to % syntax (#97812)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97812
Approved by: https://github.com/Skylion007, https://github.com/kiukchung, https://github.com/malfet, https://github.com/mlazos
2023-04-01 01:43:33 +00:00
7f9533e224 [Dynamo] Add UserError type (#97705)
To get started the dynamo error message improvement effort, we discussed about adding new user error type which covers cases where the user used something that TorchDynamo doesn't support and there is clear actions they can take.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97705
Approved by: https://github.com/anijain2305, https://github.com/yanboliang
2023-04-01 01:18:00 +00:00
ee9a9b7add Remove old logging callsites (#98095)
Get around GH first issue, OSS only changes for https://github.com/pytorch/pytorch/pull/97182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98095
Approved by: https://github.com/anijain2305
2023-04-01 00:57:37 +00:00
7c60d7a24d Move CudaReproTests to its own file (#97942)
test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97942
Approved by: https://github.com/ngimel
2023-04-01 00:47:42 +00:00
df216b5736 Disable dynamo tracing torchrec.distributed (#97824)
This was used to unblock Meta internal use cases, where ```torchrec.distributed``` was used, however, it can't be traced by dynamo properly right now.
We were sending the same fix(#90087) several months ago, but was reverted due to ```fbgemm``` conflicts. This PR catches ```Exception``` rather than ```ImportError``` which can handle the conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97824
Approved by: https://github.com/wconstab
2023-04-01 00:39:59 +00:00
b89f74aa35 Mark Vulkan test as unstable (#98106)
While investigating the new flaky issue in trunk, i.e. f9ca48ddb5.  Curiously, this starts to become flaky after https://github.com/pytorch/pytorch/pull/97698, so I'll reach out to the author.

Fixes https://github.com/pytorch/pytorch/issues/98071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98106
Approved by: https://github.com/clee2000
2023-04-01 00:39:18 +00:00
7aa010dcc9 [stronghold][bc-linter] add BC linter suppression by suppress-api-compatibility-check PR label (#97727)
Adds the ability to suppress BC-linter by adding `suppress-api-compatibility-check` label to the PR.

See #96977 for the context on the BC-linter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97727
Approved by: https://github.com/osalpekar
2023-04-01 00:25:40 +00:00
6b319d1525 [dynamo][graph break fix] inplace add for empty tuple (#97923)
Fixes one of the frequent graph breaks in HF models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97923
Approved by: https://github.com/yanboliang, https://github.com/jansel
2023-04-01 00:11:16 +00:00
7dde61ce46 [quant][pt2e][refactor] Remove extra arguments of _maybe_insert_output_observer_for_node (#97959)
Summary:
The goal is for this function to be reused by the pt2e flow

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97959
Approved by: https://github.com/andrewor14
2023-03-31 23:59:43 +00:00
8313b852cb Fallback getitem fix (#98041)
I'm working on enabling complex fallback. we will be getting additional coverage when that lands, but I also did a run through the inductor test suite.

Differential Revision: [D44564138](https://our.internmc.facebook.com/intern/diff/D44564138)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98041
Approved by: https://github.com/davidberard98
2023-03-31 23:51:00 +00:00
091177516e Rearrange the fields in at::OperandInfo to reduce padding. (#98037)
Summary:
Rearrange the fields in at::OperandInfo to reduce padding.

The current class layout is {64,3,1,1,8,1,1,1,16,16,8,8}. Moving the 5th
element in the class allows the small bytes/bools to be packed together.

This class is frequently read from places like the stack trace below, so
compacting the class could speed things up.

c10/util/MaybeOwned.h:198 operator*
aten/src/ATen/TensorIterator.h:187 tensor_base
aten/src/ATen/TensorIterator.h:322 tensor_base
aten/src/ATen/TensorIterator.cpp:1194 compute_mem_overlaps
aten/src/ATen/TensorIterator.cpp:1475 build

Test Plan: Rely on unit tests.

Differential Revision: D44559604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98037
Approved by: https://github.com/swolchok
2023-03-31 23:45:52 +00:00
9be9592f28 [Dynamo] Code refactor: move context managers out of misc.py (#97958)
misc.py and test_misc.py is too big, moving context managers to context.py and test_context.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97958
Approved by: https://github.com/ezyang, https://github.com/anijain2305, https://github.com/mlazos, https://github.com/voznesenskym
2023-03-31 23:15:39 +00:00
3c7b2b730f use libcusolver_lapack_static.a for CUDA>=12 (#98072)
Needed for https://github.com/pytorch/builder/pull/1374 to enable nightly CUDA12.1 builds.

From the cuSOLVER release notes (https://docs.nvidia.com/cuda/cusolver/index.html#link-third-party-lapack-library):
> The `liblapack_static.a` library is deprecated and will be removed in the next major release. Use the `libcusolver_lapack_static.a` instead.

Note that "next major release" corresponds to CUDA 12, not 13.
The fix was verified locally on an H100 using https://github.com/pytorch/builder/pull/1374 and pip wheels were properly built:
```
>>> torch.version.cuda
'12.1'
>>> torch.backends.cudnn.version()
8801
>>> conv =nn.Conv2d(3, 3, 3).cuda()
>>> x = torch.randn(1, 3, 224, 224).cuda()
>>> out = conv(x)
>>> out.sum()
tensor(5386.9219, device='cuda:0', grad_fn=<SumBackward0>)
```

CC @malfet @atalman @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98072
Approved by: https://github.com/malfet, https://github.com/atalman
2023-03-31 20:47:53 +00:00
5810f5ad1a Fix aten::squeeze.dims shape function (#98078)
Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

Fixes https://github.com/llvm/torch-mlir/issues/1690#issuecomment-1491931180.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98078
Approved by: https://github.com/davidberard98
2023-03-31 20:24:09 +00:00
d158545b16 [pruning] Add gelu to list of supported activation functions (#95618)
Summary:

This PR adds nn.GELU and F.gelu respectively to the list of suppported
activation functions

Test Plan:
```
python test/test_ao_sparsity.py -- TestBaseSparsifier
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95618
Approved by: https://github.com/andrewor14
2023-03-31 19:55:12 +00:00
8564ed24a8 do not need to check if element in dict input is Tensor. (#97866)
sometimes it's a tuple with tensor element such as past value key in text generation case

Fixes https://github.com/pytorch/pytorch/issues/97229

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97866
Approved by: https://github.com/jgong5, https://github.com/davidberard98
2023-03-31 19:39:00 +00:00
794f6e50a1 [PyTorch] Accept string_view in Pickler::pushGlobal (#96402)
This should make a difference for users building with libstdc++: we pass string literals to pushGlobal with length longer than 15 bytes, and 15 bytes is the maximum inline size of libstdc++'s std::string before it will heap allocate.

Differential Revision: [D43930698](https://our.internmc.facebook.com/intern/diff/D43930698/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96402
Approved by: https://github.com/ezyang
2023-03-31 19:33:46 +00:00
fb7b398479 [FSDP] Do not _unshard if already prefetched (#97981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97981
Approved by: https://github.com/fegin
2023-03-31 18:47:03 +00:00
195b92ab01 [FSDP][Easy] Minor cleanups to _runtime_utils.py (#97980)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97980
Approved by: https://github.com/H-Huang
2023-03-31 18:47:03 +00:00
adee9423bd [FSDP][Docs] Tidy up FSDP ctor docs (#97979)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97979
Approved by: https://github.com/fegin
2023-03-31 18:47:00 +00:00
3226ad21cf Revert "[Reland] fix some MKL detection issues of CMake (#94924)"
This reverts commit dc2b7aa95554188155a4e2e087412f06f2f3b642.

Reverted https://github.com/pytorch/pytorch/pull/94924 on behalf of https://github.com/atalman due to conda nightly build failures
2023-03-31 18:41:11 +00:00
0d73cfb3e9 Retry at test file level (#97506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506
Approved by: https://github.com/huydhn
2023-03-31 18:36:53 +00:00
3b188c5883 Don't use subclass when tracing and call wait_tensor immediately. (#98001)
This change expects that proper scheduling of the wait_tensor call will happen over the traced graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98001
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2023-03-31 18:33:20 +00:00
f2127bbf47 [PyTorch] Add Vulkan support and tests for at::upsample_bilinear2d (#98022)
Summary: Bilinear upsampling is a [4D tensor upsampling operation](https://pytorch.org/docs/stable/generated/torch.nn.Upsample.html), this adds support for the operation on the Vulkan GPU backend.

Test Plan:
1. `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook
2. Confirm all tests pass with no regression, and the added tests `*upsample_bilinear2d*` pass
2a. All tests P669847383
2b. `upsample_bilinear2d` tests P669866631
3. Overview:

```
...

[ RUN      ] VulkanAPITest.upsample_bilinear2d_align_false_small
[       OK ] VulkanAPITest.upsample_bilinear2d_align_false_small (1 ms)
[ RUN      ] VulkanAPITest.upsample_bilinear2d_align_false_large
[       OK ] VulkanAPITest.upsample_bilinear2d_align_false_large (2 ms)
[ RUN      ] VulkanAPITest.upsample_bilinear2d_align_true_small
[       OK ] VulkanAPITest.upsample_bilinear2d_align_true_small (2 ms)
[ RUN      ] VulkanAPITest.upsample_bilinear2d_align_true_large
[       OK ] VulkanAPITest.upsample_bilinear2d_align_true_large (1 ms)

...

[==========] 209 tests from 1 test suite ran. (6317 ms total)
[  PASSED  ] 201 tests.
[  SKIPPED ] 1 test, listed below:
[  SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
[  FAILED  ] 7 tests, listed below:
[  FAILED  ] VulkanAPITest.cat_dim1_singledepth_success
[  FAILED  ] VulkanAPITest.gru_success
[  FAILED  ] VulkanAPITest.gru_mclareninputs_success
[  FAILED  ] VulkanAPITest.gru_prepack_success
[  FAILED  ] VulkanAPITest.lstm_success
[  FAILED  ] VulkanAPITest.lstm_mclareninputs_success
[  FAILED  ] VulkanAPITest.lstm_prepack_success
```

Differential Revision: D43142564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98022
Approved by: https://github.com/SS-JIA
2023-03-31 18:32:42 +00:00
64b8d20a5c Fix typos under c10 directory (#98079)
This PR fixes typos in comments and messages of files under `c10` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98079
Approved by: https://github.com/Skylion007
2023-03-31 18:31:11 +00:00
762a2079c7 [dynamo 3.11] make create_instruction kwarg mandatory (#98032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98032
Approved by: https://github.com/albanD
2023-03-31 18:20:51 +00:00
089134bf66 [dynamo 3.11] implement 3.11 linetable (#96509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96509
Approved by: https://github.com/jansel
2023-03-31 18:20:28 +00:00
14ef91cea6 [dynamo 3.11] small bug fixes (#96508)
Bugs fixed:
	- CALL_FUNCTION_EX expects null pop in symbolic_convert
	- make_function_with_closure codegen requires a push_null
	- copy over the closure in eval_frame.c
	- add JUMP_FORWARD to terminal opcodes
	- enum repr fix in utils.py
	- fix symbolic_convert's break_graph_if_unsupported wrapper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96508
Approved by: https://github.com/jansel
2023-03-31 18:18:12 +00:00
cb4bc8e0f5 [dynamo 3.11] support prefix instructions MAKE_CELL, COPY_FREE_VARS, RETURN_GENERATOR, RESUME (#96506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96506
Approved by: https://github.com/jansel
2023-03-31 18:16:17 +00:00
05641b81e5 [dynamo 3.11] fix jump if (not) none (#96505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96505
Approved by: https://github.com/jansel
2023-03-31 18:05:54 +00:00
27e06e1a28 Print test times for pytest in verbose mode (#98028)
Adds test time like
```
e.py::test1 PASSED [0.0001s]                                                                                        [ 33%]
e.py::test2 PASSED [1.0075s]                                                                                        [ 66%]
e.py::test3 PASSED [0.0002s]                                                                                        [100%]
```
but they also get colored
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98028
Approved by: https://github.com/huydhn
2023-03-31 18:04:54 +00:00
d03799f9a5 optimize the AMP func name in custom_device_mod (#98052)
Fixes #ISSUE_NUMBER
1、optimize the func name of AMP in custom device module,use `torch.foo.set_autocast_enable` instead of `torch.foo.set_autocast_foo_enable`.
2、In AMP with custom device,use `custom_device_mod.set_autocast_enable` instead of `getattr(custom_device_mod,  "set_autocast_enable"`, because we have check that `custom_device_mod` hasattr `set_autocast_enable` before.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98052
Approved by: https://github.com/bdhirsh
2023-03-31 17:04:32 +00:00
c699ac17df [CI] Bump up torchbench version to fix dynamo graph breaks in transformers (#98003)
Summary: When we bump up the torchbench version pin last time, we found
there were new graph breaks introduced with the trasformers version
upgrade, see https://github.com/pytorch/pytorch/pull/96782. Turns out
they are already fixed upstream, see
https://github.com/huggingface/transformers/pull/21648 and https://github.com/pytorch/benchmark/pull/1511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98003
Approved by: https://github.com/ngimel
2023-03-31 16:52:09 +00:00
9e3b34775b Revert "[dtensor] switch mean to use reduction linear (#97996)"
This reverts commit 1b323b313ce35c03583ece017f928079f4a86882.

Reverted https://github.com/pytorch/pytorch/pull/97996 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it fails a test on CPU 1b323b313c
2023-03-31 16:44:01 +00:00
87f5e92916 [dynamo] Add guards for deterministic algos (#96695)
Inductor now falls back to eager mode for deterministic algos. Add guards in dynamo to check if the deterministic algos mode changes.

See #93537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96695
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-31 16:28:45 +00:00
864ab93656 aot_autograd: avoid using intermediate_base logic unnecessarily (#97786)
fixes https://github.com/pytorch/pytorch/issues/97691, see the issue for the proposed design. Now that we are employing AOTAutograd's "intermediate base" logic a lot less frequently, we might see some speedups in the benchmark suite.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97786
Approved by: https://github.com/jansel, https://github.com/soulitzer
2023-03-31 16:25:13 +00:00
4e26ad786d fix load_sharded_optimizer_state_dict error on multi node (#98063)
Fixes #95892

This PR fixes the placement error in ChunkShardingSpec when training with multi nodes. 'rank:{global_rank}/cuda:{local_rank}' should be used but 'rank:{global_rank}/cuda:{global_rank}' is used so this would result in a CUDA error: invalid device ordinal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98063
Approved by: https://github.com/kumpera
2023-03-31 16:07:09 +00:00
cb8c0be54d add StorageImpl::mutable_unsafe_data (#97648)
See D44409928.

Differential Revision: [D44409945](https://our.internmc.facebook.com/intern/diff/D44409945/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97648
Approved by: https://github.com/ezyang
2023-03-31 16:04:07 +00:00
f4f1a5b5b3 Revert "Move functional collectives to the right namespace (#97793)"
This reverts commit 184bfbc3d7b37e8f202f4938f6ea9ba557c93b1e.

Reverted https://github.com/pytorch/pytorch/pull/97793 on behalf of https://github.com/atalman due to breaks internal builds
2023-03-31 16:02:07 +00:00
fa1a8b9f96 Fix device handling in nn.utils.rnn.unpad_sequence (#98042)
Without this change I get the following error.
```
line 444, in unpad_sequence
    mask = idx < length
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98042
Approved by: https://github.com/mikaylagawarecki
2023-03-31 16:00:49 +00:00
1c21cd2213 [quant][pt2e][refactor] Add input_output_share_observers to node.meta["target_dtype_info"] (#97949)
Summary:
The goal for this PR is to unify the flow of information to reduce fragmentation of implementations between fx graph mode quantization
and quantize_pt2e, since quantize_pt2e will be using node.meta to store this information, we'd like to make sure fx graph mode quantization
get this information from the same place

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97949
Approved by: https://github.com/andrewor14
2023-03-31 15:54:19 +00:00
6b9e22f3f6 Clarify the saving of intermediates in the "extending torch.func" docs (#98020)
Fixes https://github.com/pytorch/pytorch/issues/97260

We got some feedback that the page reads like "in order to save an input
for backward, you must return it as an output of the
autograd.Function.forward".

Doing so actually raises an error (on master and as of 2.1), but results
in an ambiguous situation on 2.0.0. To avoid more users running into
this, we clarify the documentation so it doesn't read like the above
and clearly mentions that you can save things from the inputs or
outputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98020
Approved by: https://github.com/soulitzer, https://github.com/kshitij12345
2023-03-31 13:57:37 +00:00
91ad5984d8 Add script to summarize performance from CI performance run (#97977)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97977
Approved by: https://github.com/wconstab
2023-03-31 12:44:48 +00:00
e073979794 [Quant][FX] Add test case for lowering conv_transpose with kwargs (#97311)
**Summary**
As the title

**Test plan**
python test/test_quantization.py -k test_lowering_functional_conv_transpose_with_kwargs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97311
Approved by: https://github.com/jerryzh168
2023-03-31 10:39:29 +00:00
efdd08a8d0 [MPS] Move impl functions to mps namespace (#97238)
This PR moves impl functions to `at::native::mps` to prevent them from being exposed in `at::native`.

Because of the moves of functions being hard to review, this PR only refactors part of functions in the MPS codebase. Will check everything is correctly moved again before merging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97238
Approved by: https://github.com/kulinseth
2023-03-31 09:55:14 +00:00
e61b842001 [Quant][FX] lower functional conv_transpose ops (#97126)
**Summary**
Support quantizing and lowering functional `conv_transpose1d`, `conv_transpose2d` and `conv_transpose3d`.
Please note that
- `conv_tranpose + relu` fusion is not supported. Remember to keep `relu` node in graph when lowering.
- `conv_tranpose` requires `per-tensor` scheme for weight. Use default `qconfig_mappings` instead of deprecated `qconfig_dict` for test cases.

**Test plan**
python test/test_quantization.py -k test_conv_transpose_not_reference
python test/test_quantization.py -k test_conv_transpose_reference
python test/test_quantization.py -k test_conv_transpose_relu_not_reference
python test/test_quantization.py -k test_conv_transpose_relu_reference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97126
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-03-31 07:17:29 +00:00
c797c7bc8b Clean up duplicate function run_test.py (#97914)
afaict theyre the same thing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97914
Approved by: https://github.com/huydhn
2023-03-31 06:31:17 +00:00
675dfd2c1f Revert "Retry at test file level (#97506)"
This reverts commit 7d5d5beba27050a8da68675a0ae97a12b26b8a40.

Reverted https://github.com/pytorch/pytorch/pull/97506 on behalf of https://github.com/clee2000 due to test_jit_cuda_fuser having a rough time
2023-03-31 06:22:14 +00:00
3a5ca4bdd4 [quant][pt2e] Add support for conv bn fusion in et backend config (#97389)
Batch Norm was supported by XNNPACK via fusion with the preceding convolution op. We do the same here by fusing across q -> dq nodes.

We must update the original pass in order to fuse convolution weight/bias with batch norm parameters, this way quantization is supported for batch norm

Differential Revision: [D43976324](https://our.internmc.facebook.com/intern/diff/D43976324/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97389
Approved by: https://github.com/salilsdesai
2023-03-31 05:33:42 +00:00
c091aa9a2c [vision hash update] update the pinned vision hash (#98043)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98043
Approved by: https://github.com/pytorchbot
2023-03-31 05:19:21 +00:00
fe2bdfb2cd [Executorch][XNNPACK] Quantized mean (#97388)
Support Quantized Mean.dim for xnnpack

Adding another pattern for Quantized Partitioner and test to ensure quantized operator works

Differential Revision: [D43915706](https://our.internmc.facebook.com/intern/diff/D43915706/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97388
Approved by: https://github.com/salilsdesai
2023-03-31 05:08:53 +00:00
f78b44b2d9 [quant][pt2e][refactor] Refactor prepare to remove the use of qconfig in _maybe_insert_input_observer_for_arg_or_kwarg (#97948)
Summary:
The goal is for this function to be reused by quantize_pt2e

Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D44558929](https://our.internmc.facebook.com/intern/diff/D44558929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97948
Approved by: https://github.com/andrewor14
2023-03-31 05:07:58 +00:00
f9ca48ddb5 [Executorch][XNNPACK] Quantized hardtanh (#97387)
Lower Quantized Hardtanh to XNNPACK

Also add symmetric quantization support for hardtanh in executorch backend config

Differential Revision: [D43901222](https://our.internmc.facebook.com/intern/diff/D43901222/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97387
Approved by: https://github.com/salilsdesai
2023-03-31 04:58:24 +00:00
ae5b044ccb [XNNPACK] Enable S8 Operators (#97386)
Enabling S8 Operators for quantized Clamp.

This is only for clamp nodes by themselves. I believe once they are fused with the previous node they are no longer needed

Differential Revision: [D43901200](https://our.internmc.facebook.com/intern/diff/D43901200/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43901200/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97386
Approved by: https://github.com/salilsdesai
2023-03-31 04:32:29 +00:00
4befb84d49 [XNNPACK] Allow VCVT Operators (#97385)
Allow VCVT operators to allow operators that change datatype.

We want xnn_define_convert, to convert from one data type to another. Explanation for usage provided in follow up diff

Differential Revision: [D43844094](https://our.internmc.facebook.com/intern/diff/D43844094/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43844094/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97385
Approved by: https://github.com/salilsdesai
2023-03-31 03:56:53 +00:00
9c3fbe7475 [BE] Enable flake8-simplify checks (#97984)
Enable some sensible flake8-simplify rules. Mainly wanted to enable the SIM101, and `yield from` SIM103 checks. @kit1980 since you wanted to be tagged on this CI check.

Enabling this check also helped flag one logical bug so it's definitely beneficial (also fixed in this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97984
Approved by: https://github.com/ezyang
2023-03-31 03:40:21 +00:00
3dc4405278 Add a unit test for negative torch.arange() incorrect numerical behavior with dynamic shapes (#97926)
This unit test is for the fix in #97777 for issue #96971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97926
Approved by: https://github.com/ezyang
2023-03-31 03:04:50 +00:00
cyy
dc2b7aa955 [Reland] fix some MKL detection issues of CMake (#94924)
This is reland of PR #94402 that tries to solve the additional link issues.
The  PR #94402 failed because caffe2::mkl had been converted to private dependency while libtorch_cuda_linalg hadn't linked to it explicitly. This is fixed in commit 4373bf0ae3dee32afc178f9d51a4154d6c5904c6
We also replace more references of MKL_LIBRARIES by caffe2::mkl in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94924
Approved by: https://github.com/malfet
2023-03-31 02:01:52 +00:00
a1dc2b1774 [BE] Remove bool dtype from masked_scatter (#98015)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at a9fa438</samp>

Simplified a test function for `torch.masked_scatter` in `test/test_torch.py` by removing redundant and unnecessary code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98015
Approved by: https://github.com/ezyang
2023-03-31 01:45:57 +00:00
26a90fb9c2 using accumulate type to do the computation of mean reduce(CPU) (#97351)
This PR will use accumulate type to do the computation of mean reduce for CPU path as GPU path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97351
Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/ezyang
2023-03-31 01:27:49 +00:00
a5b6f10c5d Fix format bug in NT docs (#97998)
Fixes a formatting bug in the NT docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97998
Approved by: https://github.com/jbschlosser
2023-03-31 01:00:25 +00:00
fae28fcdf5 separate deterministic scatter_add as a helper function (#97922)
separate it for better readability and and this helper function can be reused for the deterministic implementation of `scatter` and `scatter_reduce`  with sum reduction mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97922
Approved by: https://github.com/ngimel
2023-03-31 00:01:18 +00:00
99f25c2920 [Vulkan] Fix divide-by-zero with padded tensors (#97698)
Summary:
This fixes the divide-by-zero that arises when performing a division in which the denominator has a number of channels that isn't a multiple of 4, and therefore the channel dimension has been padded with 0s.

More details in the comments of this post: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1288546972015593/

Test Plan:
```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64
```

```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

Differential Revision: D44392406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97698
Approved by: https://github.com/SS-JIA
2023-03-30 23:05:47 +00:00
38207a9e53 [ci][easy] Only print remaining logs if test step ran (#97713)
it sometimes spits out left over logs from a previous run on the windows ephemeral runner, but this might have been fixed by now.  I get a bit annoyed when the step runs even though it obviously isnt going to be useful since the test step didnt run,

always() is needed to ensure that it runs on test step failure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97713
Approved by: https://github.com/huydhn
2023-03-30 23:03:41 +00:00
1b323b313c [dtensor] switch mean to use reduction linear (#97996)
mean is actually a reduction linear formula if the final reduction
is partial sum (which currently is), so switching to use that instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97996
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
2023-03-30 22:48:16 +00:00
184bfbc3d7 Move functional collectives to the right namespace (#97793)
This moves them from `torch._C._nn` to `torch._C._dist`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97793
Approved by: https://github.com/albanD
2023-03-30 22:18:13 +00:00
45acfc8574 Revert "[BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212)"
This reverts commit 313db584f33991c8c2520c79b6dbe11fd93d4179.

Reverted https://github.com/pytorch/pytorch/pull/97212 on behalf of https://github.com/soulitzer due to Internally someone is rely on _wrap_outputs and we updated its signature
2023-03-30 22:03:07 +00:00
c218309f88 [dynamo] profiler.record_function on all dynamo_timed functions (#96495)
**Summary**: profiler.record_function inserts an event into the chrome trace generated by the pytorch profiler. This PR adds record_function everywhere that @dynamo_timed is annotated.

dynamo_timed and the CLI viewer torch._dynamo.utils.compile_times() are already useful on their own; but for identifying _when_ these get called, it's nice to be able to view in the profiler chrome trace.

Why not just turn on python stack traces in the profiler to get this information? Dynamo compilation is implemented in python and therefore produces a huge amount of events when it records compilation steps. The resulting trace files are often too large to load in chrome://tracing, and they take a long time to generate. Additionally, the stack traces are deep enough that they are often hard to read. This approach produces much more readable traces with lower overhead.

**Tests**:
- Added in test/dynamo/test_profiler.py. Verified in https://github.com/pytorch/pytorch/actions/runs/4559322864/jobs/8043307798?pr=96495 that the tests are actually running.
- Performance run with `ciflow/inductor-perf-compare` shows no noticeable change in compilation time or speedup numbers. Geomean speedup changes from 1.275 -> 1.277. Geomean compilation times change from 54.2s -> 53.8s. That's likely just due to noise. All individual benchmark numbers regressed by no more than 5% between the two runs; and we see improvements of around the same magnitude, suggesting this is, again, just noise. For meta employees, you can see the results in a google sheets here: https://docs.google.com/spreadsheets/d/1Ki69XvcgxcA3ZnqC5n_jav5KiD4u7Wojlad3VTnIdlk/edit?usp=sharing

**Example**:

Run this:

```python
import torch

def gn(x):
    return x.sin().cos()

def fn(x, y):
    return x.sin() * y.cos()

x, y = [torch.rand((2, 2), device='cuda') for _ in range(2)]

# just to clear out any lazy initialization
with torch.profiler.profile() as prof:
    torch.compile(gn)(x)

with torch.profiler.profile() as prof:
    torch.compile(fn)(x, y)

prof.export_chrome_trace("./dynamo_timed_profile.json")
```

and we can see that the resulting trace shows important dynamo steps, even when python tracing is turned off.

<img width="867" alt="Screenshot 2023-03-29 at 7 26 15 PM" src="https://user-images.githubusercontent.com/5067123/228712263-8ae67ab9-1a52-4765-a9c2-7c5cf0abe2f5.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96495
Approved by: https://github.com/ngimel, https://github.com/mlazos
2023-03-30 21:49:02 +00:00
ca135ed6b5 [PyTorch] Optimize TupleType::annotation_str_impl for small tuples (#97910)
In general, we can't profitably gather an array of all the elements' annotation strings so that we can reserve the final string because we'd have to heap-allocate that array. If we do it as a fast path for small tuples (which Tuple itself sets precedent for!), we can stack-allocate the array of annotation strings and make it profitable.

Differential Revision: [D44519675](https://our.internmc.facebook.com/intern/diff/D44519675/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97910
Approved by: https://github.com/suo, https://github.com/Skylion007
2023-03-30 21:38:44 +00:00
cadccf0daf Flip switch (#97993)
Turn on cudagraph trees (delay in fbcode for now).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97993
Approved by: https://github.com/davidberard98
2023-03-30 21:35:09 +00:00
f2e6b0837a make triton uses the wheel script now (#97995)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97995
Approved by: https://github.com/colesbury
2023-03-30 21:23:49 +00:00
1f85390eb2 Skip test_batch_norm in test_jit_fuser_te for asan (#98016)
it takes 10+ minutes on asan?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98016
Approved by: https://github.com/huydhn
2023-03-30 20:58:41 +00:00
7bb5fb3c6d [vmap] Fix index_select support when dim is negative (#97916)
Fixes https://github.com/pytorch/pytorch/issues/96854

Previously, this would segfault (via indexing -2 into a SmallVector).
This PR fixes it so that we wrap negative dimensions.

Test Plan:
- changed the index_select OpInfo to use dim=-1 instead of dim=1,
because it's much more common that the negative dimension doesn't work
instead of the positive one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97916
Approved by: https://github.com/ngimel, https://github.com/janeyx99
2023-03-30 20:57:38 +00:00
7868e4b45b Revert "Disable dynamo tracing torchrec.distributed (#97824)"
This reverts commit 9d1d95099b0689e8bbd0be3e4fafbad76d8ca524.

Reverted https://github.com/pytorch/pytorch/pull/97824 on behalf of https://github.com/yanboliang due to need to catch more exception
2023-03-30 20:43:00 +00:00
ee1c539ecf Fix module backward pre-hooks to actually update gradient (#97983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97983
Approved by: https://github.com/albanD
2023-03-30 20:33:44 +00:00
06d677f41d [dynamo 3.11] fix push null timing in resume functions (#96504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96504
Approved by: https://github.com/jansel, https://github.com/albanD
2023-03-30 20:29:49 +00:00
5b6e4c48b1 [dynamo 3.11] properly determine cell/freevar index in bytecode_transformation.py (#96503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96503
Approved by: https://github.com/jansel
2023-03-30 20:23:59 +00:00
ba52268da5 [dynamo 3.11] properly copy free/cell vars in eval_frame.c (#96501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96501
Approved by: https://github.com/jansel, https://github.com/albanD
2023-03-30 20:23:38 +00:00
c681c52e01 [inductor] fix TritonTemplateCaller.__str__ (#97578)
We remove TritonTemplateCaller.to_callable previously. But this method is still used in `TritonTemplateCaller.__str__` . The to_callable method in the base class will be used and raise an exception.

This PR fix TritonTemplateCaller.__str__ to return the string representation without calling to_callable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97578
Approved by: https://github.com/nmacchioni, https://github.com/ngimel
2023-03-30 20:23:02 +00:00
c905251f9f [dynamo 3.11] fix eval_frame.c debug prints for 3.11 (#96500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96500
Approved by: https://github.com/jansel, https://github.com/albanD
2023-03-30 20:20:12 +00:00
848bf8103b fix functional collective to not generate getattr node (#97924)
use mesh.get_dim_groups directly instead of doing mesh tensor operations

This help us get rid of the getattr ops during tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97924
Approved by: https://github.com/kumpera
2023-03-30 20:14:50 +00:00
eqy
2fddcf0fc0 [CUDA][CUDA 11] Remove more CUDA 11 version checks (#92934)
Working on removing stragglers missed in previous CUDA version < 11.0 cleanup PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92934
Approved by: https://github.com/ngimel
2023-03-30 19:49:52 +00:00
90f69cad9a [inductor] test codegen with dynamic shapes (#96934)
Adds new tests that check for patterns in generated C++/Triton code to
see if it's dynamic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96934
Approved by: https://github.com/ezyang
2023-03-30 19:38:38 +00:00
da28af3286 distinguish mutability of StorageImpl::data_ptr() member (#97651)
See D44409928.

Differential Revision: [D44410323](https://our.internmc.facebook.com/intern/diff/D44410323/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97651
Approved by: https://github.com/ezyang
2023-03-30 19:13:56 +00:00
35090b869d set num_warps to at least 4 (#97950)
To avoid IMAs in https://gist.github.com/ngimel/25e81c996d9c8c652d97e33cc9c7d5f4
This is not a general fix (e.g. if inputs were a bit larger, num_warps would naturally be 4, and we could still have spills and hit ptxas bugs), but will do for now.
Longer term, we should check spills in kernels we generate and recompile with more warps if there are spills.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97950
Approved by: https://github.com/bertmaher
2023-03-30 18:58:14 +00:00
19706356b5 Fix TorchScript support in as_nested_tensor (#97960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97960
Approved by: https://github.com/cpuhrsch
2023-03-30 18:55:26 +00:00
b235e1f737 Compare len(fw_derivatives) with 0 w/o using not (#97953)
`fw_derivatives` is a list as in the permlink so we can make use of the fact that the empty list is evaluated as `False` but from I prefer `len(some_list) > 0` thanks to its clarity and implication that the variable is a container.

53c9bc8c68/tools/autograd/gen_variable_type.py (L942)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97953
Approved by: https://github.com/soulitzer
2023-03-30 18:42:18 +00:00
97fc8ea5f4 Run the benchmark suite with dynamic batch only (#97912)
Symbolic shapes compile time on full CI with inductor is horribly long (even though our aot_eager local runs seemed to suggest that the added latency was only 10s per model.) To patch over the problem for now, run the benchmark suite with dynamic batch only.  This should absolve a lot of sins.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97912
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2023-03-30 18:04:48 +00:00
4cce60751b Move TestIndexingSimplification to its own file (#97941)
test_torchinductor has gotten too big (almost 10k lines), this stack is trying to split it into smaller pieces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97941
Approved by: https://github.com/ngimel
2023-03-30 17:55:34 +00:00
94bae36a1f Fix strip_function_call in GuardBuilder (#97810)
repo:
from #92670 this address one of the bug for TorchDynamo

pytest ./generated/test_PeterouZh_CIPS_3D.py -k test_003

Issue:
In GuardBuilder, when parsing argnames with "getattr(a.layers[slice(2)][0]._abc, '0')" it returns "getattr(a", where it suppose to return "a", and thus causing SyntaxError.

This PR fix the regex and add couple test cases.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97810
Approved by: https://github.com/yanboliang
2023-03-30 17:46:10 +00:00
ffd76d11c9 [fix] take : backward batching rule (#95772)
Fixes https://github.com/pytorch/pytorch/issues/95738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95772
Approved by: https://github.com/zou3519
2023-03-30 17:18:17 +00:00
7d5d5beba2 Retry at test file level (#97506)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97506
Approved by: https://github.com/huydhn
2023-03-30 17:12:19 +00:00
24a5d006f2 [dynamo 3.11] Refactor create_instruction (#96499)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96499
Approved by: https://github.com/jansel, https://github.com/albanD
2023-03-30 17:05:27 +00:00
e6888697c4 Revisit torch._six.string_classes removal (#94709) (#97863)
Revisit `torch._six.string_classes` (which is `(str, bytes)`) removal: `isinstance(obj, string_classes) -> isinstance(obj, str)`.

Both `str` and `bytes` are `Sequence` classes.

```python
In [1]: from typing import Sequence

In [2]: issubclass(bytes, Sequence)
Out[2]: True

In [3]: issubclass(str, Sequence)
Out[3]: True
```

Re-add `bytes` to type guards like:

```python
def is_seq(obj):
    return isinstance(obj, Sequence) and not isinstance(obj, (str, bytes))
```

Ref:

- https://github.com/pytorch/pytorch/pull/94709#issuecomment-1487282912
- #97737
- #97789
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97863
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-03-30 17:02:45 +00:00
9ec6fdb29b Enable adam foreach in full train step tracing (#97897)
Main changes:

1. Registered several foreach ops to both meta and DTensor
2. Skip redundant getitem node when expanding foreach ops with DTensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97897
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-03-30 16:47:10 +00:00
19dcf55a6f [functorch] .data should not work for grad, jvp, vjp (#94817)
Improve error message

Fixes https://github.com/pytorch/pytorch/issues/94514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94817
Approved by: https://github.com/zou3519
2023-03-30 16:46:57 +00:00
96dbca69e6 Add unstable workflow to upload test stats (#97918)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97918
Approved by: https://github.com/huydhn
2023-03-30 16:30:18 +00:00
65e8c14948 Corrected batch norm docs with the exact computations of the standard deviation (#97974)
Fixes #77427

@jbschlosser sory for taking so long to submit this, I just realized this had been sitting in my backlog for too long.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97974
Approved by: https://github.com/albanD
2023-03-30 16:29:57 +00:00
cdb32dad3d [minifier] cuda.synchronize to better detect IMA (#97962)
Sometimes IMA can trigger much later than the kernel invocation call, and they escape minifier. Calling cuda.synchronize fixes this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97962
Approved by: https://github.com/mlazos
2023-03-30 15:46:52 +00:00
0e4ddc2b40 NT: Refactor for lazy computation of opt_sizes (#97895)
This PR changes the `opt_sizes_` metadata to be computed lazily if needed rather than at construction. Since this metadata is data-dependent, we can't calculate it if we have symbolic metadata (i.e. for dynamic shapes). Notes:
* `opt_size()` is the only public accessor of `opt_sizes_`; several kernels use it. During the first call to this, the metadata is computed.
* `size()` / `sym_size()` use `opt_size()`. For the symbolic case, this will have to change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97895
Approved by: https://github.com/drisspg
2023-03-30 14:58:26 +00:00
47dca20d80 [BE] Enable flake8-comprehension rule C417 (#97880)
Enables flake8-comprehension rule C417. Ruff autogenerated these fixes to the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97880
Approved by: https://github.com/ezyang, https://github.com/kit1980, https://github.com/albanD
2023-03-30 14:34:24 +00:00
1d08b5b103 [fx] Replace literals with placeholder helper (#97683)
Helper function to replace literals that show up in call_function nodes in the graph to become placeholders so that they can be represented as wildcards when matching with the SubgraphMatcher. This pass causes the resulting graph to not be runnable with the original inputs since adding placeholders to the graph will change the number of inputs needed for the graph.

Test: `python test/test_fx.py TestMatcher`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97683
Approved by: https://github.com/kimishpatel, https://github.com/SherlockNoMad
2023-03-30 12:13:28 +00:00
19162083f8 Improved perfs for vectorized bilinear interpolate cpu uint8 RGB-case (channels last) (#96848)
## Description

- Based on https://github.com/pytorch/pytorch/pull/96651
  - Improved perfs for vectorized **bilinear** interpolate uint8 RGB-case, **channels last**
    - unified RGB and RGBA processing code such that RGB input is not copied into RGBA
  - Performances are more close to Pillow-SIMD (labeled as `Pillow (9.0.0.post1)` in the results)
  - RGBA case perfs are the same after refactoring (see Source link below)
- Fixed mem pointer alignment, added more comments (reviews from #96651)

## Results

- `Pillow (9.0.0.post1)` == Pillow-SIMD

```
[-------------------------------------------------------------------------------------------------- Resize -------------------------------------------------------------------------------------------------]
                                                                                 |  Pillow (9.0.0.post1)  |  torch (2.1.0a0+gitd6e220c) PR  |  torch (2.1.0a0+git2b75955) nightly  |  Speed-up: PR vs nightly
1 threads: --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=True        |    38.674 (+-0.323)    |         57.591 (+-0.244)        |          131.033 (+-1.448)           |      2.275 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (32, 32) aa=False       |                        |         39.471 (+-0.166)        |          113.911 (+-1.736)           |      2.886 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=True      |   128.512 (+-1.916)    |        161.592 (+-1.242)        |          299.679 (+-2.099)           |      1.855 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (224, 224) aa=False     |                        |        150.994 (+-1.180)        |          285.331 (+-1.919)           |      1.890 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=True      |   180.045 (+-2.223)    |        220.581 (+-1.363)        |          431.057 (+-3.536)           |      1.954 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (320, 320) aa=False     |                        |        219.391 (+-1.409)        |          429.410 (+-3.620)           |      1.957 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=True        |   113.911 (+-1.024)    |        129.457 (+-1.295)        |          459.610 (+-13.322)          |      3.550 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (32, 32) aa=False       |                        |         59.800 (+-0.199)        |          400.015 (+-11.815)          |      6.689 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=True      |   283.050 (+-2.664)    |        339.143 (+-1.209)        |          683.555 (+-4.466)           |      2.016 (+-0.000)
      3 torch.uint8 channels_last bilinear (520, 520) -> (224, 224) aa=False     |                        |        250.601 (+-1.236)        |          603.545 (+-2.644)           |      2.408 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=True        |   186.723 (+-2.213)    |        199.960 (+-1.343)        |          860.867 (+-21.763)          |      4.305 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (32, 32) aa=False       |                        |         79.188 (+-0.261)        |          703.019 (+-25.805)          |      8.878 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=True      |   412.353 (+-4.476)    |        462.230 (+-1.983)        |         1101.673 (+-49.299)          |      2.383 (+-0.000)
      3 torch.uint8 channels_last bilinear (712, 712) -> (224, 224) aa=False     |                        |        327.973 (+-1.852)        |          941.062 (+-5.549)           |      2.869 (+-0.000)

      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=True        |    61.191 (+-0.926)    |         80.795 (+-0.518)        |          160.853 (+-1.506)           |      1.991 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=True      |   134.488 (+-2.129)    |        169.147 (+-1.324)        |          327.343 (+-2.846)           |      1.935 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=True    |  1037.045 (+-24.982)   |        938.623 (+-9.010)        |         2603.360 (+-20.530)          |      2.774 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=True        |    52.792 (+-0.613)    |         73.692 (+-0.264)        |          131.829 (+-1.333)           |      1.789 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=True      |   139.596 (+-1.944)    |        173.778 (+-1.039)        |          320.063 (+-2.562)           |      1.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=True    |   690.132 (+-10.946)   |        772.758 (+-2.864)        |         2036.860 (+-36.109)          |      2.636 (+-0.000)
      3 torch.uint8 channels_last bilinear (64, 64) -> (224, 224) aa=False       |                        |         78.747 (+-0.799)        |          158.479 (+-1.702)           |      2.013 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (270, 268) aa=False     |                        |        167.046 (+-1.077)        |          322.104 (+-2.764)           |      1.928 (+-0.000)
      3 torch.uint8 channels_last bilinear (256, 256) -> (1024, 1024) aa=False   |                        |        918.967 (+-5.251)        |         2611.388 (+-29.917)          |      2.842 (+-0.000)
      3 torch.uint8 channels_last bilinear (224, 224) -> (64, 64) aa=False       |                        |         55.336 (+-0.251)        |          113.869 (+-1.243)           |      2.058 (+-0.000)
      3 torch.uint8 channels_last bilinear (270, 268) -> (224, 224) aa=False     |                        |        156.505 (+-1.095)        |          299.861 (+-2.710)           |      1.916 (+-0.000)
      3 torch.uint8 channels_last bilinear (1024, 1024) -> (256, 256) aa=False   |                        |        514.344 (+-1.905)        |         1776.796 (+-19.660)          |      3.454 (+-0.000)

```

Note: There is no perf regression for other case. There some cases (see Source below) with small speed-ups, for the rest it is roughly around 1.0 +/- 0.1 which may be attributed to noisy measurements ...

[Source](https://gist.github.com/vfdev-5/1c0778904a07ce40401306548b9525e8#file-20230329-181023-pr_vs_nightly-speedup-md)

## Context

- https://github.com/pytorch/pytorch/pull/90771

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96848
Approved by: https://github.com/NicolasHug, https://github.com/peterbell10
2023-03-30 11:51:02 +00:00
379fb47654 [SPMD] Support foreach optimizers with functionalization (#97853)
My first attempt was to apply the same solution as how proxy_tensor.py
handles other inplace ops. However, foreach is different in the way
that it's schema is `native_functions.yaml` does not return anything,
whereas ops like `addcmul_` and `addcdiv_` do return Tensors (Thanks
bdhirsh for teaching me this!). As a result, the proxy output
during tracing does not wrap anything, and hence we cannot correctly
connect it with subsequent operators. Modifying `native_functions.yaml`
is not a preferred solution. After discussing with bdhirsh, the
temporary solution is to do foreach functionalization as a graph
pass for now. Later, when https://github.com/pytorch/pytorch/issues/97852
is addressed, we will switch to default functionalization.

Edit: the latest version follows @bdhirsh 's suggestion on using
`make_fx` `decomposition_table` instead of implementing manual
fx.Graph tranforms to functionalize `_foreach_add_`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97853
Approved by: https://github.com/fegin, https://github.com/wanchaol
2023-03-30 11:27:10 +00:00
0f3ffaf798 extract torch.proto to its own library (#97614)
This is used in libtorch.

Differential Revision: [D44400084](https://our.internmc.facebook.com/intern/diff/D44400084/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97614
Approved by: https://github.com/PaliC
2023-03-30 10:35:03 +00:00
428cb3a868 distinguish mutability of untyped StorageImpl::data() member (#97647)
To implement the warning when transitioning reshape to copy-on-write
storage, we want to be able to detect a write to one view family
following by a read or a write to another one that shares the same
copy-on-write storage.

Because we have historically not been strict about the mutability of
our data pointers, any warning we have would likely be far too
aggressive.

Therefore, this is the first PR in a long series to ensure a strict
distinction between mutable and const data accessors in TensorBase,
TensorImpl, Storage, and StorageImpl.

The rough plan is to give the mutable accessor a new name that is
explicit about mutation, this will also force us to rewrite any code
that really needs a mutation.

Differential Revision: [D44409928](https://our.internmc.facebook.com/intern/diff/D44409928/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97647
Approved by: https://github.com/ezyang
2023-03-30 09:45:09 +00:00
0770ad3cae extract caffe2.proto to its own library (#97613)
This reduces the footprint of the caffe2_pb library.

Differential Revision: [D44400083](https://our.internmc.facebook.com/intern/diff/D44400083/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97613
Approved by: https://github.com/PaliC
2023-03-30 09:16:25 +00:00
5ab50cf048 Fix shoud/shoudl typos (#97930)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97930
Approved by: https://github.com/clee2000
2023-03-30 08:27:16 +00:00
7554c10899 Fix typos under tools directory (#97779)
Fix typos under tools directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97779
Approved by: https://github.com/clee2000, https://github.com/kit1980
2023-03-30 08:21:35 +00:00
5a81508bb6 Add NestedTensor ops: logical_not, logical_not_, masked_fill (#97934)
# Summary
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 7954302</samp>

This pull request adds support for `logical_not` and `masked_fill` operations on nested tensors, which are tensors that can have tensors as elements. It modifies the `native_functions.yaml` file to dispatch these operations to the nested tensor backend, implements the logic for these operations in `NestedTensorBinaryOps.cpp` and `NestedTensorUnaryOps.cpp`, adds documentation in `nested.rst`, and adds tests in `test_nestedtensor.py`.

## Description
<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 7954302</samp>

*  Implement `logical_not` operation on nested tensors ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1164), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1172), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f7c94671810b3ce652f9ad5458518cb7bbd67e8bf7e84e0a2fba641d878ba7c5R45-R56), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR203), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0L854-R867))
  - Add `NestedTensor_logical_not` and `NestedTensor_logical_not_` functions to `native_functions.yaml` for CPU and CUDA dispatch ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1164), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R1172))
  - Define `NestedTensor_logical_not` and `NestedTensor_logical_not_` functions in `NestedTensorUnaryOps.cpp` using `map_nt` and `get_buffer` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f7c94671810b3ce652f9ad5458518cb7bbd67e8bf7e84e0a2fba641d878ba7c5R45-R56))
  - Document `torch.logical_not` function for nested tensors in `nested.rst` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR203))
  - Add subtest for `logical_not` function in `test_activations` method in `TestNestedTensorDeviceType` class in `test_nestedtensor.py` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0L854-R867))
* Implement `masked_fill` operation on nested tensors ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R7439), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L210-R224), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR197), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R677-R688), [link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R2515-R2528))
  - Add `NestedTensor_masked_fill` function to `native_functions.yaml` for CPU and CUDA dispatch ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-2f3dbd85efb9b5172f2264eedd3be47dd765e6ab7cc8bf3ade5e62c28ae35991R7439))
  - Define `NestedTensor_masked_fill` function in `NestedTensorBinaryOps.cpp` using `NestedTensor_elementwise_Tensor` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L210-R224))
  - Document `torch.Tensor.masked_fill` function for nested tensors in `nested.rst` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-c8b131d009badb3f92031b2aaa6e7f93a793f13caee278ea78e1c57d78c0399eR197))
  - Add test case for `masked_fill` function in `TestNestedTensorDeviceType` class in `test_nestedtensor.py` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R677-R688))
  - Add test case for backward pass of `masked_fill` function in `TestNestedTensorAutograd` class in `test_nestedtensor.py` ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-6eef496a8ec635930b6e52507358e069c80021f3535b8737d39e14ffc38950c0R2515-R2528))
* Improve error message for unsupported element-wise binary operations on nested dense tensors ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L142-R150))
  - Modify `NestedTensor_elementwise_Tensor` function in `NestedTensorBinaryOps.cpp` to include operation name in error message ([link](https://github.com/pytorch/pytorch/pull/97934/files?diff=unified&w=0#diff-f847e41e3d373230df0b25574e993ec0e6b699bf16796b3df9ae9fb518048e25L142-R150))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97934
Approved by: https://github.com/cpuhrsch
2023-03-30 08:14:39 +00:00
f92cae4849 Fix a grep-itself bug when checking for GPU healthcheck (#97929)
The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929
Approved by: https://github.com/malfet, https://github.com/weiwangmeta
2023-03-30 08:14:01 +00:00
b093dfaefa Revert "Fix a grep-itself bug when checking for GPU healthcheck (#97929)"
This reverts commit f40b2ed59c4d1aca36d42ed208cfa9356fbe672d.

Reverted https://github.com/pytorch/pytorch/pull/97929 on behalf of https://github.com/huydhn due to Rework to get rid of grep completely
2023-03-30 07:52:20 +00:00
7776653a0c Add linear gradgrad (#97151)
Fixes #92206
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97151
Approved by: https://github.com/albanD
2023-03-30 07:25:02 +00:00
15271d353a [quant][pt2e] Support convtranspose + bn fusion (#97933)
Summary:
This PR extends `_fuse_conv_bn_` function to support fusing convtranspose and bn

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_transposed_conv_bn_fusion

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97933
Approved by: https://github.com/vkuzo
2023-03-30 07:02:39 +00:00
f7fe6e148e [test] Make environment variable name better (#97356)
This PR intends to use better (or correct?) environment variable name (`TORCH_DOCTEST_ANOMALY` instead of `TORCH_DOCTEST_ANOMOLY`) in test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97356
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-03-30 06:21:28 +00:00
53c9bc8c68 Add DLPack support for XPU backend by mapping to kDLOneAPI in DLPack … (#94968)
# Motivation
The DLPack device type kDLOneAPI stands for the Unified Shared Memory allocated on a oneAPI device. The corresponding Pytorch backend type is XPU.
Support to export/import the Pytorch XPU tensor as a DLPack tensor of kDLOneAPI device.

# Solution
1. Update the DLPack protocol to v0.7.
2. Add the XPU hooks to map the Aten device and DLPack device with the address value and device information.

# Additional Context
Reopen (#82867)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94968
Approved by: https://github.com/kit1980
2023-03-30 04:32:15 +00:00
88234540e7 Fix typo under torch/csrc/jit/tensorexpr directory (#97218)
This PR fixes typo in comments and messages under `torch/csrc/jit/tensorexpr` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97218
Approved by: https://github.com/davidberard98, https://github.com/jgong5, https://github.com/EikanWang, https://github.com/kit1980
2023-03-30 04:21:24 +00:00
721260e966 [3/n] Consolidate replicate and DDP: update replicate to reuse functions in DDP (#96660)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96660
Approved by: https://github.com/rohan-varma
2023-03-30 03:54:34 +00:00
af0264ae08 [BE] Pass -faligned-new if supported by compiler (#97887)
<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 507f7a2</samp>

> _`-faligned-new` flag_
> _always on for C++17_
> _simpler winter code_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97887
Approved by: https://github.com/atalman, https://github.com/Skylion007
2023-03-30 03:16:19 +00:00
a95815c6b7 fix compiler version detection on MacOS (#97883)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 43c1df6</samp>

Fix build error on macOS with Xcode 12 or newer by updating clang version detection in `CMakeLists.txt`.

Fixes https://github.com/pytorch/pytorch/issues/97882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97883
Approved by: https://github.com/malfet
2023-03-30 02:56:22 +00:00
1432a893ef Fix issue with single input cat (#97822)
Fixes #97695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97822
Approved by: https://github.com/ngimel, https://github.com/anijain2305
2023-03-30 02:51:43 +00:00
7fc100a290 support random for custom device (#97420)
Fixes #ISSUE_NUMBER
set seed for custom device, @bdhirsh , please review my changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97420
Approved by: https://github.com/bdhirsh
2023-03-30 02:12:52 +00:00
3eecca764a Skip test_cpp_wrapper on mac (#97911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97911
Approved by: https://github.com/clee2000
2023-03-30 00:41:45 +00:00
f40b2ed59c Fix a grep-itself bug when checking for GPU healthcheck (#97929)
The logic works https://github.com/pytorch/pytorch/actions/runs/4558327458, but it also grep itself due to `set -x` is set (Ugh, debug message)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97929
Approved by: https://github.com/malfet, https://github.com/weiwangmeta
2023-03-30 00:25:43 +00:00
b23cfe5465 [Inductor] Remove fb custom ops dependency (#97907)
As it conflicts with other dependencies

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97907
Approved by: https://github.com/anijain2305
2023-03-29 23:53:21 +00:00
2f6c18d1a2 improve memory footprint of torch.testing.assert_close (#96131)
Redo of #90172 out of stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96131
Approved by: https://github.com/pearu, https://github.com/mruberry
2023-03-29 23:49:56 +00:00
8e5c5d2023 Revert "Propogate dynamo shape_env to make_fx (#96437)"
This reverts commit 3a22916c7a501499eec9053e3a568f2b1f49938c.

Reverted https://github.com/pytorch/pytorch/pull/96437 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-03-29 23:47:59 +00:00
3460b2b7d3 Add support for pin memory on custom device. (#97621)
Add support for pin memory on custom device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97621
Approved by: https://github.com/NivekT
2023-03-29 23:45:52 +00:00
f603873c1b add various NT ops needed for testing (#97837)
# Summary
Add some Simple unary and binary NT ops
- Sub
- sgn
- abs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97837
Approved by: https://github.com/cpuhrsch
2023-03-29 23:43:37 +00:00
2b56da139c [kineto] init kineto_activity for each event (#97550)
Summary:
There was a refactoring while back to address Kineto <--> PyTorch Profiler buffer management issues. This made the Profiler API path safer but it regressed the OnDemand path.

The proper long term solution is to merge those paths which would significantly improve the maintainability of the codebase.

Test Plan:
# Test on Resnet integration test
```
buck2 run mode/opt  kineto/libkineto/fb/integration_tests:pytorch_resnet_integration_test
dyno gputrace
```

# Trace
https://fburl.com/perfdoctor/t8nkda9z

Differential Revision: D44362040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97550
Approved by: https://github.com/aaronenyeshi
2023-03-29 23:33:28 +00:00
47ce41e732 [dtensor] remove DeviceMesh typing hack guard type imports (#97889)
This PR relands https://github.com/pytorch/pytorch/pull/94526
and tries to guard the type import for older version numpy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97889
Approved by: https://github.com/fegin
2023-03-29 23:29:41 +00:00
eqy
aa4ea6e1f3 [cuDNN][cuDNN V8 API] Fix incorrect use of emplace in the benchmark cache (#97838)
`emplace` does not overwrite the existing mapped value in a map if it already exists, which can lead to repeated execution of a plan that e.g., tries to allocate an OOM-inducing workspace size and retriggers either a heuristic run (or worse, a benchmark run).

CC @ptrblck @ngimel @Fuzzkatt @syed-ahmed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97838
Approved by: https://github.com/ngimel
2023-03-29 23:14:05 +00:00
35be579701 Refactor TENSOR_MATCH guards to check dim (for NT support) (#97896)
Tweaks the TENSOR_MATCH guard logic to avoid saving sizes / strides for the case of dynamic shapes. Instead, the dim() is stored, which is enough for both dense tensors and NTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97896
Approved by: https://github.com/ezyang
2023-03-29 23:08:03 +00:00
04ca3a289d Disable modes in preserve_rng_state (#97738)
This one allows make_fx to be called when already in a faketensor mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97738
Approved by: https://github.com/ezyang
2023-03-29 22:46:51 +00:00
3a22916c7a Propogate dynamo shape_env to make_fx (#96437)
Currently, when we use assume_static_by_default flag, dynamo won't produce any symbols for input tensors. But when we pass the dynamo generated graph onto make_fx via torchdynamo.export(aten_graph=True), there is no way to pass this flag. We enable this by directly passing the fake tensors dynamo used to make_fx and call make_fx with "real" mode with fake tensors from dynamo.

Note that this is modified version of (https://github.com/pytorch/pytorch/pull/96143)

Differential Revision: [D43994693](https://our.internmc.facebook.com/intern/diff/D43994693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96437
Approved by: https://github.com/jansel, https://github.com/ezyang
2023-03-29 22:34:37 +00:00
7257de6eac Fix typos in torch/fx/_compatibility.py (#97618)
Fixes #ISSUE_NUMBER
Modify the _compatibility.py file global variable name and modify its test file simultaneously.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97618
Approved by: https://github.com/ezyang
2023-03-29 21:55:13 +00:00
2f86c9bc0b Update query version for update_expected.py (#97898)
Unclear why this wobbled, but rocks had an outage and fixed it,
maybe new endpoints were generated as a result of that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97898
Approved by: https://github.com/huydhn
2023-03-29 21:50:19 +00:00
099b2801db Stop runner service when its GPU crashes (#97585)
Per title, I'm looking for a way to take the runner out of service when its GPU crashes and couldn't recover.  Taking the faulty runner out of service would prevent future jobs to be assigned to it as they will surely fail.

This is based on the observation that GPU crash usually happen in the middle of the test or in the next `setup-nvidia` step.  This is only happens on G5 runner with A10G GPU, so the suspicion is that this is a hardware failure.  Updating to the newer NVIDIA driver (525.85.06) might or might not help with the issue (https://github.com/pytorch/pytorch/pull/96904), so I'm preparing this PR as a preemptive measure.  Here are the symptoms when the GPU crashes:

* Test fails with "No CUDA GPUs are available" error when initialize CUDA.  For examples:
  * https://github.com/pytorch/pytorch/actions/runs/4506110581/jobs/7932832519
  * https://github.com/pytorch/pytorch/actions/runs/4507220502/jobs/7935084759
* Calling nvidia-smi timeouts after 60 second.  For example:
  * https://github.com/pytorch/pytorch/actions/runs/4496201282/jobs/7910938448
* Fail to run nvidia-smi with an unable to determine the device handle for GPU unknown error
  * https://github.com/pytorch/pytorch/actions/runs/4546343549/jobs/8015359600
*  Run `docker --gpus all` fails with error response from daemon while the command `nvidia-container-cli` fails with `detection error: nvml error: unknown error`
  * https://github.com/pytorch/pytorch/actions/runs/4545579871/jobs/8013667872

I'm assume that an offline runner with a stopped runner service would be teardown and recycle properly by infra scaling process.

### Testing
https://github.com/pytorch/pytorch/actions/runs/4517112069/jobs/7956204805.  When it runs, the code fetches the service name from `${{ RUNNER_WORKSPACE }}/../../.service` file and issue `sudo systemctl stop ${RUNNER_SERVICE_NAME}` to stop the self-hosted runner service.

The job will show its status as `The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97585
Approved by: https://github.com/jeanschmidt
2023-03-29 21:17:13 +00:00
2806fa4470 Use the latest NVIDIA driver from setup-nvidia (#97840)
This goes with https://github.com/pytorch/test-infra/pull/3949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97840
Approved by: https://github.com/ZainRizvi
2023-03-29 21:14:27 +00:00
b93e1f377e [dynamo, benchmarks] Add inductor-mode (for max-autotune) and warm start options to dynamo benchmarks (#97719)
Title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97719
Approved by: https://github.com/shunting314
2023-03-29 21:09:00 +00:00
942e587d40 [SPMD] Make compile cache the compilation result and add option to perform transformation (#97836)
This PR changes ``compile()`` decorator to cache the compilation result so that the compilation is done once. An gm_transformation option is also added to ``compile()`` so that after the compilation is done, users can perform any graph transformation with the compiled graph module.

Differential Revision: [D44484033](https://our.internmc.facebook.com/intern/diff/D44484033/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97836
Approved by: https://github.com/mrshenli, https://github.com/wconstab
2023-03-29 20:51:22 +00:00
d70f9c7888 Fix typo under torch/csrc/jit/runtime directory (#97243)
This PR fixes typo in comments and messages under `torch/csrc/jit/runtime` directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97243
Approved by: https://github.com/davidberard98
2023-03-29 20:17:10 +00:00
1f71ac785c [RFC][inductor][index_put] fallback to aten in torch deterministic mode (#96898)
Fixes #93537
fallback to aten for index_put and scatter ops in torch deterministic mode

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96898
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-29 19:28:37 +00:00
e6909f6ccc [Dynamo] Fix for tuple construction from tuple iterators (#97862)
Fixes #93405

In short - when calling the builtin function `Tuple` on a list variable we added a list length guard. This paired with converting tuple iterators to a ListIteratorVariable resulted in this guard being improperly added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97862
Approved by: https://github.com/yanboliang, https://github.com/jansel
2023-03-29 19:20:05 +00:00
477f3f555f Simplify by using yield from (#97831)
The issues were found by SIM104 flake8-simplify in a local run.

I'll take a look on adding the check to the CI separately.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97831
Approved by: https://github.com/Skylion007
2023-03-29 19:15:24 +00:00
22b723132b Update ufmt to v2.1.0 (#97900)
Updates ufmt to the latest version with all the relevant bugfixes and performance improvements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97900
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-03-29 19:01:21 +00:00
e626be79a4 Add config setting to error on recompile (#97829)
Adds a config setting `error_on_recompile` - when set dynamo will raise an exception after compiling a function for the second time.

This was requested to help debugging in pyper

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97829
Approved by: https://github.com/bertmaher
2023-03-29 19:00:43 +00:00
bb40b62501 Delete fusions_possible counter (#97881)
This not used by anything, and it doesn't mean anything either.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97881
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-03-29 18:27:24 +00:00
313db584f3 [BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212)
Fixes https://github.com/pytorch/pytorch/issues/96887

We error out in BOTH the case when graph is created and when it is not created.

Still bc-breaking, but not as severe because we are limiting to the case where someone uses setup_context.

This makes setup_context and non-setup_context versions diverge in their behavior
- With the non-setup_context version, saved variables are assumed to have the grad_fn of the inputs.
- But now with the setup_context version, we produce an error for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97212
Approved by: https://github.com/zou3519
2023-03-29 17:54:00 +00:00
4114c1ea02 Revert "[dtensor] remove typing hack of DeviceMesh (#94526)"
This reverts commit 70b063db0e2b55b24c096884b2375376c0925453.

Reverted https://github.com/pytorch/pytorch/pull/94526 on behalf of https://github.com/atalman due to breaking internal builds
2023-03-29 17:33:58 +00:00
8372c5dc68 Refactor dynamic dims api, stateless internals, higher level export API (#96699)
The purpose of this API is to execute a few large components of work:

1) Refactor all the internals of plumbing dynamic dimension information after dynamo to be stateless
2) Decouple allocation controls around dynamic dimensions from verification
3) For (2), for allocation, create an enum that dictates whether we are in DUCK (default today), STATIC (aka assume_static_default in the past), or DYNAMIC (aka user constrained, do not duck shape)
4) For (2), for verification, we separate out the list of dynamic ranges entirely from allocation. This means shape_env does not tracking for what we verify on, and instead, it is the callers job to invoke produce_guards() with the various things they want verified, specifically, with the valid ranges. We do use constrain ranges to refine value ranges when doing analysis.
5) We have decided, therefore, as an extension of (4) to double down on "late" checks versus "eager" checks, primarily because the mechanisms for gathering what actually matters happens during guards, and should be a purview of the caller seeking guards, not the shape env. However, for dynamo, these structures are essentially one and the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96699
Approved by: https://github.com/avikchaudhuri, https://github.com/ezyang
2023-03-29 16:55:49 +00:00
2c16b73a1b Remove comma from parametrized test name (#97844)
Using `name_fn` argument of `@paramterize` decorator.

As internal test runner can't figure out how to parse those, otherwise this is a no-op.

For those with intern access, see [T149211516](https://www.internalfb.com/intern/tasks/?t=149211516)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97844
Approved by: https://github.com/weiwangmeta
2023-03-29 14:20:13 +00:00
44e73db3c2 [2/n] Consolidate replicate and DDP: split forward function (#96658)
Split `forward` function into `pre_forward` and `post_forward`, so that they can be reused in the composable API of `replicate`.

Differential Revision: [D44377456](https://our.internmc.facebook.com/intern/diff/D44377456)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96658
Approved by: https://github.com/rohan-varma
2023-03-29 13:57:16 +00:00
170a1c3ace [ONNX] Fix typo "scipt" -> "script" (#97850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97850
Approved by: https://github.com/BowenBao
2023-03-29 13:42:13 +00:00
2ce6ad9aa9 [inductor] make run_and_get_cpp_code signature match run_and_get_triton_code (#97826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97826
Approved by: https://github.com/ezyang
2023-03-29 12:29:03 +00:00
4ae4c6f68a Fix typo when setting FSDP state dict config (#97110)
`get_state_dict_type` in FSDP looks for a key called `_optim_state_dict_config` when getting the optimizer state dict config.  However, `set_state_dict_type` sets the config at a key called `_optimstate_dict_config`.  This looks like a typo.

This fixes the discrepancy, so that when you set the state dict type, it is correctly used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97110
Approved by: https://github.com/awgu, https://github.com/fegin
2023-03-29 10:46:26 +00:00
004bb34f42 inductor: fix vision_maskrcnn dynamic_shapes error on CPU (#97312)
Fix several c++ compilation errors in `vision_maskrcnn` in dynamic_shapes cases:
1. convert `ceiling` to `std::ceil` in `CppPrinter`:
```bash
error: ‘ceiling’ was not declared in this scope
   17 |                 for(long i1=0; i1<ceiling(1.8735363483429*ks1); i1+=1)
```

2. convert index in `store` to `INDEX_TYPE`:
```bash
error: invalid types ‘float*[double]’ for array subscript
   52 |                         out_ptr0[i2 + (i1*(floor(1.8735363483429*ks2))) + (i0*(std::ceil((1.87353634834290*ks1)))*(floor(1.8735363483429*ks2)))] = tmp30;
```

3. convert offset, size, steps in loop to  `INDEX_TYPE`:
```bash
error: invalid controlling predicate
   16 |                 for(long i1=0; i1<std::ceil((1.87353634834290*ks1)); i1+=1)
```

4. convert index in `load` to  `INDEX_TYPE`:
```bash
error: invalid types ‘float*[double]’ for array subscript
   64 |                     auto tmp0 = out_ptr0[i1 + (i0*(floor(1.8735363483429*ks2)))];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97312
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/jansel
2023-03-29 10:24:57 +00:00
2f06fc2422 prepare doc preview s3-prefix for future change (#97433)
Sister patch for pytorch/test-infra#3917. TL;DR this moves the doc preview scheme to $OWNER/$REPO rather than $REPO that we are currently rolling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97433
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-03-29 10:13:43 +00:00
faccd87658 [NNC] Fix the issue that the void** could not store a scalar if the bit width of the scalar is greater than 32bit on a 32bit platform (#97669)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97669
Approved by: https://github.com/jgong5
2023-03-29 09:22:13 +00:00
6871665a97 Avoid copies in matmul (no ghstack) (#97355)
Resubmit of https://github.com/pytorch/pytorch/pull/76828 without using ghstack so that @ngimel can import it and help me debug the issue why it was reverted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97355
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-29 06:54:09 +00:00
46faa79e09 Simplify by using yield from in torch/utils/data (#97839)
Also see https://github.com/pytorch/pytorch/pull/97831
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97839
Approved by: https://github.com/NivekT, https://github.com/Skylion007
2023-03-29 04:51:26 +00:00
f388bec985 [Dynamo] torch.Generator state should have a source and be reconstructed properly (#97403)
Fixes #97077 partially.

During FX graph propagation, we request every tensor should have source:
a524123c91/torch/_dynamo/variables/builder.py (L929)
However, the output of ```torch.Generator.get_state()``` is a tensor but without source, since it's generated inside of the FX graph. My change is following what we did for [Python random functions](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/variables/user_defined.py#L260), to have a dedicated ```GeneratorStateSource```. We have to also update the reconstruction logics, since we will reuse the ```TensorVariable``` reconstruction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97403
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-03-29 04:31:23 +00:00
9d1d95099b Disable dynamo tracing torchrec.distributed (#97824)
This was used to unblock Meta internal use cases, where ```torchrec.distributed``` was used, however, it can't be traced by dynamo properly right now.
We were sending the same fix(#90087) several months ago, but was reverted due to ```fbgemm``` conflicts. This PR catches ```Exception``` rather than ```ImportError``` which can handle the conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97824
Approved by: https://github.com/wconstab
2023-03-29 04:29:51 +00:00
f4ac8e0052 Add dynamo config skip_nnmodule_hook_guards (#97830)
This lets users that are sure they won't use hooks avoid overhead
related to dynamo guards on (assumedly) empty hook dicts on all
nn modules.

Only enable this flag if you are sure you won't change hook-behavior
after compiling.  It is ok to register a hook and then compile, if
you promise never to remove/alter the hook.  It is also ok to
not register a hook and compile, if you never register a hook later.

Note- this is not the best we can do, and hopefully in the future
we can avoid the need for this option following some of these paths
- make guards fast enough to not be an issue when guarding on hook
  dicts
- make a mode where dynamo actually skips tracing __call__ so
  hooks are consistently ignored by compiled programs
- use nnmodule versioning so hook changes can be guarded without
  explicit hook dict guards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97830
Approved by: https://github.com/jansel
2023-03-29 04:25:27 +00:00
91166ef7e7 Remove rocm python 3.11 restriction (#97818)
Removes restrictions for rocm 5.4.2 and python 3.11

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97818
Approved by: https://github.com/malfet
2023-03-29 02:51:13 +00:00
f754be897a Disable speedup_experiment_ds (#97806)
It seems to be broken.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97806
Approved by: https://github.com/jansel
2023-03-29 01:27:31 +00:00
60631aefe5 Disable test_variable_sharing on ASAN due to non-deterministically hang (#97742)
See https://github.com/pytorch/pytorch/issues/94024.  I disabled this test on ASAN a while ago for this exact issue.  The issue, unfortunately, was hard to reproduce and flaky bot closed it 3 weeks ago.  ASAN job has been hanging flakily since then, i.e. 8313becefa.

I don't want to reopen the issue and forget about it after 2 weeks, so let's disable the test for ASAN and be at peace (for now).  Interesting, there are other tests here also hanging on ASAN, i.e. `test_leaf_variable_sharing`:

```
# See https://github.com/pytorch/pytorch/issues/14997
@unittest.skipIf(TEST_WITH_ASAN,
                 "non-deterministically hangs with ASAN")
def test_leaf_variable_sharing(self):
```

I suspect that they have the same root cause.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97742
Approved by: https://github.com/clee2000
2023-03-29 01:18:44 +00:00
9e2e345af7 [inductor] avoid kernel cache miss because of different arg name (#97755)
We previously use buffer name for the variable containing randomly generated kernel input in the kernel benchmark. This has a big drawback. The same kernel may be used for different buffers. However if we use buffer name as argument name, the kernel source code for different invocation of the kernel will be different. This cause the following downsides:
- compile time will be longer since we can not reused compiled kernel due to cache miss
- this cause inconsistent behavior with TORCHINDUCTOR_BENCHMARK_KERNEL enabled or disabled. We may see more kernels (some are essentially duplicated) in the compiled module if TORCHINDUCTOR_BENCHMARK_KERNEL is enabled.
- this obscure some optimization opportunities. E.g., a kernel spend 6% time is worth looking at. But if the kernel is called 20 times and now it show up as 20 different kernels each spend 0.3% of time, it would be less obvious that we should optimize this kernel.

In this PR, we just use canonical name like `arg_{i}` rather than the buffer name to avoid all the issues above.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97755
Approved by: https://github.com/jansel
2023-03-29 01:07:46 +00:00
5949d86bec [Easy] Remove unnecessary graph lint (#97815)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97815
Approved by: https://github.com/fegin
2023-03-29 00:41:00 +00:00
70b063db0e [dtensor] remove typing hack of DeviceMesh (#94526)
This removes the typing hack, part of https://github.com/pytorch/pytorch/pull/92931
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94526
Approved by: https://github.com/XilunWu
2023-03-29 00:23:47 +00:00
8a45befcec [reland] add numpy typing plugin to mypy config (#94525)
reland of https://github.com/pytorch/pytorch/pull/92930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94525
Approved by: https://github.com/huydhn
2023-03-29 00:23:47 +00:00
2490ac561f Propagate inductor guards to ShapeEnv (#97777)
Fixes https://github.com/pytorch/pytorch/issues/96971

Antoni is going to help us get a small test case to put in the test
suite.  The root cause of the problem was Alibi, which has arange
with negative step.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97777
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-29 00:09:43 +00:00
597b558c51 [BE]: Update flake8 and plugins and fix bugs (#97795)
Update flake8 and flake8-plugins in lintrunner to a modern version. Enables more checks and makes flake8 checks significantly faster. Added a few additional rule ignores that will need to be fixed in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97795
Approved by: https://github.com/alexsio27444, https://github.com/janeyx99, https://github.com/ezyang
2023-03-28 23:51:55 +00:00
7282be3d91 Patch for nvfuser build (#97404)
1. Packaging nvfuser header for support c++ build against nvfuser;
2. Moving `#include <torch/csrc/jit/codegen/fuser/interface.h>` from `torch/csrc/jit/runtime/register_ops_utils.h` to `torch/csrc/jit/runtime/register_prim_ops_fulljit.cpp` to avoid missing header, since pytorch doesn't package `interface.h`;
3. Patching DynamicLibrary load of nvfuser to leak the handle, this avoids double de-allocation of `libnvfuser_codegen.so`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97404
Approved by: https://github.com/davidberard98
2023-03-28 23:36:08 +00:00
e0a647d8b5 new pin (#97278)
contains Fix for https://github.com/openai/triton/issues/1372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97278
Approved by: https://github.com/Skylion007
2023-03-28 23:30:23 +00:00
bc86af0d37 Remove DeferredIndentedBuffer (#97616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97616
Approved by: https://github.com/desertfire
2023-03-28 23:13:41 +00:00
c92dfe2694 [Vulkan] Add convert_qconv2d_context op (#97714)
Summary:
This diffs adds a convert_qconv2d_context op, which converts a cpu quantized Conv2dPackedParamsBase object (used by quantized::conv2d) into a vulkan Conv2dPackedContext object.
This op is used in a later diff (D44189363), to do a graph rewrite of quantized conv2d and conv2d_relu ops

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: SS-JIA

Differential Revision: D41595032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97714
Approved by: https://github.com/SS-JIA
2023-03-28 23:06:16 +00:00
662a8cf74d [FSDP][8/N] Simplify addr padding internals (#97796)
This is a follow-up to the last PR to greatly simplify the approach. This should be much cleaner.

**Details**
Let `N` denote the number of original parameters flattened into a given flat parameter with `M` extra padding tensors.
- `_numels_with_padding`: length `N + M`
- `_is_padding_mask`: length `N + M`
- `_numels`, `_param_infos`, `_shapes`, `_fqns`, `_param_extensions`: length `N`

`_shard_param_indices` and `_shard_param_offsets` were used to determine (1) if a given original parameter is in the local shard and if so, then (2) what is its offset in the _sharded_ flat parameter, and (3) how many numel are in the _sharded_ flat parameter.

This PR reworks how to achieve (1), (2), and (3) to allow for simplifying the previously mentioned data structures. In particular, it saves one extra tuple `_shard_param_infos: Tuple[_ShardParamInfo, ...]` of length `N` where each `_ShardParamInfo` entry gives exactly the needed info. For example, the offset into the sharded flat parameter is now pre-computed, so we do not need to do `offset = 0; offset += numel_in_shard` over a `for` loop each time now.

For optimizer state dict, `FSDPParamInfo.param_indices` now maps to the indexes with respect to the length `N` data structures, not the length `N + M` ones. The only purpose of `param_indices` is to be able to index into `flat_param._shard_param_infos[i]` to get the contained info to flatten the unsharded original parameter optimizer state and extract the part in the local shard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97796
Approved by: https://github.com/rohan-varma
2023-03-28 22:19:44 +00:00
aee96e2cb3 Revert "[inductor] Refactor cpp_wrapper to be an attribute of GraphLowering (#97709)"
This reverts commit 8710dc8d5a09204391ebcaaed9839c1d885bdc44.

Reverted https://github.com/pytorch/pytorch/pull/97709 on behalf of https://github.com/malfet due to Broke cpu_wrapper tests on MacOS, see https://github.com/pytorch/pytorch/actions/runs/4545603517/jobs/8014327136#step:13:868
2023-03-28 22:07:33 +00:00
dc3d6fe6b0 [jit][easy] add missing quotes in namedtuple forwardref tests (#97736)
Follow-up to #96933. This test was intended to have quotes around the
type annotations for the attributes of the NamedTuple. This PR adds this
missing quotes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97736
Approved by: https://github.com/eellison
2023-03-28 21:44:01 +00:00
79d2a8dd9e [PyTorch] Second try: use c10::FastMap for memoizing in Pickler (#96688)
These maps don't rely on reference stability, so FastMap should be fine.

First try (#96360) was reverted because it broke internal tests.

Differential Revision: [D43995796](https://our.internmc.facebook.com/intern/diff/D43995796/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43995796/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96688
Approved by: https://github.com/malfet
2023-03-28 21:23:13 +00:00
d4829bd6c7 Use remote master as the linter merge base (#97800)
Fixes https://github.com/pytorch/pytorch/issues/96794

Sometimes people never update their local `master` branch. Their workflow instead consists of fetching commits from git and directly creating branches off of the remote `master` branch (e.g. via `git checkout -b <mybranch> origin/master`

For those people, their local `master` is very old and out of date, creating an unreasonably old lint base that tends to catch all sorts of unrelated linter errors.

Anyone with an updated `master` branch will naturally have an updated pointer to the remote `master`, so this change makes lintrunner friendly to both behavior patterns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97800
Approved by: https://github.com/huydhn
2023-03-28 21:23:05 +00:00
c39f1c1490 Allow DTensor to trigger collecives before inplace ops (#97787)
Mainly two fixes:

1. `make_fx` seems trace through DeviceMesh operations. This commit removes that from the DTensor expanded graph
2. During DTensor expansion, autograd complains about inplace changes on leaf node. This commit wraps entire DTensor expansion code with `torch.no_grad()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97787
Approved by: https://github.com/wanchaol
2023-03-28 21:06:51 +00:00
35a13a593e Revert "Updates NCCL to 2.17.1 (#97407)"
This reverts commit b113a09ef90decbc703722bfdc2064fc5eb54a19.

Reverted https://github.com/pytorch/pytorch/pull/97407 on behalf of https://github.com/clee2000 due to looks like it broke inductor distributed tests b113a09ef9 (12344853677)
2023-03-28 21:04:18 +00:00
b443198966 Fix sparse addmv ref impl for non-contig tensors (#97730)
Fix logic in `test_block_addmm` that tested op against itself rather than against dense implementation, by implementing `ref_addvm` function that converts tensor back to dense before multiplying it with vector.

Fix reference implementation by passing stride for vector and result. (Not sure wether it will be more perf efficient to iterate over strided tensor or request a dense copy as MKL implementation does)

Print more verbose error message if values differ.

Fixes https://github.com/pytorch/pytorch/issues/97629 , https://github.com/pytorch/pytorch/issues/97589 ,  https://github.com/pytorch/pytorch/issues/97563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97730
Approved by: https://github.com/cpuhrsch
2023-03-28 20:46:32 +00:00
bb42104fe8 [DataLoader] Fix collation logic (#97789)
Similar to #97737, a previous auto-refactor changed how `bytes` are handled during collation, which can potentially lead to performance regression. This PR undoes that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97789
Approved by: https://github.com/albanD
2023-03-28 20:25:34 +00:00
ae3316c16e Update CODEOWNERS for torch data (#97797)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97797
Approved by: https://github.com/NivekT
2023-03-28 20:15:16 +00:00
fee1407c8d [xla hash update] update the pinned xla hash (#91874)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91874
Approved by: https://github.com/pytorchbot, https://github.com/huydhn
2023-03-28 19:52:01 +00:00
fb7f983357 Graph break on operators that fake tensor doesn't support (#97708)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97708
Approved by: https://github.com/eellison
2023-03-28 19:49:54 +00:00
08f125bcac [ROCm] Remove usage of deprecated ROCm component header includes (#97620)
- clang parameter 'amdgpu-target' changed to 'offload-arch'
- HIP and MIOpen includes path updated for extensions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97620
Approved by: https://github.com/ezyang, https://github.com/jithunnair-amd
2023-03-28 19:28:38 +00:00
4afef85dda [MPS] Fix index_select_scalar test (#97773)
#96408 introduced a check that prevents the index to scalar from being non-singleton.

Fixes #94162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97773
Approved by: https://github.com/kulinseth
2023-03-28 19:23:59 +00:00
196acc84b1 [UX] Advise users to rebase-and-merge for stale PRs (#97808)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97808
Approved by: https://github.com/clee2000, https://github.com/kit1980
2023-03-28 19:03:16 +00:00
b113a09ef9 Updates NCCL to 2.17.1 (#97407)
This PR updates NCCL submodule to 2.17.1.
Closes https://github.com/NVIDIA/nccl/issues/750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97407
Approved by: https://github.com/ngimel, https://github.com/ptrblck, https://github.com/malfet
2023-03-28 19:01:37 +00:00
8289120ef0 Revert "test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)" (#97759)
Tests were already fixed in https://github.com/pytorch/pytorch/pull/92834, and these changes instead of also fixing tests are now breaking them again.

This reverts commit 7f94ea84927844842a1d0892b7a5e6a41518430b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97759
Approved by: https://github.com/janeyx99
2023-03-28 18:43:08 +00:00
2ef6ffdfa1 Revert "[BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212)"
This reverts commit f3aca45a163cf1aafd4f5fa65a0adce53b33abfa.

Reverted https://github.com/pytorch/pytorch/pull/97212 on behalf of https://github.com/soulitzer due to TestAutogradFunctionCUDA.test_function_returns_input_inner_requires_grad_True_save_for_vjp_save_tensors_output_mark_dirty_True_cuda leaks
2023-03-28 18:30:51 +00:00
6854fd7189 Add Config to Skip Cpp Codegen, Enable in FBCode (#97204)
Differential Revision: [D44353662](https://our.internmc.facebook.com/intern/diff/D44353662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97204
Approved by: https://github.com/ngimel, https://github.com/bertmaher, https://github.com/mikekgfb, https://github.com/cpuhrsch
2023-03-28 18:21:15 +00:00
c0e0fbb6e1 inductor: fix _dynamic_reshape_indexer issue when tail index is sym (#97502)
For TIMM **swin_base_patch4_window7_224**  dynamic shape case, there has an error for ```view``` op:

```
  File "/home/xiaobing/pytorch-offical/torch/_inductor/lowering.py", line 229, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/home/xiaobing/pytorch-offical/torch/_inductor/lowering.py", line 665, in view
    return TensorBox(View.create(x.data, sizes))
  File "/home/xiaobing/pytorch-offical/torch/_inductor/ir.py", line 1325, in create
    reindex = cls.dynamic_reshape_indexer(old_size, new_size)
  File "/home/xiaobing/pytorch-offical/torch/_inductor/ir.py", line 1351, in dynamic_reshape_indexer
    reindex2 = cls._dynamic_reshape_indexer(flat, new_size)
  File "/home/xiaobing/pytorch-offical/torch/_inductor/ir.py", line 1406, in _dynamic_reshape_indexer
    assert size_new == 1
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AssertionError:
  target: aten.view.default
  args[0]: TensorBox(StorageBox(
    Pointwise(
      'cpu',
      torch.float32,
      def inner_fn(index):
          i0, i1, i2, i3 = index
          tmp0 = ops.load(buf37, i3 + 49 * i2 + 2401 * i1 + 9604 * i0)
          tmp1 = ops.load(arg35_1, i3 + 49 * i2)
          tmp2 = ops.load(arg1_1, i1 + 4 * (tmp1))
          tmp3 = tmp0 + tmp2
          return tmp3
      ,
      ranges=[64, 4, 49, 49],
      origins={add_12}
    )
  ))
  args[1]: [64//s3, s3, 4, 49, 49]
```

the target shaps of ```view``` is ```[64//s3, s3, 4, 49, 49]```, and ```Sym(s3)``` is 64, see

```
sym_size_16: Sym(s3) = torch.ops.aten.sym_size(arg34_1, 0)
floordiv_3: Sym(64//s3) = sym_size_13 // sym_size_16
view_33: f32[64//s3, 64//(64//s3), 4, 49, 49] = torch.ops.aten.view.default(add_12, [floordiv_3, sym_size_16, 4, sym_size_14, sym_size_14]);  add_12 = floordiv_3 = sym_size_16 = None
```

For the tail index of the new size is ```Sym(64//s3)```, it is not a number, we shouldn't directly compare it with ```1```.

Currently, I didn't find a simple test case to reproduce it, I just test it for the real model.

```
python -m torch.backends.xeon.run_cpu --core_list 0 --ncores_per_instance 1 benchmarks/dynamo/timm_models.py --performance --float32 -dcpu -n50 --inductor --only swin_base_patch4_window7_224 --batch_size 1 --threads 1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97502
Approved by: https://github.com/jgong5, https://github.com/ezyang
2023-03-28 18:00:07 +00:00
b895a0a675 [BE] Move flatbuffer related python C bindings to script_init (#97476)
Summary:
Extra C binding module for flatbuffer was introduced because
not all dependencies of Pytorch want (or can) bundle in flatbuffer.

However, flatbuffer is in by default now so this separate binding is not longer needed.

Test Plan: existing unit tests

Differential Revision: D44352583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97476
Approved by: https://github.com/dbort
2023-03-28 17:56:32 +00:00
d8cc8ffebc [DataLoader] Short circuit pin_memory recursion when operating on bytes (#97737)
Slack thread: https://pytorch.slack.com/archives/GEEQ2K4MD/p1679962409906099

I was seeing some massive (~2x) slowdowns on a job after running it on PyTorch 2.0. From some profiling in `py-spy` it looked like the pin_memory thread was doing a lot more work than before. Looking at a trace in `nsys` I saw the thread doing the forward pass having a bunch of `pthread_cond_timedwait` with GIL reacquire calls in it’s call stack, and it seemed like the thread doing the forward pass was getting blocked (waiting for the GIL) by the pin memory thread (which was holding the GIL).

After some debugging I found out the issue. If a `bytes` was passed into `pin_memory`, previously in 1.13 (before https://github.com/pytorch/pytorch/pull/94709) it would short-circuit and return here
d922c29a22/torch/utils/data/_utils/pin_memory.py (L54-L55)
since `bytes` was in `torch._six.string_classes`:
```
>>> from torch._six import string_classes
>>> string_classes
(<class 'str'>, <class 'bytes'>)
>>>
```

However after https://github.com/pytorch/pytorch/pull/94709, if a `bytes` was passed into `pin_memory` it would fall into here instead
c263bd43e8/torch/utils/data/_utils/pin_memory.py (L68-L73)
because the previous check is now doing `isinstance(data, str)` instead of `isinstance(data, (str, bytes))`!
c263bd43e8/torch/utils/data/_utils/pin_memory.py (L56-L57)

As a result, `pin_memory` gets called recursively for each element in the `bytes` leading to a ton of wasted recursion. This also explains the slowdown / GIL contention I was seeing.

This PR simply changes `isinstance(data, str)` to `isinstance(data, (str, bytes))` to match the behavior before https://github.com/pytorch/pytorch/pull/94709

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97737
Approved by: https://github.com/albanD, https://github.com/NivekT
2023-03-28 17:39:23 +00:00
1a2dcff127 Added ModuleInfos for remaining activation functions (#97704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97704
Approved by: https://github.com/albanD
2023-03-28 17:11:41 +00:00
dbd41cfa91 Add arm tests to mps workflow (#97279)
Fixes #ISSUE_NUMBER
https://github.com/pytorch/pytorch/pull/94417/files broke mac tests but was an mps pr, so add arm tests to mps

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 4b0f4ed</samp>

> _`PyTorch` on `macOS`_
> _Testing new Python and ARM_
> _Autumn of Silicon_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97279
Approved by: https://github.com/ZainRizvi
2023-03-28 16:55:53 +00:00
9eea9d21a4 Update ONNX submodule from ONNX 1.13.1 with Protobuf 4.21 updates (#96138)
~Test ONNX 1.13.1 package which was built with Protobuf v21. onnx-test-protobufv21 package was created by this PR: https://github.com/onnx/onnx/pull/4973, which was based with the rel-1.13.1 release branch (which is what PyTorch is using now).~ Update ONNX submodule from https://github.com/onnx/onnx/tree/1.13.1-protobuf4.21, which was created with rel-1.13.1 branch with updated for Protobuf 4.21. Please note that PyTorch should still be able to build ONNX from source by Protobuf 3 with the same source code.

BTW, https://github.com/onnx/onnx/pull/4956/files. is the PR targeting ONNX's main branch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96138
Approved by: https://github.com/kit1980, https://github.com/BowenBao
2023-03-28 16:55:10 +00:00
22e3f67cd2 Update vision pinned hash (#97706)
after https://github.com/pytorch/vision/pull/7448, the test is now ok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97706
Approved by: https://github.com/tugsbayasgalan
2023-03-28 16:54:01 +00:00
8710dc8d5a [inductor] Refactor cpp_wrapper to be an attribute of GraphLowering (#97709)
Summary: to prepare for further AOT Inductor changes

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 7dff885</samp>

This pull request adds support for AOT compilation and C++ wrapper code generation for inductor models. It modifies the `GraphLowering` class in `torch/_inductor/graph.py` and the `compile_fx` function in `torch/_inductor/compile_fx.py` to enable this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97709
Approved by: https://github.com/jansel
2023-03-28 16:50:36 +00:00
2b369eb3c2 [fix] jacrev and jacfwd : support non-tensor args again (#97746)
Fixes https://github.com/pytorch/pytorch/issues/97636

The code to check if argument tensor are complex assumed that all arguments are tensor (which is not the case) which lead to the error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97746
Approved by: https://github.com/zou3519
2023-03-28 16:37:33 +00:00
1c83888be8 [memory profiling] show pre-existing memory in trace_plot (#97590)
Previously we only plotted memory if it was allocated or freed while
trace recording was active. This change also adds any pre-existing blocks
to the visualization. This helps because it is common to enable trace recording
later and then not realize that there is a lot of allocated memory in
the trace eventhough a lot was allocated beforehad.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97590
Approved by: https://github.com/eellison
2023-03-28 16:31:10 +00:00
b1a83c4da4 [memory history] cleanup recording API (#97406)
This makes the options for recording memory history
easier to understand and makes the default to record
the most information.

<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

This pull request enhances the memory profiling and debugging capabilities of PyTorch on CUDA devices. It introduces a new API for memory history recording in `torch/cuda/memory.py` and `test/test_cuda.py`, and adds new functions for memory snapshot management and visualization in `torch/cuda/memory.py`.

Also adds a quick _dump_snapshot function to make
it easier to look at the common visualizations.

<!--
copilot:walkthrough
-->
### <samp>🤖 Generated by Copilot at 4706acf</samp>

*  Modify the `_record_memory_history` function to use a new API that accepts a string argument for the `enabled` parameter and more parameters to control the stack trace collection and memory event history ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L620-R696))
* Add a new function `_dump_snapshot` that allows users to dump a memory snapshot to a directory with HTML plots of the memory segments and events ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377R703-R713))
* Update the test cases in `test/test_cuda.py` to use the new API for memory history recording and check the expected output of the memory plots ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4946-R4946), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L4984-R4984), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5000-R5000), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5015-R5015), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5035-R5038), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450R5045-R5046), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5060-R5059), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5068-R5065), [link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-893b1eea27352f336f4cd832919e48d721e4e90186e63400b8596db6b82e7450L5088-R5085))
* Add missing imports and types to the `torch/cuda/memory.py` module ([link](https://github.com/pytorch/pytorch/pull/97406/files?diff=unified&w=0#diff-80bd98caafb20d758f45a4d23711810f7e0b9ce7a6505094f9dbb0e00a657377L5-R15))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97406
Approved by: https://github.com/ezyang
2023-03-28 16:31:10 +00:00
9e029f44b5 [EASY] Fix test that does nothing (#97722)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97722
Approved by: https://github.com/jansel
2023-03-28 16:31:03 +00:00
0176fb4cd6 Remove fast_nvcc entry in README.md (#97624)
After https://github.com/pytorch/pytorch/pull/96665 landed, fast_nvcc tool is no longer available.
This commit removes the documentation for it so as not to confuse users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97624
Approved by: https://github.com/drisspg
2023-03-28 16:23:09 +00:00
26c5e34b47 Re-enable ProcessGroupMPITest in CI (#97687)
As https://github.com/pytorch/pytorch/issues/60756 been fixed a while back

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97687
Approved by: https://github.com/clee2000, https://github.com/kit1980
2023-03-28 16:04:16 +00:00
428540001d Add shape function for squeeze.dims op (#93919)
Changes to `_native_batch_norm_legit` and `upsample_nearest2d` in `serialized_shape_function_registry.cpp` are made just because this file is auto-generated, and the file was not auto-generated after the changes in `_shape_functions.py` for those two ops.

Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93919
Approved by: https://github.com/davidberard98
2023-03-28 14:55:00 +00:00
b2f1edabfe Renaming all_known_overloads to all_py_loaded_overloads and add comment (#97672)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97672
Approved by: https://github.com/Skylion007
2023-03-28 14:10:38 +00:00
bb85b43c0b Move test_cpp_wrapper to its own file (#97634)
This test takes 5+ minutes to finish, this breaks it into smaller
pieces.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97634
Approved by: https://github.com/eellison
2023-03-28 14:08:51 +00:00
c785f1903a Add dynamic shapes to perf dashboard (#97673)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97673
Approved by: https://github.com/desertfire
2023-03-28 13:18:22 +00:00
08766b23de [Quant][FX] lower ConvTranspose3d (#97125)
**Summary**
Enable quantization and lowering of `ConvTranspose3d`.
Add test cases for `ConvTranspose1d`, `ConvTranspose2d` and `ConvTranspose3d` since there were no such test cases.

**Test plan**
python test/test_quantization.py -k test_conv_transpose_not_reference
python test/test_quantization.py -k test_conv_transpose_reference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97125
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2023-03-28 11:58:29 +00:00
a8065cc61f [dynamo] simplify get_item_dyn (#97637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97637
Approved by: https://github.com/ezyang
2023-03-28 08:34:50 +00:00
867b07b424 Sampler API described for customization. (#97338)
Explanation with examples of sampler customization added.

* fixed TypeVar
* removed unused init from Sampler class
* added examples for custom sampler and batch sampler
* Distributed sampler typing fixed.
* _InfiniteConstantSampler fixed

Fixes #92268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97338
Approved by: https://github.com/NivekT
2023-03-28 06:40:38 +00:00
100b396b9b [Pytorch][coreml]Pass backend and modelid by value (#97566)
Summary:
Due to async dispatch passing by reference may cause crash.

Test Plan: CI

Reviewed By: mcr229

Differential Revision: D44386623

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97566
Approved by: https://github.com/mcr229
2023-03-28 06:34:55 +00:00
5aa4046743 [ONNX] Remove the _jit_pass_onnx_scalar_type_analysis pass (#97729)
It doesn't do anything because it doesn't analyze function calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97729
Approved by: https://github.com/titaiwangms
2023-03-28 05:17:36 +00:00
8624a2e88a Include missing header (#97453)
`std::exception_ptr` is defined in `<exception>`. This works in the past because the header is transitively included by other headers. The situation has changed in most recent llvm (c9d36bd807)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97453
Approved by: https://github.com/ngimel
2023-03-28 05:12:47 +00:00
6df18260fc Fix typo in error message (#97716)
The check makes sure qkv_bias is 1 dimensional, but the error message expects it to be 2-D.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97716
Approved by: https://github.com/drisspg
2023-03-28 04:36:43 +00:00
c1a6dde79e Make dynamo-FSDP skip guards (#97463)
Create a new GuardSource for FSDP modules, and use it
to opt out of guard installation.

Based on @awgu's work in https://github.com/pytorch/pytorch/pull/97091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97463
Approved by: https://github.com/voznesenskym, https://github.com/jansel, https://github.com/awgu
2023-03-28 04:04:34 +00:00
e9050ef74e explicitly list out caffe2 protos (#97612)
This will make it easier to split up this library.

Differential Revision: [D44400049](https://our.internmc.facebook.com/intern/diff/D44400049/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97612
Approved by: https://github.com/PaliC
2023-03-28 03:54:02 +00:00
1726c6f7a7 [fix] vmap: fix segfault on data access (#97237)
Fixes #97161

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97237
Approved by: https://github.com/zou3519
2023-03-28 03:35:44 +00:00
403905a37b cleanup caffe2/proto package (#97601)
Name according to Bazel recommendations, adjust order from public to
private, and restrict visibility.

Differential Revision: [D44396068](https://our.internmc.facebook.com/intern/diff/D44396068/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97601
Approved by: https://github.com/PaliC
2023-03-28 03:34:58 +00:00
5e6e984835 flake8 version reporting in collect_env (#94573)
Fixes #94571

# Testing
`[pip3] flake8==3.9.2` now appears under `Versions of relevant libraries:`
when running: `python torch/utils/collect_env.py`
### Output with this change
```
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 13.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.12 (main, Apr  5 2022, 01:53:17)  [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz

Versions of relevant libraries:
[pip3] flake8==3.9.2
[pip3] mypy==0.971
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] numpydoc==1.2
[conda] blas                      1.0                         mkl
[conda] mkl                       2021.4.0           hecd8cb5_637
[conda] mkl-service               2.4.0            py39h9ed2024_0
[conda] mkl_fft                   1.3.1            py39h4ab4a9b_0
[conda] mkl_random                1.2.2            py39hb2f4e1b_0
[conda] numpy                     1.21.5           py39h2e5f0a9_1
[conda] numpy-base                1.21.5           py39h3b1a694_1
[conda] numpydoc                  1.2                pyhd3eb1b0_0
```
### Output before
```
Collecting environment information...
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: N/A

OS: macOS 13.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.12 (main, Apr  5 2022, 01:53:17)  [Clang 12.0.0 ] (64-bit runtime)
Python platform: macOS-10.16-x86_64-i386-64bit
Is CUDA available: N/A
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: N/A

CPU:
Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz

Versions of relevant libraries:
[pip3] mypy==0.971
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.21.5
[pip3] numpydoc==1.2
[conda] blas                      1.0                         mkl
[conda] mkl                       2021.4.0           hecd8cb5_637
[conda] mkl-service               2.4.0            py39h9ed2024_0
[conda] mkl_fft                   1.3.1            py39h4ab4a9b_0
[conda] mkl_random                1.2.2            py39hb2f4e1b_0
[conda] numpy                     1.21.5           py39h2e5f0a9_1
[conda] numpy-base                1.21.5           py39h3b1a694_1
[conda] numpydoc                  1.2                pyhd3eb1b0_0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94573
Approved by: https://github.com/malfet, https://github.com/kit1980
2023-03-28 03:24:41 +00:00
f3aca45a16 [BE][autograd Function] Raise an error if input is returned as-is and saved for forward or backward in setup_context (#97212)
Fixes https://github.com/pytorch/pytorch/issues/96887

We error out in BOTH the case when graph is created and when it is not created.

Still bc-breaking, but not as severe because we are limiting to the case where someone uses setup_context.

This makes setup_context and non-setup_context versions diverge in their behavior
- With the non-setup_context version, saved variables are assumed to have the grad_fn of the inputs.
- But now with the setup_context version, we produce an error for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97212
Approved by: https://github.com/zou3519
2023-03-28 03:14:32 +00:00
c7fa648ea1 move caffe2/proto/ to its own Bazel package (#97600)
This is just to break up build files and make the system easier to
reason about during the transition to the common build system.

Differential Revision: [D44395826](https://our.internmc.facebook.com/intern/diff/D44395826/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97600
Approved by: https://github.com/PaliC
2023-03-28 03:14:26 +00:00
e1f44ee3b3 [inductor] correctly setup constant in the wrapper (#97571)
V.graph.constants like seed_cuda_0 is not handled properly in the wrapper. Recently we move the code that initializes constants from global scope to a function. That makes assigning to seed_cuda_0 creating a new local variable rather than setup the global variable.

Add 'global var_name' lines to maintain the same behavior as before.

Test:

Run the forward graph for nvidia_deeprecommender's training run. Previous fail and now pass with the fix.

Thanks @ngimel  for report the issue with repro and @Chillee  for pointing out the root cause.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97571
Approved by: https://github.com/ngimel
2023-03-28 03:10:53 +00:00
b756fd98bb Fix NumPy scalar arrays to tensor conversion (#97696)
By performing cast from scalar to 0-dim array only if object is not an
array already.

Fixes https://github.com/pytorch/pytorch/issues/97021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97696
Approved by: https://github.com/albanD
2023-03-28 03:00:18 +00:00
b2be14bcca Fix missing extra_traceback in InterpreterShim (#97615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97615
Approved by: https://github.com/Chillee, https://github.com/desertfire
2023-03-28 02:40:35 +00:00
08c1d1a871 [dtensor] set cuda device automatically, and refactor error handling (#97583)
This PR would detect if device_type is cuda, if cuda passed in,
we would set the current cuda device each process/thread automatically
(This assumption is based on homogenous devices).

Also refactored error handling code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97583
Approved by: https://github.com/wz337, https://github.com/XilunWu
2023-03-28 02:25:45 +00:00
e9c4904915 [dtensor] remove custom dispatch op (#95629)
Since we removed all custom dispatch ops, we can safely delete this
table as we won't use it for other purposes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95629
Approved by: https://github.com/XilunWu
2023-03-28 02:25:45 +00:00
342ed0372f [DCP] Expose create_read_items_for_chunk_list helper. (#97570)
This function is needed by all ReadPlanner subclasses that are trying to implement support for a custom distributed tensor.

Better expose it than have users reimplement this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97570
Approved by: https://github.com/wz337
2023-03-28 02:25:04 +00:00
1c15cd48e2 [FSDP][7/N] Add alignment padding for use_orig_params=True (#97667)
This PR adds intra-`FlatParameter` 16-byte alignment padding to the `use_orig_params=True` code path to avoid clones in TorchInductor.

**Approach**
The `FlatParameter` maintains several data structures about its original parameters. Notably, the data structures `_param_infos`, `_shapes`, `_numels`, and `_fqns` have the same length and index in the same way.

This PR treats alignment padding _like_ an original parameter in that the padding gets flattened into the `FlatParameter`. Therefore, it must be reflected in the aforementioned data structures. However, given the way in which the data structures are used, we choose to do the following if the `i`th tensor flattened into the `FlatParameter` is padding:
- `_numels[i]` is the numel of padding
- `_param_infos[i] == _shapes[i] == _fqns[i] == None`

This choice is because (1) we must record the padding numel to account for it (e.g. for views) and (2) we prefer to preserve the invariant that the data structures index in the same way over avoiding `None` entries.

To ease the burden of other FSDP developers, we separate the parameter flattening logic:
- `_init_flat_param_and_metadata()`: This should be called only once in the `FlatParamHandle` constructor. The `FlatParameter` metadata is assumed to be static thereafter.
- `flatten_tensors()` / `flatten_tensors_into_flat_param()`: These can be used for optimizer and model state dict and can be called after construction time.

This separation allows `_init_flat_param_and_metadata()` to contain the much heavier metadata logic, while keeping the latter methods to be much lighter. The only constraint is that the alignment padding logic must be kept consistent between the two, but this should be worth the simper interface.

**Testing**
- This PR directly modifies the `use_orig_params=True` code path, so all existing tests passing gives good signal.
    - Some existing unit tests had to be adjusted to account for the alignment padding.
- This PR adds two tests in `test_fsdp_flatten_params.py` to explicitly test the sharding metadata with alignment for both parameter full precision and mixed precision since the latter requires possibly more padding elements due to the decreased per-element size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97667
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
b9049a7f11 [FSDP][6/N] Rename param/module name helpers for clarity (#97666)
This is an easy PR. It has some remaining local changes that I had that I felt clarified naming.
- `_param_fqns` -> `_param_name_infos` since it returns a tuple of `fqn, param_name, module_name`, not only `fqn`. (similarly for `_shared_param_fqns` -> `_shared_param_name_infos`)
- nit: `parameter_module_names` -> `param_module_names` for consistency since we almost never fully spell out `parameter`. (similarly for `shared_parameter_module_names` -> `shared_param_module_names`)
- nit: `full_fqn` -> `fqn_from_global_root`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97666
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
30a6ed34a0 [FSDP][5/N] Lift FSDPParamInfo to use FlatParamHandle (#97665)
This PR changes `FSDPParamInfo` in `_optim_utils.py` to save the `FlatParamHandle`, not directly the `FlatParameter`. This is in preparation for subsequent PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97665
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
5d554ca26f [FSDP][4/N] Document use_orig_params: bool (#97664)
This adds long-awaited documentation for `use_orig_params: bool`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97664
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
c622559968 [FSDP][3/N] Minor fixes (rename, assert message) (#97663)
This is an easy PR.
- It renames `_shard_indices` to `_shard_param_indices` for consistency.
- It fixes an old mention of `comm_module` in an assert message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97663
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
a27882ecd1 [FSDP][2/N] Rename "flattened parameter" -> "flat parameter" (pt. 2) (#97662)
From our recent experience, we refer to FSDP's `FlatParameter` as "flat parameter", not "flattened parameter". This PR renames that in `_optim_utils.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97662
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
bd979737cd [FSDP][1/N] Rename "flattened parameter" to "flat parameter" (#97661)
From our recent experience, we refer to FSDP's `FlatParameter` as "flat parameter", not "flattened parameter". This PR renames that in `flat_param.py`.

**This PR only changes documentation.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97661
Approved by: https://github.com/rohan-varma
2023-03-28 01:46:43 +00:00
2bca64ae28 [Vulkan] Merge upsample_nearest2d and quantized_upsample_nearest2d (#97467)
Summary: Merging quantized_upsample_nearest2d into upsample_nearest2d. Therefore, at::upsample_nearest2d can handle quantized vulkan input tensors.

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Reviewed By: SS-JIA

Differential Revision: D44118212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97467
Approved by: https://github.com/SS-JIA
2023-03-28 01:13:18 +00:00
a9a81ab7e3 [CI] Run benchmark test with dynamo_eager in periodic (#97543)
Summary: The idea is to catch any dynamo_eager regression earlier, and also
we can take that off the dashboard run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97543
Approved by: https://github.com/huydhn
2023-03-28 01:02:49 +00:00
82592f7e53 remove dead torch_pb.h library (#97599)
This is only used in one place, ensure it still builds.

Differential Revision: [D44395699](https://our.internmc.facebook.com/intern/diff/D44395699/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44395699/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97599
Approved by: https://github.com/PaliC
2023-03-28 00:55:17 +00:00
a283c15e34 Added ModuleInfos for {*}LU modules (#97375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97375
Approved by: https://github.com/albanD, https://github.com/jbschlosser
2023-03-28 00:36:31 +00:00
1c3ec7c4c5 [eazy][inductor] fix typo in mm max-autotune log (#97486)
max autotune log like
```
AUTOTUNE bias_addmm(512x197951, 512x512, 512x197951)
  triton_mm_61 1.2882s 100.0%
  triton_mm_62 1.3036s 98.8%
  bias_addmm 1.4889s 86.5%
  triton_mm_60 1.6159s 79.7%
  triton_mm_63 1.7060s 75.5%
  triton_mm_64 1.7777s 72.5%
  triton_mm_67 1.9722s 65.3%
  addmm 2.0603s 62.5%
  triton_mm_70 2.0675s 62.3%
  triton_mm_68 2.3552s 54.7%
SingleProcess AUTOTUNE takes 2.949904441833496 seconds
```
is confusion since the sum of runtime of all the kernels is larger than the total time used for tuning. In fact, `triton.testing.do_bench` return milliseconds scale time rather than seconds scale. Fix the typo in the log message to make that clear.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97486
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-28 00:02:27 +00:00
32fdd44577 SymIntify maybe_multiply (#97675)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97675
Approved by: https://github.com/albanD
2023-03-27 23:20:23 +00:00
35c9ea89fa dont bake in defaults when tracing *_like factories (#97564)
quick fix for https://github.com/pytorch/pytorch/issues/97541. letting CI run to see if there's any fallout

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97564
Approved by: https://github.com/ezyang
2023-03-27 22:53:44 +00:00
2ca911f2ac make_fx, make pre_autograd a kwarg (#97559)
Some of the other inputs to make_fx() should probably be kwargs too - I didn't want to risk dealing with internal failures in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97559
Approved by: https://github.com/wconstab, https://github.com/SherlockNoMad, https://github.com/ezyang
2023-03-27 22:53:44 +00:00
6c450c7880 Allow -ic when no pending jobs (#97707)
fixes https://github.com/pytorch/test-infra/issues/3933

https://github.com/pytorch/test-infra/pull/3937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97707
Approved by: https://github.com/kit1980
2023-03-27 22:14:44 +00:00
652592efa9 [inductor] use torch.prifiler in the triton wrapper (#97405)
I think it's helpful to use torch.profiler to profile the triton wrapper.

E.g., I tried it for nvidia_deeprecommender's infernece graph.

Even with max-autotune, we see the majority of the time the GPU is running 2 mm/addmm op. That's why max autotune does not help for this model since tuning does not affect the external mm ops.

<img width="711" alt="Screenshot 2023-03-22 at 5 49 28 PM" src="https://user-images.githubusercontent.com/52589240/227072474-2f0d7205-4a10-4929-b1b7-551214788c61.png">

next step I'll check why the triton mm kernels are not picked.

EDIT: the above screenshot is captured without max-autotune due to a typo. below is the trace with max-autotune enabled:
<img width="712" alt="Screenshot 2023-03-22 at 6 43 26 PM" src="https://user-images.githubusercontent.com/52589240/227077624-fdccf928-be08-4211-871b-a9e3d7b76fbe.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97405
Approved by: https://github.com/ngimel
2023-03-27 21:54:25 +00:00
6c43e9fdbd Run _calculate-docker-image on 2xlarge with a larger disk space (#97551)
Not quite sure why I chose `linux.large` here, probably a copy paste mistake.  `linux.large` is a small runner with too small disk space (15GB), so there is a chance that it could run out of space as in https://github.com/pytorch/pytorch/actions/runs/4513709983/jobs/7948892825
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97551
Approved by: https://github.com/clee2000
2023-03-27 21:36:40 +00:00
7d94493392 [easy] Update xla hash pin merge rule (#97700)
Fixes #ISSUE_NUMBER
should really make this some sort of regex

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97700
Approved by: https://github.com/huydhn
2023-03-27 21:32:52 +00:00
5d33596c5f remove dead proto_convert library (#97598)
This has no code, only a collection of headers. Just make sure the
only thing that includes it still builds.

Differential Revision: [D44395700](https://our.internmc.facebook.com/intern/diff/D44395700/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44395700/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97598
Approved by: https://github.com/PaliC
2023-03-27 21:19:29 +00:00
35fd5c548e Fix typos under torch/distributed directory (#95638)
This PR fixes typos in comments and messages of `.py` files under torch/distributed directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95638
Approved by: https://github.com/usamah1, https://github.com/H-Huang, https://github.com/kit1980
2023-03-27 21:13:44 +00:00
8313becefa With Chillee's permission, add me to all Chillee's diffs (#97632)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97632
Approved by: https://github.com/kit1980
2023-03-27 21:10:49 +00:00
6430dad700 Apparently aot_function doesn't cache anymore (#97610)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97610
Approved by: https://github.com/albanD
2023-03-27 21:07:20 +00:00
b24052b1d9 Make test_binary_shape_functions actually test the ops (#90566)
Because of the break, only operator.__mul__ was actually tested.

<!--
copilot:poem
-->
### <samp>🤖 Generated by Copilot at 0e6aaa1</samp>

> _`break` statement gone_
> _loop tests all shape functions_
> _symbolic winter_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90566
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-27 21:03:59 +00:00
b66a121c5e [Vulkan] Fix broadcasting in quantized elementwise ops (#97554)
Summary: Fixes broadcasting along the channel and batch dimensions in quantized add, sub, mul and div

Test Plan:
```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

Reviewed By: SS-JIA

Differential Revision: D44359706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97554
Approved by: https://github.com/SS-JIA
2023-03-27 20:17:18 +00:00
5da86bbb68 Add decomposition for aten.squeeze.dims op (#97020)
Signed-Off By: Vivek Khandelwal <vivek@nod-labs.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97020
Approved by: https://github.com/jansel
2023-03-27 20:13:19 +00:00
91cce4c09a Sort: Use cub::WarpMergeSort for small sorts (32 < n <= 128) (#96223)
We currently use `bitonicSortKVInplace` for sorts of size `n <= 32`
but use `radixSortKVInplace` for `32 < n <= 4096`. Bitonic sort is
also unstable, which forces stable sorts fall back to which is up to
4x slower in this small regime.

This PR adds a new kernel `warpMergeSortKVInplace` using
`cub::WarpMergeSort` to implement sorts with `32 < n <= 128` and all
stable sorts with `n < 128`. This results in up to a 2x speedup for
unstable sorts and up to 15x for stable sorts, depending on the input
geometry.

This also doesn't increase the total number of kernels since we are
replacing radix-sorts of size 32 and 128.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96223
Approved by: https://github.com/ngimel
2023-03-27 19:48:45 +00:00
236bac811a Add ModuleInfos for Adaptive{Max/Avg}Pool ops (#97291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97291
Approved by: https://github.com/albanD
2023-03-27 19:45:37 +00:00
8177081848 Add gather to MTPG (#97555)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97555
Approved by: https://github.com/H-Huang
2023-03-27 19:37:02 +00:00
759e527ea1 Use internal symbolizer for FBCODE (#97172)
Summary:
addr2line does not work fast on fbcode binaries, so use the
internally symbolize pathway.

Test Plan: sandcastle

Differential Revision: D44227690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97172
Approved by: https://github.com/eellison
2023-03-27 19:24:12 +00:00
008be795ce run buildifier on the root BUILD.bazel file (#97611)
Just a no-op cleanup.

Differential Revision: [D44400008](https://our.internmc.facebook.com/intern/diff/D44400008/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97611
Approved by: https://github.com/PaliC
2023-03-27 19:03:07 +00:00
bbc7c79b20 add device checks for sparse csr (#97520)
Fixes #95373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97520
Approved by: https://github.com/cpuhrsch
2023-03-27 18:57:27 +00:00
96e3b3ac72 [BE] Cleanup CMake flag suppressions (#97584)
Use `append_cxx_flag_if_supported` to determine whether or not `-Werror` is supported
Do not suppress deprecation warnings if glog is not used/installed, as the way check is written right now, it will suppress deprecations even if `glog` is not installed.
Similarly, do not suppress deprecations on MacOS simply because we are compiling with protobuf.
Fix deprecation warnings in:
 - MPS by replacing `MTLResourceOptionCPUCacheModeDefault`->`MTLResourceCPUCacheModeDefaultCache`
 - In GTests by replacing `TYPED_TEST_CASE`->`TYPED_TEST_SUITE`
 - In `codegen/onednn/interface.cpp`, by using passing `Stack` by reference rathern than pointer.

Do not guard calls to `append_cxx_flag_if_supported` with `if(CLANG)` or `if(GCC)`.
Fix some deprecated calls in `Metal` hide more complex exception under `C10_CLANG_DIAGNOSTIC_IGNORE`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97584
Approved by: https://github.com/kit1980
2023-03-27 18:46:09 +00:00
345714e372 Upload merge records to Rockset (#97471)
This upload a record to a new Rockset `merges` collection in `commons` workspace in the following format:

```
{
    "id": comment_id,
    "pr_num": pr_num,
    "owner": owner,
    "project": project,
    "pending_checks": pending_checks,  # At the time of the merge
    "failed_checks": failed_checks,  # At the time of the merge
    "is_failed": is_failed,  # This is set to True if the merge fails to get through for whatever reason
    "dry_run": dry_run,
    "skip_mandatory_checks": skip_mandatory_checks,
    "ignore_current": ignore_current,
    "error": error,  # The same Exception message that will be shown on PR
}
```

To achieve this, I need to tweak `find_matching_merge_rule` a bit to return the list of pending and failed checks in addition to the matching merge rule.  As this function is also used internally, I have confirmed that the internal call doesn't need the return values.  Thus, the change is safe to land.

### Testing

* Unit testing
* Dry-run locally `python3 .github/scripts/trymerge.py --comment-id 1478678477 --dry-run 97293` using an older PR.  The merge obviously failed, but the record was created successfully on Rockset
```
{
  "_id": "52d3152b-ec35-4b5a-91fc-0e7298fc54b5-1",
  "_event_time": "2023-03-23T21:10:32.754368Z",
  "_meta": null,
  "owner": "pytorch",
  "is_failed": true,
  "id": 1478678477,
  "failed_checks": [],
  "dry_run": true,
  "error": "Command `git -C pytorch cherry-pick -x cc0d2e0fba648bb5deda34a9056f2c4192b22314` returned non-zero exit code 1...",
  "ignore_current": false,
  "project": "pytorch",
  "pr_num": 97293,
  "skip_mandatory_checks": false,
  "pending_checks": []
}
```

* Dry-run locally with this PR `python3 .github/scripts/trymerge.py --comment-id 1481949104 --dry-run --force 97471` with `--force`
```
{
  "_id": "dd7d2580-f6e5-47e7-9441-17df86056c14-1",
  "_event_time": "2023-03-23T21:43:53.915911Z",
  "_meta": null,
  "owner": "pytorch",
  "is_failed": true,
  "id": 1481949104,
  "failed_checks": [],
  "dry_run": true,
  "error": "PR #97471 has not been reviewed yet",
  "ignore_current": false,
  "project": "pytorch",
  "pr_num": 97471,
  "skip_mandatory_checks": true,
  "pending_checks": []
}
```

* Dry-run locally with this PR `python3 .github/scripts/trymerge.py --comment-id 1481949104 --dry-run 97471` again with approval rule commented out

```
{
  "_id": "5d7de4e3-1af1-4869-a3b7-d1a9dbced6ce-1",
  "_event_time": "2023-03-24T00:10:41.914111Z",
  "_meta": null,
  "is_failed": false,
  "id": 1481949104,
  "failed_checks": [],
  "error": "",
  "last_commit_sha": "4657400513f0360a0a4f73d46e1aff0882221687",
  "merge_commit_sha": "416bac5b813a181753afade781ae30f4f0843586",
  "ignore_current": false,
  "pending_checks": [
    [
      "pull / linux-focal-py3.8-gcc7 / test (default, 1, 3, linux.2xlarge)",
      "https://github.com/pytorch/pytorch/actions/runs/4506464828/jobs/7933518379",
      12239935788
    ],
    ...
    [
      "trunk / linux-bionic-cuda11.8-py3.10-gcc7 / test (default, 5, 5, linux.4xlarge.nvidia.gpu)",
      "https://github.com/pytorch/pytorch/actions/runs/4506465633/jobs/7933621958",
      12240067113
    ],
    ...
  ],
  "owner": "pytorch",
  "skip_mandatory_checks": true,
  "author": "Huy Do <huydhn@gmail.com>",
  "project": "pytorch",
  "merge_base_sha": "a3b30c5025e3381022fa00b127b0d881f4ef66d4",
  "pr_num": 97471
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97471
Approved by: https://github.com/clee2000
2023-03-27 18:42:00 +00:00
2ea097071a fix device type bug for custom device (#97213)
Fixes #ISSUE_NUMBER
support  the custom renamed device ,@bdhirsh , please review my changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97213
Approved by: https://github.com/bdhirsh, https://github.com/kit1980
2023-03-27 18:36:47 +00:00
fcc312e945 [BE] Update flake8-comprehensions to 3.11.1 (#97671)
Updates flake8-comprehensions in lintrunner so we can enforce new checks that have been implemented since the last update (including one implemented by me). I also added C417 to the flake8 ignore codes for now since we do not yet conform to that check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97671
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-03-27 18:07:37 +00:00
4d2611375b Fix typo in throughput_benchmark. (#97619)
Fixes #ISSUE_NUMBER
Fix typo in throughput_benchmark.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97619
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-27 18:07:30 +00:00
97711ac6db [CI] Reduce perf nightly run frequency and bump up its timeout limit (#97682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97682
Approved by: https://github.com/weiwangmeta
2023-03-27 17:22:02 +00:00
6db196b744 Specify the head branch when upload perf stats to Rockset (#97643)
Before this, my assumption was that the workflow was only run on the main branch. This is not correct anymore as it could also now be run as part of the PR, i.e. https://hud.pytorch.org/pr/91316.  So this change does two things:

* Always upload inductor-A100-perf-nightly artifacts to S3 once completed by removing the main branch gating.
* Add `head_branch` to Rockset records, so that the [dashboard](https://torchci-git-fork-huydhn-add-compilers-bench-74abf8-fbopensource.vercel.app/benchmark/compilers) knows if the records come from the daily schedule on the main branch or from experimental PR.  The `head_branch` would be set to `master` in the former.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97643
Approved by: https://github.com/desertfire
2023-03-27 17:17:52 +00:00
9d37cefcb0 Resubmit _int_mm (#96685)
Avoids any changes to gemm_and_bias

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96685
Approved by: https://github.com/drisspg, https://github.com/ngimel
2023-03-27 16:14:07 +00:00
5f88d86142 Remove hacky python dispatcher fallthrough (#96635)
Ed's previous PRs in stack https://github.com/pytorch/pytorch/pull/96306 fixes #89037, but this PR just removes the original hacky fallthrough.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96635
Approved by: https://github.com/zhxchen17
2023-03-27 16:09:45 +00:00
a6bc1f3a9f Dynamo size dim kwargs (#97450)
Fix https://github.com/pytorch/pytorch/pull/97098#discussion_r1145157874

@ngimel @voznesenskym

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97450
Approved by: https://github.com/ngimel
2023-03-27 15:36:46 +00:00
8275e5d2a8 [cpp_extension.py] fix bogus _check_cuda_version (#97602)
Currently if `setuptools<49.4.0` and there is a minor version mismatch `_check_cuda_version` fails with a misleading non-actionable error:
```
2023-03-24T20:21:35.0625644Z   RuntimeError:
2023-03-24T20:21:35.0628441Z   The detected CUDA version (11.2) mismatches the version that was used to compile
2023-03-24T20:21:35.0630681Z   PyTorch (11.3). Please make sure to use the same CUDA versions.
```
This condition shouldn't be failing since minor version match isn't required.

It fails because the other condition to have a certain version of `setuptools` isn't met. But that condition is written in a comment (!!!). So this PR changes it to actually tell the user how to fix the problem.

While at it, I adjusted the version number as a lower `setuptools>=49.4.0` is sufficient for this to work.

Thanks.

p.s. this problem manifests on `nvidia/cuda:11.2.2-cudnn8-devel-ubuntu20.04` docker image.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97602
Approved by: https://github.com/ezyang
2023-03-27 15:15:57 +00:00
a1ada050f8 do not insert to_dtype for memory copy only buffers (#97147)
Remove redundant to_dtype like
`load_bf16 + to_fp32 + to_bf16 + store_bf16` => `load_bf16 + store_bf16`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97147
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-27 14:55:41 +00:00
e1f153f3b1 Add support for copysign operator in functorch (#96018)
Fixes #91176
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96018
Approved by: https://github.com/zou3519
2023-03-27 14:20:57 +00:00
d0abc31428 Remove unnecessary retain_grad call from gradcheck (#96923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96923
Approved by: https://github.com/albanD
2023-03-27 13:38:28 +00:00
51c3fd39a5 Modify all calls to checkpoint pass use_reentrant explicitly (#97376)
Fixes #ISSUE_NUMBER

This is the first step toward making use_reentrant=False the default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97376
Approved by: https://github.com/albanD
2023-03-27 13:37:42 +00:00
38da54e9c9 Split rnn primitive for inference and training (#96736)
## Description
Currently, both inference and training will use `forward_training` in rnn primitive, which will bring performance downgrade for inference (The performance drop is from rnn primitive and unnecessary creation of `pd` and `workspace`). This PR is to split them into `forward_inference` and `forward_training` seperately.

## Performance
With this fix PR, in RNN-T inference, the throughput reduction is 167 ms, which increases `3.7%` of E2E time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96736
Approved by: https://github.com/jgong5
2023-03-27 11:14:15 +00:00
e3df6a7c8a [Dynamo] Unspec int list if enabling dynamic_shapes (#97557)
Fixes #97348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97557
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-03-27 06:12:43 +00:00
542fb0b1fa Specify file encoding in test_torch.py (#97628)
Attempt to fix
```
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 5260: ordinal not in range(128)
```
in https://github.com/pytorch/pytorch/actions/runs/4522628359/jobs/7965372405

In general, it's a good practice to explicitly specify encoding, as otherwise it depends on environment variable and makes tests failures unpredicatble

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97628
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-03-26 20:03:25 +00:00
b73e8cd4fa [BE] Use nested namespaces in sparse (#97581)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 59a5205</samp>

This pull request refactors the namespace declarations in several files under `aten/src/ATen/native/sparse` to use a more concise and consistent syntax. This improves the readability and reusability of the sparse tensor operations code.

Also, do not rely on deprecated `TensorBase::data` and instead use `TensorBase::data_ptr`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97581
Approved by: https://github.com/kit1980, https://github.com/huydhn
2023-03-26 18:20:27 +00:00
461f088c96 add -std=c++17 to windows cuda compilations (#97515)
add -std=c++17 to windows cuda compilations

Summary:
We're using C++17 in headers that are compiled by C++
extensions. Support for this was not added when we upgraded to C++17.

Test Plan: Rely on CI.

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97515).
* #97175
* __->__ #97515
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97515
Approved by: https://github.com/ezyang
2023-03-26 15:23:52 +00:00
4c0dce50fd [BE] Apply ufmt to run_test and GitHub Python util scripts (#97588)
This has been bugging me for a while as I'm working on these Python scripts and they are not tracked by ufmt linter.  So I add these script into that linter.

```
[[linter]]
code = 'UFMT'
include_patterns = [
    '.github/**/*.py',
    'test/run_test.py',
```

This change should just work and not break anything as ufmt (black + usort) linter is very safe to use for standalone util scripts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97588
Approved by: https://github.com/kit1980
2023-03-26 04:52:55 +00:00
f09347a9f1 [inductor] Fix broadcast of random seed in mm epilogue (#97591)
Fixes #96468, #97553

In matmul codegen epilogue we use `mask` shape to infer the broadcasted shape in case we need to broadcast a scalar value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97591
Approved by: https://github.com/jansel
2023-03-26 03:35:03 +00:00
4f2ac8abac Fixes double printing of code in debug mode (#97608)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97608
Approved by: https://github.com/mlazos
2023-03-26 02:39:38 +00:00
dc45ad7024 [inductor] support SymPy exprs in reflection_pad2d_backward lowering (#97604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97604
Approved by: https://github.com/ezyang
2023-03-26 00:38:50 +00:00
9585a7ffd3 [inductor] support non-tensor ops with dynamic shapes (#97519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97519
Approved by: https://github.com/jansel
2023-03-26 00:38:50 +00:00
13dcf635e0 Dynamo stride dim kwargs (#97444)
Fixes #97441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97444
Approved by: https://github.com/ezyang
2023-03-25 23:43:05 +00:00
233742cb2f Add accuracy tests for traced optimizers (#97577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97577
Approved by: https://github.com/yifuwang
2023-03-25 15:45:11 +00:00
1b08a01361 Default to aot_eager for torch.compile on MPS (#96980)
Fixes https://github.com/pytorch/pytorch/issues/96976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96980
Approved by: https://github.com/kulinseth, https://github.com/albanD, https://github.com/ZainRizvi
2023-03-25 14:21:39 +00:00
75fb0b6c9f Enable full train_step tracing and customizable dist graph expansion (#97416)
This commit adds an entry point for full `train_step` tracing and
expansion. Model forward, backwrd, and optimizer step will be included
in one graph. DTensor expansion will be applied on top to insert
collective communications. Users can also provide an `Override`
implementation to skip non-traceable submodules and directly install
submodule logic to the  DTensor-expanded graph by inserting `fx.Nodes`.

Differential Revision: [D44325177](https://our.internmc.facebook.com/intern/diff/D44325177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97416
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
2023-03-25 09:24:21 +00:00
e67b58105a Enable lowering to inductor (#96927)
**Summary**
Enable the lowering path from a quantized 2.0 fx graph into Inductor. The basic usage will be
```
export_module, guards = torchdynamo.export(m, *args)
prepare_module = prepare_pt2e(export_module, *args)
convert_module = convert_pt2e(prepare_module)
ooptimized_module = compile_fx(convert_module, example_inputs)
```
Most of the issues we met previously has already been fixed in PyTorch Master. So in this PR, we mainly do 2 things:
1. Add the basic usage into a UT.
2. Move `handle_dynamo_export_graph` before the fusion passes, otherwise the dynamo_export_graph will hit the fusion passes twice which is un-expected.

**Test Plan**
```
clear && python -m pytest test_quantization.py -k test_inductor_backend_config_conv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96927
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/jerryzh168
2023-03-25 08:29:40 +00:00
3b1b585a59 [FSDP] Fix bug in determining whether parameters need to be materialized (#97488)
Previously, `_need_to_materialize_module` would return false because:

* `managed_params =_get_orig_params(module, ignored_params)` returns a generator
* `is_meta_module = any(param.is_meta for param in managed_params)` exhausts the generator in its check
* `any(fake.is_fake(param) for param in managed_params)` would try to iterate over the empty generator and get an empty sequence, thus returning `False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97488
Approved by: https://github.com/ngimel, https://github.com/awgu
2023-03-25 08:24:57 +00:00
14177f0d3d [BE] Make USE_FLASH_ATTENTION private (#97579)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at b07152e</samp>

This pull request refactors the CMake configuration to enable the `USE_FLASH_ATTENTION` feature for the `torch_cuda` target only, using a target-specific macro. This avoids conflicts with other libraries that also use this feature, such as fairseq.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97579
Approved by: https://github.com/kit1980
2023-03-25 05:41:07 +00:00
5e014bfbbd [vmap] ldl_factor: batch rule (#97518)
Ref https://github.com/pytorch/pytorch/issues/96855

Will look into `ldl_solve` separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97518
Approved by: https://github.com/zou3519
2023-03-25 04:37:32 +00:00
f89af60183 Rewrite NCCL watchdog to more reliably throw timeout (#97066)
Fixes #97191

This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job.

### Previous output in #97191
```
Rank 0 is the problematic rank
Rank 4 completed
Rank 5 completed
Rank 3 completed
Rank 6 completed
Rank 2 completed
Rank 7 completed
Rank 1 completed
[E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out.
Rank 0 completed
[E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down.
```
Although it says that it is taking the process down, it sometimes fails to do so.

### New output after this PR:
```
...
[E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python
Traceback (most recent call last):
  File "/pytorch-dev-env/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/pytorch-dev/torch/distributed/run.py", line 794, in main
    run(args)
  File "/pytorch-dev/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
hang.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-20_22:00:42
  host      : node0
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 194470)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 194470
============================================================
```

The log suggests that TorchX monitor is triggered, and job is torn down.

### Major changes in this PR:
1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined.
Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level.
2. Rethrow exception at watchdog thread.
3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`.
4. Turn on ASYNC_ERROR_HANDLING by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066
Approved by: https://github.com/rohan-varma
2023-03-25 04:30:20 +00:00
ee934fd633 Use unordered NEQ comparison for vec512 operator!= implementations (#97466)
This is consistent with the vec256 operator!= implementations. _CMP_NEQ_UQ is the logical opposite of _CMP_EQ_OQ comparison used in the operator== implementations.

Using the ordered NEQ operation results in nan != nan being false which is incorrect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97466
Approved by: https://github.com/jgong5, https://github.com/sanchitintel
2023-03-25 03:29:45 +00:00
c757647dd8 [Better Transformer] make is_causal a hint and force attn_mask to be set on is_causal=True in F.MHA (#97214)
Summary:
This fixes an issue raised in [is_causal parameter in torch.nn.TransformerEncoderLayer.forward does not work #96941](https://github.com/pytorch/pytorch/issues/96941) where results computed with is_causal do not properly reflect causal masking.

In PyTorch 2.0, Accelerated PT Transformers added the is_causal parameter to legacy nn.Transformer* and nn.MHA APIs aligned with and intended to engage the is_causal parameter of the new scaled_dot_product_attention (SDPA) operator.

At present is_causal works differently for Transformer* modules, the nn.MHA and F.MHA:
* The nn.Transformer* modules treat is_causal as an optional indicator about the format of attn_mask. This is because some layers (such as the CLIP layer use the attention mask in the layer, and thus the attn_mask was a required feature.)
* Initially, nn.MHA and F.MHA were defined to align with F.SDPA in behavior: a user may specify either the attention mask, or is_causal, but not both.  It seemed to make sense at the time to align SDPA and MHA, esp since there was a larger overlap of parameters which have since changed, e.g., with the removal of need_weights from SDPA. (See below for why this makes sense.)

Unfortunately, this does not work because of how MHA was changed to handle the need_weights parameter.  When need_weights is present, we do not (any more) call SDPA because support for need_weights was removed from SDPA before the release.  The rationale is that need_weights defeats all optimization at the foundation of SDPA performance.  Having the flag might thus mislead users into thinking they get good performance and have them disappointed when they enable a legacy feature of MHA which massively degrades performance.  (They might not think anything of enabling that, because it is on by default in MHA today, which leads to more  issues.)

Since SDPA does not (no longer) support need_weights, we need to pick a separate path which implements attention using a set of discrete operations that allocates a tensor for weights.  Alas, this code path does not have support for is_causal, because attention is implemented as matmul and using the attention mask.  Thus, is_causal has no impact.  (A substantially similar situation arises with how kpm is implemented today because Nested Tensors are not supported by torch.compile() in 2.0)

This problem was masked because all uses of legacy nn.MHA (and F.MHA) come through nn.Transformer* which called self-attention (i.e., nn.MHA) only ever with the attention mask attn_mask, and never with is_causal, a missed optimization opportunit that would have been addressed in a future performance update.

Regrettably, always calling nn.MHA with attn_mask prevented diagnosing of the issue of not having a suitable attention mask when need_weights support was dropped from SDPA and a discrete implementation of attention was added for that scenario, and for the execution path with key_padding_mask.

We have two options to address this issue:

Solution 1: Whenever nn.MHA and F.MHA are executed with is_causal set, we internally create a causal mask at significant expense of allocating a tensor and filling it with a triangular causal matrix.  This increases memory usage, and runtime, for allocating a causal mask.  To add insult to injury, in all current (and likely future) execution scenarios, MHA is called by a model using the nn.Transformer API which already has that matrix and passes it from nn.module to nn.module.  Then the passing in of attn_mask has to be suppressed by nn.TransformerEncoderLayer, only for nn.MHA to immediately allocate the very same tensor again to satisfy the requirement to have an attention mask for the computation. (We expect new use cases to use SDPA directly.)

Solution 2: We align the behavior of nn.MHA and F.MHA with the rest of the existing nn.Transformer API, and require the attention mask to be passed into nn.MHA in addition to is_causal as an optional indicator about the nature of the attention mask rather than as an alternative to attn_mask.  Then, when we choose the code path for processing MHA with need_weights or a key_padding_mask, we have the attn_mask passed down through the nn.Transformer* hierarchy, without the added overhead of allocating an attention mask as in scenario 1.

This PR implements solution 2 which offers better performance and in retrospect aligns MHA better with the rest of the Transformer modules as the definition of SDPA evolved into a more streamlined high-performance operator.  It ostensibly changes how is_causal works, by requiring the attention mask to be specified.  However, as described here, and as shown in the submitted issue, is_causal is not working as intended today, so it requires a change regardless.

In that sense, a change in API does not occur per-se, as the current implementation is not working, and a change has to occur either way to resolve the submitted issue, breaking any use cases that depend on the current implementation.  Checks exist (and more can be added) that flag any scenarios where is_causal is passed as True, but no attention mask is provided, ensuring that there's not quiet change from even the faulty behavior present in 2.0.

As  an upside, the present implementation will improve performance by addressing the passing of the is_causal flag from Transformer modules to MHA, speeding up training for these examples, e.g., finetuning BERT, RoBERTa, XLM-R models.

Differential Revision: D44245725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97214
Approved by: https://github.com/albanD
2023-03-25 01:36:30 +00:00
2e8086b0a1 Add myself to nn codeowners (#97277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97277
Approved by: https://github.com/awgu, https://github.com/albanD
2023-03-25 01:26:23 +00:00
0781188e64 [NCCL] Cleanup NCCL-no-record streams, move to TORCH_NCCL_AVOID_RECORD_STREAMS (#97053)
Cleanup of #89880 including moving environment variable to `TORCH_*` prefix and a warning condition fix.

CC @ptrblck @kwen2501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97053
Approved by: https://github.com/kwen2501
2023-03-25 01:10:06 +00:00
021de486ff [Easy] Apply black to format _spmd files (#97534)
No real changes. Format code to prepare for the PR on top.

Differential Revision: [D44376380](https://our.internmc.facebook.com/intern/diff/D44376380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97534
Approved by: https://github.com/wanchaol
2023-03-25 01:09:41 +00:00
a8f7e0b213 [Easy] Improve error message for meta_mm (#97533)
Differential Revision: [D44376381](https://our.internmc.facebook.com/intern/diff/D44376381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97533
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2023-03-25 01:09:41 +00:00
b32afbbdb6 [Kineto] Improve Config Options Part 2 - update to new Kineto Submodule (#97556)
Summary: Remove the old client code, and replace with the new client interface after updating Kineto submodule.

Test Plan: CI Tests

Reviewed By: chaekit

Differential Revision: D44314909

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97556
Approved by: https://github.com/anupambhatnagar, https://github.com/davidberard98
2023-03-25 00:52:23 +00:00
129e03905d disallow invalid value ranges in torch.testing.make_tensor (#96334)
Fixes #96179.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96334
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
47bfb192a7 deprecate low==high in torch.testing.make_tensor (#96333)
Addresses #96179.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96333
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
76fb9a1c7f fix low and high in torch.testing.make_tensor for integral inputs (#96124)
Fixes #96178.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96124
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
779cd1f15b only apply domain eps for floating and complex types (#97010)
As discussed in https://github.com/pytorch/pytorch/pull/96124#issuecomment-1471973352.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97010
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
9029361f24 honor low and high for torch.bool in torch.testing.make_tensor (#96332)
Fixes #96101.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96332
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
7602aade0f fix random mask creation in test_maskedtensor (#97017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97017
Approved by: https://github.com/pearu, https://github.com/mruberry
2023-03-24 23:55:17 +00:00
303eb37e38 QoL improvements for torch.testing.make_tensor (#96125)
Per title. The major ones:

- Enforce keyword only parameters for `_modify_low_high`, which takes 7 parameters.
  28aa2efd14/torch/testing/_creation.py (L147)
  is just impossible to comprehend without multiple trips back and forth.
- Improve the error messages by including the offending values in the message

I'll highlight the smaller ones inline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96125
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
090af4aa71 add proper tests for torch.testing.make_tensor (#96331)
We had some minimal tests for `torch.testing.make_tensor` before, but nothing exhaustive. This lead to quite few edge cases being undetected. This PR adds comprehensive tests and leaves a few FIXMEs in there for behavior that needs to be fixed in `make_tensor`. This will happen in later commits of this stack. Meaning, at the end of this stack, there shouldn't be any FIXME left in the tests added here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96331
Approved by: https://github.com/mruberry
2023-03-24 23:55:17 +00:00
dbe6da797a Revert "Sort: Use cub::WarpMergeSort for small sorts (32 < n <= 128) (#96223)"
This reverts commit 5d8c7e7ea47cb6e1faf333430889a804de87536e.

Reverted https://github.com/pytorch/pytorch/pull/96223 on behalf of https://github.com/osalpekar due to Causing numerous Internal build failures emerging from SortUtils.cuh. Details in [D44378546](https://www.internalfb.com/diff/D44378546)
2023-03-24 23:48:04 +00:00
85885301fd fix ignored qualifiers errors (#97443)
fix ignored qualifiers errors

Summary:
These errors exist in GCC 11, which is the default compiler on CentOS
9.

Test Plan: Rely on CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97443).
* __->__ #97443
* #97442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97443
Approved by: https://github.com/ezyang
2023-03-24 23:05:50 +00:00
39c8188194 Inductor: fall back bernoulli on cpu (#97002)
data type: float32
Input size: torch.Size([64, 4, 128, 128])
single socket (32cores):
```
Before: bernoulli 0.001327775239944458 s      dropout 0.0014216173489888509 s
After:  bernoulli 0.0002424612840016683 s     dropout 0.00039757410685221353 s
```

single core:
```
Before: bernoulli 0.04154032731056213 s      dropout 0.04382548745473226 s
After: bernoulli 0.006143261671066284 s      dropout 0.0065830423831939695 s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97002
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-24 22:13:51 +00:00
2b75955c9f [CI] Add missing --cold-start-latency for the dashboard run (#97547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97547
Approved by: https://github.com/huydhn
2023-03-24 20:15:11 +00:00
95c166cd3d Add is_causal API for TransformerDecoder (#97166)
The same API is implemented for `TransformerEncoder`, where this argument is passed through to the sublayers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97166
Approved by: https://github.com/mikekgfb
2023-03-24 20:00:53 +00:00
92605ee776 Support per channel tensor with unpacking in QNNPACK (#96268)
Summary: Supporting Per Channel quantizer with unpacking for QNNPACK.

Test Plan: buck2 test //caffe2/test/quantization:quantization --  test_qlinear_per_channel_qnnpack_free_memory_and_unpack

Differential Revision: D43865268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96268
Approved by: https://github.com/kimishpatel
2023-03-24 19:52:47 +00:00
c5135ff2a6 [DataPipe] Fix missing imports in DataPipe interface file (#97458)
Fixes https://github.com/pytorch/data/issues/1106

Ran linter locally on `datapipes.pyi` (which is generated during installation) to confirm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97458
Approved by: https://github.com/mikaylagawarecki
2023-03-24 19:25:43 +00:00
827b2aee97 Warn once on dynamo module w/ hooks (#97535)
Fixes #97347

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97535
Approved by: https://github.com/jbschlosser
2023-03-24 19:13:07 +00:00
197434df96 [Kineto] Improve Config Options for Input Shapes, Memory, Stack, Flops, and Modules - Part 1 (#97380)
Summary:
Improve On-Demand Kineto config to enable toggling of [profiler options](https://pytorch.org/docs/stable/profiler.html) via the config file. New config strings:
- PROFILE_REPORT_INPUT_SHAPES
- PROFILE_PROFILE_MEMORY
- PROFILE_WITH_STACK
- PROFILE_WITH_FLOPS
- PROFILE_WITH_MODULES

Also marked for deprecation, but still valid, old config options:
- CLIENT_INTERFACE_ENABLE_OP_INPUTS_COLLECTION
- PYTHON_STACK_TRACE

Test Plan: CI Tests (internal testing)

Reviewed By: leitian

Differential Revision: D44275220

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97380
Approved by: https://github.com/davidberard98
2023-03-24 19:12:19 +00:00
cf0ba1b9c0 Use L1 loss for Smooth L1 loss with beta=0 (#97022)
Fixes #96813.

Comments:

1. Wasn't able to test since tools/nightly.py does not allow for GPU build (and I don't want to build from scratch).
2. In theory, the bug (i.e. NaNs) can still occur when beta is very small (e.g. `beta=1e-50`), but not sure whether anybody cares.
3. Some checks within the smooth_l1_loss C++ code could be changed to check for `beta > 0` instead of `beta >= 0`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97022
Approved by: https://github.com/jbschlosser
2023-03-24 19:10:32 +00:00
17567e5b29 [pytorch@arvr/windows] Fix pytorch build/import on Windows @ ovrsource (#97193)
Summary:

- Importing torch on Windows can cause a crash within python.
- The problem was introduced by the change in `Module.cpp` from https://github.com/pytorch/pytorch/pull/94927
- The cause is that a call to `PyObject* initModule(void)` declared with a `__declspec(dllimport)` specifier can lead to a crash if the definition doesn't include the `__declspec(dllexport)` counterpart.
- To mitigate the problem without introducing  customized macros and changing the build system (note, `#include <c10/macros/Export.h>` doesn't work in `stub.c`) is to simply remove the `__declspec(dllimport)` specifier.
- According to https://web.archive.org/web/20140808231508/http://blogs.msdn.com/b/russellk/archive/2005/03/20/399465.aspx and other sources, `__declspec(dllimport)` only leads to some code optimizations, and since `initModule()` is only called once at startup, this is marginal.
- Note: the `stub_with_flatbuffer.c` file counterpart wasn't affected, therefore, not touched.

Differential Revision: D44236183

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97193
Approved by: https://github.com/ezyang
2023-03-24 18:32:43 +00:00
baf71a8aad [ROCm] Update clock intrinsic handling for AMD gfx11 family (#97005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97005
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2023-03-24 18:29:49 +00:00
5170995b2a Revert "Upgrade NVTX to NVTX3 (#90689)"
This reverts commit e64ddd1ab9d46cfc921c19269969ffc5cd7d6f6c.

Reverted https://github.com/pytorch/pytorch/pull/90689 on behalf of https://github.com/osalpekar due to Build Failures due to not being able to find one nvtx3 header in FRL jobs: [D42332540](https://www.internalfb.com/diff/D42332540)
2023-03-24 18:16:06 +00:00
a96ccaa362 Code update for vectorized interpolate cpu uint8 (#96847)
- code style update
- use idx_ptr_xmin/idx_ptr_size instead of bounds
- compute wt_max inside _compute_indices_weights_aa (no significant overhead)
- added comments and explanations
- renamed xmin/xmax into ids_min, ids_size

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96847
Approved by: https://github.com/peterbell10, https://github.com/NicolasHug, https://github.com/lezcano
2023-03-24 18:11:03 +00:00
4ff71c91d3 backport std::ssize to c10 (#97442)
backport std::ssize to c10

Summary:
Now that we have -Werror=sign-compare enabled, we encounter a lot of
friction comparing standard containers and our tensors which are
signed.

std::ssize will make it easier and safer to succinctly convert
container sizes to a signed type.

Test Plan: Added a unit test.

Reviewers: ezyang

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97442).
* #97443
* __->__ #97442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97442
Approved by: https://github.com/ezyang
2023-03-24 17:56:05 +00:00
b5edf18334 GradScaler recomputes optimizer_state["found_inf_per_device"] before optimizer.step (#97415)
I found a discrepancy between non-fused and fused optimizers, which is to use `optimizer_state["found_inf"]` or to recompute `found_inf`.

- non fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L289)
- fused: e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353)
    - where `_check_inf_per_device` is e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L564-L573)

The other way to align the behavior is to use the existing `found_inf` in e64ddd1ab9/torch/cuda/amp/grad_scaler.py (L353).

I'd say this PR is for the sake of "safety" and the alternative is to keep the existing behavior.
I honestly have no idea if it's expected to double-check the sanity of gradients in `GradScaler.step`.

---

what I've observed in huggingface/transformers T5-base example so far seems like that non-fused optimizers lead to invalid parameters while the fused not.
The cause seems to be that `gradients` become inf/nan before `GradScaler.step(optimizer)` after `GradScaler._unscale_grads_` (more precicely, the call of `torch._amp_foreach_non_finite_check_and_unscale_`) in the script of the issue linked below, i.e. the gradient clipping and/or unscaling lead to inf/nan as these happen after the grad check. See
788300cc2a/aten/src/ATen/native/cuda/AmpKernels.cu (L165-L174).

Fixes #96755 🙏

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97415
Approved by: https://github.com/ngimel, https://github.com/janeyx99
2023-03-24 17:36:47 +00:00
6fcd671574 Complex support for expm1 (#96644)
Fixes #92619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96644
Approved by: https://github.com/soulitzer
2023-03-24 17:24:50 +00:00
1b8b82f835 [ROCm] Update magma commit for ROCm (#97491)
Updated magma to more recent commit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97491
Approved by: https://github.com/jeffdaily
2023-03-24 17:15:51 +00:00
c55d1a6049 [CI] Experiment with a newer CUDA driver (#96904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96904
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-24 17:05:18 +00:00
622a11d512 Fix typos under torch/utils directory (#97516)
This PR fixes typos in comments and messages of `.py` files under `torch/utils` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97516
Approved by: https://github.com/ezyang
2023-03-24 16:53:39 +00:00
d305d4a57f [Dynamo] Fix TIMM benchmark compute_loss (#97423)
Fixes #97382

#95416 fixed a critical bug in dynamo benchmark, where AMP tests fall back to eager mode before that PR. However, after that PR, we found [a list of TIMM models amp + eager + training testing failed](https://docs.google.com/spreadsheets/d/1DEhirVOkj15Lu4UNawIUon9MqkVLaWqyT-DQPif5NHk/edit#gid=0).
Now we identified the root cause is: high loss values make gradient checking harder, as small changes in accumulation order upset accuracy checks. We should switch to the helper function ```reduce_to_scalar_loss``` which has been used by Torchbench tests.
After switching to ```reduce_to_scalar_loss```, TIMM models accuracy pass rate grows from 67.74% to 91.94% in my local test. The rest 5 failed models(ese_vovnet19b_dw, fbnetc_100, mnasnet_100, mobilevit_s, sebotnet33ts_256) need further investigation and handling, but I think it should be similar reason.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97423
Approved by: https://github.com/Chillee
2023-03-24 16:50:28 +00:00
5f5d675587 remove unused CAFFE2_VERSION macros (#97337)
remove unused CAFFE2_VERSION macros

Summary:
Nothing reads these and they are completely subsumed by TORCH_VERSION.

Getting rid of these will be helpful for build unification, since they
are also not used internally.

Test Plan: Rely on CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97337
Approved by: https://github.com/malfet
2023-03-24 16:02:35 +00:00
605a77fd59 Log FSDP mixed precision (#97367)
Log to clarify the mp config in jobs

Differential Revision: [D44307044](https://our.internmc.facebook.com/intern/diff/D44307044/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97367
Approved by: https://github.com/awgu
2023-03-24 16:01:59 +00:00
51ce02232b [ONNX] Support converting fx graph with symbolic shape to ONNX (#96350)
~~Need https://github.com/microsoft/onnx-script/pull/484~~

Support dynamic export on fx-ONNX exporter. Essentially, we set inputs size and nodes all dynamic in torchscript, and leverage on `aten::sym_size` to catch dynamic size between each Op.

1. Add `dynamic_axes` switch between symbolic tracing (dynamic sizes) and fake mode (static). Set it to default True, as most of our tests are happy with sumbolic tracing. Except GPT2 stays with fake mode with error: https://github.com/microsoft/onnx-script/issues/523
2. Add test_fx_dynamic_onnruntime.py to test on some addhoc we have from old exporter. This can be removed once tests are integrated with https://github.com/pytorch/pytorch/pull/96479
3. Since `aten::sym_size` are operated with built-in function, a built-in function mapping is added to support SymFloat/SymInt. (FIXME: https://github.com/pytorch/pytorch/issues/97201). sym_size output value is also fx.Node, and can be found in `fx_name_to_onnxscipt_value`, so it's operation stays the same as other ONNX ops in ONNX graph.
4. Fully deprecated FakeTensorProp as make_fx() should provide all node meta info.
5. Put complicated fx.Node related ArgumentType into _type_utils.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96350
Approved by: https://github.com/wschin, https://github.com/justinchuby
2023-03-24 15:47:55 +00:00
bcff4773da add /std:c++17 to windows compilations when not using Ninja (#97445)
add /std:c++17 to windows compilations when not using Ninja

Summary:
This was overlooked when we upgraded to C++17.

Test Plan: Rely on CI.

Reviewers: ezyang

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97445).
* #96603
* #97473
* #97175
* #97515
* __->__ #97445
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97445
Approved by: https://github.com/ezyang
2023-03-24 14:52:29 +00:00
6e46f47227 [inductor] xfail tests by default (#97331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97331
Approved by: https://github.com/ezyang
2023-03-24 11:11:05 +00:00
36d64760d9 Disable inductor developer warnings in official releases (#97451)
Fixes https://github.com/pytorch/pytorch/issues/97449

We shouldn't match on: torch             2.0.0+cu118
We should match on: torch             2.1.0.dev20230323+cu118
We should match on: torch              2.1.0a0+git63e1f12

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97451
Approved by: https://github.com/SherlockNoMad
2023-03-24 07:16:51 +00:00
73fadd523b Use a single stream for cuda graph pool (#97419)
Previously, we would use the same memory pool but not actually reuse the same memory. The peak memory showed good numbers, but real memory use was much higher because we had a bunch of unallocated segments that could not be reused.

As stated in comments:

NB: cuda caching allocator will remember the stream a segment is allocated to
and only allocate that segment to the same stream. we need to use a single stream
for all allocations to the memory pool, otherwise the allocations to separate streams
will not be reused; separate recordings would have use the same memory pool, but not
the same memory.

Thanks to @zdevito for help debugging this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97419
Approved by: https://github.com/ngimel
2023-03-24 07:04:12 +00:00
b11ce4bbca Bring back tensor_has_compatible_shallow_copy_type (#97455)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97455
Approved by: https://github.com/clee2000
2023-03-24 06:43:20 +00:00
f25cdf8aeb Revert "Rewrite NCCL watchdog to more reliably throw timeout (#97066)"
This reverts commit 95e8d0c39ec523f5a35c31155285fd4242928d8a.

Reverted https://github.com/pytorch/pytorch/pull/97066 on behalf of https://github.com/clee2000 due to sorry but I think this broke periodic mutigpu tests 416bac5b81 https://github.com/pytorch/pytorch/actions/runs/4505085943/jobs/7930826040
2023-03-24 06:27:00 +00:00
ad5d81adda [Sparse] Add reference implementation for addmv (#97353)
Partially addresses the problem raised in https://github.com/pytorch/pytorch/issues/96972

Add `test_addmv` and enable `test_block_addmv` on all platforms (so the test could be run on M1)

TODO: Make sure that test_block_addmv non-contiguous mode actually
generate non-contiguous as rigth now it probably does not, as test
passes assuming values are contiguous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97353
Approved by: https://github.com/cpuhrsch
2023-03-24 06:14:32 +00:00
31e858e8fc Add missing aot_autograd_arg_pos_to_source (#97487)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97487
Approved by: https://github.com/malfet, https://github.com/ezyang
2023-03-24 05:17:59 +00:00
9320cae1da Add GPU frequency lock option to inductor workflows running on A100 (#97465)
Fixes #97459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97465
Approved by: https://github.com/xuzhao9
2023-03-24 05:15:21 +00:00
fa4c77e39b Rename PyOperator to HigherOrderOperator (#97493)
Twice this week I have had people confuse "operator defined with Python
operator registration aka torch.library" and "PyOperator which is used
to define control flow operators and other operators that cannot be
represented in JIT schema."  Renaming PyOperator for clarity.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97493
Approved by: https://github.com/SherlockNoMad
2023-03-24 05:04:02 +00:00
763c5a33e7 [Vulkan] Fix quantized cpu to vulkan broken by padding (#97372)
Summary:
Previous diff D43068669 introduced channel padding, and in doing so, it broke the quantized copy of cpu to vulkan tensors.
This diff updates the quantized nchw to image shaders, in order to work with padded channels.

Test Plan:
```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

Differential Revision: D44309956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97372
Approved by: https://github.com/SS-JIA
2023-03-24 03:37:29 +00:00
a66625da3b [PyTorch] Optimize DictType::annotation_str_impl (#96498)
stringstream construction is expensive, and we can exactly reserve space for the output string while doing the same number of string copies. (If we wanted to improve performance further, we could introduce annotation_str_out to append the output to a given std::string and thus avoid copying subtype annotation strings. It occurs to me that the existing approach is quadratic in the number of layers of nesting, so we should probably do this!)

Differential Revision: [D43919651](https://our.internmc.facebook.com/intern/diff/D43919651/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96498
Approved by: https://github.com/Skylion007
2023-03-24 02:38:21 +00:00
000cfeb848 [PyTorch] Optimize TupleType::annotation_str_impl (#96497)
stringstream is expensive to create, we used stringstream instead of ostringstream, and we can easily specialize the empty tuple. Also, anybody compiling with C++20 support can move out of the stringstream and it shouldn't hurt people without C++20 support to do so. I would consider specializing the 1-element case as well but I don't have evidence that that's necessary right now.

Differential Revision: [D43882402](https://our.internmc.facebook.com/intern/diff/D43882402/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96497
Approved by: https://github.com/Skylion007
2023-03-24 02:35:32 +00:00
33dfdedb28 CUDAGraph Trees - Warn on dealloc (#97171)
Differential Revision: [D44228370](https://our.internmc.facebook.com/intern/diff/D44228370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97171
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-03-24 01:19:19 +00:00
24e280d5e2 clean up triton mathlib (#97460)
now both OSS and internal use tl.math

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97460
Approved by: https://github.com/ngimel
2023-03-24 01:08:07 +00:00
bb74d04353 Remove inductor-perf-test-nightly label (#97290)
Remove labels according to Ed's suggestion
"I do NOT think performance dashboard should be label triggered. It easy to put a label on the PR, and then forget about it and keep spamming our limited A100 capacity when you push updates to your PR." Instead, one can use the "Run workflow" option and specify their feature branch in https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97290
Approved by: https://github.com/ezyang, https://github.com/desertfire
2023-03-24 01:02:29 +00:00
63e1f12b49 Speedup bincount and histc on CUDA (#97090)
This is to speed up torch.bincount and torch.histc on CUDA.

1. Speed up int64_t gpuAtomicAdd,
2. and optimize the histogram kernel.

# Fixes #96626
After speedup, time cost in #96626 would be

```
... (run 2 times and ignore the first run)
case 1 CPU  0.0003631114959716797 seconds
case 1 CUDA 0.0005860328674316406 seconds
case 2 CPU  0.0013742446899414062 seconds
case 2 CUDA 0.0008623600006103516 seconds
```

Note that in "*case 1 CUDA*", the **max** op takes the most time, i.e., 5ee5a164ff/aten/src/ATen/native/cuda/SummaryOps.cu (L334-L335), which is not to be optimized in this PR.

# Benchmark

Time is measured on i7-10700 + RTX 3080, Ubuntu 22.04 (in WSL). The baseline is PyTorch 2.0.0+cu117. My dev version of PyTorch is compiled with CUDA 11.8. Each case is measured 15 times to take the median.

## torch.bincount
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.000834 | 0.005783 | 0.000266 | 21.8x
2**20 | 80 | narrow in 1 bin | 0.001576 | 0.003967 | 0.000563 | 7.0x
2**20 | 500 | random.uniform | 0.000852 | 0.003641 | 0.000334 | 10.9x
2**20 | 500 | narrow in 1% bins | 0.000894 | 0.001878 | 0.000349 | 5.4x
2**20 | 2048 | random.uniform | 0.000891 | 0.000820 | 0.000298 | 2.8x
2**20 | 2048 | narrow in 1% bins | 0.000958 | 1.043251 | 0.000335 | 3,116.6x
2**26 | 80 | random.uniform | 0.067715 | 0.322409 | 0.003032 | 106.3x
2**26 | 80 | narrow in 1 bin | 0.110940 | 0.194644 | 0.017651 | 11.0x
2**26 | 500 | random.uniform | 0.066666 | 0.192302 | 0.002535 | 75.8x
2**26 | 500 | narrow in 1% bins | 0.066130 | 0.092237 | 0.005462 | 16.9x
2**26 | 2048 | random.uniform | 0.066371 | 0.035308 | 0.002476 | 14.3x
2**26 | 2048 | narrow in 1% bins | 0.068453 | 72.122858 | 0.003185 | 22,644.3x

## torch.histc (float32)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | 0.001261 | 0.000145 | 9.47E-05 | 1.5x
2**20 | 80 | narrow in 1 bin | 0.001074 | 0.000356 | 0.000311 | 1.1x
2**20 | 500 | random.uniform | 0.001162 | 0.000227 | 9.18E-05 | 2.5x
2**20 | 500 | narrow in 1% bins | 0.001082 | 0.000201 | 0.000152 | 1.3x
2**20 | 2048 | random.uniform | 0.001100 | 0.000203 | 0.000118 | 1.7x
2**20 | 2048 | narrow in 1% bins | 0.001089 | 0.000396 | 0.000107 | 3.7x
2**26 | 80 | random.uniform | 0.064219 | 0.001170 | 0.000786 | 1.5x
2**26 | 80 | narrow in 1 bin | 0.056471 | 0.013283 | 0.011939 | 1.1x
2**26 | 500 | random.uniform | 0.078183 | 0.003411 | 0.000562 | 6.1x
2**26 | 500 | narrow in 1% bins | 0.056711 | 0.002763 | 0.002738 | 1.0x
2**26 | 2048 | random.uniform | 0.059296 | 0.003503 | 0.000533 | 6.6x
2**26 | 2048 | narrow in 1% bins | 0.061754 | 0.015703 | 0.000962 | 16.3x

## torch.histc (int64)
#elem | nbins | distribution | CPU | PyTorch 2.0.0 | this PR | speedup
-- | -- | -- | -- | -- | -- | --
2**20 | 80 | random.uniform | N/A | 0.005614 | 9.47E-05 | 59.3x
2**20 | 80 | narrow in 1 bin | N/A | 0.003799 | 0.000395 | 9.6x
2**20 | 500 | random.uniform | N/A | 0.003665 | 9.58E-05 | 38.2x
2**20 | 500 | narrow in 1% bins | N/A | 0.001760 | 0.000178 | 9.9x
2**20 | 2048 | random.uniform | N/A | 0.000693 | 0.000111 | 6.2x
2**20 | 2048 | narrow in 1% bins | N/A | 1.082904 | 0.000123 | 8,802.4x
2**26 | 80 | random.uniform | N/A | 0.320400 | 0.001145 | 279.9x
2**26 | 80 | narrow in 1 bin | N/A | 0.193668 | 0.015229 | 12.7x
2**26 | 500 | random.uniform | N/A | 0.182897 | 0.000823 | 222.2x
2**26 | 500 | narrow in 1% bins | N/A | 0.089363 | 0.00376 | 23.8x
2**26 | 2048 | random.uniform | N/A | 0.033190 | 0.000832 | 39.9x
2**26 | 2048 | narrow in 1% bins | N/A | 71.721012 | 0.001525 | 47,017.8x

## Banchmark code

Here is the benchmark code:

```python3
import time
import torch

cases = [
    ("bincount    bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.bincount(x, minlength=80)),
    ("bincount    bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.bincount(x, minlength=500)),
    ("bincount    bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.bincount(x, minlength=2048)),
    ("bincount    bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.bincount(x, minlength=2048)),
    ("histc_float bins=80   wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=80   narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=80, min=0., max=1.)),
    ("histc_float bins=500  wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=500  narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=500, min=0., max=1.)),
    ("histc_float bins=2048 wide  ", torch.rand(2**20),            lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_float bins=2048 narrow", torch.rand(2**20)*.01,        lambda x: torch.histc(x, bins=2048, min=0., max=1.)),
    ("histc_int   bins=80   wide  ", torch.randint(80, [2**20]),   lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=80   narrow", torch.randint(1, [2**20]),    lambda x: torch.histc(x, bins=80, min=0., max=80.)),
    ("histc_int   bins=500  wide  ", torch.randint(500, [2**20]),  lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=500  narrow", torch.randint(5, [2**20]),    lambda x: torch.histc(x, bins=500, min=0., max=500.)),
    ("histc_int   bins=2048 wide  ", torch.randint(2048, [2**20]), lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
    ("histc_int   bins=2048 narrow", torch.randint(20, [2**20]),   lambda x: torch.histc(x, bins=2048, min=0., max=2048.)),
]

def test(case, device):
    name, x, func = case
    x = x.to(device)
    time_samples = []
    for _ in range(15):
        torch.cuda.synchronize()
        t1 = time.time()
        func(x)
        torch.cuda.synchronize()
        t2 = time.time()
        time_samples.append(t2 - t1)
    median = sorted(time_samples)[len(time_samples) // 2]
    print(device, name, median)

for case in cases:
    test(case, device="cuda")

# for case in cases:
#     test(case, device="cpu")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97090
Approved by: https://github.com/ngimel
2023-03-24 00:25:34 +00:00
f3cf3d7620 [DTensor] Fix the default PG condition for DeviceMesh (#97384)
The current conditin to use the default PG is `len(unique_mesh_values) == WORLD_SIZE - 1`. The `- 1` is not correct and seems to be an incorrect fix from https://github.com/pytorch/pytorch/pull/96861.

Differential Revision: [D44314317](https://our.internmc.facebook.com/intern/diff/D44314317/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97384
Approved by: https://github.com/wanchaol
2023-03-24 00:04:34 +00:00
e4b365a9a0 Use a equal operator that don't depend on nonzero for flatbuffer_serializer (#97298)
Summary: call to is_nonzero here is not desirable: https://www.internalfb.com/code/fbsource/[ed0407ba3bf520baa2e9333483b274c5b40b54eb]/fbcode/caffe2/aten/src/ATen/core/ivalue.cpp?lines=278

Differential Revision: D44276685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97298
Approved by: https://github.com/larryliu0820
2023-03-23 23:54:41 +00:00
12da0c7037 Revert "remove dead torch_pb.h library (#97323)"
This reverts commit 364d92f9b6864ce284fa13519c7ca5c87460e477.

Reverted https://github.com/pytorch/pytorch/pull/97323 on behalf of https://github.com/malfet due to Reverting as PR dependent on https://github.com/pytorch/pytorch/pull/97322 that has been reverted
2023-03-23 23:19:05 +00:00
b531eb974a Revert "move caffe2/proto/ to its own Bazel package (#97324)"
This reverts commit 6273c0af9513895f0597ae1801f37164d7b46d2a.

Reverted https://github.com/pytorch/pytorch/pull/97324 on behalf of https://github.com/malfet due to Reverting as PR dependent on https://github.com/pytorch/pytorch/pull/97322 that has been reverted
2023-03-23 23:13:43 +00:00
91a3040b4b Revert "cleanup caffe2 cc_proto_library (#97325)"
This reverts commit 603a32c96458af870fd1653cdf57453bc7d9905d.

Reverted https://github.com/pytorch/pytorch/pull/97325 on behalf of https://github.com/malfet due to Reverting as PR dependent on https://github.com/pytorch/pytorch/pull/97322 that has been reverted
2023-03-23 23:08:26 +00:00
0d66db1b2a Implement last dim split_with_sizes for NT (forward only, non-SymInt-ified) (#97446)
This is needed for the HSTU model.

Details:
* ~~NT `chunk` now calls into NT `split_with_sizes` since the latter is more general~~ (removed; they're totally separate)
* Throws for backward
* Only operates over the last dim (`dim=-1`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97446
Approved by: https://github.com/cpuhrsch
2023-03-23 22:17:06 +00:00
37f7c13b7b [ci] disable some dtensor tests (#97358)
fixes https://github.com/pytorch/pytorch/issues/96454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97358
Approved by: https://github.com/rohan-varma
2023-03-23 22:08:31 +00:00
13fbf93238 Revert "remove dead proto_convert library (#97322)"
This reverts commit d850c33bfe3f1d1f0040738718cacb04ee449bdc.

Reverted https://github.com/pytorch/pytorch/pull/97322 on behalf of https://github.com/osalpekar due to This broke a large number of internal builds due to not being able to find proto_convert.h. See here: [D44319486](https://www.internalfb.com/diff/D44319486)
2023-03-23 21:38:01 +00:00
95e8d0c39e Rewrite NCCL watchdog to more reliably throw timeout (#97066)
Fixes #97191

This PR aims to propagate collective exceptions (async error or timeout) up to the program, so as to avoid silent stuck job.

### Previous output in #97191
```
Rank 0 is the problematic rank
Rank 4 completed
Rank 5 completed
Rank 3 completed
Rank 6 completed
Rank 2 completed
Rank 7 completed
Rank 1 completed
[E ProcessGroupNCCL.cpp:464] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10917 milliseconds before timing out.
Rank 0 completed
[E ProcessGroupNCCL.cpp:478] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:483] To avoid data inconsistency, we are taking the entire process down.
```
Although it says that it is taking the process down, it sometimes fails to do so.

### New output after this PR:
```
...
[E ProcessGroupNCCL.cpp:459] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:473] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:479] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:818] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=10000) ran for 10599 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 194470) of binary: /data/home/kw2501/repos/pytorch-dev-env/bin/python
Traceback (most recent call last):
  File "/pytorch-dev-env/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
  File "/pytorch-dev/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/pytorch-dev/torch/distributed/run.py", line 794, in main
    run(args)
  File "/pytorch-dev/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/pytorch-dev/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/pytorch-dev/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
hang.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-20_22:00:42
  host      : node0
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 194470)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 194470
============================================================
```

The log suggests that TorchX monitor is triggered, and job is torn down.

### Major changes in this PR:
1. Merge ncclWatchDog thread and workCleanupLoop thread into one so that the watch action and the throw action are streamlined.
Previously, ncclWatchDog is responsible for watching comm error and timeout, and workCleanupLoop is responsible for watching Work item error and throwing exception. This two-thread design is not streamlined, raising the chance of missing the throw. Also, it is duplicated to watch at multiple level.
2. Rethrow exception at watchdog thread.
3. Clean up a bunch of duplicated functions, e.g. `checkAndThrowException` and `handleNcclException`.
4. Turn on ASYNC_ERROR_HANDLING by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97066
Approved by: https://github.com/rohan-varma
2023-03-23 21:31:21 +00:00
416bac5b81 [Vulkan] Fix static analysis errors in vulkan_quantized_api_test.cpp (#97400)
Summary:
Fixes many static analysis and linter issues present in vulkan_quantized_api_test.cpp.
Replaces C-style rand function by Cpp rand functions.

Test Plan:
```
buck run --target-platforms ovr_config//platform/macos:arm64-fbsource -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

Differential Revision: D44315235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97400
Approved by: https://github.com/kimishpatel
2023-03-23 20:32:43 +00:00
c2d7508276 [DTensor] default value for DTensor ops on non-participating devices (#95852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95852
Approved by: https://github.com/wanchaol
2023-03-23 19:30:02 +00:00
103f4c99f0 [DTensor] implement aten.equal sharding prop (#97170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97170
Approved by: https://github.com/wanchaol
2023-03-23 19:30:02 +00:00
5f57b36318 Rename torch._inductor.triton_ops.autotune to torch._inductor.triton_heuristics (#95558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95558
Approved by: https://github.com/Chillee
2023-03-23 17:41:19 +00:00
f0649d4723 update flatten.py docstring (#97276)
Carried over comment from tensor.flatten docstring to to clarify when a view vs copy is instantiated - this has been a [minor point of confusion in forums](https://discuss.pytorch.org/t/what-is-the-difference-of-flatten-and-view-1-in-pytorch/51790/5).  This comment is:

```
    Unlike NumPy’s flatten, which always copies input’s data, this function may return the original object, a view, or copy.
    If no dimensions are flattened, then the original object input is returned.
    Otherwise, if input can be viewed as the flattened shape, then that view is returned.
    Finally, only if the input cannot be viewed as the flattened shape is input’s data copied.
    See torch.Tensor.view() for details on when a view will be returned.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97276
Approved by: https://github.com/mikaylagawarecki
2023-03-23 17:10:10 +00:00
a3b30c5025 update internal triton (#97422)
Reviewed By: bertmaher

Differential Revision: D44276873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97422
Approved by: https://github.com/brad-mengchi
2023-03-23 17:01:35 +00:00
29c061bb90 Remove non existent files in multigpu tests (#97393)
They were removed in https://github.com/pytorch/pytorch/pull/96989/files and https://github.com/pytorch/pytorch/pull/96985/files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97393
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/fduwjj, https://github.com/malfet
2023-03-23 17:00:29 +00:00
4a88f71f65 Fix potential naming clash when writing traces with tensorboard_trace_handler (#97392)
Fixes https://github.com/pytorch/pytorch/issues/82915

This rare flaky issue caught my attention today when it failed flakily on MacOS in https://github.com/pytorch/pytorch/actions/runs/4494182574/jobs/7906827531.  The test expected 3 traces to be written but got only 2 of them.

Looking a bit closer into the `tensorboard_trace_handler` function, it looks like there is a potential filename clash here.  The millisecond since epoch `"{}.{}.pt.trace.json".format(worker_name, int(time.time() * 1000))` is used as part of the name.  As `tensorboard_trace_handler` is used as a callback handle in the test, the names look too close to each other (1-millisecond apart), i.e.

```
huydo-mbp_13494.1679526197252.pt.trace.json
huydo-mbp_13494.1679526197253.pt.trace.json
huydo-mbp_13494.1679526197250.pt.trace.json
```

Switching to nanosecond reduces the chance of two or more of them having the same timestamp while keeping the naming convention intact, i.e. `huydo-mbp_13804.1679526325182878000.pt.trace.json`

I suspect that this is also the cause of Windows flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97392
Approved by: https://github.com/malfet, https://github.com/aaronenyeshi
2023-03-23 16:53:11 +00:00
d499b7d750 [inductor] Fix a multi-gpu context error (#97398)
Summary: The problem only appears when we enable multi-gpu test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97398
Approved by: https://github.com/ngimel, https://github.com/shunting314
2023-03-23 15:59:22 +00:00
7711d24717 vmap support for linalg.lu_factor (#94328)
Differential Revision: D43093457

Fix #91415

### Expected behaviour

No use warning.

```python
from functorch import vmap
x = torch.randn(4, 3, 2)
z = vmap(torch.linalg.lu_factor)(x)
```
Same behaviour as for-loop:

```python
x = torch.randn(4, 3, 2)
results = []
for xi in x:
  y = torch.linalg.lu_factor(xi)
  results.append(y)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94328
Approved by: https://github.com/zou3519, https://github.com/Skylion007, https://github.com/Chillee
2023-03-23 14:18:57 +00:00
bdaf402565 build C++ extensions on windows with /std:c++17 (#97413)
build C++ extensions on windows with /std:c++17

Summary:
We added -std=c++17 to Posix builds, but neglected to add this for
Windows. This just brings back parity.

Test Plan: Rely on CI.

Reviewers: ezyang

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97413).
* #97175
* __->__ #97413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97413
Approved by: https://github.com/ezyang
2023-03-23 13:31:29 +00:00
feace5d66f [inductor] handle integer Symbols in is_integer_type (#97217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97217
Approved by: https://github.com/alexsio27444, https://github.com/ezyang
2023-03-23 13:29:45 +00:00
a331cd4314 [inductor] fix cpp legalize bf16 reduction (#97228)
When legalizing bf16 for reduction, operators with result dtype of torch.int64, like argmax, would encounter an assertion error now. The PR fixes for the case of int64, enabling several bf16 models (hf_Reformer, doctr_reco_predictor) to run successfully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97228
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire
2023-03-23 08:52:25 +00:00
580b4702bc [FSDP][optim_state_dict] Consolidate the arguments and logic of optim_state_dict and optim_state_dict_to_load (#96534)
Summary:
The current `optim_state_dict()` does not require users to call `optim.state_dict()` first while `optim_state_dict_to_load()` requires users to call `optim.load_state_dict()`. This PR make both APIs provide the option for users not having to call the extra API.

This PR also changes the arguments order of `optim_state_dict_to_load` which is a breaking change. So we should do this asap before the API is adopted in production cases.

Test Plan: CI

Differential Revision: D43925068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96534
Approved by: https://github.com/rohan-varma
2023-03-23 07:56:08 +00:00
1fb1c6e135 Retry download and install NDK when testing Android (#97067)
As this step uses network to download and install NDK, it could fail flakily, i.e. https://github.com/pytorch/pytorch/actions/runs/4452757793/jobs/7820701670.  So I'm adding retrying to the workflow.

I could try figure out a way to Dockerize this, but not sure yet how to handle the GitHub action `reactivecircus/android-emulator-runner@v2` in Docker.  So let's opt for the easy fix of retrying.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97067
Approved by: https://github.com/malfet
2023-03-23 07:10:32 +00:00
37faa48844 DCE inference graphs too (#97275)
I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275
Approved by: https://github.com/bdhirsh
2023-03-23 07:02:52 +00:00
fbc803df0c Only warn once for TypedStorage deprecation (#97379)
Fixes #97207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97379
Approved by: https://github.com/ezyang
2023-03-23 05:40:23 +00:00
b507d7d798 Fix Device Idx Setting (#97399)
We weren't always setting the device indices, which led to a StopIteration Exception on next(iter(device_idxs))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97399
Approved by: https://github.com/yanboliang, https://github.com/ngimel
2023-03-23 04:55:31 +00:00
5d8c7e7ea4 Sort: Use cub::WarpMergeSort for small sorts (32 < n <= 128) (#96223)
We currently use `bitonicSortKVInplace` for sorts of size `n <= 32`
but use `radixSortKVInplace` for `32 < n <= 4096`. Bitonic sort is
also unstable, which forces stable sorts fall back to which is up to
4x slower in this small regime.

This PR adds a new kernel `warpMergeSortKVInplace` using
`cub::WarpMergeSort` to implement sorts with `32 < n <= 128` and all
stable sorts with `n < 128`. This results in up to a 2x speedup for
unstable sorts and up to 15x for stable sorts, depending on the input
geometry.

This also doesn't increase the total number of kernels since we are
replacing radix-sorts of size 32 and 128.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96223
Approved by: https://github.com/ngimel
2023-03-23 04:24:54 +00:00
3b54592050 [PyTorch] Add annotation_str benchmark (#96496)
To be used to evaluate performance of following improvements. Baseline numbers:

https://gist.github.com/swolchok/c8bcb92be1dc6e67c4f7efad498becd5

Differential Revision: [D43919653](https://our.internmc.facebook.com/intern/diff/D43919653/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43919653/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96496
Approved by: https://github.com/Skylion007
2023-03-23 04:18:07 +00:00
a34d35d569 [vision hash update] update the pinned vision hash (#97396)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97396
Approved by: https://github.com/pytorchbot
2023-03-23 04:14:58 +00:00
62ecfa8b79 Fix typo under torch/csrc/jit/passes directory (#97222)
This PR fixes typo in comments under `torch/csrc/jit/passes` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97222
Approved by: https://github.com/davidberard98, https://github.com/kit1980
2023-03-23 04:08:42 +00:00
603a32c964 cleanup caffe2 cc_proto_library (#97325)
cleanup caffe2 cc_proto_library

Summary:
This doesn't need to be public, nor does it need a long name. Since
this is the most private library, we move it to the bottom of the
file.

Test Plan: Should be a no-op, verify in CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97325).
* #97337
* #97336
* #97335
* #97334
* __->__ #97325
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97325
Approved by: https://github.com/malfet, https://github.com/PaliC
2023-03-23 03:34:12 +00:00
35439e8610 [Inductor] add guards to guarantee vector int32 only used by comparison ops (for masked load) (#97144)
Fix https://github.com/pytorch/pytorch/issues/97124 and https://github.com/pytorch/pytorch/issues/97127

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97144
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-03-23 03:12:50 +00:00
c5b65032ac Restore ROCm trunk jobs (#97354)
Move it back from unstable as the job looks stable now.  The one remaining flaky test I have seen is `functorch/test_ops.py::TestOperatorsCUDA::test_vmapjvpvjp_svd_cuda_float32` b04363ead4.  So I just try to skip that one on ROCm?

I will monitor the job a bit longer, and have this PR at the ready.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97354
Approved by: https://github.com/zou3519, https://github.com/ZainRizvi
2023-03-23 02:56:44 +00:00
a74ecaf0f6 Revert "Retry download and install NDK when testing Android (#97067)"
This reverts commit d9b289b74749f76ea9a8c27e74fb60251ff42a66.

Reverted https://github.com/pytorch/pytorch/pull/97067 on behalf of https://github.com/huydhn due to Need to rework this a bit as sdkmanager does not correctly treat a failed download as failure (surprise) https://github.com/pytorch/pytorch/actions/runs/4495666042/jobs/7909537961
2023-03-23 02:53:49 +00:00
da7c42f89a Uninstall PyTorch after testing on non-ephemeral Windows runners (#97285)
Per title, I suspect that having a leftover PyTorch built from CUDA 11.7 installed in non-ephemeral Windows runners could cause some flakiness on Windows CUDA 11.8 jobs also running on the same type of runners, for example `win-vs2019-cuda11.8-py3` in 5d3c347bf6 failed with a PATH error:

```
nvrtc: error: failed to open nvrtc-builtins64_117.dll.
Make sure that nvrtc-builtins64_117.dll is installed correctly.
```

This also cleans up the dead code about `pytorch_env_restore.bat` under `ci_scripts` temp directory.  This directory is cleaned up always by [teardown-win](https://github.com/pytorch/pytorch/blob/master/.github/actions/teardown-win/action.yml#L33).  So the bat script will never be there for the next job anyway.  As Windows test jobs are doing fine, proving that we don't need this adhoc script anymore.

### Testing
https://github.com/pytorch/pytorch/actions/runs/4485931686/jobs/7888513795
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97285
Approved by: https://github.com/seemethere
2023-03-23 02:26:26 +00:00
cyy
e64ddd1ab9 Upgrade NVTX to NVTX3 (#90689)
Due to recent upgrade to CUDA 11, we can upgrade NVTX to NVTX3 as well, which is a header only library that can simplify the building system a lot.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90689
Approved by: https://github.com/soumith, https://github.com/malfet
2023-03-23 01:56:42 +00:00
4b75583052 Add autocast_test_lists.py to the merge patterns (#94381)
Add autocast_test_lists.py to the merge patterns.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94381
Approved by: https://github.com/jgong5, https://github.com/kit1980, https://github.com/malfet
2023-03-23 01:56:02 +00:00
4610ce49f6 Fix typo under torch/testing directory (#97254)
This PR fixes typo in comments and messages under `torch/testing` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97254
Approved by: https://github.com/kit1980, https://github.com/malfet
2023-03-23 01:46:17 +00:00
788300cc2a [cudnn] Support v8 API in fbcode (#96512)
Summary: It turns out we never turn on cudnn v8 API which blocks bf16 conv. Enable the new v8 API

Test Plan: buck run mode/dev-nosan scripts/xdwang/example:fc_pytorch

Reviewed By: ngimel

Differential Revision: D43784279

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96512
Approved by: https://github.com/malfet
2023-03-23 01:41:04 +00:00
fe0afc5852 use accumulate type in BF16 gemm(include dot, mv) ref path (#96074)
Fix https://github.com/pytorch/pytorch/issues/95125 and https://github.com/pytorch/pytorch/issues/83863 for bf16 accumulation in gemm ref path

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96074
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2023-03-23 01:22:59 +00:00
b45880c537 Optionally ignore utf-8 decoding error when converting std::string to python str. (#97282)
Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97282
Approved by: https://github.com/davidberard98
2023-03-23 01:19:08 +00:00
a524123c91 [torchgen] Bump native function max namespace levels due for internal use case (#97381)
Summary: As titled. Should be trivial

Test Plan: Rely on unit test

Differential Revision: D44314834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97381
Approved by: https://github.com/cccclai
2023-03-23 00:40:37 +00:00
13ca08435c [test_foreach] add cases of zero size tensors (#95028)
supply zero-size tensors only if multi_tensor_apply_kernel would be called w.h.p, i.e. device is cuda and dtype is float32

rel:
- https://github.com/pytorch/pytorch/pull/94655
- https://github.com/pytorch/pytorch/issues/94865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95028
Approved by: https://github.com/ngimel
2023-03-23 00:12:13 +00:00
116a4f2301 linemaps for inductor: python 3.9 and lower doesn't have bisect key argument (#97369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97369
Approved by: https://github.com/eellison
2023-03-22 23:39:52 +00:00
3303f5447a [inductor] use real data for cudagraphify (#97363)
Using zeros is unsafe and results in a bad memory
access in GPT2SequenceClassification that only occurs
when a tensor gets put at the beginning of segment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97363
Approved by: https://github.com/eellison
2023-03-22 23:39:52 +00:00
a1edf5f63c [EASY] Do hook sizes check with SymInt (#97362)
I don't think this matters for any uses right now, but I found
it during an audit; might as well fix it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97362
Approved by: https://github.com/wconstab
2023-03-22 23:26:00 +00:00
5425191f57 Update xla pin merge rule for python3.8 (#97371)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97371
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-22 23:11:48 +00:00
bc268284de [ci] Onnx test 3->2 shards (#97383)
Nit, not entirely sure why onnx needs an extra shard, it also doesn't seem to be doing anything.
https://github.com/pytorch/pytorch/actions/runs/4494513193/jobs/7907327958
The test step is 2 minutes long
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97383
Approved by: https://github.com/huydhn
2023-03-22 23:11:39 +00:00
191a2322f0 [WIP][Stronghold] Integrate python API BC-linter from test-infra (#96977)
See:
https://github.com/pytorch/test-infra/tree/main/tools/stronghold
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96977
Approved by: https://github.com/osalpekar
2023-03-22 22:15:34 +00:00
712bd9ae88 Upload failed and rerun tests (#97304)
Upload data for any test that didn't cleanly succeed to S3 for injestion by rockset.

About 0.001% of tests fall under this category, keeping the data usage low.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97304
Approved by: https://github.com/clee2000
2023-03-22 22:03:56 +00:00
545abc292b [aot autograd] refactor to make functionalization self-contained (#96341)
This refactor should make it easier to add an export hook into aot autograd.

(1) I killed `create_forward_or_joint_functionalized()` (and the functions that it called, like `forward_or_joint()`) which used to handle autograd + functionalization all-in-one-go for the joint case, and was also used in the inference case.

I added a few separate helper functions:

`create_functionalized_graph()`: this takes a flat fn, and returns a functionalized fx graph. It is mostly just a thin wrapper around functionalization + make_fx(), but also has some extra logic to manually append `copy_()` ops to the end of the graph.

`fn_no_extra_mutations()`: this creates the fn that we want to trace in the inference code path. It takes in a function that it then calls, and returns the outputs + any (updated) mutated inputs.

`joint_fn_no_external_mutations()`: this creates the fn that we want to trace in the joint code path. It takes in a function, and traces out its joint. It also does the work of cloning inputs that are mutated and require gradients, returning mutated inputs as outputs, and returning intermediate bases as outputs

We should be able to add an export hook by basically adding a similar version of `joint_fn_no_external_mutations` but with a lot more restrictions (guaranteed to have no tangents, not synthetic bases, etc), and calling `create_functionalized_graph()` on it.

Differential Revision: [D44204090](https://our.internmc.facebook.com/intern/diff/D44204090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96341
Approved by: https://github.com/ezyang
2023-03-22 21:41:52 +00:00
e8a722b9cb Fix missing dynamo cache lookup registration in profiler.profiler (#97305)
This follows https://github.com/pytorch/pytorch/pull/96199 and supports the 'other' profiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97305
Approved by: https://github.com/voznesenskym
2023-03-22 21:09:16 +00:00
ec54f186fe Add an issue template to disable CI jobs (#97045)
Per title, I will update the runbook to point to this after the review
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97045
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi
2023-03-22 20:32:35 +00:00
5cc2e4d7c9 [10/N] Remove ST init ops (#96985)
Differential Revision: [D44158326](https://our.internmc.facebook.com/intern/diff/D44158326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96985
Approved by: https://github.com/wz337, https://github.com/wanchaol
2023-03-22 20:26:18 +00:00
11114ab8be rename to need_attn_weights to match elsewhere (#97102)
Change variable spelling from `need_atten_weights` to `need_attn_weights` to match naming convention elsewhere in pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97102
Approved by: https://github.com/drisspg
2023-03-22 20:12:23 +00:00
7a8b691388 Make early stop the default for checkpoint and expose a way to disable (#96866)
Why did I choose context manager instead of per-call? Early stopping is not part of the model definition, and depending on how a particular model is used, e.g., with PT2 or not we may or may not want to disable early stopping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96866
Approved by: https://github.com/albanD
2023-03-22 20:03:56 +00:00
546835c45a [9/N] Remove ST multiple ops (#96989)
Differential Revision: [D44158327](https://our.internmc.facebook.com/intern/diff/D44158327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96989
Approved by: https://github.com/wz337, https://github.com/wanchaol
2023-03-22 20:02:58 +00:00
5d5f43abea [prims] Fix schema of minimum_value for a primitive operation (#97327)
This PR fixes incorrect schema for `minimum_value` in creating a primitive operation.

This PR also fixes typo in comment and python doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97327
Approved by: https://github.com/zou3519
2023-03-22 20:01:33 +00:00
726fc366a2 Add missing __main__ in two unittests (#97302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97302
Approved by: https://github.com/zou3519
2023-03-22 19:09:08 +00:00
28929b1205 Add as_strided_ to tensor docs (#97300)
Closes #87365

I added `as_strided_` to the tensor docs, following what seemed to be a pattern consistent with similar functions. More specifically, both the in-place and out-of-place function counterparts are defined in `_tensor_docs.py`, with the in-place version linking to the out-of-place version and the out-of-place version pointing to the corresponding `_torch_docs.py` definition.

If the above is not what we want (e.g. we want to add a more robust description, examples, etc.), let me know and I will be happy to update accordingly!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97300
Approved by: https://github.com/zou3519
2023-03-22 19:08:30 +00:00
a7856e18a7 Revert "DCE inference graphs too (#97275)"
This reverts commit aa3a57b80d39fc803f3f85e6a84a49926d99b4ba.

Reverted https://github.com/pytorch/pytorch/pull/97275 on behalf of https://github.com/ezyang due to this broke a test
2023-03-22 18:55:52 +00:00
d779dadda1 Remove stack trace captures from import (#97274)
Summary:
Calls to this function without an argument will get a stack trace at
import time. This is expensive, we can just skip it by passing in a value.

Test Plan: Wait for tests

Differential Revision: D44244345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97274
Approved by: https://github.com/kiukchung
2023-03-22 18:34:13 +00:00
9c144bc4fe Dont increment generation if forward of backward exists, and warning on deallocation of live tensors (#97168)
Refining the logic for when it is okay to ignore previously live outputs from cudagraphs. If there is a forward that has been invoked without invocation of the corresponding backwards, dont allow overwriting outputs.

Differential Revision: [D44228369](https://our.internmc.facebook.com/intern/diff/D44228369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97168
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-03-22 18:27:36 +00:00
9370f253e3 [inductor] Rewrite convolution triton templates (#95556)
Fixes #95775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95556
Approved by: https://github.com/Chillee, https://github.com/ngimel
2023-03-22 18:12:23 +00:00
da96ae230b [CI] Add a missing dtype flag in nightly perf run (#97357)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97357
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-22 17:28:07 +00:00
73b7702b7e Revert "FIX make sure we import the correct object from multiprocessing (#81862)"
This reverts commit 701cdbb6a5baa65cfbd90b91aff70dc262dcf31f.

Reverted https://github.com/pytorch/pytorch/pull/81862 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2023-03-22 17:22:47 +00:00
6273c0af95 move caffe2/proto/ to its own Bazel package (#97324)
move caffe2/proto/ to its own Bazel package

Summary:
This is just to break up build files and make the system easier to
reason about during the transition to the common build system.

Test Plan: Verified locally and rely on CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97324).
* #97337
* #97336
* #97335
* #97334
* #97325
* __->__ #97324
* #97323
* #97322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97324
Approved by: https://github.com/malfet
2023-03-22 17:19:26 +00:00
364d92f9b6 remove dead torch_pb.h library (#97323)
remove dead torch_pb.h library

Summary: This is only used in one place, ensure it still builds.

Test Plan: Rely on CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97323).
* #97337
* #97336
* #97335
* #97334
* #97325
* #97324
* __->__ #97323
* #97322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97323
Approved by: https://github.com/malfet
2023-03-22 17:06:21 +00:00
89d116d961 [BE][docs]Improve and update checkpoint documentation (#96862)
Updates:
- ~recommend user to use non-reentrant, mention that reentrant will be deprecated in the future~
- merges all the warnings into a single list of non-reentrant improvements over reentrant
- adds an additional entry to the list about allowing backward inside checkpointed region

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96862
Approved by: https://github.com/albanD
2023-03-22 16:53:29 +00:00
0f424f7f05 Fixed broken link to troubleshooting.html docs page (#97330)
Seen first in error message:
```
[2023-03-22 10:30:39,786] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64)
   function: '<resume in paste_mask_in_image>' (/vision/torchvision/models/detection/roi_heads.py:407)
   reasons:  w == 857
to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
[2023-03-22 10:30:40,036] torch._dynamo.convert_frame: [WARNING] torch._dynamo hit config.cache_size_limit (64)
   function: '<resume in paste_mask_in_image>' (/vision/torchvision/models/detection/roi_heads.py:406)
   reasons:  ___stack0 == 207
to diagnose recompilation issues, see https://pytorch.org/docs/master/dynamo/troubleshooting.html.
```

Broken link:
- https://pytorch.org/docs/master/dynamo/troubleshooting.html.

Good link:
- https://pytorch.org/docs/master/compile/troubleshooting.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97330
Approved by: https://github.com/zou3519
2023-03-22 16:40:21 +00:00
a133b5081c [JIT] Partially support ForwardRef type annotations for NamedTuple attributes (#96933)
**Summary** NamedTuple attributes can be annotated to declare their type:
```python
class MyNamedTuple(NamedTuple):
    x: int
    y: torch.Tensor
    z: MyOtherType
```
Normally in python you can also declare your types as strings, `x: 'int'`. But NamedTuples previously didn't support this, because their annotation evaluation process was slightly different. This PR updates the NamedTuple attribute type annotation evaluation method to support ForwardRef declarations (i.e. declaring as strings).

**Details**

Below I repeat the comment I left in _jit_internal.py:

NamedTuple types are slightly different from normal types.

Normally, annotations are evaluted like this (during jit.script):
1. Load strings of python code into c++ and parse.
2. Get annotations as strings
3. Use the PythonResolver's resolution callback (rcb) to convert the string into a python object
4. We call into annotations.py:ann_to_type to convert python obj from step 3 into a type that torchscript understands.

NamedTuples are more complicated, because they have sub-types. Normally, once we have the NamedTuple type object from #3, we can just look at the annotation literal values and use ann_to_type directly on them.

But sometimes, users will annotate with string literals, e.g.
```
   x: 'int'
```
This also happens with PEP563 (from __forward__ import annotations)

These annotations appear in the annotation dict as ForwardRef('int').

Then, we need to convert the string into a python object. This requires having local context for custom objects or imported types. rcb() is what gives us this. So, we plumb rcb through the stack so it can be used in this context for the if block below.

FAQ:
- Why do we need this special handling for NamedTuple but string annotations work fine for normal types? Normally, we parse the string directly and then call rcb() directly from C++.
- Why not use ForwardRef._evaluate? For that, we need globals() and locals() for the local context where the NamedTuple was defined. rcb is what lets us look up into these. So, basically rcb does the hard work for us.
- What is rcb? rcb is a ResolutionCallback - python callable that takes a string and returns a type. It's generated by `createResolutionCallback.*` in _jit_internal.py.

**Why is this only partial support**:

This only plumbs the rcb through some paths. In particular, the `toSugaredValue` path uses a fake rcb.

**Alternatives**:

We could also treat this the way we treat non-nn.Module classes: we evaluate them separately, ahead of time. That solution is probably better, but probably requires a more risky refactor for the way NamedTuples are handled.

Fixes #95858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96933
Approved by: https://github.com/qihqi
2023-03-22 15:20:38 +00:00
d850c33bfe remove dead proto_convert library (#97322)
remove dead proto_convert library

Summary:
This has no code, only a collection of headers. Just make sure the
only thing that includes it still builds.

Test Plan: Rely on CI.

Reviewers: sahanp

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/97322).
* #97337
* #97336
* #97335
* #97334
* #97325
* #97324
* #97323
* __->__ #97322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97322
Approved by: https://github.com/malfet
2023-03-22 14:40:31 +00:00
5537792307 [dynamo] handle dim in size kwargs (#96992) (#97098)
Fixes #96992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97098
Approved by: https://github.com/ezyang
2023-03-22 14:19:59 +00:00
9d5ac03b9a Deprecate gradcheck check_sparse_nnz argument as duplicate of masked argument (#97187)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97187
Approved by: https://github.com/soulitzer
2023-03-22 14:11:03 +00:00
cff4826f28 pytorch_unet is now passing (#97309)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97309
Approved by: https://github.com/janeyx99, https://github.com/zou3519
2023-03-22 13:55:05 +00:00
be49d3b170 [CI] Turn on debug logging for dla102 and gernet_l (#97307)
Summary: Log the generated code for those two flaky tests to see if
there is any codegen difference when they fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97307
Approved by: https://github.com/ezyang
2023-03-22 13:42:13 +00:00
c37ab85d96 Improve TORCH_LOGS settings error msg (#97264)
Lists registered loggable entities if an invalid settings string is passed via TORCH_LOGS
[before](https://gist.github.com/mlazos/91fcbc3d577f874bcb3daea44f8b41f2)
[after](https://gist.github.com/mlazos/815ea9e76aca665602228f960e0eb0d6)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97264
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-03-22 13:26:53 +00:00
aab34a476f inductor(cpu): support mkldnn packed linear to improve bfloat16 performance (#96954)
As title, enable mkldnn packed linear to improve bfloat16 performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96954
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/desertfire
2023-03-22 12:25:59 +00:00
e49b4d3827 Changed logging in aotautograd a little (#97289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97289
Approved by: https://github.com/mlazos
2023-03-22 09:33:30 +00:00
4ab1588d99 Enhance error message for dependency check (#96642)
If python development library is missing when building pytorch from source, cmake will raise the error like:
```
CMake Error at cmake/Dependencies.cmake:1079 (if):
  if given arguments:

    "VERSION_LESS" "3"

  Unknown arguments specified
```

it's quite a misleading information that user would consider it's a syntax error or cmake version problem.

This PR add a check to ensure `PYTHONLIBS_VERSION_STRING` exist before using.

Related  #87993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96642
Approved by: https://github.com/kit1980
2023-03-22 08:42:48 +00:00
f6bafcde6f Added current buck target as minifier dep (#97183)
Summary: Have minifier include the current buck target as a dependency to make sure all deps are included.

Test Plan: TORCH_COMPILE_DEBUG_DIR=”.” buck2 run mode/dev-nosan //caffe2/test/inductor:minifier_smoke

Differential Revision: D44231209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97183
Approved by: https://github.com/anijain2305
2023-03-22 08:30:53 +00:00
a6d8c70933 Init quantization backend config for inductor (#96476)
**Summary**
Init the backend config file with quantization recipes for quantization 2.0 inductor path. In this PR, we only init the recipe for `convolution` and `convolution_relu`.

**Test Plan**
```
clear && python -m pytest test_quantization.py -k test_inductor_backend_config_conv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96476
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jerryzh168
2023-03-22 07:56:56 +00:00
517a432d6e [Inductor] Enable CppWrapper to support BF16 (#97089)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97089
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-22 05:54:09 +00:00
573b2deb4b [Inductor] Fix the issue that cannot pass lint check for debug mode (#97249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-03-22 04:38:55 +00:00
37e1d85848 [Inductor] Load a BF16 scalar and broadcast it as a float vector (#97070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97070
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-22 04:25:44 +00:00
c5d7ed9423 [Inductor] Fix the issue that cannot pass lint check for debug mode (#97249)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97249
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/jansel
2023-03-22 04:25:44 +00:00
b72bddabe9 Move empty check to the start of _pack_padded_sequence (#94885)
Fixes #94122.
Move empty check to the start of `_pack_padded_sequence`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94885
Approved by: https://github.com/kshitij12345, https://github.com/jgong5, https://github.com/malfet
2023-03-22 04:16:58 +00:00
f9a9a88812 Remove chhillee from autoreview (#97293)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97293
Approved by: https://github.com/kit1980
2023-03-22 03:45:43 +00:00
db15d191b6 Update NestedTensor add to support non identical striding for NT+NT (#97195)
# Summary
NestedTensors currenlty don't support non-identical strided addition. When accumulating grad it possible to try and accumulate a grad with different striding then the old var and there is no way to change this in user code. This is a solution.. probs should support strided addition for NT
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97195
Approved by: https://github.com/albanD, https://github.com/cpuhrsch
2023-03-22 03:34:47 +00:00
4733de18fd [Inductor] Add debug logging to explain reasons of disabling vectorization (#97108)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97108
Approved by: https://github.com/EikanWang, https://github.com/jansel
2023-03-22 02:38:34 +00:00
c1025af012 [Dynamo] throw better error message if assert with non-string message (#97297)
Error message before this PR:
```
torch._dynamo.exc.Unsupported: missing: LOAD_ASSERTION_ERROR
```
After:
```
torch._dynamo.exc.Unsupported: assert with non-string message
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97297
Approved by: https://github.com/tugsbayasgalan
2023-03-22 02:24:04 +00:00
57c13fde18 Test and fix guard fail message in CompileProfiler (#97055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97055
Approved by: https://github.com/voznesenskym, https://github.com/jansel
2023-03-22 02:17:57 +00:00
1e4e256790 Mention pytorchbot command on label error (#97267)
It's not clear how to add a label, especially for contributors without write permissions - they don't have a UI for that.

One recent struggle example - https://github.com/pytorch/pytorch/pull/94671
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97267
Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/seemethere
2023-03-22 02:13:14 +00:00
688427b5ae Add sympy to binary linux test - fix conda nightly (#97281)
Try to fix following nightly conda linux failure:
https://github.com/pytorch/pytorch/actions/runs/4476375944/jobs/7868006749

We will have to revert builder sympy install:
ce427de8a8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97281
Approved by: https://github.com/malfet
2023-03-22 01:53:51 +00:00
c7fad13310 [Dynamo] Support nn.Module.named_children (#97216)
Fixes Meta internal export case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97216
Approved by: https://github.com/jansel
2023-03-22 01:43:10 +00:00
aa3a57b80d DCE inference graphs too (#97275)
I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275
Approved by: https://github.com/bdhirsh
2023-03-22 01:02:21 +00:00
3282030fa4 [inductor] reusing autotuning sub-processes (#97219)
The major cost of doing autotuning in sub process is process creating and initialization. Previously we do that for each benchmark task. This PR reuse a child process as long as it has not crashed yet. This improves compiling time a lot. It's still a bit slower than single process tuning though. Here are the comparison between single process tuning and multi-process tuning:
- if a benchmark task will crash the process, then single process tuning is a no-go
- if a benchmark task works fine, then tuning in child process will be slower. We will try to leveraging multi-GPU to further speed this up.

TLDR for the compilation time: we reduce the 11x slowdown to 1.5x. We'll try to further improve that.

Here are the compilation time comparison:

Single process autotuning:
```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0307s 90.0%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_7 0.0379s 73.0%
  ref_mm_plus_mm 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
SingleProcess AUTOTUNE takes 9.04686689376831 seconds
```

Naive multi process tuning (not reuse child process): 11x slower than single process autotuning

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0287s 100.0%
  triton_mm_plus_mm_6 0.0287s 100.0%
  triton_mm_plus_mm_1 0.0317s 90.3%
  triton_mm_plus_mm_5 0.0317s 90.3%
  triton_mm_plus_mm_7 0.0379s 75.7%
  ref_mm_plus_mm 0.0389s 73.7%
  triton_mm_plus_mm_2 0.0399s 71.8%
  triton_mm_plus_mm_3 0.0399s 71.8%
  triton_mm_plus_mm_4 0.0420s 68.3%
SubProcess AUTOTUNE takes 101.22216320037842 seconds
```

Multi process tuning reusing child process (this PR): 1.5x slower than single process autotuning
```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0307s 90.0%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_7 0.0379s 73.0%
  ref_mm_plus_mm 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
SubProcess AUTOTUNE takes 13.752070665359497 seconds
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97219
Approved by: https://github.com/ngimel
2023-03-22 00:52:57 +00:00
0b094ca37f Add gradcheck_nondet_tol to a few padding moduleinfos (#97265)
Fixes #96739, see https://github.com/pytorch/pytorch/issues/96739#issuecomment-1478327704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97265
Approved by: https://github.com/albanD
2023-03-21 23:46:28 +00:00
af440c427b [draft for discussion] add per-dispatch key modes (#97052)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97052
Approved by: https://github.com/ezyang, https://github.com/zou3519
2023-03-21 23:45:45 +00:00
793cf0cbb0 Fix dispatching issue of the new device type. (#97273)
Summary: Fix the device type dispatching issue.

Test Plan: All CI should pass.

Reviewed By: scottxu0730

Differential Revision: D44179512

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97273
Approved by: https://github.com/ezyang
2023-03-21 23:23:06 +00:00
2b32a74ab0 moving nvfuser benchmark to third_party/nvfuser (#96725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96725
Approved by: https://github.com/davidberard98
2023-03-21 23:19:15 +00:00
a1ef0be30c [BE] Remove spurious semicolon in XPUHooksInterface.h (#97296)
Semicolon in `void foo() {};` is redundant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97296
Approved by: https://github.com/kit1980
2023-03-21 23:15:27 +00:00
6dded5d63e Fixes warning to refer to SMs instead of Cuda Cores (#97224)
Fixes https://github.com/pytorch/pytorch/issues/97179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97224
Approved by: https://github.com/eellison, https://github.com/voznesenskym
2023-03-21 22:37:31 +00:00
47f18b78ec leave libdevice name for fbcode (#97257)
fbcode triton version is not updated yet, leave libdevice name there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97257
Approved by: https://github.com/bertmaher, https://github.com/jansel
2023-03-21 21:51:15 +00:00
9a18968253 Fix kDefaultTimeout multiple definition build failure (#97270)
Make the namespace explicit to avoid the constexpr conflict on GCC 11.

Fixes #90448

@ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97270
Approved by: https://github.com/ezyang
2023-03-21 21:44:53 +00:00
e7d9331688 [inductor] hoist symbolic padding expressions (#97099)
Towards fixing pnasnet5large, see #96709. The generated kernel looks much better
```
@pointwise(size_hints=[1048576], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32', 3: 'i32', 4: 'i32', 5: 'i32', 6: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': [], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 6), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, out_ptr0, ks0, ks1, ks2, ks3, xnumel, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x1 = (xindex // ks0) % ks0
    x0 = xindex % ks0
    x2 = (xindex // ks3)
    x4 = xindex
    tmp0 = x1 + ((-1)*ks1)
    tmp1 = 0
    tmp2 = tmp0 >= tmp1
    tmp3 = ks2
    tmp4 = tmp0 < tmp3
    tmp5 = x0 + ((-1)*ks1)
    tmp6 = tmp5 >= tmp1
    tmp7 = tmp5 < tmp3
    tmp8 = tmp2 & tmp4
    tmp9 = tmp8 & tmp6
    tmp10 = tmp9 & tmp7
    tmp11 = tl.load(in_ptr0 + (x0 + ((-1)*ks1) + (ks2*x1) + (x2*(ks2*ks2)) + ((-1)*ks1*ks2) + tl.zeros([XBLOCK], tl.int32)), tmp10 & xmask, other=0)
    tmp12 = tl.where(tmp10, tmp11, 0.0)
    tl.store(out_ptr0 + (x4 + tl.zeros([XBLOCK], tl.int32)), tmp12, xmask)
 ```
Interestingly, removing `expand` in in index `simplify` function makes `load` expression a little bit better, but `store` fails to simplify to flat store in this case, so I'm leaving `expand` in.
 Full pnasnet still chokes on `ceiling` in batch_norm kernels, additionally, it looks like shape propagation goofs in inductor and generates overly complicated expressions, we should switch to meta data from fx graph.
 I'm still not adding `ceil` print to triton, because we should be able to hoist all indexing expression (and just printing ceil without converting to int64 doesn't work)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97099
Approved by: https://github.com/jansel
2023-03-21 21:43:32 +00:00
b615b7ef9e use a proper cc_library for the miniz library (#96957)
use a proper cc_library for the miniz library

Summary:
Using "include" is hostile to the Bazel way of doing things.

Test Plan: Rely on CI.

Reviewers:

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/96957).
* __->__ #96957
* #96956
* #96955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96957
Approved by: https://github.com/PaliC
2023-03-21 21:39:43 +00:00
d9b289b747 Retry download and install NDK when testing Android (#97067)
As this step uses network to download and install NDK, it could fail flakily, i.e. https://github.com/pytorch/pytorch/actions/runs/4452757793/jobs/7820701670.  So I'm adding retrying to the workflow.

I could try figure out a way to Dockerize this, but not sure yet how to handle the GitHub action `reactivecircus/android-emulator-runner@v2` in Docker.  So let's opt for the easy fix of retrying.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97067
Approved by: https://github.com/malfet
2023-03-21 21:36:59 +00:00
19b5b67bc5 exclude all generated files from torch_headers (#96956)
exclude all generated files from torch_headers

Summary:
This allows Bazel to build without having to wipe the standard CMake
build.

The standard CMake build produces generated files in the source tree,
this causes a problem because Bazel has its own way of generating
them, and then both sets of generated files conflict with each other.

Test Plan: Rely on CI.

Reviewers:

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/96956).
* #96957
* __->__ #96956
* #96955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96956
Approved by: https://github.com/PaliC
2023-03-21 21:34:58 +00:00
d785d0c0a1 [reland][inductor] do benchmark in sub processes for max autotuning (#97215)
Previous attempt of landing this PR is reverted due to a landrace: https://github.com/pytorch/pytorch/pull/96410 .

The reason is `PyCodeCache.load` has a new linemap argument being added but my previous PR does not handle it (due to a stale checkout). Fix is trivial.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97215
Approved by: https://github.com/Chillee, https://github.com/jansel
2023-03-21 21:19:45 +00:00
b759134152 update Bazel to the latest release 6.1.1 (#96955)
update Bazel to the latest release 6.1.1

Summary:

Test Plan: Rely on CI.

Reviewers:

Subscribers:

Tasks:

Tags:

---
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/96955).
* #96957
* #96956
* __->__ #96955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96955
Approved by: https://github.com/PaliC
2023-03-21 21:02:44 +00:00
ea9194a4f2 [inductor] Make the original ATen info dumped in alphabetical order (#97261)
Summary: To avoid a lot of noises when comparing output_code.py from two
runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97261
Approved by: https://github.com/Chillee
2023-03-21 20:34:49 +00:00
01885cea43 [Typo] mulithreading_enabled => multithreading_enabled (#97054)
Summary: Fix typo

Test Plan: Continuous integration - Expected NoOp since it is just a variable renaming

Differential Revision: D44118850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97054
Approved by: https://github.com/Skylion007
2023-03-21 20:11:59 +00:00
b04363ead4 [easy] Expose documentation for a few global nn.Module hooks (#97185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97185
Approved by: https://github.com/albanD
2023-03-21 20:09:29 +00:00
7a93865c46 Fix regression on loading jit module from flatbuffer (#97190)
Summary:
https://fb.workplace.com/groups/pytorch.edge.users/permalink/1287477365455887

Root cause:
Introduced in D44106776. But this loop is wierd because class_dep can grow, so it cannot be replaced with c10::irange.

Test Plan:
Used model at `fbpkg fetch speech.tuna.milan.ondevice.en_us.transducer:6`
Then
`buck run xplat/caffe2/fb/lite_predictor:convert_model -- --model=$HOME/20230320debug/pytorchmodel.pt --output_name=/tmp/ffmodel.ff`

Differential Revision: D44234894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97190
Approved by: https://github.com/larryliu0820
2023-03-21 19:54:44 +00:00
de2230baa7 [dynamo] Improve error message for missing backend (#97255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97255
Approved by: https://github.com/msaroufim
2023-03-21 19:36:04 +00:00
ec3894ec0a Fix typo in settings regex logging (#97245)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97245
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-03-21 19:02:04 +00:00
77e73b9b7a Refactor NT offsets metadata to be a Tensor (#96909)
It's tedious work, but somebody's gotta do it.

Benefits:
* Enable access to offsets metadata from Python via private API (for validation, etc.)
* Consistency with nested sizes / strides metadata
* Needed for SymInt-ifying offsets metadata
* more TBD

Bonus:
* Remove `_tensor` suffixes from metadata / getter names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96909
Approved by: https://github.com/drisspg
2023-03-21 18:51:35 +00:00
22ea21da3d Change 1D Tensor of 1 element to 0D Tensor (#96994)
add 0d tensor to graph adam/adamw test

Affected:
- `torch.cuda.amp.GradScaler`'s `found_inf`, `_scale`, and `_growth_tracker`
- `step` of Adam & AdamW of `capturable`

Fixes #96776 🤞

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96994
Approved by: https://github.com/janeyx99
2023-03-21 18:24:19 +00:00
c47cf9bc7f Update parallel_apply.py for assertion error when len(modules) != len(inputs) (#94671)
Print the result why it is wrong.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94671
Approved by: https://github.com/ngimel, https://github.com/kit1980
2023-03-21 17:46:23 +00:00
a6bbeec2e1 Fix required template (#97247)
Fixes https://github.com/pytorch/pytorch/pull/96878#issuecomment-1477776378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97247
Approved by: https://github.com/ezyang
2023-03-21 17:43:44 +00:00
dbb31672b2 Fix the compatible issue of the Dynamo and the PyDev.Debugger. (#96721)
The PyDev.Debugger use the _PyFrameEvalFunction to debug the python script.
Fallback to the previous _PyFrameEvalFunction to fix the dynamo with PyDev.Debugger issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96721
Approved by: https://github.com/ezyang
2023-03-21 17:36:14 +00:00
b95896c578 [CI] Fix perf_nightly output file naming error (#97263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97263
Approved by: https://github.com/huydhn
2023-03-21 17:35:36 +00:00
acd9df8a72 [inductor] Add scaled_dot_product_attention to fallback kernels (#93339)
Summary:
We don't have decomps/lowerings for SDPA (and probably won't for a
while) so don't warn.

Test Plan: code inspection

Differential Revision: D42878203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93339
Approved by: https://github.com/desertfire, https://github.com/drisspg
2023-03-21 17:06:18 +00:00
0a2b527abe Update mkl_verbose return value check due to API change in mkl (#96283)
As title.
Originally `mkl_verbose()` function returned `0` and `1`, indicating failure and success respectively. However, the version that PyTorch uses now changed the output of `mkl_verbose()` to reflect its input level. Thus, the check logic needs to be changed to compare output of the `mkl_verbose()` function with -1.
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/miscellaneous/mkl-verbose.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96283
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-03-21 16:56:47 +00:00
244736a5a5 Mark ROCm tests as flaky (#97259)
Before https://github.com/pytorch/pytorch/pull/96464, ROCm tests in trunk are already quite flaky https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=trunk%20%2F%20linux-focal-rocm5.4.2-py3.8%20%2F%20test%20(default).

After https://github.com/pytorch/pytorch/pull/96464, there is a new group of flaky failures coming from functorch.  So let's mark the test as flaky to monitor without impacting trunk.

Two flaky tests currently seeing in trunk are:

* https://github.com/pytorch/pytorch/issues/97256
* `functorch/test_memory_efficient_fusion.py` OOM

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97259
Approved by: https://github.com/malfet, https://github.com/zou3519
2023-03-21 16:55:00 +00:00
5d3c347bf6 Make split reduction warning only emit once (#97112)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97112
Approved by: https://github.com/Skylion007
2023-03-21 14:57:31 +00:00
701cdbb6a5 FIX make sure we import the correct object from multiprocessing (#81862)
Fixes #44687.

The issue was that the Process object is not the one from the `_default_context` which should be `loky` when nesting `loky` calls.

This is a revamp of #53282 that was reverted because it broke some other tests.
How can I run the failing tests so I can see why this breaks?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81862
Approved by: https://github.com/VitalyFedyunin, https://github.com/janeyx99
2023-03-21 14:48:17 +00:00
4e054175d6 Fix uniform returning end point for BFloat16 and Half (#96962)
Fixes #96947

If we generate `1.0 - float_eps`, the BFloat16 and Half constructors will round this to 1.0 which is outside of the half-open range. Instead, we delay the bounds change until after the value has been rounded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96962
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-03-21 14:01:29 +00:00
5acf403088 Run functorch tests in default shards; delete functorch-specific shards (#96464)
Fixes #96347

This PR:

- Makes the functorch tests run as a part of the "default" shards
- Delete the functorch CI shard from all CI job configurations (if it exists)
- Increase the "default" shard count by 1 for each job, unless it was
previously set to 1, to accommodate the new functorch tests and not
regress time-to-signal.
- Adds a bunch of skips for ROCM and torchdynamo configurations. We can
investigate them later.

NB: I might go through some more iterations to figure out what other
skips need to be added, but this iteration of the PR seems to pass most CI.
suite.

Test Plan:
- wait for CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96464
Approved by: https://github.com/huydhn
2023-03-21 13:53:01 +00:00
b004819f91 Re-enable TestJit.test_profiler (#94391)
Test to see if TestJit.test_profiler still fails on Windows on CI.
I was not able to reproduce the crash locally. Also I tested 3 times on CI and the test passed.
Even with this change the test will still be disabled due to https://github.com/pytorch/pytorch/issues/81626
Fixes #62820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94391
Approved by: https://github.com/huydhn
2023-03-21 13:52:23 +00:00
2c588b3ad5 Allow new_full's fill_value argument type to be complex (#91345)
It seems that this code should type-check but doesn't:
```python
torch.zeros((2,3),dtype=torch.cdouble).new_full((4,5),complex(6,7))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91345
Approved by: https://github.com/zou3519, https://github.com/ezyang
2023-03-21 12:34:00 +00:00
38b687ed4d [PTD][Checkpoint] Add checkpointing support for DTensor submesh (#96802)
DTensor submesh support is added in https://github.com/pytorch/pytorch/pull/95458.

This PR adds support for DTensor submesh by adding an extra check when create local save/load plan.
If the rank is not participating in the mesh, we simply skip creating WriteItem/ReadItem for the local SavePlan/LoadPlan.

Updated the associated test as well.

cc. @wanchaol, @kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96802
Approved by: https://github.com/wanchaol
2023-03-21 08:17:17 +00:00
a9b9fd90a2 [Inductor] index_put - unsqueeze indices[0] if self and indices[0] are not broadcastable (#97105)
Fixes #97104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97105
Approved by: https://github.com/ngimel
2023-03-21 07:07:41 +00:00
141a2ebcf1 Clean up Compilation Profiler (#97029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97029
Approved by: https://github.com/voznesenskym
2023-03-21 06:24:22 +00:00
f9ce593267 Extend aot autograd dedup guards to params, stop using positions (#96774)
The purpose of this PR is to remove reliance on argument positions in dedup guards, AND extend the functionality to params.

A version of this PR was stamped prior https://github.com/pytorch/pytorch/pull/95831 - but was kinda gross, because it was based on an underlying PR that did way too much with source names.

This PR leaves most of that alone, in favor of just reusing the same name standardization logic that dynamo module registration does.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96774
Approved by: https://github.com/ezyang
2023-03-21 05:59:33 +00:00
e8be6d813b [Quant][FX] Fix issue of lowering weighted functional ops with kwargs (#95865)
Fixes #95492

**Summary**
This PR fixes the issue that weighted functional ops with kwargs are not lowered correctly since kwargs are ignored.
These kwargs should be moved from the functional op to its cooresponding prepack op, e.g., from `F.conv2d` to `quantized.conv2d_prepack`.

**Test plan**
python test/test_quantization.py -k test_lowering_functional_conv_with_kwargs
python test/test_quantization.py -k test_lowering_functional_conv_transpose_with_kwargs
python test/test_quantization.py -k test_lowering_functional_linear_with_kwargs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95865
Approved by: https://github.com/jgong5, https://github.com/supriyar
2023-03-21 05:29:03 +00:00
7beac103ee [PyTorch] Remove unnecessary unpickler.h #include in jit/serialization/import.h (#96687)
A forward declaration will do here.

Differential Revision: [D43995795](https://our.internmc.facebook.com/intern/diff/D43995795/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96687
Approved by: https://github.com/suo
2023-03-21 03:43:05 +00:00
d2f5722996 [ONNX] 'Transform' as base class for passes (#95935)
Base class `Transform` provides basic diagnostics functionality. Diagnostics
are automatically recorded for inherited passes.
New base class `Pass` will be added when `analysis` is introduced.

Example diagnostics for `test_mnist`:

Decompose:
<img src="https://user-images.githubusercontent.com/9376104/222615465-689e76eb-6b30-4670-aed5-a0d419583bfe.png" width="80%" height="80%">

Shape inference:
<img src="https://user-images.githubusercontent.com/9376104/222615527-0484e504-f9d5-4f5c-b018-3e45ef15c138.png" width="80%" height="80%">

Moving placeholders:
<img src="https://user-images.githubusercontent.com/9376104/222852379-36caf263-6965-4e5d-9dce-f63075a3812f.png" width="80%" height="80%">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95935
Approved by: https://github.com/justinchuby
2023-03-21 03:31:22 +00:00
45296f87ec Fix for verify_dynamo on ROCm (#97013)
Prior to this change ROCm was not exiting check_cuda, causing an exception at packaging.version.parse(torch.version.cuda), this change exits check_cuda if torch.version.cuda is None

```
python verify_dynamo.py

Python version: 3.9.16
`torch` version: 2.1.0a0+git2b2f10c
CUDA version: None
ROCM version: 5.4

All required checks passed
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97013
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet, https://github.com/kit1980
2023-03-21 03:19:31 +00:00
ee6b19bd4c Error only if autocast actually enabled (#96097)
I am trying to use bfloat16 AMP on a range of devices, using the `enabled` argument to actually enable/disable AMP, like this:
```python
with torch.cuda.amp.autocast(enabled=use_amp, dtype=torch.bfloat16):
```

However, this raises a RuntimeError even if enabled=False.

```
  File "/venv/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 221, in __init__
    raise RuntimeError('Current CUDA Device does not support bfloat16. Please switch dtype to float16.')
RuntimeError: Current CUDA Device does not support bfloat16. Please switch dtype to float16.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96097
Approved by: https://github.com/ngimel, https://github.com/kit1980
2023-03-21 03:13:13 +00:00
cc0701e5b3 [inductor] Move fx-fusion tests to a separate file (#97028)
They're sort of independent of the rest of inductor, and this makes
them a bit easier to find and marginally faster to run.

Differential Revision: [D44168337](https://our.internmc.facebook.com/intern/diff/D44168337/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D44168337/)!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97028
Approved by: https://github.com/jansel
2023-03-21 03:11:39 +00:00
695d98b0bc [inductor] Allow tensors kwarg in sink_cat_after_pointwise (#97019)
Lacking handling of kwargs strikes again.

Differential Revision: [D44166740](https://our.internmc.facebook.com/intern/diff/D44166740/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97019
Approved by: https://github.com/jansel
2023-03-21 03:07:53 +00:00
e20e5f5578 [RFC] Add an API to remove autograd hooks from DDP (#96490)
Summary:
When creating a new DDP instance for the same model when an old DDP instance existed, the autograd hooks from the old DDP instance might not be cleared. Also, relying on python gc to clear out old autograd hooks is fragile and may not work 100% of the time.

As a result, in this PR I'm adding a way to explicitly remove these hooks from DDP

Test Plan:
Unit test added

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96490
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
2023-03-21 02:56:16 +00:00
fa82080016 Don't run fallback if symbolic sizes in fake tensor (#97148)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97148
Approved by: https://github.com/Skylion007, https://github.com/eellison, https://github.com/bdhirsh
2023-03-21 02:23:44 +00:00
adcd1b3077 inductor: support profiler_mark_wrapper_call in cpp wrapper (#97119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97119
Approved by: https://github.com/alexsio27444, https://github.com/jgong5, https://github.com/desertfire
2023-03-21 01:40:09 +00:00
50ed38a7eb Fix typo under docs directory (#97202)
This PR fixes typo in `.rst` files under docs directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97202
Approved by: https://github.com/kit1980
2023-03-21 01:24:10 +00:00
793cb3f424 [FSDP][optim_state_dict] Print out more useful error message for optim_state_dict (#96860)
Summary: Print out more useful error message for optim_state_dict

Test Plan: CI

Reviewed By: wz337

Differential Revision: D43556073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96860
Approved by: https://github.com/rohan-varma, https://github.com/wz337
2023-03-21 01:04:24 +00:00
f5612758d8 [SPMD] Make the IterGraphModule less verbose and more profiling friendly (#96969)
Make the IterGraphModule less verbose and more profiling friendly

Differential Revision: [D44110594](https://our.internmc.facebook.com/intern/diff/D44110594/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96969
Approved by: https://github.com/mrshenli
2023-03-20 23:54:48 +00:00
9c288b992b minor spelling fixes NestedTensorImpl.h (#97103)
Minor spelling fixes in comments in NestedTensorImpl.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97103
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-20 23:46:40 +00:00
a269e5fa04 Add forward and backward support for silu to NestedTensors (#97181)
# Summary
Add forward and backward support for silu to NestedTensors
- Add forward support to silu
- Add forward support to silu_
- Add backward support to silu
- Add to NT docs
- Add tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97181
Approved by: https://github.com/cpuhrsch, https://github.com/jbschlosser
2023-03-20 23:46:12 +00:00
9a5fed1bd0 Harmonize BCELoss example to F.binary_cross_entropy (#95178)
About that line:

```
torch.empty(3).random_(2)
```
* Since BCE supports targets in the interval [0, 1], a better example is to sample from uniform(0, 1), using `rand`
* BCE supports multiple dimensions, and the example in `F.binary_cross_entropy` highlights it
* `rand` is more well known than `random_`, which is a bit obscure (`rand` is in the [Random Sampling section in the docs](https://pytorch.org/docs/stable/torch.html#random-sampling))
* Chaining `empty` and `random_` gives binary values as floats, which is a weird way to get that result
* Why do it in two steps when we have sampling functions that do it in a single step?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95178
Approved by: https://github.com/albanD, https://github.com/kit1980
2023-03-20 23:45:01 +00:00
252c6f25e0 Update vec256_complex_float_vsx.h (#95658)
Update vec256_complex_float_vsx.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95658
Approved by: https://github.com/jgong5, https://github.com/kit1980
2023-03-20 23:44:21 +00:00
c089c6bf15 update triton pin (#96730)
Fixes #96694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96730
Approved by: https://github.com/malfet
2023-03-20 23:42:33 +00:00
485cc7515d [Inductor CI] Fix concurrency cancellation rule of inductor-perf-compare job (#97197)
Currently old commits' jobs are not cancelled: https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-compare.yml
This PR tries to fix the concurrency rule so that when new commit is pushed, old job gets cancelled immediately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97197
Approved by: https://github.com/atalman
2023-03-20 23:32:38 +00:00
ea6113ea20 Update loss.py (#95367)
Fix the dimension bug in the document

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95367
Approved by: https://github.com/albanD, https://github.com/kit1980
2023-03-20 23:24:49 +00:00
b1e8f2fc11 Update torch.fx docs (#97058)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97058
Approved by: https://github.com/svekars, https://github.com/SherlockNoMad
2023-03-20 23:13:16 +00:00
663e7c9eeb Fix TestBufferProtocolCPU::test_byte_to_int_cpu test on Big Endian (#96424)
Fix TestBufferProtocolCPU::test_byte_to_int_cpu test on Big Endian
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96424
Approved by: https://github.com/janeyx99
2023-03-20 21:27:33 +00:00
270b42d279 Fix test_schema_check CUDA illegal memory access (#97062)
I'm seeing some recent [CUDA illegal memory access](https://hud.pytorch.org/failure/FAILED%20test_schema_check.py%3A%3ATestSchemaCheckModeOpInfoCUDA%3A%3Atest_schema_correctness_fft_fft_cuda_bool%20-%20RuntimeError%3A%20CUDA%20error%3A%20an%20illegal%20memory%20access%20was%20encountered) error related to this test.  So a cheap fix is to run it serially.

Fixes https://github.com/pytorch/pytorch/issues/95749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97062
Approved by: https://github.com/clee2000
2023-03-20 20:57:27 +00:00
c848a777e8 DOC: Various typo fixes (#97095)
Various typos found while browsing documentation/source code.

Thank you for a wonderful deep-learning library!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97095
Approved by: https://github.com/mikaylagawarecki, https://github.com/kit1980
2023-03-20 20:46:04 +00:00
8a6e28ccd3 Fix typo for generator. (#97136)
Fix typo for generator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97136
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-20 20:43:56 +00:00
13398d8b95 [inductor] improve bandwidth computation (#97057)
When we compute bandwidth for an kernel, we should double the memory usage for inplace arguments since we need read them once and write them once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97057
Approved by: https://github.com/Chillee
2023-03-20 20:30:46 +00:00
6b691b99da add amp support for custom backend (#96188)
Fixes #ISSUE_NUMBER
1、add amp support for custom backend
2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188
Approved by: https://github.com/bdhirsh
2023-03-20 20:27:35 +00:00
a37b4fa03a [mergebot] An ignore current flag (#96756)
with https://github.com/pytorch/test-infra/pull/3882

Add -ic/--ignore-current flag for merge.  It ignores the currently failing checks but will stop when there is a new failure.  If there are no pending checks, it fails and tells you to use -f/--force.

Doesn't work on ghstacks with more than 1 PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96756
Approved by: https://github.com/huydhn
2023-03-20 19:07:01 +00:00
aacbf091db Allow fused optimizers to call _foreach_zero_ in zero_grad (#97159)
Fixes #97032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97159
Approved by: https://github.com/Skylion007
2023-03-20 19:03:26 +00:00
1c40ce4f19 handle SymInt shape/input when debugging in dynamic shape (#96645)
Handle SymInt shape/input when debugging in dynamic shape. Fixes #96272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96645
Approved by: https://github.com/bdhirsh
2023-03-20 18:19:03 +00:00
100641aadf [MPS] Fix torch.eye unsupported bool constant on macOS 12 (#97027)
Fixes #91620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97027
Approved by: https://github.com/kulinseth
2023-03-20 18:08:36 +00:00
16e7e5a24b [dtensor] lazy init process groups in device mesh (#96700)
This PR adds a private flag to allow process grou lazy initialization, this is
replacing the previous `dim_groups` arg, as no one is using that now

This could help avoid creating process groups when not necessary

Differential Revision: [D44044664](https://our.internmc.facebook.com/intern/diff/D44044664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96700
Approved by: https://github.com/fduwjj, https://github.com/XilunWu
2023-03-20 17:50:04 +00:00
ead5186462 [CI] Change tests used by the new dashboard (#96986)
Summary: Stop using runn.py to trigger the new dashboard run. Instead,
we spell out the actual cmd which will be easier to extend. Dropping
perf tests for dynamo_eager and aot_eager in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96986
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-20 17:28:12 +00:00
bda9d7ba73 [pytorch][2/3] Pytorch profiler permits CPU events with CUPTI Range profiler mode (#97048)
Summary:
## Motivation
Initial version of CUPTI Range profile was conservative in turning of all other event types in kineto/pytorch profiler.
However, there is value in enabling CPU side activity logging. This let's us correlate the CPU operator -> GPU kernel statistics, helps us analyze flops/other performance metrics at the operator level.

## Details
1. Update pytorch profiler experimental configs parsing to allow setting CPU activities along with range profiler. Only enable on per kernel measurement mode.
1. Fixed Clang tidy issues (added nolint for 2 of them)

Test Plan: Testplan see bottom diff

Differential Revision: D44165079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97048
Approved by: https://github.com/aaronenyeshi
2023-03-20 14:44:31 +00:00
16d85160d5 Fix standalone compile for op with multiple outputs (#96936)
Op-benchmark directly uses fx.Graph to create nodes without dynamo and then compiles the graph with inductor. Currently, operators with multiple outputs, e.g. native_layer_norm, would fail to run caused by standalone torch._inductor.compile() API #95594. Actually, the graph's result is a node with several outputs instead of a tuple with several nodes. However, the standalone API forces a non-tuple result be a tuple, i.e., a tuple with one node-type element with several outputs. This PR considers a return node with several outputs as a tuple to avoid errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96936
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-20 07:56:18 +00:00
4a99b4f12b enable Half for cat serial kernel (#96021)
Summary:
1.31 x speedup.

|                     | shape                    | before | after |
| ------------ | ------------- | ------------ | ------------- |
| half          | 1024 * (100 + i)  |    235.75 us      | 179.11 us      |

Benchmark with
```
import torch
import torch.utils.benchmark as benchmark

def cat(*args, dim=0):
    return torch.cat(args, dim)

tensors = []
for i in range(10):
    tensors.append(torch.rand(1024, 100 + i).half())

t0 = benchmark.Timer(
    stmt="cat(*tensors, dim=1)",
    setup="from __main__ import cat",
    globals={"tensors": tensors},
    num_threads=1,
)

```

Test Plan: CI

Differential Revision: D43810514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96021
Approved by: https://github.com/ngimel, https://github.com/houseroad, https://github.com/jgong5
2023-03-20 05:33:04 +00:00
dba9487324 Add helpful pretty pretting summaries to torch for lldb debugging (#97101)
# Summary
Add support for pretty printing of tensors when using lldb similiar to what is currently available for gdb

<img width="772" alt="Screenshot 2023-03-18 at 6 20 34 PM" src="https://user-images.githubusercontent.com/32754868/226148687-b4e6cfe1-8be1-4657-9ebc-d134f697dd37.png">

<img width="254" alt="Screenshot 2023-03-18 at 6 20 43 PM" src="https://user-images.githubusercontent.com/32754868/226148690-caca6f76-d873-419e-b5e4-6bb403b3d179.png">

I changed it so to override the variable formatting instead of having to call a seperate command you can just do `print <tensor>`

I also add one for sizes
<img width="309" alt="Screenshot 2023-03-19 at 1 05 49 PM" src="https://user-images.githubusercontent.com/32754868/226206458-e3f0111b-6a97-4d75-8125-48455aa2cf43.png">

Last one:
<img width="815" alt="Screenshot 2023-03-19 at 1 39 23 PM" src="https://user-images.githubusercontent.com/32754868/226207687-20bd014f-9e0e-4c01-b2c8-190b7365aa70.png">

If you use the codelldb extension be sure to add:
    `"lldb.launch.initCommands": ["command source ${env:HOME}/.lldbinit"]`
    To your setttings .json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97101
Approved by: https://github.com/ngimel
2023-03-20 01:27:44 +00:00
5471621497 [BE] Remove unnecessary dict comprehensions (#97116)
Removes unnecessary dict comprehensions that optimize creation of dicts from iterables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97116
Approved by: https://github.com/kit1980
2023-03-20 00:56:57 +00:00
be0b415a5a [ONNX] Set shape/type into torchscript (#96349)
Fixes https://github.com/pytorch/pytorch/pull/95676#issuecomment-1460588229

PS: It doesn't seem the exported ONNX_proto having type now. I wonder if there was a ONNX pass doing this for us (converting torch dtype to onnx dtype during exporting.)

Type promotion issue would be raised with an error if we want to set type
```python
onnxscript_value.dtype = expected_value.dtype
```
onnx.onnx_cpp2py_export.shape_inference.InferenceError: [ShapeInferenceError] Shape inference error(s): (op_type:aten_add, node name: aten_add_1): [ShapeInferenceError] (op_type:Add, node name: n3): B has inconsistent type tensor(int64)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96349
Approved by: https://github.com/justinchuby, https://github.com/wschin
2023-03-19 21:58:10 +00:00
722c4e59a4 Replace source check with assert (#95640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95640
Approved by: https://github.com/ezyang
2023-03-19 21:51:59 +00:00
c8030b5406 Revert "Update mkl_verbose return value check due to API change in mkl (#96283)"
This reverts commit c1214ce5c26fce541a920bdf9917c9ca9f63ecb0.

Reverted https://github.com/pytorch/pytorch/pull/96283 on behalf of https://github.com/kit1980 due to Looks like this broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458194071/jobs/7830194137
2023-03-19 21:48:01 +00:00
e74c5e5637 rexnet_100 is disabled for static, does not need dynamic listing (#97100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97100
Approved by: https://github.com/Skylion007
2023-03-19 20:57:49 +00:00
5d33f9cddb Revert "Fix standalone compile for op with multiple outputs (#96936)"
This reverts commit 37cde56658e20afae6d94b70d53e4131043e09e8.

Reverted https://github.com/pytorch/pytorch/pull/96936 on behalf of https://github.com/kit1980 due to Broke inductor tests on macos-12-py3-arm64 https://github.com/pytorch/pytorch/actions/runs/4458548491/jobs/7830566793
2023-03-19 20:32:13 +00:00
90537a779c Update FlashAttention to work with sm90 Gpus (#97051)
# Summary
FlashAttention was confirmed to work on h100 and sm90 hardware so we update the checks to account for this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97051
Approved by: https://github.com/cpuhrsch
2023-03-19 19:33:57 +00:00
37cde56658 Fix standalone compile for op with multiple outputs (#96936)
Op-benchmark directly uses fx.Graph to create nodes without dynamo and then compiles the graph with inductor. Currently, operators with multiple outputs, e.g. native_layer_norm, would fail to run caused by standalone torch._inductor.compile() API #95594. Actually, the graph's result is a node with several outputs instead of a tuple with several nodes. However, the standalone API forces a non-tuple result be a tuple, i.e., a tuple with one node-type element with several outputs. This PR considers a return node with several outputs as a tuple to avoid errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96936
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-19 02:44:03 +00:00
c1214ce5c2 Update mkl_verbose return value check due to API change in mkl (#96283)
As title.
Originally `mkl_verbose()` function returned `0` and `1`, indicating failure and success respectively. However, the version that PyTorch uses now changed the output of `mkl_verbose()` to reflect its input level. Thus, the check logic needs to be changed to compare output of the `mkl_verbose()` function with -1.
https://www.intel.com/content/www/us/en/develop/documentation/onemkl-developer-reference-c/top/support-functions/miscellaneous/mkl-verbose.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96283
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-03-18 20:30:07 +00:00
5ee5a164ff [aot] disable inference view tracking (#96478)
For inference, we should disable unnecessary view tracking to save time. Most of operators get an improvement of performance (inductor v.s. eager). This PR fix the general regression of operators for inductor.

Example of operators' speedup in torchbench (inductor v.s. eager):
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

  | current | new
-- | -- | --
aten.hardsigmoid.default | [0.6426090814905988, 0.6791992931354925, 0.7046010955095103] | [0.7921782106271767, 0.8919522525991529, 0.9128089963571694]
aten.tanh.default | [0.6135534976747065, 0.7588851221588919, 0.898274076411234] | [0.857534066531159, 1.0524121834821605, 1.2535141671420165]
aten.floor.default | [0.6115868728087821, 0.6115868728087821, 0.6115868728087821] | [0.9472870784346195, 0.9472870784346195, 0.9472870784346195]
aten.exp.default | [0.7784016216625718, 0.9279358274876591, 1.1201178548406794] | [0.5777145055206203, 0.8610140436473923, 1.1850714193498957]
aten.mul_.Tensor | [0.14381872531802153, 0.14638969818507447,   0.14947766446663138] | [0.37695307573466363, 0.3832122689450142, 0.38963470437456904]
aten.hardtanh_.default | [0.49502896822398157, 0.5897512505705527, 0.8052969399847189] | [0.4915338157706071, 0.6098169585316151, 0.8587605051115021]
aten.relu_.default | [0.47776870021339685, 0.54452322796367, 0.6516167164223963] | [0.4764791289773786, 0.5608095328163419, 0.6753350976452626]

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96478
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/jgong5, https://github.com/bdhirsh
2023-03-18 13:58:24 +00:00
4805441b4a [dtensor] remove unused tests and fix ci (#97064)
fix ci
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97064
Approved by: https://github.com/huydhn
2023-03-18 06:01:37 +00:00
a5923ab3f3 Revert "[inductor] do benchmark in sub processes for max autotuning (#96410)" (#97075)
This reverts commit 34256bc73080d7898138c821273b9f31fab777f8.

@kit1980: I'm not sure how best to revert a co-dev PR like https://github.com/pytorch/pytorch/pull/96410#issuecomment-1474704337.  IIRC, Ivan and Eli did a revert PR like this before, so I create one here just in case we need to use it.  If that's the case, please feel free to get this merge to fix trunk.  Otherwise, this can be closed.

@shunting314 If you can do a forward fix faster than this, please help do so.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97075
Approved by: https://github.com/kit1980
2023-03-18 05:07:18 +00:00
a1c46e5f8f component-level configurable logging for dynamo, inductor, aot (#94858)
Summary:

Adds NNC-like logging that is configured through an env var `TORCH_COMPILE_LOGS`
Examples:
`TORCH_LOGS="dynamo,guards" python script.py` - prints dynamo logs at level INFO with guards of all functions that are compiled

`TORCH_LOGS="+dynamo,guards,graph" python script.py` - prints dynamo logs at level DEBUG with guards and graphs (in tabular) format of all graphs that are compiled

[More examples with full output](https://gist.github.com/mlazos/b17f474457308ce15e88c91721ac1cce)

Implementation:
The implementation parses the log settings from the environment, finds any components (aot, dynamo, inductor) or other loggable objects (guards, graph, etc.) and generates a log_state object. This object contains all of the enabled artifacts, and a qualified log name -> level mapping. _init_logs then adds handlers to the highest level logs (the registered logs), and sets any artifact loggers to level DEBUG if the artifact is enabled.

Note: set_logs is an alternative for manipulating the log_state, but if the environment contains TORCH_LOGS, the environment settings will be prioritized.

Adding a new log:
To add a new log, a dev should add their log name to torch._logging._registrations (there are examples there already).

Adding a new artifact:
To add a new artifact, a dev should add their artifact name to torch._logging._registrations as well.
Additionally, wherever the artifact is logged, `torch._logging.getArtifactLogger(__name__, <artifact_name>)` should be used instead of the standard logging implementation.

[design doc](https://docs.google.com/document/d/1ZRfTWKa8eaPq1AxaiHrq4ASTPouzzlPiuquSBEJYwS8/edit#)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94858
Approved by: https://github.com/ezyang
2023-03-18 04:17:31 +00:00
086ce765a5 Add new parameter materialize_grads to torch.autograd.grad() (#97015)
Fixes #44189
Adds a new parameter, zero_grad_unused, to the torch.autograd.grad() function. This parameter allows for the gradient to be set to 0 instead of None when a variable is unused, which can be helpful for higher-order partial differentials.

Here is an example of using this new parameter to solve d^3y/dx^3 given y = a * x:

```python
x = torch.tensor(0.5, dtype=torch.float32, requires_grad=True)
a = torch.tensor(1, dtype=torch.float32, requires_grad=True)
y = x * a
dydx = torch.autograd.grad(y, x, create_graph=True, allow_unused=True)
d2ydx2 = torch.autograd.grad(dydx, x, allow_unused=True, zero_grad_unused=True)
try:
    d3ydx3 = torch.autograd.grad(d2ydx2, x, allow_unused=True, zero_grad_unused=True)
except RuntimeError as e:
    assert False, "Should not raise error"
```

With `zero_grad_unused`, d2ydx2 could be 0 instead of None, enabling d3ydx3 to be calculated as defined in math without throwing an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97015
Approved by: https://github.com/soulitzer
2023-03-18 03:11:12 +00:00
34256bc730 [inductor] do benchmark in sub processes for max autotuning (#96410)
This PR implements the support to benchmark max-autotune choices in subprocesses. This way crash like https://github.com/openai/triton/issues/1298 will only abort the autotuning child process but the parent process can continue.

There are a few things to note:
- cuda runtime does not work with fork. So we have to use spawn to create child processes. Check the best practice from pytorch multithreading module: https://pytorch.org/docs/stable/notes/multiprocessing.html
- to run a job in a child process, the multiprocessing module needs to pickle both the target function and arguments and pass them to child process. This is the major complexity of this prototype since there are quite a lot of corner cases making pickle fail.

Here I list the pickle related issues I encountered:
- pickle a StorageBox cause infinite recursion. Error: https://gist.github.com/171e5ab404b7855dee2dfa1d9f093442 . Work around by pickle the inner buffer.
- IRNode store fx.Node's in its origin fields. However, we can not pickle a fx.Node. It fails when with the following error when picking the fx.Node.graph: https://gist.github.com/9c289e895d7091d7ec787c67bc3c0d70. Work around by skip origins when pickling a IRNode.
- jinja Template in TritonTemplateKernel can not be pickled: `TypeError: Template.__new__() missing 1 required positional argument: 'source' `. Workaround by pickle the source rather than jinjia Template. During unpickling, rebuild the jinja template.
- due to how select_algorithm.template_kernels is populated, in child process, it's empty. Work around by passing select_algorithm.template_kernels from parent process to child process directly.
  - There is some change in TritonTemplate.generate to make a TritonTemplateKernel pickle'able. A TritonTemplate is refered to in the closure for a TritonTemplateKernel object.
- We can not pass choice to child process directly because of pickle failure for lambda/local function being used. However cloudpickle can handle lambda. Work around by passing the cloudpickle'd choice object to child process. The child project need to unpickle it explictly.

Test:
```
python test/inductor/test_max_autotune.py -k test_max_autotune_mm_plus_mm
```
This is basically the repro I get from Bert Maher.

Benchmark in sub process is about 4x slower than benchmark in the same process. Without doing any profiling, I feel the time may be cost by starting a new process and doing initialization. Some ~thread~ process pool may help.

```
AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_5 0.0317s 87.1%
  triton_mm_plus_mm_1 0.0328s 84.4%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0379s 73.0%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 12.001659393310547 seconds

AUTOTUNE ref_mm_plus_mm(2048x64, 64x1536, 2048x64, 64x1536)
  triton_mm_plus_mm_0 0.0276s 100.0%
  triton_mm_plus_mm_6 0.0287s 96.4%
  triton_mm_plus_mm_1 0.0317s 87.1%
  triton_mm_plus_mm_5 0.0317s 87.1%
  ref_mm_plus_mm 0.0379s 73.0%
  triton_mm_plus_mm_7 0.0389s 71.1%
  triton_mm_plus_mm_2 0.0399s 69.2%
  triton_mm_plus_mm_3 0.0410s 67.5%
  triton_mm_plus_mm_4 0.0410s 67.5%
AUTOTUNE takes 51.39659810066223 seconds
```

The feature is disabled by default and can be enabled by setting the following config or envvar:
```
autotune_in_subproc = os.environ.get("TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC") == "1"
```

Differential Revision: [D43996048](https://our.internmc.facebook.com/intern/diff/D43996048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96410
Approved by: https://github.com/jansel
2023-03-18 02:43:28 +00:00
b132220309 Update MHA doc string (#97046)
Summary: Update MHA doc string

Test Plan: sandcastle & github

Differential Revision: D44179519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97046
Approved by: https://github.com/voznesenskym
2023-03-18 02:14:59 +00:00
915cbf8208 [Inductor] Eliminate redundant to_dtype node (#96650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96650
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-18 01:51:38 +00:00
679dec847e Use is_available instead of device_count to check for CUDA availability (#97043)
There are some tests that incorrectly uses the number of GPU devices `torch.cuda.device_count() > 0` to check for CUDA availability instead of the default `torch.cuda.is_available()` call.  This makes these tests more brittle when encountering infra flakiness on G5 runner using A10G, for example [test_pytorch_np](https://hud.pytorch.org/failure/FAILED%20test_tensorboard.py%3A%3ATestTensorBoardPyTorchNumpy%3A%3Atest_pytorch_np%20-%20RuntimeError%3A%20No%20CUDA%20GPUs%20are%20available).

The underlying problem is that GPU devices could crash on these runner.  While the root cause for that is unclear and we will try to upgrade to a new NVIDIA driver https://github.com/pytorch/pytorch/pull/96904 to see if it helps, we can also make these tests more resilient by using the correct check to skip tests correctly when GPU crashes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97043
Approved by: https://github.com/clee2000
2023-03-18 00:39:42 +00:00
c62fc81cc5 Increase the timeout value for linter calculate-docker-image (#96993)
I should have known that this step rebuilds the linter Docker image if it doesn't exists.  When it does so, it takes close to 15 minutes to finish, i.e. https://github.com/pytorch/pytorch/actions/runs/4443046530/attempts/1, instead of the regular 2-minute run, i.e. https://github.com/pytorch/pytorch/actions/runs/4442455480/jobs/7798700609.

This x2 the timeout value of this step to 30 minutes to avoid getting timeout flakily.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96993
Approved by: https://github.com/clee2000
2023-03-18 00:06:39 +00:00
b390e7037e [docs] passing LogSoftmax into NLLLoss (#97001)
Fixes https://github.com/pytorch/pytorch/issues/96795

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97001
Approved by: https://github.com/soulitzer
2023-03-17 23:22:13 +00:00
410210b351 Remove obsolete "merge -g" flag from update_commit_hashes.py (#97033)
The flag is deprecated and is being removed in https://github.com/pytorch/test-infra/pull/3882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97033
Approved by: https://github.com/huydhn
2023-03-17 22:51:58 +00:00
db2c1ea8c8 Re-enable test_ops_jit on Windows (#96859) (#96931)
Fixes https://github.com/pytorch/pytorch/issues/96858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96931
Approved by: https://github.com/kit1980
2023-03-17 22:42:22 +00:00
a4c706bcbc [dynamo][dashboard] fix triton clone step in dashboard (#96623)
previously this would clone triton, and then try to checkout without being in the git repo directory. This wasn't usually a problem because the environment already had a triton repo downloaded; but I ran into this while trying to construct a new environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96623
Approved by: https://github.com/anijain2305
2023-03-17 22:36:26 +00:00
4a90aca60d Make keep-going work for more than linux (#96974)
cc. asked by @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96974
Approved by: https://github.com/huydhn
2023-03-17 22:08:37 +00:00
b59a60ddff Fix CPU bitwise shifts for out-of-limit shift values (#96659)
Negative shift values and positive shift values greater than the bit size of the dtype (limit `0..bits`) now yield expected results which are consistent with numpy.

Left shift with an out-of-limit shift value result in a value of `0`. Right shift with an out-of-limit shift value results in a value of `-1` for negative inputs and `0` for non-negative inputs (sign preserving).

Fixes https://github.com/pytorch/pytorch/issues/70904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96659
Approved by: https://github.com/ngimel, https://github.com/albanD, https://github.com/zou3519, https://github.com/jgong5, https://github.com/malfet
2023-03-17 21:35:34 +00:00
dd9ade6377 Remove unnecessary items() call in zero_grad (#97040)
Micro-optimization to zero_grad() which is performance critical
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97040
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-03-17 21:34:14 +00:00
98a5cf090d [SDPA] Remove the chunk_grad from mem-eff attention (#96880)
# Summary

There exists an optimization within the scaled_dot_product_efficieint bacwkard attention path to, under the right conditions, output grad_q, grad_k, grad_v all as aliases of the same storage. This was done to optimize for the hot path where mha does packed linear_projection -> chunk -> (view stuff) -> sdpa. The thought was that chunk-> would be able to "trivially" cat inputs to chunk.backward(). However upon closer inspection chunk.backward will call ` cat` irregardless of the inputs so this is not being utilized.

I validated this by profiling on main and then this branch and the traces produced the same both with `split.backward()` calling into cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96880
Approved by: https://github.com/cpuhrsch
2023-03-17 21:28:25 +00:00
d4b8ed2b11 Fail fast when dynamo attempts to add unspecialized int/float as additional graph inputs (#96786)
Summary:
Verified the changes to catch unspecialized int/floats being added as additional graph in D44037548 prior to RP(https://github.com/pytorch/pytorch/pull/95621).

However, with #95621 the issue to be solved originally is no longer valid because int & float in `forward` will always be specialized in export. This RP is to add the assertion anyway *(though not be hit unless there is a regression)* to immediately catch the attempt to add unspecialized int/float to additional graphargs

Test Plan:
Example of the error message would look like:
```
Dynamo attempts to add additional input: value=9.999999747378752e-06, source=NNModuleSource(inner=AttrSource(base=NNModuleSource(inner=AttrSource(base=LocalInputSource(local_name='self', pos=0), member='torch_module')), member='eps'))
```
Passed all export tests
```
Buck UI: https://www.internalfb.com/buck2/fea72653-5549-47e7-a9bf-740eb86a8e26
Test UI: https://www.internalfb.com/intern/testinfra/testrun/8725724422167257
RE: reSessionID-7b3470b1-c293-4c4a-9671-dd0b7a2839b8  Up: 6.0 KiB  Down: 0 B
Jobs completed: 101. Time elapsed: 115.7s.
Tests finished: Pass 98. Fail 0. Fatal 0. Skip 0. 0 builds failed
```

Differential Revision: D44075910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96786
Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang
2023-03-17 21:15:18 +00:00
cea13ad9fa Improve size mismatch error messaging referencing mat/vet sizes (#96863)
Fixes #94841

This fixes the error messages in the following files, the same as those referenced in the linked issue. I was not able to find any additional examples, but am happy to add commits for any that I may have missed!

```
aten/src/ATen/native/Blas.cpp:     "size mismatch, got ", self.size(0), ", ", mat.size(0), "x", mat.size(1), ",", vec.size(0));
torch/_decomp/decompositions.py:        lambda: f"size mismatch, got {self.size(0)}x{self.size(1)},{vec.size(0)}",
```

Example output for `Blas.cpp` before:
```
size mismatch, got 3, 3x4,1
```

The new error messages have the following format:

```
aten/src/ATen/native/Blas.cpp:     "size mismatch, got bias (", self.size(0), "), matrix (", mat.size(0), "x", mat.size(1), "), vector (", vec.size(0), ")");
torch/_decomp/decompositions.py:        lambda: f"size mismatch, got matrix ({self.size(0)}x{self.size(1)}), vector ({vec.size(0)})",
```

Example output for `Blas.cpp` after:
```
size mismatch, got bias (3), matrix (3x4), vector (1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96863
Approved by: https://github.com/albanD
2023-03-17 21:07:48 +00:00
985fc66b30 Bind increment_version to python (#96852)
Should be convenient when writing python-only kernels (with triton) that don't have access to the C++ APIs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96852
Approved by: https://github.com/soulitzer
2023-03-17 20:36:33 +00:00
1983b31711 Fixed print tensor.type() issue. (#96381)
Fixes #95954
Updating the cpp printing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96381
Approved by: https://github.com/albanD
2023-03-17 20:26:43 +00:00
57bb5b159d [static-runtime] one more attempt to improve crash log readability (#96903)
Summary:
* add human readable type and ivalue printout
* fix internal linter warnings

Test Plan:
error message now looks like e.g.
```
E0315 16:27:32.409082 422313 ExceptionTracer.cpp:222] exception stack complete
terminate called after throwing an instance of 'c10::Error'
  what():  List[int] is not a subtype of List[int]; schema arg name: 'split_sizes', ivalue: [1, 1]
```

Differential Revision: D44112297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96903
Approved by: https://github.com/davidberard98
2023-03-17 17:56:26 +00:00
44d7bbfe22 [cpp extension] Allow setting PYTORCH_NVCC to a customized nvcc in torch cpp extension build (#96987)
per title

I can write a script named `nvcc` like this
```bash
#!/bin/bash
/opt/cache/bin/sccache /usr/local/cuda/bin/nvcc $@
```
and set its path to `PYTORCH_NVCC` (added in this PR), along with another `sccache-g++` script to env var `CXX`.
cfa6b52e02/torch/utils/cpp_extension.py (L2106-L2109)

With ninja, I can fully enable c-cached build on my cuda extensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96987
Approved by: https://github.com/ezyang
2023-03-17 17:05:17 +00:00
8ce296ae2c [ez][inductor] show kernel category in kernel benchmark result (#96991)
I feel it's useful to show if an kernel is pointwise/reduction/persistent_reduction in the benchmark output. Only print the upper case of the first 3 letters to avoid wrap the line:
- POI for pointwise
- RED for reduction
- PER for persistent_reduction

<img width="1091" alt="Screenshot 2023-03-16 at 5 10 21 PM" src="https://user-images.githubusercontent.com/52589240/225780546-07b8d345-2bbe-40bd-9e65-185e9294743e.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96991
Approved by: https://github.com/Chillee
2023-03-17 17:02:43 +00:00
46eaf4be7d Fix Typo in pytorch/torch/autograd/__init__.py (#97024)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97024
Approved by: https://github.com/Skylion007, https://github.com/soulitzer
2023-03-17 16:24:18 +00:00
95575f0a5f [DTensor] Fix _get_or_create_default_group() (#96961)
Summary:
This PR fixes `_get_or_create_default_group()` of `DeviceMesh`. When `mesh` of the first created `DeviceMesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]` and `is_initialized() == False`, it wrongly asserts. This PR fixes this issue by removing these assertions.

 ---

More specifically, `_get_or_create_default_group()` has 4 checks:

1. `DeviceMesh must include every process in WORLD`
2. `DeviceMesh cannot have duplicate values`
3. `DeviceMesh ranks must start from 0`
4. `DeviceMesh should have all ranks of WORLD`

1, 3, and 4 are not satisfied when `self.mesh` is not `[0, 1, 2, ... WORLD_SIZE - 1]`.

2 is a valid check, but it is also checked in `__init__()`, so we don't need to check it again in this function.

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D44098849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96961
Approved by: https://github.com/wanchaol
2023-03-17 15:52:19 +00:00
ffddb2219a Change THPStorage::cdata to be a MaybeOwned<Storage>, add unpack func (#96801)
Part of #91395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96801
Approved by: https://github.com/ezyang
2023-03-17 14:58:21 +00:00
7f94ea8492 test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)
Use opposite encoding on big endian systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952
Approved by: https://github.com/ezyang
2023-03-17 14:36:33 +00:00
18cf30fb2a [Inductor] preserve AliasedLayout on View (#96948)
Fix https://github.com/pytorch/pytorch/issues/96728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96948
Approved by: https://github.com/Chillee
2023-03-17 14:29:13 +00:00
92eb9d363a Decoder native functions join the dead code society (#96025)
Summary: Decoder native joins the dead code society

With the recent introduction of PT2, we no longer need native decoder operators:
1 - full-function SDPA kernels can be used to implement cross-attention efficiently without the (slower) decoder MHA blob.
2 - torch.compile() generates more efficient code across many platforms from the python implementation of decoders than the decoder layer blob by tailoring code to target

Test Plan: github & sandcastle

Differential Revision: D43811808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96025
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-03-17 09:45:55 +00:00
b5ecf727be Revert "[aot autograd] refactor to make functionalization self-contained (#96341)"
This reverts commit 3cd9c7a16d8b19c28d12bf5b56a8a7c20405476a.

Reverted https://github.com/pytorch/pytorch/pull/96341 on behalf of https://github.com/DanilBaibak due to Break internal build
2023-03-17 09:24:05 +00:00
238b06086f inductor: fix cpp wrapper ExternKernel check (#96799)
Fix cpp_wrapper functionality for ExternKernel. Changes in https://github.com/pytorch/pytorch/pull/91575 has disabled the cpp_wrapper for ExternKernel cases.

1. Need to set the `cpp_wrapper` attr before `V.graph.register_buffer(self)`.
`register_buffer` will invoke the below check:
c6a82e4339/torch/_inductor/graph.py (L220-L223)
The current code which sets the `cpp_wrapper` after the `V.graph.register_buffer(self)` will always disable the cpp wrapper.

2. Fix the missing `ordered_kwargs_for_cpp_kernel` attr for `at::addmm_out`

3. Enhance the UT to check that cpp_wrapper has been turned on for the supported cases to prevent being unintentionally disabled by future changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96799
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-17 08:58:35 +00:00
13538c88b3 [1/n] Consolidate replicate and DDP: setup ufmt for distributed.py (#96597)
As we already enabled ufmt for composable APIs in https://github.com/pytorch/pytorch/pull/90873, it seems a good idea to enable ufmt for other distributed APIs as well. This change setup ufmt for DDP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96597
Approved by: https://github.com/rohan-varma
2023-03-17 06:25:11 +00:00
24ce3a7c34 Move hasPrimaryContext to c10::cuda (#96800)
This method has to be accessible from `c10` to enable CUDA-12 integration.
Implemented by providing private `c10::cuda:_internal::setHasPrimaryContext` that passes the pointer to the implementation (in `torch_cuda`) back to c10.
Use global class constructor/destructor to guarantee RAII.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96800
Approved by: https://github.com/ngimel
2023-03-17 04:50:35 +00:00
cbd3df93c4 [vision hash update] update the pinned vision hash (#96990)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96990
Approved by: https://github.com/pytorchbot
2023-03-17 03:13:22 +00:00
4de1bc16e3 [PyTorch][XNNPACK] Update wrappers for internal only x86 SSE2 kernels (#96896)
Summary:
Same as D43747173 (https://github.com/pytorch/pytorch/pull/95911) except for the newly added x86 SSE2 kernels.

For future reference, wrappers can be generated by

```
cd ~/fbsource/xplat/third-party/XNNPACK
# Update the list of internal only kernels in generate-wrappers.py
python3 generate-wrappers.py
```

Test Plan: CI

Reviewed By: digantdesai

Differential Revision: D44072764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96896
Approved by: https://github.com/digantdesai
2023-03-17 03:07:39 +00:00
f865e23abc [MPS] Introduce MPSUnaryGradCachedGraph & MPSBinaryGradCachedGraph (#95289)
This PR introduces `MPSUnaryGradCachedGraph` & `MPSBinaryGradCachedGraph` to replace duplicate CachedGraph creation in backward functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95289
Approved by: https://github.com/kulinseth
2023-03-17 02:50:51 +00:00
571f96bf59 cudagraph trees (#89146)
CUDA Graph Trees

Design doc: https://docs.google.com/document/d/1ZrxLGWz7T45MSX6gPsL6Ln4t0eZCSfWewtJ_qLd_D0E/edit

Not currently implemented :

- Right now, we are using weak tensor refs from outputs to check if a tensor has dies. This doesn't work because a) aliasing, and b) aot_autograd detaches tensors (see note [Detaching saved tensors in AOTAutograd]). Would need either https://github.com/pytorch/pytorch/issues/91395 to land to use storage weak refs or manually add a deleter fn that does what I want. This is doable but theres some interactions with the caching allocator checkpointing so saving for a stacked pr.

- Reclaiming memory from the inputs during model recording. This isn't terribly difficult but deferring to another PR. You would need to write over the input memory during warmup, and therefore copy the inputs to cpu. Saving for a stacked pr.

- Warning on overwriting previous generation outputs. and handling nested torch.compile() calls in generation tracking

Differential Revision: [D43999887](https://our.internmc.facebook.com/intern/diff/D43999887)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89146
Approved by: https://github.com/ezyang
2023-03-17 02:47:03 +00:00
cf732053e4 nn.EmbeddingBag bound check (#96022)
Summary: Today if we're accessing out of bound embedding rows, it'll either go through or throw IMA. This is not ideal - adding bound checks. This will probably slow things down - need to benchmark it.

Test Plan:
TODO: add some tests

Tried a simple example and it's showing this:
```
aten/src/ATen/native/cuda/EmbeddingBag.cu:143: EmbeddingBag_updateOutputKernel_sum_mean: block: [0,0,0], thread: [0,1,0] Assertion `input[emb] < numRows` failed.
```

Differential Revision: D43810777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96022
Approved by: https://github.com/cpuhrsch, https://github.com/ngimel
2023-03-17 02:01:43 +00:00
50beab2978 [MPS] Fix the failure with ReplicatePad3D (#96988)
- Only ReflectPad needs the torch checks for input arguments and not the ReplicatePad
- Added a test case
- The failure was originally found in test_modules with test `test_forward_nn_ReplicationPad3d_mps_float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96988
Approved by: https://github.com/DenisVieriu97
2023-03-17 01:41:12 +00:00
417e7bc09f Revert "[PTD][Checkpoint] Add checkpointing support for DTensor submesh (#96802)"
This reverts commit cfa6b52e02eb61f71c0034d5b7e73e365420f35a.

Reverted https://github.com/pytorch/pytorch/pull/96802 on behalf of https://github.com/huydhn due to This breaks distributed test cfa6b52e02. Probably a landrace as PR signal was green
2023-03-17 01:04:43 +00:00
c9135e4408 Stop using my channel for 3.11 builds (#96973)
As `numpy` for Python 3.11 is now available from the default anaconda channel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96973
Approved by: https://github.com/kit1980, https://github.com/atalman
2023-03-17 00:55:38 +00:00
e4e761b277 record caller frame instead of function frame (#96882)
Previously, when starting to trace a function, we would record a frame summary recording the definition loc. This would lead to an unconventional-looking stack trace when used for debugging, e.g., shape guards.

```
  File ".../scripts/avik/pt2/example.py", line 407, in forward
    def forward(self, x):
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 912, in forward
    @add_start_docstrings_to_model_forward(BERT_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 562, in forward
    def forward(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 484, in forward
    def forward(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 416, in forward
    def forward(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 275, in forward
    def forward(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 351, in forward
    attention_scores = attention_scores + attention_mask
```

As noted in https://github.com/pytorch/pytorch/pull/95848#discussion_r1134397096, we would like to change this to record function calls instead, like conventional stack traces do. This diff makes this change. The above stack now looks like the following, which is way more helpful at a glance to understand what's going on.

```
  File ".../scripts/avik/pt2/example.py", line 408, in forward
    bert_out = self.bert(**x)
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 1021, in forward
    encoder_outputs = self.encoder(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 610, in forward
    layer_outputs = layer_module(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 496, in forward
    self_attention_outputs = self.attention(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 426, in forward
    self_outputs = self.self(
  ...
  File ".../transformers/models/bert/modeling_bert.py", line 351, in forward
    attention_scores = attention_scores + attention_mask
```

Differential Revision: [D44101882](https://our.internmc.facebook.com/intern/diff/D44101882/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96882
Approved by: https://github.com/ezyang
2023-03-17 00:06:16 +00:00
ea7415087a Expose Stream Recording Apis in python (#96384)
Differential Revision: [D43999891](https://our.internmc.facebook.com/intern/diff/D43999891)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96384
Approved by: https://github.com/zdevito
2023-03-16 23:45:43 +00:00
b01d6f2cdb addmv decomp #2 (#96264)
Fixes #94617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96264
Approved by: https://github.com/ngimel, https://github.com/ezyang
2023-03-16 23:09:45 +00:00
5842e5c175 vmap support for torch.tril and torch.triu (#94287)
Summary:
Add vmap support for torch.tril and torch.triu.

Fix: #91403

Test Plan: GitHub pipeline

Differential Revision: D43016624

### Expected behavior
Same as using for-loop:

```python
import torch

x = torch.randn(32, 3)
results = []
for xi in x:
  y = torch.triu(xi)
  results.append(y)
"""
triu: input tensor must have at least 2 dimensions
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-d726203efb0e> in <module>
      4 results = []
      5 for xi in x:
----> 6   y = torch.triu(xi)
      7   results.append(y)
RuntimeError: triu: input tensor must have at least 2 dimensions
"""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94287
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-03-16 22:33:18 +00:00
cfa6b52e02 [PTD][Checkpoint] Add checkpointing support for DTensor submesh (#96802)
DTensor submesh support is added in https://github.com/pytorch/pytorch/pull/95458.

This PR adds support for DTensor submesh by adding an extra check when create local save/load plan.
If the rank is not participating in the mesh, we simply skip creating WriteItem/ReadItem for the local SavePlan/LoadPlan.

Updated the associated test as well.

cc. @wanchaol, @kumpera
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96802
Approved by: https://github.com/wanchaol
2023-03-16 22:16:58 +00:00
2abcafcfd8 Add masked_grad kw argument to to_dense (#96095)
As in the title.

The `masked_grad` kw argument is required for `to_dense` backward to distinguish the expected semantics of sparse tensors. `masked_grad=True` means that the `to_dense` backward will apply a mask to the returned gradient where the mask is defined by the input indices. The default semantics implies `masked_grad==True` for BC but see the [comment](https://github.com/pytorch/pytorch/pull/96095/files#diff-d4df180433a09071e891d552426911c227b30ae9b8a8e56da31046e7ecb1afbeR501-R513) in `to_dense_backward`.

As a consequence, existing code that is run through autograd engine must replace `.to_dense()` calls with `.to_dense(masked_grad=False)`. For example,
```python
torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense())
torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense())
```
(recall, gradcheck has `masked=False` as default) must be updated to
```python
torch.autograd.gradcheck(lambda x: torch.sum(x, [0]).to_dense(masked_grad=False))
torch.autograd.gradcheck(lambda x: torch.sparse.sum(x, [0]).to_dense(masked_grad=True), masked=True)
```

Fixes https://github.com/pytorch/pytorch/issues/95550

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96095
Approved by: https://github.com/cpuhrsch
2023-03-16 21:38:11 +00:00
9d80969fa4 Retry brew and gem installation in trunk ios workflow (#96970)
Per title, I don't want to see network flakiness like this https://github.com/pytorch/pytorch/actions/runs/4439991996/jobs/7793213476 ever again :P
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96970
Approved by: https://github.com/clee2000
2023-03-16 21:30:57 +00:00
b02fd701fb [FSDP] Reduce CPU overhead (#96958)
I experimented with 200 `nn.Linear`s with `bias=True` for a total of 400 `nn.Parameter`s all wrapped into the same FSDP instance and world size of 2.

**`unshard()` -> `_use_unsharded_views()`**
- (From previous PR) unsafe `setattr`: 6.112 ms -> 4.268 ms

**`pre_unshard()` -> `_writeback_orig_params()`**
- Factor out `flat_param` and `flat_param_grad` data pointers: ~1.8 ms -> 1.071 ms
    - Now dominated by calling `_typed_storage()` on each original parameter and its gradient

**`reshard()` -> `_use_sharded_views()`**
- Factor out `torch.empty(0, ...)`: ~4.6 - 4.7 ms -> ~2.7 - 2.8 ms
    - Now dominated by `aten::slice()` and (unsafe) `setattr`, which are required

I removed some `assert` calls that were only needed for mypy or if the subsequent call would provide the same error message anyway. These have negligible overhead, but I think it is still okay to remove them and avoid the type check. We need to address type checking more holistically anyway.

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96958
Approved by: https://github.com/rohan-varma
2023-03-16 21:13:57 +00:00
931a4913b1 [inductor] Refactor memory management code in wrapper codegen (#96768)
Summary: use inheritance to simplify CppWrapperCodeGen and to prepare for AOT codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96768
Approved by: https://github.com/jansel
2023-03-16 18:36:35 +00:00
3f4090652c Passing LinearPackedParamBase Capsule as a saved_data to backward stage (#96269)
Summary:
Initial implementation was unpacking for original weight in custom furward function which will double weight tensor in memory 2x bigger.

Hence we better unpack weight in backward function.
store Capsule object in saved_data storage and unpack in backward function.

Detail :
https://github.com/pytorch/pytorch/pull/94432#discussion_r1126669178

Test Plan: buck2 run //scripts/kwanghoon/pytorch:torch_playground - [D43809980](https://www.internalfb.com/diff/D43809980)
You can plug and play with above script.

Differential Revision: D43895790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96269
Approved by: https://github.com/kimishpatel
2023-03-16 17:37:05 +00:00
397fb2762e [DTensor] Fix DeviceMesh (#96861)
Summary: This Diff fixes some DeviceMesh issues, which blocks internal DTensor integration.  Specifically, when `self.mesh = [2, 3]` while `world_size = 4`, because `unique_mesh_values[-1] == 3`, it takes the first short-cut branch and uses `default_pg`. Let's check the length instead of the last value of `unique_mesh_values`.

Test Plan: CI

Reviewed By: wanchaol

Differential Revision: D44079872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96861
Approved by: https://github.com/wanchaol
2023-03-16 16:40:38 +00:00
6718e3ca7c Cache the transformers model used in ONNX test (#96793)
Also updating merge_rule to allow ONNX exporter team to update the Docker script by themselves.  By default, the model is cached at ~/.cache/huggingface/hub/ under CI jenkins user.

The model is cached so that we don't need to re-download it every time in CI, which causes flaky [CI failures](https://hud.pytorch.org/failure/FAILED%20test%2Fonnx%2Ftest_fx_to_onnx_with_onnxruntime.py%3A%3ATestFxToOnnxWithOnnxRuntime%3A%3Atest_large_scale_exporter_with_tiny_gpt2%20-%20requests.exceptions.ReadTimeout%3A%20HTTPSConnectionPool(host%3D'huggingface.co'%2C%20port%3D443)%3A%20Read%20timed%20out.%20(read%20timeout%3D10.0)).

This is the second part after https://github.com/pytorch/pytorch/pull/96590

### Testing

Confirm that the model is cached in the Docker image before running the test:

```
jenkins@dd0db85dd34f:~/workspace$ ls -la ~/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/*
/var/lib/jenkins/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/blobs:
total 2460
drwxrwxr-x 2 jenkins jenkins     126 Mar 15 05:48 .
drwxrwxr-x 5 jenkins jenkins      48 Mar 15 05:48 ..
-rw-rw-r-- 1 jenkins jenkins     662 Mar 15 05:48 2c81a6c4c984e95a45338c64a7445c1f0f88077f
-rw-rw-r-- 1 jenkins jenkins 2514146 Mar 15 05:48 b706b24034032bdfe765ded5ab6403d201d295a995b790cb24c74becca5c04e6

/var/lib/jenkins/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/refs:
total 4
drwxrwxr-x 2 jenkins jenkins 18 Mar 15 05:48 .
drwxrwxr-x 5 jenkins jenkins 48 Mar 15 05:48 ..
-rw-rw-r-- 1 jenkins jenkins 40 Mar 15 05:48 main

/var/lib/jenkins/.cache/huggingface/hub/models--sshleifer--tiny-gpt2/snapshots:
total 0
drwxrwxr-x 3 jenkins jenkins 54 Mar 15 05:48 .
drwxrwxr-x 5 jenkins jenkins 48 Mar 15 05:48 ..
drwxrwxr-x 2 jenkins jenkins 50 Mar 15 05:48 5f91d94bd9cd7190a9f3216ff93cd1dd95f2c7be
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96793
Approved by: https://github.com/titaiwangms, https://github.com/ZainRizvi
2023-03-16 16:38:22 +00:00
aeb3db8aa0 Back out "Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827)" (#96796)
Summary:
Original commit changeset: a19273017a2a

Original Phabricator Diff: D43969564

-----------------------------------------------------------------------------------------------------------------------

Test Plan: unlandayc

Reviewed By: terrycsy

Differential Revision: D44080273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96796
Approved by: https://github.com/jianyuh, https://github.com/davidberard98
2023-03-16 16:35:33 +00:00
0eb9e01cbd Enable TestTorchbind on Windows (#96507)
This PR addresses the issues opened in #25155. However, those specific tests are no longer used since after #37473 they were moved to test_torchbind.
This PR enable TestTorchbind on Windows.
test_custom_class.py is no longer used after that commit.
In the original issue, there were problems on Windows with those tests so I tested the updated ones to see if they work.
I had no issues with them so this enables them on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96507
Approved by: https://github.com/ezyang
2023-03-16 16:18:08 +00:00
62eb7a2e97 [MPS] LSTM grad_y missing fix (#96601)
Fixes #96416
Added tests that do not use LSTM output simalarly to the issue

Seems like this fix once again introduces backward incompatibility.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96601
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-03-16 15:53:56 +00:00
b249b44bc1 Turn off split reductions for dynamic shapes (#96850)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96850
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-16 14:39:57 +00:00
bf08d1387c [primTorch] handle out in sort meta function (#96719)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96719
Approved by: https://github.com/ezyang
2023-03-16 07:38:53 +00:00
577d930c39 [CI] Revert https://github.com/pytorch/pytorch/pull/96195 (#96897)
Summary: https://github.com/pytorch/pytorch/pull/96195 was an experiment
for debugging flaky failures on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96897
Approved by: https://github.com/ngimel
2023-03-16 06:28:18 +00:00
8187c0de88 Add xpu device type to torch dispatch overload (#96849)
# Motivate
Add XPU device type to CppFunction dispatch overload function.
We previously omitted it.

# Solution
Add XPU device type.

# Additional
This list is synchronized with the k-constants in c10/core/DeviceType.h
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96849
Approved by: https://github.com/ezyang
2023-03-16 05:52:51 +00:00
0a53c9624a Back out "Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)" (#96885)
Summary:
Backing out  _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96885
Approved by: https://github.com/drisspg
2023-03-16 05:32:55 +00:00
06054d7df0 fix random output issue on index_select when src is scalar and index is empty (#96408)
Fix https://github.com/pytorch/pytorch/issues/94340
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96408
Approved by: https://github.com/ngimel
2023-03-16 05:30:45 +00:00
3405ac8a08 [TP][DTensor Op] Enable Embedding op for DTensor (#96702)
We enabled col-wise embedding for TP users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96702
Approved by: https://github.com/wanchaol
2023-03-16 05:18:07 +00:00
44c9ecad8d fix flop formulas for sdpa (#96690)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96690
Approved by: https://github.com/drisspg
2023-03-16 04:55:56 +00:00
8c2341c1b9 Remove pytest block list (#96698)
Enables the last few files under pytest.

xdist was causing problems with `profiler/test_profiler` `test_source_multithreaded` due to creating extra threads.  Luckily we don't use it so we can disable it with `-p no:xdist`, but this is incompatible with pytest-rerunfailures==10.2, so upgrade to 10.3.  I'd update the windows ami but idk how.

`dynamo/test_optimizers` and `dynamo/test_repros` both had tests that used skip_if_pytest.  https://github.com/pytorch/pytorch/pull/93251/files suggests that it is due to pytest assertion rewriting, so I added `PYTEST_DONT_REWRITE` to their module docstrings to prevent pytest from rewriting assertions.

Disable test by issue in `dynamo/test_dynamic_shapes` seems sane.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96698
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-16 04:22:42 +00:00
3162f71787 [memory debugging] Extract frame information from inductor (#95753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95753
Approved by: https://github.com/Chillee
2023-03-16 04:12:54 +00:00
e74f70d212 Revert "Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"" (#96878)
This reverts commit e1ea584b1caf9c50de25ce69396dfeb523a452c0.
Adds __has_include check to fix fbcode build.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96878
Approved by: https://github.com/ezyang
2023-03-16 04:12:54 +00:00
1f340df33c [vision hash update] update the pinned vision hash (#96906)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96906
Approved by: https://github.com/pytorchbot
2023-03-16 02:59:05 +00:00
3606f59366 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-16 02:54:18 +00:00
308a58ebca [FSDP] Rename to _get_orig_buffer_dtypes (#96790)
Reland this PR

Differential Revision: [D44078430](https://our.internmc.facebook.com/intern/diff/D44078430/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96790
Approved by: https://github.com/awgu
2023-03-16 00:31:29 +00:00
71adb32ddc [DDP] API to get data parallel parameters (#95097)
Add a private API to retrieve data parallel parameters. This is
useful for example for apply_optimizer_in_backward in the case user wishes to
ensure it is applied only on DDP managed parameters.

Differential Revision: [D43383878](https://our.internmc.facebook.com/intern/diff/D43383878/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95097
Approved by: https://github.com/zhaojuanmao, https://github.com/fegin
2023-03-16 00:30:37 +00:00
3ce9aac786 Add environment variable to force flattening of 3D input tensor (#96761)
Adding an environment variable `TORCH_LINEAR_FLATTEN_3D` to force flattening of 3D input tensor even when it is non-contiguous.

Today, the `Linear` op would flatten a 3D input sensor if it is contiguous.

It was found that even for some non-contiguous inputs (esp. with BF16 data type), flattening would also yield higher performance.
For example:
```
x_size = (3072, 1196, 128)
x = torch.rand(x_size, device="cuda", dtype=torch.bfloat16)
x = torch.transpose(x, 1, 2)
torch._C._nn.linear(x, weight, bias)
```

Since the detailed auto-tuning is unknown, this PR adds an environment variable for users to make a choice.
(Default value is still 0.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96761
Approved by: https://github.com/ngimel
2023-03-16 00:24:09 +00:00
dcafe3f271 Updates to the release notes scripts and documentation (#94560)
# Summary
This PR made some significant changes to the scripts around Release Scripts. At a high level:
- Turned the quips into docs and updated links
- Update the common.categorizes list in the hopes to make this the source of truth for releases- This is hard since the release_notes labels can be changed at will. An alternative would be to poll from github api. However, I think that is overkill. The notebook does a set compare and will show you knew categories. I think we want this to be manual so that the release note engineer will decided how to categorize.
- Create cateogry group from speaking with folks on distributed and AO that told me these different release categories can be merged.
- I am the newest person to Core and don't use ghstack soo made token getting a lil more generic.
- Added a classifier.py file. This file will train a commit categorizer for you, hopefully with decent accuracy. I was able to achieve 75% accuracy. I drop the highest frequency class - "skip" since this creates a more useful cateogrizer.
- I updated the categorize.py script so that the prompt will be what the classifier thinks, gated by a flag.
- Added a readme that will hopefully help future release notes engineers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94560
Approved by: https://github.com/albanD
2023-03-16 00:09:26 +00:00
731bb6e61b Fix periodic job by excluding check_graph_breaks (try 2) (#96803)
**note about second try**
First try (https://github.com/pytorch/pytorch/pull/96780) was reverted because while it fixed periodic,
it broke inductor cpu-accuracy (which strangely didn't show up as failures on this PR).  This try keeps the cpu-accuracy filter and also adds the inductor filter to get rid of periodic jobs.

**the actual PR desc**
It's going to be harder to properly support check_graph_breaks across multiple baselines.

Periodic and Inductor workflows are different baselines since they include different sets of models.

It's not as simple as checking in the csv for the superset (periodic), becuase update_expected.py is designed to run given the sha of your failing PR and reset the baseline to that PR's artifacts. This is a nice workflow, and would be harder to manage if it had to always point to a periodic job.

For now just do the check on the inductor job and ignore the other models covered only on periodic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96803
Approved by: https://github.com/desertfire
2023-03-15 23:24:47 +00:00
6d62134f2c fix aminmax output resize issue when input is a zero dimension tensor (#96171)
Fix https://github.com/pytorch/pytorch/issues/96042

### before
```
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
__main__:1: UserWarning: An output with one or more elements was resized since it had shape [], which does not match the required output shape [1]. This behavior is deprecated, and in a future PyTorch release outputs will not be resized unless they have zero elements. You can explicitly reuse an out tensor t by resizing it, inplace, to zero elements with t.resize_(0). (Triggered internally at ../aten/src/ATen/native/Resize.cpp:24.)
torch.return_types.aminmax(
min=tensor([1]),
max=tensor([1]))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
```
### after
```
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=True)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))
>>> torch.aminmax(torch.tensor(1, device='cpu'), dim=0, keepdim=False)
torch.return_types.aminmax(
min=tensor(1),
max=tensor(1))

```

Marked the following test as expected_fail:
`test_vmap.py TestVmapOperatorsOpInfoCPU.test_op_has_batch_rule_aminmax_cpu_float32`

Given input shape of (2), the loop out is shape (2), the batched vmap out is (2, 1), which mismatched.
The loop out will calculate twice on a tensor shape of ( ): without this patch, the output is (1), and then stacked into (2, 1); with this patch, the output is ( ), then stacked into (2).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96171
Approved by: https://github.com/jgong5, https://github.com/ngimel, https://github.com/zou3519
2023-03-15 22:44:13 +00:00
7c525823c7 Remove un-used list. And disable pytest for public binding test. (#96684)
This contains a temporary change to make sure the test fails nicely now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96684
Approved by: https://github.com/clee2000
2023-03-15 22:12:00 +00:00
f3db2a6341 Expose API to specify custom context manager for checkpoint (#96783)
Per [design](https://docs.google.com/document/d/1v-yqRqiWA6dIUOw5OpqFs2PqSQIbDEkwRPGk9FcYnxg/edit) we want (1) to allow the user to pass in a function that returns two context managers (2) a per-call API only for now, and (3) do not upstream selective checkpoint for the short term.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96783
Approved by: https://github.com/albanD
2023-03-15 20:37:33 +00:00
ac7329b323 Add exceptionhandler to more distributed_c10d APIs (#96770)
Summary: Adding exception handler to a few more APIs so that internal errors are logged to the c10d errors scuba table

Test Plan: sandcastle

Differential Revision: D44068557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96770
Approved by: https://github.com/wz337
2023-03-15 20:31:46 +00:00
1716709d46 [CUDA] Use accumulate type to improve accuracy of grid_sample on half precision inputs [v2] (#96586)
Fixes #96429

This PR is also a follow up for #90427. In that PR, we also discussed whether calculations of grid indices `grid_sampler_compute_source_index` should also be upcasted to `opmath_t` https://github.com/pytorch/pytorch/pull/90427/files#r1048876708. Due to another unit test failure, we didn't upcast those calculations in that PR.

After some investigations, I found that the inaccurate results have nothing to do with the internals of `affine_grid`, even if it's calculated using `double` internally. As long as input `grid` is passed to `grid_sample` in **half** precision, the results will be less inaccurate than a **float** `grid`. This can be verified with a short C++ program like this (by setting `TYPE_T` to `__half` and `float` in compilations)

```cpp
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>

#include <iostream>

#ifndef TYPE_T
    #define TYPE_T float
#endif

int main() {
    using type_t = TYPE_T;
    type_t d = static_cast<__half>((double)2.0 / 3.0);
    type_t s = (((float)d + 1.f) * 3 - 1) / 2;

    printf("%.15f %.15f\n", (double)d, (double)s);
}
```

Outputs are
```
./float.out
0.666503906250000 1.999755859375000

./half.out
0.666503906250000 2.000000000000000
```

To resolve the discussion back in https://github.com/pytorch/pytorch/pull/90427/files#r1048876708, I've also increased the test tolerance in the failed unit test `issue_24823_1(torch.half)`.

For the original script in #96429, I got more accurate results with `align_corners = True`
```
align_corners = True
Expected result has mean absolute value of 0.5285 and maximum absolute value of 3.2067.
Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum.

align_corners = False
Expected result has mean absolute value of 0.5189 and maximum absolute value of 3.0101.
Half precision result is off by 0.0001 (0.02%) on average and 0.0010 (0.03%) at maximum.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96586
Approved by: https://github.com/ngimel
2023-03-15 19:25:20 +00:00
54cd4a67d0 Output peak memory stats from dynamo torchbench perf CI (#95666)
Adds absolute memory usage numbers (in addition to compression ratio) to performance jobs.

Example output:
<img width="1211" alt="image" src="https://user-images.githubusercontent.com/4984825/225419950-500908c5-00ce-4711-afa2-c995bf90d35d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95666
Approved by: https://github.com/ezyang, https://github.com/williamwen42
2023-03-15 19:24:47 +00:00
445863128b Use .to instead of contiguous to generate channels last tensor (#96791)
Fix for https://github.com/pytorch/pytorch/issues/95693.

From https://pytorch.org/tutorials/intermediate/memory_format_tutorial.html:
> There are minor difference between the two APIs to and contiguous. We suggest to stick with to when explicitly converting memory format of tensor.
For general cases the two APIs behave the same. However in special cases for a 4D tensor with size NCHW when either: C==1 or H==1 && W==1, only to would generate a proper stride to represent channels last memory format.

We hit this case in convolution_backward in calling `contiguous()`. Even though we were determining that we should run the backward in channels_last forward, as FakeTensor had gathered from the output of [determine_backend_memory_format](https://github.com/pytorch/pytorch/blob/master/torch/_subclasses/fake_tensor.py#L559), we were still outputting a contiguous tensor. That led to the mismatch in strides in the issue.

Should we be calling `to` instead of `contiguous` more liberally throughout the codebase, especially in convolution related code ? Not sure if there are reasons not to do this.

Another fix would be to update `cudnn_conv_suggest_memory_format` so that it would output a contiguous_format in this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96791
Approved by: https://github.com/ngimel
2023-03-15 19:12:04 +00:00
24c49dbf14 [functorch] batch rule : few decomposition ops (#96744)
Fixes https://github.com/pytorch/pytorch/issues/96741

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96744
Approved by: https://github.com/zou3519
2023-03-15 18:55:05 +00:00
9b1b3fdd2d add to functorch codeowner (#96746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96746
Approved by: https://github.com/zou3519
2023-03-15 18:52:22 +00:00
11e708dd6b [doc] fix torch.cuda.mem_get_info doc (#96621)
the current `torch.cuda.mem_get_info` doc is incorrect. This util returns `free, total` and not `free, used`

```
__host__ ​cudaError_t cudaMemGetInfo ( size_t* free, size_t* total )
    Gets free and total device memory.
```

Also this util isn't mentioned in https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management - should it be included there as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96621
Approved by: https://github.com/kit1980
2023-03-15 18:11:00 +00:00
6339ee5d23 Temporarily disable test_ops_jit on Windows (#96859)
See https://github.com/pytorch/pytorch/issues/96858
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96859
Approved by: https://github.com/kit1980
2023-03-15 17:51:32 +00:00
aa09f8891c add 2d tests to ci (#96711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96711
Approved by: https://github.com/huydhn, https://github.com/fduwjj
2023-03-15 17:18:47 +00:00
5612aa6acd Fixes a layer_norm_nested backwards edge case. (#96788)
# Summary
Add Test and the fix for when input NT doesn't require grad to layernorm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96788
Approved by: https://github.com/cpuhrsch
2023-03-15 17:16:13 +00:00
80e8e41ca7 Fix type hint for torch.Tensor.grad_fn (#96804)
Fix type hint for `torch.Tensor.grad_fn`, which can be a `torch.autograd.graph.Node` or `None`.

This is a regression in `torch` 2.0. It makes `mypy` failure in downstream projects.

Ref:

- https://github.com/pytorch/pytorch/issues/94937#issuecomment-1469344993
- metaopt/torchopt#149
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96804
Approved by: https://github.com/Skylion007
2023-03-15 17:14:05 +00:00
a7d2e451fd Fix build, shadowed variable (#96778)
Had an internal build error with this

Differential Revision: [D44071892](https://our.internmc.facebook.com/intern/diff/D44071892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96778
Approved by: https://github.com/Chillee, https://github.com/voznesenskym
2023-03-15 16:41:06 +00:00
e9d9151eec [aot autograd] avoid cloning some inputs unnecessarily when they dont require grad (#96342)
When constructing the joint graph, we normally have to clone any inputs that are mutated, so that we can pass in the original, pre-mutation inputs as leaves to autograd.

Previously, we were doing this for all mutated inputs - but we only need to do it for inputs that require gradients and participate in autograd.

Hopefully this should speed up code like batch norm - I think before this we were unnecessarily cloning the running stats during training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96342
Approved by: https://github.com/albanD, https://github.com/ezyang
2023-03-15 16:34:04 +00:00
3cd9c7a16d [aot autograd] refactor to make functionalization self-contained (#96341)
This refactor should make it easier to add an export hook into aot autograd.

(1) I killed `create_forward_or_joint_functionalized()` (and the functions that it called, like `forward_or_joint()`) which used to handle autograd + functionalization all-in-one-go for the joint case, and was also used in the inference case.

I added a few separate helper functions:

`create_functionalized_graph()`: this takes a flat fn, and returns a functionalized fx graph. It is mostly just a thin wrapper around functionalization + make_fx(), but also has some extra logic to manually append `copy_()` ops to the end of the graph.

`fn_no_extra_mutations()`: this creates the fn that we want to trace in the inference code path. It takes in a function that it then calls, and returns the outputs + any (updated) mutated inputs.

`joint_fn_no_external_mutations()`: this creates the fn that we want to trace in the joint code path. It takes in a function, and traces out its joint. It also does the work of cloning inputs that are mutated and require gradients, returning mutated inputs as outputs, and returning intermediate bases as outputs

We should be able to add an export hook by basically adding a similar version of `joint_fn_no_external_mutations` but with a lot more restrictions (guaranteed to have no tangents, not synthetic bases, etc), and calling `create_functionalized_graph()` on it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96341
Approved by: https://github.com/ezyang
2023-03-15 16:34:04 +00:00
8e6287264d [Optim in backward] register_hook=False API (#95096)
Use this API to avoid registering hooks for applications that do their
own custom logic. This eliminates the need for DDP to have to de-register these
hooks.

Differential Revision: [D43383794](https://our.internmc.facebook.com/intern/diff/D43383794/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43383794/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95096
Approved by: https://github.com/zhaojuanmao
2023-03-15 14:33:13 +00:00
8ce8d49cc4 aot autograd: consolidate metadata (#96340)
Another bonus of factoring the synthetic_base logic into one place: we used to have a `CompiledRuntimeMetadata` object that encapsulated `ViewAndMutationMeta`, plus a bunch of extra synthetic base metadata that was plumbed around. Now I can kill that first metadata object, and use `ViewAndMutationMeta` on its own everywhere.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96340
Approved by: https://github.com/ezyang
2023-03-15 13:45:45 +00:00
070cefaef9 aot_autograd: dont requires_grad on tangents (#96339)
Ed pointed it out a few days ago - I probably added this mistakenly a few months ago. I can't think of any reason it's necessary, and removing it doesn't cause any tests to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96339
Approved by: https://github.com/ezyang
2023-03-15 13:45:45 +00:00
a269469982 aot autograd refactor: make all synthetic base logic layered in a single location (#96235)
This  refactor doesn't significantly change LoC in aot autograd, but I think this nets out to making it clearer (interested in peoples' thoughts).

The idea is that I tried to re-write the part of aot autograd that deals with synthetic bases in a layered way, similar to how Ed wrote the logic for dedup'ing inputs: it happens in one place, and all of the downstream transformation in aot autograd don't have to worry about it.

Specifically, I added a new function `aot_wrapper_synthetic_base`, similar to the existing `aot_wrapper_dedupe`.

The benefit: none of the other code in aot autograd needs to think about synthetic bases (previously, synthetic base code was intertwined in several places).

The downsides: there are two.

(1) `aot_wrapper_synthetic_base()` needs to have its own epilogue. There is one particularly hairy case, where factoring the synthetic base logic to a single location was painful: If you have two inputs that alias each other, where one gets a data mutation, and the other gets a metadata mutation.

Ordinarily, metadata mutations are handled by the runtime epilogue, in `create_runtime_wrapper`. However, now that things are factored this way, the runtime wrapper operates only on synthetic bases instead of operating on the original inputs. For data mutations, it is fine to apply the data mutation to the synthetic base instead of the original input alias. But for metadata mutations, we **need** to apply the metadata mutation directly to the original inputs.

The way that I handled this was by tracking which inputs slot into this specific case (part of a synthetic base, and get metadata mutations), and updateing the flat_fn() that we pass downstream to return these updated inputs as extra outputs. From the perspective of downstream logic, these are real user outputs, that it can treat like any other user outputs. `aot_wrapper_synthetic_base` will know to grab these extra outputs and use them to apply the metadata mutations.

This was pretty annoying, but has the benefit that all of that logic is encapsulated entirely in `aot_wrapper_synthetic_base()`.

(2) input mutations are now performed on the synthetic base instead of the individual aliases.

You can see the original code comment [here](b0b5f3c6c6/torch/_functorch/aot_autograd.py (L1131)) for details. We used to do the optimized thing in this case, and now we do the less optimized thing (copying the entire synthetic base, instead of the potentially smaller alias).

To be fair, we had no data showing that this optimization was showing improvements on any models in practice. I also think that the main reason anyone would ever run across this problem is because of a graph break - so if you care about perf, you probably want to avoid the extra graph breaks to begin with. I haven't added any warnings for this, but we probably could depending on what people think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96235
Approved by: https://github.com/ezyang
2023-03-15 13:45:43 +00:00
7a076b7b93 [aot_autograd] only performance functionalization analysis pass once (#95992)
For a while now, we've been re-running our functionalization analysis pass twice - once for get metadata when dedup'ing, and an entire second time during aot_dispatch_base/autograd.

This should also probably speed up compile times pretty noticeably, since we're going from:

(a) inference-only trace case: 3 fw traces -> 2 fw traces
(b) autograd trace case: 2 fw traces + 1 joint trace -> 1 fw trace + 1 joint trace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95992
Approved by: https://github.com/ezyang
2023-03-15 13:45:40 +00:00
e1ea584b1c Revert "[memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)"
This reverts commit 4e1060c609c094fd5f58041ebed803f74410ee36.

Reverted https://github.com/pytorch/pytorch/pull/95541 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-15 13:28:41 +00:00
33c7be360f [reland][CI] switch torchbench to a pinned version (#96782)
Summary: This is reland of https://github.com/pytorch/pytorch/pull/96553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96782
Approved by: https://github.com/huydhn
2023-03-15 12:46:36 +00:00
3fd24fb608 COO intersection: allow external hash + hash reuse in sparse_mask (#94596)
External hash implies more flexibility in the op coverage + performance improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94596
Approved by: https://github.com/cpuhrsch
2023-03-15 09:11:14 +00:00
93f7996995 [MPS] Fix the MacOS 13.3 selector check. (#96733)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96733
Approved by: https://github.com/DenisVieriu97
2023-03-15 06:43:18 +00:00
60a68477a6 Bump black version to 23.1.0 (#96578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96578
Approved by: https://github.com/ezyang
2023-03-15 06:27:59 +00:00
a229e78544 [BE] Enforce sign-compare (#96723)
Number of OSS PR were reverted, because new signed-unsigned comparison warnings, which are treated as errors in some internal builds.
Not sure how those selective rules are applied, but this PR removes `-Wno-sign-compare` from PyTorch codebase.

The only tricky part in this PR, as making sure that non-ASCII character detection works for both signed and unsigned chars  here:
6e3d51b08a/torch/csrc/jit/serialization/python_print.cpp (L926)

Exclude several files from sign-compare if flash attention is used, due to the violation in cutlass, to be fixed by https://github.com/NVIDIA/cutlass/pull/869
Do not try to fix sign compare violations in caffe2 codebase
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96723
Approved by: https://github.com/albanD
2023-03-15 06:04:20 +00:00
96c745dfdc Fix int() casting in torch.nn.RNN to have correctly traced JIT and ONNX graph. (#92970)
Signed-off-by: Boris Fomitchev <bfomitchev@nvidia.com>

Fixes #91351

As for unit tests - in this PR I only fixed LSTM unit test to properly use dynamic axes and expose export issue by running test with same ONNX for additional inputs.
If the changes approved, we should also fix the rest of the tests (RNN/GRU and beyond).

I have verified the following updated tests are working with new code and failing with the old code:
test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset_version_14_is_script_False_keep_initializers_as_inputs_True::test_rnn_name_lstm_nonlinearity_None_unilayer_bidirectional_no_initial_state_with_variable_length_sequences_with_dropout
test/onnx/test_pytorch_onnx_onnxruntime.py::TestONNXRuntime_opset_version_14_is_script_False_keep_initializers_as_inputs_True::test_rnn_name_lstm_nonlinearity_None_unilayer_bidirectional_with_initial_state_with_variable_length_sequences_with_dropout

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92970
Approved by: https://github.com/titaiwangms, https://github.com/kit1980
2023-03-15 05:33:41 +00:00
d3a38bdd47 Revert "Update xnnpack to the latest commit (#95884)"
This reverts commit 0da89664cc21b14b84c5d438d358278699f8a51e.

Reverted https://github.com/pytorch/pytorch/pull/95884 on behalf of https://github.com/kit1980 due to Broke buck-build-test https://github.com/pytorch/pytorch/actions/runs/4421715166/jobs/7752808844#logs
2023-03-15 05:32:36 +00:00
6110effa86 Rework torch.compile docs (#96706)
Chatted with @stas00 on slack and here are some great improvements he suggested to the compile docs

- [x] Rename `dynamo` folder to `compile`
- [x] Link `compile` docstring on `torch.html` to main index page for compile
- [x] Create a new index page that describes why people should care
  - [x] easy perf, memory reduction, 1 line
  - [x] Short benchmark table
  - [x] How to guide
  - [x] TOC that links to the more technical pages folks have written, make the existing docs we have a Technical overview
- [x] Highlight the new APIs for `torch._inductor.list_options()` and `torch._inductor.list_mode_options()` - clarify these are inductor specific and add more prose around which ones are most interesting

He also highlighted an interesting way to think about who is reading this doc we have

- [x] End users, that just want things to run fast
- [x] Library maintainers wrapping torch.compile which would care for example about understanding when in their code they should compile a model, which backends are supported
- [x] Debuggers who needs are somewhat addressed by the troubleshooting guide and faq but those could be dramatically reworked to say what we expect to break

And in a seperate PR I'll work on the below with @SherlockNoMad
- [ ] Authors of new backends that care about how to plug into dynamo or inductor layer so need to explain some more internals like
  - [ ] IR
  - [ ] Where to plugin, dynamo? inductor? triton?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96706
Approved by: https://github.com/svekars
2023-03-15 04:41:13 +00:00
2795233668 Revert "Fix periodic job by excluding check_graph_breaks (#96780)"
This reverts commit 8ec9beacec54fb6c56102f60180063f0fca7f24c.

Reverted https://github.com/pytorch/pytorch/pull/96780 on behalf of https://github.com/wconstab due to broke trunk builds that didn't run on PR? didn't see those trunk failures on PR CI even with trunk label
2023-03-15 04:30:47 +00:00
85639c1a88 [allocator] Generalize recording to a pool (#96542)
Previously the allocator would query whether a stream was recording a graph,
and look up the pool associated with a graph. This change has the allocator
directly associate a stream with a mempool, decoupling "record this stream to a pool"
from the action of "record all actions to a cuda graph".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96542
Approved by: https://github.com/eellison
2023-03-15 04:28:49 +00:00
e01b092705 inductor: don't remember pre-loop order if pre loop has loop collapse (#96640)
Given the following case from timm **ese_vovnet19b_dw**:

```
import torch
import torch._dynamo

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = torch.nn.Conv2d(256, 256, kernel_size=1, padding=0)
        self.conv2 = torch.nn.Conv2d(256, 256, kernel_size=1, padding=0)
        self.pool = torch.nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)

    def forward(self, x):
        x = self.conv1(x)
        x2 = self.conv2(x)
        y = x2 * x
        return self.pool(y)

model = Model().to(memory_format=torch.channels_last).eval()
x = torch.randn(128, 256, 56, 56).to(memory_format=torch.channels_last)
opt_model = torch._dynamo.optimize('inductor')(model)
with torch.no_grad():
    for i in range(2):
        y = opt_model(x
```

before this PR, the max_pooling can't be vectorized:

```
extern "C" void kernel(float* in_out_ptr0,
                       const float* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<6422528; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
                auto tmp1 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 16*i0);
                auto tmp2 = tmp0 * tmp1;
                tmp2.store(in_out_ptr0 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=102760448; i0<102760448; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = in_out_ptr0[i0];
                auto tmp2 = tmp0 * tmp1;
                in_out_ptr0[i0] = tmp2;
            }
        }
        {
            #pragma omp for
            for(long i0=0; i0<128; i0+=1)
            {
                #pragma GCC ivdep
                for(long i1=0; i1<256; i1+=1)
                {
                    #pragma GCC ivdep
                    for(long i2=0; i2<28; i2+=1)
                    {
                        #pragma GCC ivdep
                        for(long i3=0; i3<28; i3+=1)
                        {
                            auto tmp0 = static_cast<long>(2*i2);
                            auto tmp1 = static_cast<long>(0);
                            auto tmp2 = tmp0 >= tmp1;
                            auto tmp3 = static_cast<long>(56);
                            auto tmp4 = tmp0 < tmp3;
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = static_cast<long>(2*i3);
                            auto tmp7 = tmp6 >= tmp1;
                            auto tmp8 = tmp6 < tmp3;
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = in_out_ptr0[i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp12;
                            }
                            ;
                            auto tmp13 = tmp10 ? tmp11() : -std::numeric_limits<decltype(tmp11())>::infinity();
                            auto tmp14 = static_cast<long>(1 + (2*i3));
                            auto tmp15 = tmp14 >= tmp1;
                            auto tmp16 = tmp14 < tmp3;
                            auto tmp17 = tmp15 & tmp16;
                            auto tmp18 = tmp5 & tmp17;
                            auto tmp19 = [&]
                            {
                                auto tmp20 = in_out_ptr0[256 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp20;
                            }
                            ;
                            auto tmp21 = tmp18 ? tmp19() : -std::numeric_limits<decltype(tmp19())>::infinity();
                            auto tmp22 = (tmp13 != tmp13) ? tmp13 : std::max(tmp21, tmp13);
                            auto tmp23 = static_cast<long>(2 + (2*i3));
                            auto tmp24 = tmp23 >= tmp1;
                            auto tmp25 = tmp23 < tmp3;
                            auto tmp26 = tmp24 & tmp25;
                            auto tmp27 = tmp5 & tmp26;
                            auto tmp28 = [&]
                            {
                                auto tmp29 = in_out_ptr0[512 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp29;
                            }
                            ;
                            auto tmp30 = tmp27 ? tmp28() : -std::numeric_limits<decltype(tmp28())>::infinity();
                            auto tmp31 = (tmp22 != tmp22) ? tmp22 : std::max(tmp30, tmp22);
                            auto tmp32 = static_cast<long>(1 + (2*i2));
                            auto tmp33 = tmp32 >= tmp1;
                            auto tmp34 = tmp32 < tmp3;
                            auto tmp35 = tmp33 & tmp34;
                            auto tmp36 = tmp35 & tmp9;
                            auto tmp37 = [&]
                            {
                                auto tmp38 = in_out_ptr0[14336 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp38;
                            }
                            ;
                            auto tmp39 = tmp36 ? tmp37() : -std::numeric_limits<decltype(tmp37())>::infinity();
                            auto tmp40 = (tmp31 != tmp31) ? tmp31 : std::max(tmp39, tmp31);
                            auto tmp41 = tmp35 & tmp17;
                            auto tmp42 = [&]
                            {
                                auto tmp43 = in_out_ptr0[14592 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp43;
                            }
                            ;
                            auto tmp44 = tmp41 ? tmp42() : -std::numeric_limits<decltype(tmp42())>::infinity();
                            auto tmp45 = (tmp40 != tmp40) ? tmp40 : std::max(tmp44, tmp40);
                            auto tmp46 = tmp35 & tmp26;
                            auto tmp47 = [&]
                            {
                                auto tmp48 = in_out_ptr0[14848 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp48;
                            }
                            ;
                            auto tmp49 = tmp46 ? tmp47() : -std::numeric_limits<decltype(tmp47())>::infinity();
                            auto tmp50 = (tmp45 != tmp45) ? tmp45 : std::max(tmp49, tmp45);
                            auto tmp51 = static_cast<long>(2 + (2*i2));
                            auto tmp52 = tmp51 >= tmp1;
                            auto tmp53 = tmp51 < tmp3;
                            auto tmp54 = tmp52 & tmp53;
                            auto tmp55 = tmp54 & tmp9;
                            auto tmp56 = [&]
                            {
                                auto tmp57 = in_out_ptr0[28672 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp57;
                            }
                            ;
                            auto tmp58 = tmp55 ? tmp56() : -std::numeric_limits<decltype(tmp56())>::infinity();
                            auto tmp59 = (tmp50 != tmp50) ? tmp50 : std::max(tmp58, tmp50);
                            auto tmp60 = tmp54 & tmp17;
                            auto tmp61 = [&]
                            {
                                auto tmp62 = in_out_ptr0[28928 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp62;
                            }
                            ;
                            auto tmp63 = tmp60 ? tmp61() : -std::numeric_limits<decltype(tmp61())>::infinity();
                            auto tmp64 = (tmp59 != tmp59) ? tmp59 : std::max(tmp63, tmp59);
                            auto tmp65 = tmp54 & tmp26;
                            auto tmp66 = [&]
                            {
                                auto tmp67 = in_out_ptr0[29184 + i1 + (512*i3) + (28672*i2) + (802816*i0)];
                                return tmp67;
                            }
                            ;
                            auto tmp68 = tmp65 ? tmp66() : -std::numeric_limits<decltype(tmp66())>::infinity();
                            auto tmp69 = (tmp64 != tmp64) ? tmp64 : std::max(tmp68, tmp64);
                            out_ptr0[i1 + (256*i3) + (7168*i2) + (200704*i0)] = tmp69;
                        }
                    }
                }
            }
        }
    }
}
''')
```

We always avoid reordering when pre-loop has a loop collapse: 2cbce06fee/torch/_inductor/ir.py (L2267-L2273).

This PR adds a check that only reuses pre-loop ordering when not having loop collapse.

After this PR, the generated code is
```

extern "C" void kernel(float* in_out_ptr0,
                       const float* in_ptr0,
                       float* out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<6422528; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
                auto tmp1 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 16*i0);
                auto tmp2 = tmp0 * tmp1;
                tmp2.store(in_out_ptr0 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=102760448; i0<102760448; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = in_out_ptr0[i0];
                auto tmp2 = tmp0 * tmp1;
                in_out_ptr0[i0] = tmp2;
            }
        }
        {
            #pragma omp for
            for(long i0=0; i0<128; i0+=1)
            {
                #pragma GCC ivdep
                for(long i1=0; i1<28; i1+=1)
                {
                    #pragma GCC ivdep
                    for(long i2=0; i2<28; i2+=1)
                    {
                        for(long i3=0; i3<16; i3+=1)
                        {
                            auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(2*i1));
                            auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(0));
                            auto tmp2 = tmp0 >= tmp1;
                            auto tmp3 = at::vec::Vectorized<int>(static_cast<int>(56));
                            auto tmp4 = tmp0 < tmp3;
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = at::vec::Vectorized<int>(static_cast<int>(2*i2));
                            auto tmp7 = tmp6 >= tmp1;
                            auto tmp8 = tmp6 < tmp3;
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp12;
                            }
                            ;
                            auto tmp13 = decltype(tmp11())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp11(), to_float_mask(tmp10) != at::vec::Vectorized<float>(0));
                            auto tmp14 = at::vec::Vectorized<int>(static_cast<int>(1 + (2*i2)));
                            auto tmp15 = tmp14 >= tmp1;
                            auto tmp16 = tmp14 < tmp3;
                            auto tmp17 = tmp15 & tmp16;
                            auto tmp18 = tmp5 & tmp17;
                            auto tmp19 = [&]
                            {
                                auto tmp20 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 256 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp20;
                            }
                            ;
                            auto tmp21 = decltype(tmp19())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp19(), to_float_mask(tmp18) != at::vec::Vectorized<float>(0));
                            auto tmp22 = at::vec::maximum(tmp21, tmp13);
                            auto tmp23 = at::vec::Vectorized<int>(static_cast<int>(2 + (2*i2)));
                            auto tmp24 = tmp23 >= tmp1;
                            auto tmp25 = tmp23 < tmp3;
                            auto tmp26 = tmp24 & tmp25;
                            auto tmp27 = tmp5 & tmp26;
                            auto tmp28 = [&]
                            {
                                auto tmp29 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 512 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp29;
                            }
                            ;
                            auto tmp30 = decltype(tmp28())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp28(), to_float_mask(tmp27) != at::vec::Vectorized<float>(0));
                            auto tmp31 = at::vec::maximum(tmp30, tmp22);
                            auto tmp32 = at::vec::Vectorized<int>(static_cast<int>(1 + (2*i1)));
                            auto tmp33 = tmp32 >= tmp1;
                            auto tmp34 = tmp32 < tmp3;
                            auto tmp35 = tmp33 & tmp34;
                            auto tmp36 = tmp35 & tmp9;
                            auto tmp37 = [&]
                            {
                                auto tmp38 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 14336 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp38;
                            }
                            ;
                            auto tmp39 = decltype(tmp37())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp37(), to_float_mask(tmp36) != at::vec::Vectorized<float>(0));
                            auto tmp40 = at::vec::maximum(tmp39, tmp31);
                            auto tmp41 = tmp35 & tmp17;
                            auto tmp42 = [&]
                            {
                                auto tmp43 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 14592 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp43;
                            }
                            ;
                            auto tmp44 = decltype(tmp42())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp42(), to_float_mask(tmp41) != at::vec::Vectorized<float>(0));
                            auto tmp45 = at::vec::maximum(tmp44, tmp40);
                            auto tmp46 = tmp35 & tmp26;
                            auto tmp47 = [&]
                            {
                                auto tmp48 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 14848 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp48;
                            }
                            ;
                            auto tmp49 = decltype(tmp47())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp47(), to_float_mask(tmp46) != at::vec::Vectorized<float>(0));
                            auto tmp50 = at::vec::maximum(tmp49, tmp45);
                            auto tmp51 = at::vec::Vectorized<int>(static_cast<int>(2 + (2*i1)));
                            auto tmp52 = tmp51 >= tmp1;
                            auto tmp53 = tmp51 < tmp3;
                            auto tmp54 = tmp52 & tmp53;
                            auto tmp55 = tmp54 & tmp9;
                            auto tmp56 = [&]
                            {
                                auto tmp57 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 28672 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp57;
                            }
                            ;
                            auto tmp58 = decltype(tmp56())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp56(), to_float_mask(tmp55) != at::vec::Vectorized<float>(0));
                            auto tmp59 = at::vec::maximum(tmp58, tmp50);
                            auto tmp60 = tmp54 & tmp17;
                            auto tmp61 = [&]
                            {
                                auto tmp62 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 28928 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp62;
                            }
                            ;
                            auto tmp63 = decltype(tmp61())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp61(), to_float_mask(tmp60) != at::vec::Vectorized<float>(0));
                            auto tmp64 = at::vec::maximum(tmp63, tmp59);
                            auto tmp65 = tmp54 & tmp26;
                            auto tmp66 = [&]
                            {
                                auto tmp67 = at::vec::Vectorized<float>::loadu(in_out_ptr0 + 29184 + (16*i3) + (512*i2) + (28672*i1) + (802816*i0));
                                return tmp67;
                            }
                            ;
                            auto tmp68 = decltype(tmp66())::blendv(at::vec::Vectorized<float>(-std::numeric_limits<float>::infinity()), tmp66(), to_float_mask(tmp65) != at::vec::Vectorized<float>(0));
                            auto tmp69 = at::vec::maximum(tmp68, tmp64);
                            tmp69.store(out_ptr0 + (16*i3) + (256*i2) + (7168*i1) + (200704*i0));
                        }
                        #pragma omp simd simdlen(8)
                        for(long i3=256; i3<256; i3+=1)
                        {
                            auto tmp0 = static_cast<long>(2*i1);
                            auto tmp1 = static_cast<long>(0);
                            auto tmp2 = tmp0 >= tmp1;
                            auto tmp3 = static_cast<long>(56);
                            auto tmp4 = tmp0 < tmp3;
                            auto tmp5 = tmp2 & tmp4;
                            auto tmp6 = static_cast<long>(2*i2);
                            auto tmp7 = tmp6 >= tmp1;
                            auto tmp8 = tmp6 < tmp3;
                            auto tmp9 = tmp7 & tmp8;
                            auto tmp10 = tmp5 & tmp9;
                            auto tmp11 = [&]
                            {
                                auto tmp12 = in_out_ptr0[i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp12;
                            }
                            ;
                            auto tmp13 = tmp10 ? tmp11() : -std::numeric_limits<decltype(tmp11())>::infinity();
                            auto tmp14 = static_cast<long>(1 + (2*i2));
                            auto tmp15 = tmp14 >= tmp1;
                            auto tmp16 = tmp14 < tmp3;
                            auto tmp17 = tmp15 & tmp16;
                            auto tmp18 = tmp5 & tmp17;
                            auto tmp19 = [&]
                            {
                                auto tmp20 = in_out_ptr0[256 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp20;
                            }
                            ;
                            auto tmp21 = tmp18 ? tmp19() : -std::numeric_limits<decltype(tmp19())>::infinity();
                            auto tmp22 = (tmp13 != tmp13) ? tmp13 : std::max(tmp21, tmp13);
                            auto tmp23 = static_cast<long>(2 + (2*i2));
                            auto tmp24 = tmp23 >= tmp1;
                            auto tmp25 = tmp23 < tmp3;
                            auto tmp26 = tmp24 & tmp25;
                            auto tmp27 = tmp5 & tmp26;
                            auto tmp28 = [&]
                            {
                                auto tmp29 = in_out_ptr0[512 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp29;
                            }
                            ;
                            auto tmp30 = tmp27 ? tmp28() : -std::numeric_limits<decltype(tmp28())>::infinity();
                            auto tmp31 = (tmp22 != tmp22) ? tmp22 : std::max(tmp30, tmp22);
                            auto tmp32 = static_cast<long>(1 + (2*i1));
                            auto tmp33 = tmp32 >= tmp1;
                            auto tmp34 = tmp32 < tmp3;
                            auto tmp35 = tmp33 & tmp34;
                            auto tmp36 = tmp35 & tmp9;
                            auto tmp37 = [&]
                            {
                                auto tmp38 = in_out_ptr0[14336 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp38;
                            }
                            ;
                            auto tmp39 = tmp36 ? tmp37() : -std::numeric_limits<decltype(tmp37())>::infinity();
                            auto tmp40 = (tmp31 != tmp31) ? tmp31 : std::max(tmp39, tmp31);
                            auto tmp41 = tmp35 & tmp17;
                            auto tmp42 = [&]
                            {
                                auto tmp43 = in_out_ptr0[14592 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp43;
                            }
                            ;
                            auto tmp44 = tmp41 ? tmp42() : -std::numeric_limits<decltype(tmp42())>::infinity();
                            auto tmp45 = (tmp40 != tmp40) ? tmp40 : std::max(tmp44, tmp40);
                            auto tmp46 = tmp35 & tmp26;
                            auto tmp47 = [&]
                            {
                                auto tmp48 = in_out_ptr0[14848 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp48;
                            }
                            ;
                            auto tmp49 = tmp46 ? tmp47() : -std::numeric_limits<decltype(tmp47())>::infinity();
                            auto tmp50 = (tmp45 != tmp45) ? tmp45 : std::max(tmp49, tmp45);
                            auto tmp51 = static_cast<long>(2 + (2*i1));
                            auto tmp52 = tmp51 >= tmp1;
                            auto tmp53 = tmp51 < tmp3;
                            auto tmp54 = tmp52 & tmp53;
                            auto tmp55 = tmp54 & tmp9;
                            auto tmp56 = [&]
                            {
                                auto tmp57 = in_out_ptr0[28672 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp57;
                            }
                            ;
                            auto tmp58 = tmp55 ? tmp56() : -std::numeric_limits<decltype(tmp56())>::infinity();
                            auto tmp59 = (tmp50 != tmp50) ? tmp50 : std::max(tmp58, tmp50);
                            auto tmp60 = tmp54 & tmp17;
                            auto tmp61 = [&]
                            {
                                auto tmp62 = in_out_ptr0[28928 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp62;
                            }
                            ;
                            auto tmp63 = tmp60 ? tmp61() : -std::numeric_limits<decltype(tmp61())>::infinity();
                            auto tmp64 = (tmp59 != tmp59) ? tmp59 : std::max(tmp63, tmp59);
                            auto tmp65 = tmp54 & tmp26;
                            auto tmp66 = [&]
                            {
                                auto tmp67 = in_out_ptr0[29184 + i3 + (512*i2) + (28672*i1) + (802816*i0)];
                                return tmp67;
                            }
                            ;
                            auto tmp68 = tmp65 ? tmp66() : -std::numeric_limits<decltype(tmp66())>::infinity();
                            auto tmp69 = (tmp64 != tmp64) ? tmp64 : std::max(tmp68, tmp64);
                            out_ptr0[i3 + (256*i2) + (7168*i1) + (200704*i0)] = tmp69;
                        }
                    }
                }
            }
        }
    }
```

After this PR, we can get a **18%** performance improvement for timm **ese_vovnet19b_dw** on skx-4148(```python -m torch.backends.xeon.run_cpu --node_id 0 benchmarks/dynamo/timm_models.py --performance --float32 -dcpu -n50 --inductor  --channels-last --no-skip --dashboard --only ese_vovnet19b_dw```):

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96640
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-15 04:06:37 +00:00
c6a82e4339 [vision hash update] update the pinned vision hash (#96787)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96787
Approved by: https://github.com/pytorchbot
2023-03-15 03:05:28 +00:00
8ec9beacec Fix periodic job by excluding check_graph_breaks (#96780)
It's going to be harder to properly support check_graph_breaks
across multiple baselines.

Periodic and Inductor workflows are different baselines since they include
different sets of models.

It's not as simple as checking in the csv for the superset (periodic),
becuase `update_expected.py` is designed to run given the sha of your
failing PR and reset the baseline to that PR's artifacts.  This is a
nice workflow, and would be harder to manage if it had to always point to
a periodic job.

For now just do the check on the inductor job and ignore the other models
covered only on periodic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96780
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-03-15 02:54:53 +00:00
6d4d9840cd Stop including of PassManagerBuilder for llvm >= 15 (#96762)
Summary: LLVM trunk / llvm-16 removes the `PassManagerBuilder.h` file. But we are using the new pass manager for llvm>=15 anyway.

Test Plan: sandcastle

Differential Revision: D44064301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96762
Approved by: https://github.com/bertmaher
2023-03-15 02:22:42 +00:00
a8f36dd646 Revert "add amp support for custom backend (#96188)"
This reverts commit cf12edee02a44009c4f06e36efa97d9a7372ab35.

Reverted https://github.com/pytorch/pytorch/pull/96188 on behalf of https://github.com/kit1980 due to Broke some linalg tests : https://github.com/pytorch/pytorch/actions/runs/4420037607/jobs/7750708339
2023-03-15 00:03:19 +00:00
5396f85c91 Propagate dynamo dynamic_shapes config to backwards (#96771)
This fixes

```
  File "/data/users/ezyang/a/pytorch/torch/_inductor/codegen/triton.py", line 1642, in codegen_node_schedule
    indexing_dtype_strength_reduction(node._body)
  File "/data/users/ezyang/a/pytorch/torch/_inductor/optimize_indexing.py", line 310, in indexing_dtype_strength_reduction
    OptimizeIndexing(loop_body, indices, indexing).run()
  File "/data/users/ezyang/a/pytorch/torch/_inductor/optimize_indexing.py", line 96, in __init__
    self.replace_indirect(k, ValueRanges(0, v))
  File "/data/users/ezyang/a/pytorch/torch/utils/_sympy/value_ranges.py", line 67, in __init__
    upper = simple_sympify(upper)
  File "/data/users/ezyang/a/pytorch/torch/utils/_sympy/value_ranges.py", line 33, in simple_sympify
    assert not e.free_symbols, f"free variables NYI: {e}"
AssertionError: free variables NYI: s0
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96771
Approved by: https://github.com/eellison
2023-03-14 23:45:55 +00:00
0da89664cc Update xnnpack to the latest commit (#95884)
After trying to update cpuinfo submodule to the latest, I saw that an update on xnnpack is also necessary.

Fixes the 2 failing checks on [#95379](https://github.com/pytorch/pytorch/pull/95379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95884
Approved by: https://github.com/Skylion007
2023-03-14 23:25:55 +00:00
707d892564 Debug logging around DDP mixed precision copies (#96438)
Per title

Differential Revision: [D43859976](https://our.internmc.facebook.com/intern/diff/D43859976/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96438
Approved by: https://github.com/zhaojuanmao
2023-03-14 23:06:59 +00:00
b60d6e246e [inductor] Consolidate codegen functions in sizevars.py into wrapper.py (#96654)
Summary: Refactor the code so that wrapper codegen doesn't mix Python and C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96654
Approved by: https://github.com/jansel
2023-03-14 22:55:12 +00:00
037acd5a22 Update CI skips (#96745)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96745
Approved by: https://github.com/wconstab
2023-03-14 22:19:10 +00:00
a198ce6d76 [PyTorch][XNNPACK] Update build files for newly added kernels (#95911)
Same thing as D43747173 with some modifications to make sure internal only kernels are disabled in open source.

Differential Revision: [D43747173](https://our.internmc.facebook.com/intern/diff/D43747173/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43747173/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95911
Approved by: https://github.com/digantdesai
2023-03-14 22:13:24 +00:00
dd5e6e8553 [BE]: Merge startswith calls - rule PIE810 (#96754)
Merges startswith, endswith calls to into a single call that feeds in a tuple. Not only are these calls more readable, but it will be more efficient as it iterates through each string only once.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96754
Approved by: https://github.com/ezyang
2023-03-14 22:05:20 +00:00
be4eaa69c2 Revert "[CI] switch torchbench to a pinned version (#96553)"
This reverts commit 61d6ccd29a1806f75b7604aa55d44a918ea6a3fb.

Reverted https://github.com/pytorch/pytorch/pull/96553 on behalf of https://github.com/desertfire due to land race
2023-03-14 21:39:45 +00:00
2951a75c3a Revert "Update perf smoke test threshold in check_hf_bert_perf_csv.py (#96772)"
This reverts commit 2eed44933b5460623d135c2f453000b1d636c333.

Reverted https://github.com/pytorch/pytorch/pull/96772 on behalf of https://github.com/desertfire due to land race
2023-03-14 21:37:30 +00:00
e7d795dccd [Inductor] aten.{avg_pool2d/max_pool2d_with_indices} arguments can be 1 element tuple (#96727)
Fixes failure from 14k github models: ```pytest ./generated/test_ProGamerGov_neural_dream.py -k test_000```
Error:
```
......
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 357, in call_function
    raise LoweringException(e, target, args, kwargs).with_traceback(
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/graph.py", line 354, in call_function
    out = lowerings[target](*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 228, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/scratch/ybliang/work/repos/pytorch/torch/_inductor/lowering.py", line 3124, in avg_pool2d
    assert len(padding) == 2
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
LoweringException: AssertionError:
  target: aten.avg_pool2d.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg0_1', layout=FixedLayout('cuda', torch.float32, size=[4, 4, 64, 64], stride=[16384, 4096, 64, 1]))
  ))
  args[1]: [7, 7]
  args[2]: [1, 1]
  args[3]: [0]
  args[4]: False
  args[5]: False

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96727
Approved by: https://github.com/jansel
2023-03-14 21:34:30 +00:00
784dd583a6 Automatically register/clear dynamo profiler hooks while profiling (#96199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96199
Approved by: https://github.com/jansel
2023-03-14 21:19:33 +00:00
2eed44933b Update perf smoke test threshold in check_hf_bert_perf_csv.py (#96772)
Reduce the threshold a little further due to runner to runner performance variations.  e.g. https://github.com/pytorch/pytorch/actions/runs/4419276220/jobs/7747985757  https://github.com/pytorch/pytorch/actions/runs/4419548525/jobs/7748553775  failed to meet 1.145 but were above 1.140.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96772
Approved by: https://github.com/seemethere, https://github.com/huydhn, https://github.com/atalman
2023-03-14 21:00:13 +00:00
159145a19e Add support for torch.complex in functorch (#96032)
Fixes #91175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96032
Approved by: https://github.com/Skylion007, https://github.com/kshitij12345, https://github.com/zou3519
2023-03-14 20:47:53 +00:00
06b7285163 Add torch._check* functions analogous to C++ TORCH_CHECK* (#88725)
Adds `_check`, `_check_index`, `_check_value`, `_check_type`, `_check_not_implemented`, `_check_tensor_all`

Part of #72948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88725
Approved by: https://github.com/albanD
2023-03-14 20:44:50 +00:00
cf12edee02 add amp support for custom backend (#96188)
Fixes #ISSUE_NUMBER
1、add amp support for custom backend
2、optimize the file `backend_registration.py`, and rename it with `custom_backend_registration.py`. And then we would register other funcs for custom backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96188
Approved by: https://github.com/bdhirsh
2023-03-14 20:43:21 +00:00
d30db9a251 Replace non-reentrant checkpoint with a rewrite that can be nested and contain grad (#90105)
Changes:
- bc-breaking change: The main difference between this and the old non-reentrant impl that it replaces is that we clear recomputed tensors on backward immediately upon unpack, even if retain_graph=True. This has the following additional implications:
   - Accessing _saved_tensors multiple times will silently recompute forward multiple times.
   - Accessing ctx.saved_tensor twice in the same backward will now raise an error.
- To avoid dealing with the potential consequences, early stopping has been hidden behind a global flag that is by default False, and can be enabled via a context manager. We can remove this in a follow up. Some features of nesting as a result do not work by default.

Before land:
- import to check for more bc-breakingness
- implement any workarounds for the bc-breaking-ness, if we decide on any
- update docs to reflect new lifetime of recomputed variables
- update docs to mention the early stop feature

Follow ups:
- enable early-stopping by default
- update docs/tutorial to feature nested use cases

Related docs:
  - code comment: https://github.com/pytorch/pytorch/pull/90105/files#diff-9dcd955620b52ce128e18e3567be88edbb238810460d1288a86fabc20e483b30R448
  - design doc: https://docs.google.com/document/d/1UDLhTNv6_kvuDTRlsjfj9WdqtNaQNr8ahrvdBIB6914/edit#
  - retains_grad <> checkpiont https://docs.google.com/document/d/1maiGmuFUxysQL0AdYUU88kngAaXh_L0XpDcLDh_5Ors/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90105
Approved by: https://github.com/albanD
2023-03-14 20:38:36 +00:00
234df29901 [MPS] Add C++ API support for MPS backend (#96668)
- This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++.
- Added test case for C++ APIs to `mps_test_allocator.cpp`

Fixes #96425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668
Approved by: https://github.com/kulinseth, https://github.com/albanD, https://github.com/malfet
2023-03-14 20:27:40 +00:00
ba4fb9b6ad Revert "Default specialize_int to False (#96624)"
This reverts commit 1ac8782db244f6cd3d2fd109e3fe94500745e0dd.

Reverted https://github.com/pytorch/pytorch/pull/96624 on behalf of https://github.com/kit1980 due to Broke inductor/test_torchinductor_dynamic_shapes.py
2023-03-14 19:43:47 +00:00
da1489e405 Fix signed-unsigned compare in FlattenIndicesCommon.h (#96765)
One more regression from https://github.com/pytorch/pytorch/pull/94401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96765
Approved by: https://github.com/izaitsevfb, https://github.com/Skylion007
2023-03-14 19:41:47 +00:00
66871d61bb One line print for check_graph_breaks (#96750)
New output looks like this

<img width="1040" alt="image" src="https://user-images.githubusercontent.com/4984825/225059313-fbac5152-ea8b-46ba-893d-dc1e2f8d82cc.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96750
Approved by: https://github.com/ezyang
2023-03-14 19:35:54 +00:00
6ea790c5b6 Make share_memory_ call thread safe within itself. (#96664)
To achieve this, I have a per-StorageImpl (was data_ptr in the previous version of this PR, but moved to StorageImpl to ensure stability of the key before/after sharing) lock created when we are about to share a storage and make sure that all other calls to share memory wait on this lock before moving forward.
This does NOT make this call generally thread safe as any call that is not sharing memory will race and lead to UB.

This makes ensures that the sample from @robertolat in https://github.com/pytorch/pytorch/issues/95606 works fine.
This does NOT fix the example from @imurray in that same issue as the call still race with the `.sum()` call. This race is expected and there is no easy way for us to make it work I'm afraid (see issue for more details).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96664
Approved by: https://github.com/colesbury
2023-03-14 19:27:01 +00:00
cyy
799521fae5 Fixes 96676 (#96714)
Fixes #96676

PR #95942 introduced some changes in function implementations to replace const parameters by const referenced ones. However, GetBackendDevice was missed and  remains the old signature. This quick fix solves the type mismatch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96714
Approved by: https://github.com/antoniojkim, https://github.com/Skylion007
2023-03-14 19:00:59 +00:00
61d6ccd29a [CI] switch torchbench to a pinned version (#96553)
Summary: Previously we were using a branch on torchbench which skips
torchaudio. We should switch to make sure a good test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96553
Approved by: https://github.com/huydhn, https://github.com/ezyang
2023-03-14 18:42:22 +00:00
1ac8782db2 Default specialize_int to False (#96624)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96624
Approved by: https://github.com/janeyx99
2023-03-14 18:37:47 +00:00
02f6d14b97 Only allow SymInt across partitioner boundaries, and fixes (#96653)
This PR does a few things all at once, as I needed to fix several bugs on the way here.  The main goal of the PR is to fix the  `'float' object has no attribute '_has_symbolic_sizes_strides'` error. The general idea is to heavily penalize non-SymInt but still SymNode cuts in the graph. This doesn't work for default partitioner, so essentially, dynamic shapes with default partitioner is not supported.

While doing this, I had a fix a few other bugs in the partitioner:
* SymNode operations weren't considered recomputable. But they are very cheap, go wild.
* zeros_like wasn't considered recomputable, and this prevented some gradient formulas (e.g., for angle with real inputs) from successfully finding a cut at all
* AOTAutograd tests use the default partitioner. I switch them to use min-cut partitioner...
* ...but this reveals a bug where if we have nodes in backward outputs that don't depend on tangents, they never get assigned to the backward graph. I fix this by making the backward outputs mandatory to be in backwards. I have to be careful to filter out None backward outputs; those never participate in flow analysis!

This causes some wobbling for the min-cut tests, but these seem legitimate: since we're now willing to recompute, the partitioner can reduce the number of SymInts it transmits by just doing some recompute in the backend.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96653
Approved by: https://github.com/ngimel
2023-03-14 18:30:56 +00:00
9cb02b2e72 Mark empty, rand, randn as core aten op (#96661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96661
Approved by: https://github.com/ngimel
2023-03-14 18:27:25 +00:00
4e1060c609 [memory profiling] add a facility to gather combined C++/Python/TorchScript stack traces. (#95541)
This refactors the stack trace facility specific to memory profiling
    in python+cuda to make a generic facility to generate combined stack
    traces.

    The generic facility (combined_traceback.h) does not require
    python to be around to work, but will return python stacks if it is
    present.

    This facility is then used to add support for stack trace gathering in memory profiling that
    happens directly from C++.

    It is also used to expose a python API for gathering and symbolizing
    combineds stacks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95541
Approved by: https://github.com/ezyang
2023-03-14 18:26:05 +00:00
6e3d51b08a [inductor][CI] also skip rexnet_100 on non-dynamic shapes (#96691)
Recent failures show rexnet_100 accuracy is flaky also on non-dynamic shapes (was already disabled for dynamic shapes in #96474). The failure occurs for the same reason (stem.bn.weight.grad).
e.g. https://github.com/pytorch/pytorch/actions/runs/4402868441/jobs/7710977874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96691
Approved by: https://github.com/desertfire
2023-03-14 18:11:59 +00:00
a916d64900 [FSDP] Relax sharded_grad assert to allow IDLE (#96584)
`_use_sharded_grad_views()` can be called when re-registering the original parameters in `load_state_dict()`, in which case the training state is `IDLE`. Previously, I only expected `_use_sharded_grad_views()` to be called in `FORWARD` when the sharded gradient is not in `_saved_grad_shard` or `_cpu_grad`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96584
Approved by: https://github.com/fegin, https://github.com/zhaojuanmao
2023-03-14 18:05:57 +00:00
05dda7ff65 bsr_dense_mm Triton kernel: fix out kwarg (#96648)
As per title. The kernel did not handle `out=` correctly and returned a different tensor which only shared storage with `out`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96648
Approved by: https://github.com/cpuhrsch
2023-03-14 18:01:22 +00:00
40df3b41aa [AO] Update qLSTM implementation to remove unsupported backend ops (#96436)
Summary:
The reference quantized LSTM implementation uses unbind and inplace squeeze both of which are not supported when building BoltNN's Espresso IR graph.

This change adjusts the reference AO Quantizable LSTM implementation without affecting numerically while enabling removal of unsupported ops in BoltNN.

Modifications & Adjustments
1. Unbind ops appear when unstacking tensor in loop. Replaced this by getting first dim from shape and looping using ranged index.
2. Removed unbind ops call where the pattern is
`[x = t.unbind(0) -> x[i]]` can be just replaced by `t[i]` as creating a tuple from unbind is unnecessary.
3. inplace squeeze `squeeze_` uses which were not required has been replaced by `squeeze`.

See notebook N3235193 which was used for testing quantization flow and inspect the torch scripted quantized model for the set of ops used(See last cell).

Test Plan: N3235193

Reviewed By: andrewor14

Differential Revision: D43935389

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96436
Approved by: https://github.com/andrewor14
2023-03-14 17:58:34 +00:00
7ec0d6f006 Moves SDPA backward helper native function to functionsmanual.cpp (#95821)
## Summary
chunk_grad_outputs should have been created within functionsmanual.cpp to begin with. This removes it as a native function and adds to its appropriate home.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95821
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-03-14 17:49:07 +00:00
152c1529ca Add tests for all padding layers to module_db in common_modules.py (#96641)
Adding the PR discussed in #96295.

- Adds tests for all current padding layers to `module_db` in `torch/testing/_internal/common_modules.py` ( `nn.ReflectionPad`, `nn.ReplicationPad`, `nn.ZeroPad`, `nn.ConstantPad` ) for 1D, 2D, and 3D variants.
- Removes tests for the same padding layers from `torch/testing/_internal/common_nn.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96641
Approved by: https://github.com/albanD
2023-03-14 17:42:10 +00:00
4562898ad1 Disable flaky linalg.det.singular tests on ROCm (#96707)
Related issues: https://github.com/pytorch/pytorch/issues/93044 and https://github.com/pytorch/pytorch/issues/93045.

* No access to runner to debug ROCm flakiness
* Haven't seen any update on the two issues above
* Tests are still flaky whenever they are closed

### Testing

The tests are skipped https://ossci-raw-job-status.s3.amazonaws.com/log/11976899251

```
2023-03-14T03:39:02.1336514Z test_ops_gradients.py::TestBwdGradientsCUDA::test_fn_grad_linalg_det_singular_cuda_complex128 SKIPPED (Flaky on ROCm https://github.com/pytorch/pytorch/issues/93044) [ 27%]
...
2023-03-14T03:41:46.4234072Z test_ops_fwd_gradients.py::TestFwdGradientsCUDA::test_forward_mode_AD_linalg_det_singular_cuda_complex128 SKIPPED (Flaky on ROCm https://github.com/pytorch/pytorch/issues/93045) [ 44%]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96707
Approved by: https://github.com/clee2000
2023-03-14 17:35:00 +00:00
0d3bf2fdca Add missing ceil for libdevice in triton (#96709)
Towards fixing pnasnet5large

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96709
Approved by: https://github.com/jansel
2023-03-14 17:34:06 +00:00
1ec655565d [fix] resize_, resize_as_ : version bump in ADInplaceOrView (#96598)
Ref: https://github.com/pytorch/pytorch/pull/96403#discussion_r1132553277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96598
Approved by: https://github.com/albanD
2023-03-14 16:15:34 +00:00
f03db8d6cb [reland2][inductor] Add an AOT compilation mode for Inductor CPP backend (#96520)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822.
Solved the long compilation issue for inductor cpp tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96520
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-14 16:10:54 +00:00
178d2a38e0 debug shape guards (#95848)
Adds logging when shape guards are added and when symbols are specialized to constants.

Differential Revision: [D43719743](https://our.internmc.facebook.com/intern/diff/D43719743/)

Differential Revision: [D43719743](https://our.internmc.facebook.com/intern/diff/D43719743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95848
Approved by: https://github.com/ezyang
2023-03-14 16:05:28 +00:00
a37197df99 [Inductor] Enable Inductor to support BF16 atomic_add (#96620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96620
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-03-14 15:15:09 +00:00
ff7e510d1e Correctly use PythonPrinter for generating wrapper code referencing sympy (#96710)
Otherwise you get stuff like ceiling(s0) which is not valid Python code. Fixes volo_d1_224

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96710
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-14 14:35:52 +00:00
f1d4d291b0 update_expected.py to parse artifacts and update graph break stats (#96480)
TODO (cc @soumith @voznesenskym @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @desertfire @ZainRizvi) hopefully i can convert the rocks query i'm using to a public API and delete the rocs api usage (and need for apikey) from this before landing.  If that's not easy or if i need to make a new query first, maybe i should land this as-is and at least people can use it if they get an apikey.  Also, any bad practices in how i parsed/mangled the filenames?  Would be nice to make the naming of artifacts more consistent with the job names so less mangling is needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96480
Approved by: https://github.com/ZainRizvi
2023-03-14 13:37:21 +00:00
9239279cc0 Suport tensor type for XPU (#96656)
# Motivate
To support tensor type scenario for XPU.
like CUDA:
```python
>>> import torch
>>> torch.rand(2,3).cuda(0).type(torch.cuda.IntTensor)
tensor([[0, 0, 0],
        [0, 0, 0]], device='cuda:0', dtype=torch.int32)
```
without this PR:
```python
>>> import torch
>>> import intel_extension_for_pytorch
>>> torch.rand(2,3).xpu('xpu:0').type(torch.xpu.IntTensor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid type: 'torch.xpu.IntTensor'
```
with this PR:
```python
>>> import torch
>>> import intel_extension_for_pytorch
>>> torch.rand(2,3).xpu('xpu:0').type(torch.xpu.IntTensor)
tensor([[0, 0, 0],
        [0, 0, 0]], device='xpu:0', dtype=torch.int32)
```

# Solution
Add allXPUTypes in type method to parse all xpu tensor type

# Additional
UT pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96656
Approved by: https://github.com/albanD
2023-03-14 13:30:41 +00:00
a07817ad8f Revert "[MPS] Add C++ API support for MPS backend (#96668)"
This reverts commit 069ace131c7889c7aaf2ea64fe8eb44a8ff1e983.

Reverted https://github.com/pytorch/pytorch/pull/96668 on behalf of https://github.com/DanilBaibak due to breaking internal builds
2023-03-14 12:43:04 +00:00
bdd09e68e4 [Inductor] Legalize BF16 (#96183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96183
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-03-14 10:16:15 +00:00
190e284bd3 [Inductor] apply vec float mask on logical comparison ops in cpp (#96502)
Fix https://github.com/pytorch/pytorch/issues/96446
The root cause is that the logical comparison op works on the integer vector which is later used in the `where` op that expects a float vector.
1. Make sure float vec mask is applied on logical comparison ops.
2. Fix vec int specialization for `to_float_mask`. Assume int mask as input and returns the float mask with reinterpret cast.
3. Add a no-op specialization for `to_float_mask` function with the float vec as input.
4. Pass value instead of ref to `to_float_mask`. Passing by value should be efficient enough.
5. Remove a conditional check `!=0` in `masked()` since `to_float_mask` is guaranteed to return a float mask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96502
Approved by: https://github.com/EikanWang, https://github.com/XiaobingSuper, https://github.com/jansel
2023-03-14 08:47:14 +00:00
3f7235463a [Inductor] Fix the incorrect fusion if a Conv/Linear moduel is called from multiple places (#96485)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96485
Approved by: https://github.com/jansel, https://github.com/jgong5
2023-03-14 07:40:20 +00:00
3cad8d23d0 [Inductor] Skip the hf_T5_base due to intermittent failure on CI (#96649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96649
Approved by: https://github.com/desertfire
2023-03-14 07:40:20 +00:00
166117e050 control_flow.{cond/map} allows tracked_fakes divergence (#96546)
Fixes #96473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96546
Approved by: https://github.com/ezyang
2023-03-14 07:06:54 +00:00
ec536232a3 [primTorch] add meta implementation for upsample_nearest2d_backward (#96612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96612
Approved by: https://github.com/ezyang
2023-03-14 06:51:42 +00:00
6a2dcfd738 Move all ONNX test dependencies to Docker (#96590)
Per title.  This is the first one of a two-part process:

[x] Move all ONNX test dependencies to Docker https://github.com/pytorch/pytorch/pull/96590
[ ] Move the test model used by [TestFxToOnnxWithOnnxRuntime.test_gpt2_tiny](https://hud.pytorch.org/failure/FAILED%20test%2Fonnx%2Ftest_fx_to_onnx_with_onnxruntime.py%3A%3ATestFxToOnnxWithOnnxRuntime%3A%3Atest_large_scale_exporter_with_tiny_gpt2%20-%20requests.exceptions.ReadTimeout%3A%20HTTPSConnectionPool(host%3D'huggingface.co'%2C%20port%3D443)%3A%20Read%20timed%20out.%20(read%20timeout%3D10.0))
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96590
Approved by: https://github.com/ZainRizvi
2023-03-14 06:19:00 +00:00
70090b4daf [CUDA] Abate spurious resize warnings in MultiMarginLoss backward (#96382)
Follow-up of #75000 for backward.

CC @ngimel @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96382
Approved by: https://github.com/ngimel
2023-03-14 05:54:23 +00:00
906a1952c6 [DDP] Enable delayed all reduce in DDP (#96673)
Summary: Enable the functionality of delaying all reduce in DDP to specify the parameters whose all reduce will be hooked to a specific param. This prevents AllReduce blocking All2All in some recommendation models.

Test Plan: GitHub CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96673
Approved by: https://github.com/zhaojuanmao
2023-03-14 04:25:25 +00:00
d0a4881d95 [vision hash update] update the pinned vision hash (#96703)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96703
Approved by: https://github.com/pytorchbot
2023-03-14 04:07:59 +00:00
2a08a62777 Add extra metadata (as comments) to Inductor generated code (#96581)
New output
<img width="942" alt="image" src="https://user-images.githubusercontent.com/6355099/224794006-a993a2a8-d6ff-49da-8891-7b2373030a3d.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96581
Approved by: https://github.com/ngimel, https://github.com/shunting314, https://github.com/voznesenskym
2023-03-14 03:59:59 +00:00
f56cb41c2e Fix calls to sizes to enable dynamic shapes with sdpa (#96674)
Fixes part of #96414

Replaces any calls to sizes, with sym_sizes. Still seeing an error with the repro script:
``` Bash
Exception raised from sizes_default at /scratch/drisspg/work/pytorch/c10/core/TensorImpl.h:635 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x7d (0x7f697f4a141d in /scratch/drisspg/work/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xdd (0x7f697f49fbcd in /scratch/drisspg/work/pytorch/torch/lib/libc10.so)
frame #2: c10::TensorImpl::sizes_custom() const + 0x95 (0x7f697f4824c5 in /scratch/drisspg/work/pytorch/torch/lib/libc10.so)
frame #3: at::native::empty_like(at::Tensor const&, c10::optional<c10::ScalarType>, c10::optional<c10::Layout>, c10::optional<c10::Device>, c10::optional<bool>, c10::optional<c10::MemoryFormat>) + 0x92c (0x7f69809d18ac in /scratch/drisspg/work/pytorch/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x23f5ce7 (0x7f698193bce7 in /scratch/drisspg/work/pytorch/torch/lib/libtorch_cpu.so)
```

still trying to track down this empty call

from the looks of it, might be coming from at::layer_norm?
the BT from lldb is 221 frames however, so lots of noise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96674
Approved by: https://github.com/ezyang
2023-03-14 03:47:43 +00:00
218eeacacd Check dynamo graph-breaks in CI (#96346)
- add graph-breaks baselines
- add check_graph_breaks script (message users on regress or improvement)
- hook up test.sh for existing accuracy job

Refactor graph-break CI check

Take steps toward merging checker with existing check flow,
consider merging it all the way inside the bench runner.

csvs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96346
Approved by: https://github.com/ezyang
2023-03-14 03:39:36 +00:00
2cc8368af3 Clean up duplicated retry function in common.sh (#96682)
I just realize that this `retry` function is defined twice in:

* https://github.com/pytorch/pytorch/blob/master/.ci/pytorch/common_utils.sh#L12-L14
* and https://github.com/pytorch/pytorch/blob/master/.ci/pytorch/common.sh#L26-L28

Also they step on each other toes as `common.sh` load `common_utils.sh` in https://github.com/pytorch/pytorch/blob/master/.ci/pytorch/common.sh#L5

This will keep only the definition in `common_utils.sh` where it should be.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96682
Approved by: https://github.com/clee2000
2023-03-14 03:24:49 +00:00
62c1e33fc9 [BE] Remove fast_nvcc tool (#96665)
As of CUDA-11.4+ this functionality can be mimicked by passing
[`--threads`](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/#threads-number-t) option to CUDA compiler

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96665
Approved by: https://github.com/atalman, https://github.com/PaliC
2023-03-14 03:17:31 +00:00
82daf98151 [Sparse] Move SparseTensorUtils.* to native/ (#96696)
Fixes internal linking problem after `DECLARE_DISPATCH` was introduced in SparseTensorUtils.cpp, but implemented inside the native library.

Also, fix `sign-unsigned` compare in `_flatten_indices_impl`
Followups:
 Move code declared/implemented in `SparseTensorUtils.*` to `at::native` namespace
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96696
Approved by: https://github.com/albanD
2023-03-14 02:56:52 +00:00
c31f5cc26a Update functional_bfloat16.h (#96027)
Fix a typo in functional_bfloat16.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96027
Approved by: https://github.com/Skylion007, https://github.com/jgong5, https://github.com/kit1980
2023-03-14 02:35:37 +00:00
a66474b411 Update vml.h (#96028)
Fix a typo in vml.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96028
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-14 02:22:01 +00:00
5a8a4030a2 [BE] Add regression test for aten shared build (#96697)
To expose errors similar to https://github.com/pytorch/pytorch/pull/94401#issuecomment-1466654593 in OSS CI

Building `aten_cpu` as a shared library with `-Wl,--no-undefined` simulates behavior of Android NDK toolchain.

Test plan: It should fail, see https://github.com/pytorch/pytorch/actions/runs/4410571970/jobs/7728232916#step:14:1386

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96697
Approved by: https://github.com/kit1980
2023-03-14 02:19:17 +00:00
a22b92d8ba Revert "Enable thp(transparent huge pages) for buffer sizes >=2MB (#95963)"
This reverts commit 3bb16a084298ed8b9a1e59622afd80418ff4a2f1.

Reverted https://github.com/pytorch/pytorch/pull/95963 on behalf of https://github.com/izaitsevfb due to Breaks internal android builds: unused function c10_compute_alignment  [-Werror,-Wunused-function]
2023-03-14 02:15:08 +00:00
86a9fe8abc Update Exceptions.cpp (#96031)
Fix a typo in Exceptions.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96031
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-14 02:02:34 +00:00
507feb805f Don't specialize torch.Size with specialize_int = False (#96419)
Fixes https://github.com/pytorch/pytorch/issues/95868

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96419
Approved by: https://github.com/jansel, https://github.com/ngimel
2023-03-14 01:32:58 +00:00
da265652d6 Return Live Data Pointers from Checkpoint, swap onto tensors (#95020)
When we checkpoint the state of the private pool allocator, we will need to make sure that its current live allocated blocks will get properly cleaned up when the tensors they correspond to die. Return DataPtrs for these new allocated blocks that the callee can swap onto live Tensors.

The exact api for setting the checkpoint can be manipulated after this as the cudagraph implementation is built out, but this at least shows its sufficiently general.

This should be the last PR touching cuda caching allocator necessary for new cudagraphs integration.

Differential Revision: [D43999888](https://our.internmc.facebook.com/intern/diff/D43999888)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95020
Approved by: https://github.com/zdevito
2023-03-14 01:22:19 +00:00
1cc32aedb0 Handle additional live allocations not in checkpointed state (#94943)
We choose to ignore certain blocks that are currently allocated when we set the pool to its checkpoint. For those blocks, we need to swap out the deleter function of their corresponding blocks so that a deallocation is not triggered when they die.

Differential Revision: [D43999886](https://our.internmc.facebook.com/intern/diff/D43999886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94943
Approved by: https://github.com/zdevito
2023-03-14 01:00:47 +00:00
d798de2b05 Checkpoint CUDA Allocator Private Pool State (#94653)
Copying note from cuda caching allocator:

```
   * Note [Checkpointing PrivatePoolState]
   *
   * Refer above to Note [Interaction with CUDA graph capture]. Allocations made
   * during graph capture are made from a separate private pool. During graph
   * capture allocations behave as usual. During graph replay the allocator
   * state does not change even as new tensors are created. The private pool
   * will not free its blocks to the main caching allocator until cuda graph use
   * is finished to prevent an allocation from eager clobbering the memory from
   * a live but unaccounted for tensor that was created during replay.
   *
   * `make_graphed_callables`, a series of separate callables chained in
   * successive cuda graphs, can share a memory pool because after a cuda graph
   * recording the allocations in the shared private pool exactly reflect the
   * tensors that are allocated.
   *
   * We would like to extend callable chaining to support a graphed callable
   * tree. In this scenario, we have a tree of callable chains which will be
   * captured with cuda graphs. In the diagram below, we have a tree with four
   * callables, A, B, C, and D. Suppose we have captured, and subsequently
   * replayed, A, B, and C. Then on a new invocation, we replay A and B, but
   * would now like to record D. At this point the private pool will not reflect
   * any of the live tensors created during graph replay. Allocations made
   * during a new recording with the pool could overwrite those live tensors.
   *
   * In order to record a new graph capture after replaying prior callables in
   * the tree, we need the allocator to reflect the state of the live tensors.
   * We checkpoint the state of the private after each recording, and then
   * reapply it when we are starting a new recording chain. Additionally, we
   * must free the allocations for any tensors that died between the end of our
   * previous graph replaying and our new recording (TODO). All of the allocated
   * segments that existed in the checkpointed state must still exist in the
   * pool. There may also exist new segments, which we will free (TODO : link
   * note [live tensors between iterations] when it exists).
   *
   *
   *  ---------------> A ---------------> B ---------------> C
   *                                |
   *                                |
   *                                |
   *                                |
   *                                  ---------------> D
```

A few TODOs:
- need to add logic for freeing tensors that have died between a last replay and current new recording
- Add logic for free that might be called on a pointer multiple times (because we are manually freeing live tensors)

The two scenarios above have not been exercised in the tests yet.

Differential Revision: [D43999889](https://our.internmc.facebook.com/intern/diff/D43999889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94653
Approved by: https://github.com/zdevito
2023-03-14 00:47:30 +00:00
c95bcb6694 [MPS] Fix flip where no dims need to be flipped (#96605)
Fixes #96558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96605
Approved by: https://github.com/kulinseth
2023-03-14 00:34:30 +00:00
ca7e53324f [Quant][fx] Remove unused is_qat args in prepare_fx (#96631)
Test Plan:
python test/test_quantization.py TestQuantizeFx

Reviewers: vkuzo, jcaip

Subscribers: vkuzo, jcaip
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96631
Approved by: https://github.com/vkuzo
2023-03-14 00:33:18 +00:00
eqy
6e3e22d58c [CUDA][cuFFT] Minor fix for cuFFT plan cache docs (#96373)
The attributes described in the docs require indexing in to the plan cache manager, as there is a separate plan cache per device.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96373
Approved by: https://github.com/ngimel
2023-03-14 00:28:14 +00:00
6a492908cc Update conv_fused.py (#95551)
Fix typos in conv_fused.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95551
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet
2023-03-13 23:42:34 +00:00
ae4d690931 Make linter image available on ECR when rebuilding (#96671)
This is to fix the annoying error when lint jobs couldn't find the new image on ECR and fail.  They success once the image has been pushed there, for example https://github.com/pytorch/pytorch/actions/runs/4407785646/jobs/7722166975

### Testing

* Lint jobs successfully pull new linter image if there are changes to Docker https://github.com/pytorch/pytorch/actions/runs/4408362343
* No change to the Docker image. The existing one is pulled from ECR https://github.com/pytorch/pytorch/actions/runs/4408992880
* Remove `force_push` https://github.com/pytorch/pytorch/actions/runs/4410045959
* Retrying works fine https://github.com/pytorch/pytorch/actions/runs/4410045959/jobs/7727515932
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96671
Approved by: https://github.com/malfet, https://github.com/seemethere
2023-03-13 23:24:23 +00:00
f330281fb2 Add torch.nn.LayerNorm() to documented list of supported nested tensor ops (#96434)
Layer norm is supported and this updates the documentation to reflect that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96434
Approved by: https://github.com/cpuhrsch, https://github.com/jbschlosser
2023-03-13 23:16:09 +00:00
069ace131c [MPS] Add C++ API support for MPS backend (#96668)
- This enables the APIs `torch::mps::is_available()/synchronize()/manual_seed()` for use in PyTorch C++.
- Added test case for C++ APIs to `mps_test_allocator.cpp`

Fixes #96425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96668
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-03-13 23:15:37 +00:00
c28b224e0f Update CUDAGraph.cpp (#96029)
Fix typos in CUDAGraph.cpp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96029
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-13 23:06:25 +00:00
2ea0cb1207 Fix the typo for the docstring of args in the observer (#95887)
This PR fixes the typo in `torch.ao.quantization.observer.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95887
Approved by: https://github.com/kit1980
2023-03-13 23:03:57 +00:00
9159599cd5 Gramatically updated the tech docs (#92896)
Small typo change in the torch tech docs
<img width="1209" alt="Torch storage doc" src="https://user-images.githubusercontent.com/76240270/214272201-5e9cce2a-13cf-48b7-8806-9c492a0eb665.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92896
Approved by: https://github.com/mikaylagawarecki, https://github.com/kit1980
2023-03-13 22:51:42 +00:00
1d792288a5 [dynamo][dashboard] Clear local changes before pulling git repos (#96667)
Current dashboard issue is due to a .pt file in torchbench that has beeen modified for some reason. This clears any local changes before pulling.

Tested in a duplicate dashboard environment with the same .pt file modified:
* Before the change to this makefile, `make pull-deps` fails
* After the change to this makefile, `make pull-deps` succeeds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96667
Approved by: https://github.com/anijain2305
2023-03-13 22:50:38 +00:00
16a16d1803 Incorrect links #96515 (#96536)
Fixes #96515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96536
Approved by: https://github.com/kit1980
2023-03-13 22:26:21 +00:00
a48d518e45 test_foreach: remove skipMeta (#96599)
Happened to notice that the test doesn't seem to require the guard (at least on my local environment)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96599
Approved by: https://github.com/bdhirsh
2023-03-13 22:14:36 +00:00
f5a0b31a95 [FSDP][optim_state_dict] Make FSDP optim_state_dict aware of DDP prefix (#96415)
Summary: When wrapping FSDP within DDP, optimizer state_dict may be broken due to the prefix of DDP. This PR fixes the issue.

Test Plan: CI

Differential Revision: D43893609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96415
Approved by: https://github.com/zhaojuanmao
2023-03-13 21:07:34 +00:00
b992199487 [pytorch][coreml] Use from_blob instead of empty in pack_outputs (#96564)
Summary:
We don't want to load when loading model on Core ML and `at::empty` is considered an op.

So replace it with from_blob.

Test Plan:
Run Core ML backend to ensure it works for existing use cases.

Also test running Core ML backend without any ops.

Differential Revision: D43961679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96564
Approved by: https://github.com/f-meloni, https://github.com/kimishpatel
2023-03-13 20:23:43 +00:00
c69b3b8d4f [CUDA12] Autograd engine use current device only (#92354)
This is a device agnostic version #91191.
The reason of existence of this PR is device agnostic policy of autograd engine. Hence, the compile time `USE_CUDA` is not supported, so doing something like:
fa1ea9f9bc/torch/csrc/autograd/engine.cpp (L351-L357)
is not effective.

In this PR a check upon CUDA devices in device registry is added such that threads set the same CUDA device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92354
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-03-13 20:04:12 +00:00
31137a63a7 Changed flop formulas for flop counter to take in shapes directly (#96565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96565
Approved by: https://github.com/zdevito
2023-03-13 19:58:43 +00:00
3f1efadea5 [inductor] fixes addmm pattern matcher to exclude non-conformant patt… (#96634)
…erns

Fixes #96625, #96569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96634
Approved by: https://github.com/jansel
2023-03-13 19:55:04 +00:00
30d56dd8c1 Support randn_like() for NT (#96528)
To satisfy an internal ask.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96528
Approved by: https://github.com/mikaylagawarecki, https://github.com/cpuhrsch
2023-03-13 19:39:51 +00:00
f673ad6d5c Add a new knob to separately enable the autotuning in Triton. (#96440)
Summary: separate triton pointwise autotune from matmul autotune, work done by ckluk

Test Plan: sandcastle + CI

Differential Revision: D43955699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96440
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-13 19:09:27 +00:00
4454655a4c Add triton to relevant packages (#96663)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96663
Approved by: https://github.com/janeyx99, https://github.com/malfet, https://github.com/atalman
2023-03-13 19:02:07 +00:00
a8d1eb1961 Convenience script for getting correct Triton nightly binary (#96669)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96669
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-13 18:58:38 +00:00
120c6f6637 Revert all_reduce workaround as it might be causing issues on other parts of the codebase (#96460)
Recent master breakage on focal and bionic PTD tests since we switched to all_reduce in #95897
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96460
Approved by: https://github.com/fegin
2023-03-13 18:56:55 +00:00
19833486dc Autorun binary builds when a commit pin is updated (#96526)
Automatically trigger binary builds when a commit pin is updated to ensure the new versions actually get tested.

This is to prevent a recurrence of the build breaks introduced by https://github.com/pytorch/pytorch/pull/95896#issuecomment-1463312996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96526
Approved by: https://github.com/huydhn, https://github.com/seemethere
2023-03-13 18:33:29 +00:00
7eef469793 Add merge_rule for "functorch" module (#96657)
This PR enables our non-meta contributors to be able to approve
"functorch" PRs without intervention from meta contributors.

A PR is deemed a "functorch" PR if it matches one of the patterns in
merge_rules.yaml. These patterns are definitely not exhaustive
(we modify core pytorch pieces quite often), but should be a good starting
point.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96657
Approved by: https://github.com/albanD
2023-03-13 18:05:47 +00:00
55a1bd3fc6 [PT-D] Update CODEOWNERS, merge_rules, and Persons-of-Interest for to… (#96321)
Synchronize CODEOWNERS, merge_rules, and POI files to reflect kiukchung and d4l3k (Tristan Rice) as one of the maintainers for the distributed module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96321
Approved by: https://github.com/d4l3k, https://github.com/albanD, https://github.com/malfet
2023-03-13 17:38:43 +00:00
bb8dc7f7d9 Dockerize torch deploy setup (#96593)
The step `conda_install "libpython-static=${ANACONDA_PYTHON_VERSION}"` could fail flakily, for example 5f89d147a1, so let's put that into Docker.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96593
Approved by: https://github.com/ZainRizvi
2023-03-13 17:26:52 +00:00
0b5040b329 sparse_mask: remove syncs by removing calls to coalesce (#94406)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94406
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-03-13 16:30:27 +00:00
13011afb87 Fix vmap registration for t, t_ (#96539)
- t, t_ are not CompositeImplicitAutograd
- They were previously registered in BatchRulesDecompositions.cpp.
- The only thing that should get registered in BatchRulesDecompositions.cpp
are CompositeImplicitAutograd
- This PR moves their registrations out of there and into
BatchRulesViews.cpp.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96539
Approved by: https://github.com/srossross, https://github.com/kshitij12345, https://github.com/Chillee
2023-03-13 16:08:32 +00:00
024ea1a21e Support zeros_like() for NT (#96527)
This is used for the fake tensor fallbacks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96527
Approved by: https://github.com/cpuhrsch
2023-03-13 15:15:08 +00:00
3cdf18cb4f Corrected HingeEmbeddingLoss documentation (#95140)
Minor correction. `HingeEmbeddingLoss`'s documentation had this piecewise function; but there is no $\Delta$ in the function definition, it was used to denote `margin`.

$$l_n = \begin{cases}
            x_n, & \text{if}\; y_n = 1,\\
            \max \{0, \Delta - x_n\}, & \text{if}\; y_n = -1,
        \end{cases}$$

Following other documentation guidelines, `HuberLoss` has a parameter `delta`, and its piecewise function is defined as follows; using $delta$ as a reference to the `delta` parameter and not $\Delta$.

$$l_n = \begin{cases}
        0.5 (x_n - y_n)^2, & \text{if } |x_n - y_n| < delta \\
        delta * (|x_n - y_n| - 0.5 * delta), & \text{otherwise }
        \end{cases}$$

So by analogy, `HingeEmbeddingLoss` should also be the same, thus, the right piecewise function for it should be like the following instead.

$$l_n = \begin{cases}
            x_n, & \text{if}\; y_n = 1,\\
            \max \{0, margin- x_n\}, & \text{if}\; y_n = -1,
        \end{cases}$$
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95140
Approved by: https://github.com/albanD
2023-03-13 14:32:04 +00:00
32f11f58c9 DDP native mixed precision (#92882)
Implements native mixed precision support for DDP in a similar fashion to how it is enabled for FSDP. The implementation works as follows:

1. In DDP init, we save `_mp_param` and `_fp_param` variables to manage mixed precision parameter usage. In particular, _mp_param will represent the parameter in the reduced precision, while _fp_param will represent the param in regular precision. During forward/backward, we swap back and forth as needed.
2. The root module gets a root pre-forward hook that kicks off copies to the reduced precision for all submodules. An event is recorded for each submodule to allow for waiting, as we run these asynchronously.
3. Each module gets a pre-forward hook that waits on its corresponding event. note that modules might be reused during training, in this case the wait is only done for the first module. After this wait, the module's parameters are in reduced precision.
4. In the pre-forward hook, we register a backward hook on the lower precision parameters in order to run reduced precision allreduce + parameter upcast. We can't rely on the Reducer's constructor setting up these hooks because the gradient is accumulated on the low precision param, so we need to register them ourselves.
5. In the backward pass, when the hook runs, we first run allreduce + divide in the reduced precision. Next, we upcast parameters and gradients back to fp32 asynchronously. We also queue a callback at the end of backward to wait on these upcasts so that the upcast is complete before optim.step() runs.
6. Parameters that don't require grad are also cast since they may be used in computation, they are upcast back in the final autograd callback.
7. DDP Ignored parameters are not touched.

Follow-ups:

1. Unify comm hooks and make it work with apply optimizer in backward
2. implement keep_low_precision_grads,
3. allow BN, LN, or custom units to run in reduced precision,
4. support for cast_forward_inputs
5. Unify certain APIs / helpers with FSDP where possible, such as for _cast_forward_inputs
6. Integrate this with replicate() API.
7. The order in which we kick off copies and wait for them is set by the iteration order of module.modules(), but this might not be how the modules are used in the actual training. In the worst case, the last module in module.modules() could be used first which would result in waiting for all copies unnecessarily. For static graphs, we should record the module execution order and copy / wait in this order.
8. Entirely unused modules probably don't need to be cast.

Differential Revision: [D42515803](https://our.internmc.facebook.com/intern/diff/D42515803/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92882
Approved by: https://github.com/zhaojuanmao
2023-03-13 14:10:31 +00:00
c7f39c0820 Update CI skips (#96554)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96554
Approved by: https://github.com/janeyx99
2023-03-13 13:40:45 +00:00
279ada515a inductor(cpu): make variable number used of masked vectorization path align with scalar path (#96510)
Fix https://github.com/pytorch/pytorch/issues/96484, for CPP reduction vectorization path, there has an assumption that the vectorization path var number used should be aligned with the scalar path, but currently, masked doesn't meet such requirement and will report var not defined error.

before:
```
{
    {
        {
            #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}})
            float tmp7 = std::numeric_limits<float>::infinity();
            auto tmp7_vec = at::vec::Vectorized<float>(tmp7);
            for(long i0=0; i0<0; i0+=1)
            {
                auto tmp5 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
                auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0));
                auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2));
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = at::vec::Vectorized<float>(0.0);
                {
                    auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]);
                    tmp3 = decltype(tmp4)::blendv(tmp3, tmp4, to_float_mask(tmp2) != at::vec::Vectorized<float>(0));
                }
                auto tmp6 = tmp3 + tmp5;
                tmp7_vec = at::vec::minimum(tmp7_vec, tmp6);
            }
            #pragma omp simd simdlen(8)  reduction(min:tmp8)
            for(long i0=0; i0<2; i0+=1)
            {
                auto tmp6 = in_ptr1[i0];
                auto tmp0 = static_cast<long>(0);
                auto tmp1 = static_cast<long>(2);
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = in_ptr0[0];
                    return tmp4;
                }
                ;
                auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                auto tmp7 = tmp5 + tmp6;
                tmp8 = std::min(tmp8, tmp7);
            }
            tmp7 = std::min(tmp7, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp7_vec));
            out_ptr0[0] = tmp8;
        }
    }
}
```
after:

```
{
    {
        {
            #pragma omp declare reduction(min:at::vec::Vectorized<float>:omp_out = at::vec::minimum(omp_out, omp_in)) initializer(omp_priv={{std::numeric_limits<float>::infinity()}})
            float tmp8 = std::numeric_limits<float>::infinity();
            auto tmp8_vec = at::vec::Vectorized<float>(tmp8);
            for(long i0=0; i0<0; i0+=1)
            {
                auto tmp6 = at::vec::Vectorized<float>::loadu(in_ptr1 + 16*i0);
                auto tmp0 = at::vec::Vectorized<int>(static_cast<int>(0));
                auto tmp1 = at::vec::Vectorized<int>(static_cast<int>(2));
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = at::vec::Vectorized<float>(in_ptr0[0]);
                    return tmp4;
                }
                ;
                auto tmp5 = decltype(tmp3())::blendv(at::vec::Vectorized<float>(0.0), tmp3(), to_float_mask(tmp2) != at::vec::Vectorized<float>(0));
                auto tmp7 = tmp5 + tmp6;
                tmp8_vec = at::vec::minimum(tmp8_vec, tmp7);
            }
            #pragma omp simd simdlen(8)  reduction(min:tmp8)
            for(long i0=0; i0<2; i0+=1)
            {
                auto tmp6 = in_ptr1[i0];
                auto tmp0 = static_cast<long>(0);
                auto tmp1 = static_cast<long>(2);
                auto tmp2 = tmp0 < tmp1;
                auto tmp3 = [&]
                {
                    auto tmp4 = in_ptr0[0];
                    return tmp4;
                }
                ;
                auto tmp5 = tmp2 ? tmp3() : static_cast<decltype(tmp3())>(0.0);
                auto tmp7 = tmp5 + tmp6;
                tmp8 = std::min(tmp8, tmp7);
            }
            tmp8 = std::min(tmp8, at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return at::vec::minimum(x, y);}, tmp8_vec));
            out_ptr0[0] = tmp8;
        }
    }
}

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96510
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel
2023-03-13 09:41:23 +00:00
2cbce06fee Enablee test_inverse_errors_large (#94727)
Test to see if TestLinAlgCUDA.test_inverse_errors_large_cuda_float64 still fails on CI.
The test was not failing in multiple CI runs.
I was not able to reproduce the crash locally.
Fixes #57482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94727
Approved by: https://github.com/lezcano
2023-03-13 08:31:41 +00:00
760ad90518 [Dynamo] User defined functions support torch & builtin functions as default arguments (#96563)
Fixes #96197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96563
Approved by: https://github.com/jansel
2023-03-13 08:28:52 +00:00
6eca391e83 inductor(cpu): remove __restrict__ keyword to avoid generating wrong result when two pointer point same memory (#96492)
Fix https://github.com/pytorch/pytorch/issues/93365, https://github.com/pytorch/pytorch/issues/93357 and https://github.com/pytorch/pytorch/issues/96432. Currently, remove `__restrict__` keyword to avoid generating the wrong result, there has a draft PR https://github.com/pytorch/pytorch/pull/96404 to do some memory alias checks before adding `__restrict__ `keyword, but that PR needs to re-designed well for the logic of the memory alias checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96492
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-03-13 07:12:04 +00:00
be220690d9 Revert "[primTorch] add meta implementation for upsample_nearest2d_backward (#96612)"
This reverts commit fe180596b854164db0ce500d938def8df45790ba.

Reverted https://github.com/pytorch/pytorch/pull/96612 on behalf of https://github.com/malfet due to broke lint
2023-03-13 03:07:23 +00:00
2b9d9bcb85 Deprecate non-bool masks in masked_fill (#96594)
__What?__
Per discussion at #94634, deprecate `masked_fill` with non-bool masks. Deprecation warnings were previously added by #22261, but not for Apple MPS. I can revert the MPS changes if deprecation warnings are wanted first tho. See also #96112.

Fixes #85063 and #89320.

__Further Development?__
- Fixed the mask dtype checking for the cuda dispatch for `masked_fill` in `aten/src/ATen/native/cuda/Indexing.cu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96594
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-03-13 01:41:47 +00:00
fe180596b8 [primTorch] add meta implementation for upsample_nearest2d_backward (#96612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96612
Approved by: https://github.com/ezyang
2023-03-13 00:25:23 +00:00
99efe3ef5a Generate type match guard for torch.Size input (#96421)
I suppose hypothetically, if the user code ends up working
polymorphically over the SizeVariable, in such a way that a tuple would
work, this type match is not necessary.  But we do not carefully test
for this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96421
Approved by: https://github.com/jansel, https://github.com/voznesenskym
2023-03-12 23:04:55 +00:00
1ab883797a [BE] Dedup hardcoded triton versions (#96580)
Define it once in `.ci/docker/trition_version.txt` and use everywhere.

Also, patch version defined in `triton/__init__.py` as currently it always returns `2.0.0` even if package name is `2.1.0`

Followup after https://github.com/pytorch/pytorch/pull/95896 where version needed to be updated in 4+ places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96580
Approved by: https://github.com/huydhn
2023-03-12 20:00:48 +00:00
30b968f60d Revert "[BE] Dedup hardcoded triton versions (#96580)"
This reverts commit c131e51e6248cf04135db317040b5be3ab944d41.

Reverted https://github.com/pytorch/pytorch/pull/96580 on behalf of https://github.com/malfet due to Forgot to fix lint
2023-03-12 19:37:52 +00:00
c131e51e62 [BE] Dedup hardcoded triton versions (#96580)
Define it once in `.ci/docker/trition_version.txt` and use everywhere.

Also, patch version defined in `triton/__init__.py` as currently it always returns `2.0.0` even if package name is `2.1.0`

Followup after https://github.com/pytorch/pytorch/pull/95896 where version needed to be updated in 4+ places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96580
Approved by: https://github.com/huydhn
2023-03-12 16:56:04 +00:00
4b372e3958 [memory profiling] C++ tracing support (#95357)
Adds the ability to quickly generate stack traces for C++,
and combine Python, TorchScript, and C++ frames into a single trace.

This makes it possible for the memory tracer to record allocations inside
C++ code (e.g. convolution temporaries, backward operators).

The unwinder code is ~10x faster than execinfo.h's backward because it
cache fast unwinder routines for instruction pointers that have already been seen.
It is also only 1.2--2x slower than copying the entire stack (the approach perf takes),
while using 2 orders of magnitude less space per stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95357
Approved by: https://github.com/bertmaher
2023-03-12 07:24:14 +00:00
48490cec28 [memory profiling] Move Context object to c10 (#96280)
Minor refactor so that follow up PR can have objects that meet the GatheredContext
inferface without having to depend on CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96280
Approved by: https://github.com/eellison
2023-03-12 07:24:14 +00:00
266089a3fe [memory snapshots] record scripted stack traces (#95356)
Adds support for seeing both python and script stack traces in memory
debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95356
Approved by: https://github.com/aaronenyeshi
2023-03-12 07:24:14 +00:00
e8b0f504e2 Fix unpicklable object in AveragedModel (#95979)
Fixes #95376

Don't store the callable `avg_fn`, instead test if `avg_fn` is None and call
the default impl if it's not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95979
Approved by: https://github.com/janeyx99
2023-03-12 05:13:22 +00:00
82d3d053b9 Properly capturing argument names for decorated/wrapped functions (#96557)
`inspect.getfullargspec` does not properly handle functions/methods wrapped by functools.wraps(). As a result, it gets an empty list of `args` in FullArgSpec.

This PR rewrites the logic using `inspect.signature`, which handles functools.wraps() correctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96557
Approved by: https://github.com/jansel
2023-03-12 01:40:06 +00:00
a7a09adb86 Add location information for assertions in torch.jit.annotations.try_ann_to_type (#96423)
There are two assertions in `torch.jit.annotations.try_ann_to_type` that could benefit from adding source level location information.

For example, the current assertion:
```
        msg = "Unsupported annotation {} could not be resolved because {} could not be resolved."
        assert valid_type, msg.format(repr(ann), repr(contained))
```
reports:
```
AssertionError: Unsupported annotation typing.Union[typing.Dict, NoneType] could not be resolved because typing.Dict could not be resolved at
```
I find it beneficial to know from which line of code this assertion was triggered. Adding the location information then reports:
```
AssertionError: Unsupported annotation typing.Union[typing.Dict, NoneType] could not be resolved because typing.Dict could not be resolved at
  File "/home/schuetze/Documents/work/github/prediction_net/multimodal/models/heads/retina_head.py", line 189
    def forward(self, fpn_features: t.Dict[str, torch.Tensor],
                inputs: t.Dict[str, torch.Tensor],
                gts: t.Optional[t.Dict] = None) -> t.Dict[str, t.Any]:
                     ~~~~~~~~~~~~~~~~~~ <--- HERE
        """
        """
```

Adding these location information are related to #96420  but these changes in this PR can be made without any API changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96423
Approved by: https://github.com/davidberard98
2023-03-11 21:49:13 +00:00
12735952a0 Symintify _gather_sparse_backward (#96591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96591
Approved by: https://github.com/Skylion007
2023-03-11 20:48:06 +00:00
cb7c796b4b Enable min.unary_out (#96441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96441
Approved by: https://github.com/ngimel
2023-03-11 19:23:33 +00:00
31a6730411 [pt2][inductor] Ignore trace.upload_tar when pickling config (#96519)
Summary: if trace.upload_tar is set, it's a function, and it can't be pickled.

Test Plan:
Used on a Meta-internal workload; also, hacked up
test/inductor/test_smoke.py to set trace.upload_tar and ran with
TORCH_COMPILE_DEBUG=1

Reviewed By: mlazos

Differential Revision: D43915178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96519
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-11 19:20:42 +00:00
0d7c44096a Add baddbmm meta function (#96548)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96548
Approved by: https://github.com/ezyang
2023-03-11 19:09:24 +00:00
8e0d5bf538 [primTorch] add meta implementation for aten.min.dim (#96442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96442
Approved by: https://github.com/ngimel
2023-03-11 18:51:51 +00:00
ab148da66c Add fsspec to requirements.txt (#96532)
Need this package to support enable distributed checkpoint for different backends.

Fsspec package size:
```
du  -h /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg
264K    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/__pycache__
58K     /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations/__pycache__
377K    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec/implementations
1017K   /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/fsspec
96K     /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg/EGG-INFO
1.2M    /fsx/users/irisz/conda/envs/pytorch/lib/python3.9/site-packages/fsspec-2023.3.0-py3.9.egg
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96532
Approved by: https://github.com/osalpekar
2023-03-11 06:42:48 +00:00
f3fc4d035d add timeout and retry to metric upload job (#96582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96582
Approved by: https://github.com/huydhn
2023-03-11 04:25:41 +00:00
b4f434a731 [JIT] mark _exchange_device op as having side effects (#96364)
In #95305 the _exchange_device ops are getting dead-code-eliminated, so they don't get called. #95306 fixes this by using the output of the op, but it's still possible that JIT might reorder the op around other ops.

This PR marks _exchange_device as having side effects so that the ops won't get dead code eliminated or reordered, even if the return is not used.

Differential Revision: [D43966285](https://our.internmc.facebook.com/intern/diff/D43966285)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96364
Approved by: https://github.com/eellison
2023-03-11 04:17:58 +00:00
f89bd26fe4 update options (#96551)
Fix for https://github.com/pytorch/pytorch/issues/96540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96551
Approved by: https://github.com/msaroufim
2023-03-11 03:33:27 +00:00
362958125a [vision hash update] update the pinned vision hash (#96570)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96570
Approved by: https://github.com/pytorchbot
2023-03-11 03:17:37 +00:00
c3614c7a61 Add a flag to benchmarks script to keep the test report directory (#96398)
I notice from the Rockset data that there are only `float32` records, while there should be both dtypes there.  It turns out that the benchmarks script generated by `runner.py` always removes the output directory by default, so there are only records from `float32` running later left.

For example, `rm -rf /var/lib/jenkins/workspace/test/test-reports` appeared twice in the CI log https://ossci-raw-job-status.s3.amazonaws.com/log/11840774308.

I'm adding a new flag `--keep-output-dir` to keep the output directory.  This is off by default as I'm not sure how this script is used internally, people probably expect to see the output directory cleaned up everytime.

### Testing

Not really want to start the 10h jobs just to test this small flag, so I'm triple check the change to make sure that there is no bug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96398
Approved by: https://github.com/weiwangmeta
2023-03-11 03:16:56 +00:00
bdecf50b47 [fix] reshape_dim_outof to handle 0 sized dim (#96493)
Fixes https://github.com/pytorch/pytorch/issues/96345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96493
Approved by: https://github.com/zou3519
2023-03-11 02:52:00 +00:00
1be04be3b2 Remove fetch-depth from _calc_docker_img (#96588)
As in current form it will only work for PRs with one commit.
Checkout full PyTorch repo (but skip submodules)

Example of failure in PR with multiple commits, see https://github.com/pytorch/pytorch/actions/runs/4389777316/jobs/7687694067#step:4:68

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96588
Approved by: https://github.com/huydhn, https://github.com/izaitsevfb, https://github.com/osalpekar
2023-03-11 02:18:39 +00:00
61cb544397 Align mask formatting of both masks more closely (#96286)
Summary: Align mask formatting of both masks more closely

Test Plan: sandcastle & github

Differential Revision: D43878634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96286
Approved by: https://github.com/cpuhrsch
2023-03-11 02:18:05 +00:00
1e6961586b [Profiler] Memory timeline to show actual timestamps (#96535)
Summary: Rather than starting the timeline at t=0, keep the actual timestamps of the memory events.

Test Plan: CI Tests

Reviewed By: leitian, chaekit

Differential Revision: D43807624

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96535
Approved by: https://github.com/davidberard98
2023-03-11 00:25:30 +00:00
51b8ab7879 Clean up references to test_megatron_prototype (#96431)
This test has been deleted in #96254

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96431
Approved by: https://github.com/clee2000, https://github.com/fduwjj
2023-03-10 23:50:32 +00:00
4242e698a3 [BE][MPS] Add MPS to clang format (#96562)
I'm getting tired of asking to add space after if and all that jazz, so let's linter do that.
Add section for Objective-C language, where column with is extended to 120 characters and `AlignAfterOpenBracket` is set to `Align`

All `.mm` changes in this PR are made by running linter as follows:
```
lintrunner --take CLANGFORMAT --all-files --apply-patches
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96562
Approved by: https://github.com/seemethere, https://github.com/janeyx99, https://github.com/ZainRizvi, https://github.com/izaitsevfb, https://github.com/PaliC, https://github.com/albanD
2023-03-10 23:17:54 +00:00
a7689e73f6 [Docs] Document of RReLU about its different behavior between training and evaluation (#95624)
Current document of [Randomized Leaky ReLU (RReLU)](https://pytorch.org/docs/stable/generated/torch.nn.RReLU.html#torch.nn.RReLU) does not demonstrate its different behavior between training and evaluation. This PR adds illustrations about this.

Fixes #95605.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95624
Approved by: https://github.com/albanD, https://github.com/H-Huang
2023-03-10 22:33:24 +00:00
0bf2ed2eb4 Remove duplicate windows job (#96552)
They are already present in trunk.yml

during migration from 11.6->11.7 to 11.7->11.8, 11.6 trunk jobs were migrated to 11.7, but 11.7 periodic jobs were not migrated, but 11.8 were simply added
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96552
Approved by: https://github.com/huydhn
2023-03-10 22:28:56 +00:00
80ce1a934e Fix flaky Dynamo export tests (#96488)
Planning to do a full writeup later. The short story is, sometimes the following chain of events happens:

1. We turn on Dynamo's custom frame handler
2. GC triggers (and all of the finalizers run under Dynamo)
3. GC hits a GeneratorExit frame
4. You end up in the custom frame handler with throw_flag == TRUE and PyErr_Occurred() != NULL

If this happens and we blindly call into other Python functions (like the Python callback), the executed Python code will immediately raise an exception (because there's already an ambient exception set.) This is very, very confusing. The fix is to defer to the regular handler when throw_flag is TRUE.

I triggered this locally with

```
PYTHONUNBUFFERED=1 pytest test/dynamo/test_dynamic_shapes.py   -k 'Unspec and export and not dupes and not reorder' -v -x -s
```

But I also have some tests which trigger the problem synthetically.

Fixes https://github.com/pytorch/pytorch/issues/93781

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96488
Approved by: https://github.com/albanD
2023-03-10 21:51:54 +00:00
7fcf8b1829 [Dynamo] Support torch.{cuda/cpu}.amp.autocast (#95416)
For Meta internal use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95416
Approved by: https://github.com/jansel
2023-03-10 21:48:08 +00:00
d05f2ae476 Require DOCTEST_SHOW environ to run plt.show (#96522)
@ezyang This is a minor change.

I was using the doctests to check that my install wasn't broken via:

```bash
xdoctest -m torch --style=google --global-exec "from torch import nn\nimport torch.nn.functional as F\nimport torch" --options="+IGNORE_WHITESPACE"
```

And noticed that it stops in the middle to show this matplotlib figure. I added a condition so it only does the pyplot show if a DOCTEST_SHOW environment variable exists. With this fix the above command runs to completion and is an easy way for users to put torch through its paces given just a fresh install.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96522
Approved by: https://github.com/ezyang
2023-03-10 21:47:20 +00:00
384545bf84 [ONNX] Preserve stacktrace info for decomp (#95929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95929
Approved by: https://github.com/justinchuby
2023-03-10 21:07:03 +00:00
b97ce3774a [ONNX] Move graph transform functions to 'passes' (#95664)
This PR only moved code to their new location. No other actual code changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95664
Approved by: https://github.com/justinchuby
2023-03-10 21:07:03 +00:00
41991710b2 Revert "[PyTorch] Use c10::FastMap for memoizing in Pickler (#96360)" (#96547)
This reverts commit 69d3fa2e4d93f3367ceb3af62d78aedd317dca6c.

Reason: breaks internal meta tests. See [D43926671](https://www.internalfb.com/diff/D43926671)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96547
Approved by: https://github.com/seemethere, https://github.com/malfet
2023-03-10 20:57:06 +00:00
429091140e [BE][MPS] Use convenience functions (#96521)
Introduce `getMPSScalarType(const Tensor&)` that calls `getMPSScalarType(t.scalar_type())`
And replace `getMPSScalarType(t.scalar_type)` with `getMPSScalarType(t)` throughout the codebase

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96521
Approved by: https://github.com/seemethere
2023-03-10 20:31:10 +00:00
85961f5728 Fix broken anchor in RELEASE.md (#96525)
Fixes https://github.com/pytorch/pytorch/issues/96514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96525
Approved by: https://github.com/atalman
2023-03-10 19:35:03 +00:00
4519228f60 Reduce pytest blocklist part 2 (#96397)
Enable pytest for a few unique files.  pytest runs tests in a different order than unittest (but still a consistent ordering with respect to itself) and some tests change global state, causing other tests to fail.

`test_transpose_non_contiguous` in `test_torchinductor.py` gets impacted from some other test but I'm not sure which one, so my solution is to reset the metrics before the rest of the test is run.

`test_register_patterns` in `test_quantize_fx.py` adds extra keys to global variables, so remove them when the test is done via unittest's `addCleanUp` which also works on pytest.

pytest doesn't really have an equivalent for `load_tests` so change it to be like `test_jit` that imports all the classes.  I also attempted to dynamically import them, but I failed.

`test_public_api_surface` in `test_fx.py` checks for a backwards compatibility classification.  There is a different test in test_fx that results in `fuser_utils` being imported.  pytest runs this test before `test_public_api_surface` while unittest runs it after, so pytest sees `fuser_utils` when crawling through the modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96397
Approved by: https://github.com/huydhn
2023-03-10 19:10:43 +00:00
49eed50d19 [Inductor Perf CI] Lower the threshold of performance smoke test speedup. (#96531)
Avoids issues with https://github.com/pytorch/pytorch/issues/96530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96531
Approved by: https://github.com/seemethere
2023-03-10 18:58:28 +00:00
e948ba07d4 [Profiler] Add export_memory_timeline to save memory timeline plot to file (#96137)
Summary: Added the functionality to export the memory timeline plot as a list of times and sizes, which the post processing visualization can parse and plot.

Test Plan: CI Tests

Reviewed By: leitian, fengxizhou

Differential Revision: D43680760

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96137
Approved by: https://github.com/chaekit
2023-03-10 18:20:25 +00:00
29cd60dfb7 [CI] handle more dynamo benchmark models that are not expected to be deterministic (#96324)
Follow-up to #96245. alexnet, Background_Matting, vision_maskrcnn, and vgg16 all have the same problem; but on float32 they were also failing on the previous day so I missed this. Once the amp jobs became available I could see that these have the same issue (on both float32 and amp).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96324
Approved by: https://github.com/desertfire
2023-03-10 18:15:34 +00:00
481582eb11 Remove land checks in trymerge (#96401)
Remove all references to land checks (rebase on viable strict in a different branch) since its no longer used.  Adding ciflow/trunk on merge and/or rebasing the entire pr is preferred.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96401
Approved by: https://github.com/huydhn
2023-03-10 18:11:05 +00:00
6894bb0a85 Remove on_green and mandatory_only (#96400)
Our default behavior is on green, and currently the two main modes are on green and force.
Surprisingly, both these flags are pretty much not used anywhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96400
Approved by: https://github.com/huydhn
2023-03-10 18:11:05 +00:00
219d5eb4f1 [QOL] Raise a NameError when accessing non-existent variable (#96418)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96418
Approved by: https://github.com/albanD
2023-03-10 17:54:02 +00:00
55cf7eef86 add/add_ for sparse compressed formats: fix silent index downcast int64 -> int32 (#95294)
Fixes https://github.com/pytorch/pytorch/issues/95224.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95294
Approved by: https://github.com/cpuhrsch, https://github.com/amjames
2023-03-10 17:51:40 +00:00
a651e6253a [CI] Change compile_threads to 1 when running benchmark accuracy test on CI (#96195)
Summary: This is not a pretty solution, but it a way to verify if the flakiness is coming from parallel compilation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96195
Approved by: https://github.com/ngimel
2023-03-10 17:39:38 +00:00
939c4ae6cd [DataPipe] Add copy option to fork DataPipe (#96030)
Fixes pytorch/data#1061 and fixes pytorch/data#1032
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96030
Approved by: https://github.com/ejguan, https://github.com/NivekT
2023-03-10 17:31:56 +00:00
e35f020418 Retry XLA dependencies installation step (#96352)
XLA install some dependencies as part of the CI job and the step could fail sometime due to network flakiness, i.e. 3f840cc627 where it failed to get some nodejs packages.  A quick fix would be to retry the step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96352
Approved by: https://github.com/JackCaoG, https://github.com/ZainRizvi
2023-03-10 17:16:50 +00:00
55d4842a48 [SPMD] Add defunctionalize_optimizer feature (#96323)
Summary: The manually adding dependencies between _foreach_add_, _fused_adam_, and output can cause issues when lowering to Inductor. This API removes those dependencies.

Test Plan: CI

Differential Revision: D43916450

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96323
Approved by: https://github.com/kumpera
2023-03-10 16:05:23 +00:00
c7bd9b9490 Switch AsyncCollectiveTensor to be a wrapper subclass. (#96105)
Our usage is of a wrapper, so it makes sense that we use one.

This makes it possible for FakeTensorMode to work.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96105
Approved by: https://github.com/wanchaol, https://github.com/wconstab
2023-03-10 15:13:32 +00:00
3bb16a0842 Enable thp(transparent huge pages) for buffer sizes >=2MB (#95963)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set.

re-landing https://github.com/pytorch/pytorch/pull/93888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95963
Approved by: https://github.com/malfet
2023-03-10 13:58:01 +00:00
b053a0f2ba [XPU][Profiler] Add API support for XPU profiler to Kineto path (#94502)
This patch is aimed to add support to XPU profiler which will co-work with Kineto. After this PR, kineto will follow these API to fit itself. Also, the development of interface in python is near done.

Signed-off-by: Huang, Xunsong <xunsong.huang@intel.com>

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94502
Approved by: https://github.com/ezyang
2023-03-10 12:17:14 +00:00
d0f4d62961 flatten_indices: remove syncs (#94401)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94401
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-03-10 12:03:26 +00:00
1b59c3feb5 Add PyObjectSlot member to StorageImpl (#93342)
Part of #91395

Also modifies how `StorageImpl`s are stored in JIT static runtime's `MemoryPlanner`, which used to `std::move` `StorageImpl`s into a vector. But `StorageImpl` can no longer be moved. Instead, `MemoryPlanner` now contains a malloced buffer to which we add new `StorageImpl`s using placement new

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93342
Approved by: https://github.com/ezyang
2023-03-10 10:40:01 +00:00
987eade3f3 [fix] resize_ and resize_as_ : version bump (#96403)
Fixes https://github.com/pytorch/pytorch/issues/93776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96403
Approved by: https://github.com/ezyang
2023-03-10 06:46:30 +00:00
8bce88d9de [caffe2] dont call cudnnDestroy on thread exit (crashes on windows with cuda 11/12) (#95382)
Summary:
My team has been hitting a mysterious crash for a few months on a windows binary that uses Caffe2 inside a worker thread.

When this thread gets destroyed, there is an error at this line in context_gpu.h where the state of this operation gives CUDNN_STATUS_INTERNAL_ERROR instead of CUDNN_STATUS_SUCCESS.

When enabling cudnn debug logs (via the env variables nvidia specifies), I can see that the context is destroyed twice, even though this code only destroys it once, so something mysterious is causing a double free.

This seems very very similar to the issue/fix described here for pytorch:
https://github.com/pytorch/pytorch/issues/17658
https://github.com/apache/tvm/pull/8267

And pytorch handles this in the same way, by just not calling cudnnDestroy

This seems to have become an issue with cuda11, but I tested cuda12 as well and found that the issue persists so this needs to be somehow fixed.

Test Plan:
CI

I checked that the specific windows binary I am using is able to create and drestroy caffe2-invoking threads without causing the application to crash.

buck run arvr/mode/win/cuda11/opt //arvr/projects/nimble/prod/tools/MonoHandTrackingVis

Differential Revision: D43538017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95382
Approved by: https://github.com/malfet
2023-03-10 06:42:51 +00:00
76cac70939 new triton main pin (#95896)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95896
Approved by: https://github.com/jansel, https://github.com/malfet
2023-03-10 06:30:41 +00:00
9aa216cb46 reland #96249: [inductor] show more kernel specific metrics in the benchmark result (#96461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96461
Approved by: https://github.com/ngimel
2023-03-10 06:18:21 +00:00
d0731271cd Revert "new triton main pin (#95896)"
This reverts commit 6e0359dd4233b0cec51521bec8869f0a46ebd98b.

Reverted https://github.com/pytorch/pytorch/pull/95896 on behalf of https://github.com/huydhn due to I am not quite sure what this is about yet, but testing 3.8 wheel starts to fail 6e0359dd42
2023-03-10 05:41:45 +00:00
076792a3e1 [ONNX][Diagnostics] Speed up 'python_call_stack' by 'traceback' (#96348)
`inspect.stack()` retrieves all stacktraces, and is not performant. `inspect.stack(0)`
speeds up the call greatly, but loses line snippet.
Rewrite with `traceback.extract_stack` which is better in both regards.
Speeds up `export` call in `test_gpt2_tiny` from ~30s to ~4s under profiling.

Before
```log
│...├─ 30.794 export_after_normalizing_args_and_kwargs  <@beartype(torch.onnx._internal.fx.exporter.export_after_normalizing_args_and_kwargs) at 0x7f815cba0700>:1
│...│  └─ 30.794 export_after_normalizing_args_and_kwargs  torch/onnx/_internal/fx/exporter.py:580
```

After
```log
│...├─ 4.427 export_after_normalizing_args_and_kwargs  <@beartype(torch.onnx._internal.fx.exporter.export_after_normalizing_args_and_kwargs) at 0x7fd8281b3700>:1
│...│  └─ 4.427 export_after_normalizing_args_and_kwargs  torch/onnx/_internal/fx/exporter.py:580
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96348
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby
2023-03-10 05:11:47 +00:00
15e58c19ec [FSDP][optim_state_dict] Copy step tensor so that each parameter has its own step (#96313)
Summary: When parameters are flattening, multiple parameters share the same step. When unflattening the parameters, current implementation still make these parameters share the same step. When this is not wrong, some training infra get confused by sharing tensor storages. This PR fixes the issue.

Test Plan: CI

Reviewed By: awgu

Differential Revision: D43893592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96313
Approved by: https://github.com/zhaojuanmao
2023-03-10 04:51:30 +00:00
cdab1d676c pt2e short term quant: respect qmin/qmax for linear weight (#96232)
Summary:

Makes the `nnqr.Linear` module respect the qmin/qmax attributes of weight observer.  This is to unblock some customer teams who are depending on non-default values of these attributes.

Test plan:

```
python test/test_quantization.py -k TestReferenceQuantizedModule.test_linear_decomposed
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96232
Approved by: https://github.com/andrewor14
2023-03-10 04:46:20 +00:00
bea6b1d29a [vision hash update] update the pinned vision hash (#96369)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96369
Approved by: https://github.com/pytorchbot
2023-03-10 03:54:21 +00:00
f3b8638074 Adding nn.ZeroPad1d and nn.ZeroPad3d (#96295)
Fixes #95796

### Implementation
Adds python implementation for `nn.ZeroPad1d` and `nn.ZeroPad3d` in `torch/nn/modules/padding.py`.

Adds cpp implementation for `nn::ZeroPad1d` and `nn::ZeroPad3d` in the following 3 files, refactored with templates similarly to `nn::ConstantPad`'s implementation: <br>
- `torch/crsc/api/include/torch/nn/modules/padding.h`
- `torch/csrc/api/include/torch/nn/options/padding.h`
- `torch/csrc/api/src/nn/modules/padding.cpp`

Also added relevant definitions in `torch/nn/modules/__init__.py`.
### Testing
Adds the following tests:
-  cpp tests of similar length and structure as `ConstantPad` and the existing `ZeroPad2d` impl in `test/cpp/api/modules.cpp`
- cpp API parity tests in `torch/testing/_internal/common_nn.py`
- module init tests in `test/test_module_init.py`

Also added relevant definitions in `test/cpp_api_parity/parity-tracker.md`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96295
Approved by: https://github.com/soulitzer
2023-03-10 03:51:41 +00:00
cyy
d0e4ca233e some reference and move fixes (#95942)
This PR introduces some modifications:
1. We find out some const function parameters that can be passed by reference and add the reference.
2. We find more opportunists of passing by value and change them accordingly.
3. Some use-after-move errors are fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95942
Approved by: https://github.com/Skylion007
2023-03-10 03:44:09 +00:00
6e0359dd42 new triton main pin (#95896)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95896
Approved by: https://github.com/jansel
2023-03-10 03:40:37 +00:00
065de43012 Fixing a bug where allocating a 4GB block results in using 8GB of memory (#95827)
I added two constants. First helps with avoiding rounding while we hit a certain threshold, and second, to control what blocks can be cached.

Allocations larger than `kMaxRoundThreshold` will not be rounded to the next power of two anymore. Generally it is expected that larger allocations happen less frequently, and this more or less matches what happens in `CudaCachingAllocator`.

Blocks larger than `kMaxCachedSize` will not be cached. This is a separate problem than the above but I noticed this caching is poorly implemented here and doesn't do anything to avoid fragmentation or to help with good resource utilization. For example, the following allocations:
```
t1 = alloc(4GB)
del t1
t2 = alloc(10k)
t3 = alloc(4GB)
```
this results in allocating 8GB, because the first 4GB block that is cached gets assigned to the 10k allocation wasting the rest of the block.

Lastly, ideally I would make this constants configurable, but looking around the code I didn't see any existing mechanisms in ATen to configure things at runtime.

Fixes #95823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95827
Approved by: https://github.com/ngimel
2023-03-10 03:21:06 +00:00
a87f3f612e [MPS] Fall back multi-layer LSTM on macOS 12 (#90909)
The native implementation of LSTM has been fixed on macOS 13.

On macOS 12, the multi-layer LSTM still has a numerical correctness issue that cannot be resolved on OS's side.

Thus, we fall back the multi-layer LSTM on macOS 12 to LSTMCell iteration. It might have performance impact but will make LSTM on macOS 12 fully usable.

Fixes: #90421
Issues related: #80306, #83144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90909
Approved by: https://github.com/albanD, https://github.com/kulinseth
2023-03-10 03:10:49 +00:00
b0a580a21d [ONNX] Export logical_not (#96315)
Fixes https://github.com/pytorch/pytorch/issues/95154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96315
Approved by: https://github.com/justinchuby
2023-03-10 02:25:08 +00:00
5f89d147a1 [ONNX] STFT Support (#92087)
This PR addresses issue [#81075](https://github.com/pytorch/pytorch/issues/81075),  making `torch.stft` compatible with ONNX Opset 17's STFT operator.

The conversion works for _most_ of `torch.stft` functionality:

- Batched or unbatched inputs
- Normalization
- Pre-computed windows
- Rectangular windows
- One-sided returns
- Window centering (implicitly supported)

What is currently _not_ supported is **complex types**, due to the lack of conversion functionality between PyTorch and ONNX (https://github.com/pytorch/pytorch/issues/86746).

Regardless, this is easy to bypass by setting `return_complex=False` when using `torch.stft`.

Note that there is already a draft PR to address this (https://github.com/pytorch/pytorch/pull/83944), but it is currently closed and it only partially addresses the conversion (i.e., most of `torch.stft` functionality is lacking, and unit tests are missing).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92087
Approved by: https://github.com/justinchuby
2023-03-10 02:20:58 +00:00
69d3fa2e4d [PyTorch] Use c10::FastMap for memoizing in Pickler (#96360)
These maps don't rely on reference stability, so FastMap should be fine.

Differential Revision: [D43926671](https://our.internmc.facebook.com/intern/diff/D43926671/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96360
Approved by: https://github.com/ezyang
2023-03-10 02:18:16 +00:00
cc798f1a4f [PyTorch] add c10/util/FbcodeMaps.h (#96359)
Allow us to use folly maps in fbcode and std maps for compatibility in OSS, extending what static runtime is already doing.

Differential Revision: [D43926670](https://our.internmc.facebook.com/intern/diff/D43926670/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96359
Approved by: https://github.com/ezyang
2023-03-10 02:18:16 +00:00
cc699c56dc reland #96248 [inductor] show performance for each autotune config for a kernel (#96458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96458
Approved by: https://github.com/ngimel
2023-03-10 01:40:04 +00:00
cf3d3a583e Add env PYTORCH_TEST_DO_NOT_USE_PYTEST as an option to not use pytest in unit testing (#96444)
Set environment variable
```
PYTORCH_TEST_DO_NOT_USE_PYTEST=1
```
to not use pytest in pytorch unit testing.

This change is related to some recent changes, e.g. #96210, #96016, #95844, #95659, that enabled the use of pytest in many test modules. Those test modules were testing normally before, but failed immediately after pytest is used. Sample stacktraces are:

```python
root@8e3168a83ee2:/opt/pytorch/pytorch# python test/run_test.py -v -i test_optim -- -v --save-xml
Ignoring disabled issues:  []
/opt/pytorch/pytorch/test/run_test.py:1225: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if torch.version.cuda is not None and LooseVersion(torch.version.cuda) >= "11.6":
Selected tests:
 test_optim
parallel (file granularity) tests:
 test_optim
serial (file granularity) tests:

Ignoring disabled issues:  []
Ignoring disabled issues:  []
Running test_optim ... [2023-03-09 12:51:59.358110]
Executing ['/usr/local/bin/python', '-bb', 'test_optim.py', '-v', '--save-xml', '-v', '--use-pytest', '-vv', '-rfEX', '-x', '--reruns=2'] ... [2023-03-09 12:51:59.358810]

Test results will be stored in test-reports/python-pytest/test_optim/test_optim-5e41643c8bac8ace.xml
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/test_optim.py", line 4581, in <module>
    run_tests()
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 796, in run_tests
    exit_code = pytest.main(args=pytest_args)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 148, in main
    config = _prepareconfig(args, plugins)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 329, in _prepareconfig
    config = pluginmanager.hook.pytest_cmdline_parse(
  File "/usr/local/lib/python3.10/site-packages/pluggy/_hooks.py", line 265, in __call__
    return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
  File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 80, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 55, in _multicall
    gen.send(outcome)
  File "/usr/local/lib/python3.10/site-packages/_pytest/helpconfig.py", line 103, in pytest_cmdline_parse
    config: Config = outcome.get_result()
  File "/usr/local/lib/python3.10/site-packages/pluggy/_result.py", line 60, in get_result
    raise ex[1].with_traceback(ex[2])
  File "/usr/local/lib/python3.10/site-packages/pluggy/_callers.py", line 39, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1060, in pytest_cmdline_parse
    self.parse(args)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1348, in parse
    self._preparse(args, addopts=addopts)
  File "/usr/local/lib/python3.10/site-packages/_pytest/config/__init__.py", line 1231, in _preparse
    self.pluginmanager.load_setuptools_entrypoints("pytest11")
  File "/usr/local/lib/python3.10/site-packages/pluggy/_manager.py", line 287, in load_setuptools_entrypoints
    plugin = ep.load()
  File "/usr/local/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "/usr/local/lib/python3.10/site-packages/_pytest/assertion/rewrite.py", line 168, in exec_module
    exec(co, module.__dict__)
  File "/usr/local/lib/python3.10/site-packages/xdist/looponfail.py", line 16, in <module>
    import execnet
  File "/usr/local/lib/python3.10/site-packages/execnet/__init__.py", line 14, in <module>
    from .gateway_base import DataFormatError
  File "/usr/local/lib/python3.10/site-packages/execnet/gateway_base.py", line 1138, in <module>
    FLOAT_FORMAT_SIZE = struct.calcsize(FLOAT_FORMAT)
BytesWarning: Comparison between bytes and string
FINISHED PRINTING LOG FILE of test_optim (/opt/pytorch/pytorch/test/test-reports/test_optim_1pnlesrz.log)

test_optim failed!
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/test/run_test.py", line 1428, in <module>
    main()
  File "/opt/pytorch/pytorch/test/run_test.py", line 1386, in main
    raise RuntimeError(
RuntimeError: test_optim failed!

Tip: You can keep running tests even on failure by passing --keep-going to run_test.py.
If running on CI, add the 'keep-going' label to your PR and rerun your jobs.
```

I'd like to propose this option that allows users to use the good old python unit test way instead of pytest to run their testing in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96444
Approved by: https://github.com/malfet
2023-03-10 01:32:15 +00:00
ff2e14f200 Skip rexnet_100 in dynamic CI (#96474)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96474
Approved by: https://github.com/yanboliang, https://github.com/msaroufim
2023-03-10 01:23:19 +00:00
79313345e8 Fix missing items() typo (#96417)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96417
Approved by: https://github.com/Skylion007
2023-03-10 01:13:58 +00:00
8ef8bd023d [CI] Use different subdirectories for amp and float32 nightly perf run (#96470)
Summary: runner.py deletes its output_dir as its first step, so we need
to keep two separate subdirectories.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96470
Approved by: https://github.com/huydhn
2023-03-10 01:12:14 +00:00
384d3ec2b6 Extra CR comments from #95621 (#96043)
Specifically:
063e441471 (r1120306196)
https://github.com/pytorch/pytorch/pull/95621#discussion_r1125015510

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96043
Approved by: https://github.com/Chillee, https://github.com/albanD
2023-03-10 01:10:48 +00:00
2f6a371ae9 Revert "Optimize nn.Module __call__ fast path for dynamo (#95931)" (#96242)
Reverting due to concerns over silent unsoundness (skipped hooks) if users have directly added hooks dicts without using official torch APIs.

This reverts commit 26045336ca323fd27cff2a7340fe896117d5fb6e.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96242
Approved by: https://github.com/albanD
2023-03-10 01:05:01 +00:00
6154be1dd1 [ONNX] Fix circular padding to support dynamic axes (#95647)
This commit fixes a bug where the ONNX exporter for circular padding queried the input tensor shape in order to get the correct 'end' index for a slice node. This doesn't work when the axis in question is has dynamic size. The commit fixes this by setting the 'end' index to INT_MAX, which is the recommended way of slicing to the end of a dimension with unknown size per ONNX spec.

See https://onnx.ai/onnx/operators/onnx__Slice.html

Also adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95647
Approved by: https://github.com/BowenBao
2023-03-10 00:29:33 +00:00
faa4cb29b2 [Quant][fx] Create new FX-based LSTM reference module (#96343)
Summary: The previous LSTM reference module implementation did
not handle dtypes other than quint8 correctly. This is because
the internal LSTM custom module quantization used eager mode,
which did not insert the q-dq ops properly. E.g., we want the
following reference quantized model:

```
[dq -> linear1_fp32 -> q_to_qint32] -> dq -> q_to_quint8 ->
  [dq - linear2_fp32 -> q_to_quint8] -> dq -> ...
```

This requires two sets of `q - dq` pairs between two adjacent
ops that have different dtypes (linear1 and linear2). However,
these `q - dq` pairs were not inserted in the old flow, because
eager mode required users to insert Quant/DeQuantStubs manually.

This commit changes the internal LSTM custom module quantization
to use FX graph mode quantization, which automatically inserts
the `q - dq` ops that convert the dtypes between adjacent ops
correctly. However, using FX graph mode quantization here comes
with its own set of challenges that required some hacks to get
the end-to-end flow to work. These hacks are detailed in the
comments in the util functions.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams

This commit also updates the corresponding test to verify the
dtypes as well as the qparams in the reference quantized graph.
This test case should serve as an example for users to set up
their own LSTM reference module flows.

Reviewers: vkuzo, supriyar, jcaip

Subscribers: vkuzo, supriyar, jcaip
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96343
Approved by: https://github.com/vkuzo
2023-03-09 23:23:48 +00:00
05b679ce6a [inductor] don't match indirect indexing in fusion (#96273)
Fixes #96064

When deciding whether to fuse nodes, we match indexing like `c0 + 5 * tmp0`, but `tmp0` in the different nodes can refer to totally different values. Even when `tmp0` is the same (like in the added test) inductor still generates wrongly ordered loads and stores (loads come before stores), so better just disable this fusion altogether. We should fix wrong order also:
```
@pointwise(size_hints=[8], filename=__file__, meta={'signature': {0: '*i64', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3), equal_to_1=())]})
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, out_ptr1, xnumel, XBLOCK : tl.constexpr):
    xnumel = 5
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0_load = tl.load(in_ptr0 + (0))
    tmp0 = tl.broadcast_to(tmp0_load, [XBLOCK])
    tmp1 = tl.load(in_ptr1 + (x0), xmask)
    tmp2 = tl.load(out_ptr0 + (x0 + (5*tmp0)), xmask)
    tl.store(out_ptr0 + (x0 + (5*tmp0) + tl.zeros([XBLOCK], tl.int32)), tmp1, xmask)
    tl.store(out_ptr1 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
```
Note: we are loading from `out_ptr0` here (that shouldn't happen), we are loading from it before storing to it.
After this PR, the kernel above is split in 2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96273
Approved by: https://github.com/jansel
2023-03-09 23:03:46 +00:00
1bde36ba41 test only smaller block_k for mm_plus_mm (#96385)
Trim number of tested mm_plus_mm configs to work around https://github.com/openai/triton/issues/1298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96385
Approved by: https://github.com/bertmaher, https://github.com/jansel
2023-03-09 23:03:30 +00:00
090b3b95b8 [PyTorch] Add Vulkan support and tests for at::select.int operator, 4 dim/rank tensor case (#96228)
Summary:
Currently, selection along a dimension/rank is only supported for 3D/rank tensors in PyTorch Vulkan. This adds support for 4D/rank tensors at selection along batch, channel (depth), height, and width.

Additionally:
- The existing implementations have been name-refactored to reflect whether they operate on 3d or 4d tensors.
- The params buffer for all select operations now use `ivec2` or `ivec4` only, for memory alignment safety.

Test Plan:
1. `buck run --target-platforms ovr_config//platform/macos:arm64-fbsource  //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -c pt.vulkan_full_precision=1` on Apple M1 MacBook
2. Confirm all tests pass with no regression, and the directly affected tests `select_4d_*`, refactored `select_3d_`, pass
3. Test output P636928908, in particular:
```
[...bunch of other tests...]

[ RUN      ] VulkanAPITest.select_3d_depth_small
[       OK ] VulkanAPITest.select_3d_depth_small (1 ms)
[ RUN      ] VulkanAPITest.select_3d_depth_medium
[       OK ] VulkanAPITest.select_3d_depth_medium (0 ms)
[ RUN      ] VulkanAPITest.select_3d_depth_large
[       OK ] VulkanAPITest.select_3d_depth_large (1 ms)
[ RUN      ] VulkanAPITest.select_3d_height_small
[       OK ] VulkanAPITest.select_3d_height_small (0 ms)
[ RUN      ] VulkanAPITest.select_3d_height_medium
[       OK ] VulkanAPITest.select_3d_height_medium (0 ms)
[ RUN      ] VulkanAPITest.select_3d_height_medium1
[       OK ] VulkanAPITest.select_3d_height_medium1 (0 ms)
[ RUN      ] VulkanAPITest.select_3d_height_medium2
[       OK ] VulkanAPITest.select_3d_height_medium2 (0 ms)
[ RUN      ] VulkanAPITest.select_3d_height_large
[       OK ] VulkanAPITest.select_3d_height_large (1 ms)
[ RUN      ] VulkanAPITest.select_3d_width_small
[       OK ] VulkanAPITest.select_3d_width_small (0 ms)
[ RUN      ] VulkanAPITest.select_3d_width_medium
[       OK ] VulkanAPITest.select_3d_width_medium (0 ms)
[ RUN      ] VulkanAPITest.select_3d_width_medium2
[       OK ] VulkanAPITest.select_3d_width_medium2 (0 ms)
[ RUN      ] VulkanAPITest.select_3d_width_large
[       OK ] VulkanAPITest.select_3d_width_large (1 ms)
[ RUN      ] VulkanAPITest.select_4d_batch_small
[       OK ] VulkanAPITest.select_4d_batch_small (0 ms)
[ RUN      ] VulkanAPITest.select_4d_batch_medium
[       OK ] VulkanAPITest.select_4d_batch_medium (0 ms)
[ RUN      ] VulkanAPITest.select_4d_batch_large
[       OK ] VulkanAPITest.select_4d_batch_large (1 ms)
[ RUN      ] VulkanAPITest.select_4d_depth_small
[       OK ] VulkanAPITest.select_4d_depth_small (1 ms)
[ RUN      ] VulkanAPITest.select_4d_depth_medium
[       OK ] VulkanAPITest.select_4d_depth_medium (0 ms)
[ RUN      ] VulkanAPITest.select_4d_depth_large
[       OK ] VulkanAPITest.select_4d_depth_large (1 ms)
[ RUN      ] VulkanAPITest.select_4d_height_small
[       OK ] VulkanAPITest.select_4d_height_small (0 ms)
[ RUN      ] VulkanAPITest.select_4d_height_medium
[       OK ] VulkanAPITest.select_4d_height_medium (0 ms)
[ RUN      ] VulkanAPITest.select_4d_height_large
[       OK ] VulkanAPITest.select_4d_height_large (1 ms)
[ RUN      ] VulkanAPITest.select_4d_width_small
[       OK ] VulkanAPITest.select_4d_width_small (0 ms)
[ RUN      ] VulkanAPITest.select_4d_width_medium
[       OK ] VulkanAPITest.select_4d_width_medium (0 ms)
[ RUN      ] VulkanAPITest.select_4d_width_large
[       OK ] VulkanAPITest.select_4d_width_large (1 ms)

[...bunch of other tests...]

[  FAILED  ] 7 tests, listed below:
[  FAILED  ] VulkanAPITest.cat_dim1_singledepth_success
[  FAILED  ] VulkanAPITest.gru_success
[  FAILED  ] VulkanAPITest.gru_mclareninputs_success
[  FAILED  ] VulkanAPITest.gru_prepack_success
[  FAILED  ] VulkanAPITest.lstm_success
[  FAILED  ] VulkanAPITest.lstm_mclareninputs_success
[  FAILED  ] VulkanAPITest.lstm_prepack_success
```

Reviewed By: SS-JIA

Differential Revision: D42623181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96228
Approved by: https://github.com/SS-JIA
2023-03-09 22:39:33 +00:00
6a675f7cac Correctly resolve dispatch keys for PyOperator (#96306)
Previously, we never actually used resolve_key, which meant that
you had to register CPU/CUDA/etc all manually; none of the alias
keys worked.  Now they work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96306
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2023-03-09 22:16:31 +00:00
30c4ea138f Assert that there are no None arguments to backwards (#96300)
This assert would have caught https://github.com/pytorch/pytorch/pull/96219

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96300
Approved by: https://github.com/bdhirsh
2023-03-09 22:14:39 +00:00
bbe1b9bbd4 Fix https://github.com/pytorch/pytorch/issues/96278 (#96299)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96299
Approved by: https://github.com/ngimel
2023-03-09 22:13:52 +00:00
075a49442d [MPS] Allow float16 input to float32 LayerNorm (#96430)
Only for forward pass

Subset of https://github.com/pytorch/pytorch/pull/96208

Create constant with scalar using `input_mps_dtype` and use
`reciprocalWithTensor` instead of `divisionWithPrimaryTensor:1.0
secondaryTensor:`

Fixes https://github.com/pytorch/pytorch/issues/96113

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96430
Approved by: https://github.com/kulinseth
2023-03-09 22:09:10 +00:00
457396fcdc [Autograd] expand_as instead of clone to get AccumulateGrad (#96356)
This PR makes a minor change to the multi-grad hook implementation. This should decrease peak memory since we avoid one `clone()` per tensor passed into the multi-grad hook. Let me know if there are technical reasons why we need to clone. If so, is there a way for some use cases to not clone?

Before with `clone()`:
![Screenshot 2023-03-08 at 6 08 41 PM](https://user-images.githubusercontent.com/31054793/223873111-ad9105ab-2958-45a1-a2f5-18e9b254c710.png)

After with `expand_as()` -- no more "Memcpy DtoD" kernels:
![Screenshot 2023-03-08 at 6 08 48 PM](https://user-images.githubusercontent.com/31054793/223873104-670b6abc-cd5c-4d1e-a316-cea1bef5832a.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96356
Approved by: https://github.com/soulitzer
2023-03-09 21:58:42 +00:00
cb42bc2cf8 [FSDP] Add unsafe setattr gated by env var (#96326)
This adds the option to use an unsafe `setattr` for `_use_sharded_views()` and `_use_unsharded_views()` gated by the environment variable `FSDP_USE_UNSAFE_SETATTR`, where a value of `1` means to use the unsafe `setattr`. The unsafe option is disabled by default.

The unsafe `setattr` may be able to save CPU overhead and may be used to intentionally bypass `setattr` checks. Both `_use_sharded_views()` and `_use_unsharded_views()` must use the unsafe version or use the safe versions atomically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96326
Approved by: https://github.com/zhaojuanmao, https://github.com/fegin
2023-03-09 21:58:35 +00:00
fe05266fda Revert "[reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)"
This reverts commit deaf9e5e659a1f73656cbbacb39448498e857163.

Reverted https://github.com/pytorch/pytorch/pull/95985 on behalf of https://github.com/huydhn due to Sorry for reverting this. It increased the test time significantly for ASAN (and may be other test shards). ASAN tests on PR passed but it was barely not timing out. I have updated my initial findings in https://github.com/pytorch/pytorch/issues/96378
2023-03-09 01:45:24 +00:00
44d8e6c2aa Retry CI Android emulator test (#96163)
This is not the first time I spot Android test flakiness such as
893aa5df3f.  From some StackOverflow results, it looks like the failure `Unknown failure: Error: Could not access the Package Manager.  Is the system running?` could be fixed by waiting a bit for the emulator to start fully https://stackoverflow.com/questions/15524185/could-not-access-the-package-manager-is-the-system-running-while-installing-and

So, I'm adding retry capability here to give the test another chance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96163
Approved by: https://github.com/ZainRizvi
2023-03-09 00:09:10 +00:00
df0ff34bcb [ONNX] Bump onnx submodule to release 1.13.1 from rc2 (#96325)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96325
Approved by: https://github.com/justinchuby
2023-03-09 00:00:44 +00:00
32ffd70644 Rewrite fallthrough to more closely match how C++ works (#96304)
Fallthrough is modeled as a mask which we use to remove keys from the
compute dispatch key set for eligibility.

It's possible this addresses https://github.com/pytorch/pytorch/issues/89037
in a better way than https://github.com/pytorch/pytorch/pull/95891 but I
cannot easily tell as the original repro no longer works and the new PR
does not have a test.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96304
Approved by: https://github.com/zou3519, https://github.com/albanD, https://github.com/zhxchen17
2023-03-08 23:00:26 +00:00
67c329bc9b Refactor to reduce duplicate logic in torch._ops (#96302)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96302
Approved by: https://github.com/zou3519
2023-03-08 23:00:26 +00:00
4662ae5b62 Add missing types to inductor IR assert (#96221)
Unclear if there is a more efficient way to define the allowed types for IR (or if we even need this, perhaps we just ditch the assert?)  But Inductor experts can deteremine if these added ops are appropriate and if so they fix the reported issue.

Fixes #96204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96221
Approved by: https://github.com/ezyang
2023-03-08 22:55:43 +00:00
038e838e7b Make setup linux action be more friendly with gcp linux runners (#96289)
Fixes issues like the following:
https://github.com/pytorch/pytorch/actions/runs/4362155257/jobs/7627059487 has a more serious core dump failure but the log of curl failures (GCP linux trying to get EC2 specific metadata like EC2 AMI-ID, Instance ID, and Instance Type) confused the HUD.
<img width="848" alt="image" src="https://user-images.githubusercontent.com/109318740/223670567-330521ba-050a-41c3-9efb-fae6ea3398c0.png">
This PR gets rid of those curl failures.

This may have contributed to the impression of "flaky GCP" in #95416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96289
Approved by: https://github.com/huydhn, https://github.com/yanboliang
2023-03-08 22:17:36 +00:00
78e04f8272 Update nvfuser_executor.py (#96218)
In https://github.com/csarofeen/pytorch/pull/2517 the return value of `compute_contiguity` is changed from tuple to list. This PR handles that change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96218
Approved by: https://github.com/jjsjann123, https://github.com/davidberard98
2023-03-08 22:07:58 +00:00
7863efbd76 [BE][8/N] Remove ShardedTensor from TP FSDP integration test and other tests depending on Sharded Linear (#96254)
We removed ShardedLinear in https://github.com/pytorch/pytorch/pull/95948 but it broke TP_FSDP integration test because it is using ShardedTensor in the test. Migrating using DTensor fixes the test. DTensor shards the bias too so that we need to change the test a little bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96254
Approved by: https://github.com/huydhn
2023-03-08 21:56:41 +00:00
f5c39b7ba2 [inductor] fix typos in test_torchinductor.py (#96233)
Fixes typos in `test_torchinductor.py::test_recompile_on_index_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96233
Approved by: https://github.com/jansel
2023-03-08 21:24:46 +00:00
0f4652f498 [ONNX] Merge 'initializers' into 'TorchScriptGraph' (#95676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95676
Approved by: https://github.com/titaiwangms, https://github.com/wschin
2023-03-08 21:12:20 +00:00
e9e6b3b6c5 [EASY] Add complex dtypes to partitioner (#96297)
Also, delete some redundant dtype setting.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96297
Approved by: https://github.com/Chillee
2023-03-08 21:08:26 +00:00
a7fe11dec0 --subprocess for pytest (#96210)
Implements --subprocess flag for pytest, which previously only worked with unittest

Pretty much all the tests in the custom handler list use --subprocess
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96210
Approved by: https://github.com/huydhn
2023-03-08 21:04:50 +00:00
8921b22297 Set ref for linux_job checkout in lint (#96317)
test-infra's linux_job uses github.ref as the default value for the ref, which is the branch, so it checks out the most recent commit on the branch.
Might be better to fix this on the test-infra side instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96317
Approved by: https://github.com/huydhn
2023-03-08 21:04:30 +00:00
c8216e558b Add basic Module serialization BC test (#96238)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96238
Approved by: https://github.com/ezyang
2023-03-08 21:01:27 +00:00
5bbec680d7 Fix usages of contextmanager without finally (#96170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96170
Approved by: https://github.com/ngimel, https://github.com/malfet
2023-03-08 20:59:27 +00:00
34d18c8bee Remove unimported expecttest deps and usage (#96314)
expecttest is not imported to OSS BUCK build yet. Using it in target test_torchgen_executorch breaks build.

Remove it first to fix the build. Will import and fix in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96314
Approved by: https://github.com/huydhn
2023-03-08 20:54:11 +00:00
0f6d6d6124 [TorchScript] Fix torch.cuda._exchange_device (#95306)
Fixes #95305
I am not sure why these one-line changes fix TorchScript, but it works...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95306
Approved by: https://github.com/ngimel
2023-03-08 20:29:05 +00:00
deaf9e5e65 [reland][inductor] Add an AOT compilation mode for Inductor CPP backend (#95985)
Summary: This is a reland of https://github.com/pytorch/pytorch/pull/94822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95985
Approved by: https://github.com/jansel
2023-03-08 20:02:32 +00:00
b9c25f186c Ignore shape inference exception from Caffe2 ATen fallback (#90408)
Fixes #87318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90408
Approved by: https://github.com/BowenBao
2023-03-08 20:02:11 +00:00
c988de1040 [EASY] Update inductor training dynamic skips (#96298)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96298
Approved by: https://github.com/Chillee, https://github.com/janeyx99
2023-03-08 19:31:46 +00:00
b3a079810e [CI] Add a workflow for quick perf comparison (#96166)
Summary: ciflow/inductor-perf-test-nightly now contains full dashboard
run which takes a very long time. Ed proposed a simplification of the
perf run there, but it is still worth to have a set of fast perf test
which only includes one configuration (--training --amp).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96166
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
2023-03-08 19:09:04 +00:00
4a1b971748 Move MacOS x86_64 build and test jobs to periodic (#96279)
The correlation result can be found at https://github.com/pytorch/test-infra/pull/3852.  This is the first step toward reducing the redundancy of having both x86_64 and Apple silicon M1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96279
Approved by: https://github.com/ZainRizvi, https://github.com/seemethere, https://github.com/malfet
2023-03-08 18:52:18 +00:00
9137f53ec2 Revert "Error when jit.trace/script is used with torch.compile (#91681)"
This reverts commit fa92b6a7b0e12779baa92d0d11e4161a130fea58.

Reverted https://github.com/pytorch/pytorch/pull/91681 on behalf of https://github.com/izaitsevfb due to Breaks internal tests, see T147501786
2023-03-08 18:47:38 +00:00
7362e22f8b Notify on outdated lintrunner (#96241)
Let users know if they have an outdated version of lintrunner installed on their box

Sets the minimum version to one which uses master as a default mergebase (see https://github.com/pytorch/pytorch/pull/95938)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96241
Approved by: https://github.com/huydhn
2023-03-08 18:41:31 +00:00
11aab72dc9 [SDPA] Add an optional scale kwarg (#95259)
# Summary
This PR adds an optional kwarg to torch torch.nn.functional.scaled_dot_product_attention()
The new kwarg is a scaling factor that is applied after the q@k.T step of the computation. Made updates to the efficient kernel to support but flash and math were minimally updated to support as well.

Will reduce the complexity of: #94729 and has been asked for by a couple of users.

# Review Highlights
- As far as I know I did this the correct way and this both BC and FC compliant. However I always seem to break internal workloads so I would love if someone can advice I did this right?
- I named the optional arg 'scale'. This is probably dumb and I should name it 'scale_factor'. I will make this change but this is annoying and it will require someone thinking we should rename.
- 'scale' is interpreted as `Q@K.T * (scale)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95259
Approved by: https://github.com/cpuhrsch
2023-03-08 18:07:40 +00:00
3f840cc627 Revert "Ignore shape inference exception from Caffe2 ATen fallback (#90408)"
This reverts commit 1d4e8723705280a82497d366cdf37e6aef49725d.

Reverted https://github.com/pytorch/pytorch/pull/90408 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it breaks lint check https://hud.pytorch.org/pr/90408#11855039599. Please fix the error and reland your change
2023-03-08 17:28:21 +00:00
9c5a24b9df [BE] Delete `pre-cxx-11-abi MacOS libtorch builds (#96301)
Those ABI flags makes sense only for Linux, libc++ binaries shipped with MacOS has only one ABI flavor.

Moreover, those binaries were uploaded to the same location anyway, see
[upload job for pre-cxx-11 abi](https://github.com/pytorch/pytorch/actions/runs/4362299843/jobs/7628815268#step:7:97) and [upload job for cxx-11 abi](https://github.com/pytorch/pytorch/actions/runs/4362299812/jobs/7628879450#step:7:97)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96301
Approved by: https://github.com/atalman
2023-03-08 17:25:19 +00:00
e7dd9b1138 [Quant][test] Add test for mixed dtypes in the same model (#96104)
Summary: This commit adds a test for mixing multiple dtypes
for different layers in the same model. The test verifies that
FX graph mode quantization converts the dtypes correctly
between the layers.

Test Plan:
python test/test_quantization.py TestQuantizeFx.test_mixed_dtypes

Reviewers: jcaip, vkuzo, supriyar

Subscribers: jcaip, vkuzo, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96104
Approved by: https://github.com/jcaip
2023-03-08 17:08:12 +00:00
1d4e872370 Ignore shape inference exception from Caffe2 ATen fallback (#90408)
Fixes #87318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90408
Approved by: https://github.com/BowenBao
2023-03-08 16:57:48 +00:00
98ece75043 [aot autograd] merge all outputs of funtionalization analysis into single metadata (#95991)
This makes the next PR in the stack cleaner: having the top level entry point to aot autograd perform the functionalization analysis pass once, and plumb the metadata everywhere else that we need it.

I put it in a separate PR because I recently learned that this function is used in fbcode, so I'll need to fix up internals when I land this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95991
Approved by: https://github.com/ezyang
2023-03-08 16:22:54 +00:00
29b216acd5 aot autograd: handle detach() and no_grad() mutations on input (#95980)
Fixes https://github.com/pytorch/pytorch/issues/95167

More details are in that issue. To summarize, the issue shows up when we have some code like this:

```
def f(x):
    x.detach().mul_(2) # can also happen if the mul_() happens under torch.no_grad()
    return x + 1
```

AOTAutograd will then spit out code like this:
```
def compiled_fn(x):
    x_updated = x.mul(2)
    out = x_updated + 1
    return x_updated, out

def CompiledFunction.forward(x):  # pseudocode, this is part of an autograd.Function
    x_updated, out = compiled_function(x):
    return x_updated, out

def runtime_wrapper(x):
    x_updated, out = CompiledFunction.apply(x)
    x.copy_(x_updated)

x = torch.ones(2, requires_grad=True)
out = runtime_wrapper(x)
```

However, the call to `x.copy_(x_updated)` will fail with the error: `a leaf Variable that requires grad is being used in an in-place operation`. This is because `x` is an autograd leaf, and autograd doesn't allow you to mutate leaves.

In this case though, the data mutation should be entirely opaque to autograd - all mutations happened underneath a `.detach()` or a `torch.no_grad()`.

As Ed pointed out in the issue, we can detect this situation by checking if the mutated input is an autograd leaf. If it is, then it must have been the case that any mutations on it must have been hidden from autograd, since otherwise the eager code would have error'd. The solution I added is to detect this situation, and manually run `x.detach().copy_(x_updated)`, to hide the update from autograd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95980
Approved by: https://github.com/ezyang
2023-03-08 16:11:06 +00:00
bb650b34c4 [inductor] do not handle int in placeholder (#96230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96230
Approved by: https://github.com/ezyang
2023-03-08 13:50:40 +00:00
f96bd52841 aot autograd: dont allow symint outputs to get tangents in the bw graph (#96219)
Previously, if dynamic shapes were turned on and we had a forward graph that returns a symint, then we would generate a backward graph that takes in a tangent input for that symint fwd output. This causes problems for downstream - inductor will see an input that it expects to be a symint, but it gets a `None` from autograd.

Confirmed that this repro now passes:
```
benchmarks/dynamo/torchbench.py --devices cuda --inductor --dynamic-shapes --unspecialize-int --accuracy --training --only drq
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96219
Approved by: https://github.com/ezyang
2023-03-08 13:02:34 +00:00
6bbae86253 Revert "Fix hooks handling for unpickled nnmodule (#96224)"
This reverts commit 8ca264ef3666ce865520f9877b2980ec109d95da.

Reverted https://github.com/pytorch/pytorch/pull/96224 on behalf of https://github.com/ezyang due to inductor regression
2023-03-08 13:01:16 +00:00
a1d7014c0f Hooking backward for QNNPACK (#94432)
Summary: Enabling quantized gradient.

Test Plan:
Algorithmic correctness - Dequantized matmul vs QNNPACK matmul for gradient - P616202766

```
dequantized matmul : [1.5463, -0.2917, -2.1735, 0.5689, -1.0795]
QNNPACK matmul : tensor([[ 1.5463, -0.2917, -2.1735,  0.5689, -1.0795]])
```

Differential Revision: D42593235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94432
Approved by: https://github.com/malfet, https://github.com/kimishpatel
2023-03-08 10:21:32 +00:00
92edac72aa [FSDP][optim_state_dict] Fix a memory leakage in optim_state_dict (#96263)
Summary: The original code uses a class variable to store flat_parameter result. This could cause memory leakage.

Test Plan: CI and a E2E run

Reviewed By: awgu

Differential Revision: D43893577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96263
Approved by: https://github.com/zhaojuanmao
2023-03-08 08:43:42 +00:00
2bb022e902 [MPS] Adding xfaillist with all categories of failures. (#96176)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96176
Approved by: https://github.com/malfet
2023-03-08 08:41:21 +00:00
b90a9c7db2 [static-runtime] fix one forwarding usage (#96271)
Summary: as titled

Test Plan: ci

Differential Revision: D43897369

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96271
Approved by: https://github.com/davidberard98
2023-03-08 07:38:21 +00:00
3ce1e15cf7 Revert "[Dynamo] Support torch.{cuda/cpu}.amp.autocast (#95416)"
This reverts commit c88aa336aa0734f42b4d9db7f624d6cfd9b5065e.

Reverted https://github.com/pytorch/pytorch/pull/95416 on behalf of https://github.com/huydhn due to Sorry for reverting your PR. But it seems that the smoke test issue is related as it starts to fail consistently in trunk https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=50&name_filter=inductor_torchbench_smoketest_perf
2023-03-08 06:51:57 +00:00
941ff109d3 dl_open_guard should restore flag even after exception (#96231)
I.e. follow pattern outlined in https://docs.python.org/3.8/library/contextlib.html#contextlib.contextmanager

Also, return early on non-unix platforms (when `sys.getdlopenflags` is not defined)

Fixes https://github.com/pytorch/pytorch/issues/96159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96231
Approved by: https://github.com/atalman
2023-03-08 06:01:27 +00:00
8ca264ef36 Fix hooks handling for unpickled nnmodule (#96224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96224
Approved by: https://github.com/albanD
2023-03-08 05:33:15 +00:00
08fb13db65 [Quant] Add lowering for pixel_unshuffle/narrow (#96160)
Summary:
## Summary
torch.nn.functional.pixel_unshuffle and torch.narrow accepts both float
and quantized inputs. However, previously we would unnecessarily
dequantize quantized inputs into floats before passing them to
the function. This commit fixes this by lowering the pattern
[dequant - pixel_unshuffle - quant].
[dequant - narrow - quant].

Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_pixel_unshuffle
```

```
python test/test_quantization.py TestQuantizeFxOps.test_narrow
```

Differential Revision: D43858199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96160
Approved by: https://github.com/andrewor14
2023-03-08 05:25:03 +00:00
9e3f173636 [1/n] Add verifier for EXIR Aten dialect (#94783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94783
Approved by: https://github.com/zhxchen17
2023-03-08 04:55:54 +00:00
3a4275278b Use GH cache for sccache on GH mac runners (#96142)
sccache added GH cache as a storage option, so try to use it for the GH provided mac runners.

My experiments with this are varied.  I tried a couple of different releases and the first run with a cold cache took 1hr (v0.3.3), 1hr (v0.4.0 pre7), 2hr (v0.3.3).

Afterwards it usually takes 30 minutes but sometimes longer, but no longer than 1hr.

I am using v0.4.0 pre7 because they reduced the amount of configuration/env vars you need to set and the GH cache keys get managed by sccache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96142
Approved by: https://github.com/huydhn, https://github.com/malfet
2023-03-08 04:18:54 +00:00
d7db5b05b4 Context manager to push/pop frame summaries (#96054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96054
Approved by: https://github.com/avikchaudhuri, https://github.com/ezyang
2023-03-08 04:01:49 +00:00
bb8645acda [vision hash update] update the pinned vision hash (#96243)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96243
Approved by: https://github.com/pytorchbot
2023-03-08 03:57:12 +00:00
664381b293 [CI] Avoid calling torch.use_deterministic_algorithms for some models (#96245)
tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96245
Approved by: https://github.com/davidberard98
2023-03-08 03:35:32 +00:00
93ff71ec37 [ET] Add RuntimeContext to ET Aten mode (#96084)
Summary:
In ATen mode, we add the RuntimeContext arg, so we have something like
```
TORCH_API inline at::Tensor & gelu_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, c10::string_view approximate, at::Tensor & out) {
    return at::gelu_outf(self, approximate, out);
}
```
and user can use `<namespace like aten>::gelu_outf` and we will automatically dispatch the registered function in aten kernel using `at::gelu_outf` (dispatched by ATen/Functions.h header)

In optimized kernel tests, we can now automatically handle between aten kernel and optimized kernel.

The implication is that the test must depend on the correctness of codegen; an error in codegen can break the kernel tests.

Test Plan: CI

Differential Revision: D43777848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96084
Approved by: https://github.com/larryliu0820
2023-03-08 02:51:47 +00:00
c88aa336aa [Dynamo] Support torch.{cuda/cpu}.amp.autocast (#95416)
For Meta internal use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95416
Approved by: https://github.com/jansel
2023-03-08 01:40:27 +00:00
b8f7bd593c [Dynamo] Guard name should be valid Python identifier (#96174)
Fixes #96149

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96174
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-03-08 01:33:29 +00:00
738cc5e644 Fix validate_input_col for nn.Module or Callable (#96213)
Forward fix the problem introduced in https://github.com/pytorch/pytorch/pull/95067

Not all `Callable` objects have `__name__` implemented. Using `repr` as the backup solution to get function name or reference.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96213
Approved by: https://github.com/NivekT
2023-03-08 01:30:17 +00:00
fdd7e76b95 [PyTorch][easy] Don't call IValue::type twice in Pickler::endTypeTag (#96214)
The duplicate call is unlikely to be eliminated by the compiler (it can return a new heap-allocated object).

Differential Revision: [D43877846](https://our.internmc.facebook.com/intern/diff/D43877846/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96214
Approved by: https://github.com/zhxchen17
2023-03-08 01:29:21 +00:00
3623cfb790 [FSDP] Speed up first iter order check (part 2) (#96220)
For a tensor on GPU, moving it once to CPU and operating on it on CPU is faster than moving it element by element from CPU to GPU. This is a follow-up to also move `world_indices`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96220
Approved by: https://github.com/zhaojuanmao
2023-03-08 01:08:54 +00:00
7324aef9a8 Add torch.empty_like() to documented list of supported nested tensor ops (#96211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96211
Approved by: https://github.com/drisspg
2023-03-07 23:33:34 +00:00
b0b5f3c6c6 Fix gumbel cdf (#91698)
Fix `Gumbel.cdf` function.

**Description**
When transformed parameters is outside of the support of underlying Uniform distribution. This makes behavior of `Gumbel.cdf` consistent with other `TransformedDistribution` that pass value of validate_args to the base distribution.

**Issue**
running `Gumbel(0.0,1.0,validate_args=False).cdf(20.0)` would cause `ValueError` exception from `_validate_sample`

**Testing**
Test was added to the `test_distributions.py` to check if `Gumbel(0.0,1.0,validate_args=False).cdf(20.0)` successfully returns `1.0`

This is a second attempt to push changes , after https://github.com/pytorch/pytorch/pull/82488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91698
Approved by: https://github.com/fritzo, https://github.com/zou3519
2023-03-07 23:04:47 +00:00
203890e1e0 Properly show buck target to run (#96089)
Summary: Makes the debug dir location configurable with TORCH_COMPILE_DEBUG_DIR env var

Test Plan: TORCH_COMPILE_DEBUG_DIR=”.” buck2 run mode/dev-nosan //caffe2/test/inductor:minifier_smoke

Reviewed By: bertmaher

Differential Revision: D43639955

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96089
Approved by: https://github.com/bertmaher
2023-03-07 22:52:27 +00:00
79d49c60c1 [ONNX] Fix expand_as (#95962)
Fixes [#ISSUE_NUMBER](https://github.com/pytorch/pytorch/issues/95961)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95962
Approved by: https://github.com/BowenBao, https://github.com/justinchuby
2023-03-07 22:11:50 +00:00
bdb076ab43 [ONNX] Skip doctest torch.onnx._internal.fx if ImportError (#95686)
Need to use `exclude` to skip the module altogether. Because xdoctest
triggers `ImportError` when trying to import the module. So the whole
test fails regardless if skip was added in the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95686
Approved by: https://github.com/kit1980, https://github.com/titaiwangms
2023-03-07 22:05:27 +00:00
82dba844bb [ONNX] Move symbolic export to separate file (#95650)
Move things around in the effort of preparing to refactor
the code structure.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95650
Approved by: https://github.com/titaiwangms
2023-03-07 22:05:27 +00:00
d06729746c [RFC] Add _abort method to ProcessGroupNCCL (#96017)
**Summary:**

Currently the only way to destroy a process group is calling `dist.destroy_process_group`. However, this API does not guarantee destruction of the ProcessGroup object since it only deletes references inside `distributed_c10d.py`. In cases where the process group is used in multiple places it is not feasible to hunt down all the references and delete them.

In particular for NCCL if a collective gets stuck the only way to recover from this is calling ncclCommAbort on all the communicators. Currently there is no API to achieve this.

To address this, in this PR I've added an `_abort` method to ProcessGroupNCCL to achieve this, where now we have a guaranteed way to kill all NCCL communicators associated with a ProcessGroup

**Test Plan:**

Added a unit test to validate this works as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96017
Approved by: https://github.com/wanchaol
2023-03-07 21:37:38 +00:00
d6d8d3484e _memory_viz.py: Visualize how blocks fit into segments. (#91336)
Add a segment_plot command that visualizes how blocks are allocated into segments.
This is similar to the 'stats' command but produces an interactive html viewer rather
than text dump, allowing exploration of stack traces.

It also adds the ability to see the layout at any point in the trace by starting from the
snapshot and then apply the events backwards to reconstruct what memory would have looked like.

Example:
![Screen Shot 2022-12-22 at 3 32 49 PM](https://user-images.githubusercontent.com/370202/209242650-b952372e-37ac-400a-a01c-13be2b5426fa.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91336
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
71f369092d Revert "Revert "memory viz: Add colors for categories and a legend (#90587)"" (#96133)
This reverts commit b38b39c441f12be90fd5d7eafe74246d050665c8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96133
Approved by: https://github.com/bhosmer
2023-03-07 21:07:18 +00:00
32eb3ab7e8 [FSDP] Speed up first iter order check (#96146)
For a tensor on GPU, moving it once to CPU and operating on it on CPU is faster than moving it element by element from CPU to GPU. The relevant tensor in this case is `world_num_valid_indices`.

This closes https://github.com/pytorch/pytorch/issues/95728.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96146
Approved by: https://github.com/zhaojuanmao, https://github.com/rohan-varma
2023-03-07 20:58:56 +00:00
3d5eba811a Add shape function for stack op (#92205)
As @ramiro050 requested in https://github.com/llvm/torch-mlir/pull/1747, this PR moved the shape code for stack op from torch-mlir to pytorch upstream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92205
Approved by: https://github.com/eellison
2023-03-07 20:45:56 +00:00
5e73cc9310 Update lintrunner to version that uses as default mergebase (#95938)
Fixes https://github.com/pytorch/pytorch/issues/93156

Upgrades lintrunner to the latest version which can use the `master` branch as the merge base by default (provided it's specified in `.lintrunner.toml` and update `.lintrunner.toml` accordingly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95938
Approved by: https://github.com/huydhn
2023-03-07 20:25:02 +00:00
a5aceba61f [static-runtime] a pass through checks throwing exceptions (#95983)
Summary: increasing verbosity where possible

Test Plan: CI

Differential Revision: D43761268

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95983
Approved by: https://github.com/davidberard98
2023-03-07 19:16:27 +00:00
576762d9d2 Clean up duplicate line in logit sample inputs (#95163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95163
Approved by: https://github.com/ngimel
2023-03-07 18:57:40 +00:00
eea0733045 Reduce pytest blocklist (#96016)
`TestCase = object` or variations of it get switched to `TestCase = NoTest`.

unittest collects test based on subclassing unittest.TestCase, so setting TestCase = object removes it from unittest test collection.  pytest collects based on name (https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_classes) but can be told to ignore a class (bottom of https://docs.pytest.org/en/7.1.x/example/pythoncollection.html#changing-naming-conventions)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96016
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-07 18:30:27 +00:00
30237e7aec Provide more informative kernel names in Inductor (#95940)
Before: `triton_fused_add_83_add_84_relu_13_squeeze_46_var_mean_15_14`
After: `triton_fused__native_batch_norm_legit_functional_convolution_relu_14`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95940
Approved by: https://github.com/SherlockNoMad, https://github.com/ngimel, https://github.com/jansel
2023-03-07 18:02:10 +00:00
c74f09403b [inductor] make philox_rand_like work with dynamic shapes (#95461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95461
Approved by: https://github.com/ezyang
2023-03-07 18:01:50 +00:00
02a18b1a97 Properly avoid wrapping numbers as tensors before backend (#96193)
This partially reverts https://github.com/pytorch/pytorch/pull/96051 with a proper fix.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96193
Approved by: https://github.com/jansel
2023-03-07 17:57:47 +00:00
2f66b57a7a [MPS] Fix in-place add and sub with alpha == 0.0 (#96184)
Apart from fixing the below issue, this PR integrates the test for `sub` into the test for `add` as they are implemented using the same template.

Fixes #96065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96184
Approved by: https://github.com/kulinseth
2023-03-07 17:17:53 +00:00
d4f5f9fdb4 Profile dynamo guards (#96119)
Adds a profiler start and end callback to dynamo's C eval_frame impl, which can be used to profile a region providing a name for visualization.  Currently only hooks up one usage to profile cache lookup (primarily covering guards and linear search through  linked list).

Example profile taken from toy model:
`python benchmarks/dynamo/distributed.py --toy_model --profile --dynamo aot_eager`
<img width="1342" alt="image" src="https://user-images.githubusercontent.com/4984825/223225931-b2f6c5a7-505a-4c90-9a03-34982f6dc033.png">

Planning to measure overhead in CI, and probably can't afford to check this in enabled by default.  Will have to evaluate UX options such as `config.profile_dynamo_cache = True` or some other way.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96119
Approved by: https://github.com/jansel
2023-03-07 16:12:22 +00:00
d0641ed247 [TEST] Turn on unspecialize int dynamic training inductor CI (#96058)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96058
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-07 16:08:45 +00:00
9a575e77ca inductor: align baddbmm behavior with eager mode for beta=0 and input has nan value (#96087)
For ```torch.baddbmm(input, mat1,mat2, beta=0)```, if ```beta``` is zero, the multiplication of value ```input*beta``` will be ignored for the eager mode(always gets zero number, see https://pytorch.org/docs/stable/generated/torch.baddbmm.html?highlight=torch+baddbmm#torch.baddbmm), but the inductor is not, the inductor will get a different value if the input has a ```nan``` of ```inf``` value:

```
def fn_test(input, mat1, mat2):
    return torch.baddbmm(input, mat1, mat2, beta=0.0)

opt_fn = torch._dynamo.optimize("inductor")(fn_test)
a, b, c = [torch.rand((3,2,2)) for _ in range(3)]

real_out = fn_test(a, b, c)
a[:] = torch.nan
compiled_out = opt_fn(a, b,c)

print(compiled_out)
print(real_out)

```
before this PR, the output will be like this:

```
tensor([[[0.4272, 0.6037],
         [0.4279, 0.4219]],

        [[0.0838, 0.4873],
         [0.1210, 0.5516]],

        [[   nan,    nan],
         [   nan,    nan]]])
tensor([[[0.4272, 0.6037],
         [0.4279, 0.4219]],

        [[0.0838, 0.4873],
         [0.1210, 0.5516]],

        [[0.4985, 0.1072],
         [0.0857, 0.0186]]])

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96087
Approved by: https://github.com/jansel, https://github.com/ngimel, https://github.com/jgong5
2023-03-07 14:59:21 +00:00
ac77883e48 fix issue of baddbmm when out has nan value for beta=0 (#96086)
Fix https://github.com/pytorch/pytorch/issues/96037.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96086
Approved by: https://github.com/ngimel, https://github.com/lezcano
2023-03-07 14:54:05 +00:00
cyy
666efd8d5d Improve ASAN and TSAN handling in cmake (#93147)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93147
Approved by: https://github.com/malfet
2023-03-07 14:10:13 +00:00
8dbc549517 Correctly pass $@ to the runner benchmark script (#96190)
Alternative to https://github.com/pytorch/pytorch/pull/96168

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96190
Approved by: https://github.com/desertfire
2023-03-07 13:49:57 +00:00
40ca20bb7b [Easy] Fix typo "steams" -> "streams" (#95706)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95706
Approved by: https://github.com/Skylion007, https://github.com/H-Huang
2023-03-07 13:38:03 +00:00
803e30441f [FSDP][Docs] Per-device NCCL stream is per PG (#95705)
71ad1005f6/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L647-L649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95705
Approved by: https://github.com/fegin
2023-03-07 13:38:03 +00:00
98a4d74a68 COO intersection primitives: performance improvement (#96094)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96094
Approved by: https://github.com/pearu
2023-03-07 13:21:29 +00:00
98ff841a75 Use maxint to bound integers. (#96121)
We don't actually support arbitrary precision integers.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96121
Approved by: https://github.com/tugsbayasgalan, https://github.com/lezcano
2023-03-07 12:46:19 +00:00
a6e3e7905e Turn on unspecialize int dynamic inductor CI (#96034)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96034
Approved by: https://github.com/voznesenskym
2023-03-07 12:39:55 +00:00
3326c14e86 Add a sample for index_fill to test framework (#91534)
Currently the index_fill test doesn't include a sample with tensor `value` input.

This PR adds one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91534
Approved by: https://github.com/ngimel
2023-03-07 08:36:04 +00:00
12ab4f08b7 [Dynamo] No graph break on namedtuple and potential other functions (#96122)
```collections.namedtuple``` caused 40+ ```dynamo.export``` testing failing in 14k github models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96122
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-03-07 08:00:21 +00:00
8ca3c881db Dynamo.export to preserve names of args & kwargs (#95851)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95851
Approved by: https://github.com/jansel
2023-03-07 05:07:08 +00:00
e38f48ff11 Refactor unittest around dynamo.export wrt function signature (#95850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95850
Approved by: https://github.com/jansel
2023-03-07 05:07:08 +00:00
c596504292 Type annotate dynamo.export (#95742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95742
Approved by: https://github.com/jansel
2023-03-07 05:07:08 +00:00
18e8aa95f1 Restore _graph_executor_optimize flag after JIT test_profiler (#96135)
Fixes https://github.com/pytorch/pytorch/issues/91483

Using a separate test class here, so that there is no need to run setup and teardown for all tests in TestJit.  The root cause here is that test_profiler could be flaky and fail in the middle without the chance to restore `torch._C._set_graph_executor_optimize` to its original value (https://github.com/pytorch/pytorch/issues/81626). This causes issues for all future tests running after as shown in https://github.com/pytorch/pytorch/issues/91483.

I suspect that is also the same root cause for several other flaky tests in the same file https://github.com/search?q=repo%3Apytorch%2Fpytorch+DISABLED+test_jit.TestScript&type=issues.  After this fix is merged, I would let retry bot does it job and close these issues after 2 weeks.

### Testing
The issue https://github.com/pytorch/pytorch/issues/91483 can now be reproduced by adding `torch._C._set_graph_executor_optimize(False)` locally to see if the test fails:

```
diff --git a/test/test_jit.py b/test/test_jit.py
index 2d1161d7466..17745d39182 100644
--- a/test/test_jit.py
+++ b/test/test_jit.py
@@ -5413,6 +5413,8 @@ a")
             FileCheck().check("int =").check("ListConstruct").check("aten::cat").run(str(g))

     def test_stack(self):
+        torch._C._set_graph_executor_optimize(False)
+
         with enable_profiling_mode_for_profiling_tests():
             @torch.jit.script
             def func(x):
```

It indeed fails:

```
======================================================================
FAIL [0.006s]: test_stack (test_jit.TestScript)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_jit.py", line 5437, in test_stack
    self.assertAutodiffNode(func2.graph_for(x, y), True, ['aten::stack'], [])
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_jit.py", line 282, in assertAutodiffNode
    self.assertEqual(should_autodiff_node,
##[endgroup]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 2975, in assertEqual
    raise error_metas[0].to_error(
AssertionError: Booleans mismatch: True is not False

Failure in testing nodes' autodifferentiation. One or more nodes were expected to be autodiffed, but were not found in specified fusible/nonfusible DifferentiableGraph groups.
Specifically:
  ['aten::stack'] were not in one of the DifferentiableGraphs when they were expected to be. Did you intend for these nodes to be autodiffed? If not, remove them from the list of nonfusible nodes.

----------------------------------------------------------------------
Ran 2677 tests in 84.596s

FAILED (failures=1, skipped=136, expected failures=13)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96135
Approved by: https://github.com/clee2000
2023-03-07 04:21:19 +00:00
769cc8a614 [MPS] Add type promotion to torch.addcmul (#96164)
Fixes crash while running something like `python -c "import torch;x=torch.rand(3, 3, dtype=torch.float16, device='mps');y=x.addcmul(torch.ones(3, device='mps'), torch.ones(3, device='mps'));print(y)"`

Modify `castMPSTensor` to become a no-op if cast is not needed

Define `common_dtype` as `c10::promoType` between self, tensor1 and
tensor2. Cast to any output type.

Add mixed-types test to `TestMPS.test_addcmul`, though it does not cover
all the permutations

Discovered while looking at https://github.com/pytorch/pytorch/issues/96113

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96164
Approved by: https://github.com/kulinseth
2023-03-07 04:19:30 +00:00
7038458c5b Add Generator register for the privateuse1 backend (#93920)
Fixes #92202
Add generator regiter for the backend of `privateuseone`

module: backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93920
Approved by: https://github.com/bdhirsh
2023-03-07 03:43:23 +00:00
e9ca902894 fix typo under aten/src/ATen/cudnn (#96063)
This files fixes typo in comments under `aten/src/ATen/cudnn` directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96063
Approved by: https://github.com/ngimel
2023-03-07 03:17:29 +00:00
19ee61f7fa Upload torch dynamo performance stats to S3 before Rockset (#96165)
I have a minor tweak on the uploading workflow to upload to S3 first before Rockset as `upload-test-stats` and  `upload-torch-dynamo-perf-stats` both run when inductor-A100-perf finished.  There is a potential race condition where the test reports are not yet no S3 when `upload-torch-dynamo-perf-stats` runs (it gets the data from S3).  `inductor-A100-perf` is now handled exclusively by `upload-torch-dynamo-perf-stats` to avoid this.  It will upload test reports to S3 first before getting them to Rockset.

The uploading script works fine with the test reports from https://hud.pytorch.org/pr/95685.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96165
Approved by: https://github.com/desertfire
2023-03-07 03:02:59 +00:00
2973994259 fix typo in comments under torch/csrc/distributed (#96062)
This PR fixes typos in comments and messages of `.cpp` and `.hpp` files under `torch/csrc/distributed` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96062
Approved by: https://github.com/ngimel
2023-03-07 02:56:41 +00:00
fe4fec37a4 [inductor] Refactor IR printing (#96024)
Reland #95567 part 2.  The previous version of this had a bug which that
added test triggers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96024
Approved by: https://github.com/ngimel
2023-03-07 02:23:06 +00:00
4ab4d88147 Stop printing data dependent variable stacks by default (#96120)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96120
Approved by: https://github.com/tugsbayasgalan
2023-03-07 02:14:22 +00:00
1cd0929bf7 [BC] Allow only bool tensors as mask in masked_select (#96112)
`byte` support was marked as deprecated in 1.8, so it's fine to remove this in 2.1 (or even 2.0)
Deprecation warning was added by https://github.com/pytorch/pytorch/pull/22261

Also, fix bunch of syntactic errors in comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96112
Approved by: https://github.com/ezyang
2023-03-07 01:43:14 +00:00
e70ea8d58d enable taskset core pinning in addition to numactl (#96011)
- port https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/pull/740 to `run_cpu`
- use-case by https://github.com/pytorch/serve/pull/2166 where `numactl` is unavailable (e.g., requires `privileged` mode)

This PR automatically tries taskset if numactl core binding doesn't work.

Reference:
`taskset` is added to adapt to launcher use-cases such as in docker where `numactl` requires to be ran in  `privileged` mode, where the  `privileged` mode "wont work for deployments like sagemaker for example" as raised by TorchServe. Please see [torchserve ipex docker discussion](https://github.com/pytorch/serve/pull/1401#issuecomment-1090817704) for reference. To address such use-cases, `taskset` can be used in place of `numactl` to set core affinity. Note that, unlike `numactl`, `taskset` does not provide memory binding to local memories; however, memory binding may not be needed in these use-cases  that typically do not span multi sockets. Hence we can automatically try taskset if numactl doesn't work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96011
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-03-07 01:19:46 +00:00
e168dbb90a [inductor] improve cpp vec implementation of square (#96072)
For cpp vectorization of `square`, the current implementation is not efficient. The implementation would also affect the performance of `batch normalization` as it uses `square` when calculating variance. This PR replaces the `power` with `multiplication` to gain more performance.

Micro-benchmark performance for eager v.s. inductor:
op=`aten.native_batch_norm.default`
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

suite | improvement_0.2 | improvement_0.5 | improvement_0.8 | current_speedup_0.2 | new_speedup_0.2 | current_speedup_0.5 | new_speedup_0.5 | current_speedup_0.8 | new_speedup_0.8
-- | -- | -- | -- | -- | -- | -- | -- | -- | --
torchbench | 8.82% | 5.53% | 32.19% | 0.608006834 | 0.661613139 | 0.691743711 | 0.729987622 | 0.76176223 | 1.00694842
timm | 59.30% | 63.01% | 94.77% | 0.650648524 | 1.036498047 | 0.676425152 | 1.102667387 | 0.695693384 | 1.354992423

</body>

</html>

Model training performance for eager v.s. inductor:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta name=ProgId content=Excel.Sheet>
<meta name=Generator content="Microsoft Excel 15">
<link id=Main-File rel=Main-File
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip.htm">
<link rel=File-List
href="file:///C:/Users/xuanliao/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml">
</head>

<body link="#0563C1" vlink="#954F72">

model | improvement | current_speedup | new_speedup
-- | -- | -- | --
lcnet_050 multi-thread | 5.16% | 1.046 | 1.1
lcnet_050 single-thread | 21.81% | 0.94 | 1.145
mobilenet_v2 multi-thread | 3.88% | 1.135 | 1.179
mobilenet_v2 single-thread | 37.46% | 0.929 | 1.277

</body>

</html>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96072
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2023-03-07 01:13:39 +00:00
bf01caf27b Fix broken contribution stats upload job (#96003)
Actually allows uploads to S3 to work properly for external contribution stats.
Test: https://github.com/pytorch/pytorch/actions/runs/4347343883/jobs/7594534296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96003
Approved by: https://github.com/huydhn
2023-03-07 01:01:02 +00:00
e6b361bd47 Refactor dynamo benchmark test script to reduce duplication (#96096)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96096
Approved by: https://github.com/desertfire
2023-03-07 00:41:55 +00:00
969586c373 Handle GitHub request failure when filter test config (#96145)
This is to make the script more resilient to GitHub network flakiness, i.e. https://github.com/pytorch/pytorch/actions/runs/4347281268/jobs/7594804454.  When the script couldn't get the list of labels from the PR to do filtering, it should just continue normally and run everything by default.

I also refactor the code a bit to re-use the existing `download_json` function which support retries.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96145
Approved by: https://github.com/clee2000, https://github.com/malfet
2023-03-07 00:40:12 +00:00
847d6520ed Don't guard on the exact int value on conversion to bool (#96008)
Fixes https://github.com/pytorch/pytorch/issues/95981

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96008
Approved by: https://github.com/ngimel
2023-03-07 00:40:06 +00:00
680214ac11 SymIntify a few more relatively non-controversial schemas (#96100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96100
Approved by: https://github.com/Skylion007
2023-03-06 23:12:40 +00:00
95d17dc93d [inductor] Reland #95567 part 1 (#96023)
This is the non-problematic part of #95567.  The errors were coming from
IR printing changes which will be next in the stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96023
Approved by: https://github.com/ngimel, https://github.com/mlazos
2023-03-06 22:57:22 +00:00
39e8311a29 unwrap sizevars passed as args (#96051)
Undoes int arg wrapping that dynamo does
Generated line:
```
s1 = arg2_1.item() if isinstance(arg2_1, torch.Tensor) else arg2_1
```
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96051
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-03-06 22:45:47 +00:00
8c8148c887 Revert D43643526: Multisect successfully blamed D43643526 for test or build failures (#96126)
Summary:
This diff is reverting D43643526
Depends on D43693521
D43643526: Avoid copies in matmul (#76828) by generatedunixname499836121 has been identified to be causing the following test or build failures:

Tests affected:
- [mle/favour:tests - favour_test.py::TestLinears::test_psd](https://www.internalfb.com/intern/test/562950027104300/)

Here's the Multisect link:
https://www.internalfb.com/intern/testinfra/multisect/1611690
Here are the tasks that are relevant to this breakage:
T146911536: 5 tests started failing for oncall prob in the last 2 weeks
We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it.

Test Plan: NA

Differential Revision: D43693526

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96126
Approved by: https://github.com/weiwangmeta
2023-03-06 22:30:07 +00:00
962b3f78bd [inductor] run all kernel benchmarks individually in a compiled module (#95845)
This is a follow up for PR #95506 to run all the triton kernels in a compiled module individually as suggested by Horace.

Here are the steps:
1. Run the model as usual with a benchmark script and with TORCHINDUCTOR_BENCHMARK_KERNEL enabled. e.g.
```
TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --dashboard --only resnet18 --disable-cudagraphs --training
```
2. From the output we will see 3 lines like
```
Compiled module path: /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py
```
That's because we have one graph module for fwd/bwd/optitimizer respectively. Each graph module will have one such output corresponding to the compiled module.

3. We can run the compiled module directly. Without any extra arguments, we just maintain the previous behavior to run the call function -- which just does what the original graph module does but in a more efficient way. But if we add the '-k' argument, we will run benchmark for each individual kernels in the file.

```
python /tmp/torchinductor_shunting/rs/crsuc6zrt3y6lktz33jjqgpkuahya56xj6sentyiz7iv4pjud43j.py -k
```

Example output:
<img width="430" alt="Screenshot 2023-03-01 at 4 51 06 PM" src="https://user-images.githubusercontent.com/52589240/222302996-814a85be-472b-463c-9e85-39d2c9d20e1a.png">

Note: I use the first 10 characters of the hash to identify each kernel since
1. hash is easier to get in the code :)
2. name like `triton__3` only makes sense within a compiled module, but a hash can make sense even without specifying the compiled module (assuming we have enough bytes for the hash)

If we found a triton kernel with hash like c226iuf2wi having poor performance, we can look it up in the original compiled module file. It works since we comment each compiled triton kernel with the full hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95845
Approved by: https://github.com/Chillee
2023-03-06 21:30:33 +00:00
6c9a51cdc9 Install NVIDIA driver in bazel workflow (#96020)
This is to address the missing NVIDIA driver in the new Bazel GPU build and test workflow as documented in https://github.com/pytorch/pytorch/issues/79851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96020
Approved by: https://github.com/vors, https://github.com/malfet, https://github.com/ZainRizvi, https://github.com/seemethere
2023-03-06 21:21:24 +00:00
893aa5df3f Promote "Skipping frame" message to INFO log level (#95968)
Without this, when you skip frame because of a graph break, at INFO logging level all you see is:

```
[INFO] Step 1: torchdynamo start tracing run_n_iterations
[INFO] Step 1: torchdynamo start tracing forward_pass
```

With this promotion, you now see:

```
[INFO] Step 1: torchdynamo start tracing run_n_iterations
[INFO] Skipping frame because there is a graph break in a for/while loop
[INFO] Step 1: torchdynamo start tracing forward_pass
```

This is MUCH more useful, while only adding a single log message per
already existing INFO log message.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95968
Approved by: https://github.com/albanD, https://github.com/janeyx99
2023-03-06 20:16:42 +00:00
28aa2efd14 [7/N][BE] Remove Partial Tensor and its dependency (#95949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95949
Approved by: https://github.com/wanchaol
2023-03-06 19:57:46 +00:00
6dddc0d689 [6/N][BE] Remove Sharded Linear Op for ShardedTensor (#95948)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95948
Approved by: https://github.com/wanchaol
2023-03-06 19:57:19 +00:00
4e396a54e8 [5/N][BE] Remove Replicated Tensor class (#95947)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95947
Approved by: https://github.com/wanchaol
2023-03-06 19:50:17 +00:00
b38b39c441 Revert "memory viz: Add colors for categories and a legend (#90587)"
This reverts commit ee4384250589f870f24e4d24894a03824ed1c49e.
2023-03-06 11:38:58 -08:00
22c9896ea4 Use original arg names if possible (#95898)
Use graphargs

rm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95898
Approved by: https://github.com/suo
2023-03-06 19:04:49 +00:00
6fff232280 Delete torch._functorch.config.use_dynamic_shapes (#96102)
As requested in
https://github.com/pytorch/pytorch/pull/95975#discussion_r1124837162

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96102
Approved by: https://github.com/Skylion007
2023-03-06 18:50:20 +00:00
483e193d0e [CI] Use CUDA 11.8 to run inductor benchmark tests (#96059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96059
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-03-06 18:34:54 +00:00
000d9ec189 Use working pre-built sccache binary (#95997)
From https://github.com/pytorch/pytorch/pull/95938 where a new Docker image build fails to start sccache. This issue starts to happen today (Mar 3rd).   The server fails to start with a cryptic `sccache: error: Invalid argument (os error 22)`

```
=================== sccache compilation log ===================
ERROR 2023-03-03T20:31:14Z: sccache::server: failed to start server: Invalid argument (os error 22)

sccache: error: Invalid argument (os error 22)

=========== If your build fails, please take a look at the log above for possible reasons ===========

+ sccache --show-stats
sccache: error: Connection to server timed out
```

I don't have a good explanation for this yet.  The version of sccache we build from https://github.com/pytorch/sccache is ancient.  If I start to build the exact same version on Ubuntu Docker image now, the issue will manifest.  But the older binary built only few days ago e50ff3fcdb works without any issue.  So I fix sccache binary to that version instead of rebuilding it every time in the image as a temporary mitigation while trying to root cause this further.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95997
Approved by: https://github.com/ZainRizvi
2023-03-06 18:24:56 +00:00
69bfdcd244 Only print bytecode of inlined function if output_code is True (#95969)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95969
Approved by: https://github.com/janeyx99, https://github.com/jansel
2023-03-06 18:06:35 +00:00
69aa6b4bb9 fix typo in comments under torch/csrc/autograd (#96061)
This PR fixes typos in comments of `.cpp` and `.h` files under `torch/csrc/autograd` directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96061
Approved by: https://github.com/soulitzer
2023-03-06 18:05:14 +00:00
301a28bf8c [primTorch] move diagonal & add linalg.diagonal refs (#95774)
Fixes #85419

Also, add `_refs.linalg.diagonal`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95774
Approved by: https://github.com/lezcano
2023-03-06 17:59:47 +00:00
1fd7ea1ba8 Update skips for RecursionError (#96109)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96109
Approved by: https://github.com/huydhn
2023-03-06 17:55:38 +00:00
5b2ab0dd4f Multiple fixes for functional collectives. (#95897)
_functional_collectives.py: Ensure we always wait all collectives.
derivatives.yaml: mark all_reduce as non differentiable
gen_variable_type.py: Add all_reduce to DONT_ENFORCE_TENSOR_IMPL_USE_COUNT
common_dtensor.py: replace dist.barrier with all_reduce

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95897
Approved by: https://github.com/wconstab, https://github.com/fegin
2023-03-06 15:35:07 +00:00
3beafc91d1 USE_FAST_NVCC Windows (#95206)
USE_FAST_NVCC now works on Windows.

Fixes #67100

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95206
Approved by: https://github.com/ezyang
2023-03-06 15:04:24 +00:00
7a192cc51c dynamo: wrap graph break inst in try except block - with context manager setup/teardown (#94758)
Replacement to https://github.com/pytorch/pytorch/pull/94672.

Follow up to https://github.com/pytorch/pytorch/pull/94137.

We simply replace the set grad mode try except blocks with one for a more generic contextmanager (using `__enter__` and `__exit__`), storing the context manager into a `symbolic_local` for the duration of the try block.

(see https://github.com/pytorch/torchdynamo/issues/207 for the original motivation)

This allows us to handle calling inner functions with graph breaks for any arbitrarily deep nesting of live context managers subclassing `AbstractContextManager`. (see tests)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94758
Approved by: https://github.com/yanboliang
2023-03-06 14:04:17 +00:00
18d6ce9622 sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048)
As per title. Sync is still unavoidable for super high-dim tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94048
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
2023-03-06 13:37:30 +00:00
ebaf9af76e use float to init reduction value (#95452)
Fixes https://github.com/pytorch/pytorch/issues/95195, https://github.com/pytorch/pytorch/issues/95185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95452
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-06 08:49:36 +00:00
dcc159d3b6 inductor: pre-convert a TensorBox's layout to FixedLayout at FX side if one user of it is a CPU external customer kernel (#95873)
Given the following case:

```
import torch
import torch._dynamo

class Model(torch.nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 =  torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
        self.conv2 =  torch.nn.Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
        self.silu = torch.nn.SiLU(inplace=False)

    def forward(self, x,):
        x = self.silu(x)
        y1 = self.conv1(x)
        y2 = self.conv2(x)
        return y1, y2

model = Model().eval()
model = model.to(memory_format=torch.channels_last).eval()
opt_model = torch._dynamo.optimize('inductor')(model)

x = torch.randn(128, 64, 112, 112).to(memory_format=torch.channels_last)
with torch.no_grad():
    for i in range(3):
        out = opt_model(x)
```

the silu is used by two external kernels, and there always have redundant memory copy:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<6422528; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
                auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                auto tmp2 = tmp0 * tmp1;
                tmp2.store(out_ptr0 + 16*i0);
                tmp2.store(out_ptr1 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=102760448; i0<102760448; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
                auto tmp2 = tmp0 * tmp1;
                out_ptr0[i0] = tmp2;
                out_ptr1[i0] = tmp2;
            }
        }
    }
}
''')
```
This PR will pre-convert the `silu`'s layout to FixedLayout at FX side(will be realized to avoid multi-realize at external kernel) if one user of it is a CPU external customer kernel, after this PR, the output code is:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/dl/cdljpywww2h2ag4o35mwbvm45hhasxnxkhqgbupxnk3y7olula65.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<6422528; i0+=1)
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + 16*i0);
                auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                auto tmp2 = tmp0 * tmp1;
                tmp2.store(out_ptr0 + 16*i0);
            }
            #pragma omp for simd simdlen(8)
            for(long i0=102760448; i0<102760448; i0+=1)
            {
                auto tmp0 = in_ptr0[i0];
                auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
                auto tmp2 = tmp0 * tmp1;
                out_ptr0[i0] = tmp2;
            }
        }
    }
}
''')
```

Currently, this PR only considers the CPU external customer kernel, but for other external kernels, there may have the same issue.

For Timm **eca_halonext26ts** , this PR will give about **8%** performance improvement(BS=128, 20 cores on SKX).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95873
Approved by: https://github.com/jansel
2023-03-06 06:28:40 +00:00
cc775fb8c4 Upload torch dynamo perf stats to Rockset (#95675)
The new workflow is run after `inductor` or `inductor-perf-test-nightly` finish in trunk (not on PR).  All test reports CSV files are ingested into https://console.rockset.com/collections/details/inductor.torch_dynamo_perf_stats

### Testing

Run

* inductor-A100-perf
```
python -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 4272892998 --workflow-run-attempt 1 --repo pytorch/pytorch
```

to ingest some data from 9b7abc4fac

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95675
Approved by: https://github.com/weiwangmeta
2023-03-06 05:28:20 +00:00
02792ff16f [CI] Make inductor-perf-test-nightly produce data for dashboard (#95685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95685
Approved by: https://github.com/ezyang, https://github.com/huydhn
2023-03-06 03:14:03 +00:00
fa92b6a7b0 Error when jit.trace/script is used with torch.compile (#91681)
Fixes https://github.com/pytorch/pytorch/issues/93485

```python
import torch
from torchvision.models import resnet50

model = resnet50(weights=None)
compile_model = torch.compile(model)
print(type(compile_model))
example_forward_input = torch.rand(1, 3, 224, 224)
c_model_traced = torch.jit.trace(compile_model, example_forward_input) # or torch.jit.script
torch.jit.save(c_model_traced, "c_trace_model.pt")
```

Should I raise a warning if a user tries to compile a scripted or traced model as well? It works just fine now on resnet but not sure if it's that something we want to explicitly discourage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91681
Approved by: https://github.com/desertfire
2023-03-06 02:03:35 +00:00
e8cd173aae Fix node provenance tracking (#95901)
Before:
```
triton_fused_add_83_add_84_convolution_15_relu_12_relu_13_squeeze_46_var_mean_15_14
```

After:
```
triton_fused_add_83_add_84_relu_13_squeeze_46_var_mean_15_14
```

For this kernel
```
@persistent_reduction(
    size_hints=[512, 64],
    reduction_hint=ReductionHint.INNER,
    filename=__file__,
    meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32', 6: '*fp32', 7: '*fp32', 8: '*fp32', 9: '*fp32', 10: 'i32', 11: 'i32'}, 'device': 0, 'constants': {}, 'mutated_arg_names': ['in_out_ptr0'], 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10), equal_to_1=())]}
)
@triton.jit
def triton_(in_out_ptr0, in_ptr0, in_ptr1, in_ptr2, in_ptr3, in_ptr4, out_ptr0, out_ptr2, out_ptr3, out_ptr4, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr):
    xnumel = 512
    rnumel = 49
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[None, :]
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (49*x0)), rmask & xmask, other=0)
    tmp8 = tl.load(in_ptr1 + (x0), xmask)
    tmp22 = tl.load(in_ptr2 + (x0), xmask)
    tmp24 = tl.load(in_ptr3 + (x0), xmask)
    tmp30 = tl.load(in_ptr4 + (x0), xmask)
    tmp2 = tl.where(rmask & xmask, tmp0, 0)
    tmp3 = tl.sum(tmp2, 1)[:, None]
    tmp4 = 49.0
    tmp5 = tmp3 / tmp4
    tmp6 = 0.1
    tmp7 = tmp5 * tmp6
    tmp9 = 0.9
    tmp10 = tmp8 * tmp9
    tmp11 = tmp7 + tmp10
    tmp12 = tmp0 - tmp5
    tmp13 = tmp12 * tmp12
    tmp15 = tl.where(rmask & xmask, tmp13, 0)
    tmp16 = tl.sum(tmp15, 1)[:, None]
    tmp17 = tmp16 / tmp4
    tmp18 = 1e-05
    tmp19 = tmp17 + tmp18
    tmp20 = tl.libdevice.rsqrt(tmp19)
    tmp21 = tmp12 * tmp20
    tmp23 = tmp21 * tmp22
    tmp25 = tmp23 + tmp24
    tmp26 = tl.where(0 != 0, 0, tl.where(0 > tmp25, 0, tmp25))
    tmp27 = 1.0208333333333333
    tmp28 = tmp17 * tmp27
    tmp29 = tmp28 * tmp6
    tmp31 = tmp30 * tmp9
    tmp32 = tmp29 + tmp31
    tl.store(in_out_ptr0 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp5, xmask)
    tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp11, xmask)
    tl.store(out_ptr2 + (r1 + (49*x0) + tl.zeros([XBLOCK, RBLOCK], tl.int32)), tmp26, rmask & xmask)
    tl.store(out_ptr3 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp20, xmask)
    tl.store(out_ptr4 + (x0 + tl.zeros([XBLOCK, 1], tl.int32)), tmp32, xmask)
```

Tbh this still isn't super great provenance tracking, since ops like layernorms are decomposed. I might add some extra provenance tracking during decompositions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95901
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-03-05 21:52:48 +00:00
36a6e2c54b Automated submodule update: kineto (#95798)
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: d9954ad558

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95798
Approved by: https://github.com/cpuhrsch
2023-03-05 19:59:41 +00:00
5dd52e250f [inductor] Add some simple decomps (#96039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96039
Approved by: https://github.com/ngimel
2023-03-05 17:07:56 +00:00
60cf95610d [CI] Skip xcit_large_24_p8_224 in TIMM (#96048)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96048
Approved by: https://github.com/jansel
2023-03-05 14:54:46 +00:00
1359d16fe8 [CI] Further tighten the checking of two eager runs (#95902)
Summary: To catch nondeterminism in eager if there is any.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95902
Approved by: https://github.com/jansel
2023-03-05 14:53:02 +00:00
43e71cddb0 [inductor] use triu ref instead of lowering (#96040)
Fixes #95958
Generated code is functionally identical with ref and lowering, only minor differences

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96040
Approved by: https://github.com/jansel
2023-03-05 07:24:34 +00:00
789fc4c292 [dtensor] refactor shape/offset calculation (#95923)
Shape offset calculation is commonly used and extract them into a separate util

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95923
Approved by: https://github.com/fduwjj
2023-03-05 06:33:32 +00:00
af8dbe7ec2 Fix training enablement in AOTAutograd (#95975)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95975
Approved by: https://github.com/ngimel, https://github.com/voznesenskym
2023-03-05 04:28:29 +00:00
b7a3f331f1 Add doc test in graph_drawer.py (#95919)
Add a doc test, extending #95534 .

I found I need to put the xdoctest under a class method. Otherwise if it's right under the class definition, the test cannot be found. @Erotemic Do I miss anything?

The xdoctest has been tested:
```
$ pytest --xdoctest torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0
=========== test session starts ==================
platform linux -- Python 3.9.15, pytest-7.2.1, pluggy-1.0.0
rootdir: /localdisk/wenzhexu/dev/forked_pytorch, configfile: pytest.ini
plugins: xdoctest-1.1.1
collected 1 item

torch/fx/passes/graph_drawer.py .                                                                                                                                                                               [100%]

============ 1 passed in 1.13s ===================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95919
Approved by: https://github.com/ezyang
2023-03-05 02:23:18 +00:00
5da6da659a [inductor] Enable some decomps (#96038)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96038
Approved by: https://github.com/ngimel
2023-03-05 02:03:35 +00:00
03b6e6979c Transformers: fix src and key padding mask bool regression (#96009)
Summary: fix src and pad mask bool regression

This fixes a regression introduced previously with #92733. That PR unified testing of masks to remove Byte Tensors as permissible mask, introduced mask compatibility check, and mask conversion to FP mask.  The problem addressed in this PR was that after the first mask had been converted, a check for mask compatibility would fail.

Test Plan: sandcastle & github

Differential Revision: D43782858

Fixes  https://github.com/pytorch/pytorch/issues/95702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96009
Approved by: https://github.com/malfet
2023-03-05 01:50:46 +00:00
78da315afd [MPS] Fix bidirectional LSTM & small one-direction LSTM fix (#95563)
Fixes #94754

With this PR I hope to finish my breathtaking journey of fixing MPS LSTM.

Here, I enable `bidirectional` on MPS. Also, I've noticed that cache key did not account for all parameters, so there could have been problems with one-directional LSTM when created without bias or dropout and then with one of them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95563
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/malfet
2023-03-05 00:19:54 +00:00
c7c4a20321 Update dynamic skips (#95966)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95966
Approved by: https://github.com/janeyx99, https://github.com/voznesenskym
2023-03-04 23:01:58 +00:00
49f6849f58 Fix codegen logic for foreach derivatives (#95263)
follow-up https://github.com/pytorch/pytorch/pull/93901.

Unexpected numerical mismatches observed in some foreach functions' backward result seemed to be caused by the wrong order of `IndexRangeGenerator::range` call.
This pr has `args_with_derivatives` have the same or similar order of `foreach_native_function.func.arguments.flat_non_out`

---

what the current master generates for `_foreach_mul.List`:
```cpp
variable_list ForeachMulBackward0List::apply(variable_list&& grads) {
  std::lock_guard<std::mutex> lock(mutex_);
  TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE);
  TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE);
  IndexRangeGenerator gen;
  auto other_ix = gen.range(other_size_);
  auto self_ix = gen.range(self_size_);
  variable_list grad_inputs(gen.size());
  auto other = unpack_list(other_);
  auto self = unpack_list(self_);
  if (task_should_compute_output({ other_ix })) {
    std::vector<Tensor> grad_result;
    grad_result.reserve(grads.size());
    for (const auto & i : c10::irange(grads.size())) {
      grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type()));
    }
    copy_range(grad_inputs, other_ix, grad_result);
  }
  if (task_should_compute_output({ self_ix })) {
    std::vector<Tensor> grad_result;
    grad_result.reserve(grads.size());
    for (const auto & i : c10::irange(grads.size())) {
      grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type()));
    }
    copy_range(grad_inputs, self_ix, grad_result);
  }
  return grad_inputs;
}
```

with this PR the generated backward is
```cpp
variable_list ForeachMulBackward0List::apply(variable_list&& grads) {
  std::lock_guard<std::mutex> lock(mutex_);
  TORCH_CHECK(!self_released_, ERR_BACKWARD_TWICE);
  TORCH_CHECK(!other_released_, ERR_BACKWARD_TWICE);
  IndexRangeGenerator gen;
  auto self_ix = gen.range(self_size_);                                         <----- diff
  auto other_ix = gen.range(other_size_);                                       <----- diff
  variable_list grad_inputs(gen.size());
  auto self = unpack_list(self_);
  auto other = unpack_list(other_);
  if (task_should_compute_output({ other_ix })) {
    std::vector<Tensor> grad_result;
    grad_result.reserve(grads.size());
    for (const auto & i : c10::irange(grads.size())) {
      grad_result.emplace_back(mul_tensor_backward(grads[i], self[i], other[i].scalar_type()));
    }
    copy_range(grad_inputs, other_ix, grad_result);
  }
  if (task_should_compute_output({ self_ix })) {
    std::vector<Tensor> grad_result;
    grad_result.reserve(grads.size());
    for (const auto & i : c10::irange(grads.size())) {
      grad_result.emplace_back(mul_tensor_backward(grads[i], other[i], self[i].scalar_type()));
    }
    copy_range(grad_inputs, self_ix, grad_result);
  }
  return grad_inputs;
}

```

The change is to fix the order of `self_ix` and `other_ix`.[](url)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95263
Approved by: https://github.com/soulitzer
2023-03-04 20:03:54 +00:00
a10897a344 [Dynamo] Fix number of inputs in onnxrt and tvm backend (#95429)
This PR intends to fix #95428 by only binding active inputs to onnxrt's inference session and tvm's runtime lib after model conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95429
Approved by: https://github.com/jansel
2023-03-04 18:14:53 +00:00
26045336ca Optimize nn.Module __call__ fast path for dynamo (#95931)
This PR optimizes the guards overhead introduced by dynamo tracing module forward hooks.

It can and maybe should be followed by a wider change proposed by @voznesenskym to optimize specialized nnmodules by 'observing' any user mutations and directly invalidating the root guard, obviating the need to install other nnmodule guards.  (But this observer change seems more involved...)

Idea: maintain a flag, and keep it up to date whenever adding or
removing hooks. Use the flag rather than dict checks to enter the call fast path.
  - need to extend RemovableHandle to keep a ref to nnModule so it can update the flag on removal.
  - also need to handle the flag in ScriptModule which still uses the python call impl when called from python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95931
Approved by: https://github.com/ezyang, https://github.com/voznesenskym
2023-03-04 15:09:40 +00:00
6ca286df69 [Dynamo] Support call dict with list/tuple as input (#95928)
Fixes Meta internal use case

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95928
Approved by: https://github.com/jansel
2023-03-04 05:52:33 +00:00
43dd043ea7 Revert "[inductor] Improve error messages (#95567)" (#96014)
This reverts commit 62b775583f008effc510e5f5c3e2b30a85a53465.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96014
Approved by: https://github.com/Chillee
2023-03-04 04:03:31 +00:00
dc70e8175f Add various uninterpreted bit tensor data types (try 2) (#95860)
Summary:

This is a retry of https://github.com/pytorch/pytorch/pull/94992 which was reverted due to CI issues.

This PR adds a set of unintrepreted data types on PyTorch which can be used to implement experimental functionality out of core (think fp8, int4, int16 quant, etc).

@bypass-github-export-checks

Test Plan:

```
python test/test_quantization.py -k TestBits
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95860
Approved by: https://github.com/atalman
2023-03-04 03:35:59 +00:00
5e1067bcc2 [vision hash update] update the pinned vision hash (#95932)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95932
Approved by: https://github.com/pytorchbot
2023-03-04 03:32:32 +00:00
7ff9612e34 Improve error message for instance norm when channels is incorrect (#94624)
Fixes https://github.com/pytorch/pytorch/issues/90514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94624
Approved by: https://github.com/jbschlosser
2023-03-04 02:06:48 +00:00
436993d52b [MPS] Error on unsupported types (#95982)
I.e. attempt to create tensor of all possible types and make sure that
it raises a structured error for non-MPS types

Also, rename `test_resize_as_all_dtypes_and_devices` to `test_resize_as_mps_dtypes` and `test_resize_all_dtypes_and_devices` to `test_resize_mps_dtypes` and run both test for all MPS dtypes (rather than just bool, float16 and bfloat16 as they were running before)

Fixes https://github.com/pytorch/pytorch/issues/95976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95982
Approved by: https://github.com/kulinseth
2023-03-04 01:29:07 +00:00
f4b33791fd [BE] Remind people there are rsts to update in docs/source (#95914)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95914
Approved by: https://github.com/msaroufim
2023-03-04 01:23:46 +00:00
d303665d33 Make int unspecialization actually work (#95621)
OK, so this PR used to be about reducing the number of constants we specialize on, but it turns out that unspecialization was ~essentially never used (because we still constant specialized way too aggressively) and I ended up having to fix a bunch of issues to actually get tests to pass. So this PR is now "make int unspecialization actually work". As part of this, I have to turn off unspecialization by default, as there are still latent bugs in inductor.

The general strategy is that an unspecialized int is represented as a SymInt. Representing it as a 0d tensor (which is what the code used to do) is untenable: (1) we often need unspecialized ints to participate in size computations, but we have no way of propagating sympy expressions through tensor compute, and (2) a lot of APIs work when passed SymInt, but not when passed a Tensor. However, I continue to represent Numpy scalars as Tensors, as they are rarely used for size computation and they have an explicit dtype, so they are more accurately modeled as 0d tensors.

* I folded in the changes from https://github.com/pytorch/pytorch/pull/95099 as I cannot represent unspecialized ints as SymInts without also turning on dynamic shapes. This also eliminates the necessity for test_unspec.py, as toggling specialization without dynamic shapes doesn't do anything. As dynamic shapes defaults to unspecializing, I just deleted this entirely; for the specialization case, I rely on regular static shape tests to catch it. (Hypothetically, we could also rerun all the tests with dynamic shapes, but WITH int/float specialization, but this seems... not that useful? I mean, I guess export wants it, but I'd kind of like our Source heuristic to improve enough that export doesn't have to toggle this either.)
* Only 0/1 integers get specialized by default now
* A hodgepodge of fixes. I'll comment on the PR about them.

Fixes https://github.com/pytorch/pytorch/issues/95469

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95621
Approved by: https://github.com/jansel, https://github.com/Chillee
2023-03-04 01:22:08 +00:00
7d765cdc66 Fix wrong handling of grad_scale & found_inf in fused optimizers (#95847)
Fixes #95781.
The cause seems to be that the current implementation doesn't correctly pass `found_inf` when `grad_scale` is `None`. Therefore parameters can get mistakenly updated by gradients whose some elements are invalid, i.e. nan or inf.

Related #94060

I forgot about this wrong handling after #94344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95847
Approved by: https://github.com/janeyx99
2023-03-04 01:21:21 +00:00
d214d82acd Prettify assert expr in self.symbol_to_source failure (#95972)
Main QOL improvement is to print the name() of Source, not the repr.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95972
Approved by: https://github.com/Skylion007
2023-03-04 01:04:28 +00:00
4d9c499dc2 [SPMD] Introduce the cross-iteration graph optimization framework (#94803)
Introduce the cross-iteration graph optimization framework that allow users to write a graph optimization that moves nodes cross iterations.

Differential Revision: [D43247247](https://our.internmc.facebook.com/intern/diff/D43247247/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94803
Approved by: https://github.com/anj-s
2023-03-04 00:59:40 +00:00
5a07c3d3d1 Remove fake inputs from control flow (#95988)
Previously running make_fx with tracing_mode="symbolic" resulted in `RuntimeError: Creating a new Tensor subclass FakeTensor but the raw Tensor object is already associated to a python object of type FakeTensor`. This is probably due to there existing multiple FakeTensorModes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95988
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2023-03-04 00:58:52 +00:00
9a781ce3e1 Add flop formulas for sdpa (#95854)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95854
Approved by: https://github.com/drisspg
2023-03-04 00:33:34 +00:00
7db5f8c765 Improve Discoverability of Inductor Optimizations (#95824)
Finding out what the inductor configs mean has been a confusing point for the community so creating some top level functions that just print them out to console if people don't wanna muck around the source code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95824
Approved by: https://github.com/jansel
2023-03-04 00:30:10 +00:00
7d02ecfabb Fix typo in RELEASE.md and CONTRIBUTING.md (#95965)
This PR fixes typos in `RELEASE.md` and `CONTRIBUTING.md`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95965
Approved by: https://github.com/kit1980
2023-03-04 00:14:05 +00:00
ac07de4a61 Add export docs, improve asserts (#94961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94961
Approved by: https://github.com/tugsbayasgalan
2023-03-03 23:40:00 +00:00
027ebca4d7 Don't use guardless contiguity/stride-like implementations (#95733)
These prevent us from simplifying tests involving unbacked SymInts,
and then you end up with unbacked SymInt in guards, which is bad.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95733
Approved by: https://github.com/tugsbayasgalan
2023-03-03 21:56:41 +00:00
f8b57ba635 [EASY] Unindent some blocks (#95967)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95967
Approved by: https://github.com/Skylion007
2023-03-03 21:05:36 +00:00
e6f3e16d89 Fix: validate_input_col for partial functions (#95067)
Fixes #95066

#### Proposed change:
do not call `str()` on a `Callable` to determine its name

#### Reasoning:
Please see https://github.com/pytorch/pytorch/issues/95066 for reasoning and examples

#### Effect:
* The code example given in https://github.com/pytorch/pytorch/issues/95066 now executes instantly.
* If invalid input is provided, the stacktrace now prints nicely as
  ```
  ValueError: The function foo takes 1 parameters, but 2 are required.
  ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95067
Approved by: https://github.com/NivekT, https://github.com/ejguan
2023-03-03 21:05:07 +00:00
ee43842505 memory viz: Add colors for categories and a legend (#90587)
Adds a category legend to memory trace plots that colors allocations by their role (activation, parameter, gradient, etc.) as captured by kineto.

Differential Revision: [D43757381](https://our.internmc.facebook.com/intern/diff/D43757381)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90587
Approved by: https://github.com/aaronenyeshi
2023-03-03 20:42:22 +00:00
c72fbf2e5a [inductor] do not use ceil in arange ref (#95773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95773
Approved by: https://github.com/ezyang
2023-03-03 20:38:18 +00:00
feffcafe09 [inductor] use FP div in CPP expr printer (#95698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95698
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-03-03 20:38:18 +00:00
6c061e5145 [DTensor][Shampoo] add _tenso.zero function (#95863)
Summary:
implement zeros function inside DTensor API
- user specify the zeros tensor shape, and the function will create local zero tensor given the placement information

Test Plan:
{F889157756} - unit test for util function for compute_local_tensor_size
- unit test for _tensor.zeros

Reviewed By: wanchaol

Differential Revision: D43630718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95863
Approved by: https://github.com/wanchaol
2023-03-03 19:36:44 +00:00
1d3c394d5e [MTPG] Improve all_reduce and handle bwd thread support (#95524)
This implements all reduce ops in all_reduce and a PG being used from a thread different than the one that created it.

We should be this >< close to getting complex training tests working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95524
Approved by: https://github.com/H-Huang
2023-03-03 18:53:36 +00:00
a7698a8260 [DCP] Add DCP FSDP sharded_state_dict checkpoint example to DCP .rst file (#95517)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95517
Approved by: https://github.com/kumpera
2023-03-03 18:09:10 +00:00
4026c62174 Revert "Don't use guardless contiguity/stride-like implementations (#95733)"
This reverts commit deaf077de82789656c707d4b4b2c2e0d1ecee684.

Reverted https://github.com/pytorch/pytorch/pull/95733 on behalf of https://github.com/ezyang due to apparently this regresses executorch tests internally
2023-03-03 17:43:05 +00:00
7f5f0b3665 Run _nvfuser/test_torchscript serially (#95951)
Started at ce4cbac914 (11734276291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95951
Approved by: https://github.com/huydhn
2023-03-03 17:41:09 +00:00
879400e4e8 Revert "[inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)"
This reverts commit 73b66098b2f43be508e1975fd6a425ed6308b993.

Reverted https://github.com/pytorch/pytorch/pull/94822 on behalf of https://github.com/clee2000 due to broke inductor_tmm_cpu_accuracy, 73b66098b2 (11745396725)
2023-03-03 17:33:27 +00:00
d21577f28c Run more tests through pytest (#95844)
Run more tests through pytest.

Use a block list for tests that shouldn't run through pytest.  As far as I can tell, the number of tests run, skipped, and xfailed for those not on the blocklist are the same.

Regarding the main module:

Usually tests are run in CI, we call `python <test file>`, which causes the file to be imported under the module name `__main__`.  However, pytest searches for the module to be imported under the file name, so the file will be reimported.  This can cause issues for tests that run module level code and change global state, like test_nn, which modifies lists imported from another file, or tests in test/lazy, which initialize a backend that cannot coexist with a second copy of itself.

My workaround for this is to run tests from the `__main__` module.  However, this results in pytest being unable to rewrite assertions (and possibly other things but I don't know what other things pytest does right now).  A better solution might be to call `pytest <test file>` directly and move all the code in run_tests(argv) to be module level code or put it in a hook in conftest.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95844
Approved by: https://github.com/huydhn
2023-03-03 17:32:26 +00:00
004bcffc6a Fix formatting (#95906)
Fixing list formatting by adding a missing blank line:

Before:
![Screenshot 2023-03-02 at 3 17 28 PM (2)](https://user-images.githubusercontent.com/5317992/222585127-9b6ed4dd-4719-4756-b2ac-1ba6e8f97b87.png)

After:
![Screenshot 2023-03-02 at 3 16 48 PM (2)](https://user-images.githubusercontent.com/5317992/222585172-3ef35a48-641f-4b73-9f7b-f419a122196b.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95906
Approved by: https://github.com/orionr
2023-03-03 16:18:12 +00:00
f3c25cd348 [Quant][PT2.0] fix issues for rearranging weight observer for decomposed linear (#94296)
**Summary**
Linear is decomposed to `t - addmm/mm` after `dynamo.export`. And weight's observer is inserted between `t` and `addmm/mm` in the first place. `_rearrange_weight_observer_for_addmm()` is then called to move the observer between weight and `t`.
```
    before:
         weight - t - observer \
           input - observer - addmm/mm
    after:
         weight - observer - t \
           input - observer - addmm/mm
```
We found two issues of `_rearrange_weight_observer_for_addmm()`:
- It does not call `m.recompile()` in the end, so it does not function correctly.
- It does not support `aten.mm.default` which is from decomposed linear without bias.

This PR fixes the two issues and renames the function to `_rearrange_weight_observer_for_decomposed_linear`.

**Test plan**
python test/test_quantization.py -k test_rearrange_weight_observer_for_decomposed_linear

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94296
Approved by: https://github.com/jgong5, https://github.com/andrewor14
2023-03-03 15:54:11 +00:00
d809020fc8 Triton kernel for bsr @ dense (#94823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94823
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2023-03-03 15:11:28 +00:00
88e554b513 Move label check failure to mergebot (#94707)
Fixes https://github.com/pytorch/pytorch/issues/88098

This is a mirror of the same PR (https://github.com/Goldspear/pytorch/pull/3) that has been reviewed in my fork (due to it's a stacked PR).

==============
## Context
This the 3rd of the 3 PRs to address the issue 88098.

## What Changed
1. check_labels.py no longer fails, but only leaving a comment
2. trymerge.py now would fail if no required labels provided

## Tests
* dummy-repo trymerge run [fails without required label](https://github.com/Goldspear/pytorch-dummy/actions/runs/4162819216) and resulted in [a label error comment](https://github.com/Goldspear/pytorch-dummy/pull/3#issuecomment-1427756769)
* the above pr was [correctly merged](https://github.com/Goldspear/pytorch-dummy/pull/3) after label is added.

## Note to Reviewers
1st PR: https://github.com/pytorch/pytorch/pull/94179
2nd PR: https://github.com/pytorch/pytorch/pull/94899
3rd PR: this one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94707
Approved by: https://github.com/ZainRizvi
2023-03-03 15:09:14 +00:00
73b66098b2 [inductor] Add an AOT compilation mode for Inductor CPP backend (#94822)
Summary: The AOT mode currently works for the CPP backend. When turned on, Inductor compiles the model code into a .so file with aot_inductor_entry as the entry function. If the AOT compilation fails, Inductor will explicitly fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94822
Approved by: https://github.com/jansel
2023-03-03 14:18:09 +00:00
3eb8eaa177 Inductor: fix crash when indexing an empty tensor by invalid index (#95046)
This pr is for #94830. Aligned the behavior with eager.
Reference: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/TensorAdvancedIndexing.cpp#L550-L558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95046
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-03 14:12:02 +00:00
d4e0d895e9 inductor: fix permute_linear_fusion/linear_permute_fusion has nor 'bias' keyError issue (#95930)
Fix https://github.com/pytorch/pytorch/issues/95912.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95930
Approved by: https://github.com/ngimel, https://github.com/jiawenliu64
2023-03-03 12:02:49 +00:00
35bf5bac26 Fix "sandcastle_skip_if decorator name is confusing" (#95649)
Fixes https://github.com/pytorch/pytorch/issues/89473
See the issue https://github.com/pytorch/pytorch/issues/89473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95649
Approved by: https://github.com/atalman, https://github.com/malfet
2023-03-03 09:29:40 +00:00
0147a408c3 Refactor inductor collectives around base class (#95920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95920
Approved by: https://github.com/wanchaol
2023-03-03 09:00:48 +00:00
92a2107375 Support Inductor collectives with wait or collective outside graph (#95893)
Inductor implementations of collectives/wait must match
eager impls in _functional_collectives in terms of interacting
with _register_tensor_work API.  If they do, then splitting
a collective-wait pair so one half is in a compiled graph should
work fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95893
Approved by: https://github.com/kumpera
2023-03-03 09:00:48 +00:00
7206b5e9e5 Remove pydispatcher from test since no longer needed (#95890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95890
Approved by: https://github.com/kumpera
2023-03-03 09:00:48 +00:00
738beaa6b8 [dtensor] fix experimental_op slice_scatter (#95894)
Test Plan: test with spmd e2e flow

Differential Revision: D43740349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95894
Approved by: https://github.com/fegin
2023-03-03 08:41:22 +00:00
304a95435d [MPS] Disallow reshape in slice (#95905)
Disallow reshapes for arrayViews.
Current code allows a base shape of `[2, 4, 256]` to be sliced into `[4, 1, 256]` (view's shape) - which is not possible. Slicing a smaller dimension into a bigger one will always error out.

Fixes https://github.com/pytorch/pytorch/issues/95883
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95905
Approved by: https://github.com/razarmehr, https://github.com/kulinseth
2023-03-03 08:08:34 +00:00
cyy
a32be76a53 Disable more warnings on Windows CI test (#95933)
These warnings are disabled to avoid long log on Windows tests. They are also disabled on CMake buildings currently.
'/wd4624': MSVC complains  "destructor was implicitly defined as delete" on c10::optional and other templates
'/wd4076': "unexpected tokens following preprocessor directive - expected a newline" on some header
'/wd4068': "The compiler ignored an unrecognized [pragma]"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95933
Approved by: https://github.com/ezyang
2023-03-03 07:11:13 +00:00
22d3ac79d2 [torchgen] Prettify generated type annotations (#95877)
Changes:

1. Use class inheritance for `torch/return_types.pyi`:

    Before:

    ```python
    max = NamedTuple("max", [("values", Tensor), ("indices", Tensor)])
    ```

    After:

    ```python
    class max(NamedTuple):
        values: Tensor
        indices: Tensor
    ```

------

2. Add missing spaces in generated type annotations.

    1. Always has a space after `,`.
    2. If an argument is annotated, then there need spaces around `=` when it has a default value.

        ```diff
        - def func(..., out: Optional[Tensor]=None, ...) -> Tensor:
        + def func(..., out: Optional[Tensor] = None, ...) -> Tensor:
        ```

    3. If an argument is not annotated, then there should be no spaces around `=` when it has a default value.

        ```python
        def contiguous(self, memory_format=torch.contiguous_format) -> Tensor: ...
        ```

------

3. ~Remove redundant import alias in `torch/nn/functional.pyi`:~ (Reverted)

    UPDATE: `mypy` needs the alias to work.

    Before:

    ```python
    from .. import conv1d as conv1d
    from .. import conv2d as conv2d
    from .. import conv3d as conv3d
    from .. import conv_transpose1d as conv_transpose1d
    from .. import conv_transpose2d as conv_transpose2d
    from .. import conv_transpose3d as conv_transpose3d
    from .. import conv_tbc as conv_tbc
    from .. import avg_pool1d as avg_pool1d
    from .. import relu_ as relu_
    from .. import selu_ as selu_
    from .. import celu_ as celu_
    from .. import rrelu_ as rrelu_
    from .. import pixel_shuffle as pixel_shuffle
    from .. import pixel_unshuffle as pixel_unshuffle
    from .. import channel_shuffle as channel_shuffle
    from .. import native_channel_shuffle as native_channel_shuffle
    from .. import pdist as pdist
    from .. import cosine_similarity as cosine_similarity
    ```

    After:

    ```python
    from .. import (
        conv1d,
        conv2d,
        conv3d,
        conv_transpose1d,
        conv_transpose2d,
        conv_transpose3d,
        conv_tbc,
        avg_pool1d,
        relu_,
        selu_,
        celu_,
        rrelu_,
        pixel_shuffle,
        pixel_unshuffle,
        channel_shuffle,
        native_channel_shuffle,
        pdist,
        cosine_similarity,
    )
    ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95877
Approved by: https://github.com/ezyang
2023-03-03 07:08:40 +00:00
3bb76e6ced [static-runtime] increase verbosity for schema check (#95937)
Summary: as titled

Differential Revision: D43758690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95937
Approved by: https://github.com/wushirong, https://github.com/hl475, https://github.com/houseroad
2023-03-03 06:50:28 +00:00
76ade51307 [pt2][inductor] turn off cache search by default (#95662)
Summary: set `search_autotune_cache=False` by default due to inductor compilation regression on HF models, while working on reducing overhead

Test Plan: CI

Differential Revision: D43641286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95662
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-03-03 06:09:54 +00:00
53b4f6c0f6 Revert "[jit] Add c++ stacktraces for jit::ErrorReport (#94842)" (#95886)
This reverts commit 70029214f300f611e7dd816b5f64426224f6ab96.

It broke some internal tests.

Differential Revision: [D43735833](https://our.internmc.facebook.com/intern/diff/D43735833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95886
Approved by: https://github.com/malfet, https://github.com/qihqi
2023-03-03 05:49:40 +00:00
d1a168f176 equal_quantized_cpu requires both inputs are quantized tensor (#95875)
**Summary**
Fix the issue https://github.com/pytorch/pytorch/issues/95291, `equal_quantized_cpu` requires both inputs are quantized tensor.

**Test Plan**
```
python -m pytest test_quantization.py -k test_quantized_equal
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95875
Approved by: https://github.com/vkuzo, https://github.com/jgong5
2023-03-03 05:33:23 +00:00
4e02ad7538 Rename inductor collectives test (#95889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95889
Approved by: https://github.com/kumpera
2023-03-03 04:58:39 +00:00
ce4cbac914 Change linux.gcp.a100 to linux.gcp.a100.large (#95913)
To avoid making workloads like https://github.com/pytorch/pytorch/blob/master/.github/workflows/inductor.yml#L52 queue for a long time.

For example, in the past the runners were all used to run https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly.yml and perf smoke test jobs in https://github.com/pytorch/pytorch/actions/workflows/inductor.yml did not get runners.

<img width="614" alt="image" src="https://user-images.githubusercontent.com/109318740/222570066-7aec611d-0feb-42cb-8b1b-d93bd36f4d17.png">

This PR makes sure the queue only happens to long-running workloads and we have a plan to address it with more runners with basic auto-scaling feature enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95913
Approved by: https://github.com/malfet, https://github.com/huydhn
2023-03-03 04:32:03 +00:00
2ee85f9e8b Extend filter logic to remaining CI workflows (#95437)
Per title, this extends CI filter logic to all remaining *smaller* workflows across pull, trunk, and periodic.  These jobs can then be disabled dynamically.  Before this, the filter logic only exists in major platform workflows including linux, windows, macos, and rocm.

* These *smaller* workflows now accept the `test-matrix` input with one the default shard
* `filter-test-configs` logic is added as a filter step

This is needed after https://github.com/pytorch/pytorch/pull/95442 in the event where we need to disable these jobs

### Testing

* Disable https://github.com/pytorch/pytorch/issues/95746 for testing. Confirm in https://github.com/pytorch/pytorch/actions/runs/4299047707/jobs/7493851429 that the job is disabled (skipped)
* Disable https://github.com/pytorch/pytorch/issues/95752 for testing. Confirm in https://github.com/pytorch/pytorch/actions/runs/4299049008/jobs/7512566000 that the job is disabled (skipped). Note that MPS is the special case where it could be triggered by `Mac MPS` or `trunk` workflows.  So disabling MPS would strictly require both to be disabled.  IMO, this is not a big issue as we would only need to disable `trunk` most of the times.  Devs who attach `ciflow/mps` and use `Mac MPS` workflow know what they are doing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95437
Approved by: https://github.com/clee2000
2023-03-03 04:23:21 +00:00
53c9866ffa Print the actual values with scalar mismatch. (#95879)
When you do assertEqual between two ints, previously it would only print

```
Absolute difference: 1
Relative difference: 0.3333333333333333
```

Now it prints:

```
Expected 3 but got 2.
Absolute difference: 1
Relative difference: 0.3333333333333333
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95879
Approved by: https://github.com/dagitses, https://github.com/albanD
2023-03-03 02:22:20 +00:00
c22e7c7bf3 Revert "sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048)"
This reverts commit 7901f2d1560bb858f62fc8c28ff5672dd8d53914.

Reverted https://github.com/pytorch/pytorch/pull/94048 on behalf of https://github.com/seemethere due to Sign compare between size_t and int64_t is not allowed
2023-03-03 02:03:14 +00:00
d7637801d3 Revert "COO intersection primitives: performance improvement (#92976)"
This reverts commit b033594943876d68b9278d4c2ed04fc3c968f001.

Reverted https://github.com/pytorch/pytorch/pull/92976 on behalf of https://github.com/seemethere due to Need to revert this so I can revert https://github.com/pytorch/pytorch/pull/94048 cleanly
2023-03-03 01:38:56 +00:00
1b1b9c8706 Add flop counter utility (#95751)
Overall, an example usage. Note that this *also* captures backwards FLOPs.
```
import torchvision.models as models
import torch
from torch.utils.flop_counter import FlopCounterMode

inp = torch.randn(1, 3, 224, 224, device='cpu')
mod = models.resnet18()

flop_counter = FlopCounterMode(mod, depth=1)
with flop_counter:
    mod(inp).sum().backward()
```
<img width="326" alt="image" src="https://user-images.githubusercontent.com/6355099/222023068-3491e405-f195-4e11-b679-36b19a1380c7.png">

You can control the depth of the module hierarchy with the `depth` attribute (which defaults to 2). For example, if I don't limit it, this is what it outputs.

<img width="366" alt="image" src="https://user-images.githubusercontent.com/6355099/222023306-3d880bb6-f534-4f98-bf10-83c4353acefc.png">

## Other APIs

FlopCounterMode(custom_mapping=...): Allows for custom flop counting functions
FlopCounterMode.get_table(depth=...): Explicitly get the table as a string
FlopCounterMode.flop_counts: Contains the flop information as a Dict[hierarchy: str, Dict[Op, int]]
FlopCounterMode.register_hierarchy(f, name): Allows you to register additional "hierarchies" for a function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95751
Approved by: https://github.com/ngimel, https://github.com/albanD
2023-03-02 23:19:49 +00:00
3095c95828 Fixes for PyTorch/XLA functionalization integration (#94537)
Fixes for PyTorch/XLA functionalization integration

---
Some notable changes include:
- More asserts in `FunctionalTensorWrapper`, so bugs show up more cleanly in cases where we e.g. forget to wrap an output
- Make the *_scatter ops `CompositeExplicitAutogradNonFunctional`, so we get a better error message and XLA doesn't accidentally try to us them
- Fix LTC/XLA codegen in core to handle multi-tensor out= ops with no returns
- Better erroring: Allow XLA to use the CPU fallback from core in a way so that it always errors on view ops, which XLA should no longer see.
- Update MetaConverter to exclude XLA tensors in raising NotImplemented…
- Add `_propagate_xla_data` op
- Add meta tensor support for some ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94537
Approved by: https://github.com/bdhirsh
2023-03-02 23:02:34 +00:00
f397d1700f Inductor reduce_scatter_tensor (#95764)
This adds reduce_scatter to the functional collective and adds the
inductor lowering support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95764
Approved by: https://github.com/kumpera
2023-03-02 22:05:30 +00:00
3df1a9baca Upload external contribution data to s3 (#95747)
Context: We want to create a metric panel to track external contributions to the PyTorch repo

This PR creates a daily job to track how many external contributions occurred the day before and uploads it to a s3 collection which is accessible by rockset.

`upload_external_contrib_stats.py` is a python script which grabs the neccesary stats from github and sticks them into an s3 bucket. It is used here to do daily uploads, but can generally be used for larger queries as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95747
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-03-02 21:57:28 +00:00
02fa2291f7 Add support for custom backend (#95072)
Fixes https://github.com/pytorch/pytorch/issues/92344

A custom backend can be specified by passing in a string with format `"<device_type1>:<backend_name>,<device_type2>:<backend_name>"`, e.g. `"cpu:gloo,cuda:custom_backend"`.

Differential Revision: [D43630050](https://our.internmc.facebook.com/intern/diff/D43630050)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95072
Approved by: https://github.com/kwen2501
2023-03-02 21:41:49 +00:00
b2875268c9 [bazel] use GPU machine and run GPU tests (#95721)
Fixes #79354

Run bazel build and tests on GPU machine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95721
Approved by: https://github.com/huydhn
2023-03-02 21:09:24 +00:00
61fa43a1f2 [GHF] Add submodule updates check (#95885)
Originally planned to integrate it somehow into the `lintrunner`, but this poses too many challenges, one of them is that it deliberately ignores submodule updates.

On the other hand, almost all the information, other than list of the submodules is already present in the GitHubPR info.

Incorporate small BE change into `test_trymerge.py`, that moves `@mock.patch` from individual test to the class definition.

Fixes https://github.com/pytorch/pytorch/issues/74326 and https://github.com/pytorch/test-infra/issues/1521
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95885
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn
2023-03-02 18:05:26 +00:00
7ebd816aab Switch DTensor to use funcol::all_reduce. (#95804)
This is relanding the troubling part of #95009 that caused a regression.

BC: This changes the signature and semantics of DeviceMesh::all_reduce.

DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable.
You no longer need to use CommTensor to get a trace.

all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization.

Signature changed: removed async_op param and changes return type from Optional[Work] to torch.Tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95804
Approved by: https://github.com/fegin
2023-03-02 17:55:01 +00:00
00ebbba623 Remove torch._inductor.config.triton.convolution (#95842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95842
Approved by: https://github.com/ngimel
2023-03-02 17:44:41 +00:00
b033594943 COO intersection primitives: performance improvement (#92976)
This PR improves COO intersection primitives by:
* making it sync-less (dims <= 8, can be changed to any value that fits stack).
* improving performance with much less kernel calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92976
Approved by: https://github.com/cpuhrsch, https://github.com/pearu
2023-03-02 17:42:39 +00:00
06562529d2 Revert "Upload external contribution data to s3 (#95747)"
This reverts commit f418e1f8b63c0c15f52b373a57bfd9d65d02b172.

Reverted https://github.com/pytorch/pytorch/pull/95747 on behalf of https://github.com/clee2000 due to broke lint on master, merge base is too old, https://github.com/pytorch/pytorch/actions/runs/4315881630/jobs/7531170401 f418e1f8b6 (11721314649)
2023-03-02 17:34:14 +00:00
1712a18170 Fix typos under torch/_C directory (#95710)
This PR fixes typos in files under `torch/_C` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95710
Approved by: https://github.com/H-Huang
2023-03-02 17:29:38 +00:00
e83d0a1893 Improve unittest class printing for generated classes (#95806)
Previously they printed like `torch._dynamo.testing.make_test_cls_with_patches.<locals>.DummyTestClass`; now they print as `torch._dynamo.testing.StaticDefaultDynamicShapesUnspecTests`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95806
Approved by: https://github.com/dagitses
2023-03-02 17:03:41 +00:00
f418e1f8b6 Upload external contribution data to s3 (#95747)
Context: We want to create a metric panel to track external contributions to the PyTorch repo

This PR creates a daily job to track how many external contributions occurred the day before and uploads it to a s3 collection which is accessible by rockset.

`upload_external_contrib_stats.py` is a python script which grabs the neccesary stats from github and sticks them into an s3 bucket. It is used here to do daily uploads, but can generally be used for larger queries as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95747
Approved by: https://github.com/huydhn, https://github.com/kit1980
2023-03-02 16:03:32 +00:00
5309c44210 [inductor] enable test_alexnet_prefix_dynamic_shapes on CUDA (#95766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95766
Approved by: https://github.com/ezyang
2023-03-02 14:25:52 +00:00
db8e91ef73 [CUDA] Split out compute capability 8.7 and 7.2 from others (#95803)
Follow up of #95008 to avoid building Jetson compute capabilities unnecessarily, also adds missing 7.2.

CC @ptrblck @malfet
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95803
Approved by: https://github.com/ezyang
2023-03-02 14:13:15 +00:00
d0dd898943 [MPS] Remove remaining casts from 13.3 (#95870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95870
Approved by: https://github.com/kulinseth
2023-03-02 12:44:59 +00:00
3a7fd20108 fix nll loss decomposition to properly ignore ignore_index (#95833)
Fixes #95794
This is a hotfix for decomposition only (that is currently used by inductor), reference still accesses invalid indices. Perhaps `_nll_loss_nd` and this decomp should be unified, cc @soumith @voznesenskym @yanboliang @penguinwu @anijain2305 @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @desertfire @lezcano

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95833
Approved by: https://github.com/lezcano, https://github.com/Chillee
2023-03-02 08:37:56 +00:00
c86d23a1ef Allow point-ranges on floating point inf (#95799)
Fixes https://github.com/pytorch/pytorch/issues/95797

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95799
Approved by: https://github.com/eellison
2023-03-02 08:14:11 +00:00
7bdfdbbd5f [MPS] Add macOS 13.3 selector check (#95866)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95866
Approved by: https://github.com/DenisVieriu97
2023-03-02 07:11:48 +00:00
d9f822b566 Add dimension check in tensordot (#94497)
This PR is to add dimension check in tensordot. The expected dimension should be smaller than `dim_a` or `dim_b`.
Fix #91589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94497
Approved by: https://github.com/jgong5, https://github.com/albanD
2023-03-02 05:45:11 +00:00
75cb99e549 [optim] Widen the cases for defaulting to foreach (#95820)
Big OOP correction continued. Also added a test this time to verify the defaulting was as expected.

The key here is realizing that the grouping for foreach already assumes that the non-param tensorlists follow suit in dtype and device, so it is too narrow to check that _all_ tensors were on CUDA. The main leeway this allowed was state_steps, which are sometimes cpu tensors. Since foreach _can_ handle cpu tensors, this should not introduce breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95820
Approved by: https://github.com/albanD
2023-03-02 04:15:33 +00:00
2bcf863fad [optim] include nn.Parameter as foreach supported (#95811)
This PR is a result of a realization that models are NOT subscribed to the foreach defaulting as have been claimed on our documentation for months now. BIG OOPS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95811
Approved by: https://github.com/albanD
2023-03-02 04:15:33 +00:00
45fd1b390e [vision hash update] update the pinned vision hash (#95843)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95843
Approved by: https://github.com/pytorchbot
2023-03-02 03:51:11 +00:00
4973ca5e3e [sdpa] Add broadcasting for batch and num_heads dimensions to fused kernel nested preproc (#95657)
Adds a path with the strategy mentioned [here](https://github.com/pytorch/pytorch/pull/95346#issuecomment-1441283506)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95657
Approved by: https://github.com/drisspg
2023-03-02 03:44:55 +00:00
62b775583f [inductor] Improve error messages (#95567)
Example error message before/after (710 to 131 lines):
https://gist.github.com/jansel/6fecad057738089fa95bf08c3de9fc8a

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95567
Approved by: https://github.com/mlazos
2023-03-02 02:20:55 +00:00
d1ec9a51e9 Bump version 2.0.0 -> 2.1.0 (#95790)
Same as: https://github.com/pytorch/pytorch/pull/90491
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95790
Approved by: https://github.com/albanD, https://github.com/malfet
2023-03-02 00:38:46 +00:00
4d3352ed90 [MPS] Remove casts from reduction/cumsum/sort ops starting with macOS 13.3 (#95817)
MPS in macOS13.3 has added support for int64 in reduction ops / cumsum / sort / argsort. This change removes the hard-coded casts and error messages prior macOS 13.3, allowing the op to run natively with int64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95817
Approved by: https://github.com/kulinseth
2023-03-02 00:26:24 +00:00
184fb9f11d Small doc update for torch_compile_debug (#95809)
Updates the troubleshooting documentation with the folder structure of the debug directory
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95809
Approved by: https://github.com/msaroufim
2023-03-02 00:25:28 +00:00
1fd119948e [3/3] Update .pyi Python stub files and enable 'UFMT' linter (#95268)
Changes:

- #95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- #95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- => this PR: #95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95268
Approved by: https://github.com/huydhn
2023-03-01 23:50:56 +00:00
b3d8fae042 Fix typos in documents under torch directory (#95709)
This PR fixes typo in `.md` files under `torch` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95709
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-03-01 23:43:35 +00:00
b35e67142c [JIT] Improve source attribution for NamedTuple type inference (#95761)
Most errors thrown during torchscript scripting or execution have a SourceRange attached that can be used to identify where the error is coming from. NamedTuple type inference previously didn't have SourceRanges attached; this PR adds them.

Differential Revision: [D43685662](https://our.internmc.facebook.com/intern/diff/D43685662)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95761
Approved by: https://github.com/eellison
2023-03-01 23:40:13 +00:00
053205aab5 [dynamo] Fix OrderedDict reconstruction bytecode (#95800)
Fixes OrderedDict reconstruction issue found in https://github.com/pytorch/pytorch/pull/95250 with an attempt to fix it here https://github.com/pytorch/pytorch/pull/95725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95800
Approved by: https://github.com/yanboliang, https://github.com/clee2000
2023-03-01 23:39:09 +00:00
cyy
6786a24fd2 fix some tiny code issues (#95757)
This PR tries to fix:
1. a misspelled NDEBUG preprocessing condition.
2. get ride of all writable-strings warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95757
Approved by: https://github.com/soulitzer
2023-03-01 23:27:32 +00:00
f7b26bdd22 Remove mention of dynamo.optimize() in docs (#95802)
This should be self containable to merge but other stuff that's been bugging me is
* Instructions on debugging IMA issues
* Dynamic shape instructions
* Explaining config options better

Will look at adding a config options doc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95802
Approved by: https://github.com/svekars
2023-03-01 23:24:09 +00:00
deaf077de8 Don't use guardless contiguity/stride-like implementations (#95733)
These prevent us from simplifying tests involving unbacked SymInts,
and then you end up with unbacked SymInt in guards, which is bad.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95733
Approved by: https://github.com/tugsbayasgalan
2023-03-01 23:14:58 +00:00
a9a3a1bd14 Apply peephole for eval mode when constant folding is enabled only (#95801)
Fixes https://github.com/microsoft/onnx-converters-private/issues/150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95801
Approved by: https://github.com/BowenBao
2023-03-01 23:07:38 +00:00
8093abce3e Always get attr static out (#95771)
Discussion here https://github.com/pytorch/pytorch/issues/95630#issuecomment-1449596766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95771
Approved by: https://github.com/jansel
2023-03-01 23:05:44 +00:00
34a7c79eac Rename func (#95639)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95639
Approved by: https://github.com/ezyang
2023-03-01 23:03:09 +00:00
de86538f55 [ROCM] Restrict pytorch rocm to only use triton 2.0.x (#95793)
To align with upstream, we are requiring triton dependency to be between 2.0.0 and 2.1.  This will allow PyTorch 2.0 on ROCM to stay flexible enough to pick up any performance/stability improvements from Triton, without needing to cut a separate PyTorch version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95793
Approved by: https://github.com/huydhn
2023-03-01 22:50:44 +00:00
2936c8b9ce Revert "Enable thp(transparent huge pages) for buffer sizes >=2MB (#93888)"
This reverts commit 2cc845eb1a45c7ea494c33262a97f9a348818261.

Reverted https://github.com/pytorch/pytorch/pull/93888 on behalf of https://github.com/seemethere due to Reverting due to internal build issues, Meta employees see: https://fburl.com/sandcastle/1p4zvldk
2023-03-01 22:33:04 +00:00
13340638f4 Update inductor-perf-test-nightly.yml (#95807)
Try to cancel previous commits to avoid wasted runs on older commits.  Not sure if a different user's push would cancel an ongoing job.

Currently multiple commits from the same open PR would be running, even though most likely the latest commit's status is of interest.

This tries to see if old workflows could get cancelled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95807
Approved by: https://github.com/huydhn
2023-03-01 22:15:37 +00:00
63796d35ef [sdpa] move seq_len_1 check and replace with seq_len_0 check in sdp_utils (#95486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95486
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2023-03-01 21:46:45 +00:00
72b9d45e76 Clean up install_triton and install_filelock in CI (#95754)
After Dockerize triton in https://github.com/pytorch/pytorch/pull/95233, we can now clean up `install_triton` and `install_filelock` in the CI to improve its reliability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95754
Approved by: https://github.com/weiwangmeta
2023-03-01 21:41:58 +00:00
dd88954511 Preserve specialize_int_float during export (#95741)
In the next PR, i will error when dynamo tries to add "implicit" input so that it doesn't fail during sanity check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95741
Approved by: https://github.com/yanboliang
2023-03-01 21:26:16 +00:00
5d9d8c6154 [MPS] Add fixes for div with floor and raise error for div_trunc (#95769)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95769
Approved by: https://github.com/DenisVieriu97
2023-03-01 20:52:28 +00:00
5ba4dafccd Retry Merge: extract utils from check labels ptr (#94899)
Fixes #88098

This is the rebased and retry merging branch of the reverted PR: https://github.com/pytorch/pytorch/pull/94597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94899
Approved by: https://github.com/kit1980
2023-03-01 20:40:30 +00:00
975333d80c Logaddexp for complex in CPU (#95717)
Continuation of PR #93153 where I implemented logaddexp for complex, but didn't expose it to `torch.logaddexp`. So this PR is to expose the complex logaddexp to `torch.logaddexp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95717
Approved by: https://github.com/lezcano
2023-03-01 20:37:46 +00:00
97fbceead4 [EASY] Make has_hint work on more things than just SymInt. (#95792)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95792
Approved by: https://github.com/Skylion007
2023-03-01 20:30:23 +00:00
879f0c3fee [CI] Increate the timeout limit for benchmark test (#95787)
Summary: xcit_large_24_p8_224 occasionally hits TIMEOUT on CI. Bump up
the limit to reduce flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95787
Approved by: https://github.com/ezyang, https://github.com/ZainRizvi
2023-03-01 19:54:25 +00:00
ef731cdaf0 [2/3] Update .pyi Python stub files: Prettify rnn.py by using type annotated NamedTuple (#95267)
Changes:

- #95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- => this PR: #95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- #95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95267
Approved by: https://github.com/janeyx99
2023-03-01 19:37:23 +00:00
a46e550d06 [1/3] Recognize .py.in and .pyi.in files as Python in VS Code (#95200)
Changes:

- => this PR: #95200

1. Recognize `.py.in` and `.pyi.in` files as Python in VS Code for a better development experience.
2. Fix deep setting merge in `tools/vscode_settings.py`.

- #95267

3. Use `Namedtuple` rather than `namedtuple + __annotations__` for `torch.nn.utils.rnn.PackedSequence_`:

    `namedtuple + __annotations__`:

    ```python
    PackedSequence_ = namedtuple('PackedSequence_',
                                 ['data', 'batch_sizes', 'sorted_indices', 'unsorted_indices'])

    # type annotation for PackedSequence_ to make it compatible with TorchScript
    PackedSequence_.__annotations__ = {'data': torch.Tensor, 'batch_sizes': torch.Tensor,
                                       'sorted_indices': Optional[torch.Tensor],
                                       'unsorted_indices': Optional[torch.Tensor]}
    ```

    `Namedtuple`: Python 3.6+

    ```python
    class PackedSequence_(NamedTuple):
        data: torch.Tensor
        batch_sizes: torch.Tensor
        sorted_indices: Optional[torch.Tensor]
        unsorted_indices: Optional[torch.Tensor]
    ```

- #95268

4. Sort import statements and remove unnecessary imports in `.pyi`, `.pyi.in` files.
5. Format `.pyi`, `.pyi.in` files and remove unnecessary ellipsis `...` in type stubs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95200
Approved by: https://github.com/janeyx99
2023-03-01 19:16:56 +00:00
e096bca5f9 adding symbolic link to get CI to run tests where cmake is not run on CI node (#95402)
Fixes #95155 which breaks CI and no nvfuser python tests are run on CI nodes.

Thanks to @davidberard98 for noticing this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95402
Approved by: https://github.com/davidberard98
2023-03-01 19:01:18 +00:00
5d29b68bbc [inductor] generate triton kernel benchmark (#95506)
A PR to generate benchmark code for individual triton kernels. We can explore improving autotuning with the saved compiled kernel directly. This potentially can speedup our iteration and separate the concern with the upstream components that generate the compiled module.

Since I'm still ramping up on inductor, I'll reflect what I learned here so people can correct me if I'm wrong.  In inductor, WrapperCodeGen class is used to generate the compiled module for CUDA (or triton). Here is an example compiled module for a toy model like: `def f(x): return sin(x) + cos(x)` https://gist.github.com/shunting314/c6ed9f571919e3b414166f1696dcc61b .  A compiled module contains the following part:
- various triton kernels
- a wrapper (or a method named call . The name is hardcoded) that calls the triton kernels and potentially ATen kernels to efficiently do the same work as the original Fx graph being compiled by inductor
- some utility code that generate random inputs and run the wrapper

The triton kernels in the compiled module are annotated with decorator like pointwise which is used for autotuning.

This PR add a config so enabling it will just trigger the path of the compiled module being printed. It can be controlled from environment variable as well.

The path to each compiled triton kernel is added as comment in the compiled module. E.g.
```
# kernel path: /tmp/torchinductor_shunting/gn/cgn6x3mqoltu7q77gjnu2elwfupinsvcovqwibc6fhsoiy34tvga.py
triton__0 = async_compile.triton('''
import triton
import triton.language as tl
...
""")
````

Example command:
```
TORCHINDUCTOR_OUTPUT_COMPILED_MODULE_PATH=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --training --dashboard --only AlbertForMaskedLM --disable-cudagraphs
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95506
Approved by: https://github.com/Chillee
2023-03-01 18:29:07 +00:00
e9c70b0b20 Fix typo and grammatical errors in community docs and dynamo docs (#95692)
Fixes typo and grammatical errors in community docs and dynamo docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95692
Approved by: https://github.com/H-Huang
2023-03-01 18:10:46 +00:00
3e8eedd78e Round of fixes for functional collectives (#95714)
Move collective registration to torch.__init__ to handle multipy warmup.
Fix all_reduce with non-contiguous tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95714
Approved by: https://github.com/wconstab
2023-03-01 17:52:14 +00:00
46f092dc66 Add jinja2 as mandatory dependency (#95691)
Should fix #95671  for nightly wheels issue. v2.0.0 RC does not need this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95691
Approved by: https://github.com/malfet
2023-03-01 17:28:55 +00:00
2bcc0e9e18 Expand sparse.softmax zero nnz tests to cover cases of previously reported FPE. (#95646)
- Test cases with zero `nnz` added for `sparse.log_softmax`.
- Test cases with zero `nnz` for both `sparse.log_softmax` and
`torch.sparse_softmax` expanded to cover the backward pass.

These test additions prove resolution to #95371 and #82107.

Fixes #82107 #95371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95646
Approved by: https://github.com/cpuhrsch, https://github.com/pearu, https://github.com/nikitaved
2023-03-01 17:26:51 +00:00
c5f6092591 Use FindCUDAToolkit to find cuda dependencies (#82695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695
Approved by: https://github.com/malfet
2023-03-01 17:26:36 +00:00
7901f2d156 sparse compressed tensor validation without syncs for low-(batch)dim tensors. (#94048)
As per title. Sync is still unavoidable for super high-dim tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94048
Approved by: https://github.com/alexsamardzic, https://github.com/cpuhrsch
2023-03-01 17:25:11 +00:00
e5a959a2d4 [MPS] Fix views with 3 or more sliced dimensions (#95762)
Fixes https://github.com/pytorch/pytorch/issues/95482
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95762
Approved by: https://github.com/razarmehr
2023-03-01 16:16:49 +00:00
7d097e3695 [CI] Reduce the frequency of running inductor-perf-test-nightly (#95778)
Summary: This to prepare for extending inductor-perf-test-nightly to
collect dashboard numbers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95778
Approved by: https://github.com/ezyang
2023-03-01 14:34:04 +00:00
9835c93aba [CI] Change the way tests are triggered with dynamo and inductor (#94539)
Summary: Currently running PyTorch tests with dynamo and inductor is
controlled by environment variables, and CI sets them based on test
config name matching. Change them to use options of run_test.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94539
Approved by: https://github.com/huydhn
2023-03-01 13:06:23 +00:00
e3892fd16b [inductor] correctly infer dtype of full (#95593)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95593
Approved by: https://github.com/ezyang, https://github.com/ngimel
2023-03-01 10:13:21 +00:00
9da903f180 [Inductor] Fix the logical_and/logical_or vectorization issue (#95609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95609
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-03-01 08:21:57 +00:00
c1f5e50fd1 [Inductor] Vectorize channels-last adaptive_avg_pool2d (#95608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95608
Approved by: https://github.com/jansel
2023-03-01 08:21:57 +00:00
074ae720f4 [Inductor] Fix the issue that at::vec does not support indexing (#95459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95459
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/jansel
2023-03-01 08:21:57 +00:00
7a772bfff9 [dtensor] add submesh example to checkpoint_example (#95655)
This PR adds a submesh example for checkpoing purposes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95655
Approved by: https://github.com/XilunWu
2023-03-01 08:19:27 +00:00
3fa939625b Rearrange some transformer tests (#95745)
This changes the test placement to be more inline with the class hierarchy in the test_transformers.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95745
Approved by: https://github.com/cpuhrsch
2023-03-01 07:18:49 +00:00
1e2e149570 Dynamic dim guards (#95584)
Guards for dynamic dims, essentially authored/co-authored by @ezyang by triple checking my (originally faulty) logic. Comments in code explain the guard decision tree.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95584
Approved by: https://github.com/ezyang
2023-03-01 06:17:41 +00:00
e628a3e724 Don't generate guards that refer to unbacked SymInts (#95732)
This regresses unbacked batch resnet, but I have a plan to recover that
too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95732
Approved by: https://github.com/tugsbayasgalan
2023-03-01 06:14:27 +00:00
9b86b53285 allow privateuse1 key to be used with legacy constructor (#95748)
fixes https://github.com/pytorch/pytorch/issues/95734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95748
Approved by: https://github.com/ezyang
2023-03-01 06:11:00 +00:00
93f1aa5511 raw_values is dead (#95703)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95703
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-03-01 05:38:43 +00:00
9227fd741c Avoid recursion in graph traverse (#95723)
It's easy to reach recursion limit in Python when calling `dfs_find_cycle` in big graphs (e.g., searching for attention heads in GPT-2 via SubgraphMatcher). Let's switch to queue-based graph tarversing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95723
Approved by: https://github.com/SherlockNoMad, https://github.com/Skylion007
2023-03-01 04:35:22 +00:00
e970dd9dcf [CI] Compile on M1 natively (#95719)
We have plenty of runners now, let's use them for compilation as well.
To achieve that, remove `xcode-version: "13.3.1"` property and tweak Metal framework detection logic to work with command line tools(which are installed in `/Library/Developer/CommandLineTools`) and SDK is in `/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk`) rather than full Xcode installation.

TODO: Fix/enable OpenMP accelerated native builds (which are currently broken with `OMP: Error #15: Initializing libomp.dylib, but found libomp.dylib already initialized.`), but this matches existing behavior as cross-builds are compiled  with OpenMP disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95719
Approved by: https://github.com/huydhn
2023-03-01 04:20:42 +00:00
e79b2b7792 [CI] Force clear triton cache between running each test (#95729)
Summary: The idea is to see if this reduces some of the flakiness
we have seen on CI. If it does help, then we have a problem in our
caching implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95729
Approved by: https://github.com/ngimel
2023-03-01 04:10:03 +00:00
d3d75a5cd8 [vision hash update] update the pinned vision hash (#95665)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95665
Approved by: https://github.com/pytorchbot
2023-03-01 04:07:29 +00:00
21b1134be6 [inductor] fix type promotion for comparison operations (#95736)
Fixes #95695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95736
Approved by: https://github.com/Skylion007, https://github.com/desertfire, https://github.com/jansel
2023-03-01 03:29:55 +00:00
6930f30ccd Small bugfix in nested matmul bmm path head_dim acquisition (#95744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95744
Approved by: https://github.com/drisspg
2023-03-01 03:27:08 +00:00
e50ff3fcdb Fix kernel name bug (#95739)
[T146374491](https://www.internalfb.com/intern/tasks/?t=146374491): [Inductor] Descriptive kernel names not displaying in trace

Use the descriptive kernel name for the triton function name if indicated in the config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95739
Approved by: https://github.com/ngimel
2023-03-01 03:02:47 +00:00
65f49ab663 [Inductor Perf Test Workflow] Remove pull request trigger and rely on ciflow/ label only (#95755)
Mitigates A100 queue issue.
Workflow seems to run twice upon pull request changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95755
Approved by: https://github.com/seemethere
2023-03-01 02:47:49 +00:00
1c526664d5 feat(dockerfile): shrink layers & build cleaner (#95375)
this change will reduce the layer size as it will not save the layers also it will build cleaner on other machines as it won't ask for a user interaction when running the build

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95375
Approved by: https://github.com/ezyang
2023-03-01 02:39:56 +00:00
60a1d29585 Correct OneCycleLR doc example code to explicitly call optimizer.step() (#95730)
Fixes #89358 as suggested in the issue comment

A screenshot of the example code in the built docs:
<img width="1168" alt="Screenshot 2023-02-28 at 4 46 45 PM" src="https://user-images.githubusercontent.com/31816267/221999156-02b28f2a-85b3-4aa8-841d-e4c66a39a33f.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95730
Approved by: https://github.com/janeyx99
2023-03-01 02:15:50 +00:00
ed1957dc19 [MPS] Add support for masked_scatter (#95743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95743
Approved by: https://github.com/kulinseth
2023-03-01 01:36:36 +00:00
d9cd9a13bc [BE][DDPOptimizer] De-dup p and param (#95654)
The `param` from `param = target.get_parameter(name)` should be the same as `p` from `target.named_parameters()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95654
Approved by: https://github.com/wconstab
2023-03-01 01:17:09 +00:00
94bec94f5a Initial minifier smoke test + runbook (#95670)
Summary:
Adds a manual smoke test for the minifier in fb code to use as an example for the runbook. (We already have automatic tests which should be running)

See draft runbook:
https://docs.google.com/document/d/18I0KYhWiYo4taC4foR2UcijJXYyEcZV4McBJQIUSSJw/edit#

Test Plan:
buck2 run mode/dev-nosan //caffe2/test/inductor:minifier_smoke

Run displayed minifier launcher script, and it should reduce the graph from 5 to 3 nodes

Differential Revision: D43415890

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95670
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2023-03-01 01:02:23 +00:00
7ea3aab45d Remove dead ZeroGuard (#95701)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95701
Approved by: https://github.com/Skylion007
2023-03-01 01:02:04 +00:00
cf3638a9cc [dynamo] Clear cache on dynamo dashboard accuracy tests (#95726)
Might fix some flaky accuracy tests?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95726
Approved by: https://github.com/ngimel, https://github.com/anijain2305, https://github.com/desertfire
2023-03-01 00:50:19 +00:00
40d54cf8bf Apply filter logic to disabled jobs dynamically (#95442)
Apply filter logic to disabled jobs dynamically.  The list of disabled jobs is published at https://ossci-metrics.s3.amazonaws.com/disabled-jobs.json.  When the workflow (i.e. `pull`) and the platform (i.e. `linux-bionic-py3.8-clang9`) names match, job will be disabled (skipped) if they are in the list.

Note that getting the current job name within the GitHub action is fairly hacky.  This is a TODO item.

### Testing

* Unit testing
* This PR. https://github.com/pytorch/pytorch/issues/94861 disables `pull / linux-bionic-py3.8-clang9 / test (dynamo)` in the CI.  We have:
   * No dynamo tests running in `pull / linux-bionic-py3.8-clang9` https://github.com/pytorch/pytorch/actions/runs/4272505289/jobs/7437706181
   * Other dynamo tests, i.e. `pull / linux-bionic-py3.11-clang9`, are run normally https://github.com/pytorch/pytorch/actions/runs/4272505289/jobs/7437706054
 * This PR. https://github.com/pytorch/pytorch/issues/95642 disables `pull / linux-bionic-cuda11.7-py3.10-gcc7-sm86 / test`.  All test jobs for `pull / linux-bionic-cuda11.7-py3.10-gcc7-sm86` are skipped https://github.com/pytorch/pytorch/actions/runs/4287330986/jobs/7468179694
 * This PR. https://github.com/pytorch/pytorch/issues/95656 disables `pull / linux-bionic-py3_8-clang8-xla / build`.  All build and test jobs for `pull / linux-bionic-py3_8-clang8-xla` are skipped https://github.com/pytorch/pytorch/actions/runs/4287330986/jobs/7470478905
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95442
Approved by: https://github.com/clee2000
2023-03-01 00:10:35 +00:00
2fbbc3362b [ONNX] Support 'dtype' argument for 'aten::norm' (#95637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95637
Approved by: https://github.com/titaiwangms
2023-03-01 00:07:34 +00:00
88a31f4be6 hoist precomputed exprs from indices (#95690)
This generates compilable code for maskrcnn graph 13, with ceilings hoisted to be computed on the host. But it now fails with
```
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/symbolic_convert.py", line 379, in wrapper
    self.output.compile_subgraph(self, reason=reason)
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/output_graph.py", line 562, in compile_subgraph
    pass1.foreach(stack_values)
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 166, in foreach
    self(i)
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/codegen.py", line 148, in __call__
    output.extend(value.reconstruct(self))
  File "/scratch/ngimel/work/pytorch/torch/_dynamo/variables/dicts.py", line 40, in reconstruct
    codegen.create_load_python_module(collections),
TypeError: create_load_python_module() missing 1 required positional argument: 'push_null'

from user code:
   File "/scratch/ngimel/work/env/lib/python3.9/site-packages/torchvision-0.15.0a0+928b05c-py3.9-linux-x86_64.egg/torchvision/models/detection/backbone_utils.py", line 58, in forward
    x = self.fpn(x)
```
looks like we never execute this `create_load_python_module()` path for other subgraphs.
Any advice on how to fix this @voznesenskym @jansel ?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95690
Approved by: https://github.com/jansel
2023-02-28 23:32:36 +00:00
dc10ab15b7 Warn on modification of OptimizedModule.forward (#95673)
Partially addresses #95641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95673
Approved by: https://github.com/ezyang
2023-02-28 23:21:23 +00:00
6bdef7a5ff Warn on dynamo OptimizedModule.forward() (#95672)
Partially addresses #95641

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95672
Approved by: https://github.com/ezyang
2023-02-28 23:21:03 +00:00
20dfce591c Add support for Inductor + symbolic shapes + training (#93059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93059
Approved by: https://github.com/ezyang
2023-02-28 22:44:31 +00:00
70029214f3 [jit] Add c++ stacktraces for jit::ErrorReport (#94842)
**Summary**: This PR adds C++ stacktraces to jit::ErrorReports. After this PR, if you run with `TORCH_SHOW_CPP_STACKTRACES=1` environment variable and a jit::ErrorReport is thrown, then the C++ stacktrace should be displayed.

**More background**: This behavior already occurs for c10::Error; but not for jit::ErrorReport. jit::ErrorReport _does_ usually have a python stacktrace for the python source, but it is sometimes still helpful to know where in the C++ codebase the error came from.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94842
Approved by: https://github.com/qihqi
2023-02-28 22:37:51 +00:00
e3c5c369ba Run tests in USE_PYTEST_LIST through run_tests (#95659)
Part of my effort to move everything to pytest and decrease the number of testrunner frameworks in ci

Gives xmls but they might look a weird b/c module level tests vs tests in classes.

Doesn't give skip/disable test infra because those are tied to classes. (for future ref, could either put tests in classes or move the check_if_enable stuff into a pytest hook)

Tested in CI and checked that the same number of tests are run

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95659
Approved by: https://github.com/huydhn
2023-02-28 22:09:01 +00:00
e5b9d98752 Rephrase zero_grad docs (#95643)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95643
Approved by: https://github.com/albanD
2023-02-28 22:04:23 +00:00
ba43d908f9 Build Triton in Docker image (#95233)
See a bunch of timeout error when trying to clone and build Triton today c6d8d10b3e, so let's build triton as part of the Docker image.

* The pinned commit file is moved to the Docker context at `.ci/docker/ci_commit_pins/triton.txt`, and `.github/ci_commit_pins/triton.txt` is now a soft link pointing to it
* New Docker images are built whenever the pinned commit is updated
* The build logic is in `.ci/docker/common/install_triton.sh` which copies `install_triton` step in the CI.  The latter can be removed in a separate PR after this one

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95233
Approved by: https://github.com/weiwangmeta, https://github.com/malfet
2023-02-28 22:01:37 +00:00
b55d0d2aef Fix trymerge changed files count (#95720)
The value from the PR info includes only unique files != The number of files changed (both are technically correct, depending on how you view it)

I'm trying to merge this PR https://github.com/pytorch/pytorch/pull/95233 which makes `.github/ci_commit_pins/triton.txt` a softlink.  So the PR includes 2 changes to that file 1) to delete the file and 2) to add it as a symlink.

```
[
  ".ci/docker/build.sh",
  ".ci/docker/ci_commit_pins/triton.txt",
  ".ci/docker/common/common_utils.sh",
  ".ci/docker/common/install_triton.sh",
  ".ci/docker/requirements-ci.txt",
  ".ci/docker/ubuntu-cuda/Dockerfile",
  ".ci/docker/ubuntu/Dockerfile",
  ".github/ci_commit_pins/triton.txt", <--
  ".github/ci_commit_pins/triton.txt", <--
  ".github/workflows/build-triton-wheel.yml"
]
```

Trymerge doesn't like that and rejects the merge due to `Changed file count mismatch` https://github.com/pytorch/pytorch/actions/runs/4295438799/jobs/7485853815 . This is because the PRInfo GraphQL result from GitHub only counts 9 of them https://paste.sh/zVsOnWoT#p_3RKX_VMjj-e71vwsTeA01W (search for `changedFiles`).  It means that the name are dedup, so that only unique file names are counted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95720
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ZainRizvi
2023-02-28 21:55:21 +00:00
2cc845eb1a Enable thp(transparent huge pages) for buffer sizes >=2MB (#93888)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown significant improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if `THP_MEM_ALLOC_ENABLE` environment variable is set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93888
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-02-28 21:12:46 +00:00
f1dbfe2f2a [ao][fx] Enable observed -> quantized float for static quantized MultiheadAttention (#95636)
Test Plan:
Sandcastle

cc andrewor14 any suggestions here?

Differential Revision: D43631794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95636
Approved by: https://github.com/andrewor14
2023-02-28 20:50:19 +00:00
fafb410985 Clean up unused fill_ sample inputs (#95117)
The OpInfo of it has been integrated into `UnaryUfuncInfo('fill',...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95117
Approved by: https://github.com/ngimel
2023-02-28 20:27:13 +00:00
835122c89f Add missing f-string specifiers (#95707)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95707
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-02-28 20:20:05 +00:00
e13b804105 Add standalone torch._inductor.compile() API (#95594)
This fixes support for inductor compiling non-dynamo generated FX graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95594
Approved by: https://github.com/bertmaher, https://github.com/desertfire
2023-02-28 20:05:03 +00:00
fc324d3485 [quant][pt2e] Add support for dynamic quantization with symmetric quant for input (#94854)
Summary:
Previously we assumed asymmetric quantization for dynamic quantization, this diff adds the support of symmetric quantization
for the input in dynamic quantization

Test Plan: buck run executorch/exir/tests:quant_lowering_custom_backend_pass -- "executorch.exir.tests.test_quant_lowering_custom_backend_pass.TestQuantLoweringCustomBackendPass.test_quantized_linear_dynamic"

Reviewed By: digantdesai

Differential Revision: D43134794

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94854
Approved by: https://github.com/digantdesai
2023-02-28 19:39:31 +00:00
cyy
f8ad64d5eb [dynamo] avoid truncation of python pointers (#95619)
This PR is separated from #94927 . It aims to fix to the MSVC warnings that passed python pointers are truncated to a smaller integer type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95619
Approved by: https://github.com/Skylion007
2023-02-28 19:38:34 +00:00
1e15a272ff [dtensor][BE] remove redundant tests (#94838)
All test cases in test_tp_sharding_ops.py already been covered by
test_dtensor_ops.py, deleting it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94838
Approved by: https://github.com/XilunWu
2023-02-28 18:42:49 +00:00
2a1cb9640c [dtensor] support creating DTensor in submesh (#95458)
This PR supports creating DTensor in a submesh, if the rank is not
participating in the mesh, we assign the local tensor to be empty
tensor, and do nothing in the operator dispatch

Differential Revision: [D43643577](https://our.internmc.facebook.com/intern/diff/D43643577)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95458
Approved by: https://github.com/XilunWu
2023-02-28 17:54:26 +00:00
261eb46ddd [dtensor] refactor get_coordiniate (#95457)
This refactor get_coordinate to return a optional[list] instead of
directly the coordinate on dim, this is so that we can check if the
rank is inside the mesh easily

Differential Revision: [D43643579](https://our.internmc.facebook.com/intern/diff/D43643579)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95457
Approved by: https://github.com/XilunWu
2023-02-28 17:54:26 +00:00
bb9a05b116 [dtensor] use tracing for metadata prop (#95456)
This PR uses tracing for metadata prop, so that we can get correct
shape/stride metadata without manual calculation by ourselves.

The follow up PR on this would be adopt tracing for the sharding
prop itself

Differential Revision: [D43643578](https://our.internmc.facebook.com/intern/diff/D43643578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95456
Approved by: https://github.com/XilunWu
2023-02-28 17:54:22 +00:00
80614783e3 Enabling FlashAttention for SDPA when given NestedTensor (#95438)
# Summary
Previously, for NestedTensor inputs flash_attention was disabled due to an Illegal Memory Access error that was occurring on the "cutlass" branch of flash-attention that had be incorporated into core.  Since we have switched to the main branch of flash_attention we the existing repro script did not produce the same memory error. This PR re-enables the FlashAttention Path for NTs. As well it unifies the nested preprocessing between the two implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95438
Approved by: https://github.com/mikaylagawarecki
2023-02-28 17:49:38 +00:00
57f2c5888f Update skip message to reflect why test is being skipped (#95127)
Summary: Update skip message to reflect why test is being skipped

Test Plan: github

Differential Revision: D43423288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95127
Approved by: https://github.com/cpuhrsch
2023-02-28 17:37:13 +00:00
4fada6eb95 MHA torch.jit.script fix for in_proj_weight = None (#95653)
Summary: MHA fix to support in_proj_weight being None

Test Plan: sandcastle

Differential Revision: D43628206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95653
Approved by: https://github.com/davidberard98, https://github.com/cpuhrsch
2023-02-28 17:29:29 +00:00
1a72712645 Add dynamo graph break stats to CI (#95635)
Adds columns to csv produced by accuracy job including dynamo graph break stats.

Example output from torchbench CI job:
<img width="771" alt="image" src="https://user-images.githubusercontent.com/4984825/221716236-9276684e-1be8-43e1-837e-f41671d4e0e3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95635
Approved by: https://github.com/ezyang
2023-02-28 16:17:46 +00:00
f33180fb7f [MPS] Add pow.Scalar (#95201)
1. Adds `pow.Scalar`.
2. Modifies testing `atol` and `rtol` to get pow output match tests pass.
3. Xfails numerically incorrect dtypes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95201
Approved by: https://github.com/kulinseth
2023-02-28 16:11:15 +00:00
71ad1005f6 Add prelu into Autocast CPU whitelist (#95366)
### Motivation
Add `prelu` to lower precision cast policy on AutocastCPU to fix https://github.com/pytorch/pytorch/issues/95365 :

Before: Within the scope of torch.cpu.amp.autocast(dtype=torch.bfloat16) , `prelu` cannot address the scenario of different datatypes of `input` and `weight`, will get a RuntimeError. This scenario is common in autocast, e.g, with `autocast` to `bf16`, if the `op` before `prelu` comes out a `bf16` output, which is the input of `prelu`, and `prelu's` weight is `fp32`, then it will get a RuntimeError.

After: Within the scope of torch.cpu.amp.autocast(dtype=torch.bfloat16) , prelu be forced to run with `bf16` data type.

Before https://github.com/pytorch/pytorch/pull/91238, when input is `bf16`, weight will be forced to cast to `bf16`.  After https://github.com/pytorch/pytorch/pull/91238, this kind of test scenario will raise a RuntimeError. There is no precision loss since the workable one is also casting to `bf16`.

And this also alighs with Autocast CUDA whitelist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95366
Approved by: https://github.com/ngimel, https://github.com/lezcano, https://github.com/leslie-fang-intel
2023-02-28 13:13:18 +00:00
b87229f19d Reland #94719 - Update ideep to add primitive cache for ARM (#95688)
### Description
This PR is to update ideep to add primitive cache in order to speed up ARM's PyTorch workloads.
Reland https://github.com/pytorch/pytorch/pull/94719, which is unintentional reverted by https://github.com/pytorch/pytorch/pull/94939#issuecomment-1447501258.
Fixes https://github.com/pytorch/pytorch/issues/94264.

### Performance test
Use TorchBench test in ICX with 40 cores
Intel OpenMP & jemalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/221760391-fb6cbabe-6d88-4155-b216-348e718e68b9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95688
Approved by: https://github.com/ezyang
2023-02-28 12:25:11 +00:00
05943712a4 [MTA] Skip size-0 tensors in multi_tensor_apply (#94655)
This PR skips size-0 tensors to avoid possible stack corruption in `multi_tensor_apply()`. A follow-up PR will add more unit tests in `test_foreach.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94655
Approved by: https://github.com/ngimel
2023-02-28 07:14:32 +00:00
9e16f1281f [MPS] Add copysign op. (#95552)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95552
Approved by: https://github.com/kulinseth
2023-02-28 06:49:46 +00:00
b7c2a65139 [MPS] Fix type casting copy with storage offset (#95573)
This PR handles the case where the `dst` tensor of type casting has a storage offset by creating a temporary buffer to store results and then copy them back to the dst with the offset added.

Fixes #95417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95573
Approved by: https://github.com/kulinseth
2023-02-28 05:24:31 +00:00
7c66333c08 [pt] add share_memory_ to aten TensorBase (#95557)
Summary:
This is the part 2 of adding `share_memory_()` support to C++ ATen lib.

See inline comments for API considerations and current behavior rationale.

Test Plan:
Since https://github.com/pytorch/pytorch/pull/95228 already adds the UT, this is not repeating it.

Github CI

Differential Revision: D43575383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95557
Approved by: https://github.com/ezyang
2023-02-28 05:07:24 +00:00
58648822b6 Handle int/float arguments for cpp codegen in inductor (#95533)
This is a little questionable because we don't actually know what the dtype of the sympy expression is, and it's not clear we can rely on the assumptions.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95533
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-28 03:57:35 +00:00
447f5b5e2d [bazel] enable sccache+nvcc in CI (#95528)
Fixes #79348

This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.

Along the way we had to do couple tweaks:
1.  Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.

2. Workaround bug in bazel https://github.com/bazelbuild/bazel/issues/10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.

## Test plan

Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740

Note that now it says "CUDA" in the sccache output

```
+ sccache --show-stats
Compile requests                    9784
Compile requests executed           6726
Cache hits                          6200
Cache hits (C/C++)                  6131
Cache hits (CUDA)                     69
Cache misses                         519
Cache misses (C/C++)                 201
Cache misses (CUDA)                  318
Cache timeouts                         0
Cache read errors                      0
Forced recaches                        0
Cache write errors                     0
Compilation failures                   0
Cache errors                           7
Cache errors (C/C++)                   7
Non-cacheable compilations             0
Non-cacheable calls                 2893
Non-compilation calls                165
Unsupported compiler calls             0
Average cache write                0.116 s
Average cache read miss           23.722 s
Average cache read hit             0.057 s
Failed distributed compilations        0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95528
Approved by: https://github.com/huydhn
2023-02-28 03:51:11 +00:00
49ba11962e Update Dispatcher.cpp (#95589)
Update Dispatcher.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95589
Approved by: https://github.com/ezyang
2023-02-28 03:46:05 +00:00
3944e7c3e8 Fix grammatical errors in contribution guide (#95454)
Fixed following errors in contribution guide.

"deep neural networks using a **on** tape-based autograd systems." to "deep neural networks **using a tape-based** autograd systems."

"the best entrance **point** and are great places to start." to "the best entrance **points** and are great places to start."
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95454
Approved by: https://github.com/ezyang
2023-02-28 03:44:40 +00:00
46385b3e48 Fix typos under torch/_dynamo directory (#95599)
This PR fixes typos in comments and messages of `.py` files under `torch/_dynamo` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95599
Approved by: https://github.com/ezyang
2023-02-28 03:44:24 +00:00
38c32e19c8 fix DeprecationWarning (#95545)
This PR fixes 2 `DeprecationWarning` instances:

```
python3.8/site-packages/torch/utils/tensorboard/__init__.py:4
  /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if not hasattr(tensorboard, "__version__") or LooseVersion(

python3.8/site-packages/torch/utils/tensorboard/__init__.py:6
  /home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/utils/tensorboard/__init__.py:6: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    ) < LooseVersion("1.15"):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95545
Approved by: https://github.com/ezyang
2023-02-28 03:43:57 +00:00
3762e801ba Update dynamic skips (#95587)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95587
Approved by: https://github.com/voznesenskym
2023-02-28 03:26:55 +00:00
8b0543381b [Inductor] Support sparse_grad for torch.gather (#95490)
Summary: https://github.com/pytorch/pytorch/issues/95187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95490
Approved by: https://github.com/ngimel
2023-02-28 03:26:39 +00:00
454c48b987 Add experimental torch.export prototype (#95070)
This is WIP PR for adding torch.export API in OSS. Couple of points:
- I intentionally named it as experimental_export so that ppl don't get confused thinking this is our official API
- We don't plan to use AOTAutograd backend just yet. The reason we have it here is because the functionalization AOTAutograd uses is what we need for export (handling of param/buffer mutation etc). In the near future, I will extract the functionalization part and use it on top of make_fx. What we have right now is merely a placeholder.
- The reason we want to do it now is because we want to have some minimal tests running in OSS so that we can catch regressions earlier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95070
Approved by: https://github.com/gmagogsfm, https://github.com/zhxchen17
2023-02-28 02:40:19 +00:00
801b3f8fc7 Revert "Use FindCUDAToolkit to find cuda dependencies (#82695)"
This reverts commit 7289d22d6749465d3bae2cb5a6ce04729318f55b.

Reverted https://github.com/pytorch/pytorch/pull/82695 on behalf of https://github.com/peterbell10 due to Breaks torchaudio build
2023-02-28 02:29:09 +00:00
f8692dcc4a Node.stack_trace should have innermost frame last (#95592)
Both fx.Tracer and Dynamo should store node.stack_trace in the "innermost frame last" order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95592
Approved by: https://github.com/ezyang
2023-02-28 02:14:40 +00:00
b818b3fe1c better error message when functionalization cant handle op (#95392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95392
Approved by: https://github.com/mikekgfb, https://github.com/cpuhrsch, https://github.com/ezyang, https://github.com/xw285cornell
2023-02-28 00:24:40 +00:00
ddd6b53d80 fix embedding_backward_dense decomp with broadcasting (#95499)
Fixes https://github.com/pytorch/pytorch/issues/95182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95499
Approved by: https://github.com/ezyang, https://github.com/ngimel
2023-02-28 00:24:40 +00:00
84e2d957a1 fix primtorch handling for sub.scalar with alpha and float64 arg (#95421)
This fixes the primtorch issue stemming from https://github.com/pytorch/pytorch/issues/95181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95421
Approved by: https://github.com/ngimel, https://github.com/SherlockNoMad
2023-02-28 00:24:38 +00:00
eff5ae8746 Better mark_dynamic assertions (#95566)
This PR allows us to reuse the static per tensor decision making we make at fake tensorification time. We can use this to avoid setting up dynamic dim guards later if the tensor was never a candidate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95566
Approved by: https://github.com/ezyang
2023-02-28 00:02:22 +00:00
4e926db1f8 Add super().setUp() in test_symbolic_shape_analysis (#95336)
Instead of the usual `super().setUp()`, use `super(JitTestCase, self).setUp()` since JitTestCase.setUp() seems to interfere with the test (see the results on the first commit of this PR).  `super(JitTestCase, self).setUp()` skips the setUp method of JitTestCase

Fixes https://github.com/pytorch/pytorch/issues/95341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95336
Approved by: https://github.com/huydhn
2023-02-27 23:22:36 +00:00
d7146e7870 Update copyright (#95652)
Updating the copyright to reflect on the website.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95652
Approved by: https://github.com/atalman
2023-02-27 23:15:55 +00:00
10bf019b71 [jit] Add shapes info to the output type of CallFunction nodes after tracing, if the output is a tensor (#95544)
**Summary**: jit.trace usually adds shape information to all the jit::Values in its graph. This is mostly a side effect of how jit tracing is performed, but many users use this behavior for debugging and for better understanding the graph. Previously, CallFunction nodes (inserted by calling jit.script-ed functions) did _not_ have this information attached. This PR attaches this information for the tensor output values.

**Details**:
* First the jit tracer sets a global TracerState object
* Then the jit tracer invokes the python callable that is to be traced
* When the python function gets to a jit.script-ed function, [invokeScriptFunctionFromPython](8693604bc6/torch/csrc/jit/python/pybind_utils.h (L1060)) is called. It inserts a FunctionCall.
* Then after the actual scripted function gets called and we have a concrete output, we attach the concrete output [IValue to the TracerState](8693604bc6/torch/csrc/jit/python/pybind_utils.h (L1001))
* ^^ the setValueTrace call (linked in previous list item) is where this PR makes changes; we revise the jit::Value output of the CallFunction node to use the type of the concrete tensor, which will have actual shapes associated.

**Test**: added a test verifying that shape info appears in the output type for a CallFunction node in a jit-traced graph.

Differential Revision: [D43592880](https://our.internmc.facebook.com/intern/diff/D43592880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95544
Approved by: https://github.com/qihqi
2023-02-27 22:50:29 +00:00
5272d6e6e5 Remove mentions of distributed/_shard/test_replicated_tensor (#95632)
The file was removed in https://github.com/pytorch/pytorch/pull/95453, which cause some issues with the multigpu job in periodic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95632
Approved by: https://github.com/huydhn
2023-02-27 22:41:02 +00:00
38fdd28db4 [4/N][Deprecate ST][BE] Move warnings of Partial Tensor to functions (#95631)
To solve https://github.com/pytorch/pytorch/issues/95623
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95631
Approved by: https://github.com/wanchaol
2023-02-27 22:28:04 +00:00
33cf62359d Revert "Convert operator.not_ to torch.logical_not (#94626)"
This reverts commit 97510c6d50e2c8215aa0dd0c703497a29c774598.

Reverted https://github.com/pytorch/pytorch/pull/94626 on behalf of https://github.com/ezyang due to not correct
2023-02-27 21:50:51 +00:00
cc6da7b901 Inductor allgather_into_tensor (#95530)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95530
Approved by: https://github.com/kumpera
2023-02-27 21:38:36 +00:00
68eec90cfd Support elementwise add / mul for [B, *] nested, [B, 1] dense (CUDA only) (#95620)
Small hack to reuse the 3D custom kernel from #88289 for [B, *] nested, [B, 1] dense elementwise add / mul. Simply treat the inputs as [B, *, 1], [B, 1, 1]. This is added to satisfy an internal ask.

Future work: full general broadcasting support between mixed nested / dense.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95620
Approved by: https://github.com/cpuhrsch, https://github.com/drisspg
2023-02-27 21:07:09 +00:00
1fe2a9d122 Add _int_mm to expose cuBLAS int8@int8 -> int32 matmul (#94339)
Add _int_mm primitive that binds cuBLAS int8@int8 -> int32 matmul and that translates to Triton based mm templates under max autotune. This is a very useful first step towards better supporting quantization on the GPU. This is a not a user facing API, but an internal primitive.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94339
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-27 20:27:25 +00:00
32558910f3 make overriding operator warning message only print once (#95179)
Fixes #ISSUE_NUMBER

when I want to override some operators for new backend, this warning message will print for every op, the message is to much.  So just print once for all operators.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95179
Approved by: https://github.com/bdhirsh
2023-02-27 20:17:43 +00:00
29f9a702cc [NCCL] (re-open) Optionally avoid recordStream calls in ProcessGroupNCCL (#89880)
Rebased version of @mcarilli's #76861

CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89880
Approved by: https://github.com/kwen2501
2023-02-27 20:15:53 +00:00
f43ce9553b [meta_tensor] polish error strings in meta registrations (#95052)
I found some error message should be formatted for detailed information. So I polished those error message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95052
Approved by: https://github.com/bdhirsh
2023-02-27 20:12:09 +00:00
fa5a4b0dfc [CI] Do not compare two eager run results against fp64 result (#95616)
Summary: When running the benchmark test with --accuracy, two eager runs
should return the same result. If not, we want to detect it early, but
comparing against fp64_output may hide the non-deterministism in eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95616
Approved by: https://github.com/ZainRizvi
2023-02-27 20:11:21 +00:00
34617d7eb8 dynamo export should be able to export identity function (#94962)
Summary:
While working increasing coverage
(https://github.com/jansel/pytorch-jit-paritybench/pull/5) I found that identity function are not exportable because the generated graph has no call_function.

Test Plan:
Unit test

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94962
Approved by: https://github.com/yanboliang
2023-02-27 19:41:45 +00:00
868640e094 Re-enable a FX-to-ONNX kwargs Test (#94763)
As title. The re-factorization of ONNX test framework disabled one exporter. This PR just brings that test back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94763
Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/titaiwangms
2023-02-27 19:37:37 +00:00
8dfac7b887 Update fx.pass.graph_drawer usage doc to draw fx graph (#95534)
Previous usage gave this error:
```
f.write(g.get_dot_graph().create_svg())
TypeError: write() argument must be str, not bytes
```

pydot has function to save to different types, e.g. `save_svg()`. I updated the usage doc working code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95534
Approved by: https://github.com/ezyang
2023-02-27 19:27:29 +00:00
cyy
f27e09de04 Cleanup Windows warning suppression in CMake and fix some warnings in the source code (#94927)
This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
2023-02-27 19:22:20 +00:00
d950f45577 Revert "[Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009)"
This reverts commit 0765dbc25ed9368f41225e7de231ee3dd6b188a3.

Reverted https://github.com/pytorch/pytorch/pull/95009 on behalf of https://github.com/jeanschmidt due to this PR is causing internal breakages. Check https://fburl.com/diff/me41urq8
2023-02-27 19:21:58 +00:00
1cf11c1c86 Add bfloat16 support to upsample (#95500)
Fixes https://github.com/pytorch/pytorch/issues/80339

This PR was previously here: https://github.com/pytorch/pytorch/pull/95159
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95500
Approved by: https://github.com/ezyang
2023-02-27 19:21:52 +00:00
c44a733018 Fix split_module bug (#95493)
Summary: Title, the mapping currently has lots of unused keys due to the condition or always return True, but it will not affect the correctness.

Test Plan: N/A

Differential Revision: D43579510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95493
Approved by: https://github.com/Skylion007
2023-02-27 19:11:49 +00:00
a3b505c55e [Quant] Fix setting fixed qparams for inner LSTM ops (#95537)
Summary: The existing util function did not quantize all inner
ops in the quantizable LSTM module, resulting in the error
"Could not run X with arguments from the 'QuantizedCPU' backend."
This commit fixes this by ensuring that all the other ops whose
qparams were not specifically configured are still quantized as
before, as in `torch.ao.nn.quantizable.LSTM.from_float`.

Test Plan: This commit also adds an additional check in the test
to ensure that the final converted model is in fact quantized,
in addition to just checking the qparams in the observers have
the right values.

python test/test_quantization.py TestQuantizeFx.test_static_lstm_with_custom_fixed_qparams

Reviewers: vkuzo

Subscribers: vkuzo, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95537
Approved by: https://github.com/vkuzo
2023-02-27 19:08:51 +00:00
31ce32b03d Fix typos in documents under torch (#95597)
This PR fixes typos of documents in `.md` files under `torch` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95597
Approved by: https://github.com/ezyang
2023-02-27 19:07:47 +00:00
3beb644578 [dynamo] Fix keyword argument name of all_dim (#95600)
This PR changes keyword argument name of `all_dim` function from `keeepdim` to `keepdim`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95600
Approved by: https://github.com/ezyang
2023-02-27 19:05:49 +00:00
4f84c57c87 Fix potential deadlock when recording memory traces (#95273)
See comment in the diff

Differential Revision: [D43490668](https://our.internmc.facebook.com/intern/diff/D43490668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95273
Approved by: https://github.com/eellison
2023-02-27 19:04:47 +00:00
9a4cb9bcaf Fix typos under torch/_inductor directory (#95601)
This PR fixes typos in comments and messages of `.py` files under `torch/_inductor` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95601
Approved by: https://github.com/ezyang
2023-02-27 19:00:17 +00:00
5d70ee93fa Expose more headers for extensions. (#95447)
Fixes #ISSUE_NUMBER

Expose more headers for extensions of distributed methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95447
Approved by: https://github.com/ezyang
2023-02-27 18:59:40 +00:00
cyy
c1fa403e57 suppress nvfuser loading warning when we disable nvfuser (#95603)
To avoid annoying warnings such as "[W interface.cpp:47] Warning: Loading nvfuser library failed"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95603
Approved by: https://github.com/ezyang
2023-02-27 18:56:46 +00:00
97ec340fe9 Fix double-a typo (#95470)
Fixes a type where there was a repeated "a" in a warning message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95470
Approved by: https://github.com/ezyang
2023-02-27 18:54:43 +00:00
4930ae7f82 [MPS] Add roll op (#95168)
Reuse the cpu implementation here as currently there is no native roll implementation from the MPS api (if any, please let me know).

Compared to falling back to cpu using `PYTORCH_ENABLE_MPS_FALLBACK=1`, this way we keep tensors on MPS.

Did a small benchmark:

```python
for num in [10, 100, 1000, 10000]:
    for shft in [1, 5]:
        sz = num * num
        x = torch.arange(sz, device="cpu").view(num, num)
        s = time.time()
        r = torch.roll(x, shft)
        cpu_e = time.time() - s
        x = torch.arange(sz, device="mps").view(num, num)
        s = time.time()
        r = torch.roll(x, shft)
        mps_e = time.time() - s
        print(f"size: ({num}, {num}) shft: {shft} cpu: {cpu_e} mps: {mps_e}")
```

```
size: (10, 10) shft: 1 cpu: 0.00015163421630859375 mps: 0.003078937530517578
size: (10, 10) shft: 5 cpu: 6.794929504394531e-05 mps: 0.0014979839324951172
size: (100, 100) shft: 1 cpu: 0.0001621246337890625 mps: 0.0016200542449951172
size: (100, 100) shft: 5 cpu: 0.00016379356384277344 mps: 0.00154876708984375
size: (1000, 1000) shft: 1 cpu: 0.0022068023681640625 mps: 0.0017690658569335938
size: (1000, 1000) shft: 5 cpu: 0.009071111679077148 mps: 0.0020020008087158203
size: (10000, 10000) shft: 1 cpu: 0.16785407066345215 mps: 0.011695146560668945
size: (10000, 10000) shft: 5 cpu: 0.1160881519317627 mps: 0.011452913284301758
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95168
Approved by: https://github.com/albanD
2023-02-27 18:31:17 +00:00
448c97ca10 Revert "Disable MacOS M1 test jobs (#95509)"
This reverts commit afece1992aace1b2dd334f5b61978605b3ac6c2b.

Reverted https://github.com/pytorch/pytorch/pull/95509 on behalf of https://github.com/huydhn due to https://github.com/pytorch/pytorch/issues/95510 has been mitigated, macos m1 runners have been added back
2023-02-27 18:27:17 +00:00
b89fda51cd Implement sparse semantics support in gradcheck (2nd try) (#95405)
Replaces https://github.com/pytorch/pytorch/pull/94714 that was reverted due to https://github.com/pytorch/pytorch/pull/94714#issuecomment-1442355648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95405
Approved by: https://github.com/albanD
2023-02-27 17:48:02 +00:00
ea367347c0 [inductor] Allow list of decompositions to be overridden (#95468)
Partially addresses #95021 by exposing decompositions as an argument.

The reason for the `is None` check is to enable passing an empty list of decompositions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95468
Approved by: https://github.com/ngimel
2023-02-27 17:45:41 +00:00
325b43661e add/add_ for compressed sparse inputs: bypass BLAS in some trivial cases (#95293)
In `add(self, other, out=...)` we can bypass calls to BLAS in cases when `self == other == out` and `self == other`.

This PR fixes the repro from https://github.com/pytorch/pytorch/issues/94966, but the issue is still present when `x.add_(x)` is replaced, say, with `x = x.clone().add_(x)`.
Could that be a synchronization issue? CC @IvanYashchuk .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95293
Approved by: https://github.com/cpuhrsch
2023-02-27 16:06:02 +00:00
d301caa890 Deepcopy output node metadata (#95426)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95426
Approved by: https://github.com/SherlockNoMad
2023-02-27 15:25:54 +00:00
b3175ae95f Avoid copies in matmul (#76828)
With this PR, matmul just folds a bmm into a mm o mv if and only if it
can achieve so without copying. We add tests for this to make sure that
our algorithm to detect this is accurate.

For the cases where it was copying before see https://github.com/pytorch/pytorch/pull/75197#discussion_r843413208 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489479 https://github.com/pytorch/pytorch/pull/75197#discussion_r863489805

Fixes https://github.com/pytorch/pytorch/issues/76702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76828
Approved by: https://github.com/ngimel
2023-02-27 15:24:59 +00:00
d83a14e7f6 [inductor] enable test_grid_sampler_2d_dynamic_shapes (#95575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95575
Approved by: https://github.com/ezyang
2023-02-27 15:19:33 +00:00
03cc0f587c Don't create large intermediary tensors in the backward of matmul (#95261)
Currently, if we multiply a transposed batch of matrices with shape
[b, m, n] and a matrix with shape [n, k], when computing the gradient
of the matrix, we instantiate a matrix of shape [b, n, k]. This may be
a very large matrix. Instead, we fold the batch of matrices into a
matrix, which avoids creating any large intermediary tensor.

Note that multiplying a batch of matrices and a matrix naturally occurs
within an attention module, so this case surely happens in the wild.
In particular, this issue was found while investigating the OOMs caused by the
improved folding algorithm in the next PR of this stack. See https://github.com/pytorch/pytorch/pull/76828#issuecomment-1432359980
This PR fixes those OOMs and decreases the memory footprint of the
backward of matmul.

I understand this is a tricky one, so I put it on its own PR to discuss it.

Differential Revision: [D43541495](https://our.internmc.facebook.com/intern/diff/D43541495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95261
Approved by: https://github.com/ezyang
2023-02-27 15:19:09 +00:00
fd8367a7b1 [MPS][BE] Introduce xfail (#95045)
Add `mps_ops_modifier` function that adds `unittest.expectedFailure` decorators to the operators that supposed to fail on MPS.

This allows one to know whether or not operation will fail, rather than skip it.
For example:
```
% python test_mps.py -v -k test_output_match_dot
test_output_match_dot_cpu_float32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int16 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int32 (__main__.TestConsistencyCPU) ... ok
test_output_match_dot_cpu_int64 (__main__.TestConsistencyCPU) ... expected failure
test_output_match_dot_cpu_uint8 (__main__.TestConsistencyCPU) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.175s

OK (expected failures=1)
```

Moved a few functions from blocklist to xfail, and find out that some of the functions in the list actually work, for example `torch.long`.

Also, allow `None` to be used in `ALLOWLIST`  instead of specifying all types explicitly (which aligns with `DecorateInfo` semantic)

Eventually, we should get rid of `ALLOWLIST` (i.e. all ops are allowed), keep small `BLOCKLIST` and move the rest to `XFAILLIST`

Add step to print HW/SW info before running MPS tests.

Fix type promotion in `trace_mps_out`

Introduce `MACOS_12_X_XFAILLIST` and skip almost every function for `torch.uint8`,  although some of those doesn't make much sense and feels like a regression from PyTorch-1.13

Re-enabled MPS testing on MacOS 12, as runners seems to be available again
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95045
Approved by: https://github.com/albanD
2023-02-27 15:01:01 +00:00
11f293a74e Comment about Meta-internal usage of trymerge.py (#95536)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95536
Approved by: https://github.com/malfet
2023-02-27 14:16:04 +00:00
fb10e66d35 Bulk convert numel() to sym_numel() in FunctionsManual (#95543)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95543
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2023-02-27 13:46:13 +00:00
21f680e8ad Follow up on CUDA 12 support for PyTorch/Caffe2 (#95582)
Differential Revision: D43610669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95582
Approved by: https://github.com/ngimel
2023-02-27 04:39:56 +00:00
5265170029 [inductor] enable test_recompile_on_index_dynamic_shapes (#95581)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95581
Approved by: https://github.com/ezyang
2023-02-27 03:17:33 +00:00
6624a73837 Move istype and object identity tests into a dispatching dictionary. (#95476)
The idea is to make it a little more obvious which branch you're going to go down in a subset of cases, and make it easier to detect if you've accidentally shadowed one condition with another (the reason I wrote this in the first place.) The type dictionary also makes it harder for people to accidentally use isinstance when they should have used istype.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95476
Approved by: https://github.com/jansel
2023-02-27 02:50:58 +00:00
d6dd67a248 Dynamo: Use out-of-place binary ops instead of in-place (#95446)
Fixes issues with things like:
```python
x = 2
x += y.shape[0]
```

resulting in invalid `2 += y.shape[0]` code in the FX graph.

Fix: Whenever dynamic shapes are involved, insert the out-of-place op to the FX graph instead of the in-place op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95446
Approved by: https://github.com/ezyang
2023-02-27 02:10:37 +00:00
7dd95ad7f3 Add a convenience shortcut for accessing size on ComptimeVar (#95404)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95404
Approved by: https://github.com/voznesenskym
2023-02-27 02:02:02 +00:00
56c3e4ce35 [inductor] Shrink mm configs for small sizes (#95555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95555
Approved by: https://github.com/ngimel
2023-02-26 22:42:53 +00:00
6e61629f10 [inductor] Refactors/improvements to max-autotune (#95554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95554
Approved by: https://github.com/ngimel, https://github.com/nmacchioni
2023-02-26 22:39:04 +00:00
d3e1f165b3 Copy helper next_power_of_2 from triton (#95436)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95436
Approved by: https://github.com/ngimel
2023-02-26 20:49:36 +00:00
261b019a64 Copy nn_module_stack meta data when creates create node in tracer (#95358)
This pr allows tracer to always preserve the nn_module_stack (if there is any) meta data when creating node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95358
Approved by: https://github.com/SherlockNoMad
2023-02-26 20:21:40 +00:00
bc51ee4ed7 fix spurious aot autograd warning (#95521)
The _make_boxed logic probably needs a cleanup, but this fixes a spurious warning that we should get in before the release.

Confirmed that this used to emit a warning and no longer does:
```
import torch

lin = torch.nn.Linear(100, 10)
def f(x):
    return lin(x)

opt_f = torch.compile(f)
opt_f(torch.randn(10, 100, requires_grad=False))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95521
Approved by: https://github.com/ngimel
2023-02-26 18:17:36 +00:00
6c30dc6cee [FSDP] Save _all_handles; _all_fsdp_states to root (#95465)
- The previous PR addressed one tree traversal in `_root_pre_forward()` but not the main one from `_get_fsdp_handles()` that runs for all settings.
- This PR saves `_all_handles` to cache `_get_fsdp_handles()` and `_all_fsdp_states` to cache `_get_fsdp_states()` (renamed from `_fsdp_states` compared to last PR) on the root state.
- This PR introduces a dummy `_RootFSDPState` class that inherits from `_FSDPState` to be used only for type checking since some attributes are only defined for root states.
    - I found this approach to be better than adding `_p_assert(state.root_only_attr is not None, ...)` upon each usage of `root_only_attr`.
    - This hopefully also helps readers to quickly see which attributes are defined only on root states.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95465
Approved by: https://github.com/fduwjj
2023-02-26 13:59:53 +00:00
ac9b305afe Back out "cherry-picking autodiff support for gather/index_select (#93333)" (#95565)
Summary: A bisect blamed #93333 for GPU memory leakage. This diff backs it out.

Test Plan: Monitor max GPU memory usage to see if there's a leak.

Reviewed By: hyuen, yinbinm

Differential Revision: D43511893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95565
Approved by: https://github.com/ngimel
2023-02-26 10:24:46 +00:00
3064bc4060 [dynamo] Reserve the tensorrt backend name for torch-tensorrt (#94632)
In PR #93822 the `fx2trt` backend was removed which registered the `tensorrt` backend names to point to `fx2trt` / `torch_tensorrt` and move the name to `onnxrt`. We want to reserve the name `tensorrt` for `torch_tensorrt` to prevent any confusion but due to code-freeze we cannot complete the integration and set up testing for the next release. So we propose leaving out the `tensorrt` name until we can set up the backend and testing for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94632
Approved by: https://github.com/frank-wei
2023-02-26 09:40:31 +00:00
fa7f17799a [3/N][BE][ST Deprecate] Remove Replicated Tensor (#95453)
Please use distributed tensor instead. We are deprecating ShardedTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95453
Approved by: https://github.com/wanchaol
2023-02-26 06:18:31 +00:00
a88bfc60c7 [2/N][ST deprecate][BE] Remove Replicate Tensor convert from DDP and PTD (#95450)
No use is found for this ST/Replicated Tensor based DDP. As part of ShardedTensor migration, let's remove this logic. Trying to undo everything in https://github.com/pytorch/pytorch/pull/75753.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95450
Approved by: https://github.com/wanchaol
2023-02-26 03:03:37 +00:00
9b7abc4fac Run slow gradcheck tests sequentially (#95494)
Also redo https://github.com/pytorch/pytorch/pull/95246 as there are many more still run OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95494
Approved by: https://github.com/clee2000
2023-02-26 00:44:25 +00:00
9bca9df42b [BE] Fix TORCH_WARN_ONCE (#95559)
It does not take a condition as first argument, unlike `TORCH_CHECK`
Test plan, run: ` python3 -c "import torch;print(torch.arange(1., 10.,device='mps').view(3, 3).trace())"` and observe no warning

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95559
Approved by: https://github.com/Skylion007
2023-02-25 20:47:27 +00:00
407b0f3214 fix for debug crash build (#95464)
Fixes https://github.com/pytorch/pytorch/issues/94376

⚠️ Hacky fix

Details about use of `noop_vtable`:
d677432b70/c10/core/impl/PyInterpreter.h (L92-L102)

Currently, at destruction, `noop_vtable` goes out of scope first while there are dangling references to the object still present with other objects like `PythonKernelHolder` which is held by the singleton `Dispatcher`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95464
Approved by: https://github.com/ezyang
2023-02-25 19:42:06 +00:00
d78274b759 Automatically guard when SymInt is converted to int (#95479)
During enablement, we disabled int() conversions because they were
any easy way to footgun guards.  We have enough of dynamic shapes
working now that this is now causing spurious errors; e.g., if you feed
a symbolic int to x.size(symint).  We now allow for implicit conversions
of SymInt to int here, posting a guard.  We expect guard provenance
to help people debug overspecialization.

Fixes https://github.com/pytorch/pytorch/issues/95328

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95479
Approved by: https://github.com/wconstab, https://github.com/voznesenskym, https://github.com/ngimel
2023-02-25 19:41:51 +00:00
a530446f57 Manual submodule update: kineto and libfmt bazel issue (#94756) (#95535)
Summary:
This is a manual pull request to update the third_party submodule for [pytorch/kineto](https://github.com/pytorch/kineto). Also, tries to fix the failure in libfmt bazel build similar to https://github.com/pytorch/pytorch/pull/93219.

New submodule commit: 92c5344f0b

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95535

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Differential Revision: D43588413

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95535
Approved by: https://github.com/davidberard98
2023-02-25 19:26:08 +00:00
02d44e5de4 [Dynamo] Support CUDA stream passed from outside of torch.compile decrator (#94627)
Fixes #94499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94627
Approved by: https://github.com/jansel
2023-02-25 19:15:59 +00:00
ab1ab3ab19 [CI] Specify more torch.backends.cudnn options to reduce non-determinism (#95478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95478
Approved by: https://github.com/ezyang
2023-02-25 18:54:12 +00:00
4dca9bde05 [MPS] Add fmax fmin op (#95191)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95191
Approved by: https://github.com/kulinseth
2023-02-25 07:21:48 +00:00
057bc7191d [Dynamo] Remove torch.autograd.profiler.profile workaround in UserDefined (#95504)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95504
Approved by: https://github.com/williamwen42
2023-02-25 05:15:01 +00:00
f5cf1a8b43 Update triton hash (#95540)
Fixes #95523

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95540
Approved by: https://github.com/ngimel
2023-02-25 03:56:29 +00:00
ee6610ddf6 [vision hash update] update the pinned vision hash (#95532)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95532
Approved by: https://github.com/pytorchbot
2023-02-25 03:24:53 +00:00
b8151d2ba9 Utility for running delta comparisons between two flag configs (#95411)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95411
Approved by: https://github.com/Chillee
2023-02-25 02:30:22 +00:00
69d62373aa Move multi-line wrap functions to helper (#95472)
My intention is to collapse all of the istype() and isinstance() and object identity tests into a more structured form involving a dict lookup. To do this conveniently, I need every continuation to be expressible in a single expression. Thus, all multi-line wrap methods are moved. This is code motion only, no logic changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95472
Approved by: https://github.com/Skylion007
2023-02-25 02:23:40 +00:00
a33d8133a5 Slight cleanup of VariableBuilder giant if condition (#95471)
Some of these changes are semantics preserving, some are not. Please review carefully.

* Use `istype(x, y)` over `type(x) is y`
* Use istype over isinstance in frozenset. If the user subclassed the type in question, we must treat it as a user defined class as it may have custom behavior
* The `isinstance(value, (int, float))` condition for `wrap_unspecialized_primitive` is dead-ish; direct int/float values are caught earlier istype check. Technically however, if you subclassed int/float it would pass through, however this is almost assuredly not intended behavior

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95471
Approved by: https://github.com/Skylion007
2023-02-25 02:23:40 +00:00
8693604bc6 coreml - Wrap Core ML execute and forward calls in autorelease pool (#95384)
Summary:
When performing inference using the Core ML delegate, memory is increasing indefinitely. This is due to Core ML allocating memory within `predictionFromFeatures:error:`. Seems that the autorelease pool does not release the return values from the prediction method until inference is stopped completely. So we need to release with `autoreleasepool` manually ([per Apple guidance in the Apple Developer Forums](https://developer.apple.com/forums/thread/692425)).

This commit wraps `autoreleasepool` around the `execute` function of `PTMCoreMLBackend`, which is the scope of where the return values of `predictionFromFeatures:error:` are. Also added in `PTMCoreMLExecutor` for good measure.

Differential Revision: D43520767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95384
Approved by: https://github.com/mcr229
2023-02-25 01:06:36 +00:00
ca59b2d375 Fix co-dev regresssion in github-exports-check job (#95345)
Summary:
Regression introduced in #91134 (github-exports-check calls git, which is not available internally at Meta).

Meta employees, see T145865943 for the context.

Test Plan: Unit tests, `github-export-checks` job.

Differential Revision: D43521051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95345
Approved by: https://github.com/kit1980
2023-02-24 22:40:28 +00:00
acb81c1c5a [pytorch] Bump SoLoader version to 0.10.5 (#95498)
Summary: Use system linker by default on Android N and above devices.

Test Plan: sandcastle and Circle CI

Differential Revision: D43581588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95498
Approved by: https://github.com/kit1980
2023-02-24 22:37:47 +00:00
afece1992a Disable MacOS M1 test jobs (#95509)
We have an outage with MacOS m1 runner, so need to disable the job till next Monday where infra has capacity to look into the issue.

Note: Do we want to keep MPS tests on `macos-m1-13`? (As long as this new runners are still there)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95509
Approved by: https://github.com/clee2000
2023-02-24 22:36:07 +00:00
eqy
cc39cd6938 [CUDA][CUBLAS] Explicitly link against cuBLASLt (#95094)
An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95094
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman
2023-02-24 21:44:32 +00:00
b215af2db8 [optim] Add general documentation on our algorithm defaults (#95391)
I added a section + table under Algorithms
https://docs-preview.pytorch.org/95391/optim.html?highlight=optim#module-torch.optim
<img width="725" alt="image" src="https://user-images.githubusercontent.com/31798555/221246256-99325a27-9016-407b-a9fe-404d61e41a82.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95391
Approved by: https://github.com/albanD
2023-02-24 21:35:30 +00:00
0520a680c0 Rebuild LICENSES_BUNDLED.txt (#95505)
A re-run of third_party/build_bundled.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95505
Approved by: https://github.com/seemethere
2023-02-24 21:24:05 +00:00
b855b5eaac SymIntify topk (#95015)
Companion PR for https://github.com/pytorch/xla/pull/4644.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95015
Approved by: https://github.com/ezyang
2023-02-24 21:20:50 +00:00
f53671e46e [inductor] Bugfix in autotuning cache handling (#95435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95435
Approved by: https://github.com/nmacchioni, https://github.com/yanboliang
2023-02-24 21:19:52 +00:00
76cbe5797d [MPS] Add TORCH_CHECK for Conv (#95480)
- Also remove FFTs from fallback

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95480
Approved by: https://github.com/DenisVieriu97
2023-02-24 19:52:35 +00:00
4c8ad93a7c [Inductor][CI] Remove hf_GPT2_large from CPU inference test (#95473)
Summary: hf_GPT2_large shows random failure on CI for the CPU inference. Created https://github.com/pytorch/pytorch/issues/95474 for the Intel team to investigate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95473
Approved by: https://github.com/anijain2305
2023-02-24 18:21:36 +00:00
01c861af14 Added utilities to instrument kernel bandwidth numbers (#95355)
Looks like

![image](https://user-images.githubusercontent.com/6355099/221048077-33aeff50-0951-42c9-89e9-22049db4f94d.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95355
Approved by: https://github.com/ngimel, https://github.com/jansel
2023-02-24 17:51:11 +00:00
d677432b70 Remove non-existing third_party/catch from CMake (#95420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95420
Approved by: https://github.com/huydhn
2023-02-24 08:00:07 +00:00
9ded087bac During export, generate Python TENSOR_MATCH guards (#94970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94970
Approved by: https://github.com/ezyang
2023-02-24 05:37:31 +00:00
80a6b24ee1 [pt] move csrc shm logic to aten storage utils (#95228)
Summary:
This is part 1 of the effort to support `share_memory_()` in C++ aten library.

This allows C++ code to in place replace the tensor storage to shm based.
For now fd based shm is the only implementation supported to simplify memory management in general.

This first part intentionally avoids public api changes (to `TensorBase`, see comments in `StorageUtil.h`) such that we can get the core features usable outside pt/csrc first. The API addition to `Tensor` or `TensorBase` would involve more distracting changes and make the change harder to review.

Test Plan:
```
buck test caffe2:StorageUtils_test
```

Differential Revision: D43467616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95228
Approved by: https://github.com/ezyang
2023-02-24 05:30:00 +00:00
a12e92d8e4 Support nn.Module forward hooks in torchdynamo (#92125)
Tweak dynamo behavior in 2 places when calling nn.Modules,
to route the call to __call__  instead of .forward(), since
__call__ is the codepath that eager users hit and will dispatch
to hooks correctly.
 (1) inside NNModuleVariable.call_function, which covers the common case
     of calling a module from code dynamo is already tracing
 (2) at the OptimizedModule layer, which is the entrypoint
     into a top-level nn.Module dynamo is about to compile

This exposes a new bug: NNModuleVariable used to special-case calling
module.forward() (which is a method) as a UserFunctionVariable with an extra
'self' arg.  After tracing into module.__call__, there is no longer a special
case for the eventual call into .forward, and it gets wrapped in a
UserDefinedObjectVariable following standard behavior of ._wrap().  UDOV can't be
called, so this broke some tests.

- Fix: add a new special case in _wrap() that treats methods as a UserDefinedMethod
  instead of UserDefinedObjectVariable.  Now, the forward method can be called.

Also, fix NNModuleVar.call_method routing forward back to __call__

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92125
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/voznesenskym
2023-02-24 05:10:29 +00:00
d89bfa16e7 [quant] add serialization method for quantized hardswish (#94486)
**Summary**
Fix the issue: https://github.com/pytorch/pytorch/issues/91877. The root cause is serialization and deserialization method for `state_dict` does not enable for `QuantizedHardswish`. Added these methods in this PR.

**Test plan**
```
python -m pytest quantization/core/test_quantized_module.py -k test_hard_swish
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94486
Approved by: https://github.com/jgong5, https://github.com/vkuzo
2023-02-24 04:43:27 +00:00
9d04d376d8 docs: Match open bracket with close bracket in unsqueeze (#95215)
Was going to fix something else that I thought was an issue, but isn't, so just leaving this tiny thing in case it's wanted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95215
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2023-02-24 03:56:59 +00:00
6665fe9e65 [vision hash update] update the pinned vision hash (#95427)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95427
Approved by: https://github.com/pytorchbot
2023-02-24 03:39:47 +00:00
a641d60757 hotfix for memory leak in aot autograd induced by saving tensors for backward (#95101)
Workaround fix in AOTAutograd for https://github.com/pytorch/pytorch/issues/94990 (see the comments for more details / discussion)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95101
Approved by: https://github.com/albanD
2023-02-24 03:02:55 +00:00
4846d52212 inductor: fix complier error when trying to vectorize logit_and and logit_or (#95361)
Currently, `operator&& `  and `operator|| ` don't have vectorization implementation, disable them now for a quick fix for 2.0 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95361
Approved by: https://github.com/ngimel, https://github.com/EikanWang
2023-02-24 02:30:13 +00:00
0765dbc25e [Functional Collectives] Migrate DeviceMesh::all_reduce to use functional all_reduce. (#95009)
BC: This changes the signature and semantics of DeviceMesh::all_reduce.

DeviceMesh::all_reduce now uses a functional collective under the hood which makes it more easily traceable.
You no longer need to use CommTensor to get a trace.

all_reduce now is async only and uses AsyncCollectiveTensor to ensure proper stream synchronization.

Signature changed: removed `async_op` param and changes return type from `Optional[Work]` to `torch.Tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95009
Approved by: https://github.com/wanchaol
2023-02-24 02:10:55 +00:00
5cad542e43 [MPS] Add log_sigmoid op (#95280)
1. Add log_sigmoid.
2. Make log1p a common function. Operators that use log1p: mish, softplus, log_sigmoid (maybe more).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95280
Approved by: https://github.com/kulinseth
2023-02-24 01:38:30 +00:00
9f707f164e Add more GPU metric instrumentation (#91717)
Fixes https://github.com/pytorch/serve/issues/1937

A fairly common query I see folks running while using pytorch is

`nvidia-smi --format=csv,noheader,nounits --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.used,temperature.gpu,power.draw,clocks.current.sm,clocks.current.memory -l 10`

Existing metrics we have
* For kernel utilization`torch.cuda.utilization()`
* For memory utilization we have them under `torch.cuda.memory` the memory allocated with `torch.cuda.memory.memory_allocated()`
* For total available memory we have `torch.cuda.get_device_properties(0).total_memory`

Which means the only metrics we're missing are
* Temperature: now in `torch.cuda.temperature()`
* Power draw: now in `torch.cuda.power()`
* Clock speed: now in `torch.cuda.clock_speed()`

With some important details on each

* Clock speed settings: I picked the SM clock domain which is documented here https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceEnumvs.html#group__nvmlDeviceEnumvs_1g805c0647be9996589fc5e3f6ff680c64
* Temperature: I use `pynvml.nvmlDeviceGetTemperature(handle, 0)` where 0 refers to the GPU die temperature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/91717
Approved by: https://github.com/ngimel
2023-02-24 00:38:03 +00:00
8efe4fd590 Memoize repeated nonzero calls to the same fake tensor (#95399)
This removes the need to explicitly constrain_unify `x[mask]` and `y[mask]` when mask is a boolean tensor. It's very narrow but it seems to work in practice.

To invalidate the nonzero call when mutation occurs, I use version counter. I know there are ways to bypass this but I think it's good enough for now.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95399
Approved by: https://github.com/eellison
2023-02-24 00:27:45 +00:00
4833e47feb Add support for nonzero, some improvements to reduce guards (#95387)
This takes the strategy described in https://docs.google.com/document/d/1lFRYAJo5nrfxRhwIzGnfi2pbLpU6T4ytSRSuLJ5qebI/edit#

It is essentially https://github.com/pytorch/pytorch/pull/95222 but squashed and with changes that are unnecessary given that we assume nonzero returns > 1.

What's in the PR:

* nonzero now supports meta propagation. When `capture_dynamic_output_shape_ops`, it will return a tensor with an unbacked SymInt representing the size in question.
* The unbacked SymInt is UNSOUNDLY assumed to be not equal to 0/1. We will still error if you guard otherwise.
* PrimTorch pointwise operators are updated to use empty_permuted, to avoid guarding on unbacked SymInt from empty_strided (tested in `test_dynamic_pointwise_scalar`)
* Convolution is updated to skip backend selection if batch is unbacked, to avoid guarding on unbacked SymInt (tested in `test_unbacked_batch_resnet`)
* I kept the helper utilities like `definitely_true` for working with possibly unbacked SymInts. They're not used right now but maybe someone will find them useful.
* Added `constrain_unify` to let you specify two unbacked SymInts must have the same value

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95387
Approved by: https://github.com/voznesenskym
2023-02-24 00:27:45 +00:00
627282fa6c Corrected grammar in contribution guide (#93014)
Corrected the grammar of a sentence in "Implementing Features or Fixing Bugs" section of the contribution guide.

**Before:**
Issues that are labeled first-new-issue, low, or medium priority provide the best entrance point are great places to start.

**After:**
Issues that are labeled first-new-issue, low, or medium priority provide the best entrance point _and_ are great places to start.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93014
Approved by: https://github.com/albanD, https://github.com/kit1980
2023-02-24 00:22:14 +00:00
3bafecf719 Revert "Add various uninterpreted bit tensor data types (#94992)"
This reverts commit 9dbfca7840680ccd8d43f3e12594420ab9cd82e4.

Reverted https://github.com/pytorch/pytorch/pull/94992 on behalf of https://github.com/atalman due to breaks libtorch windows nightly builds see: https://github.com/pytorch/pytorch/pull/95406
2023-02-23 23:54:23 +00:00
f172c7c60a Improve retries when ECR login is flaky (#95398)
We had a few failures on master where the AWS ECR login was flaky
- [example 1](https://github.com/pytorch/pytorch/actions/runs/4255994694/jobs/7404316780)
- [example 2](https://github.com/pytorch/pytorch/actions/runs/4255390043/jobs/7402936370)
- [example 3](https://github.com/pytorch/pytorch/actions/runs/4255390040/jobs/7403356275)

Most likely the failure happened when getting the AWS_ACCOUNT_ID (which wasn't protected by a retry).

Retrying getting the account id, and also moving the whole step into a retry action to retry on slightly longer lasting ECR outages
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95398
Approved by: https://github.com/huydhn
2023-02-23 23:47:10 +00:00
6dc81f7bdd Update docs that Parameters are immune to no_grad mode (#95232)
Fixes https://github.com/pytorch/pytorch/issues/83998

![image](https://user-images.githubusercontent.com/31798555/220971800-4af57d92-9f15-4e13-bfe4-73e2ff1cd943.png)
![image](https://user-images.githubusercontent.com/31798555/221019508-d7330a16-7f01-4d37-a1af-a4905e9596c4.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95232
Approved by: https://github.com/soulitzer
2023-02-23 23:33:19 +00:00
98c5921ed5 Upload artifacts from inductor-A100-perf to S3 (#95401)
This addresses the missing artifacts from induction A100 perf workflows on HUD https://github.com/pytorch/pytorch/issues/95075#issuecomment-1441924840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95401
Approved by: https://github.com/clee2000, https://github.com/wconstab
2023-02-23 21:46:04 +00:00
24dd37ef51 Add BOOL_FALSE guard to optimize empty container case (#95248)
There is a fast way to implement a guard for an empty dict, which is to check its bool() value.

However, we can't use this guard in general, since we can only safely apply it at runtime if the runtime value actually is a dict (or, another type that works with 'bool' in the same way).  A counterexample is when a tensor is passed instead of a dict, and throws on bool() operator.

So we can put a type check in the guard, but that is slow enough it defeats the purpose.

Instead, we note that for the case of NNModuleVariables (which are specialized NNModules not unspecialized ones), we already have a hook in place to invalidate the guards if setattr is called.  I am claiming that setattr is the only way that the type of a property on an NNModule could change.  If I'm right, then it's safe to (a) only use this guard for NNModuleVariables, (b) not do a type check inside the guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95248
Approved by: https://github.com/voznesenskym
2023-02-23 21:35:15 +00:00
9c45f47bbe [FSDP] Save _fsdp_states on root (#95343)
This saves an attribute `_fsdp_states: Optional[_FSDPState]`. For root, it is populated with all `_FSDPState`s in the root's tree. For non-root, it is `None`.

This is used to avoid doing the tree traversal during `_root_pre_forward()` when `forward_prefetch=True`.

Differential Revision: [D43536895](https://our.internmc.facebook.com/intern/diff/D43536895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95343
Approved by: https://github.com/fegin
2023-02-23 21:18:05 +00:00
cba8b12fa7 [quant][bug fix] Fix qrange_len in torch.ao.quantization.utils.py (#95297)
Summary:

It looks like there is a typo and qrange_len should be 2^32 instead of 2^31, as it is currently set.

Test Plan:
```
python test/test_quantization.py TestObserver.test_per_tensor_observers

```

Reviewers:

Subscribers:

Tasks: https://github.com/pytorch/pytorch/issues/95295

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95297
Approved by: https://github.com/vkuzo
2023-02-23 20:23:45 +00:00
0eeb04652a [vulkan] Pad channels when using texture storage instead of "tight packing" (#95251)
Currently, in Vulkan 4D tensors are represented in GPU textures by simply combining the batch and channel dimensions into the depth axis. However, if the number of channels is not a multiple of 4, then data belonging to the same batch can cross texel boundaries.

For instance, consider a tensor with `N=2`, `C=3`. The depth axis of the texture would contain the data

```
|tex1|tex2|
-----------
|AAAB|BB00|
```
Where A represents data from `n=1`and B represents data form `n=2`.

This packing structure ("tight packing") makes some ops that care about batch boundaries more complex and inefficient to implement. Therefore this diff introduces channel padding when storing tensors as image textures.

The same tensor with `N=2`, `C=3` would now have the depth axis contain

```
|tex1|tex2|
-----------
|AAA0|BBB0|
```

Differential Revision: [D43068669](https://our.internmc.facebook.com/intern/diff/D43068669/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D43068669/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95251
Approved by: https://github.com/salilsdesai
2023-02-23 19:08:00 +00:00
d4882a9445 Make the cuda device assert error message clearer (#95360)
Summary: Easier to debug

Test Plan: CI

Differential Revision: D43525303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95360
Approved by: https://github.com/ngimel
2023-02-23 18:28:54 +00:00
ec10d23c51 [dynamo] Fix list contains check (#95092)
Original issue was something like:
```
def func(x):
    assert x.size(-1) in [4, 5, 6], "bad"
    return x + x
```
where the contains check is comparing a symint (x.size(-1)) with other integers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95092
Approved by: https://github.com/voznesenskym, https://github.com/yanboliang
2023-02-23 18:22:32 +00:00
0c0694495b Fix a bug in nesting check_sparse_tensor_invariants context managers (#95372)
As in the title. The bug was reported in https://github.com/pytorch/pytorch/pull/94728#discussion_r1108892366 and has the following reproducer:
```python
>>> import torch
>>> check_ctx = torch.sparse.check_sparse_tensor_invariants(True)
>>> no_check_ctx = torch.sparse.check_sparse_tensor_invariants(False)
>>> with check_ctx:
...   assert torch.sparse.check_sparse_tensor_invariants.is_enabled()
...   with no_check_ctx:
...     assert not torch.sparse.check_sparse_tensor_invariants.is_enabled()
...   assert torch.sparse.check_sparse_tensor_invariants.is_enabled()
...
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
AssertionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95372
Approved by: https://github.com/cpuhrsch
2023-02-23 18:22:13 +00:00
808879ec8b Revert "Implement sparse semantics support in gradcheck (#94714)" (#95386)
This reverts commit 7ac511c29ad365f6dc078b8353d9c189720970a2 from https://github.com/pytorch/pytorch/pull/94714 since it breaks periodic.

Git thinks there's a merge conflict due to an unfortunately located newline deletion, so reverting this one manually

Details behind the failure in https://github.com/pytorch/pytorch/pull/94714#issuecomment-1442160593
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95386
Approved by: https://github.com/clee2000
2023-02-23 18:02:37 +00:00
fb3ff77438 [mergebot] Fix for pagination error (#95333)
Fix for weird bug that happens very rarely.  My solution is to retrieve all checksuites before going to retrieve their checkruns.

Sometimes `cs_cursor=edges[edge_idx - 1]["cursor"] if edge_idx > 0 else None,` is None when it shouldn't be because of how we reset `checksuites = get_next_checksuites(checksuites)` on every loop.

Ex
page 1 of checksuites contains some stuff
page 2 of checksuites: pull {a bunch of checkruns}
cs_cursor gets set to none for the pull checksuite on page 2 because `checksuites = get_next_checksuites(checksuites)` resets the edges on every loop.  Then the checkruns can't be retrieved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95333
Approved by: https://github.com/huydhn
2023-02-23 17:48:56 +00:00
254b161def Revert "During export, generate Python TENSOR_MATCH guards (#94970)"
This reverts commit 5a8092f0584590796e1f64a1f51ac0c834750449.

Reverted https://github.com/pytorch/pytorch/pull/94970 on behalf of https://github.com/voznesenskym due to Clowny comparison bug on edge cases for devices
2023-02-23 17:47:59 +00:00
cb6e38d89d Revert "Update docs that Parameters are immune to no_grad mode (#95232)"
This reverts commit 5783cee2a3a1457fc93b00a4a50e61ba02f148db.

Reverted https://github.com/pytorch/pytorch/pull/95232 on behalf of https://github.com/ZainRizvi due to This caused the test_doc_examples test to fail on trunk
2023-02-23 17:43:45 +00:00
b9e95158d5 [MPS] Fix LSTM backward and forward pass (#95137)
Fixes #91694
Fixes #92615

Several transpositions were missing for backward graph in case of `batch_first=True`. The #91694 is not reproduced with `batch_first=False`.

After fixing transpose issue, I finally thought that now I can use LSTM freely in my project. And then I got horrific results on train. Seems related to #92615.

After that I decided to fix LSTM's backward step completely. I collected all my findings in this thread — seems like I succeeded

Funny enough, backward tests were completely disabled before and were not passing:
```python
    @unittest.skipIf(True, "Backward of lstm returns wrong result")
    def test_lstm_2(self, device="mps", dtype=torch.float32):
```

UPD: forward pass of multi-layer version also was wrong due to the incorrect `initState, initCell` slices. Tests were passing because states were inited with zeros. *Accidentally* fixed this too

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95137
Approved by: https://github.com/jhavukainen, https://github.com/kulinseth, https://github.com/soulitzer
2023-02-23 17:32:42 +00:00
86efa104f5 [MPS] Fix view op slicing for 2nd dim in case of 0 offset (#95381)
* Fix view op slicing for 2nd dim in case of 0 offset

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95381
Approved by: https://github.com/razarmehr
2023-02-23 17:26:10 +00:00
5783cee2a3 Update docs that Parameters are immune to no_grad mode (#95232)
Fixes https://github.com/pytorch/pytorch/issues/83998

![image](https://user-images.githubusercontent.com/31798555/220971800-4af57d92-9f15-4e13-bfe4-73e2ff1cd943.png)
![image](https://user-images.githubusercontent.com/31798555/220971892-35554d17-fc44-4211-9017-7a5555ae3bb1.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95232
Approved by: https://github.com/soulitzer
2023-02-23 16:41:54 +00:00
af202aea34 Add knobs for globally turning off 0/1 specialization and duck shaping (#95352)
They're not wired up to anything right now but the most logical wiring
would be to add torch._dynamo.config to toggle them.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95352
Approved by: https://github.com/voznesenskym
2023-02-23 16:29:10 +00:00
94fd063f3f Stop subclassing sympy Symbol (#95313)
According to ngimel (and also noticed by me), printing
x1*s0**2 doesn't work correctly in Sympy as it complains
'<' not supported between instances of 'tuple' and 'str'

This is probably a Sympy bug but the real answer is subclassing
is more trouble than its worth and we ought not do it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95313
Approved by: https://github.com/ngimel
2023-02-23 16:28:50 +00:00
cece63f197 Add warn-once deprecation warning to legacy sparse constructors (#94850)
Addresses https://github.com/pytorch/pytorch/issues/68323#issuecomment-1425174341

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94850
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2023-02-23 15:05:12 +00:00
3b966a6ce3 [autograd] disable backward/grad for complex scalar output (#92753)
Fixes https://github.com/pytorch/pytorch/issues/92750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92753
Approved by: https://github.com/ezyang
2023-02-23 11:38:27 +00:00
b5ff41a47a [Dynamo] No graph break on calling dict & collections.OrderedDict() (#95250)
It's common to call ```dict()``` or ```collections.OrderedDict()``` inside of ```forward``` function, so we should not graph break.

This pattern has been used in many places including:
* The use case in [torchvision](
928b05cad3/torchvision/models/_utils.py (L66-L73)).
* It causes ~100 model failures(nopython=True) in the 14k github models.
* Also it hits several Meta internal use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95250
Approved by: https://github.com/jansel
2023-02-23 09:03:07 +00:00
bc438af6fe std/var: support floating point correction value (#94073)
Ref https://github.com/pytorch/pytorch/issues/61492#issuecomment-1413003480

The array API specifies correction to be `Union[int, float]` while we currently only support integers.
https://data-apis.org/array-api/latest/API_specification/generated/array_api.std.html

As std/var is calculated currently, the final count of elements is already done
in floating point so we can make the correction floating point without any loss
of precision or generality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94073
Approved by: https://github.com/ezyang
2023-02-23 05:50:45 +00:00
56aed2a6bb SymFloat: Expose comparison operators in C++ API (#94812)
This is adapted from the corresponding methods in `SymInt.h`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94812
Approved by: https://github.com/ezyang
2023-02-23 05:50:45 +00:00
5730cabdd0 using float type to do the computation of norm reduce for cpu half and bfloat16 dtype (#95166)
As the title, we should use a higher dtype to compute norm reduce for half and bfloat1 dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95166
Approved by: https://github.com/peterbell10, https://github.com/jgong5, https://github.com/ngimel, https://github.com/lezcano
2023-02-23 05:00:25 +00:00
6912cf4053 [DCP] Update DCP to use the updated FSDP optim state_dict APIs (#95303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95303
Approved by: https://github.com/fegin
2023-02-23 03:55:02 +00:00
c97275acf6 Fix OOMing periodic shards (#95246)
Tests have been consistently failing with the error on the following shards with the error `RuntimeError: CUDA error: out of memory`
- `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 1, 2, linux.4xlarge.nvidia.gpu)`
- `periodic / linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck / test (default, 2, 2, linux.4xlarge.nvidia.gpu)`

Seeing if serializing those test files makes the periodic jobs succeed again.  This feels a bit off since there are so many different test files that have failed and need to be serialized, indicating a potential perf regression somewhere

Failures on hud: https://hud.pytorch.org/hud/pytorch/pytorch/master/1?per_page=100&name_filter=periodic%20%2F%20linux-bionic-cuda11.7-py3-gcc7-slow-gradcheck%20%2F%20test%20(default%2C%20
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95246
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2023-02-23 03:50:56 +00:00
bdb78e529e [PTD][DCP] Add fsdp checkpoint example (#95258)
Add an example to show recommended way to checkpoint FSDP.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95258
Approved by: https://github.com/kumpera
2023-02-23 03:40:27 +00:00
c594a32f60 [vision hash update] update the pinned vision hash (#95340)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95340
Approved by: https://github.com/pytorchbot
2023-02-23 03:34:14 +00:00
29c235e555 [SDPA] Fix bug in parsing scaled_dot_product_attention arguments (#95311)
Fixes #95266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95311
Approved by: https://github.com/cpuhrsch
2023-02-23 03:12:46 +00:00
8e391c735f use 4 warps for small block config in mm (#95339)
Temporary Fix for #95312
In triton, 1 warp computes 16x16 tile of output, so for 32x32 block we only need 4 warps. 8 warps IMA, which is a bug, but it's not a good config anyway.
Triton main is supposed to have better behavior for these pathological, but we are not on main yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95339
Approved by: https://github.com/ezyang, https://github.com/Chillee
2023-02-23 03:03:42 +00:00
ba8ff4be4d [inductor] enable test_nll_loss_forward_dynamic_shapes (#95176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95176
Approved by: https://github.com/ezyang
2023-02-23 02:52:21 +00:00
f98733e976 Fix disbale typos (#95322)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95322
Approved by: https://github.com/clee2000
2023-02-23 02:08:45 +00:00
586ac98cde Bugfix nested mem_efficient path in SDPA when E_qk != E_v (#95330)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95330
Approved by: https://github.com/drisspg, https://github.com/cpuhrsch
2023-02-23 02:06:46 +00:00
78175ceeab [FSDP][Docs] Re-add why reg. post-bwd hook on 1st forward (#95326)
This PR adds back some explanation for why we have the heuristic to only register the post-backward hook on the first forward in the case of multiple forwards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95326
Approved by: https://github.com/fegin
2023-02-23 01:50:25 +00:00
f247129f23 Avoid FPE when running batch norm with zero batch size. (#95324)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95324
Approved by: https://github.com/bdhirsh
2023-02-23 01:26:03 +00:00
a257486bdd coreml_delegate - Add input shape in error when throwing from predicting (#95249)
Summary: This change adds input shape when CoreML throws an errors.

Test Plan: testMCSModelInvalidInputShape tests that the assert throws when invalid input shapes are provided.

Differential Revision: D43449112

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95249
Approved by: https://github.com/mcr229
2023-02-23 00:45:44 +00:00
3ebab9aeff [pt2][inductor] switch dinfo representation (#95302)
Summary:
bypass-github-export-checks

use `dinfo.name` instead of `repr(dinfo)`, as initial results have shown that `dinfo.total_memory` may unexpectedly fluctuate

Test Plan: sandcastle + CI

Differential Revision: D43503558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95302
Approved by: https://github.com/bertmaher
2023-02-23 00:15:29 +00:00
ca7eb1bab2 Preserve meta["val"] on export (#95314)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95314
Approved by: https://github.com/yinghai, https://github.com/voznesenskym
2023-02-22 23:24:57 +00:00
f6f413c6b6 Second part of splitting #91254 in two (#92749)
This handles the disabling masks if numel is a multiple of BLOCK.
It currently introduces a performance regression, but the triton
it generates does not seem to have any issues: all the change does
is cause xmask to be removed from load/stores in cases where it safely
can be removed. It seems it must be coming from some issue in triton
optimizer.

FWIW, if you try this change with current triton master (instead of
pinned version) it does _not_ cause a performance regression.
However, upgradign to triton master by itself already causes
significant performance regressions so it's not an option
to just bump up the pin.

I'm going to leave this PR open until we manage to increase
the triton pin past the big refactoring. Once we do that
I will check if it still causes a performance regression.

UPDATE:

The triton pin has been moved and I retried this PR. As expected, there's no longer a performance regression for hf_Bert:

```
tspin python benchmarks/dynamo/torchbench.py  --performance  --backend inductor --float16 --training --batch-size-file $(realpath benchmarks/dynamo/torchbench_models_list.txt) --only hf_Bert -n 5 --diff-branch viable/strict 2> err
batch size: 16
cuda train hf_Bert                             numel_BLOCK                1.175x p=0.00
batch size: 16
cuda train hf_Bert                             viable/strict              1.161x p=0.00
```
Re-opening this, should be okay to merge now I expect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92749
Approved by: https://github.com/jansel
2023-02-22 22:55:05 +00:00
cbac56e244 [BE] Simplify Source.is_nn_module; add some types (#95292)
I am still reading Dynamo source code...

This is an easy PR to simplify `Source.is_nn_module()` to reuse `GuardSource.is_nn_module()` instead of having the `in (...)` check implemented twice. While simplifying that, I thought I might as well add some type annotations for `Source` methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95292
Approved by: https://github.com/ezyang
2023-02-22 22:33:58 +00:00
674ef1f9be Make fx.Transformer.get_attr call tracer to preserve node.meta (#95245)
Currently, transformer creates proxy objects directly for get_attr method. node.meta is lost in this step. In order to keep it, we invoke tracer.create_proxy. Meta data is copied over in tracer.create_proxy and tracer.create_node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95245
Approved by: https://github.com/SherlockNoMad, https://github.com/tugsbayasgalan
2023-02-22 22:33:37 +00:00
c0fa0669f6 Update isend/irecv warning messages for nccl (#95236)
Summary: nccl backend does not support `tag` as mentioned in https://github.com/pytorch/pytorch/issues/94819. Adding a note in the documentation for it.

Example:

<img width="888" alt="image" src="https://user-images.githubusercontent.com/14858254/220464900-094c8063-797a-4bdc-8e25-657f17593fe9.png">

Differential Revision: D43475756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95236
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-02-22 22:00:13 +00:00
7ac511c29a Implement sparse semantics support in gradcheck (#94714)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94714
Approved by: https://github.com/soulitzer, https://github.com/albanD
2023-02-22 20:03:25 +00:00
b6a1c238bd [MPS] Remove mps specialized path in BCE backward (#95220)
Remove mps specialized path in BCE backward as `logit` op has been implemented for mps.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95220
Approved by: https://github.com/soulitzer
2023-02-22 19:43:53 +00:00
69c76ff05e [MPS] Add xlogy op (#95213)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95213
Approved by: https://github.com/kulinseth, https://github.com/soulitzer
2023-02-22 19:43:12 +00:00
5fa937886c [DCP][nit] Rename variables + minor documentation fix for optimizer.py (#95264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95264
Approved by: https://github.com/rohan-varma
2023-02-22 19:07:10 +00:00
3758559a58 Reland "Introduce constrain_range; remove old expr_subs (#95063)" (#95209)
This reverts commit 4e88547c957cdc3a3c87e7b873520638ccfbd667.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95209
Approved by: https://github.com/albanD
2023-02-22 18:16:25 +00:00
d6a8d397da Fix formatting for merge failed message (#95234)
Fixes formatting so that the merge rule shows up on a different line than the "Raised by" text

Follow up to https://github.com/pytorch/pytorch/pull/94932

New version
<img width="433" alt="image" src="https://user-images.githubusercontent.com/4468967/220441349-ac99096d-590a-42c1-b995-4a23b2d9b810.png">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95234
Approved by: https://github.com/huydhn
2023-02-22 18:11:22 +00:00
d88d4145c3 [MPS] Fix Float16 issue with Reduction ops for macOS 12 (#94952)
This would fix the issue with `__rdiv__` with float16
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94952
Approved by: https://github.com/kulinseth
2023-02-22 18:07:56 +00:00
5e47571a13 [MPS] Convolution cleanup; remove unnecessary contiguous calls (#95078)
- Fixes convolution crashes in backward with weights
- Removes unnecessary contiguous calls
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95078
Approved by: https://github.com/kulinseth
2023-02-22 18:04:12 +00:00
02a6d4334b [MPS] Handle broadcasting by expanding src tensor in Copy.mm (#95272)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95272
Approved by: https://github.com/DenisVieriu97
2023-02-22 18:02:42 +00:00
5a8092f058 During export, generate Python TENSOR_MATCH guards (#94970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94970
Approved by: https://github.com/ezyang
2023-02-22 17:28:17 +00:00
8475af7761 [MPS] Cast int64 to int32 for reduction ops (#95231)
- give warnings of converting int64 for reduction ops
- use cast tensor for reduction sum on trace
- unblock trace from running
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95231
Approved by: https://github.com/razarmehr
2023-02-22 17:23:25 +00:00
6ae60b19b7 Revert "During export, generate Python TENSOR_MATCH guards (#94970)"
This reverts commit 5d2eb6d636069a255754289572dfa36ffa35e5a7.

Reverted https://github.com/pytorch/pytorch/pull/94970 on behalf of https://github.com/jeanschmidt due to Requires codev to land internal test changes
2023-02-22 16:49:37 +00:00
5f24b2b1f0 [pt2][inductor] search caches by default (#95134)
Summary: attempt two at enabling search of global/local cache, regardless of `max_autotune`, by default. the main problem is that triton template generation seems to be broken in some cases for CI tests (maybe dynamic shapes), but this is going to take more time to figure out. for now, we can just cancel template generation instead of raising an assertion error and filter out those failed templates.

Test Plan: sandcastle + CI

Differential Revision: D43424922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95134
Approved by: https://github.com/jansel
2023-02-22 06:02:17 +00:00
8de4238a31 Add dynamo bench arg --per_process_memory_fraction (#95260)
Simply pipes the arg to the existing torch.cuda API by the same name.

Useful for locally debugging OOMs that happened on a smaller GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95260
Approved by: https://github.com/davidberard98
2023-02-22 05:11:18 +00:00
a4b02a15d3 Support registering op returning symint in python (#95240)
Running an operator registered in python returning a symint will result in the following error:
```
RuntimeError: Unable to cast Python instance of type <class 'torch.SymInt'> to C++ type 'long'
```

The interaction of 2 things make the issue being triggered:
- We use boxed kernel here. For boxed kernel, we need convert py::object to IValue in torch/csrc/autograd/python_variable.cpp pushPyOutToStack .
- In the schema parsing code in torch/csrc/jit/frontend/schema_type_parser.cpp SchemaTypeParser::parseFakeAndRealType , if a SymInt is found, we register a Int type instead (not sure why we do this), and register SymInt as the real type.

The result is we would convert an SymInt to int in pushPyOutToStack and cause the issue.

The fix is to use real type when we convert py::object to IValue.

BTW, registering the same op using C++ API does not trigger the issue.
```
TORCH_LIBRARY(clib, m) {
  m.def("sqsum(SymInt a, SymInt b) -> SymInt", [](SymInt a, SymInt b) -> SymInt {
    return a * a + b * b;
  });
}
```
The reason is, the kernel registered in C++ is unboxed kernel and it does not trigger the code path above that converts an py::object to IValue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95240
Approved by: https://github.com/larryliu0820, https://github.com/ezyang
2023-02-22 04:56:37 +00:00
097679478e [optim] Set defaults to foreach, NOT fused (#95241)
Rolling back the default change for Adam and rectifying the docs to reflect that AdamW never defaulted to fused.

Since our fused implementations are relatively newer, let's give them a longer bake-in time before flipping the switch for every user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95241
Approved by: https://github.com/ngimel
2023-02-22 04:47:32 +00:00
2f547ae613 Remove SHA checksum for bazel http_archive from GitHub (#95039)
An action item from https://github.com/pytorch/pytorch/issues/94346

Although the security practice of setting the checksum is good, it doesn't work when the archive is downloaded from some sites like GitHub because it can change. Specifically, GitHub gives no guarantee to keep the same value forever https://github.com/community/community/discussions/46034.

This also adds a new linter to make sure that SHA checksum from GitHub can be removed quickly.  The WORKSPACE file is actually updated using the new linter:

```
>>> Lint for WORKSPACE:

  Advice (BAZEL_LINTER) format
    Redundant SHA checksum. Run `lintrunner -a` to apply this patch.

    You can run `lintrunner -a` to apply this patch.

     5   5 |
     6   6 | http_archive(
     7   7 |     name = "rules_cuda",
     7     |-    sha256 = "f80438bee9906e9ecb1a8a4ae2365374ac1e8a283897281a2db2fb7fcf746333",
     9   8 |     strip_prefix = "runtime-b1c7cce21ba4661c17ac72421c6a0e2015e7bef3/third_party/rules_cuda",
    10   9 |     urls = ["b1c7cce21b.tar.gz"],
    11  10 | )
--------------------------------------------------------------------------------
    29  28 |   name = "pybind11_bazel",
    30  29 |   strip_prefix = "pybind11_bazel-992381ced716ae12122360b0fbadbc3dda436dbf",
    31  30 |   urls = ["992381ced7.zip"],
    31     |-  sha256 = "3dc6435bd41c058453efe102995ef084d0a86b0176fd6a67a6b7100a2e9a940e",
    33  31 | )
    34  32 |
    35  33 | new_local_repository(
--------------------------------------------------------------------------------
    52  50 |     urls = [
    53  51 |         "https://github.com/gflags/gflags/archive/v2.2.2.tar.gz",
    54  52 |     ],
    54     |-    sha256 = "34af2f15cf7367513b352bdcd2493ab14ce43692d2dcd9dfc499492966c64dcf",
    56  53 | )
    57  54 |
    58  55 | new_local_repository(
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95039
Approved by: https://github.com/ZainRizvi
2023-02-22 04:39:19 +00:00
8d22eb61aa Upgrade setuptools before building wheels (#95265)
Should fix https://github.com/pytorch/builder/issues/1318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95265
Approved by: https://github.com/ngimel
2023-02-22 04:21:09 +00:00
a4d866b1eb Update triton hash (#95247)
Should fix #95082
This hash commit is supposed to fix sm_89 issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95247
Approved by: https://github.com/ngimel, https://github.com/seemethere
2023-02-22 04:05:00 +00:00
e769371781 [vision hash update] update the pinned vision hash (#95252)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95252
Approved by: https://github.com/pytorchbot
2023-02-22 03:44:38 +00:00
cf6e078c34 Revert "Reland "Introduce constrain_range; remove old expr_subs (#95063)" (#95209)"
This reverts commit f7bf31fff1b72752227459bb51e5682abefcfed7.

Reverted https://github.com/pytorch/pytorch/pull/95209 on behalf of https://github.com/ezyang due to internal sympy is too old
2023-02-22 01:58:58 +00:00
f67d2df933 [ONNX] Refactor validation op-level (#94920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94920
Approved by: https://github.com/BowenBao
2023-02-22 00:06:59 +00:00
c399ee09fe Use PyTorch wheel in Windows CI (#94958)
Per the convo in https://github.com/pytorch/pytorch/pull/93139/files#r1107487994, switching Windows CI to use built PyTorch wheel like other platforms instead of 7z-ing stuffs over.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94958
Approved by: https://github.com/malfet
2023-02-21 23:56:05 +00:00
f70a3430aa [MPS] Add hypot op (#95196)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95196
Approved by: https://github.com/kulinseth
2023-02-21 22:40:20 +00:00
7289d22d67 Use FindCUDAToolkit to find cuda dependencies (#82695)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82695
Approved by: https://github.com/malfet
2023-02-21 22:35:17 +00:00
5d1fec80e3 [BE][CI] remove .jenkins entirely (#92625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92625
Approved by: https://github.com/huydhn
2023-02-21 21:36:04 +00:00
f20c4d2345 Stop printing giant container in test failure message (#95226)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95226
Approved by: https://github.com/albanD
2023-02-21 21:15:02 +00:00
ed4b6d2113 [profiler] update docs with repeat=1 (#95085)
Specifying number of times to repeat is now required when defining the schedule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95085
Approved by: https://github.com/aaronenyeshi
2023-02-21 21:09:10 +00:00
640b9c80f9 [primTorch] Redefine prim.collapse{,_view} end point to be inclusive (#92017)
This makes `prims.collapse(a, start, end)` match the behavior of
`torch.flatten(a, start, end)` more closely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92017
Approved by: https://github.com/mruberry
2023-02-21 20:36:50 +00:00
2622adb980 [primTorch] Make prims.collapse a real prim (#91748)
`prims.collapse` is currently just a plain python function wrapping
`prims.reshape`. This turns it into a real prim, and also factors out some of
the code duplicated with `_collapse_view_aten`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91748
Approved by: https://github.com/lezcano, https://github.com/ngimel
2023-02-21 20:36:50 +00:00
0d2e91573e Reorder the Fx execution order to in-time get_attr rather than putting all get_attr ahead (#95014)
Summary:
Basically today we:
[getattr....getattr, call partition1, call parition2]
this makes getattr just in time:
so [getattr, call partition1, getattr, call partition 2 ..]

Test Plan:
CMF and MAI test result:
https://fb.quip.com/K5J9A7G246Ox

Differential Revision: D43376080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95014
Approved by: https://github.com/angelayi
2023-02-21 20:05:30 +00:00
e5785f1e34 If the input is contiguous, short-circuit infer_size_dv in reshape (#95216)
The main improvement is that this avoids guards from infer_size_dv,
although this also counts as a minor perf improvement too.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95216
Approved by: https://github.com/albanD
2023-02-21 19:45:41 +00:00
7b403c8c75 Nvfuser moving python tests and files under nvfuser (#95155)
1. Moving `test_jit_cuda_fuser.py` `test_nvfuser_dynamo.py` `test_nvfuser_frontend.py` under `third_party/nvfuser/python_tests/`.
2. Moving `nvfuser/__init__.py` to `third_party/nvfuser/python/`.
3. Leaving dummy test scripts under `./test/` for CI.
4. Patching `torch/_prims/nvfuser_prims.py` for view/reshape renaming in nvfuser
5. Installing `third_party/nvfuser/python` and `third_party/nvfuser/python_tests` to pytorch root/test directy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95155
Approved by: https://github.com/davidberard98
2023-02-21 19:27:24 +00:00
5d2eb6d636 During export, generate Python TENSOR_MATCH guards (#94970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94970
Approved by: https://github.com/ezyang
2023-02-21 19:12:57 +00:00
307ebacf94 [dynamo 3.11] fix to eval_frame.c (#94102)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94102
Approved by: https://github.com/albanD, https://github.com/jansel, https://github.com/malfet
2023-02-21 18:47:36 +00:00
1123ab8647 [dynamo 3.11] changes to with contexts (#94101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94101
Approved by: https://github.com/albanD, https://github.com/jansel
2023-02-21 18:47:36 +00:00
04d931d979 [dynamo 3.11] changes to MAKE_FUNCTION and MATCH_KEYS (#94100)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94100
Approved by: https://github.com/albanD, https://github.com/jansel
2023-02-21 18:47:34 +00:00
d5aaf54261 [dynamo 3.11] fix cell/freevar offsets (#94099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94099
Approved by: https://github.com/albanD, https://github.com/jansel
2023-02-21 18:47:32 +00:00
055a9e45aa [dynamo 3.11] changes to LOAD_GLOBAL and function calls (#94098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94098
Approved by: https://github.com/albanD
2023-02-21 18:47:30 +00:00
da98053c6d Fix bug where a github api failure would prevent the label check from failing (#95098)
Fix bug where a github api failure would prevent the check from failing even if we already saw that labels were needed.

Also adds more debugging info to the rate limit exceeded error since it's weird to see an error claiming the rate limit has exceeded when the "Used" amount is way below the limit.  I suspect these happen when the request arrived just before the rate reset time, but the response was generated right after the reset time, hence the apparently tiny "used" amounts

Example run where the check should have failed, but passed instead:
https://github.com/pytorch/pytorch/actions/runs/4200205209/jobs/7285979824
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95098
Approved by: https://github.com/huydhn
2023-02-21 18:42:12 +00:00
311b20aae1 [fix] torch.pow handle real negative base and complex exponent (#95198)
Fixes https://github.com/pytorch/pytorch/issues/89903 https://github.com/pytorch/pytorch/issues/95111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95198
Approved by: https://github.com/albanD, https://github.com/ngimel
2023-02-21 18:36:20 +00:00
976d289e86 Fix update_pytorch_labels workflow (#95227)
Pass in repo args now that they're required (after a recent refactor). Also changes the script to pass in the repo name instead of being hardcoded to pytorch/pytorch.

I'm guessing this wasn't noticed earlier since the workflow is only triggered when a label is created/edited/deleted
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95227
Approved by: https://github.com/huydhn
2023-02-21 18:26:21 +00:00
b0f22f8d2b Use run_subtests utility in FSDP test_state_dict_save_load_flow test (#95090)
Converts the single-instance of `self.subTest` in `test_fsdp_state_dict.py` to use the `run_subtests` utility.

Related: #84171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95090
Approved by: https://github.com/awgu
2023-02-21 18:24:41 +00:00
bef3c02330 try triton with remat fix (#94882)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94882
Approved by: https://github.com/malfet
2023-02-21 18:06:48 +00:00
f7bf31fff1 Reland "Introduce constrain_range; remove old expr_subs (#95063)" (#95209)
This reverts commit 4e88547c957cdc3a3c87e7b873520638ccfbd667.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95209
Approved by: https://github.com/albanD
2023-02-21 18:02:48 +00:00
ce950b412f Reland "Add torch.empty_permuted (#95069)" (#95208)
This reverts commit 92e03cd583c027a4100a13682cf65771b80569da.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95208
Approved by: https://github.com/albanD
2023-02-21 18:02:48 +00:00
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
c6d8d10b3e Fix warning if backend registers timer (#91702)
currently logger timer is registered default for
cpu/cuda. for other backends, it may or may not
registers this timer. It reports warning for other backends and return which is not expected.
The above may fail, if the backends has have registered this timer. For example, HPU(habana) backend registers this timer. so, in this case it reports a warning and return which is incorrect.

Other case is where lazy backend timer is never registered. so, this returns a warning, and this is the reason the check was added, but it fails for other cases.

Add a generic check if the timer is registered, then don’t report warning.

Signed-off-by: Jeeja <jeejakp@habana.ai>

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91702
Approved by: https://github.com/kit1980
2023-02-21 14:09:47 +00:00
7ca623c2e1 Fix convit_base (#95174)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95174
Approved by: https://github.com/ngimel, https://github.com/jansel, https://github.com/atalman
2023-02-21 14:07:59 +00:00
92e03cd583 Revert "Add torch.empty_permuted (#95069)"
This reverts commit bedeb1f014795c497f11942ff4c772431d1c157a.

Reverted https://github.com/pytorch/pytorch/pull/95069 on behalf of https://github.com/jeanschmidt due to Breaking internal builds. More in https://fburl.com/phabricator/ztrxrroq
2023-02-21 12:05:20 +00:00
079476c6b2 Add a check for n<0 and a test for it (#95144)
Fixes [94740
](https://github.com/pytorch/pytorch/issues/94740) Adds a check in `aten/src/ATen/native/ReduceOps.cpp` and a test case in test/test_torch.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95144
Approved by: https://github.com/lezcano
2023-02-21 10:57:06 +00:00
4e88547c95 Revert "Introduce constrain_range; remove old expr_subs (#95063)"
This reverts commit 3711f7c59f772190059ebee7fbd58978e1082267.

Reverted https://github.com/pytorch/pytorch/pull/95063 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, more details can be found: https://fburl.com/phabricator/fq5b6k8a
2023-02-21 10:43:39 +00:00
cyy
1ab112cfab code is clean enough that some warnings can be enabled (#95139)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95139
Approved by: https://github.com/Skylion007
2023-02-21 07:24:20 +00:00
e0a0329a67 [MPS] Add hardsigmoid op (#95164)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95164
Approved by: https://github.com/kulinseth
2023-02-21 07:06:37 +00:00
d96aac8d2a [MPS] Add logit op (#95162)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95162
Approved by: https://github.com/kulinseth
2023-02-21 07:02:45 +00:00
062380db91 Fix Typo (#95173)
Summary: Fix Typo

Test Plan: sandcastle & github

Differential Revision: D43417472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95173
Approved by: https://github.com/nmacchioni, https://github.com/Skylion007
2023-02-21 04:07:07 +00:00
aa042a57cd [inductor] fix max_pool2d with ceil mode (#94887)
Fixes #94775

When ceil mode turns on, max_pool2d has a bug allowing a sliding window to be entirely off bounds. This PR restricts sliding windows to start within the input or left padding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94887
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
2023-02-21 01:58:19 +00:00
1aea2d2ec3 for SymInt nodes in fx graph, get result from node meta in inductor GraphLowering (#95152)
Finally, swin is passing, with no floors in the generated code.
I don't know how to write a test for it though, and swin patterns triggering this are pretty complicated (even prior to this PR we were already good at pulling `floors` out of device code).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95152
Approved by: https://github.com/ezyang
2023-02-21 01:35:41 +00:00
77dae43767 Don't truncate leading 1s if they are unbacked (#95141)
This prevents us from guarding on leading unbacked SymInts.

The previous attempt at https://github.com/pytorch/pytorch/pull/94521 I got the logic a bit wrong. My idea there was to avoid slicing when the values to be set have low enough dimensionality that they definitely aren't too long. To do this, I need to compute the difference between the data to be set, and the post-slice space for the values. But I incorrectly compared against the *pre-slice* space in the original PR. Another version of this PR which is wrong is to compare against variableIndices.size(); but remember that in advanced indexing with tensors/lists, each of the individual indices specify what coordinates to read out of each dimension! A third incorrect attempt tested `variableIndices[0].dim()`, which is only correct if you don't broadcast one of the later variable indices, and if there are enough variableIndices to cover all dims. This is all quite complicated, so I went for a simpler solution of checking if the leading dim had a hint before testing if it is not equal to one.

BTW, there is no test for this one stripping behavior. There is now a test for this, based off the real code that caused the problem.

Signed-off-by: Edward Z. Yang <ezyangmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95141
Approved by: https://github.com/ngimel
2023-02-21 00:22:24 +00:00
f54233e273 [foreach] bump tensor's version and define backward via torchgen (as possible) (#93901)
## summary
- increment tensor versions in inplace foreach functions
- add a logic to take care of `ArrayRef<Scalar>`

rel: https://github.com/pytorch/pytorch/issues/58833, https://github.com/pytorch/pytorch/pull/89591

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93901
Approved by: https://github.com/albanD
2023-02-20 23:18:07 +00:00
83b5eb4e16 [sympy] fix ValueRanges.pow error when b.lower is float (#95151)
Summary:
fix `TypeError: 'Float' object cannot be interpreted as an integer` for `ValueRanges.pow(a, b)` when `not a.is_singleton() and b.is_singleton() and not isinstance(b.lower, int)`

this is breaking  `cuda11.7-py3.10-gcc7-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu)`
{F878635541}

Test Plan: sandcastle + CI

Differential Revision: D43430385

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95151
Approved by: https://github.com/Skylion007
2023-02-20 22:55:24 +00:00
679e5dbfa1 [executorch] Always generate CustomOpsNativeFunctions.h if custom_ops.yaml is present (#95084)
To match the build system logic, enforce CustomOpsNativeFunctions.h to be generated if we have custom_ops.yaml, even if we don't select any custom ops.

Added unit test.

Differential Revision: [D43402718](https://our.internmc.facebook.com/intern/diff/D43402718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95084
Approved by: https://github.com/iseeyuan
2023-02-20 18:54:41 +00:00
da41003b5f [MPS] Fix the uint8 type issue with View ops kernels (#95145)
This should fix the problem in Resnet model with image artifacts due to saturation on int8 type and also the incorrect class recognition reported in #86954.

Fixes #86954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95145
Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97
2023-02-20 18:09:20 +00:00
08370ddad8 Update model skips (#95089)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95089
Approved by: https://github.com/albanD
2023-02-20 13:24:49 +00:00
4d753b5045 [WIP][dynamo] simplify module_key creation logic (#94945)
After some thoughts, I find it difficult to come up with a robust naming convention that satisfies the following constraints at the same time: 1. the new name should be a valid nn.Moule attribute (as required by minifier and it's a good thing to have in general) 2. it can cover various cases such as GetItemSource, GetAttrSource 3. it's easy to recover the original path 4. robust to users' naming scheme.

Thanks to @yanboliang for pointing out the original access path is preserved in Source, now we just need to add an additonal value source.name() to node.meta["nn_module_stack"]  to get the access path in original module.

We also address some TODO in quantization, which relies on the original naming convention in nn_module_stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94945
Approved by: https://github.com/jansel, https://github.com/yanboliang
2023-02-20 07:28:04 +00:00
954c767bc6 [Inductor] Enable accuracy test for CPPBackend (#94898)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94898
Approved by: https://github.com/jgong5, https://github.com/desertfire
2023-02-20 05:02:15 +00:00
3dcf8b6140 [Fix] Inbound check of sorter indices in searchsorted (#95109)
Fixes https://github.com/pytorch/pytorch/issues/91606, but in C++14 style.

Prior fix (https://github.com/pytorch/pytorch/pull/94863) was in C++17 which might violate some builds.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95109
Approved by: https://github.com/ngimel
2023-02-20 04:59:11 +00:00
286d821e61 Don't replace FloorDiv with floor in simplify, do simplifications for divisible exprs (#95076)
I don't see why `floor` is better than `FloorDiv` and solve with `FloorDiv` doesn't work anyway (the solution wouldn't be unique even if it worked).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95076
Approved by: https://github.com/jansel, https://github.com/malfet, https://github.com/nkaretnikov
2023-02-20 01:53:54 +00:00
bedeb1f014 Add torch.empty_permuted (#95069)
torch.empty_permuted is a generalized version of torch.empty(memory_format=...), where you can pass an arbitrary physical layout as a tuple of dims to allow you to setup dense, non-overlapping tensors with non-standard memory format. Check the docblock for a full description of semantics.

The initial motivation for this PR is with guard-less unbacked SymInts. Traditionally, the way we allocate dense tensors with arbitrary layout is with `empty_strided`. However, `empty_strided` does not know that the given strides are actually contiguous, and must test this manually to find out if it is the case. With `empty_permuted`, this is known statically to be the case and helps us skip some 0/1 guards.

However, I also think torch.empty_permuted is a useful API in its own right. It is technically possible to simulate this with an empty and a permute; however, there are some downsides:

* The manual incant is tricky to work out. To allocate an NHWC tensor, the invocation is `torch.empty(N, H, W, C).permute(0, 3, 1, 2)`; the permute call has to take NHWC to NCHW, and is the *inverse* of the permutation people are typically thinking of when they talk about NHWC (0, 2, 3, 1). Instead, torch.empty_permuted lets you say `torch.empty_permuted((N, C, H, W), (0, 2, 3, 1))`, letting you provide the intuitive permutation. It can be literally be read off as NHWC if you assign N=0, C=1, H=2, W=3.
* An empty(requires_grad=True).permute() is no longer a leaf tensor. You can force it to be a leaf with a detach(), but it is more straightforward and less error prone to allow directly allocating a tensor with the correct permutation.

It is also technically possible to simulate this with empty_strided. However, this requires the user to manually compute the contiguous output strides and is bad from a reduction of guards perspective. For what it's worth, this is one of the more common uses of as_strided in the wild, and it would be nice to get rid of it.

A nice enhancement of this feature would be to accept `physical_layout` anywhere `memory_format` is accepted. However, this would be a pretty involved change, so I'm doing the easy thing instead.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95069
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/albanD, https://github.com/dagitses
2023-02-20 00:23:10 +00:00
50ec4ddb70 fix 'sympy.core.logic' has no attribute 'boolalg' (#95130)
Summary: fix module error by directly importing `sympy.logic.boolalg.Boolean`

Test Plan: CI

Differential Revision: D43423823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95130
Approved by: https://github.com/Skylion007
2023-02-20 00:09:57 +00:00
567362cedb [inductor] move dynamic shapes tests into a new file (#94971)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94971
Approved by: https://github.com/ezyang
2023-02-20 00:01:48 +00:00
3711f7c59f Introduce constrain_range; remove old expr_subs (#95063)
This PR introduces a new `constrain_range` function which can be used to constrain the possible values a SymInt/SymFloat can take on. This knowledge can be then used to discharge potential guards (by running the range analysis, and then seeing if the guard must be true given the original range) without adding another guard.

The usage of ranges is very limited right now; ranges are only constrained when the user explicitly instructs the system so. However, we can also infer range constraints based on guards as well; this is left for future work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95063
Approved by: https://github.com/eellison
2023-02-19 23:17:09 +00:00
f89ae0a7f4 Revert "Only truncate leading 1s if the value is too big. (#94521)"
This reverts commit 03f4a63fd86fe2d22202c7aee6a4e62c13b4f561.

Reverted https://github.com/pytorch/pytorch/pull/94521 on behalf of https://github.com/ezyang due to fails internal tests
2023-02-19 15:05:56 +00:00
06489a3c1c [functorch] roll : fix batching rule for scalar tensor (#95048)
Fixes https://github.com/pytorch/pytorch/issues/94925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95048
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2023-02-19 09:30:30 +00:00
039b4c8809 Add meta function for _upsample_bilinear2d_aa (#94982)
Differential Revision: D43353000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94982
Approved by: https://github.com/ezyang
2023-02-19 07:11:20 +00:00
17d0b7f532 [pt2][inductor]global autotuning cache (#94922)
Summary:
this diff adds logic to handle a global autotuning cache, stored in json format at config.global_cache_path.

what is changing from `DiskCache`:
* `DiskCache` is renamed to `PersistentCache`
* the local cache is now stored as a single file in json format, located at `/tmp/torchinductor_{$USER}/local_cache`. the file contains a dictionary structure like `local_cache[name][inputs][choice]` where `name` is the type of operation, like `addmm`, `inputs` is the repr of the inputs, and `choice` is the hash of a `ChoiceCaller`. the stored value is the benchmark time for that `ChoiceCaller`.
* a global cache is added, initially stored at `fbcode/caffe2/torch/_inductor/global_cache`, with almost identical format as the local cache. since the global cache exists over different machines, there is an additional `dinfo` field, such that `global_cache[dinfo] = local_cache` (at least structure wise, there is no guarantee that the global cache and local cache share the same values). `dinfo` is just a repr of the cuda device properties.
* the autotuner will prioritize the global cache, and return values from there first, before looking in the local cache
* the autotuner will look in both the global cache and the local cache even when `max_autotune=False`, but will still only generate values if `max_autotune=True`.
* the autotuner will log global cache hits and misses to a scuba table (inductor_autotuning_cache) which will be used to update the global cache at regular intervals

Test Plan: D43285472

Differential Revision: D42785435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94922
Approved by: https://github.com/jansel
2023-02-19 05:35:18 +00:00
3f381473cd [blob inspector] free memory from workspace for di blobs post stats (#95064)
Differential Revision: D43250357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95064
Approved by: https://github.com/michaelay
2023-02-19 05:05:35 +00:00
a17a7ccc92 [MPS] LogSoftmax numerical stability (#95091)
Fixes #94043

Calculations are now consistent with numericaly stable formula and CPU:

$LogSoftmax(X, \dim) = X - \max(X, \dim) - \log(sum(X - \max(X, \dim), \dim))$

@malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95091
Approved by: https://github.com/malfet, https://github.com/kulinseth
2023-02-18 18:26:29 +00:00
9511b9fad2 [MPS] Fix copy_cast_mps() on tensors with storage offset (#95093)
- The copy_cast path requires storage_offset to be applied before casting
- This should fix some correctness issues in transformer models

Fixes #94980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95093
Approved by: https://github.com/kulinseth
2023-02-18 16:29:01 +00:00
25ee6dd335 [MPS] Fix fill_ where input tensor has a storage offset (#95113)
Fixes #94390

Apart from fixing the issue above, this PR also fixes a bug that when an input tensor can be sliced, a sliced array view is created. This array view seems to be not writable or have a different storage from the original tensor, causing incorrect results with the in-place `fill`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95113
Approved by: https://github.com/kulinseth
2023-02-18 16:19:15 +00:00
57830a9655 [vision hash update] update the pinned vision hash (#95106)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95106
Approved by: https://github.com/pytorchbot
2023-02-18 03:30:18 +00:00
9bb2fe3eae fix numpy1.24 deprecations in unittests (#93997)
Fixes https://github.com/pytorch/pytorch/issues/91329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93997
Approved by: https://github.com/ngimel, https://github.com/jerryzh168
2023-02-18 00:59:09 +00:00
9dbfca7840 Add various uninterpreted bit tensor data types (#94992)
Summary:

This PR adds a set of unintrepreted data types on PyTorch which can be used to implement experimental functionality out of core (think fp8, int4, int16 quant, etc).

Note: this is a copy-pasta of https://github.com/pytorch/pytorch/pull/89990 with a bug fix for clang9, easier to just to put up another PR since I'm not sure how comandeering works with Meta-only changes.

@bypass-github-export-checks

Test Plan:

```
python test/test_quantization.py -k TestBits
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94992
Approved by: https://github.com/angelayi
2023-02-18 00:04:30 +00:00
e44737e619 Revert "Update error messages to reflect why test is skipped (#95049)"
This reverts commit 22e797a8786ffbb1f3b947b70cd8647cc43d6f3e.
2023-02-17 15:41:02 -08:00
8928e7bdb8 Raise error on 3.11 dynamo export (#95088)
For https://github.com/pytorch/pytorch/issues/94914. Realized that `dynamo.export` doesn't immediately raise an error when dynamo is trying to run on 3.11/windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95088
Approved by: https://github.com/weiwangmeta
2023-02-17 23:33:38 +00:00
4fc277c338 [Quant] Add lowering for pixel_shuffle (#94769)
Summary: `torch.nn.functional.pixel_shuffle` accepts both float
and quantized inputs. However, previously we would unnecessarily
dequantize quantized inputs into floats before passing them to
the function. This commit fixes this by lowering the pattern
[dequant - pixel_shuffle - quant].

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_pixel_shuffle

Reviewers: vkuzo

Subscribers: vkuzo, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94769
Approved by: https://github.com/vkuzo
2023-02-17 23:11:17 +00:00
c16b2916f1 Back out "fix: make sure sorter indices are inbound in searchsorted (#94863)" (#95086)
Summary:
Original commit changeset: 96a2200d1fd8

Original Phabricator Diff: D43342962

Test Plan: Sandcastle and land castle as well as buck2 build mode/opt //frl/et/projects/Masquerade/stable/datasets/masquerade/c6p7:post_processing

Reviewed By: seemethere, bigfootjon

Differential Revision: D43402398

@bypass-github-export-checks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95086
Approved by: https://github.com/bigfootjon
2023-02-17 22:48:22 +00:00
22e797a878 Update error messages to reflect why test is skipped (#95049)
Summary: Update error messages to reflect why test is skipped

Test Plan: github

Differential Revision: D43386390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95049
Approved by: https://github.com/nmacchioni, https://github.com/cpuhrsch
2023-02-17 22:42:25 +00:00
500ebb2cd6 Fine grained dynamic shape controls (#94787)
https://docs.google.com/document/d/1aoIyYE8_6cYpWqS25thzVoIiKsT5aaUEOiiPwbIXt8k/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94787
Approved by: https://github.com/ezyang
2023-02-17 22:28:37 +00:00
30c07722d1 Revert "Inductor: fix incorrect result of inplace unsqueeze (#94797)"
This reverts commit 6ae06e49ac92442e583f05e6b88f58670cecebaa.

Reverted https://github.com/pytorch/pytorch/pull/94797 on behalf of https://github.com/ezyang due to bad approach, and can lead to subtle further bugs
2023-02-17 22:22:27 +00:00
17c149ad9e Revert "[CI] Use prebuilt triton from nightly repo (#94732)"
This reverts commit 18d93cdc5dba50633a72363625601f9cf7253162.

Reverted https://github.com/pytorch/pytorch/pull/94732 on behalf of https://github.com/kit1980 due to Reverting per offline discussion to try to fix dynamo test failures after triton update
2023-02-17 21:51:25 +00:00
0205ffb8d9 Fix expired deprecation of comparison dtype for NumPy 1.24+ (#91517)
> The `dtype=` argument to comparison ufuncs is now applied correctly. That
> means that only `bool` and `object` are valid values and `dtype=object` is
> enforced.

Source: https://numpy.org/doc/stable/release/1.24.0-notes.html#expired-deprecations

Fixes #91516

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91517
Approved by: https://github.com/zou3519, https://github.com/huydhn
2023-02-17 21:11:03 +00:00
d5d55363d9 Add broadcastable check to index_put (#94849)
Copy-n-paste it from
989299802c/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L582-L583)

Which is used for both CPU and CUDA checks, unless op is called for GPU with `deterministicAlgorithms()` set to true

Followup: do the same for XLA and fix the case when indices are not null

Fixes https://github.com/pytorch/pytorch/issues/94667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94849
Approved by: https://github.com/ngimel
2023-02-17 20:37:23 +00:00
e0ede1cc30 Revert "Fine grained dynamic shape controls (#94787)"
This reverts commit 2aa806608bc28a401292255a621f03ec507134f9.

Reverted https://github.com/pytorch/pytorch/pull/94787 on behalf of https://github.com/kit1980 due to After this PR, test_autocast_sdpa_dynamic_shapes_static_default started to fail with RuntimeError: Cannot call sizes() on tensor with symbolic sizes/strides: https://github.com/pytorch/pytorch/actions/runs/4206176846/jobs/7299657478
2023-02-17 19:52:16 +00:00
0a9c608461 [MPS] Fix tensor with non-zero storage offset graph gathering (#91071)
Previously, the "can slice" flag in Placeholder constructor in `OperationUtils.mm` is conditioned on whether the numbers of dimensions of base shape and view shape are the same. This doesn't consider the situation that a view tensor could be the base tensor's sliced and then unsqueezed version, resulting in different num of dims.

For example, if we want to stack `y_mps` and `x_mps` on the last dim:
```
t_mps = torch.tensor([1, 2, 3, 4], device="mps")
x_mps = t_mps[2:]  # [3, 4]
y_mps = t_mps[:2]  # [1, 2]

res_mps = torch.stack((y_mps, x_mps), dim=-1)
```

the kernel will unsqueeze both of them on the last dim and then concatenate them, which is equivalent to:

```
res_mps = torch.cat((y_mps.unsqueeze(-1), x_mps.unsqueeze(-1)), dim=-1)
```

`x_mps.unsqueeze(-1)` is an unsqueezed and contiguous tensor with a storage offset, this kind of tensors should be sliceable without cloning its storage.

Fixes #87856
Fixes #91065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91071
Approved by: https://github.com/kulinseth
2023-02-17 18:44:20 +00:00
5de3ead712 [MPS] Add optional minor argument to is_macos13_or_newer (#95065)
Will be needed if one wants to make accurate XFAIL validation

I.e. `torch.backends.mps.is_macos13_or_newer()` will return True if PyTorch is running on MacOS 13.0 or newer, `torch.backends.mps.is_macos13_or_newer(1)` will return True if running on MacOS 13.1 or newer and `torch.backends.mps.is_macos13_or_newer(2)` will return True  if running on MacOS 13.2 or newer

Do not use 13.3 check as `@available` does not really work for shared libraries

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95065
Approved by: https://github.com/albanD
2023-02-17 18:30:20 +00:00
c43e88665a [Resubmit] helpers to torch.dist.utils (#95025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95025
Approved by: https://github.com/fegin
2023-02-17 18:24:20 +00:00
2aa806608b Fine grained dynamic shape controls (#94787)
https://docs.google.com/document/d/1aoIyYE8_6cYpWqS25thzVoIiKsT5aaUEOiiPwbIXt8k/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94787
Approved by: https://github.com/ezyang
2023-02-17 17:39:22 +00:00
766d51b496 [export] Add a data type for representing export workflow information. (#95013)
upstreaming some of our internal work to OSS so that we can get a better
preiew of how export pipeline works. there'll be more modularized work
sent in later.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95013
Approved by: https://github.com/tugsbayasgalan
2023-02-17 16:28:17 +00:00
c137d3d688 inductor: enable lowering for bitwise_right_shift (#94997)
triton pin has been moved past the relevant bug fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94997
Approved by: https://github.com/Skylion007, https://github.com/jansel
2023-02-17 15:51:34 +00:00
d978395f55 Deprecate Caffe2 ONNX exporter (#94994)
Discussed on Weekly meeting with Meta on 2/16/2023 with @kit1980 @malfet

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94994
Approved by: https://github.com/Skylion007, https://github.com/BowenBao
2023-02-17 15:41:11 +00:00
2f9ffe7b0a Add torch.utils._sympy.interp (#94985)
This utility allows us to conveniently abstract interpret Sympy expressions with respect to some alternative domain. I am particularly interested in using ValueRanges to do range analysis on expressions (not this PR).

Some minor house-keeping:
* ReferenceAnalysis got moved to its own file, sprouted a constant() implementation, and some uses of math.* got converted to sympy.*
* ValueRangeAnalysis now understands mod
* Test file gets moved from `test_value_ranges.py` to `test_sympy_utils.py`

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94985
Approved by: https://github.com/eellison
2023-02-17 14:28:18 +00:00
ccef485221 Add boolean/comparison operator support to ValueRanges (#94944)
Pretty straightforward.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94944
Approved by: https://github.com/lezcano
2023-02-17 14:28:18 +00:00
08ef83f07c Add exhaustive testing to ValueRanges, fix bugs (#94939)
Since I didn't want to deal with nondeterministic tests, I went the exhaustive testing route for a fixed list of constants to look at. The tests generate random ranges, propagate the range through the function, and then pick elements in the range and check that the result on the operation is in the resulting range. This caught bugs in log, sqrt and pow.

My resolution for pow was a little special, because I had trouble figuring out the correct semantics under all inputs domains. Instead, I picked two input domains (pow on two point ranges, and pow where exponent is known) and only implemented those. Everything else we give up. I think this is unlikely to affect perf.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94939
Approved by: https://github.com/lezcano, https://github.com/eellison, https://github.com/nunoplopes
2023-02-17 14:28:15 +00:00
12c9a932ca Assert more invariants on ValueRanges (#94906)
The main new invariant is lower/upper must be a Sympy expression of some sort (filtered through `simple_sympify`). There are some simpler sanity checks (mostly making sure the range is well formed). There is a type confusion problem (it's not immediately obvious if a range is for float/int/bool) but we aren't going to solve this for now as it is more complicated.

Billing of changes:

* ValueRanges.wrap() now accepts sympy expressions
* ValueRanges now accepts non-sympy expressions and will sympyify them appropriately. Rewrite calls to ValueRanges to not sympify manually as it is unnecessary
* Don't attempt to test sqrt(-1)
* Add ValuesRanges.unknown() which gives -oo, oo bounds, and rewrite direct calls to -math.inf, math.inf to use it
* Make multiply work between ValueRanges.unknown() and ValueRanges.wrap(0)
* Consistently use sympy.oo instead of math.inf

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94906
Approved by: https://github.com/eellison
2023-02-17 14:28:11 +00:00
950a9efcc3 [Dynamo] Enable test_autocast_sdpa (#95011)
Enable test_autocast_sdpa since the blocker has been removed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95011
Approved by: https://github.com/drisspg
2023-02-17 09:37:25 +00:00
cyy
2cf1a7d79b Fix clang warnings and other minor issues (#94975)
Fix various clang warnings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94975
Approved by: https://github.com/Skylion007
2023-02-17 08:59:14 +00:00
45d775cedb [BE] Cleanup triton builds (#95026)
Remove Python-3.7 clause
Do not install llvm-11, as llvm-14 is installed by triton/python/setup.py script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95026
Approved by: https://github.com/osalpekar, https://github.com/weiwangmeta
2023-02-17 05:55:36 +00:00
a2afc657da [MPS] Fix upsample for NHWC output (#94963)
Fixes https://github.com/huggingface/diffusers/issues/941

**Before**:
<img width="1144" alt="Screenshot 2023-02-15 at 8 11 53 PM" src="https://user-images.githubusercontent.com/104024078/219266709-6a77636a-2fc0-4802-b130-85069b95953f.png">

**After**:
<img width="1144" alt="Screenshot 2023-02-15 at 8 12 02 PM" src="https://user-images.githubusercontent.com/104024078/219266694-ea743c02-fb55-44f1-b7d6-5946106527c3.png">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94963
Approved by: https://github.com/razarmehr
2023-02-17 05:07:22 +00:00
a8cbf70ffc Inductor support for aten::all_reduce (#93111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93111
Approved by: https://github.com/jansel, https://github.com/wanchaol
2023-02-17 04:42:04 +00:00
5d1e9fd214 [MPS] Fix prelu backward pass (#94933)
Allocate the correct shape for the weights gradient
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94933
Approved by: https://github.com/razarmehr
2023-02-17 03:45:12 +00:00
acc1dfe670 [vision hash update] update the pinned vision hash (#95017)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95017
Approved by: https://github.com/pytorchbot
2023-02-17 03:32:32 +00:00
16a4579335 [FSDP] [composable] [BE] warning should read TorchRec, not DMP (#95010)
Summary: as title

Test Plan: N/A

Differential Revision: D43375189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95010
Approved by: https://github.com/awgu, https://github.com/fegin
2023-02-17 03:31:30 +00:00
e5496ebcac [torch] [composable] [analytics] add analytics logging to PT-D composable APIs (#95016)
Summary: as title

Test Plan: N/A

Differential Revision: D43376274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95016
Approved by: https://github.com/awgu, https://github.com/rohan-varma, https://github.com/fegin
2023-02-17 02:49:16 +00:00
13ebffe088 [CUDA] sm_87 / Jetson Orin support (#95008)
Surfaced from #94438 CC @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95008
Approved by: https://github.com/ezyang
2023-02-17 02:22:23 +00:00
0dffbcd4fa Remove unnecessary TensorMeta rewrap (#95004)
Extracted from https://github.com/pytorch/pytorch/pull/94523

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95004
Approved by: https://github.com/voznesenskym, https://github.com/ngimel, https://github.com/Skylion007
2023-02-17 00:52:37 +00:00
d9950c5215 Hard code known true contiguity settings for unbacked SymInts (#95003)
Extracted from https://github.com/pytorch/pytorch/pull/94523 which has E2E test

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95003
Approved by: https://github.com/voznesenskym, https://github.com/ngimel
2023-02-17 00:42:41 +00:00
a2f44d82f8 Flag guard unbacked SymInt/SymFloat support (#94987)
I believe this fixes the AllenaiLongformerBase problem in periodic.

The longer version of the problem is here is we are currently optimistically converting all item() calls into unbacked SymInt/SymFloat, but sometimes this results in a downstream error due to a data-dependent guard. Fallbacks for this case are non-existent; this will just crash the model. This is bad. So we flag guard until we get working fallbacks.

What could these fallbacks look like? One idea I have is to optimistically make data-dependent calls unbacked, but then if it results in a crash, restart Dynamo analysis with the plan of graph breaking when the item() call immediately happened.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94987
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-17 00:25:05 +00:00
30d0112bf3 fix performance issue in torch.sparse.mm reduce mode (#94969)
Fix performance bug for `torch.sparse.mm()` with reduce flag.

Found this bug within internal benchmarking.
Made a mistake when updating previous patch which causes load imbalance between threads:

Test on ogbn-products datasets on Xeon CLX with 24 cores:

#### before
```
sparse.mm: mean: 1156.148 ms
sparse.mm: sum: 1163.754 ms
sparse.mm: (using mkl): 703.227 ms
```

#### after
```
sparse.mm: mean: 662.578 ms
sparse.mm: sum: 662.301 ms
sparse.mm: (using mkl): 700.178 ms
```

The result also indicates that the current spmm kernel is no worse than MKL's sparse_mm .

Also update results on `pyg benchmark` with:
```
python gnn.py --use_sage --epochs=3 --runs=1 --inference
```

* Out of box: `13.32s`
* Without the fix in this PR: `5.87s`
* With the fix in this PR: `3.19s`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94969
Approved by: https://github.com/jgong5
2023-02-17 00:20:00 +00:00
bb347dc3c3 [PTD][DCP] Add 1D DTensor based DCP (#94868)
Add 1D DTensor based DCP along with its test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94868
Approved by: https://github.com/wanchaol, https://github.com/fegin
2023-02-16 23:38:04 +00:00
5cdedab0cc Raise error if torch.compile is called from windows or py 3.11 (#94940)
For https://github.com/pytorch/pytorch/issues/94914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94940
Approved by: https://github.com/albanD
2023-02-16 23:34:52 +00:00
8126bb5529 Mark linux-focal-py3.8-gcc7 / test (distributed) as unstable temporarily (#95002)
This has become flaky recently (5.11% > 5% threshold) https://hud.pytorch.org/reliability/pytorch/pytorch?jobName=pull%20%2F%20linux-focal-py3.8-gcc7%20%2F%20test%20(distributed), moving it to unstable makes sense because the more important CUDA distributed jobs are still run in trunk.  The issue is being investigated in https://github.com/pytorch/pytorch/issues/94954
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95002
Approved by: https://github.com/ZainRizvi
2023-02-16 23:28:33 +00:00
b45ec156a8 Revert "Temporarily disable ROCm trunk tests (#94995)"
This reverts commit 920ad2415c5fadc171279059136ab3836b6822a0.

Reverted https://github.com/pytorch/pytorch/pull/94995 on behalf of https://github.com/huydhn due to ROCm runners have been cleaned up
2023-02-16 23:17:55 +00:00
e0106e1850 Use the run_subtests utility instead of self.subTest (#94983)
The use of run_subtests utility is a better test practice.

Related #84071

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94983
Approved by: https://github.com/awgu
2023-02-16 22:13:13 +00:00
ee0e7f0529 [dtensor] add checkpointing example (#94743)
This PR adds some DTensor sharding example on a simple MLP model
for checkpointing reference purposes

Note that checkpointing itself is not implemented yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94743
Approved by: https://github.com/wz337
2023-02-16 22:04:09 +00:00
59005bb998 Fix segmentation fault in script_type_parser.cpp and unpickler.cpp (#94815)
Hi!

I've been fuzzing different pytorch modules, and found a few crashes.

Proposed checks fixes multiple segmentation faults and heap buffer overflows that was found during fuzzing pytorch with [sydr-fuzz](https://github.com/ispras/oss-sydr-fuzz/tree/master/projects/pytorch).

### Crash files ###
1) Heap buffer overflow that leads to crash
[crash-842314913bf1820ec19cddfbb7400ffdbb756920.zip](https://github.com/pytorch/pytorch/files/9461316/crash-842314913bf1820ec19cddfbb7400ffdbb756920.zip)

```
  "AsanReport": [
    "==3751==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000033478 at pc 0x0000005f9bc3 bp 0x7fffffff1eb0 sp 0x7fffffff1ea8\n",
    "READ of size 4 at 0x619000033478 thread T0\n",
    "[Detaching after fork from child process 3762]\n",
    "    #0 0x5f9bc2 in c10::IValue::IValue(c10::IValue&&) /pytorch_fuzz/aten/src/ATen/core/ivalue.h:192:43\n",
    "    #1 0x9ecd0a7 in torch::jit::pop(std::vector<c10::IValue, std::allocator<c10::IValue> >&) /pytorch_fuzz/aten/src/ATen/core/stack.h:102:12\n",
    "    #2 0x9ecd0a7 in torch::jit::Unpickler::readInstruction() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:380:17\n",
    "    #3 0x9ecafc7 in torch::jit::Unpickler::run() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:226:27\n",
    "    #4 0x9ecac62 in torch::jit::Unpickler::parse_ivalue() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:183:3\n",
    "    #5 0x9e45996 in torch::jit::unpickle(std::function<unsigned long (char*, unsigned long)>, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:127:20\n",
    "    #6 0x9e4626d in torch::jit::unpickle(char const*, unsigned long, std::function<c10::StrongTypePtr (c10::QualifiedName const&)>, c10::ArrayRef<at::Tensor>, c10::Type::SingletonOrSharedTypePtr<c10::Type> (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)) /pytorch_fuzz/torch/csrc/jit/serialization/pickle.cpp:137:10\n",
```

2) Segmentation fault
[crash-e690c58718e88921350562f0b4d9180938145d77.zip](https://github.com/pytorch/pytorch/files/9461331/crash-e690c58718e88921350562f0b4d9180938145d77.zip)

```
 "AsanReport": [
    "==3744==ERROR: AddressSanitizer: SEGV on unknown address (pc 0x000009122754 bp 0x7fffffff5290 sp 0x7fffffff5270 T0)\n",
    "==3744==The signal is caused by a READ memory access.\n",
    "==3744==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.\n",
    "[Detaching after fork from child process 3763]\n",
    "    #0 0x9122754 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::retain_() /pytorch_fuzz/c10/util/intrusive_ptr.h:269:54\n",
    "    #1 0x9127929 in c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> >::intrusive_ptr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch_fuzz/c10/util/intrusive_ptr.h:352:5\n",
    "    #2 0x9127929 in torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&) /pytorch_fuzz/torch/csrc/jit/frontend/tree_views.h:269:49\n",
    "    #3 0x91b1bbb in torch::jit::Maybe<torch::jit::Expr>::get() const /pytorch_fuzz/torch/csrc/jit/frontend/tree_views.h:211:12\n",
    "    #4 0x92a8f74 in torch::jit::ScriptTypeParser::parseClassConstant(torch::jit::Assign const&) /pytorch_fuzz/torch/csrc/jit/frontend/script_type_parser.cpp:461:41\n",
    "    #5 0x9e1c09b in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) /pytorch_fuzz/torch/csrc/jit/serialization/import_source.cpp:549:34\n",
    "    #6 0x9e13f00 in torch::jit::SourceImporterImpl::importNamedType(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, torch::jit::ClassDef const&) /pytorch_fuzz/torch/csrc/jit/serialization/import_source.cpp:288:5\n",
    "    #7 0x9e11fbc in torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&) /pytorch_fuzz/torch/csrc/jit/serialization/import_source.cpp:140:5\n",
```

3) Unhandled out of bounds access in a vector
[crash-ccd524e7ba19a37982dd91e0d6fc06bb26dd0b10.zip](https://github.com/pytorch/pytorch/files/9461367/crash-ccd524e7ba19a37982dd91e0d6fc06bb26dd0b10.zip)

```
  "AsanReport": [
    "==3792== ERROR: libFuzzer: deadly signal\n",
    "[Detaching after fork from child process 3809]\n",
    "    #0 0x59cc11 in __sanitizer_print_stack_trace /llvm-project/compiler-rt/lib/asan/asan_stack.cpp:87:3\n",
    "    #1 0x511547 in fuzzer::PrintStackTrace() /llvm-project/compiler-rt/lib/fuzzer/FuzzerUtil.cpp:210:5\n",
    "    #2 0x4f7753 in fuzzer::Fuzzer::CrashCallback() /llvm-project/compiler-rt/lib/fuzzer/FuzzerLoop.cpp:233:3\n",
    "    #3 0x7ffff7c6741f  (/lib/x86_64-linux-gnu/libpthread.so.0+0x1441f)\n",
    "    #4 0x7ffff7a8700a in __libc_signal_restore_set /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/internal-signals.h:86:3\n",
    "    #5 0x7ffff7a8700a in raise /build/glibc-SzIz7B/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:48:3\n",
    "    #6 0x7ffff7a66858 in abort /build/glibc-SzIz7B/glibc-2.31/stdlib/abort.c:79:7\n",
    "    #7 0x7ffff7e73910  (/lib/x86_64-linux-gnu/libstdc++.so.6+0x9e910)\n",
    "    #8 0x7ffff7e7f38b  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa38b)\n",
    "    #9 0x7ffff7e7f3f6 in std::terminate() (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa3f6)\n",
    "    #10 0x7ffff7e7f6a8 in __cxa_throw (/lib/x86_64-linux-gnu/libstdc++.so.6+0xaa6a8)\n",
    "    #11 0x7ffff7e763aa  (/lib/x86_64-linux-gnu/libstdc++.so.6+0xa13aa)\n",
    "    #12 0x6aeedf in std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_range_check(unsigned long) const /usr/bin/../lib/gcc/x86_64-linux-gnu/10/../../../../include/c++/10/bits/stl_vector.h:1073:4\n",
    "    #13 0x9ecd66c in torch::jit::Unpickler::readInstruction() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp\n",
    "    #14 0x9ecafc7 in torch::jit::Unpickler::run() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:226:27\n",
    "    #15 0x9ecac62 in torch::jit::Unpickler::parse_ivalue() /pytorch_fuzz/torch/csrc/jit/serialization/unpickler.cpp:183:3\n",
```

Some other crashes found by fuzzer:
[crash-0cab888cbd1e9fea92ab6ddeadf40b958b87d62b.zip](https://github.com/pytorch/pytorch/files/9461406/crash-0cab888cbd1e9fea92ab6ddeadf40b958b87d62b.zip)
[crash-04c9ba8e3b0f15028fd0fb0ed014fd352e182a1d.zip](https://github.com/pytorch/pytorch/files/9461407/crash-04c9ba8e3b0f15028fd0fb0ed014fd352e182a1d.zip)
[crash-422ad8c3a3472980ba751f4c7f79cf2b53e49927.zip](https://github.com/pytorch/pytorch/files/9461408/crash-422ad8c3a3472980ba751f4c7f79cf2b53e49927.zip)

### How to reproduce ###

1. To reproduce the crashes, use provided docker: [Dockerfile](https://github.com/ispras/oss-sydr-fuzz/blob/master/projects/pytorch/Dockerfile)

2. Build the container: `docker build -t oss-sydr-fuzz-pytorch-reproduce .`

3. Copy crash file to the current directory

4. Run the container: `` docker run --privileged --network host -v `pwd`:/homedir --rm -it oss-sydr-fuzz-pytorch-reproduce /bin/bash ``

5. And execute fuzz-targets with provided crash-files.

After execution completes you will see ASAN reports.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94815
Approved by: https://github.com/davidberard98
2023-02-16 21:41:11 +00:00
03f4a63fd8 Only truncate leading 1s if the value is too big. (#94521)
If it's just right, broadcasting will do the right thing
automatically.

This helps with unbacked SymInts as I can avoid testing one
equality on the inside.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94521
Approved by: https://github.com/voznesenskym
2023-02-16 21:33:12 +00:00
4f257a507c [Dynamo] Support Python builtin sorted function (#94949)
Fixes #94750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94949
Approved by: https://github.com/jansel, https://github.com/Skylion007
2023-02-16 21:27:11 +00:00
5747a51657 Fix flaky StaticRuntime.Nonzero test (#94418)
If the operator produces a zero size tensor, the memory
may be equal to the original.  With nonzero, we would sometimes
get unlucky and everything was zero.

See failing tests at https://hud.pytorch.org/failure/%5B%20%20FAILED%20%20%5D%20StaticRuntime.Nonzero

Arguably we should also fix the seeding but it was less obvious
to me where to do that.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94418
Approved by: https://github.com/albanD
2023-02-16 21:25:15 +00:00
b209d8fa0d [PT-D][Sequence Parallelism] Enable DTensor based Naive sequence parallelism (#94369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94369
Approved by: https://github.com/wanchaol
2023-02-16 21:21:00 +00:00
29fdb354ff [MPS] Fix embedding_backward() issue with Float16 (#94950)
- Casting the float16 input tensor to float32 and cast back the output tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94950
Approved by: https://github.com/DenisVieriu97
2023-02-16 20:55:09 +00:00
21eb7f70f1 Nvfuser python API import fix (#94036)
1. Having nvfuser python API import working with both devel and upstream;
2. Add environment variable to allow custom nvfuser code base to be built with upstream pytorch core.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94036
Approved by: https://github.com/malfet, https://github.com/davidberard98
2023-02-16 20:10:40 +00:00
7aaebe00ee Fail dynamic_aot_eager AllenaiLongformerBase model (#94986)
```
GuardOnDataDependentSymNode: It appears that you're trying to get a value out of symbolic int/float whose value is data-dependent (and thus we do not know the true value.)  The expression we were trying to evaluate is Eq(i3, -1).  Scroll up to see where each of these data-dependent accesses originally occurred.

While executing %as_strided : [#users=1] = call_method[target=as_strided](args = (%pad,), kwargs = {size: (12, %add, 768, 64), stride: (%getitem, %mul, %getitem_1, %getitem_2)})
Original traceback:
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/transformers/models/longformer/modeling_longformer.py", line 928, in <graph break in _sliding_chunks_matmul_attn_probs_value>
    chunked_value = padded_value.as_strided(size=chunked_value_size, stride=chunked_value_stride)
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94986
Approved by: https://github.com/albanD
2023-02-16 20:02:46 +00:00
920ad2415c Temporarily disable ROCm trunk tests (#94995)
ROCm tests are failing with No space left on device https://github.com/pytorch/pytorch/actions/runs/4197259561/jobs/7279713058

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94995
Approved by: https://github.com/huydhn
2023-02-16 19:59:36 +00:00
641cb4243c Fix c10d regression during cleanup. (#94988)
This fixes a regression introduced earlier today with a change to c10d global state.

It must be cleaned up in destroy_process_group or root PG and its Store will stay alive.

Fixes regression in test_c10d_nccl.py :: RendezvousEnvTest.test_common_errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94988
Approved by: https://github.com/H-Huang, https://github.com/wanchaol, https://github.com/malfet
2023-02-16 19:12:00 +00:00
b652577d8e Change test_torchinductor_opinfo.py to mark skips/xfails in a better way (#94813)
With this change, expected failures will be correctly reported as such by pytest (instead of passes as before).
It was sometimes a little confusing to see operators you did not expect to work in inductor reported as passing their tests.

One downside is that expected failures/skips for test variants have now to be identified by tuples. I.e., `("max", "reduction_no_dim"): {f16},` instead of just `"max.reduction_no_dim": {f16}`. It seems to me it is worth it.

This change would also allow to simplify `TestInductorOpInfo` class a little, since it doesn't have to handle the skips/xfails anymore, but that might require dropping support for things like `PYTORCH_COLLECT_EXPECT` and `PYTORCH_FAIL_ON_SUCCESS` so I didn't do it.

Also couple of other minor changes:

 - Got rid of c32, c64, c128 in torchinductor_opinfo. We don't support complex numbers, so they shouldn't be necessary.
 - Renamed TestExpect Enum to ExpectedTestResult to get rid of a pytest warning that thinks it is a class that has tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94813
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-02-16 18:57:01 +00:00
981511d0fe Upload coredump from ROCm and print the stacktrace (#94938)
There was a burst of `test_cuda` SIGSEGV or SIGIOT from ROCm today, for example 5705199fb1.  So, I'm trying to apply the same logic from Linux [test workflows](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_linux-test.yml#L248-L261) here to uploading the core dump to GitHub and print the stack trace.  This would help debug similar issues in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94938
Approved by: https://github.com/ZainRizvi
2023-02-16 17:56:36 +00:00
ef5de0a4cf Don't use PrimTorch decomposition for empty (#94512)
This PR removes the unnecessary == 0 guard when constructing empty tensors, by ensuring that when we create a contiguous tensor we go directly to the C++ torch.empty implementation (instead of indirecting through empty_strided), where we can bypass doing zero tests when computing the size of the storage. This probably also speeds up trace time.

When I did this, I found out that `empty_tensor_restride_symint` was flagrantly wrong (we had never exercised it before because we redirected to `empty_strided` in PrimTorch decomp, which doesn't hit this codepath.) The bugs:

* Stride computation was wrong (only `last_idx` was ever written to)
* Using set_sizes_and_strides with `sym_sizes` input doesn't work, because there is some sort of ordering problem where `clone_symvec` isn't safe when you clone a vector into itself. Probably should fix this.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94512
Approved by: https://github.com/ngimel
2023-02-16 16:04:41 +00:00
2f32fd7762 Introduce branchless implementations of TensorImpl bools (#94473)
This is the main payload of this diff stack. With it, we are able to construct a 1D tensor from unbacked SymInt with guards that are equivalent to asserting that the size is non-negative (which makes sense!) To get here, I had to arrange for all of the guards that occur when doing contiguity tests to be lazy. This was done by writing non-branching implementations of each of the tests in `sympy_is_contiguous` etc functions, and then using those implementations when we don't branch.

I also had to do some bug fixes for `is_non_overlapping_and_dense`, as unbacked SymInts were very untested previously (and that was the only time you would actually hit the Python version of the code.) In particular, we now consistently pass separate sizes/strides lists into each of the boolean computation functions (and only pack them into a single argument list when going to Sympy, which doesn't support lists of variables in custom functions.)

Finally, to actually test that this is doing something, I add a simple assumptions system from https://github.com/pytorch/pytorch/pull/90985 and use this to get the end to end test test_item_to_constructor passing. Soon, I intend to replace this with a range analysis system which will be used for assumptions in the short term. (We still might use Z3, but for all the stray assumptions I've seen range analysis will be good enough.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94473
Approved by: https://github.com/albanD
2023-02-16 16:02:13 +00:00
e22d791287 [PTD] Introduce tracing friendly collectives. (#93990)
This change adds torch.distributed.traceable_collectives.

This experimental API enables collectives to be fully traced by dynamo and FX.

See #93173 for the RFC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93990
Approved by: https://github.com/wconstab, https://github.com/wanchaol, https://github.com/H-Huang
2023-02-16 15:35:01 +00:00
d0fbed76c6 Test inductor with stock g++ (#90710)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90710
Approved by: https://github.com/jansel
2023-02-16 15:10:17 +00:00
89e16c4f18 Assume sympy is always installed (#94903)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94903
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-16 14:09:58 +00:00
23b1af0399 Inductor cache clear (#94918)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94918
Approved by: https://github.com/ezyang, https://github.com/jansel
2023-02-16 14:09:01 +00:00
68600fc7c6 avoid extra copies in batchnorm inference by introducing a new op, _native_batch_norm_legit_no_training (#94946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94946
Approved by: https://github.com/ezyang
2023-02-16 11:41:20 +00:00
2ef6659107 [Dynamo] Raise warning if user has hooks installed on the module (#94848)
We don't support hooks for ```nn.Module``` yet, should raise warnings if we detect hooks have been installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94848
Approved by: https://github.com/jansel
2023-02-16 10:01:42 +00:00
bfec4965a1 [inductor] Get compiler from environment variable if exists (#94926)
Fixes an issue where the default `g++` compiler does not specify the right compiler to use (or does not exist).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94926
Approved by: https://github.com/ngimel
2023-02-16 08:23:32 +00:00
28e69954a1 [ONNX] Support aten::bit_wise_not in fx-onnx exporter (#94919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94919
Approved by: https://github.com/justinchuby, https://github.com/wschin
2023-02-16 06:21:59 +00:00
a0389681c2 [complex] nansum & nanmean (#93199)
Follows: #71472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93199
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/kshitij12345
2023-02-16 06:13:42 +00:00
6ae06e49ac Inductor: fix incorrect result of inplace unsqueeze (#94797)
This pr aims to fix the incorrect result in the following test case.
```
@torch._dynamo.optimize("inductor")
def fn(a):
    unsqueeze_ = torch.ops.aten.unsqueeze_.default(a, 0)
    return unsqueeze_

args = [
      ((1, 1, 1, 12, 11, 3), (396, 396, 396, 33, 3, 1), torch.int64, "cpu")
       ]
args = [rand_strided(sh, st, dt, dev) for (sh, st, dt, dev) in args]

with torch.no_grad():
    out = fn(*args)

# expected result: (396, 396, 396, 396, 33, 3, 1) torch.Size([1, 1, 1, 1, 12, 11, 3])
print(args[0].stride(), args[0].shape) # incorrect result: (396, 396, 396, 396, 396, 396, 33, 3, 1) torch.Size([1, 1, 1, 1, 1, 1, 12, 11, 3])
```
**Root cause**

1. [fake_tensor](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/variables/builder.py#L140) is changed during [tracer.run](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/convert_frame.py#L311), then it will [pass incorrect inputs to inductor](https://github.com/pytorch/pytorch/blob/master/torch/_dynamo/output_graph.py#L670).
2. example_inputs are changed during [propagate](https://github.com/pytorch/pytorch/blob/master/torch/_inductor/mkldnn.py#L509)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94797
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-02-16 05:57:42 +00:00
aa9e481e0c Revert "Re-enable a FX-to-ONNX kwargs Test (#94763)"
This reverts commit 04b4704a0bbf2d3831ca7685264db574ff71216d.

Reverted https://github.com/pytorch/pytorch/pull/94763 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it has a tiny lint error that breaks trunk https://github.com/pytorch/pytorch/actions/runs/4190787551/jobs/7264666070.  This looks weird cause your PR lint signal was green
2023-02-16 05:24:07 +00:00
a049bbb100 Revert "Change test_torchinductor_opinfo.py to mark skips/xfails in a better way (#94813)"
This reverts commit bfc0d5e22c34e5888c394735bf696e2f45e07816.

Reverted https://github.com/pytorch/pytorch/pull/94813 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but it causes failures on trunk bfc0d5e22c due to a landrace with b6df987671
2023-02-16 05:08:23 +00:00
e751553848 [vision hash update] update the pinned vision hash (#94866)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94866
Approved by: https://github.com/pytorchbot
2023-02-16 05:00:48 +00:00
753c33bf86 Enable half type support for unique cpu (#91666)
Test Plan: CI

Differential Revision: D42326527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91666
Approved by: https://github.com/jgong5, https://github.com/ngimel
2023-02-16 04:59:35 +00:00
04b4704a0b Re-enable a FX-to-ONNX kwargs Test (#94763)
As title. The re-factorization of ONNX test framework disabled one exporter. This PR just brings that test back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94763
Approved by: https://github.com/justinchuby, https://github.com/abock, https://github.com/titaiwangms
2023-02-16 04:46:34 +00:00
4b2d1beca2 [dynamo] keep submodule's name for nn.Sequential when unroolling (#94913)
Currently, when unrolling an nn.Sequential, we use an integer to represent its submodule's name. This produces some difficulty in tracking the origin of the parameters in the export path:
```python
model = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
        ]))
```
Currently, the submodules will have names such as model.0, model.1 instead of model.conv1, model.relu1. This discrepency causes it difficult to track the origin of paramers because they are represented as model.conv1.foo and model.relu1.foo in model.named_parameters().

We replace enumerate() with named_children() to keep submodule's name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94913
Approved by: https://github.com/jansel
2023-02-16 04:43:05 +00:00
8c44ae2f5d [inductor] enable test_lowmem_dropout1_dynamic_shapes (#94884)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94884
Approved by: https://github.com/ezyang, https://github.com/albanD
2023-02-16 04:41:19 +00:00
5e1de31548 fix: make sure sorter indices are inbound in searchsorted (#94863)
Fixes #91606

Add a checker to `sorter` to make sure indices are inbound (as NumPy).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94863
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-02-16 04:28:39 +00:00
a863d5e37c Hide failing merge rule's name in the internal debugging section (#94932)
Fixes https://github.com/pytorch/test-infra/issues/1081

The merge rule name is not helpful to most readers, and most of the time it's just "superuser."  Move this to a less prominent place in the "Details for Dev Infra team" section
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94932
Approved by: https://github.com/huydhn
2023-02-16 04:20:10 +00:00
a4085ab837 [dynamo] support custom __getattr__ on torch.nn.Modules (#94658)
**Summary**: torch.nn.Module implementations previously did not support custom implementations of `__getattr__`; if a torch.nn.Module subclass implemented `__getattr__` and we tried to access an attribute that was expected to be present in `__getattr__`, dynamo would not check `__getattr__` and would error out with an AttributeError. This PR copies the functionality from UserDefinedObjectVariable into torch.nn.Module so that it also supports `__getattr__`

Example of a module which previously would fail:

```python
class MyMod(torch.nn.Module):
		def __init__(self):
				super().__init__()
				self.custom_dict = {"queue": [torch.rand((2, 2)) for _ in range(3)]}
				self.other_attr = torch.rand((2, 2))

		def __getattr__(self, name):
				custom_dict = self.custom_dict
				if name in custom_dict:
						return custom_dict[name]
				return super().__getattr__(name)

		def forward(self, x):
				return x @ self.other_attr + self.queue[-1]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94658
Approved by: https://github.com/yanboliang, https://github.com/jansel
2023-02-16 04:00:51 +00:00
bfc0d5e22c Change test_torchinductor_opinfo.py to mark skips/xfails in a better way (#94813)
With this change, expected failures will be correctly reported as such by pytest (instead of passes as before).
It was sometimes a little confusing to see operators you did not expect to work in inductor reported as passing their tests.

One downside is that expected failures/skips for test variants have now to be identified by tuples. I.e., `("max", "reduction_no_dim"): {f16},` instead of just `"max.reduction_no_dim": {f16}`. It seems to me it is worth it.

This change would also allow to simplify `TestInductorOpInfo` class a little, since it doesn't have to handle the skips/xfails anymore, but that might require dropping support for things like `PYTORCH_COLLECT_EXPECT` and `PYTORCH_FAIL_ON_SUCCESS` so I didn't do it.

Also couple of other minor changes:

 - Got rid of c32, c64, c128 in torchinductor_opinfo. We don't support complex numbers, so they shouldn't be necessary.
 - Renamed TestExpect Enum to ExpectedTestResult to get rid of a pytest warning that thinks it is a class that has tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94813
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-02-16 03:32:01 +00:00
3d40a86acd [ONNX] Enable skipped gpt2 test (#94930)
I think the skip is outdated. Test passed in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94930
Approved by: https://github.com/wschin
2023-02-16 03:23:57 +00:00
b4c8186774 [BE][1/N] Add deprecate msg to Sharded Partial and Replicate Tensor (#94928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94928
Approved by: https://github.com/wanchaol
2023-02-16 03:23:53 +00:00
07bc6b9587 [SDPA] Update dispatch logic to check for sm86 and head_size == 128 for flash attention (#94921)
Fixes #94883

Where backward for flash_attention on sm86 hardware with head_size == 128 is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94921
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-02-16 03:11:16 +00:00
41865bd8ed [executorch] Add RuntimeContext to generated C++ API Signature (#94570)
Summary:
Pass runtime context all the way to kernel level.

RegisterCodegenUnboxedKernels.cpp:

```
static Operator operators_to_register[] = {
    Operator(
        "aten::add.out",
        [](torch::executor::RuntimeContext & context, EValue** stack) {

            EValue& self = *stack[0];
    	EValue& other = *stack[1];
    	EValue& alpha = *stack[2];
    	EValue& out = *stack[3];
    	const torch::executor::Tensor & self_base = self.to<torch::executor::Tensor>();
    	const torch::executor::Tensor & other_base = other.to<torch::executor::Tensor>();
    	const torch::executor::Scalar & alpha_base = alpha.to<torch::executor::Scalar>();
    	torch::executor::Tensor & out_base = out.to<torch::executor::Tensor>();

            EXECUTORCH_SCOPE_PROF("native_call_add.out");
            torch::executor::aten::add_outf(context, self_base, other_base, alpha_base, out_base);

        }
    ),
}
```

Functions.h
```

// aten::add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
TORCH_API inline at::Tensor & add_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha, at::Tensor & out) {
    return torch::executor::native::add_out(self, other, alpha, out);
}

```

Test Plan: TBD

Differential Revision: D41325633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94570
Approved by: https://github.com/cccclai
2023-02-16 02:43:18 +00:00
e5c2a35d83 Add check that embedding_bag's weight is 2D (#94931)
Fixes https://github.com/pytorch/pytorch/issues/94445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94931
Approved by: https://github.com/albanD
2023-02-16 02:37:47 +00:00
3e9df622fb [mta] implement _foreach_pow (#92303)
Mainly for foreach path of `Adam` and `AdamW`

rel: https://github.com/pytorch/pytorch/issues/58833
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92303
Approved by: https://github.com/albanD
2023-02-16 02:28:26 +00:00
e28ba6813d Enable persistent reductions (#94847)
Now that we have newer triton this might be safe

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94847
Approved by: https://github.com/Chillee
2023-02-16 01:47:29 +00:00
0d7913c9c1 add backwards for layer norm nested (#94781)
Fixes #94702

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94781
Approved by: https://github.com/cpuhrsch
2023-02-16 01:42:57 +00:00
904d549ca4 Add some simple sanity tests to ValueRanges (#94905)
To start, I simply test that unary/binary ops agree with reference when
the ranges are singleton.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94905
Approved by: https://github.com/lezcano, https://github.com/eellison
2023-02-16 01:29:45 +00:00
e8dc34eaeb [MPS] Move max_pool2d to mps dispatch key (#90772)
Related issue: #77394

This PR also modifies some assertions in the codegen, an explanatory comment for it has been added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90772
Approved by: https://github.com/albanD
2023-02-16 01:13:08 +00:00
250c054bdd [SPMD] Pull the minimal working distribute API and SPMD module to PyTorch (#94802)
Pull the minimal working distribute API and SPMD module to PyTorch. The original code is on https://github.com/pytorch/tau/tree/main/spmd/compiler.

Other main contributors to the original code base: @anj-s, @lessw2020, @wanchaol @aazzolini

Differential Revision: [D43197230](https://our.internmc.facebook.com/intern/diff/D43197230/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94802
Approved by: https://github.com/anj-s, https://github.com/wanchaol
2023-02-16 00:36:16 +00:00
bc361fdfdf [MPS] Fix bilinear backward pass (#94892)
Fixes backward pass for bilinear.

Summary of changes:
- bilinear op is able to produce **contiguous, non-view** tensors with a storage offset, such as: shape=`[1, 1, 1, 1]`, `storage_offset=12`. This seems a weird case, but it is valid, and for these type of tensors we wouldn't be able to gather/scatter since we look at the view flag (which is not set here). This change looks into `storage_offset` only rather than the is_view flag which is not being set
- **reduction sum** must return a zeroed out output if passing an input with 0 elements (e.g a shape of (0, 5)).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94892
Approved by: https://github.com/kulinseth
2023-02-16 00:30:29 +00:00
dd7e2b7c0e [pt2][inductor] update choice caller hashes (#94853)
Summary:
update the hashing method for `ChoiceCaller` class.

`TritonTemplateCaller` objects will now be hashed to:
`{name}-({BLOCK_M}, {BLOCK_N}, {BLOCK_K})-{num_stages}-{num_warps}-{code_hash}`

for example:
`triton_mm-(64, 32, 32)-4-8-cptlntwzcl2gaaofd2oabdwhaqv4ox3lluvbuxitjfhhpz6cyl4o`

`ExternKernelCaller` objects will now be hashed to:
`{name}-{kwargs.keys()[0]}={kwargs.vals()[0]}-...-{code_hash}`

for example:
`addmm-alpha=1-beta=1-c4xxd3iocu4yt6z4udrlqnumays7q6mfnfd3qprh4fxgsvyhqdkf`

Test Plan: sandcastle

Differential Revision: D43285470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94853
Approved by: https://github.com/jansel, https://github.com/bertmaher
2023-02-16 00:11:26 +00:00
0698af67c7 Revert "Fix XNNPACK OSS Buck build (#94935)"
This reverts commit 9d2fddf820f8cf4273b12a8be5a556ba230c21cf.

Reverted https://github.com/pytorch/pytorch/pull/94935 on behalf of https://github.com/kit1980 due to The issue already mitigated by https://github.com/pytorch/pytorch/pull/94785
2023-02-15 23:14:44 +00:00
c01f5118a6 Add float to list of allowed ops (#94910)
By adding `BINFLOAT` op support

Fixes https://github.com/pytorch/pytorch/issues/94670
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94910
Approved by: https://github.com/albanD
2023-02-15 23:13:21 +00:00
9d2fddf820 Fix XNNPACK OSS Buck build (#94935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94935
Approved by: https://github.com/huydhn, https://github.com/seemethere, https://github.com/malfet
2023-02-15 23:06:32 +00:00
a005dd1c01 [MPS] Fix nn.functional.conv_transpose2d grad (#94871)
- add _mps_convolution_impl that takes optional shape
- for conv_tranpose2d grad, use the shape from forward pass directly
- for conv, calculate the shape from input
- remove nn.functional.conv_transpose2d grad from blocklist

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94871
Approved by: https://github.com/kulinseth
2023-02-15 21:45:11 +00:00
cd9ca4c73f [tp] additional doc fixes (#94786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94786
Approved by: https://github.com/fduwjj
2023-02-15 21:25:26 +00:00
b6df987671 [Inductor] Added aten.normal_ decomp (#91207)
Fixes #91085

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91207
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/lezcano
2023-02-15 21:21:46 +00:00
092e28f17f Make the glue compute short circuit only if possible (#94437)
If the inputs are unhinted, they will use the branchless implementation.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94437
Approved by: https://github.com/voznesenskym
2023-02-15 21:06:42 +00:00
ff7772317b Stub all TensorImpl bools; do not go to Python if not hinted. (#94431)
The basic idea behind this PR is that we want to continue using the guarding implementations of contiguity tests, if all of the elements are backend (aka, have hints). If they don't have hints, we'll have to do something slower (use the non-short circuiting, non guarding implementations of contiguity), but most of the time you aren't dealing with unbacked SymInts.

So this PR has three parts.

1. We expose `has_hint` on `SymNode`. This allows us to query whether or not a SymInt is backed or not from C++. Fairly self explanatory. Will require LTC/XLA updates; but for backends that don't support unbacked SymInts you can just always return true.
2. We update `compute_non_overlapping_and_dense` to test if the inputs are hinted. If they are all hinted, we use the conventional C++ implementation. Otherwise we call into Python. The Python case is not heavily tested right now because I haven't gotten all of the pieces for unbacked SymInts working yet. Coming soon.
3. We add stubs for all of the other contiguity tests. The intention is to apply the same treatment to them as well, but this is not wired up yet for safety reasons.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94431
Approved by: https://github.com/voznesenskym
2023-02-15 21:06:42 +00:00
6da88bc966 try to fix OSS CI error (#94785)
Differential Revision: D43259005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94785
Approved by: https://github.com/weiwangmeta, https://github.com/digantdesai
2023-02-15 21:00:55 +00:00
dea05cdbf0 [MPS] Fix the crash in elu_backward() (#94923)
Fixes a crash where the inputTensor could go null and cause a crash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94923
Approved by: https://github.com/DenisVieriu97, https://github.com/kulinseth
2023-02-15 20:49:30 +00:00
66bea59538 Clarify meaning of pin_memory_device argument (#94349)
I don't think the docstring explaining `pin_memory_device` is very clear. If it weren't for the string type, I would not have guessed that this was about the device that is referred to in the `pin_memory` option (and honestly, it took me a few minutes before noticing the type).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94349
Approved by: https://github.com/ejguan
2023-02-15 20:40:28 +00:00
f2c26420f2 [pytorch] Add support for "height" and "width" dimension for the "select" operator on pytorch vulkan backend (#94612)
Summary: Add support for "height" and "width" dimension for the "select" operator on pytorch vulkan backend.

Test Plan:
```
yipjustin@yipjustin-mbp fbsource % buck run  -c pt.vulkan_full_precision=1  --target-platforms ovr_config//platform/macos:arm64-fbsource //xplat/caffe2:pt_vulkan_api_test_binAppleMac\#macosx-arm64 -- --gtest_filter="*select_3d*"
Downloaded 1/2 artifacts, 1.29 Mbytes, 0.0% cache miss (for updated rules)
Building: finished in 3.7 sec (100%) 450/450 jobs, 2/450 updated
  Total time: 3.8 sec
BUILD SUCCEEDED
Running main() from xplat/third-party/gmock/googletest-1.12.1/googletest/src/gtest_main.cc
Note: Google Test filter = *select_3d*
[==========] Running 9 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 9 tests from VulkanAPITest
[ RUN      ] VulkanAPITest.select_3d_depth_small
[       OK ] VulkanAPITest.select_3d_depth_small (30 ms)
[ RUN      ] VulkanAPITest.select_3d_depth_medium
[       OK ] VulkanAPITest.select_3d_depth_medium (0 ms)
[ RUN      ] VulkanAPITest.select_3d_depth_large
[       OK ] VulkanAPITest.select_3d_depth_large (1 ms)
[ RUN      ] VulkanAPITest.select_3d_height_small
[       OK ] VulkanAPITest.select_3d_height_small (0 ms)
[ RUN      ] VulkanAPITest.select_3d_height_medium
[       OK ] VulkanAPITest.select_3d_height_medium (0 ms)
[ RUN      ] VulkanAPITest.select_3d_height_large
[       OK ] VulkanAPITest.select_3d_height_large (3 ms)
[ RUN      ] VulkanAPITest.select_3d_width_small
[       OK ] VulkanAPITest.select_3d_width_small (0 ms)
[ RUN      ] VulkanAPITest.select_3d_width_medium
[       OK ] VulkanAPITest.select_3d_width_medium (0 ms)
[ RUN      ] VulkanAPITest.select_3d_width_large
[       OK ] VulkanAPITest.select_3d_width_large (1 ms)
[----------] 9 tests from VulkanAPITest (40 ms total)

[----------] Global test environment tear-down
[==========] 9 tests from 1 test suite ran. (40 ms total)
[  PASSED  ] 9 tests.
```

Reviewed By: SS-JIA

Differential Revision: D43020796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94612
Approved by: https://github.com/SS-JIA
2023-02-15 19:15:17 +00:00
fa1ea9f9bc Revert "Re-enable a FX-to-ONNX kwargs Test (#94763)"
This reverts commit ea657726d951662005688e03115a44a658c4144c.

Reverted https://github.com/pytorch/pytorch/pull/94763 on behalf of https://github.com/wschin due to One line conflict with https://github.com/pytorch/pytorch/pull/94878
2023-02-15 18:07:25 +00:00
b46b2e35d4 [BE] Add flake8-logging-format linter (#94840)
Follow up to #94708
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94840
Approved by: https://github.com/ezyang
2023-02-15 17:54:50 +00:00
dc4f2af6f6 Take CUDA_VISIBLE_DEVICES into account for nvml calls (#94568)
Fixes #94472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94568
Approved by: https://github.com/ngimel
2023-02-15 17:50:12 +00:00
ea657726d9 Re-enable a FX-to-ONNX kwargs Test (#94763)
As title. The re-factorization of ONNX test framework disabled one exporter. This PR just brings that test back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94763
Approved by: https://github.com/justinchuby, https://github.com/abock
2023-02-15 17:31:04 +00:00
1d7133c542 inductor(cpu): fix C++ compile error when sigmoid's post ops is a reduction op (#94890)
For timm **nfnet_l0** model. CPU path has the following error: `torch._dynamo.exc.BackendCompilerFailed: inductor raised CppCompileError: C++ compile error`.

There has a simple test case:

```
def fn(x):
    x = torch.ops.aten.sigmoid.default(x)
    return torch.ops.aten.mean.dim(x, [-1, -2], True)

x = torch.randn((1, 8, 8, 8))
opt_fn = torch._dynamo.optimize("inductor")(fn)
opt_fn(x)

real_out = fn(x)
compiled_out = opt_fn(x)
tol = 0.0001
print(torch.allclose(real_out, compiled_out, atol=tol, rtol=tol))

```

before:

```
extern "C" void kernel(float* __restrict__ in_out_ptr0,
                       const float* __restrict__ in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    {
        #pragma GCC ivdep
        for(long i0=0; i0<8; i0+=1)
        {
            {
                #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})
                float tmp2 = 0;
                auto tmp2_vec = at::vec::Vectorized<float>(tmp2);
                for(long i1=0; i1<4; i1+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16*i1) + (64*i0));
                    auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                    tmp2_vec += tmp1;
                }
                #pragma omp simd simdlen(8)  reduction(+:tmp3)
                for(long i1=64; i1<64; i1+=1)
                {
                    auto tmp0 = in_ptr0[i1 + (64*i0)];
                    auto tmp1 = std::exp(-tmp0);
                    auto tmp2 = 1 / (1 + tmp1);
                    tmp3 += tmp2;
                }
                tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec);
                out_ptr0[i0] = tmp3;
            }
        }
    }
    {
        for(long i0=0; i0<0; i0+=1)
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16*i0);
            auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64));
            auto tmp2 = tmp0 / tmp1;
            tmp2.store(in_out_ptr0 + 16*i0);
        }
        #pragma omp simd simdlen(8)
        for(long i0=0; i0<8; i0+=1)
        {
            auto tmp0 = out_ptr0[i0];
            auto tmp1 = static_cast<float>(64);
            auto tmp2 = tmp0 / tmp1;
            in_out_ptr0[i0] = tmp2;
        }
    }
}
```

after:
```
extern "C" void kernel(float* __restrict__ in_out_ptr0,
                       const float* __restrict__ in_ptr0)
{
    auto out_ptr0 = in_out_ptr0;
    #pragma omp parallel num_threads(40)
    {
        {
            #pragma omp for
            for(long i0=0; i0<8; i0+=1)
            {
                {
                    #pragma omp declare reduction(+:at::vec::Vectorized<float>:omp_out += omp_in) initializer(omp_priv={{0}})
                    float tmp2 = 0;
                    auto tmp2_vec = at::vec::Vectorized<float>(tmp2);
                    for(long i1=0; i1<4; i1+=1)
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + (16*i1) + (64*i0));
                        auto tmp1 = decltype(tmp0)(1)/(decltype(tmp0)(1) + tmp0.neg().exp());
                        tmp2_vec += tmp1;
                    }
                    #pragma omp simd simdlen(8)  reduction(+:tmp2)
                    for(long i1=64; i1<64; i1+=1)
                    {
                        auto tmp0 = in_ptr0[i1 + (64*i0)];
                        auto tmp1 = decltype(tmp0)(1) / (decltype(tmp0)(1) + std::exp(-tmp0));
                        tmp2 += tmp1;
                    }
                    tmp2 += at::vec::vec_reduce_all<float>([](at::vec::Vectorized<float>& x, at::vec::Vectorized<float>&y) {return x + y;}, tmp2_vec);
                    out_ptr0[i0] = tmp2;
                }
            }
        }
        #pragma omp single
        {
            {
                for(long i0=0; i0<0; i0+=1)
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(out_ptr0 + 16*i0);
                    auto tmp1 = at::vec::Vectorized<float>(static_cast<float>(64));
                    auto tmp2 = tmp0 / tmp1;
                    tmp2.store(in_out_ptr0 + 16*i0);
                }
                #pragma omp simd simdlen(8)
                for(long i0=0; i0<8; i0+=1)
                {
                    auto tmp0 = out_ptr0[i0];
                    auto tmp1 = static_cast<float>(64);
                    auto tmp2 = tmp0 / tmp1;
                    in_out_ptr0[i0] = tmp2;
                }
            }
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94890
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/lezcano
2023-02-15 17:13:45 +00:00
7dd7dde033 [MPS] Convert output back to ChannelsLast for MaxPool2D (#94877)
Since we re-stride the indices and output in MPS pooling from ChannelsLast to Contiguous, we need to convert the results back to ChannelsLast.
This will fix the failure with test_memory_format with MaxPool2D in test_modules.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94877
Approved by: https://github.com/kulinseth, https://github.com/DenisVieriu97
2023-02-15 16:19:21 +00:00
54ebf255ab [MPS] Fixes for LSTM. (#94889)
- Backward pass has to give explicit bias tensor of zeros if none is passed to the op or the bias gradient will not be calculated.
- Fixed bias tensor mistakenly getting overwritten to zeros
- Fixes crash when lstm op called with has_biases set to false. Change takes into account the changed shape of the input params TensorList depending on the bias flag.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94889
Approved by: https://github.com/DenisVieriu97
2023-02-15 16:10:40 +00:00
799df90d0e [ONNX] Add bloom ops (#94878)
449a85bdbf should be included
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94878
Approved by: https://github.com/justinchuby
2023-02-15 16:05:39 +00:00
4987 changed files with 566291 additions and 176399 deletions

3
.bazelignore Normal file
View File

@ -0,0 +1,3 @@
# We do not use this library in our Bazel build. It contains an
# infinitely recursing symlink that makes Bazel very unhappy.
third_party/ittapi/

View File

@ -69,10 +69,6 @@ build --per_file_copt='^//.*\.(cpp|cc)$'@-Werror=all
# The following warnings come from -Wall. We downgrade them from error
# to warnings here.
#
# sign-compare has a tremendous amount of violations in the
# codebase. It will be a lot of work to fix them, just disable it for
# now.
build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-sign-compare
# We intentionally use #pragma unroll, which is compiler specific.
build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-error=unknown-pragmas
@ -100,6 +96,9 @@ build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-unused-parameter
# likely want to have this disabled for the most part.
build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-missing-field-initializers
build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-unused-function
build --per_file_copt='^//.*\.(cpp|cc)$'@-Wno-unused-variable
build --per_file_copt='//:aten/src/ATen/RegisterCompositeExplicitAutograd\.cpp$'@-Wno-error=unused-function
build --per_file_copt='//:aten/src/ATen/RegisterCompositeImplicitAutograd\.cpp$'@-Wno-error=unused-function
build --per_file_copt='//:aten/src/ATen/RegisterMkldnnCPU\.cpp$'@-Wno-error=unused-function

View File

@ -1 +1 @@
4.2.1
6.1.1

View File

@ -14,6 +14,7 @@
[cxx]
cxxflags = -std=c++17
ldflags = -Wl,--no-undefined
should_remap_host_platform = true
cpp = /usr/bin/clang
cc = /usr/bin/clang

View File

@ -1,7 +1,7 @@
# Docker images for Jenkins
# Docker images for GitHub CI
This directory contains everything needed to build the Docker images
that are used in our CI
that are used in our CI.
The Dockerfiles located in subdirectories are parameterized to
conditionally run build stages depending on build arguments passed to
@ -12,13 +12,13 @@ each image as the `BUILD_ENVIRONMENT` environment variable.
See `build.sh` for valid build environments (it's the giant switch).
Docker builds are now defined with `.circleci/cimodel/data/simple/docker_definitions.py`
## Contents
* `build.sh` -- dispatch script to launch all builds
* `common` -- scripts used to execute individual Docker build stages
* `ubuntu` -- Dockerfile for Ubuntu image for CPU build and test jobs
* `ubuntu-cuda` -- Dockerfile for Ubuntu image with CUDA support for nvidia-docker
* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support
## Usage

View File

@ -53,7 +53,7 @@ dependencies {
implementation 'androidx.appcompat:appcompat:1.0.0'
implementation 'com.facebook.fbjni:fbjni-java-only:0.2.2'
implementation 'com.google.code.findbugs:jsr305:3.0.1'
implementation 'com.facebook.soloader:nativeloader:0.10.4'
implementation 'com.facebook.soloader:nativeloader:0.10.5'
implementation 'junit:junit:' + rootProject.junitVersion
implementation 'androidx.test:core:' + rootProject.coreVersion

View File

@ -46,9 +46,7 @@ if [[ "$image" == *xla* ]]; then
exit 0
fi
if [[ "$image" == *-bionic* ]]; then
UBUNTU_VERSION=18.04
elif [[ "$image" == *-focal* ]]; then
if [[ "$image" == *-focal* ]]; then
UBUNTU_VERSION=20.04
elif [[ "$image" == *-jammy* ]]; then
UBUNTU_VERSION=22.04
@ -81,18 +79,18 @@ fi
# CMake 3.18 is needed to support CUDA17 language variant
CMAKE_VERSION=3.18.5
_UCX_COMMIT=31e74cac7bee0ef66bef2af72e7d86d9c282e5ab
_UCC_COMMIT=1c7a7127186e7836f73aafbd7697bbc274a77eee
_UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af
_UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-bionic-cuda11.6-cudnn8-py3-gcc7)
CUDA_VERSION=11.6.2
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=7
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
@ -100,12 +98,13 @@ case "$image" in
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-bionic-cuda11.7-cudnn8-py3-gcc7)
CUDA_VERSION=11.7.0
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=7
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
@ -113,8 +112,24 @@ case "$image" in
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-bionic-cuda11.8-cudnn8-py3-gcc7)
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc7)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
@ -126,14 +141,36 @@ case "$image" in
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3-clang7-asan)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=7
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc7-inductor-benchmarks)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=7
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.8
@ -142,9 +179,10 @@ case "$image" in
DB=yes
VISION=yes
CONDA_CMAKE=yes
ONNX=yes
;;
pytorch-linux-focal-py3-clang7-android-ndk-r19c)
ANACONDA_PYTHON_VERSION=3.7
ANACONDA_PYTHON_VERSION=3.8
CLANG_VERSION=7
LLVMDEV=yes
PROTOBUF=yes
@ -153,45 +191,38 @@ case "$image" in
GRADLE_VERSION=6.8.3
NINJA_VERSION=1.9.0
;;
pytorch-linux-bionic-py3.8-clang9)
pytorch-linux-focal-py3.8-clang10)
ANACONDA_PYTHON_VERSION=3.8
CLANG_VERSION=9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-bionic-py3.11-clang9)
pytorch-linux-focal-py3.11-clang10)
ANACONDA_PYTHON_VERSION=3.11
CLANG_VERSION=9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
VISION=yes
VULKAN_SDK_VERSION=1.2.162.1
SWIFTSHADER=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-bionic-py3.8-gcc9)
pytorch-linux-focal-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.3
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
@ -200,6 +231,18 @@ case "$image" in
ROCM_VERSION=5.4.2
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=5.6
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.8-gcc7)
ANACONDA_PYTHON_VERSION=3.8
@ -209,24 +252,20 @@ case "$image" in
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
;;
pytorch-linux-jammy-cuda11.6-cudnn8-py3.8-clang12)
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
CUDA_VERSION=11.6
CUDNN_VERSION=8
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
VISION=yes
;;
pytorch-linux-jammy-cuda11.7-cudnn8-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
CUDA_VERSION=11.7
CUDNN_VERSION=8
CLANG_VERSION=12
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
@ -236,6 +275,27 @@ case "$image" in
PROTOBUF=yes
DB=yes
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-asan)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.8-gcc11)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=11
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
CONDA_CMAKE=yes
TRITON=yes
DOCS=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
@ -260,6 +320,7 @@ case "$image" in
if [[ "$image" == *rocm* ]]; then
extract_version_from_image_name rocm ROCM_VERSION
NINJA_VERSION=1.9.0
TRITON=yes
fi
if [[ "$image" == *centos7* ]]; then
NINJA_VERSION=1.10.2
@ -323,11 +384,15 @@ docker build \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx906;gfx90a}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
--build-arg "TRITON=${TRITON}" \
--build-arg "ONNX=${ONNX}" \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
-f $(dirname ${DOCKERFILE})/Dockerfile \
-t "$tmp_tag" \
"$@" \

View File

@ -1,60 +0,0 @@
#!/bin/bash
set -ex
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*)
}
# If UPSTREAM_BUILD_ID is set (see trigger job), then we can
# use it to tag this build with the same ID used to tag all other
# base image builds. Also, we can try and pull the previous
# image first, to avoid rebuilding layers that haven't changed.
#until we find a way to reliably reuse previous build, this last_tag is not in use
# last_tag="$(( CIRCLE_BUILD_NUM - 1 ))"
tag="${DOCKER_TAG}"
registry="308535385114.dkr.ecr.us-east-1.amazonaws.com"
image="${registry}/pytorch/${IMAGE_NAME}"
login() {
aws ecr get-authorization-token --region us-east-1 --output text --query 'authorizationData[].authorizationToken' |
base64 -d |
cut -d: -f2 |
docker login -u AWS --password-stdin "$1"
}
# Only run these steps if not on github actions
if [[ -z "${GITHUB_ACTIONS}" ]]; then
# Retry on timeouts (can happen on job stampede).
retry login "${registry}"
# Logout on exit
trap "docker logout ${registry}" EXIT
fi
# Try to pull the previous image (perhaps we can reuse some layers)
# if [ -n "${last_tag}" ]; then
# docker pull "${image}:${last_tag}" || true
# fi
# Build new image
./build.sh ${IMAGE_NAME} -t "${image}:${tag}"
# Only push if `DOCKER_SKIP_PUSH` = false
if [ "${DOCKER_SKIP_PUSH:-true}" = "false" ]; then
# Only push if docker image doesn't exist already.
# ECR image tags are immutable so this will avoid pushing if only just testing if the docker jobs work
# NOTE: The only workflow that should push these images should be the docker-builds.yml workflow
if ! docker manifest inspect "${image}:${tag}" >/dev/null 2>/dev/null; then
docker push "${image}:${tag}"
fi
fi
if [ -z "${DOCKER_SKIP_S3_UPLOAD:-}" ]; then
trap "rm -rf ${IMAGE_NAME}:${tag}.tar" EXIT
docker save -o "${IMAGE_NAME}:${tag}.tar" "${image}:${tag}"
aws s3 cp "${IMAGE_NAME}:${tag}.tar" "s3://ossci-linux-build/pytorch/base/${IMAGE_NAME}:${tag}.tar" --acl public-read
fi

View File

@ -64,9 +64,9 @@ ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
COPY ./common/install_vision.sh install_vision.sh
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# Install rocm

View File

@ -0,0 +1 @@
4.27.4

View File

@ -0,0 +1 @@
b9d43c7dcac1fe05e851dd7be7187b108af593d2

View File

@ -0,0 +1 @@
05d67b9418cacda0d356c2102d7c1a887948b013

View File

@ -0,0 +1 @@
e6216047b8b0aef1fe8da6ca8667a3ad0a016411

View File

@ -0,0 +1,18 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
# Cache the test models at ~/.cache/torch/hub/
IMPORT_SCRIPT_FILENAME="/tmp/torchvision_import_script.py"
as_jenkins echo 'import torchvision; torchvision.models.mobilenet_v2(pretrained=True); torchvision.models.mobilenet_v3_large(pretrained=True);' > "${IMPORT_SCRIPT_FILENAME}"
pip_install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cpu
# Very weird quoting behavior here https://github.com/conda/conda/issues/10972,
# so echo the command to a file and run the file instead
conda_run python "${IMPORT_SCRIPT_FILENAME}"
# Cleaning up
conda_run pip uninstall -y torch torchvision
rm "${IMPORT_SCRIPT_FILENAME}" || true

View File

@ -13,7 +13,7 @@ as_jenkins() {
# NB: Pass on PATH and LD_LIBRARY_PATH to sudo invocation
# NB: This must be run from a directory that jenkins has access to,
# works around https://github.com/conda/conda-package-handling/pull/34
$SUDO -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
$SUDO -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env "PATH=$PATH" "LD_LIBRARY_PATH=$LD_LIBRARY_PATH" $*
}
conda_install() {
@ -30,3 +30,7 @@ conda_run() {
pip_install() {
as_jenkins conda run -n py_$ANACONDA_PYTHON_VERSION pip install --progress-bar off $*
}
get_pinned_commit() {
cat "${1}".txt
}

View File

@ -107,3 +107,6 @@ chgrp -R jenkins /var/lib/jenkins/.gradle
popd
rm -rf /var/lib/jenkins/.gradle/daemon
# Cache vision models used by the test
source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

View File

@ -31,10 +31,13 @@ install_ubuntu() {
maybe_libomp_dev=""
fi
# TODO: Remove this once nvidia package repos are back online
# Comment out nvidia repositories to prevent them from getting apt-get updated, see https://github.com/pytorch/pytorch/issues/74968
# shellcheck disable=SC2046
sed -i 's/.*nvidia.*/# &/' $(find /etc/apt/ -type f -name "*.list")
# HACK: UCC testing relies on libnccl library from NVIDIA repo, and version 2.16 crashes
# See https://github.com/pytorch/pytorch/pull/105260#issuecomment-1673399729
if [[ "$UBUNTU_VERSION" == "20.04"* && "$CUDA_VERSION" == "11.8"* ]]; then
maybe_libnccl_dev="libnccl2=2.15.5-1+cuda11.8 libnccl-dev=2.15.5-1+cuda11.8 --allow-downgrades --allow-change-held-packages"
else
maybe_libnccl_dev=""
fi
# Install common dependencies
apt-get update
@ -63,6 +66,7 @@ install_ubuntu() {
libasound2-dev \
libsndfile-dev \
${maybe_libomp_dev} \
${maybe_libnccl_dev} \
software-properties-common \
wget \
sudo \
@ -77,20 +81,6 @@ install_ubuntu() {
# see: https://github.com/pytorch/pytorch/issues/65931
apt-get install -y libgnutls30
# cuda-toolkit does not work with gcc-11.2.0 which is default in Ubunutu 22.04
# see: https://github.com/NVlabs/instant-ngp/issues/119
if [[ "$UBUNTU_VERSION" == "22.04"* ]]; then
apt-get install -y g++-10
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 30
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 30
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-10 30
# https://www.spinics.net/lists/libreoffice/msg07549.html
sudo rm -rf /usr/lib/gcc/x86_64-linux-gnu/11
wget https://github.com/gcc-mirror/gcc/commit/2b2d97fc545635a0f6aa9c9ee3b017394bc494bf.patch -O noexecpt.patch
sudo patch /usr/include/c++/10/bits/range_access.h noexecpt.patch
fi
# Cleanup package manager
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

View File

@ -36,14 +36,11 @@ if [ -n "$ROCM_VERSION" ]; then
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
else
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
install_ubuntu
;;
*)
install_binary
;;
esac
# TODO: Install the pre-built binary from S3 as building from source
# https://github.com/pytorch/sccache has started failing mysteriously
# in which sccache server couldn't start with the following error:
# sccache: error: Invalid argument (os error 22)
install_binary
fi
chmod a+x /opt/cache/bin/sccache

View File

@ -4,10 +4,7 @@ set -ex
if [ -n "$CLANG_VERSION" ]; then
if [[ $CLANG_VERSION == 7 && $UBUNTU_VERSION == 16.04 ]]; then
wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
sudo apt-add-repository "deb http://apt.llvm.org/xenial/ llvm-toolchain-xenial-7 main"
elif [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then
if [[ $CLANG_VERSION == 9 && $UBUNTU_VERSION == 18.04 ]]; then
sudo apt-get update
# gpg-agent is not available by default on 18.04
sudo apt-get install -y --no-install-recommends gpg-agent
@ -28,11 +25,11 @@ if [ -n "$CLANG_VERSION" ]; then
fi
# Use update-alternatives to make this version the default
# TODO: Decide if overriding gcc as well is a good idea
# update-alternatives --install /usr/bin/gcc gcc /usr/bin/clang-"$CLANG_VERSION" 50
# update-alternatives --install /usr/bin/g++ g++ /usr/bin/clang++-"$CLANG_VERSION" 50
update-alternatives --install /usr/bin/clang clang /usr/bin/clang-"$CLANG_VERSION" 50
update-alternatives --install /usr/bin/clang++ clang++ /usr/bin/clang++-"$CLANG_VERSION" 50
# Override cc/c++ to clang as well
update-alternatives --install /usr/bin/cc cc /usr/bin/clang 50
update-alternatives --install /usr/bin/c++ c++ /usr/bin/clang++ 50
# clang's packaging is a little messed up (the runtime libs aren't
# added into the linker path), so give it a little help

View File

@ -7,6 +7,7 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
case "$MAJOR_PYTHON_VERSION" in
2)
@ -52,21 +53,23 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
# Install PyTorch conda deps, as per https://github.com/pytorch/pytorch README
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# TODO: Stop using `-c malfet`
conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0 -c malfet
conda_install numpy=1.23.5 ${CONDA_COMMON_DEPS}
elif [ "$ANACONDA_PYTHON_VERSION" = "3.10" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
elif [ "$ANACONDA_PYTHON_VERSION" = "3.9" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.19.2 ${CONDA_COMMON_DEPS} llvmdev=8.0.0
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
elif [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} llvmdev=8.0.0
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
else
# Install `typing-extensions` for 3.7
conda_install numpy=1.18.5 ${CONDA_COMMON_DEPS} typing-extensions
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS} typing-extensions
fi
# This is only supported in 3.8 upward
if [ "$MINOR_PYTHON_VERSION" -gt "7" ]; then
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
conda_install llvmdev=8.0.0 "libpython-static=${ANACONDA_PYTHON_VERSION}"
fi
# Use conda cmake in some cases. Conda cmake will be newer than our supported
@ -94,5 +97,22 @@ if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
pip_install scikit-learn==0.20.3
fi
if [ -n "$DOCS" ]; then
apt-get update
apt-get -y install expect-dev
# We are currently building docs with python 3.8 (min support version)
pip_install -r /opt/conda/requirements-docs.txt
fi
# HACK HACK HACK
# gcc-9 for ubuntu-18.04 from http://ppa.launchpad.net/ubuntu-toolchain-r/test/ubuntu
# Pulls llibstdc++6 13.1.0-8ubuntu1~18.04 which is too new for conda
# So remove libstdc++6.so.3.29 installed by https://anaconda.org/anaconda/libstdcxx-ng/files?version=11.2.0
# Same is true for gcc-12 from Ubuntu-22.04
if grep -e [12][82].04.[623] /etc/issue >/dev/null; then
rm /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/lib/libstdc++.so.6
fi
popd
fi

View File

@ -4,9 +4,9 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
CUDNN_NAME="cudnn-linux-x86_64-8.3.2.44_cuda11.5-archive"
if [[ ${CUDA_VERSION:0:4} == "11.7" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.5.0.96_cuda11-archive"
curl --retry 3 -OLs https://ossci-linux.s3.amazonaws.com/${CUDNN_NAME}.tar.xz
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz

View File

@ -7,7 +7,7 @@ if [ -n "$KATEX" ]; then
# Ignore error if gpg-agent doesn't exist (for Ubuntu 16.04)
apt-get install -y gpg-agent || :
curl --retry 3 -sL https://deb.nodesource.com/setup_12.x | sudo -E bash -
curl --retry 3 -sL https://deb.nodesource.com/setup_16.x | sudo -E bash -
sudo apt-get install -y nodejs
curl --retry 3 -sS https://dl.yarnpkg.com/debian/pubkey.gpg | sudo apt-key add -

View File

@ -7,17 +7,10 @@ if [ -n "$GCC_VERSION" ]; then
# Need the official toolchain repo to get alternate packages
add-apt-repository ppa:ubuntu-toolchain-r/test
apt-get update
if [[ "$UBUNTU_VERSION" == "16.04" && "${GCC_VERSION:0:1}" == "5" ]]; then
apt-get install -y g++-5=5.4.0-6ubuntu1~16.04.12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-5 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-5 50
else
apt-get install -y g++-$GCC_VERSION
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50
fi
apt-get install -y g++-$GCC_VERSION
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-"$GCC_VERSION" 50
update-alternatives --install /usr/bin/gcov gcov /usr/bin/gcov-"$GCC_VERSION" 50
# Cleanup package manager

View File

@ -0,0 +1,28 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
function install_huggingface() {
local version
version=$(get_pinned_commit huggingface)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "transformers==${version}"
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
}
# Pango is needed for weasyprint which is needed for doctr
conda_install pango
install_huggingface
# install_timm

View File

@ -0,0 +1,51 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
# A bunch of custom pip dependencies for ONNX
pip_install \
beartype==0.10.4 \
filelock==3.9.0 \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.0 \
numpy==1.22.4
# ONNXRuntime should be installed before installing
# onnx-weekly. Otherwise, onnx-weekly could be
# overwritten by onnx.
pip_install \
onnxruntime==1.15.1 \
parameterized==0.8.1 \
pytest-cov==4.0.0 \
pytest-subtests==0.10.0 \
tabulate==0.9.0 \
transformers==4.31.0
# Using 1.15dev branch for the following not yet released features and fixes.
# - Segfault fix for shape inference.
# - Inliner to workaround ORT segfault.
pip_install onnx-weekly==1.15.0.dev20230717
# TODO: change this when onnx-script is on testPypi
# pip_install onnxscript-preview==0.1.0.dev20230809 --no-deps
# NOTE: temp change for CI to run on unpublished onnxscript PR.
pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@f69be19ebd3f2e0d7efe64b0c7be3329cbab3822" --no-deps
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/
IMPORT_SCRIPT_FILENAME="/tmp/onnx_import_script.py"
as_jenkins echo 'import transformers; transformers.AutoModel.from_pretrained("sshleifer/tiny-gpt2"); transformers.AutoTokenizer.from_pretrained("sshleifer/tiny-gpt2");' > "${IMPORT_SCRIPT_FILENAME}"
# Need a PyTorch version for transformers to work
pip_install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
# Very weird quoting behavior here https://github.com/conda/conda/issues/10972,
# so echo the command to a file and run the file instead
conda_run python "${IMPORT_SCRIPT_FILENAME}"
# Cleaning up
conda_run pip uninstall -y torch
rm "${IMPORT_SCRIPT_FILENAME}" || true

View File

@ -61,13 +61,23 @@ install_ubuntu() {
rocprofiler-dev \
roctracer-dev
# precompiled miopen kernels added in ROCm 3.5; search for all unversioned packages
# precompiled miopen kernels added in ROCm 3.5, renamed in ROCm 5.5
# search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails
MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available"
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then
MIOPENHIPGFX=$(apt-cache search --names-only miopen-hip-gfx | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENHIPGFX}" = x ]]; then
echo "miopen-hip-gfx package not available" && exit 1
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENHIPGFX}
fi
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}
MIOPENKERNELS=$(apt-cache search --names-only miopenkernels | awk '{print $1}' | grep -F -v . || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available" && exit 1
else
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated ${MIOPENKERNELS}
fi
fi
# Cleanup
@ -123,6 +133,24 @@ install_centos() {
rocprofiler-dev \
roctracer-dev
# precompiled miopen kernels; search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails
if [[ $(ver $ROCM_VERSION) -ge $(ver 5.5) ]]; then
MIOPENHIPGFX=$(yum -q search miopen-hip-gfx | grep miopen-hip-gfx | awk '{print $1}'| grep -F kdb. || true)
if [[ "x${MIOPENHIPGFX}" = x ]]; then
echo "miopen-hip-gfx package not available" && exit 1
else
yum install -y ${MIOPENHIPGFX}
fi
else
MIOPENKERNELS=$(yum -q search miopenkernels | grep miopenkernels- | awk '{print $1}'| grep -F kdb. || true)
if [[ "x${MIOPENKERNELS}" = x ]]; then
echo "miopenkernels package not available" && exit 1
else
yum install -y ${MIOPENKERNELS}
fi
fi
# Cleanup
yum clean all
rm -rf /var/cache/yum

View File

@ -6,7 +6,7 @@ set -ex
git clone https://bitbucket.org/icl/magma.git
pushd magma
# Fixes memory leaks of magma found while executing linalg UTs
git checkout 5959b8783e45f1809812ed96ae762f38ee701972
git checkout 28592a7170e4b3707ed92644bf4a689ed600c27f
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
@ -18,7 +18,7 @@ else
amdgpu_targets=`rocm_agent_enumerator | grep -v gfx000 | sort -u | xargs`
fi
for arch in $amdgpu_targets; do
echo "DEVCCFLAGS += --amdgpu-target=$arch" >> make.inc
echo "DEVCCFLAGS += --offload-arch=$arch" >> make.inc
done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc

View File

@ -0,0 +1,66 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
get_conda_version() {
as_jenkins conda list -n py_$ANACONDA_PYTHON_VERSION | grep -w $* | head -n 1 | awk '{print $2}'
}
conda_reinstall() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*
}
if [ -n "${ROCM_VERSION}" ]; then
TRITON_REPO="https://github.com/ROCmSoftwarePlatform/triton"
TRITON_TEXT_FILE="triton-rocm"
else
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton"
fi
# The logic here is copied from .ci/pytorch/common_utils.sh
TRITON_PINNED_COMMIT=$(get_pinned_commit ${TRITON_TEXT_FILE})
apt update
apt-get install -y gpg-agent
if [ -n "${CONDA_CMAKE}" ]; then
# Keep the current cmake and numpy version here, so we can reinstall them later
CMAKE_VERSION=$(get_conda_version cmake)
NUMPY_VERSION=$(get_conda_version numpy)
fi
if [ -z "${MAX_JOBS}" ]; then
export MAX_JOBS=$(nproc)
fi
if [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
elif [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
else
pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
fi
if [ -n "${CONDA_CMAKE}" ]; then
# TODO: This is to make sure that the same cmake and numpy version from install conda
# script is used. Without this step, the newer cmake version (3.25.2) downloaded by
# triton build step via pip will fail to detect conda MKL. Once that issue is fixed,
# this can be removed.
#
# The correct numpy version also needs to be set here because conda claims that it
# causes inconsistent environment. Without this, conda will attempt to install the
# latest numpy version, which fails ASAN tests with the following import error: Numba
# needs NumPy 1.20 or less.
conda_reinstall cmake="${CMAKE_VERSION}"
conda_reinstall numpy="${NUMPY_VERSION}"
fi

View File

@ -43,3 +43,6 @@ case "$ID" in
exit 1
;;
esac
# Cache vision models used by the test
source "$(dirname "${BASH_SOURCE[0]}")/cache_vision_models.sh"

View File

@ -25,10 +25,10 @@ coremltools==5.0b5
#Pinned versions:
#test that import:
expecttest==0.1.3
expecttest==0.1.6
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.1.3
#Pinned versions: 0.1.6
#test that import:
flatbuffers==2.0
@ -62,7 +62,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#mkl-devel
# see mkl
#mock # breaks ci/circleci: docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c
#mock
#Description: A testing library that allows you to replace parts of your
#system under test with mock objects
#Pinned versions:
@ -75,16 +75,16 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==0.960
mypy==1.4.1
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 0.960
#Pinned versions: 1.4.1
#test that import: test_typing.py, test_type_hints.py
networkx==2.6.3
networkx==2.8.8
#Description: creation, manipulation, and study of
#the structure, dynamics, and functions of complex networks
#Pinned versions: 2.6.3 (latest version that works with Python 3.7+)
#Pinned versions: 2.8.8
#test that import: functorch
#ninja
@ -124,7 +124,8 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
#pillow
pillow==9.3.0 ; python_version <= "3.8"
pillow==9.5.0 ; python_version > "3.8"
#Description: Python Imaging Library fork
#Pinned versions:
#test that import:
@ -139,17 +140,17 @@ psutil
#Pinned versions:
#test that import: test_profiler.py, test_openmp.py, test_dataloader.py
pytest
pytest==7.3.2
#Description: testing framework
#Pinned versions:
#test that import: test_typing.py, test_cpp_extensions_aot.py, run_test.py
pytest-xdist
pytest-xdist==3.3.1
#Description: plugin for running pytest in parallel
#Pinned versions:
#test that import:
pytest-shard
pytest-shard==0.1.2
#Description: plugin spliting up tests in pytest
#Pinned versions:
#test that import:
@ -159,7 +160,7 @@ pytest-flakefinder==1.1.0
#Pinned versions: 1.1.0
#test that import:
pytest-rerunfailures
pytest-rerunfailures>=10.3
#Description: plugin for rerunning failure tests in pytest
#Pinned versions:
#test that import:
@ -179,7 +180,7 @@ xdoctest==1.1.0
#Pinned versions: 1.1.0
#test that import:
pygments==2.12.0
pygments==2.15.0
#Description: support doctest highlighting
#Pinned versions: 2.12.0
#test that import: the doctests
@ -199,7 +200,8 @@ pygments==2.12.0
#Pinned versions: 10.9.0
#test that import:
scikit-image
scikit-image==0.19.3 ; python_version < "3.10"
scikit-image==0.20.0 ; python_version >= "3.10"
#Description: image processing routines
#Pinned versions:
#test that import: test_nn.py
@ -211,7 +213,7 @@ scikit-image
scipy==1.6.3 ; python_version < "3.10"
scipy==1.8.1 ; python_version == "3.10"
scipy==1.9.3 ; python_version == "3.11"
scipy==1.10.1 ; python_version == "3.11"
# Pin SciPy because of failing distribution tests (see #60347)
#Description: scientific python
#Pinned versions: 1.6.3
@ -224,7 +226,7 @@ scipy==1.9.3 ; python_version == "3.11"
#Pinned versions:
#test that import:
tb-nightly
tb-nightly==2.13.0a20230426
#Description: TensorBoard
#Pinned versions:
#test that import:
@ -244,9 +246,9 @@ unittest-xml-reporting<=3.2.0,>=2.0.0
#Pinned versions:
#test that import:
lintrunner==0.9.2
#Description: all about linters
#Pinned versions: 0.9.2
lintrunner==0.10.7
#Description: all about linters!
#Pinned versions: 0.10.7
#test that import:
rockset==1.0.3
@ -258,3 +260,18 @@ ghstack==0.7.1
#Description: ghstack tool
#Pinned versions: 0.7.1
#test that import:
jinja2==3.1.2
#Description: jinja2 template engine
#Pinned versions: 3.1.2
#test that import:
pytest-cpp==2.3.0
#Description: This is used by pytest to invoke C++ tests
#Pinned versions: 2.3.0
#test that import:
z3-solver
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:

View File

@ -0,0 +1,49 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git#egg=pytorch_sphinx_theme
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought is probably
# something related to Docker setup. We can investigate this later
sphinxcontrib.katex==0.8.6
#Description: This is used to generate PyTorch docs
#Pinned versions: 0.8.6
matplotlib==3.5.3
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
tensorboard==2.13.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 2.13.0
breathe==4.34.0
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 4.34.0
exhale==0.2.3
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 0.2.3
docutils==0.16
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 0.16
bs4==0.0.1
#Description: This is used to generate PyTorch C++ docs
#Pinned versions: 0.0.1
IPython==8.12.0
#Description: This is used to generate PyTorch functorch docs
#Pinned versions: 8.12.0
myst-nb==0.17.2
#Description: This is used to generate PyTorch functorch docs
#Pinned versions: 0.13.2
# The following are required to build torch.distributed.elastic.rendezvous.etcd* docs
python-etcd==0.4.5
sphinx-copybutton==0.5.0
sphinx-panels==0.4.1
myst-parser==0.18.1

View File

@ -0,0 +1 @@
2.1.0

View File

@ -58,9 +58,9 @@ ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
COPY ./common/install_vision.sh install_vision.sh
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install UCC
@ -85,6 +85,24 @@ COPY ./common/install_cmake.sh install_cmake.sh
RUN if [ -n "${CMAKE_VERSION}" ]; then bash ./install_cmake.sh; fi
RUN rm install_cmake.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
@ -127,6 +145,7 @@ RUN rm install_cudnn.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi
USER jenkins
CMD ["bash"]

View File

@ -55,9 +55,9 @@ ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
COPY ./common/install_vision.sh install_vision.sh
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# Install rocm
@ -68,6 +68,7 @@ RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN rm install_rocm_magma.sh
ENV ROCM_PATH /opt/rocm
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
@ -89,6 +90,16 @@ COPY ./common/install_ninja.sh install_ninja.sh
RUN if [ -n "${NINJA_VERSION}" ]; then bash ./install_ninja.sh; fi
RUN rm install_ninja.sh
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -36,12 +36,14 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
# Install conda and other packages (e.g., numpy, pytest)
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
COPY requirements-ci.txt /opt/conda/requirements-ci.txt
ENV DOCS=$DOCS
COPY requirements-ci.txt requirements-docs.txt /opt/conda/
COPY ./common/install_conda.sh install_conda.sh
COPY ./common/common_utils.sh common_utils.sh
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt
RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt
# Install gcc
ARG GCC_VERSION
@ -86,20 +88,20 @@ ENV INSTALLED_DB ${DB}
# (optional) Install vision packages like OpenCV and ffmpeg
ARG VISION
COPY ./common/install_vision.sh install_vision.sh
COPY ./common/install_vision.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
RUN if [ -n "${VISION}" ]; then bash ./install_vision.sh; fi
RUN rm install_vision.sh
RUN rm install_vision.sh cache_vision_models.sh common_utils.sh
ENV INSTALLED_VISION ${VISION}
# (optional) Install Android NDK
ARG ANDROID
ARG ANDROID_NDK
ARG GRADLE_VERSION
COPY ./common/install_android.sh install_android.sh
COPY ./common/install_android.sh ./common/cache_vision_models.sh ./common/common_utils.sh ./
COPY ./android/AndroidManifest.xml AndroidManifest.xml
COPY ./android/build.gradle build.gradle
RUN if [ -n "${ANDROID}" ]; then bash ./install_android.sh; fi
RUN rm install_android.sh
RUN rm install_android.sh cache_vision_models.sh common_utils.sh
RUN rm AndroidManifest.xml
RUN rm build.gradle
ENV INSTALLED_ANDROID ${ANDROID}
@ -134,6 +136,29 @@ ENV OPENSSL_ROOT_DIR /opt/openssl
ENV OPENSSL_DIR /opt/openssl
RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface.txt huggingface.txt
COPY ci_commit_pins/timm.txt timm.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface.txt
ARG TRITON
# Install triton, this needs to be done before sccache because the latter will
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
ARG ONNX
# Install ONNX dependencies
COPY ./common/install_onnx.sh ./common/common_utils.sh ./
RUN if [ -n "${ONNX}" ]; then bash ./install_onnx.sh; fi
RUN rm install_onnx.sh common_utils.sh
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH

View File

@ -3,72 +3,18 @@
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if [[ ${BUILD_ENVIRONMENT} == *onnx* ]]; then
pip install click mock tabulate networkx==2.0
pip -q install --user "file:///var/lib/jenkins/workspace/third_party/onnx#egg=onnx"
fi
# Use to retry ONNX test, only retry it twice
retry () {
"$@" || (sleep 60 && "$@")
}
# Skip tests in environments where they are not built/applicable
if [[ "${BUILD_ENVIRONMENT}" == *-android* ]]; then
echo 'Skipping tests'
exit 0
fi
if [[ "${BUILD_ENVIRONMENT}" == *-rocm* ]]; then
# temporary to locate some kernel issues on the CI nodes
export HSAKMT_DEBUG_LEVEL=4
fi
# These additional packages are needed for circleci ROCm builds.
if [[ $BUILD_ENVIRONMENT == *rocm* ]]; then
# Need networkx 2.0 because bellmand_ford was moved in 2.1 . Scikit-image by
# defaults installs the most recent networkx version, so we install this lower
# version explicitly before scikit-image pulls it in as a dependency
pip install networkx==2.0
# click - onnx
pip install --progress-bar off click protobuf tabulate virtualenv mock typing-extensions
fi
################################################################################
# Python tests #
################################################################################
if [[ "$BUILD_ENVIRONMENT" == *cmake* ]]; then
exit 0
fi
# If pip is installed as root, we must use sudo.
# CircleCI docker images could install conda as jenkins user, or use the OS's python package.
PIP=$(which pip)
PIP_USER=$(stat --format '%U' $PIP)
CURRENT_USER=$(id -u -n)
if [[ "$PIP_USER" = root && "$CURRENT_USER" != root ]]; then
MAYBE_SUDO=sudo
fi
# Uninstall pre-installed hypothesis and coverage to use an older version as newer
# versions remove the timeout parameter from settings which ideep/conv_transpose_test.py uses
$MAYBE_SUDO pip -q uninstall -y hypothesis
$MAYBE_SUDO pip -q uninstall -y coverage
# "pip install hypothesis==3.44.6" from official server is unreliable on
# CircleCI, so we host a copy on S3 instead
$MAYBE_SUDO pip -q install attrs==18.1.0 -f https://s3.amazonaws.com/ossci-linux/wheels/attrs-18.1.0-py2.py3-none-any.whl
$MAYBE_SUDO pip -q install coverage==4.5.1 -f https://s3.amazonaws.com/ossci-linux/wheels/coverage-4.5.1-cp36-cp36m-macosx_10_12_x86_64.whl
$MAYBE_SUDO pip -q install hypothesis==4.57.1
##############
# ONNX tests #
##############
if [[ "$BUILD_ENVIRONMENT" == *onnx* ]]; then
# TODO: This can be removed later once vision is also part of the Docker image
pip install -q --user --no-use-pep517 "git+https://github.com/pytorch/vision.git@$(cat .github/ci_commit_pins/vision.txt)"
pip install -q --user transformers==4.25.1
pip install -q --user ninja flatbuffers==2.0 numpy==1.22.4 onnxruntime==1.14.0 beartype==0.10.4
# TODO: change this when onnx 1.13.1 is released.
pip install --no-use-pep517 'onnx @ git+https://github.com/onnx/onnx@e192ba01e438d22ca2dedd7956e28e3551626c91'
# TODO: change this when onnx-script is on testPypi
pip install 'onnx-script @ git+https://github.com/microsoft/onnx-script@a71e35bcd72537bf7572536ee57250a0c0488bf6'
# numba requires numpy <= 1.20, onnxruntime requires numpy >= 1.21.
# We don't actually need it for our tests, but it's imported if it's present, so uninstall.
pip uninstall -q --yes numba
# JIT C++ extensions require ninja, so put it into PATH.
export PATH="/var/lib/jenkins/.local/bin:$PATH"
"$ROOT_DIR/scripts/onnx/test.sh"
# NB: ONNX test is fast (~15m) so it's ok to retry it few more times to avoid any flaky issue, we
# need to bring this to the standard PyTorch run_test eventually. The issue will be tracked in
# https://github.com/pytorch/pytorch/issues/98626
retry "$ROOT_DIR/scripts/onnx/test.sh"
fi

View File

@ -1,42 +0,0 @@
#!/bin/bash
# Required environment variable: $BUILD_ENVIRONMENT
# (This is set by default in the Docker images we build, so you don't
# need to set it yourself.
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
echo "Clang version:"
clang --version
python tools/stats/export_test_times.py
# detect_leaks=0: Python is very leaky, so we need suppress it
# symbolize=1: Gives us much better errors when things go wrong
export ASAN_OPTIONS=detect_leaks=0:detect_stack_use_after_return=1:symbolize=1:detect_odr_violation=0
if [ -n "$(which conda)" ]; then
export CMAKE_PREFIX_PATH=/opt/conda
fi
# TODO: Make the ASAN flags a centralized env var and unify with USE_ASAN option
CC="clang" CXX="clang++" LDSHARED="clang --shared" \
CFLAGS="-fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -fsanitize-address-use-after-scope -shared-libasan" \
USE_ASAN=1 USE_CUDA=0 USE_MKLDNN=0 \
python setup.py bdist_wheel
pip_install_whl "$(echo dist/*.whl)"
# Test building via the sdist source tarball
python setup.py sdist
mkdir -p /tmp/tmp
pushd /tmp/tmp
tar zxf "$(dirname "${BASH_SOURCE[0]}")/../../dist/"*.tar.gz
cd torch-*
python setup.py build --cmake-only
popd
print_sccache_stats
assert_git_not_dirty

View File

@ -1,29 +0,0 @@
#!/bin/bash
# Required environment variable: $BUILD_ENVIRONMENT
# (This is set by default in the Docker images we build, so you don't
# need to set it yourself.
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
echo "Clang version:"
clang --version
python tools/stats/export_test_times.py
if [ -n "$(which conda)" ]; then
export CMAKE_PREFIX_PATH=/opt/conda
fi
CC="clang" CXX="clang++" LDSHARED="clang --shared" \
CFLAGS="-fsanitize=thread" \
USE_TSAN=1 USE_CUDA=0 USE_MKLDNN=0 \
python setup.py bdist_wheel
pip_install_whl "$(echo dist/*.whl)"
print_sccache_stats
assert_git_not_dirty

View File

@ -11,14 +11,6 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
if [[ "$BUILD_ENVIRONMENT" == *-clang7-asan* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-asan.sh" "$@"
fi
if [[ "$BUILD_ENVIRONMENT" == *-clang7-tsan* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-tsan.sh" "$@"
fi
if [[ "$BUILD_ENVIRONMENT" == *-mobile-*build* ]]; then
exec "$(dirname "${BASH_SOURCE[0]}")/build-mobile.sh" "$@"
fi
@ -44,6 +36,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *cuda11.3* && "$BUILD_ENVIRONMENT" != *clang* ]]; then
# TODO: there is a linking issue when building with UCC using clang,
# disable it for now and to be fix later.
# TODO: disable UCC temporarily to enable CUDA 12.1 in CI
export USE_UCC=1
export USE_SYSTEM_UCC=1
fi
@ -171,6 +164,15 @@ if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
export CXX=clang++
fi
if [[ "$BUILD_ENVIRONMENT" == *-clang*-asan* ]]; then
export LDSHARED="clang --shared"
export USE_CUDA=0
export USE_ASAN=1
export USE_MKLDNN=0
export UBSAN_FLAGS="-fno-sanitize-recover=all;-fno-sanitize=float-divide-by-zero;-fno-sanitize=float-cast-overflow"
unset USE_LLVM
fi
if [[ "${BUILD_ENVIRONMENT}" == *no-ops* ]]; then
export USE_PER_OPERATOR_HEADERS=0
fi
@ -191,16 +193,19 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e
get_bazel
install_sccache_nvcc_for_bazel
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"
BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"
tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...
# Build torch, the Python module, and tests for CPU-only
tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" --config=cpu-only :torch :_C.so :all_tests
if [[ "$CUDA_VERSION" == "cpu" ]]; then
# Build torch, the Python module, and tests for CPU-only
tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" --config=cpu-only :torch :torch/_C.so :all_tests
else
tools/bazel build --config=no-tty "${BAZEL_MEM_LIMIT}" "${BAZEL_CPU_LIMIT}" //...
fi
else
# check that setup.py would fail with bad arguments
echo "The next three invocations are expected to fail with invalid command error messages."

View File

@ -31,7 +31,7 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
# as though sccache still gets used even when the sscache server isn't started
# explicitly
echo "Skipping sccache server initialization, setting environment variables"
export SCCACHE_IDLE_TIMEOUT=1200
export SCCACHE_IDLE_TIMEOUT=0
export SCCACHE_ERROR_LOG=~/sccache_error.log
export RUST_LOG=sccache::server=error
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* ]]; then
@ -39,11 +39,12 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
else
# increasing SCCACHE_IDLE_TIMEOUT so that extension_backend_test.cpp can build after this PR:
# https://github.com/pytorch/pytorch/pull/16645
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=1200 RUST_LOG=sccache::server=error sccache --start-server
SCCACHE_ERROR_LOG=~/sccache_error.log SCCACHE_IDLE_TIMEOUT=0 RUST_LOG=sccache::server=error sccache --start-server
fi
# Report sccache stats for easier debugging
sccache --zero-stats
# Report sccache stats for easier debugging. It's ok if this commands
# timeouts and fails on MacOS
sccache --zero-stats || true
fi
if which ccache > /dev/null; then

View File

@ -22,7 +22,3 @@ fi
# TODO: Renable libtorch testing for MacOS, see https://github.com/pytorch/pytorch/issues/62598
# shellcheck disable=SC2034
BUILD_TEST_LIBTORCH=0
retry () {
"$@" || (sleep 1 && "$@") || (sleep 2 && "$@")
}

View File

@ -80,19 +80,34 @@ function get_exit_code() {
}
function get_bazel() {
if [[ $(uname) == "Darwin" ]]; then
# download bazel version
retry curl https://github.com/bazelbuild/bazel/releases/download/4.2.1/bazel-4.2.1-darwin-x86_64 -Lo tools/bazel
# verify content
echo '74d93848f0c9d592e341e48341c53c87e3cb304a54a2a1ee9cff3df422f0b23c tools/bazel' | shasum -a 256 -c >/dev/null
else
# download bazel version
retry curl https://ossci-linux.s3.amazonaws.com/bazel-4.2.1-linux-x86_64 -o tools/bazel
# verify content
echo '1a4f3a3ce292307bceeb44f459883859c793436d564b95319aacb8af1f20557c tools/bazel' | shasum -a 256 -c >/dev/null
fi
# Download and use the cross-platform, dependency-free Python
# version of Bazelisk to fetch the platform specific version of
# Bazel to use from .bazelversion.
retry curl --location --output tools/bazel \
https://raw.githubusercontent.com/bazelbuild/bazelisk/v1.16.0/bazelisk.py
shasum --algorithm=1 --check \
<(echo 'd4369c3d293814d3188019c9f7527a948972d9f8 tools/bazel')
chmod u+x tools/bazel
}
chmod +x tools/bazel
# This function is bazel specific because of the bug
# in the bazel that requires some special paths massaging
# as a workaround. See
# https://github.com/bazelbuild/bazel/issues/10167
function install_sccache_nvcc_for_bazel() {
sudo mv /usr/local/cuda/bin/nvcc /usr/local/cuda/bin/nvcc-real
# Write the `/usr/local/cuda/bin/nvcc`
cat << EOF | sudo tee /usr/local/cuda/bin/nvcc
#!/bin/sh
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
exec sccache /usr/local/cuda/bin/nvcc "\$@"
else
exec external/local_cuda/cuda/bin/nvcc-real "\$@"
fi
EOF
sudo chmod +x /usr/local/cuda/bin/nvcc
}
function install_monkeytype {
@ -105,16 +120,62 @@ function get_pinned_commit() {
cat .github/ci_commit_pins/"${1}".txt
}
function install_torchtext() {
function install_torchaudio() {
local commit
commit=$(get_pinned_commit text)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${commit}"
commit=$(get_pinned_commit audio)
if [[ "$1" == "cuda" ]]; then
# TODO: This is better to be passed as a parameter from _linux-test workflow
# so that it can be consistent with what is set in build
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"
else
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/audio.git@${commit}"
fi
}
function install_torchtext() {
local data_commit
local text_commit
data_commit=$(get_pinned_commit data)
text_commit=$(get_pinned_commit text)
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/data.git@${data_commit}"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/text.git@${text_commit}"
}
function install_torchvision() {
local orig_preload
local commit
commit=$(get_pinned_commit vision)
orig_preload=${LD_PRELOAD}
if [ -n "${LD_PRELOAD}" ]; then
# Silence dlerror to work-around glibc ASAN bug, see https://sourceware.org/bugzilla/show_bug.cgi?id=27653#c9
echo 'char* dlerror(void) { return "";}'|gcc -fpic -shared -o "${HOME}/dlerror.so" -x c -
LD_PRELOAD=${orig_preload}:${HOME}/dlerror.so
fi
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/vision.git@${commit}"
if [ -n "${LD_PRELOAD}" ]; then
LD_PRELOAD=${orig_preload}
fi
}
function install_torchrec_and_fbgemm() {
local torchrec_commit
torchrec_commit=$(get_pinned_commit torchrec)
local fbgemm_commit
fbgemm_commit=$(get_pinned_commit fbgemm)
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
}
function install_numpy_pytorch_interop() {
local commit
commit=$(get_pinned_commit numpy_pytorch_interop)
# TODO: --no-use-pep517 will result in failure.
pip_install --user "git+https://github.com/Quansight-Labs/numpy_pytorch_interop.git@${commit}"
}
function clone_pytorch_xla() {
@ -129,59 +190,15 @@ function clone_pytorch_xla() {
fi
}
function install_filelock() {
pip_install filelock
}
function install_triton() {
local commit
commit=$(get_pinned_commit triton)
local short_hash
short_hash=$(echo "${commit}"|cut -c -10)
local index_url
index_url=https://download.pytorch.org/whl/nightly/cpu
if [[ "${TEST_CONFIG}" == *rocm* ]]; then
echo "skipping triton due to rocm"
elif pip install "pytorch-triton==2.0.0+${short_hash}" --index-url "${index_url}"; then
echo "Using prebuilt version ${short_hash}"
else
if [[ "${BUILD_ENVIRONMENT}" == *gcc7* ]]; then
# Trition needs gcc-9 to build
sudo apt-get install -y g++-9
CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
elif [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
# Trition needs <filesystem> which surprisingly is not available with clang-9 toolchain
sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt-get install -y g++-9
CXX=g++-9 pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
else
pip_install --user "git+https://github.com/openai/triton@${commit}#subdirectory=python"
fi
pip_install --user jinja2
fi
}
function setup_torchdeploy_deps(){
conda install -y -n "py_${ANACONDA_PYTHON_VERSION}" "libpython-static=${ANACONDA_PYTHON_VERSION}"
local CC
local CXX
CC="$(which gcc)"
CXX="$(which g++)"
export CC
export CXX
pip install --upgrade pip
}
function checkout_install_torchdeploy() {
local commit
commit=$(get_pinned_commit multipy)
setup_torchdeploy_deps
pushd ..
git clone --recurse-submodules https://github.com/pytorch/multipy.git
pushd multipy
git checkout "${commit}"
python multipy/runtime/example/generate_examples.py
pip install -e . --install-option="--cudatests"
BUILD_CUDA_TESTS=1 pip install -e .
popd
popd
}
@ -195,26 +212,21 @@ function test_torch_deploy(){
popd
}
function install_huggingface() {
local commit
commit=$(get_pinned_commit huggingface)
pip_install pandas
pip_install scipy
pip_install "git+https://github.com/huggingface/transformers.git@${commit}#egg=transformers"
}
function install_timm() {
local commit
commit=$(get_pinned_commit timm)
pip_install pandas
pip_install scipy
pip_install z3-solver
pip_install "git+https://github.com/rwightman/pytorch-image-models@${commit}"
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
git clone https://github.com/pytorch/benchmark torchbench
pushd torchbench
git checkout no_torchaudio
git checkout "$commit"
if [ "$1" ]; then
python install.py --continue_on_fail models "$@"
@ -226,10 +238,6 @@ function checkout_install_torchbench() {
popd
}
function test_functorch() {
python test/run_test.py --functorch --verbose
}
function print_sccache_stats() {
echo 'PyTorch Build Statistics'
sccache --show-stats

View File

@ -1,8 +1,4 @@
# =================== The following code **should** be executed inside Docker container ===================
# Install dependencies
sudo apt-get -y update
sudo apt-get -y install expect-dev
#!/bin/bash
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
@ -20,7 +16,7 @@ echo "cpp_doc_push_script.sh: Invoked with $*"
# but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to
# try and gather it first, just so we don't potentially break people who rely on this script
# Argument 2: What version of the Python API docs we are building.
version="${2:-${DOCS_VERSION:-master}}"
version="${2:-${DOCS_VERSION:-main}}"
if [ -z "$version" ]; then
echo "error: cpp_doc_push_script.sh: version (arg2) not specified"
exit 1
@ -34,11 +30,6 @@ echo "error: cpp_doc_push_script.sh: install_path (arg1) not specified"
exit 1
fi
is_main_doc=false
if [ "$version" == "master" ]; then
is_main_doc=true
fi
echo "install_path: $install_path version: $version"
# ======================== Building PyTorch C++ API Docs ========================
@ -53,7 +44,6 @@ set -ex
# Generate ATen files
pushd "${pt_checkout}"
pip install -r requirements.txt
time python -m torchgen.gen \
-s aten/src/ATen \
-d build/aten/src/ATen
@ -68,7 +58,6 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
pip install -r requirements.txt
time make VERBOSE=1 html -j
popd
@ -79,7 +68,7 @@ pushd cppdocs
# Purge everything with some exceptions
mkdir /tmp/cppdocs-sync
mv _config.yml README.md /tmp/cppdocs-sync/
rm -rf *
rm -rf ./*
# Copy over all the newly generated HTML
cp -r "${pt_checkout}"/docs/cpp/build/html/* .
@ -102,4 +91,3 @@ if [[ "${WITH_PUSH:-}" == true ]]; then
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -1,10 +1,10 @@
from datetime import datetime, timedelta
from tempfile import mkdtemp
from cryptography.hazmat.primitives import serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography import x509
from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
from cryptography.hazmat.primitives import hashes
temp_dir = mkdtemp()
print(temp_dir)
@ -16,37 +16,43 @@ def genrsa(path):
key_size=2048,
)
with open(path, "wb") as f:
f.write(key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.TraditionalOpenSSL,
encryption_algorithm=serialization.NoEncryption(),
))
f.write(
key.private_bytes(
encoding=serialization.Encoding.PEM,
format=serialization.PrivateFormat.TraditionalOpenSSL,
encryption_algorithm=serialization.NoEncryption(),
)
)
return key
def create_cert(path, C, ST, L, O, key):
subject = issuer = x509.Name([
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
])
cert = x509.CertificateBuilder().subject_name(
subject
).issuer_name(
issuer
).public_key(
key.public_key()
).serial_number(
x509.random_serial_number()
).not_valid_before(
datetime.utcnow()
).not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow() + timedelta(days=10)
).add_extension(
x509.BasicConstraints(ca=True, path_length=None), critical=True,
).sign(key, hashes.SHA256())
subject = issuer = x509.Name(
[
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
]
)
cert = (
x509.CertificateBuilder()
.subject_name(subject)
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.utcnow())
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow()
+ timedelta(days=10)
)
.add_extension(
x509.BasicConstraints(ca=True, path_length=None),
critical=True,
)
.sign(key, hashes.SHA256())
)
# Write our certificate out to disk.
with open(path, "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
@ -54,43 +60,65 @@ def create_cert(path, C, ST, L, O, key):
def create_req(path, C, ST, L, O, key):
csr = x509.CertificateSigningRequestBuilder().subject_name(x509.Name([
# Provide various details about who we are.
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
])).sign(key, hashes.SHA256())
csr = (
x509.CertificateSigningRequestBuilder()
.subject_name(
x509.Name(
[
# Provide various details about who we are.
x509.NameAttribute(NameOID.COUNTRY_NAME, C),
x509.NameAttribute(NameOID.STATE_OR_PROVINCE_NAME, ST),
x509.NameAttribute(NameOID.LOCALITY_NAME, L),
x509.NameAttribute(NameOID.ORGANIZATION_NAME, O),
]
)
)
.sign(key, hashes.SHA256())
)
with open(path, "wb") as f:
f.write(csr.public_bytes(serialization.Encoding.PEM))
return csr
def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
cert = x509.CertificateBuilder().subject_name(
csr_cert.subject
).issuer_name(
ca_cert.subject
).public_key(
csr_cert.public_key()
).serial_number(
x509.random_serial_number()
).not_valid_before(
datetime.utcnow()
).not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow() + timedelta(days=10)
# Sign our certificate with our private key
).sign(private_ca_key, hashes.SHA256())
cert = (
x509.CertificateBuilder()
.subject_name(csr_cert.subject)
.issuer_name(ca_cert.subject)
.public_key(csr_cert.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.utcnow())
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow()
+ timedelta(days=10)
# Sign our certificate with our private key
)
.sign(private_ca_key, hashes.SHA256())
)
with open(path, "wb") as f:
f.write(cert.public_bytes(serialization.Encoding.PEM))
return cert
ca_key = genrsa(temp_dir + "/ca.key")
ca_cert = create_cert(temp_dir + "/ca.pem", u"US", u"New York", u"New York", u"Gloo Certificate Authority", ca_key)
ca_cert = create_cert(
temp_dir + "/ca.pem",
"US",
"New York",
"New York",
"Gloo Certificate Authority",
ca_key,
)
pkey = genrsa(temp_dir + "/pkey.key")
csr = create_req(temp_dir + "/csr.csr", u"US", u"California", u"San Francisco", u"Gloo Testing Company", pkey)
csr = create_req(
temp_dir + "/csr.csr",
"US",
"California",
"San Francisco",
"Gloo Testing Company",
pkey,
)
cert = sign_certificate_request(temp_dir + "/cert.pem", csr, ca_cert, ca_key)

View File

@ -6,5 +6,4 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Testing pytorch docs"
cd docs
pip_install -r requirements.txt
make doctest

View File

@ -0,0 +1,40 @@
#!/bin/bash
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex
version=${DOCS_VERSION:-nightly}
echo "version: $version"
# Build functorch docs
pushd $pt_checkout/functorch/docs
make html
popd
git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages
pushd functorch_ghpages
if [ "$version" == "main" ]; then
version=nightly
fi
git rm -rf "$version" || true
mv "$pt_checkout/functorch/docs/build/html" "$version"
git add "$version" || true
git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
git status
if [[ "${WITH_PUSH:-}" == true ]]; then
git push -u origin gh-pages
fi
popd

View File

@ -40,8 +40,14 @@ cross_compile_arm64() {
USE_DISTRIBUTED=0 CMAKE_OSX_ARCHITECTURES=arm64 MACOSX_DEPLOYMENT_TARGET=11.0 USE_MKLDNN=OFF USE_QNNPACK=OFF WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
}
compile_arm64() {
# Compilation for arm64
# TODO: Compile with OpenMP support (but this causes CI regressions as cross-compilation were done with OpenMP disabled)
USE_DISTRIBUTED=0 USE_OPENMP=0 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
}
compile_x86_64() {
USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel
USE_DISTRIBUTED=0 WERROR=1 python setup.py bdist_wheel --plat-name=macosx_10_9_x86_64
}
build_lite_interpreter() {
@ -62,8 +68,14 @@ build_lite_interpreter() {
"${CPP_BUILD}/caffe2/build/bin/test_lite_interpreter_runtime"
}
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} = *arm64* ]]; then
cross_compile_arm64
if [[ $(uname -m) == "arm64" ]]; then
compile_arm64
else
cross_compile_arm64
fi
elif [[ ${BUILD_ENVIRONMENT} = *lite-interpreter* ]]; then
export BUILD_LITE_INTERPRETER=1
build_lite_interpreter

View File

@ -9,6 +9,25 @@ sysctl -a | grep machdep.cpu
# These are required for both the build job and the test job.
# In the latter to test cpp extensions.
export MACOSX_DEPLOYMENT_TARGET=10.9
export MACOSX_DEPLOYMENT_TARGET=11.0
export CXX=clang++
export CC=clang
print_cmake_info() {
CMAKE_EXEC=$(which cmake)
echo "$CMAKE_EXEC"
CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")
# Print all libraries under cmake rpath for debugging
ls -la "$CONDA_INSTALLATION_DIR/../lib"
export CMAKE_EXEC
# Explicitly add conda env lib folder to cmake rpath to address the flaky issue
# where cmake dependencies couldn't be found. This seems to point to how conda
# links $CMAKE_EXEC to its package cache when cloning a new environment
install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true
# Adding the rpath will invalidate cmake signature, so signing it again here
# to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
# with an exit code 137 otherwise
codesign -f -s - "${CMAKE_EXEC}" || true
}

View File

@ -25,6 +25,7 @@ setup_test_python() {
# using the address associated with the loopback interface.
export GLOO_SOCKET_IFNAME=lo0
echo "Ninja version: $(ninja --version)"
echo "Python version: $(which python) ($(python --version))"
# Increase default limit on open file handles from 256 to 1024
ulimit -n 1024
@ -70,37 +71,19 @@ test_libtorch() {
VERBOSE=1 DEBUG=1 python "$BUILD_LIBTORCH_PY"
popd
python tools/download_mnist.py --quiet -d test/cpp/api/mnist
MNIST_DIR="${PWD}/test/cpp/api/mnist"
python tools/download_mnist.py --quiet -d "${MNIST_DIR}"
# Unfortunately it seems like the test can't load from miniconda3
# without these paths being set
export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$PWD/miniconda3/lib"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$PWD/miniconda3/lib"
TORCH_CPP_TEST_MNIST_PATH="test/cpp/api/mnist" "$CPP_BUILD"/caffe2/bin/test_api
TORCH_CPP_TEST_MNIST_PATH="${MNIST_DIR}" CPP_TESTS_DIR="${CPP_BUILD}/caffe2/bin" python test/run_test.py --cpp --verbose -i cpp/test_api
assert_git_not_dirty
fi
}
print_cmake_info() {
CMAKE_EXEC=$(which cmake)
echo "$CMAKE_EXEC"
CONDA_INSTALLATION_DIR=$(dirname "$CMAKE_EXEC")
# Print all libraries under cmake rpath for debugging
ls -la "$CONDA_INSTALLATION_DIR/../lib"
export CMAKE_EXEC
# Explicitly add conda env lib folder to cmake rpath to address the flaky issue
# where cmake dependencies couldn't be found. This seems to point to how conda
# links $CMAKE_EXEC to its package cache when cloning a new environment
install_name_tool -add_rpath @executable_path/../lib "${CMAKE_EXEC}" || true
# Adding the rpath will invalidate cmake signature, so signing it again here
# to trust the executable. EXC_BAD_ACCESS (SIGKILL (Code Signature Invalid))
# with an exit code 137 otherwise
codesign -f -s - "${CMAKE_EXEC}" || true
}
test_custom_backend() {
print_cmake_info
@ -166,9 +149,7 @@ test_jit_hooks() {
assert_git_not_dirty
}
if [[ "${TEST_CONFIG}" == *functorch* ]]; then
test_functorch
elif [[ $NUM_TEST_SHARDS -gt 1 ]]; then
if [[ $NUM_TEST_SHARDS -gt 1 ]]; then
test_python_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_libtorch

View File

@ -8,6 +8,7 @@
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Testing pytorch"
time python test/run_test.py --include test_cuda_multigpu test_cuda_primary_ctx --verbose
# Disabling tests to see if they solve timeout issues; see https://github.com/pytorch/pytorch/issues/70015
# python tools/download_mnist.py --quiet -d test/cpp/api/mnist
@ -27,23 +28,25 @@ time python test/run_test.py --verbose -i distributed/checkpoint/test_checkpoint
time python test/run_test.py --verbose -i distributed/checkpoint/test_file_system_checkpoint
time python test/run_test.py --verbose -i distributed/_shard/sharding_spec/test_sharding_spec
time python test/run_test.py --verbose -i distributed/_shard/sharding_plan/test_sharding_plan
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_megatron_prototype
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/test_sharded_tensor_reshard
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_chunk
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_elementwise_ops
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_embedding
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_embedding_bag
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_binary_cmp
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_init
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_linear
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_math_ops
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_matrix_ops
time python test/run_test.py --verbose -i distributed/_shard/sharded_tensor/ops/test_softmax
time python test/run_test.py --verbose -i distributed/_shard/sharded_optim/test_sharded_optim
time python test/run_test.py --verbose -i distributed/_shard/test_partial_tensor
time python test/run_test.py --verbose -i distributed/_shard/test_replicated_tensor
# functional collective tests
time python test/run_test.py --verbose -i distributed/test_functional_api
# DTensor tests
time python test/run_test.py --verbose -i distributed/_tensor/test_device_mesh
time python test/run_test.py --verbose -i distributed/_tensor/test_random_ops
time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compile
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k optimizers_with_varying_tensors
time python test/run_test.py --verbose -i test_foreach -- -k test_tensors_grouping
assert_git_not_dirty

View File

@ -1,32 +1,41 @@
import sys
import argparse
import json
import math
import argparse
import sys
parser = argparse.ArgumentParser()
parser.add_argument('--test-name', dest='test_name', action='store',
required=True, help='test name')
parser.add_argument('--sample-stats', dest='sample_stats', action='store',
required=True, help='stats from sample')
parser.add_argument('--update', action='store_true',
help='whether to update baseline using stats from sample')
parser.add_argument(
"--test-name", dest="test_name", action="store", required=True, help="test name"
)
parser.add_argument(
"--sample-stats",
dest="sample_stats",
action="store",
required=True,
help="stats from sample",
)
parser.add_argument(
"--update",
action="store_true",
help="whether to update baseline using stats from sample",
)
args = parser.parse_args()
test_name = args.test_name
if 'cpu' in test_name:
backend = 'cpu'
elif 'gpu' in test_name:
backend = 'gpu'
if "cpu" in test_name:
backend = "cpu"
elif "gpu" in test_name:
backend = "gpu"
data_file_path = '../{}_runtime.json'.format(backend)
data_file_path = f"../{backend}_runtime.json"
with open(data_file_path) as data_file:
data = json.load(data_file)
if test_name in data:
mean = float(data[test_name]['mean'])
sigma = float(data[test_name]['sigma'])
mean = float(data[test_name]["mean"])
sigma = float(data[test_name]["sigma"])
else:
# Let the test pass if baseline number doesn't exist
mean = sys.maxsize
@ -43,37 +52,39 @@ if math.isnan(mean) or math.isnan(sigma):
sample_stats_data = json.loads(args.sample_stats)
sample_mean = float(sample_stats_data['mean'])
sample_sigma = float(sample_stats_data['sigma'])
sample_mean = float(sample_stats_data["mean"])
sample_sigma = float(sample_stats_data["sigma"])
print("sample mean: ", sample_mean)
print("sample sigma: ", sample_sigma)
if math.isnan(sample_mean):
raise Exception('''Error: sample mean is NaN''')
raise Exception("""Error: sample mean is NaN""")
elif math.isnan(sample_sigma):
raise Exception('''Error: sample sigma is NaN''')
raise Exception("""Error: sample sigma is NaN""")
z_value = (sample_mean - mean) / sigma
print("z-value: ", z_value)
if z_value >= 3:
raise Exception('''\n
raise Exception(
f"""\n
z-value >= 3, there is high chance of perf regression.\n
To reproduce this regression, run
`cd .ci/pytorch/perf_test/ && bash {}.sh` on your local machine
`cd .ci/pytorch/perf_test/ && bash {test_name}.sh` on your local machine
and compare the runtime before/after your code change.
'''.format(test_name))
"""
)
else:
print("z-value < 3, no perf regression detected.")
if args.update:
print("We will use these numbers as new baseline.")
new_data_file_path = '../new_{}_runtime.json'.format(backend)
new_data_file_path = f"../new_{backend}_runtime.json"
with open(new_data_file_path) as new_data_file:
new_data = json.load(new_data_file)
new_data[test_name] = {}
new_data[test_name]['mean'] = sample_mean
new_data[test_name]['sigma'] = max(sample_sigma, sample_mean * 0.1)
with open(new_data_file_path, 'w') as new_data_file:
new_data[test_name]["mean"] = sample_mean
new_data[test_name]["sigma"] = max(sample_sigma, sample_mean * 0.1)
with open(new_data_file_path, "w") as new_data_file:
json.dump(new_data, new_data_file, indent=4)

View File

@ -1,5 +1,6 @@
import sys
import json
import sys
import numpy
sample_data_list = sys.argv[1:]
@ -9,8 +10,8 @@ sample_mean = numpy.mean(sample_data_list)
sample_sigma = numpy.std(sample_data_list)
data = {
'mean': sample_mean,
'sigma': sample_sigma,
"mean": sample_mean,
"sigma": sample_sigma,
}
print(json.dumps(data))

View File

@ -1,5 +1,5 @@
import sys
import json
import sys
data_file_path = sys.argv[1]
commit_hash = sys.argv[2]
@ -7,7 +7,7 @@ commit_hash = sys.argv[2]
with open(data_file_path) as data_file:
data = json.load(data_file)
data['commit'] = commit_hash
data["commit"] = commit_hash
with open(data_file_path, 'w') as data_file:
with open(data_file_path, "w") as data_file:
json.dump(data, data_file)

View File

@ -9,9 +9,9 @@ for line in lines:
# Ignore errors from CPU instruction set, symbol existing testing,
# or compilation error formatting
ignored_keywords = [
'src.c',
'CheckSymbolExists.c',
'test_compilation_error_formatting',
"src.c",
"CheckSymbolExists.c",
"test_compilation_error_formatting",
]
if all([keyword not in line for keyword in ignored_keywords]):
if all(keyword not in line for keyword in ignored_keywords):
print(line)

View File

@ -1,8 +1,4 @@
# =================== The following code **should** be executed inside Docker container ===================
# Install dependencies
sudo apt-get -y update
sudo apt-get -y install expect-dev
#!/bin/bash
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
@ -23,7 +19,7 @@ set -ex
# but since DOCS_INSTALL_PATH can be derived from DOCS_VERSION it's probably better to
# try and gather it first, just so we don't potentially break people who rely on this script
# Argument 2: What version of the docs we are building.
version="${2:-${DOCS_VERSION:-master}}"
version="${2:-${DOCS_VERSION:-main}}"
if [ -z "$version" ]; then
echo "error: python_doc_push_script.sh: version (arg2) not specified"
exit 1
@ -38,7 +34,7 @@ echo "error: python_doc_push_script.sh: install_path (arg1) not specified"
fi
is_main_doc=false
if [ "$version" == "master" ]; then
if [ "$version" == "main" ]; then
is_main_doc=true
fi
@ -55,7 +51,7 @@ echo "install_path: $install_path version: $version"
build_docs () {
set +e
set -o pipefail
make $1 2>&1 | tee /tmp/docs_build.txt
make "$1" 2>&1 | tee /tmp/docs_build.txt
code=$?
if [ $code -ne 0 ]; then
set +x
@ -72,12 +68,12 @@ build_docs () {
}
git clone https://github.com/pytorch/pytorch.github.io -b $branch --depth 1
git clone https://github.com/pytorch/pytorch.github.io -b "$branch" --depth 1
pushd pytorch.github.io
export LC_ALL=C
export PATH=/opt/conda/bin:$PATH
if [ -n $ANACONDA_PYTHON_VERSION ]; then
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
export PATH=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:$PATH
fi
@ -88,10 +84,9 @@ pushd "$pt_checkout"
pushd docs
# Build the docs
pip -q install -r requirements.txt
if [ "$is_main_doc" = true ]; then
build_docs html
[ $? -eq 0 ] || exit $?
build_docs html || exit $?
make coverage
# Now we have the coverage report, we need to make sure it is empty.
# Count the number of lines in the file and turn that number into a variable
@ -102,19 +97,19 @@ if [ "$is_main_doc" = true ]; then
# Also: see docs/source/conf.py for "coverage_ignore*" items, which should
# be documented then removed from there.
lines=$(wc -l build/coverage/python.txt 2>/dev/null |cut -f1 -d' ')
undocumented=$(($lines - 2))
undocumented=$((lines - 2))
if [ $undocumented -lt 0 ]; then
echo coverage output not found
exit 1
elif [ $undocumented -gt 0 ]; then
echo undocumented objects found:
cat build/coverage/python.txt
echo "Make sure you've updated relevant .rsts in docs/source!"
exit 1
fi
else
# skip coverage, format for stable or tags
build_docs html-stable
[ $? -eq 0 ] || exit $?
build_docs html-stable || exit $?
fi
# Move them into the docs repo
@ -146,4 +141,3 @@ if [[ "${WITH_PUSH:-}" == true ]]; then
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

File diff suppressed because it is too large Load Diff

View File

@ -15,13 +15,6 @@ source "$SCRIPT_PARENT_DIR/common.sh"
# shellcheck source=./common-build.sh
source "$SCRIPT_PARENT_DIR/common-build.sh"
IMAGE_COMMIT_ID=$(git rev-parse HEAD)
export IMAGE_COMMIT_ID
export IMAGE_COMMIT_TAG=${BUILD_ENVIRONMENT}-${IMAGE_COMMIT_ID}
if [[ ${JOB_NAME} == *"develop"* ]]; then
export IMAGE_COMMIT_TAG=develop-${IMAGE_COMMIT_TAG}
fi
export TMP_DIR="${PWD}/build/win_tmp"
TMP_DIR_WIN=$(cygpath -w "${TMP_DIR}")
export TMP_DIR_WIN
@ -30,14 +23,6 @@ if [[ -n "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
mkdir -p "$PYTORCH_FINAL_PACKAGE_DIR" || true
fi
# This directory is used only to hold "pytorch_env_restore.bat", called via "setup_pytorch_env.bat"
CI_SCRIPTS_DIR=$TMP_DIR/ci_scripts
mkdir -p "$CI_SCRIPTS_DIR"
if [ -n "$(ls "$CI_SCRIPTS_DIR"/*)" ]; then
rm "$CI_SCRIPTS_DIR"/*
fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
set +ex
@ -59,7 +44,4 @@ set -ex
assert_git_not_dirty
if [ ! -f "${TMP_DIR}"/"${IMAGE_COMMIT_TAG}".7z ] && [ ! "${BUILD_ENVIRONMENT}" == "" ]; then
exit 1
fi
echo "BUILD PASSED"

View File

@ -111,23 +111,8 @@ if "%USE_CUDA%"=="1" (
set CMAKE_CUDA_COMPILER_LAUNCHER=%TMP_DIR%/bin/randomtemp.exe;%TMP_DIR%\bin\sccache.exe
)
@echo off
echo @echo off >> %TMP_DIR_WIN%\ci_scripts\pytorch_env_restore.bat
for /f "usebackq tokens=*" %%i in (`set`) do echo set "%%i" >> %TMP_DIR_WIN%\ci_scripts\pytorch_env_restore.bat
@echo on
if "%REBUILD%" == "" (
if NOT "%BUILD_ENVIRONMENT%" == "" (
:: Create a shortcut to restore pytorch environment
echo @echo off >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat
echo call "%TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat
echo cd /D "%CD%" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat
aws s3 cp "s3://ossci-windows/Restore PyTorch Environment.lnk" "C:\Users\circleci\Desktop\Restore PyTorch Environment.lnk"
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
)
)
:: Print all existing environment variable for debugging
set
python setup.py bdist_wheel
if errorlevel 1 exit /b
@ -138,18 +123,12 @@ python -c "import os, glob; os.system('python -mpip install --no-index --no-deps
if "%BUILD_ENVIRONMENT%"=="" (
echo NOTE: To run `import torch`, please make sure to activate the conda environment by running `call %CONDA_PARENT_DIR%\Miniconda3\Scripts\activate.bat %CONDA_PARENT_DIR%\Miniconda3` in Command Prompt before running Git Bash.
) else (
if "%USE_CUDA%"=="1" (
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\nvfuser && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
) else (
7z a %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torchgen %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\functorch && copy /Y "%TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z" "%PYTORCH_FINAL_PACKAGE_DIR%\"
)
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
copy /Y "dist\*.whl" "%PYTORCH_FINAL_PACKAGE_DIR%"
:: export test times so that potential sharded tests that'll branch off this build will use consistent data
python tools/stats/export_test_times.py
copy /Y ".pytorch-test-times.json" "%PYTORCH_FINAL_PACKAGE_DIR%"
copy /Y ".pytorch-test-file-ratings.json" "%PYTORCH_FINAL_PACKAGE_DIR%"
:: Also save build/.ninja_log as an artifact
copy /Y "build\.ninja_log" "%PYTORCH_FINAL_PACKAGE_DIR%\"

View File

@ -1,19 +0,0 @@
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
:: exit the batch once there's an error
if not errorlevel 0 (
echo "setup pytorch env failed"
echo %errorlevel%
exit /b
)
echo "Test functorch"
pushd test
python run_test.py --functorch --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
popd
if ERRORLEVEL 1 goto fail
:eof
exit /b 0
:fail
exit /b 1

View File

@ -1,7 +1,7 @@
#!/usr/bin/env python3
import subprocess
import os
import subprocess
COMMON_TESTS = [
(
@ -31,8 +31,7 @@ GPU_TESTS = [
if __name__ == "__main__":
if 'USE_CUDA' in os.environ and os.environ['USE_CUDA'] == '1':
if "USE_CUDA" in os.environ and os.environ["USE_CUDA"] == "1":
TESTS = COMMON_TESTS + GPU_TESTS
else:
TESTS = COMMON_TESTS
@ -44,8 +43,10 @@ if __name__ == "__main__":
try:
subprocess.check_call(command_args)
except subprocess.CalledProcessError as e:
sdk_root = os.environ.get('WindowsSdkDir', 'C:\\Program Files (x86)\\Windows Kits\\10')
debugger = os.path.join(sdk_root, 'Debuggers', 'x64', 'cdb.exe')
sdk_root = os.environ.get(
"WindowsSdkDir", "C:\\Program Files (x86)\\Windows Kits\\10"
)
debugger = os.path.join(sdk_root, "Debuggers", "x64", "cdb.exe")
if os.path.exists(debugger):
command_args = [debugger, "-o", "-c", "~*g; q"] + command_args
command_string = " ".join(command_args)

View File

@ -1,8 +1,3 @@
if exist "%TMP_DIR%/ci_scripts/pytorch_env_restore.bat" (
call %TMP_DIR%/ci_scripts/pytorch_env_restore.bat
exit /b 0
)
set PATH=C:\Program Files\CMake\bin;C:\Program Files\7-Zip;C:\ProgramData\chocolatey\bin;C:\Program Files\Git\cmd;C:\Program Files\Amazon\AWSCLI;C:\Program Files\Amazon\AWSCLI\bin;%PATH%
:: Install Miniconda3
@ -14,6 +9,13 @@ call %INSTALLER_DIR%\activate_miniconda3.bat
if errorlevel 1 exit /b
if not errorlevel 0 exit /b
:: PyTorch is now installed using the standard wheel on Windows into the conda environment.
:: However, the test scripts are still frequently referring to the workspace temp directory
:: build\torch. Rather than changing all these references, making a copy of torch folder
:: from conda to the current workspace is easier. The workspace will be cleaned up after
:: the job anyway
xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
pushd .
if "%VC_VERSION%" == "" (
call "C:\Program Files (x86)\Microsoft Visual Studio\%VC_YEAR%\%VC_PRODUCT%\VC\Auxiliary\Build\vcvarsall.bat" x64
@ -48,26 +50,5 @@ set NUMBAPRO_NVVM=%CUDA_PATH%\nvvm\bin\nvvm64_32_0.dll
set PYTHONPATH=%TMP_DIR_WIN%\build;%PYTHONPATH%
if NOT "%BUILD_ENVIRONMENT%"=="" (
pushd %TMP_DIR_WIN%\build
copy /Y %PYTORCH_FINAL_PACKAGE_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z %TMP_DIR_WIN%\
:: 7z: -aos skips if exists because this .bat can be called multiple times
7z x %TMP_DIR_WIN%\%IMAGE_COMMIT_TAG%.7z -aos
popd
) else (
xcopy /s %CONDA_PARENT_DIR%\Miniconda3\Lib\site-packages\torch %TMP_DIR_WIN%\build\torch\
)
@echo off
echo @echo off >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat
for /f "usebackq tokens=*" %%i in (`set`) do echo set "%%i" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat
@echo on
if NOT "%BUILD_ENVIRONMENT%" == "" (
:: Create a shortcut to restore pytorch environment
echo @echo off >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat
echo call "%TMP_DIR_WIN%/ci_scripts/pytorch_env_restore.bat" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat
echo cd /D "%CD%" >> %TMP_DIR_WIN%/ci_scripts/pytorch_env_restore_helper.bat
aws s3 cp "s3://ossci-windows/Restore PyTorch Environment.lnk" "C:\Users\circleci\Desktop\Restore PyTorch Environment.lnk"
)
:: Print all existing environment variable for debugging
set

View File

@ -5,14 +5,16 @@ if "%USE_CUDA%" == "0" IF NOT "%CUDA_VERSION%" == "cpu" exit /b 0
call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
if errorlevel 1 exit /b 1
cd %TMP_DIR_WIN%\build\torch\bin
set TEST_OUT_DIR=%~dp0\..\..\..\test\test-reports\cpp-unittest
md %TEST_OUT_DIR%
:: Save the current working directory so that we can go back there
set CWD=%cd%
set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set TEST_API_OUT_DIR=%TEST_OUT_DIR%\test_api
md %TEST_API_OUT_DIR%
test_api.exe --gtest_filter="-IntegrationTest.MNIST*" --gtest_output=xml:%TEST_API_OUT_DIR%\test_api.xml
set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist
python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%
python test\run_test.py --cpp --verbose -i cpp/test_api
if errorlevel 1 exit /b 1
if not errorlevel 0 exit /b 1
@ -25,6 +27,10 @@ for /r "." %%a in (*.exe) do (
goto :eof
:libtorch_check
cd %CWD%
set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\test
:: Skip verify_api_visibility as it a compile level test
if "%~1" == "verify_api_visibility" goto :eof
@ -42,12 +48,12 @@ if "%~1" == "utility_ops_gpu_test" goto :eof
echo Running "%~2"
if "%~1" == "c10_intrusive_ptr_benchmark" (
:: NB: This is not a gtest executable file, thus couldn't be handled by pytest-cpp
call "%~2"
goto :eof
)
:: Differentiating the test report directories is crucial for test time reporting.
md %TEST_OUT_DIR%\%~n2
call "%~2" --gtest_output=xml:%TEST_OUT_DIR%\%~n2\%~1.xml
python test\run_test.py --cpp --verbose -i "cpp/%~1"
if errorlevel 1 (
echo %1 failed with exit code %errorlevel%
exit /b 1

View File

@ -2,6 +2,7 @@ call %SCRIPT_HELPERS_DIR%\setup_pytorch_env.bat
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-file-ratings.json" "%PROJECT_DIR_WIN%"
pushd test

View File

@ -23,6 +23,7 @@ if "%SHARD_NUMBER%" == "1" (
echo Copying over test times file
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-times.json" "%PROJECT_DIR_WIN%"
copy /Y "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.pytorch-test-file-ratings.json" "%PROJECT_DIR_WIN%"
echo Run nn tests
python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose

View File

@ -5,13 +5,6 @@ SCRIPT_PARENT_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
# shellcheck source=./common.sh
source "$SCRIPT_PARENT_DIR/common.sh"
IMAGE_COMMIT_ID=$(git rev-parse HEAD)
export IMAGE_COMMIT_ID
export IMAGE_COMMIT_TAG=${BUILD_ENVIRONMENT}-${IMAGE_COMMIT_ID}
if [[ ${JOB_NAME} == *"develop"* ]]; then
export IMAGE_COMMIT_TAG=develop-${IMAGE_COMMIT_TAG}
fi
export TMP_DIR="${PWD}/build/win_tmp"
TMP_DIR_WIN=$(cygpath -w "${TMP_DIR}")
export TMP_DIR_WIN
@ -21,22 +14,12 @@ export PROJECT_DIR_WIN
export TEST_DIR="${PWD}/test"
TEST_DIR_WIN=$(cygpath -w "${TEST_DIR}")
export TEST_DIR_WIN
export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/users/circleci/workspace/build-results}"
export PYTORCH_FINAL_PACKAGE_DIR="${PYTORCH_FINAL_PACKAGE_DIR:-/c/w/build-results}"
PYTORCH_FINAL_PACKAGE_DIR_WIN=$(cygpath -w "${PYTORCH_FINAL_PACKAGE_DIR}")
export PYTORCH_FINAL_PACKAGE_DIR_WIN
mkdir -p "$TMP_DIR"/build/torch
# This directory is used only to hold "pytorch_env_restore.bat", called via "setup_pytorch_env.bat"
CI_SCRIPTS_DIR=$TMP_DIR/ci_scripts
mkdir -p "$CI_SCRIPTS_DIR"
if [ -n "$(ls "$CI_SCRIPTS_DIR"/*)" ]; then
rm "$CI_SCRIPTS_DIR"/*
fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
if [[ "$TEST_CONFIG" = "force_on_cpu" ]]; then
@ -51,6 +34,12 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
fi
# TODO: Move both of them to Windows AMI
python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do
@ -60,9 +49,7 @@ run_tests() {
fi
done
if [[ "${TEST_CONFIG}" == *functorch* ]]; then
"$SCRIPT_HELPERS_DIR"/install_test_functorch.bat
elif [[ $NUM_TEST_SHARDS -eq 1 ]]; then
if [[ $NUM_TEST_SHARDS -eq 1 ]]; then
"$SCRIPT_HELPERS_DIR"/test_python_shard.bat
"$SCRIPT_HELPERS_DIR"/test_custom_script_ops.bat
"$SCRIPT_HELPERS_DIR"/test_custom_backend.bat

View File

@ -106,7 +106,7 @@ All binaries are built in CircleCI workflows except Windows. There are checked-i
Some quick vocab:
* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/master/.circleci/config.yml to see the workflows.
* A \**workflow** is a CircleCI concept; it is a DAG of '**jobs**'. ctrl-f 'workflows' on https://github.com/pytorch/pytorch/blob/main/.circleci/config.yml to see the workflows.
* **jobs** are a sequence of '**steps**'
* **steps** are usually just a bash script or a builtin CircleCI command. *All steps run in new environments, environment variables declared in one script DO NOT persist to following steps*
* CircleCI has a **workspace**, which is essentially a cache between steps of the *same job* in which you can store artifacts between steps.
@ -117,8 +117,8 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build,
1. binary_builds
1. every day midnight EST
2. linux: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
3. macos: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
2. linux: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
3. macos: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml
4. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. binary_linux_conda_3.7_cpu_build
1. Builds the build. On linux jobs this uses the 'docker executor'.
@ -133,12 +133,12 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build,
2. Uploads the package
2. update_s3_htmls
1. every day 5am EST
2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml
3. See below for what these are for and why they're needed
4. Three jobs that each examine the current contents of aws and the conda repo and update some html files in s3
3. binarysmoketests
1. every day
2. https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
2. https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
3. For each binary configuration, e.g. linux_conda_3.7_cpu there is a
1. smoke_linux_conda_3.7_cpu
1. Downloads the package from the cloud, e.g. using the official pip or conda instructions
@ -146,26 +146,26 @@ The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build,
## How are the jobs structured?
The jobs are in https://github.com/pytorch/pytorch/tree/master/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/master/.circleci/scripts .
The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .
* Linux jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* Linux jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/linux-binary-build-defaults.yml
* binary_linux_build.sh
* binary_linux_test.sh
* binary_linux_upload.sh
* MacOS jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/macos-binary-build-defaults.yml
* MacOS jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/macos-binary-build-defaults.yml
* binary_macos_build.sh
* binary_macos_test.sh
* binary_macos_upload.sh
* Update html jobs: https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/binary_update_htmls.yml
* Update html jobs: https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/binary_update_htmls.yml
* These delegate from the pytorch/builder repo
* https://github.com/pytorch/builder/blob/master/cron/update_s3_htmls.sh
* https://github.com/pytorch/builder/blob/master/cron/upload_binary_sizes.sh
* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
* https://github.com/pytorch/builder/blob/main/cron/update_s3_htmls.sh
* https://github.com/pytorch/builder/blob/main/cron/upload_binary_sizes.sh
* Smoke jobs (both linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-build-smoke-tests-defaults.yml
* These delegate from the pytorch/builder repo
* https://github.com/pytorch/builder/blob/master/run_tests.sh
* https://github.com/pytorch/builder/blob/master/smoke_test.sh
* https://github.com/pytorch/builder/blob/master/check_binary.sh
* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/master/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
* https://github.com/pytorch/builder/blob/main/run_tests.sh
* https://github.com/pytorch/builder/blob/main/smoke_test.sh
* https://github.com/pytorch/builder/blob/main/check_binary.sh
* Common shared code (shared across linux and macos): https://github.com/pytorch/pytorch/blob/main/.circleci/verbatim-sources/nightly-binary-build-defaults.yml
* binary_checkout.sh - checks out pytorch/builder repo. Right now this also checks out pytorch/pytorch, but it shouldn't. pytorch/pytorch should just be shared through the workspace. This can handle being run before binary_populate_env.sh
* binary_populate_env.sh - parses BUILD_ENVIRONMENT into the separate env variables that make up a binary configuration. Also sets lots of default values, the date, the version strings, the location of folders in s3, all sorts of things. This generally has to be run before other steps.
* binary_install_miniconda.sh - Installs miniconda, cross platform. Also hacks this for the update_binary_sizes job that doesn't have the right env variables
@ -308,7 +308,7 @@ Note that the Windows Python wheels are still built in conda environments. Some
* These should all be consolidated
* These must run on all OS types: MacOS, Linux, and Windows
* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on master and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didnt mess anything up.
* These all run smoke tests at the moment. They inspect the packages some, maybe run a few import statements. They DO NOT run the python tests nor the cpp tests. The idea is that python tests on main and PR merges will catch all breakages. All these tests have to do is make sure the special binary machinery didnt mess anything up.
* There are separate run_tests.sh and smoke_test.sh because one used to be called by the smoke jobs and one used to be called by the binary test jobs (see circleci structure section above). This is still true actually, but these could be united into a single script that runs these checks, given an installed pytorch package.
### Note on libtorch
@ -340,7 +340,7 @@ The Dockerfiles are available in pytorch/builder, but there is no circleci job o
tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to master and then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/master/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
Sometimes we want to push a change to mainand then rebuild all of today's binaries after that change. As of May 30, 2019 there isn't a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
## How to test changes to the binaries via .circleci

View File

@ -9,9 +9,10 @@ should be "pruned".
from collections import OrderedDict
from cimodel.lib.conf_tree import ConfigNode
import cimodel.data.dimensions as dimensions
from cimodel.lib.conf_tree import ConfigNode
LINKING_DIMENSIONS = [
"shared",
@ -26,12 +27,18 @@ DEPS_INCLUSION_DIMENSIONS = [
def get_processor_arch_name(gpu_version):
return "cpu" if not gpu_version else (
"cu" + gpu_version.strip("cuda") if gpu_version.startswith("cuda") else gpu_version
return (
"cpu"
if not gpu_version
else (
"cu" + gpu_version.strip("cuda")
if gpu_version.startswith("cuda")
else gpu_version
)
)
CONFIG_TREE_DATA = OrderedDict(
)
CONFIG_TREE_DATA = OrderedDict()
# GCC config variants:
#
@ -41,8 +48,8 @@ CONFIG_TREE_DATA = OrderedDict(
#
# Libtorch with new gcc ABI is built with gcc 5.4 on Ubuntu 16.04.
LINUX_GCC_CONFIG_VARIANTS = OrderedDict(
manywheel=['devtoolset7'],
conda=['devtoolset7'],
manywheel=["devtoolset7"],
conda=["devtoolset7"],
libtorch=[
"devtoolset7",
"gcc5.4_cxx11-abi",
@ -63,7 +70,9 @@ class TopLevelNode(ConfigNode):
self.props["smoke"] = smoke
def get_children(self):
return [OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()]
return [
OSConfigNode(self, x, c, p) for (x, (c, p)) in self.config_tree_data.items()
]
class OSConfigNode(ConfigNode):
@ -85,12 +94,20 @@ class PackageFormatConfigNode(ConfigNode):
self.props["python_versions"] = python_versions
self.props["package_format"] = package_format
def get_children(self):
if self.find_prop("os_name") == "linux":
return [LinuxGccConfigNode(self, v) for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]]
elif self.find_prop("os_name") == "windows" and self.find_prop("package_format") == "libtorch":
return [WindowsLibtorchConfigNode(self, v) for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS]
return [
LinuxGccConfigNode(self, v)
for v in LINUX_GCC_CONFIG_VARIANTS[self.find_prop("package_format")]
]
elif (
self.find_prop("os_name") == "windows"
and self.find_prop("package_format") == "libtorch"
):
return [
WindowsLibtorchConfigNode(self, v)
for v in WINDOWS_LIBTORCH_CONFIG_VARIANTS
]
else:
return [ArchConfigNode(self, v) for v in self.find_prop("gpu_versions")]
@ -106,23 +123,29 @@ class LinuxGccConfigNode(ConfigNode):
# XXX devtoolset7 on CUDA 9.0 is temporarily disabled
# see https://github.com/pytorch/pytorch/issues/20066
if self.find_prop("gcc_config_variant") == 'devtoolset7':
if self.find_prop("gcc_config_variant") == "devtoolset7":
gpu_versions = filter(lambda x: x != "cuda_90", gpu_versions)
# XXX disabling conda rocm build since docker images are not there
if self.find_prop("package_format") == 'conda':
gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)
if self.find_prop("package_format") == "conda":
gpu_versions = filter(
lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions
)
# XXX libtorch rocm build is temporarily disabled
if self.find_prop("package_format") == 'libtorch':
gpu_versions = filter(lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions)
if self.find_prop("package_format") == "libtorch":
gpu_versions = filter(
lambda x: x not in dimensions.ROCM_VERSION_LABELS, gpu_versions
)
return [ArchConfigNode(self, v) for v in gpu_versions]
class WindowsLibtorchConfigNode(ConfigNode):
def __init__(self, parent, libtorch_config_variant):
super().__init__(parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant))
super().__init__(
parent, "LIBTORCH_CONFIG_VARIANT=" + str(libtorch_config_variant)
)
self.props["libtorch_config_variant"] = libtorch_config_variant
@ -161,11 +184,15 @@ class LinkingVariantConfigNode(ConfigNode):
super().__init__(parent, linking_variant)
def get_children(self):
return [DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS]
return [
DependencyInclusionConfigNode(self, v) for v in DEPS_INCLUSION_DIMENSIONS
]
class DependencyInclusionConfigNode(ConfigNode):
def __init__(self, parent, deps_variant):
super().__init__(parent, deps_variant)
self.props["libtorch_variant"] = "-".join([self.parent.get_label(), self.get_label()])
self.props["libtorch_variant"] = "-".join(
[self.parent.get_label(), self.get_label()]
)

View File

@ -1,13 +1,24 @@
from collections import OrderedDict
import cimodel.data.simple.util.branch_filters as branch_filters
import cimodel.data.binary_build_data as binary_build_data
import cimodel.data.simple.util.branch_filters as branch_filters
import cimodel.lib.conf_tree as conf_tree
import cimodel.lib.miniutils as miniutils
class Conf(object):
def __init__(self, os, gpu_version, pydistro, parms, smoke, libtorch_variant, gcc_config_variant, libtorch_config_variant):
class Conf:
def __init__(
self,
os,
gpu_version,
pydistro,
parms,
smoke,
libtorch_variant,
gcc_config_variant,
libtorch_config_variant,
):
self.os = os
self.gpu_version = gpu_version
self.pydistro = pydistro
@ -18,7 +29,11 @@ class Conf(object):
self.libtorch_config_variant = libtorch_config_variant
def gen_build_env_parms(self):
elems = [self.pydistro] + self.parms + [binary_build_data.get_processor_arch_name(self.gpu_version)]
elems = (
[self.pydistro]
+ self.parms
+ [binary_build_data.get_processor_arch_name(self.gpu_version)]
)
if self.gcc_config_variant is not None:
elems.append(str(self.gcc_config_variant))
if self.libtorch_config_variant is not None:
@ -26,7 +41,7 @@ class Conf(object):
return elems
def gen_docker_image(self):
if self.gcc_config_variant == 'gcc5.4_cxx11-abi':
if self.gcc_config_variant == "gcc5.4_cxx11-abi":
if self.gpu_version is None:
return miniutils.quote("pytorch/libtorch-cxx11-builder:cpu")
else:
@ -37,30 +52,41 @@ class Conf(object):
if self.gpu_version is None:
return miniutils.quote("pytorch/conda-builder:cpu")
else:
return miniutils.quote(
f"pytorch/conda-builder:{self.gpu_version}"
)
return miniutils.quote(f"pytorch/conda-builder:{self.gpu_version}")
docker_word_substitution = {
"manywheel": "manylinux",
"libtorch": "manylinux",
}
docker_distro_prefix = miniutils.override(self.pydistro, docker_word_substitution)
docker_distro_prefix = miniutils.override(
self.pydistro, docker_word_substitution
)
# The cpu nightlies are built on the pytorch/manylinux-cuda102 docker image
# TODO cuda images should consolidate into tag-base images similar to rocm
alt_docker_suffix = "cuda102" if not self.gpu_version else (
"rocm:" + self.gpu_version.strip("rocm") if self.gpu_version.startswith("rocm") else self.gpu_version)
docker_distro_suffix = alt_docker_suffix if self.pydistro != "conda" else (
"cuda" if alt_docker_suffix.startswith("cuda") else "rocm")
return miniutils.quote("pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix)
alt_docker_suffix = (
"cuda102"
if not self.gpu_version
else (
"rocm:" + self.gpu_version.strip("rocm")
if self.gpu_version.startswith("rocm")
else self.gpu_version
)
)
docker_distro_suffix = (
alt_docker_suffix
if self.pydistro != "conda"
else ("cuda" if alt_docker_suffix.startswith("cuda") else "rocm")
)
return miniutils.quote(
"pytorch/" + docker_distro_prefix + "-" + docker_distro_suffix
)
def get_name_prefix(self):
return "smoke" if self.smoke else "binary"
def gen_build_name(self, build_or_test, nightly):
parts = [self.get_name_prefix(), self.os] + self.gen_build_env_parms()
if nightly:
@ -78,7 +104,9 @@ class Conf(object):
def gen_workflow_job(self, phase, upload_phase_dependency=None, nightly=False):
job_def = OrderedDict()
job_def["name"] = self.gen_build_name(phase, nightly)
job_def["build_environment"] = miniutils.quote(" ".join(self.gen_build_env_parms()))
job_def["build_environment"] = miniutils.quote(
" ".join(self.gen_build_env_parms())
)
if self.smoke:
job_def["requires"] = [
"update_s3_htmls",
@ -116,47 +144,48 @@ class Conf(object):
os_name = miniutils.override(self.os, {"macos": "mac"})
job_name = "_".join([self.get_name_prefix(), os_name, phase])
return {job_name : job_def}
return {job_name: job_def}
def gen_upload_job(self, phase, requires_dependency):
"""Generate binary_upload job for configuration
Output looks similar to:
Output looks similar to:
- binary_upload:
name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test
filters:
branches:
only:
- nightly
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu113
- binary_upload:
name: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_upload
context: org-member
requires: binary_linux_manywheel_3_7m_cu113_devtoolset7_nightly_test
filters:
branches:
only:
- nightly
tags:
only: /v[0-9]+(\\.[0-9]+)*-rc[0-9]+/
package_type: manywheel
upload_subfolder: cu113
"""
return {
"binary_upload": OrderedDict({
"name": self.gen_build_name(phase, nightly=True),
"context": "org-member",
"requires": [self.gen_build_name(
requires_dependency,
nightly=True
)],
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"package_type": self.pydistro,
"upload_subfolder": binary_build_data.get_processor_arch_name(
self.gpu_version,
),
})
"binary_upload": OrderedDict(
{
"name": self.gen_build_name(phase, nightly=True),
"context": "org-member",
"requires": [
self.gen_build_name(requires_dependency, nightly=True)
],
"filters": branch_filters.gen_filter_dict(
branches_list=["nightly"],
tags_list=[branch_filters.RC_PATTERN],
),
"package_type": self.pydistro,
"upload_subfolder": binary_build_data.get_processor_arch_name(
self.gpu_version,
),
}
)
}
def get_root(smoke, name):
def get_root(smoke, name):
return binary_build_data.TopLevelNode(
name,
binary_build_data.CONFIG_TREE_DATA,
@ -165,7 +194,6 @@ def get_root(smoke, name):
def gen_build_env_list(smoke):
root = get_root(smoke, "N/A")
config_list = conf_tree.dfs(root)
@ -176,7 +204,8 @@ def gen_build_env_list(smoke):
c.find_prop("gpu"),
c.find_prop("package_format"),
[c.find_prop("pyver")],
c.find_prop("smoke") and not (c.find_prop("os_name") == "macos_arm64"), # don't test arm64
c.find_prop("smoke")
and not (c.find_prop("os_name") == "macos_arm64"), # don't test arm64
c.find_prop("libtorch_variant"),
c.find_prop("gcc_config_variant"),
c.find_prop("libtorch_config_variant"),
@ -185,9 +214,11 @@ def gen_build_env_list(smoke):
return newlist
def predicate_exclude_macos(config):
return config.os == "linux" or config.os == "windows"
def get_nightly_uploads():
configs = gen_build_env_list(False)
mylist = []
@ -197,6 +228,7 @@ def get_nightly_uploads():
return mylist
def get_post_upload_jobs():
return [
{
@ -210,8 +242,8 @@ def get_post_upload_jobs():
},
]
def get_nightly_tests():
def get_nightly_tests():
configs = gen_build_env_list(False)
filtered_configs = filter(predicate_exclude_macos, configs)

View File

@ -16,9 +16,4 @@ ROCM_VERSION_LABELS = ["rocm" + v for v in ROCM_VERSIONS]
GPU_VERSIONS = [None] + ["cuda" + v for v in CUDA_VERSIONS] + ROCM_VERSION_LABELS
STANDARD_PYTHON_VERSIONS = [
"3.7",
"3.8",
"3.9",
"3.10"
]
STANDARD_PYTHON_VERSIONS = ["3.7", "3.8", "3.9", "3.10"]

View File

@ -1,8 +1,7 @@
from cimodel.lib.conf_tree import ConfigNode
CONFIG_TREE_DATA = [
]
CONFIG_TREE_DATA = []
def get_major_pyver(dotted_version):
@ -96,6 +95,7 @@ class SlowGradcheckConfigNode(TreeConfigNode):
def child_constructor(self):
return ExperimentalFeatureConfigNode
class PureTorchConfigNode(TreeConfigNode):
def modify_label(self, label):
return "PURE_TORCH=" + str(label)
@ -117,6 +117,7 @@ class XlaConfigNode(TreeConfigNode):
def child_constructor(self):
return ImportantConfigNode
class MPSConfigNode(TreeConfigNode):
def modify_label(self, label):
return "MPS=" + str(label)
@ -254,8 +255,11 @@ class XenialCompilerConfigNode(TreeConfigNode):
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return XenialCompilerVersionConfigNode if self.props["compiler_name"] else PyVerConfigNode
return (
XenialCompilerVersionConfigNode
if self.props["compiler_name"]
else PyVerConfigNode
)
class BionicCompilerConfigNode(TreeConfigNode):
@ -267,8 +271,11 @@ class BionicCompilerConfigNode(TreeConfigNode):
# noinspection PyMethodMayBeStatic
def child_constructor(self):
return BionicCompilerVersionConfigNode if self.props["compiler_name"] else PyVerConfigNode
return (
BionicCompilerVersionConfigNode
if self.props["compiler_name"]
else PyVerConfigNode
)
class XenialCompilerVersionConfigNode(TreeConfigNode):

View File

@ -111,10 +111,10 @@ class Conf:
parameters["resource_class"] = resource_class
if phase == "build" and self.rocm_version is not None:
parameters["resource_class"] = "xlarge"
if hasattr(self, 'filters'):
parameters['filters'] = self.filters
if hasattr(self, "filters"):
parameters["filters"] = self.filters
if self.build_only:
parameters['build_only'] = miniutils.quote(str(int(True)))
parameters["build_only"] = miniutils.quote(str(int(True)))
return parameters
def gen_workflow_job(self, phase):
@ -122,7 +122,6 @@ class Conf:
job_def["name"] = self.gen_build_name(phase)
if Conf.is_test_phase(phase):
# TODO When merging the caffe2 and pytorch jobs, it might be convenient for a while to make a
# caffe2 test job dependent on a pytorch build job. This way we could quickly dedup the repeated
# build of pytorch in the caffe2 build job, and just run the caffe2 tests off of a completed
@ -143,7 +142,7 @@ class Conf:
# TODO This is a hack to special case some configs just for the workflow list
class HiddenConf(object):
class HiddenConf:
def __init__(self, name, parent_build=None, filters=None):
self.name = name
self.parent_build = parent_build
@ -160,7 +159,8 @@ class HiddenConf(object):
def gen_build_name(self, _):
return self.name
class DocPushConf(object):
class DocPushConf:
def __init__(self, name, parent_build=None, branch="master"):
self.name = name
self.parent_build = parent_build
@ -173,11 +173,13 @@ class DocPushConf(object):
"branch": self.branch,
"requires": [self.parent_build],
"context": "org-member",
"filters": gen_filter_dict(branches_list=["nightly"],
tags_list=RC_PATTERN)
"filters": gen_filter_dict(
branches_list=["nightly"], tags_list=RC_PATTERN
),
}
}
def gen_docs_configs(xenial_parent_config):
configs = []
@ -185,8 +187,9 @@ def gen_docs_configs(xenial_parent_config):
HiddenConf(
"pytorch_python_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=["master", "main", "nightly"],
tags_list=RC_PATTERN),
filters=gen_filter_dict(
branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN
),
)
)
configs.append(
@ -201,8 +204,9 @@ def gen_docs_configs(xenial_parent_config):
HiddenConf(
"pytorch_cpp_doc_build",
parent_build=xenial_parent_config,
filters=gen_filter_dict(branches_list=["master", "main", "nightly"],
tags_list=RC_PATTERN),
filters=gen_filter_dict(
branches_list=["master", "main", "nightly"], tags_list=RC_PATTERN
),
)
)
configs.append(
@ -226,13 +230,11 @@ def gen_tree():
def instantiate_configs(only_slow_gradcheck):
config_list = []
root = get_root()
found_configs = conf_tree.dfs(root)
for fc in found_configs:
restrict_phases = None
distro_name = fc.find_prop("distro_name")
compiler_name = fc.find_prop("compiler_name")
@ -351,8 +353,7 @@ def instantiate_configs(only_slow_gradcheck):
and compiler_name == "gcc"
and fc.find_prop("compiler_version") == "5.4"
):
c.filters = gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN)
c.filters = gen_filter_dict(branches_list=r"/.*/", tags_list=RC_PATTERN)
c.dependent_tests = gen_docs_configs(c)
config_list.append(c)
@ -361,16 +362,13 @@ def instantiate_configs(only_slow_gradcheck):
def get_workflow_jobs(only_slow_gradcheck=False):
config_list = instantiate_configs(only_slow_gradcheck)
x = []
for conf_options in config_list:
phases = conf_options.restrict_phases or dimensions.PHASES
for phase in phases:
# TODO why does this not have a test?
if Conf.is_test_phase(phase) and conf_options.cuda_version == "10":
continue

View File

@ -1,39 +1,39 @@
from collections import OrderedDict
from cimodel.lib.miniutils import quote
from cimodel.data.simple.util.branch_filters import gen_filter_dict, RC_PATTERN
from cimodel.lib.miniutils import quote
# NOTE: All hardcoded docker image builds have been migrated to GHA
IMAGE_NAMES = [
]
IMAGE_NAMES = []
# This entry should be an element from the list above
# This should contain the image matching the "slow_gradcheck" entry in
# pytorch_build_data.py
SLOW_GRADCHECK_IMAGE_NAME = "pytorch-linux-xenial-cuda10.2-cudnn7-py3-gcc7"
def get_workflow_jobs(images=IMAGE_NAMES, only_slow_gradcheck=False):
"""Generates a list of docker image build definitions"""
ret = []
for image_name in images:
if image_name.startswith('docker-'):
image_name = image_name.lstrip('docker-')
if image_name.startswith("docker-"):
image_name = image_name.lstrip("docker-")
if only_slow_gradcheck and image_name is not SLOW_GRADCHECK_IMAGE_NAME:
continue
parameters = OrderedDict({
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),
})
parameters = OrderedDict(
{
"name": quote(f"docker-{image_name}"),
"image_name": quote(image_name),
}
)
if image_name == "pytorch-linux-xenial-py3.7-gcc5.4":
# pushing documentation on tags requires CircleCI to also
# build all the dependencies on tags, including this docker image
parameters['filters'] = gen_filter_dict(branches_list=r"/.*/",
tags_list=RC_PATTERN)
ret.append(OrderedDict(
{
"docker_build_job": parameters
}
))
parameters["filters"] = gen_filter_dict(
branches_list=r"/.*/", tags_list=RC_PATTERN
)
ret.append(OrderedDict({"docker_build_job": parameters}))
return ret

View File

@ -1,6 +1,6 @@
from cimodel.data.simple.util.versions import MultiPartVersion
from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude
import cimodel.lib.miniutils as miniutils
from cimodel.data.simple.util.branch_filters import gen_filter_dict_exclude
from cimodel.data.simple.util.versions import MultiPartVersion
XCODE_VERSION = MultiPartVersion([12, 5, 1])
@ -11,7 +11,9 @@ class ArchVariant:
self.custom_build_name = custom_build_name
def render(self):
extra_parts = [self.custom_build_name] if len(self.custom_build_name) > 0 else []
extra_parts = (
[self.custom_build_name] if len(self.custom_build_name) > 0 else []
)
return "-".join([self.name] + extra_parts).replace("_", "-")
@ -20,7 +22,9 @@ def get_platform(arch_variant_name):
class IOSJob:
def __init__(self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None):
def __init__(
self, xcode_version, arch_variant, is_org_member_context=True, extra_props=None
):
self.xcode_version = xcode_version
self.arch_variant = arch_variant
self.is_org_member_context = is_org_member_context
@ -29,11 +33,15 @@ class IOSJob:
def gen_name_parts(self):
version_parts = self.xcode_version.render_dots_or_parts("-")
build_variant_suffix = self.arch_variant.render()
return [
"ios",
] + version_parts + [
build_variant_suffix,
]
return (
[
"ios",
]
+ version_parts
+ [
build_variant_suffix,
]
)
def gen_job_name(self):
return "-".join(self.gen_name_parts())
@ -59,8 +67,12 @@ class IOSJob:
WORKFLOW_DATA = [
IOSJob(XCODE_VERSION, ArchVariant("x86_64"), is_org_member_context=False, extra_props={
"lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(
XCODE_VERSION,
ArchVariant("x86_64"),
is_org_member_context=False,
extra_props={"lite_interpreter": miniutils.quote(str(int(True)))},
),
# IOSJob(XCODE_VERSION, ArchVariant("arm64"), extra_props={
# "lite_interpreter": miniutils.quote(str(int(True)))}),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "metal"), extra_props={
@ -69,9 +81,15 @@ WORKFLOW_DATA = [
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "custom-ops"), extra_props={
# "op_list": "mobilenetv2.yaml",
# "lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(XCODE_VERSION, ArchVariant("x86_64", "coreml"), is_org_member_context=False, extra_props={
"use_coreml": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True)))}),
IOSJob(
XCODE_VERSION,
ArchVariant("x86_64", "coreml"),
is_org_member_context=False,
extra_props={
"use_coreml": miniutils.quote(str(int(True))),
"lite_interpreter": miniutils.quote(str(int(True))),
},
),
# IOSJob(XCODE_VERSION, ArchVariant("arm64", "coreml"), extra_props={
# "use_coreml": miniutils.quote(str(int(True))),
# "lite_interpreter": miniutils.quote(str(int(True)))}),

View File

@ -2,17 +2,14 @@
PyTorch Mobile PR builds (use linux host toolchain + mobile build options)
"""
import cimodel.lib.miniutils as miniutils
import cimodel.data.simple.util.branch_filters
import cimodel.lib.miniutils as miniutils
class MobileJob:
def __init__(
self,
docker_image,
docker_requires,
variant_parts,
is_master_only=False):
self, docker_image, docker_requires, variant_parts, is_master_only=False
):
self.docker_image = docker_image
self.docker_requires = docker_requires
self.variant_parts = variant_parts
@ -40,13 +37,14 @@ class MobileJob:
}
if self.is_master_only:
props_dict["filters"] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
props_dict[
"filters"
] = cimodel.data.simple.util.branch_filters.gen_filter_dict()
return [{"pytorch_linux_build": props_dict}]
WORKFLOW_DATA = [
]
WORKFLOW_DATA = []
def get_workflow_jobs():

View File

@ -3,11 +3,7 @@ import cimodel.lib.miniutils as miniutils
class IOSNightlyJob:
def __init__(self,
variant,
is_full_jit=False,
is_upload=False):
def __init__(self, variant, is_full_jit=False, is_upload=False):
self.variant = variant
self.is_full_jit = is_full_jit
self.is_upload = is_upload
@ -16,19 +12,24 @@ class IOSNightlyJob:
return "upload" if self.is_upload else "build"
def get_common_name_pieces(self, sep):
extra_name_suffix = [self.get_phase_name()] if self.is_upload else []
extra_name = ["full_jit"] if self.is_full_jit else []
common_name_pieces = [
"ios",
] + extra_name + [
] + ios_definitions.XCODE_VERSION.render_dots_or_parts(sep) + [
"nightly",
self.variant,
"build",
] + extra_name_suffix
common_name_pieces = (
[
"ios",
]
+ extra_name
+ []
+ ios_definitions.XCODE_VERSION.render_dots_or_parts(sep)
+ [
"nightly",
self.variant,
"build",
]
+ extra_name_suffix
)
return common_name_pieces
@ -37,10 +38,14 @@ class IOSNightlyJob:
def gen_tree(self):
build_configs = BUILD_CONFIGS_FULL_JIT if self.is_full_jit else BUILD_CONFIGS
extra_requires = [x.gen_job_name() for x in build_configs] if self.is_upload else []
extra_requires = (
[x.gen_job_name() for x in build_configs] if self.is_upload else []
)
props_dict = {
"build_environment": "-".join(["libtorch"] + self.get_common_name_pieces(".")),
"build_environment": "-".join(
["libtorch"] + self.get_common_name_pieces(".")
),
"requires": extra_requires,
"context": "org-member",
"filters": {"branches": {"only": "nightly"}},
@ -56,11 +61,13 @@ class IOSNightlyJob:
if self.is_full_jit:
props_dict["lite_interpreter"] = miniutils.quote(str(int(False)))
template_name = "_".join([
"binary",
"ios",
self.get_phase_name(),
])
template_name = "_".join(
[
"binary",
"ios",
self.get_phase_name(),
]
)
return [{template_name: props_dict}]
@ -75,10 +82,14 @@ BUILD_CONFIGS_FULL_JIT = [
IOSNightlyJob("arm64", is_full_jit=True),
]
WORKFLOW_DATA = BUILD_CONFIGS + BUILD_CONFIGS_FULL_JIT + [
IOSNightlyJob("binary", is_full_jit=False, is_upload=True),
IOSNightlyJob("binary", is_full_jit=True, is_upload=True),
]
WORKFLOW_DATA = (
BUILD_CONFIGS
+ BUILD_CONFIGS_FULL_JIT
+ [
IOSNightlyJob("binary", is_full_jit=False, is_upload=True),
IOSNightlyJob("binary", is_full_jit=True, is_upload=True),
]
)
def get_workflow_jobs():

View File

@ -15,10 +15,7 @@ RC_PATTERN = r"/v[0-9]+(\.[0-9]+)*-rc[0-9]+/"
MAC_IOS_EXCLUSION_LIST = ["nightly", "postnightly"]
def gen_filter_dict(
branches_list=NON_PR_BRANCH_LIST,
tags_list=None
):
def gen_filter_dict(branches_list=NON_PR_BRANCH_LIST, tags_list=None):
"""Generates a filter dictionary for use with CircleCI's job filter"""
filter_dict = {
"branches": {

View File

@ -1,11 +1,13 @@
AWS_DOCKER_HOST = "308535385114.dkr.ecr.us-east-1.amazonaws.com"
def gen_docker_image(container_type):
return (
"/".join([AWS_DOCKER_HOST, "pytorch", container_type]),
f"docker-{container_type}",
)
def gen_docker_image_requires(image_name):
return [f"docker-{image_name}"]

View File

@ -12,7 +12,9 @@ class MultiPartVersion:
with the prefix string.
"""
if self.parts:
return [self.prefix + str(self.parts[0])] + [str(part) for part in self.parts[1:]]
return [self.prefix + str(self.parts[0])] + [
str(part) for part in self.parts[1:]
]
else:
return [self.prefix]

View File

@ -1,5 +1,5 @@
from dataclasses import dataclass, field
from typing import Optional, Dict
from typing import Dict, Optional
def X(val):
@ -19,6 +19,7 @@ class Ver:
"""
Represents a product with a version number
"""
name: str
version: str = ""
@ -28,7 +29,7 @@ class Ver:
@dataclass
class ConfigNode:
parent: Optional['ConfigNode']
parent: Optional["ConfigNode"]
node_name: str
props: Dict[str, str] = field(default_factory=dict)
@ -40,7 +41,11 @@ class ConfigNode:
return []
def get_parents(self):
return (self.parent.get_parents() + [self.parent.get_label()]) if self.parent else []
return (
(self.parent.get_parents() + [self.parent.get_label()])
if self.parent
else []
)
def get_depth(self):
return len(self.get_parents())
@ -69,13 +74,13 @@ class ConfigNode:
def dfs_recurse(
node,
leaf_callback=lambda x: None,
discovery_callback=lambda x, y, z: None,
child_callback=lambda x, y: None,
sibling_index=0,
sibling_count=1):
node,
leaf_callback=lambda x: None,
discovery_callback=lambda x, y, z: None,
child_callback=lambda x, y: None,
sibling_index=0,
sibling_count=1,
):
discovery_callback(node, sibling_index, sibling_count)
node_children = node.get_children()
@ -96,7 +101,6 @@ def dfs_recurse(
def dfs(toplevel_config_node):
config_list = []
def leaf_callback(node):

View File

@ -25,7 +25,6 @@ def render(fh, data, depth, is_list_member=False):
indentation = " " * INDENTATION_WIDTH * depth
if is_dict(data):
tuples = list(data.items())
if type(data) is not OrderedDict:
tuples.sort()

View File

@ -2,10 +2,11 @@
import os
import sys
import yaml
# Need to import modules that lie on an upward-relative path
sys.path.append(os.path.join(sys.path[0], '..'))
sys.path.append(os.path.join(sys.path[0], ".."))
import cimodel.lib.miniyaml as miniyaml

2
.circleci/config.yml generated
View File

@ -652,7 +652,7 @@ jobs:
- run:
name: Archive artifacts into zip
command: |
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json .pytorch-test-file-ratings.json
cp artifacts.zip /Users/distiller/workspace
- persist_to_workspace:

View File

@ -21,7 +21,6 @@ Please re-run the "%s" script in the "%s" directory and commit the result. See "
def check_consistency():
_, temp_filename = tempfile.mkstemp("-generated-config.yml")
with open(temp_filename, "w") as fh:
@ -30,7 +29,10 @@ def check_consistency():
try:
subprocess.check_call(["cmp", temp_filename, CHECKED_IN_FILE])
except subprocess.CalledProcessError:
sys.exit(ERROR_MESSAGE_TEMPLATE % (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH))
sys.exit(
ERROR_MESSAGE_TEMPLATE
% (CHECKED_IN_FILE, REGENERATION_SCRIPT, PARENT_DIR, README_PATH)
)
finally:
os.remove(temp_filename)

View File

@ -10,15 +10,16 @@ import shutil
import sys
from collections import namedtuple
import cimodel.data.simple.anaconda_prune_defintions
import cimodel.data.simple.docker_definitions
import cimodel.data.simple.mobile_definitions
import cimodel.data.simple.nightly_ios
import cimodel.data.simple.anaconda_prune_defintions
import cimodel.lib.miniutils as miniutils
import cimodel.lib.miniyaml as miniyaml
class File(object):
class File:
"""
Verbatim copy the contents of a file into config.yml
"""
@ -57,7 +58,7 @@ def horizontal_rule():
return "".join("#" * 78)
class Header(object):
class Header:
def __init__(self, title, summary=None):
self.title = title
self.summary_lines = summary or []
@ -82,15 +83,19 @@ def _for_all_items(items, functor) -> None:
def filter_master_only_jobs(items):
def _is_main_or_master_item(item):
filters = item.get('filters', None)
branches = filters.get('branches', None) if filters is not None else None
branches_only = branches.get('only', None) if branches is not None else None
return ('main' in branches_only or 'master' in branches_only) if branches_only is not None else False
filters = item.get("filters", None)
branches = filters.get("branches", None) if filters is not None else None
branches_only = branches.get("only", None) if branches is not None else None
return (
("main" in branches_only or "master" in branches_only)
if branches_only is not None
else False
)
master_deps = set()
def _save_requires_if_master(item_type, item):
requires = item.get('requires', None)
requires = item.get("requires", None)
item_name = item.get("name", None)
if not isinstance(requires, list):
return
@ -107,9 +112,9 @@ def filter_master_only_jobs(items):
item_name = item_name.strip('"') if item_name is not None else None
if not _is_main_or_master_item(item) and item_name not in master_deps:
return None
if 'filters' in item:
if "filters" in item:
item = item.copy()
item.pop('filters')
item.pop("filters")
return {item_type: item}
# Scan of dependencies twice to pick up nested required jobs
@ -123,12 +128,12 @@ def generate_required_docker_images(items):
required_docker_images = set()
def _requires_docker_image(item_type, item):
requires = item.get('requires', None)
requires = item.get("requires", None)
if not isinstance(requires, list):
return
for requirement in requires:
requirement = requirement.replace('"', '')
if requirement.startswith('docker-'):
requirement = requirement.replace('"', "")
if requirement.startswith("docker-"):
required_docker_images.add(requirement)
_for_all_items(items, _requires_docker_image)
@ -191,5 +196,4 @@ def stitch_sources(output_filehandle):
if __name__ == "__main__":
stitch_sources(sys.stdout)

View File

@ -48,7 +48,7 @@ if [[ -n "${CIRCLE_PR_NUMBER:-}" ]]; then
git checkout -q -B "$CIRCLE_BRANCH"
git reset --hard "$CIRCLE_SHA1"
elif [[ -n "${CIRCLE_SHA1:-}" ]]; then
# Scheduled workflows & "smoke" binary build on master on PR merges
# Scheduled workflows & "smoke" binary build on trunk on PR merges
DEFAULT_BRANCH="$(git remote show $CIRCLE_REPOSITORY_URL | awk '/HEAD branch/ {print $NF}')"
git reset --hard "$CIRCLE_SHA1"
git checkout -q -B $DEFAULT_BRANCH
@ -61,7 +61,7 @@ echo "Using Pytorch from "
git --no-pager log --max-count 1
popd
# Clone the Builder master repo
# Clone the Builder main repo
retry git clone -q https://github.com/pytorch/builder.git "$BUILDER_ROOT"
pushd "$BUILDER_ROOT"
echo "Using builder from "

View File

@ -33,7 +33,7 @@ fi
cp ${PROJ_ROOT}/LICENSE ${ZIP_DIR}/
# zip the library
export DATE="$(date -u +%Y%m%d)"
export IOS_NIGHTLY_BUILD_VERSION="2.0.0.${DATE}"
export IOS_NIGHTLY_BUILD_VERSION="2.1.0.${DATE}"
if [ "${BUILD_LITE_INTERPRETER}" == "1" ]; then
# libtorch_lite_ios_nightly_1.11.0.20210810.zip
ZIPFILE="libtorch_lite_ios_nightly_${IOS_NIGHTLY_BUILD_VERSION}.zip"

View File

@ -11,7 +11,7 @@ NUM_CPUS=$(( $(nproc) - 2 ))
# Defaults here for **binary** linux builds so they can be changed in one place
export MAX_JOBS=${MAX_JOBS:-$(( ${NUM_CPUS} > ${MEMORY_LIMIT_MAX_JOBS} ? ${MEMORY_LIMIT_MAX_JOBS} : ${NUM_CPUS} ))}
if [[ "${DESIRED_CUDA}" =~ cu11[0-9] ]]; then
if [[ "${DESIRED_CUDA}" =~ cu1[1-2][0-9] ]]; then
export BUILD_SPLIT_CUDA="ON"
fi

View File

@ -38,10 +38,6 @@ fi
EXTRA_CONDA_FLAGS=""
NUMPY_PIN=""
PROTOBUF_PACKAGE="defaults::protobuf"
if [[ "\$python_nodot" = *311* ]]; then
# Numpy is yet not avaiable on default conda channel
EXTRA_CONDA_FLAGS="-c=malfet"
fi
if [[ "\$python_nodot" = *310* ]]; then
# There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20
@ -81,6 +77,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
"numpy\${NUMPY_PIN}" \
mkl>=2018 \
ninja \
sympy \
typing-extensions \
${PROTOBUF_PACKAGE}
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
@ -98,7 +95,13 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
conda install \${EXTRA_CONDA_FLAGS} -y "\$pkg" --offline
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
pip install "\$pkg" --extra-index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"
if [[ "$(uname -m)" == aarch64 ]]; then
# Using "extra-index-url" until all needed aarch64 dependencies are
# added to "https://download.pytorch.org/whl/nightly/"
pip install "\$pkg" --extra-index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"
else
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/nightly/${DESIRED_CUDA}"
fi
retry pip install -q numpy protobuf typing-extensions
fi
if [[ "$PACKAGE_TYPE" == libtorch ]]; then

View File

@ -59,7 +59,7 @@ PIP_UPLOAD_FOLDER='nightly/'
# We put this here so that OVERRIDE_PACKAGE_VERSION below can read from it
export DATE="$(date -u +%Y%m%d)"
#TODO: We should be pulling semver version from the base version.txt
BASE_BUILD_VERSION="2.0.0.dev$DATE"
BASE_BUILD_VERSION="2.1.0.dev$DATE"
# Change BASE_BUILD_VERSION to git tag when on a git tag
# Use 'git -C' to make doubly sure we're in the correct directory for checking
# the git tag

View File

@ -8,62 +8,7 @@ export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export USE_SCCACHE=1
export SCCACHE_BUCKET=ossci-compiler-cache
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
export VC_YEAR=2022
if [[ "${DESIRED_CUDA}" == *"cu11"* ]]; then
export BUILD_SPLIT_CUDA=ON
fi
echo "Free Space for CUDA DEBUG BUILD"
if [[ "${CIRCLECI:-}" == 'true' ]]; then
export NIGHTLIES_PYTORCH_ROOT="$PYTORCH_ROOT"
if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\Community"
fi
if [[ -d "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft Visual Studio 14.0"
fi
if [[ -d "C:\\Program Files (x86)\\Microsoft.NET" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft.NET"
fi
if [[ -d "C:\\Program Files\\dotnet" ]]; then
rm -rf "C:\\Program Files\\dotnet"
fi
if [[ -d "C:\\Program Files (x86)\\dotnet" ]]; then
rm -rf "C:\\Program Files (x86)\\dotnet"
fi
if [[ -d "C:\\Program Files (x86)\\Microsoft SQL Server" ]]; then
rm -rf "C:\\Program Files (x86)\\Microsoft SQL Server"
fi
if [[ -d "C:\\Program Files (x86)\\Xamarin" ]]; then
rm -rf "C:\\Program Files (x86)\\Xamarin"
fi
if [[ -d "C:\\Program Files (x86)\\Google" ]]; then
rm -rf "C:\\Program Files (x86)\\Google"
fi
set +x
export AWS_ACCESS_KEY_ID=${CIRCLECI_AWS_ACCESS_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
export AWS_SECRET_ACCESS_KEY=${CIRCLECI_AWS_SECRET_KEY_FOR_SCCACHE_S3_BUCKET_V4:-}
set -x
if [[ -d "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages\\_Instances" ]]; then
mv "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages\\_Instances" .
rm -rf "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
mkdir -p "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
mv _Instances "C:\\ProgramData\\Microsoft\\VisualStudio\\Packages"
fi
if [[ -d "C:\\Microsoft" ]]; then
# don't use quotes here
rm -rf /c/Microsoft/AndroidNDK*
fi
fi
export VC_YEAR=2019
echo "Free space on filesystem before build:"
df -h

View File

@ -4,7 +4,7 @@ set -eux -o pipefail
source "${BINARY_ENV_FILE:-/c/w/env}"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2022
export VC_YEAR=2019
pushd "$BUILDER_ROOT"

View File

@ -24,7 +24,7 @@ popd
git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages
pushd functorch_ghpages
if [ $version == "master" ]; then
if [ $version == "main" ]; then
version=nightly
fi

View File

@ -1,12 +1,13 @@
# Documentation: https://docs.microsoft.com/en-us/rest/api/azure/devops/build/?view=azure-devops-rest-6.0
import re
import json
import os
import re
import sys
import requests
import time
import requests
AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"
AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
PIPELINE_ID = "911"
@ -19,54 +20,68 @@ build_base_url = AZURE_PIPELINE_BASE_URL + "_apis/build/builds?api-version=6.0"
s = requests.Session()
s.headers.update({"Authorization": "Basic " + AZURE_DEVOPS_PAT_BASE64})
def submit_build(pipeline_id, project_id, source_branch, source_version):
print("Submitting build for branch: " + source_branch)
print("Commit SHA1: ", source_version)
run_build_raw = s.post(build_base_url, json={
"definition": {"id": pipeline_id},
"project": {"id": project_id},
"sourceBranch": source_branch,
"sourceVersion": source_version
})
run_build_raw = s.post(
build_base_url,
json={
"definition": {"id": pipeline_id},
"project": {"id": project_id},
"sourceBranch": source_branch,
"sourceVersion": source_version,
},
)
try:
run_build_json = run_build_raw.json()
except json.decoder.JSONDecodeError as e:
print(e)
print("Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired.")
print(
"Failed to parse the response. Check if the Azure DevOps PAT is incorrect or expired."
)
sys.exit(-1)
build_id = run_build_json['id']
build_id = run_build_json["id"]
print("Submitted bulid: " + str(build_id))
print("Bulid URL: " + run_build_json['url'])
print("Bulid URL: " + run_build_json["url"])
return build_id
def get_build(_id):
get_build_url = AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"
get_build_url = (
AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}?api-version=6.0"
)
get_build_raw = s.get(get_build_url)
return get_build_raw.json()
def get_build_logs(_id):
get_build_logs_url = AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"
get_build_logs_url = (
AZURE_PIPELINE_BASE_URL + f"/_apis/build/builds/{_id}/logs?api-version=6.0"
)
get_build_logs_raw = s.get(get_build_logs_url)
return get_build_logs_raw.json()
def get_log_content(url):
resp = s.get(url)
return resp.text
def wait_for_build(_id):
build_detail = get_build(_id)
build_status = build_detail['status']
build_status = build_detail["status"]
while build_status == 'notStarted':
print('Waiting for run to start: ' + str(_id))
while build_status == "notStarted":
print("Waiting for run to start: " + str(_id))
sys.stdout.flush()
try:
build_detail = get_build(_id)
build_status = build_detail['status']
build_status = build_detail["status"]
except Exception as e:
print("Error getting build")
print(e)
@ -76,7 +91,7 @@ def wait_for_build(_id):
print("Bulid started: ", str(_id))
handled_logs = set()
while build_status == 'inProgress':
while build_status == "inProgress":
try:
print("Waiting for log: " + str(_id))
logs = get_build_logs(_id)
@ -86,38 +101,39 @@ def wait_for_build(_id):
time.sleep(30)
continue
for log in logs['value']:
log_id = log['id']
for log in logs["value"]:
log_id = log["id"]
if log_id in handled_logs:
continue
handled_logs.add(log_id)
print('Fetching log: \n' + log['url'])
print("Fetching log: \n" + log["url"])
try:
log_content = get_log_content(log['url'])
log_content = get_log_content(log["url"])
print(log_content)
except Exception as e:
print("Error getting log content")
print(e)
sys.stdout.flush()
build_detail = get_build(_id)
build_status = build_detail['status']
build_status = build_detail["status"]
time.sleep(30)
build_result = build_detail['result']
build_result = build_detail["result"]
print("Bulid status: " + build_status)
print("Bulid result: " + build_result)
return build_status, build_result
if __name__ == '__main__':
if __name__ == "__main__":
# Convert the branch name for Azure DevOps
match = re.search(r'pull/(\d+)', TARGET_BRANCH)
match = re.search(r"pull/(\d+)", TARGET_BRANCH)
if match is not None:
pr_num = match.group(1)
SOURCE_BRANCH = f'refs/pull/{pr_num}/head'
SOURCE_BRANCH = f"refs/pull/{pr_num}/head"
else:
SOURCE_BRANCH = f'refs/heads/{TARGET_BRANCH}'
SOURCE_BRANCH = f"refs/heads/{TARGET_BRANCH}"
MAX_RETRY = 2
retry = MAX_RETRY
@ -126,7 +142,7 @@ if __name__ == '__main__':
build_id = submit_build(PIPELINE_ID, PROJECT_ID, SOURCE_BRANCH, TARGET_COMMIT)
build_status, build_result = wait_for_build(build_id)
if build_result != 'succeeded':
if build_result != "succeeded":
retry = retry - 1
if retry > 0:
print("Retrying... remaining attempt: " + str(retry))

View File

@ -177,7 +177,7 @@
- run:
name: Archive artifacts into zip
command: |
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json
zip -1 -r artifacts.zip dist/ build/.ninja_log build/compile_commands.json .pytorch-test-times.json .pytorch-test-file-ratings.json
cp artifacts.zip /Users/distiller/workspace
- persist_to_workspace:

View File

@ -60,9 +60,6 @@ MacroBlockBegin: ''
MacroBlockEnd: ''
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
ObjCBlockIndentWidth: 2
ObjCSpaceAfterProperty: false
ObjCSpaceBeforeProtocolList: false
PenaltyBreakBeforeFirstCallParameter: 1
PenaltyBreakComment: 300
PenaltyBreakFirstLessLess: 120
@ -85,4 +82,11 @@ SpacesInSquareBrackets: false
Standard: Cpp11
TabWidth: 8
UseTab: Never
---
Language: ObjC
ColumnLimit: 120
AlignAfterOpenBracket: Align
ObjCBlockIndentWidth: 2
ObjCSpaceAfterProperty: false
ObjCSpaceBeforeProtocolList: false
...

View File

@ -9,6 +9,7 @@ bugprone-*,
-bugprone-lambda-function-name,
-bugprone-reserved-identifier,
-bugprone-swapped-arguments,
clang-diagnostic-missing-prototypes,
cppcoreguidelines-*,
-cppcoreguidelines-avoid-do-while,
-cppcoreguidelines-avoid-magic-numbers,
@ -41,8 +42,6 @@ modernize-*,
-modernize-use-trailing-return-type,
-modernize-use-nodiscard,
performance-*,
-performance-noexcept-move-constructor,
-performance-unnecessary-value-param,
readability-container-size-empty,
'
HeaderFilterRegex: '^(c10/(?!test)|torch/csrc/(?!deploy/interpreter/cpython)).*$'

34
.devcontainer/Dockerfile Normal file
View File

@ -0,0 +1,34 @@
FROM mcr.microsoft.com/vscode/devcontainers/miniconda:0-3
# I am suprised this is needed
RUN conda init
# Copy environment.yml (if found) to a temp location so we update the environment. Also
# copy "noop.txt" so the COPY instruction does not fail if no environment.yml exists.
COPY .devcontainer/cuda/environment.yml .devcontainer/noop.txt /tmp/conda-tmp/
RUN if [ -f "/tmp/conda-tmp/environment.yml" ]; then umask 0002 && /opt/conda/bin/conda env update -n base -f /tmp/conda-tmp/environment.yml; fi \
&& sudo rm -rf /tmp/conda-tmp
# Tools needed for llvm
RUN sudo apt-get -y update
RUN sudo apt install -y lsb-release wget software-properties-common gnupg
# Install CLANG if version is specified
ARG CLANG_VERSION
RUN if [ -n "$CLANG_VERSION" ]; then \
sudo wget https://apt.llvm.org/llvm.sh; \
chmod +x llvm.sh; \
sudo ./llvm.sh "${CLANG_VERSION}"; \
echo 'export CC=clang' >> ~/.bashrc; \
echo 'export CXX=clang++' >> ~/.bashrc; \
sudo apt update; \
sudo apt install -y clang; \
sudo apt install -y libomp-dev; \
fi
# Install cuda if version is specified
ARG CUDA_VERSION
RUN if [ -n "$CUDA_VERSION" ]; then \
conda install cuda -c "nvidia/label/cuda-${CUDA_VERSION}"; \
fi

View File

@ -0,0 +1,37 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/anaconda
{
"name": "PyTorch - CPU",
"build": {
"context": "../..",
"dockerfile": "../Dockerfile",
"args": {
"USERNAME": "vscode",
"BUILDKIT_INLINE_CACHE": "0",
"CLANG_VERSION": ""
}
},
// Features to add to the dev container. More info: https://containers.dev/features.
"features": {
// This is needed for lintrunner
"ghcr.io/devcontainers/features/rust:1" : {}
},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "bash .devcontainer/scripts/install-dev-tools.sh",
// Configure tool-specific properties.
// "customizations": {},
"customizations": {
"vscode": {
"extensions": ["streetsidesoftware.code-spell-checker"]
}
}
// Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
// "remoteUser": "root"
}

View File

@ -0,0 +1,6 @@
# This environment is specific to Debian
name: PyTorch
dependencies:
- cmake
- ninja
- libopenblas

View File

@ -0,0 +1,37 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/anaconda
{
"name": "PyTorch - CUDA",
"build": {
"context": "../..",
"dockerfile": "../Dockerfile",
"args": {
"USERNAME": "vscode",
"BUILDKIT_INLINE_CACHE": "0",
"CUDA_VERSION": "11.8.0",
"CLANG_VERSION": ""
}
},
"runArgs": ["--gpus", "all"],
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "bash .devcontainer/scripts/install-dev-tools.sh",
// Configure tool-specific properties.
// "customizations": {},
"customizations": {
"vscode": {
"extensions": ["streetsidesoftware.code-spell-checker"]
}
},
// Features to add to the dev container. More info: https://containers.dev/features.
"features": {
// This is needed for lintrunner
"ghcr.io/devcontainers/features/rust:1" : {}
}
// Uncomment to connect as root instead. More info: https://aka.ms/dev-containers-non-root.
// "remoteUser": "root"
}

Some files were not shown because too many files have changed in this diff Show More