Commit Graph

84 Commits

Author SHA1 Message Date
cf31724db7 Fix and improvements to toward 3.13t (#136319)
Small part of https://github.com/pytorch/pytorch/pull/130689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-09-20 04:22:18 +00:00
cyy
8967d55b01 [18/N] Fix clang-tidy warnings in jit (#132963)
Follows #132753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132963
Approved by: https://github.com/Skylion007
2024-08-09 01:27:32 +00:00
cyy
2988d33c80 [3/N] Fix clang-tidy warnings in jit (#131830)
Follows #131735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131830
Approved by: https://github.com/ezyang
2024-07-26 15:46:28 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
1e27af335e [easy] enhance local model loading (#129897)
Summary:
1. add one more model lib dep.
2. add error message when torchscript failed to find a class in python compilation unit.

Test Plan: CI

Reviewed By: jingsh

Differential Revision: D59243250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897
Approved by: https://github.com/jingsh
2024-07-03 00:29:02 +00:00
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
d3817d8a60 Don't create python tuple when _maybe_handle_torch_function is called from C++ (#128187)
Marginal overhead reduction when calling through the `torch.ops` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128187
Approved by: https://github.com/lezcano
ghstack dependencies: #128183, #128184, #128185
2024-06-10 00:16:59 +00:00
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
b04dca1502 Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo (#124290)
The important comment:

```
        # Whenever we allocate a fresh unbacked Symbol, we add it to this
        # pending list.  Unbacked symbol allocation can occur at unpredictable
        # points during meta tensor propagation, but at some point, the we
        # have to know what the binding site for an unbacked symbol is, and
        # this is computed when we actually place the node in the graph.  The
        # important thing is that we always actually handle every unaccounted
        # for unbacked symbol, so this list helps us keep track of them and
        # then make sure they are all accounted for.
        #
        # We could potentially give rise to errors earlier by lexically
        # scoping when we do propagation, and only allowing unbacked symbols
        # to be allocated at this point in time.  However this is inconvenient
        # to do in Dynamo, because fake tensor propagation is far from when we
        # analyze binding sites (set_example_value), so we do it in a more
        # mutatey way.
        #
        # NB: fresh unbacked symbols NEVER get substitutions applied to them,
        # they are binding sites!
```

The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment:

```
    After having run fake tensor propagation and producing example_value
    result, traverse example_value looking for freshly bound unbacked
    symbols and record their paths for later.  It is an error if
    we have allocated an unbacked SymInt but it cannot be found in
    example_value.  (NB: this means if you have a multi-output
    function, you must call this on the tuple of tensor output, you
    cannot wait!)
```

For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it.

I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290
Approved by: https://github.com/lezcano
2024-04-24 09:11:34 +00:00
e62169a8fa Support torchbind op dispatch in python (#123367)
We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table.

The idea is:
1. when inputs contains FakeScriptObject,  we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor.
2. when inputs are not fakified, we dispatch through the original c++ dispatcher.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367
Approved by: https://github.com/zou3519
2024-04-19 17:17:27 +00:00
6ba85cfc2a Fixed memory leak in Python dispatcher w.r.t. THPDevice. (#122439)
Fixes the memory leak reported in #122417.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122439
Approved by: https://github.com/soulitzer
2024-03-22 06:44:12 +00:00
8e14e1d514 Fix gradient refcounts in pybind and compiled autograd (#118817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817
Approved by: https://github.com/jansel
2024-02-07 10:25:42 +00:00
16373bbc1f fix error message in pytorch (#115349)
Fixes https://dev-discuss.pytorch.org/t/typo-in-error-message/1709 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115349
Approved by: https://github.com/Skylion007
2023-12-07 19:27:29 +00:00
fdaddec2c3 make_fx can now SymIntify int inputs (#113452)
This PR also contains a basket of fixes that were turned up by now testing more arguments with SymInt. I fixed as many of the easy ones as I could easily get earlier in this stack and a bunch here, but there are some more annoying ones I xfailed.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113452
Approved by: https://github.com/Chillee
ghstack dependencies: #113877, #113911
2023-11-18 06:39:09 +00:00
d0e50d9094 Move overloaded_args from FunctionSignature to PythonArgs (#106983)
This moves the `overloaded_args` field from FunctionSignature to PythonArgs. FunctionSignature is shared by all calls and should be immutable. PythonArgs contains the parsing results for an single call to the PyTorch API.

I did not measure a difference in performance in the "overrides_benchmark", although I expect there to be a bit more work in the common case. Note that the noise factor for the benchmark is much larger than the differences reported below:

Before:
```
Type tensor had a minimum time of 2.3615360260009766 us and a standard deviation of 0.7833134150132537 us.
Type SubTensor had a minimum time of 10.473251342773438 us and a standard deviation of 0.1973132457351312 us.
Type WithTorchFunction had a minimum time of 5.484819412231445 us and a standard deviation of 0.13305981701705605 us.
Type SubWithTorchFunction had a minimum time of 11.098146438598633 us and a standard deviation of 0.15598918253090233 us.
```
After:
```
Type tensor had a minimum time of 2.2134780883789062 us and a standard deviation of 0.802064489107579 us.
Type SubTensor had a minimum time of 10.625839233398438 us and a standard deviation of 0.15155907021835446 us.
Type WithTorchFunction had a minimum time of 5.520820617675781 us and a standard deviation of 0.23115111980587244 us.
Type SubWithTorchFunction had a minimum time of 11.227846145629883 us and a standard deviation of 0.23032321769278497 us.
```

Fixes #106974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106983
Approved by: https://github.com/zou3519, https://github.com/ezyang, https://github.com/albanD
2023-08-16 15:59:26 +00:00
effe1425dd ASAN: fix use-after-free (#101400)
arguments() returns vector member of object returned by schema() call.
When object returned by schema() call is destroyed, the vector is deallocated as well,
it's lifetime isn't extended.

This issue detected while running `pytest -v test/mobile/test_lite_script_type.py -k test_nest_typing_namedtuple_custom_classtype` with ASAN.

<details>
<summary>ASAN output</summary>

```
==1134126==ERROR: AddressSanitizer: heap-use-after-free on address 0x60d0005a5790 at pc 0x03ff844488d8 bp 0x03fff584afe8 sp 0x03fff584afd8
READ of size 8 at 0x60d0005a5790 thread T0
    #0 0x3ff844488d7 in __gnu_cxx::__normal_iterator<c10::Argument const*, std::vector<c10::Argument, std::allocator<c10::Argument> > >::__normal_iterator(c10::Argument const* const&) /usr/lib/gcc/s390x-i
bm-linux-gnu/11/include/g++-v11/bits/stl_iterator.h:1028
    #1 0x3ff8444293f in std::vector<c10::Argument, std::allocator<c10::Argument> >::begin() const /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_vector.h:821
    #2 0x3ff84d807d1 in torch::jit::toPyObject(c10::IValue) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:617
    #3 0x3ff84d80305 in torch::jit::toPyObject(c10::IValue) /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604
    #4 0x3ff84856871 in pybind11::detail::type_caster<c10::IValue, void>::cast(c10::IValue, pybind11::return_value_policy, pybind11::handle) /home/user/pytorch/torch/csrc/jit/python/pybind.h:138
    #5 0x3ff85318191 in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is
_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_me
thod const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const /home/user/pytorch/cmake/../third_party/pybin
d11/include/pybind11/pybind11.h:249
    #6 0x3ff85317cfd in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is
_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_me
thod const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) /home/user/pytorch/cmake/../third_party/pybind11/incl
ude/pybind11/pybind11.h:224
    #7 0x3ff82ee52e9 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:929
    #8 0x3ffab002903 in cfunction_call Objects/methodobject.c:543
    #9 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #10 0x3ffaaf8e919 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #11 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #12 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #13 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #14 0x3ffab105447 in call_function Python/ceval.c:5891
    #15 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #16 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #17 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #18 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #19 0x3ffaaf8a615 in _PyObject_FastCallDictTstate Objects/call.c:142
    #20 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #21 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #22 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #23 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #24 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #25 0x3ffab105447 in call_function Python/ceval.c:5891
    #26 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #27 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #28 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #29 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #30 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #31 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #32 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #33 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #34 0x3ffab105447 in call_function Python/ceval.c:5891
    #35 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #36 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #37 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #38 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #39 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #40 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #41 0x3ffab105447 in call_function Python/ceval.c:5891
    #42 0x3ffab0ff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #43 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #44 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #45 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #46 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #47 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #48 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #49 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #50 0x3ffab105447 in call_function Python/ceval.c:5891
    #51 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #52 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #53 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #54 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #55 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #56 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #57 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #58 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #59 0x3ffab105447 in call_function Python/ceval.c:5891
    #60 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #61 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #62 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #63 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #64 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #65 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #66 0x3ffaaf8ab9b in PyVectorcall_Call Objects/call.c:267
    #67 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #68 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #69 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #70 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #71 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #72 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #73 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #74 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #75 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #76 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #77 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #78 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #79 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #80 0x3ffab105447 in call_function Python/ceval.c:5891
    #81 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #82 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #83 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #84 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #85 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #86 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #87 0x3ffab105447 in call_function Python/ceval.c:5891
    #88 0x3ffab0ff7d7 in _PyEval_EvalFrameDefault Python/ceval.c:4198
    #89 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #90 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #91 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #92 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #93 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #94 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #95 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #96 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #97 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #98 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #99 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #100 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #101 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #102 0x3ffab105447 in call_function Python/ceval.c:5891
    #103 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #104 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #105 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #106 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #107 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #108 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #109 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #110 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #111 0x3ffab105447 in call_function Python/ceval.c:5891
    #112 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #113 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #114 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #115 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #116 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #117 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #118 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #119 0x3ffaaf8ad17 in _PyObject_Call Objects/call.c:305
    #120 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #121 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #122 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #123 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #124 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #125 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #126 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #127 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #128 0x3ffab105447 in call_function Python/ceval.c:5891
    #129 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #130 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #131 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #132 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #133 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #134 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #135 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #136 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #137 0x3ffab105447 in call_function Python/ceval.c:5891
    #138 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #139 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #140 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #141 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #142 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #143 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #144 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #145 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #146 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #147 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #148 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #149 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #150 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #151 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #152 0x3ffab105447 in call_function Python/ceval.c:5891
    #153 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #154 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #155 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #156 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #157 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #158 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #159 0x3ffab105447 in call_function Python/ceval.c:5891
    #160 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #161 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #162 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #163 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #164 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #165 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #166 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #167 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #168 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #169 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #170 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #171 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #172 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #173 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #174 0x3ffab105447 in call_function Python/ceval.c:5891
    #175 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #176 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #177 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #178 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #179 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #180 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #181 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #182 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #183 0x3ffab105447 in call_function Python/ceval.c:5891
    #184 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #185 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #186 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #187 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #188 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #189 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #190 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #191 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #192 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #193 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #194 0x3ffab105447 in call_function Python/ceval.c:5891
    #195 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #196 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #197 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #198 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #199 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #200 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290
    #201 0x3ffaaf8ada9 in PyObject_Call Objects/call.c:317
    #202 0x3ffab1059c7 in do_call_core Python/ceval.c:5943
    #203 0x3ffab0ffd39 in _PyEval_EvalFrameDefault Python/ceval.c:4277
    #204 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #205 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #206 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #207 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #208 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #209 0x3ffab105447 in call_function Python/ceval.c:5891
    #210 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #211 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #212 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #213 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #214 0x3ffaaf8e941 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #215 0x3ffaaf8eddd in method_vectorcall Objects/classobject.c:53
    #216 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #216 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #217 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #218 0x3ffab105447 in call_function Python/ceval.c:5891
    #219 0x3ffab0ff779 in _PyEval_EvalFrameDefault Python/ceval.c:4181
    #220 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #221 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #222 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #223 0x3ffaaf8a695 in _PyObject_FastCallDictTstate Objects/call.c:153
    #224 0x3ffaaf8b271 in _PyObject_Call_Prepend Objects/call.c:431
    #225 0x3ffab03f307 in slot_tp_call Objects/typeobject.c:7494
    #226 0x3ffaaf8a933 in _PyObject_MakeTpCall Objects/call.c:215
    #227 0x3ffab0f0081 in _PyObject_VectorcallTstate Include/cpython/abstract.h:112
    #228 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #229 0x3ffab105447 in call_function Python/ceval.c:5891
    #230 0x3ffab0ffa57 in _PyEval_EvalFrameDefault Python/ceval.c:4231
    #231 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #232 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #233 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #234 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #235 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #236 0x3ffab105447 in call_function Python/ceval.c:5891
    #237 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #238 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #239 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #240 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #241 0x3ffab0f00a9 in _PyObject_VectorcallTstate Include/cpython/abstract.h:114
    #242 0x3ffab0f013d in PyObject_Vectorcall Include/cpython/abstract.h:123
    #243 0x3ffab105447 in call_function Python/ceval.c:5891
    #244 0x3ffab0ff905 in _PyEval_EvalFrameDefault Python/ceval.c:4213
    #245 0x3ffab0f052b in _PyEval_EvalFrame Include/internal/pycore_ceval.h:46
    #246 0x3ffab102b67 in _PyEval_Vector Python/ceval.c:5065
    #247 0x3ffaaf8aec1 in _PyFunction_Vectorcall Objects/call.c:342
    #248 0x3ffaaf8ab15 in PyVectorcall_Call Objects/call.c:255
    #249 0x3ffaaf8ac65 in _PyObject_Call Objects/call.c:290

0x60d0005a5790 is located 80 bytes inside of 136-byte region [0x60d0005a5740,0x60d0005a57c8)
freed by thread T0 here:
    #0 0x3ffab537de5 in operator delete(void*) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160
    #1 0x3ff55984fdb in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::deallocate(std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2>*, unsigned long) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:145

previously allocated by thread T0 here:
    #0 0x3ffab53734f in operator new(unsigned long) /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x3ff5598443f in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:127
    #2 0x3fff5849ecf  ([stack]+0xb2ecf)

SUMMARY: AddressSanitizer: heap-use-after-free /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/stl_iterator.h:1028 in __gnu_cxx::__normal_iterator<c10::Argument const*, std::vector<c10::Argument, std::allocator<c10::Argument> > >::__normal_iterator(c10::Argument const* const&)
Shadow bytes around the buggy address:
  0x100c1a000b4aa0: fd fd fd fd fd fd fd fd fd fd fd fa fa fa fa fa
  0x100c1a000b4ab0: fa fa fa fa fd fd fd fd fd fd fd fd fd fd fd fd
  0x100c1a000b4ac0: fd fd fd fd fd fa fa fa fa fa fa fa fa fa fd fd
  0x100c1a000b4ad0: fd fd fd fd fd fd fd fd fd fd fd fd fd fd fd fa
  0x100c1a000b4ae0: fa fa fa fa fa fa fa fa fd fd fd fd fd fd fd fd
=>0x100c1a000b4af0: fd fd[fd]fd fd fd fd fd fd fa fa fa fa fa fa fa
  0x100c1a000b4b00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x100c1a000b4b40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==1134126==ABORTING
```

Additional backtraces (not full):
Allocation:
```
#0  __memset_z196 () at ../sysdeps/s390/memset-z900.S:144
#1  0x000003ff96f3072a in __asan::Allocator::Allocate (this=this@entry=0x3ff97041eb8 <__asan::instance>, size=size@entry=136, alignment=8, alignment@entry=0, stack=<optimized out>,
    stack@entry=0x3ffdbb45d78, alloc_type=<optimized out>, can_fill=true) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_allocator.cpp:599
#2  0x000003ff96f2c088 in __asan::asan_memalign (alignment=alignment@entry=0, size=size@entry=136, stack=stack@entry=0x3ffdbb45d78, alloc_type=alloc_type@entry=__asan::FROM_NEW)
    at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_allocator.cpp:1039
#3  0x000003ff96fb73b0 in operator new (size=136) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:99
#4  0x000003ff41404440 in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::allocate (this=0x3ffdbb468c0,
    __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:127
#5  0x000003ff414042a0 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::allocate (__a=...,
    __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/alloc_traits.h:464
#6  0x000003ff41403b66 in std::__allocate_guarded<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > > (__a=...)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/allocated_ptr.h:98
#7  0x000003ff4140372a in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (this=0x3ffdbb47888, __p=@0x3ffdbb47880: 0x0, __a=..., __args=..., __args=..., __args=..., __args=...)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:648
#8  0x000003ff41403328 in std::__shared_ptr<c10::FunctionSchema, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (this=0x3ffdbb47880, __tag=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1342
#9  0x000003ff41402f06 in std::shared_ptr<c10::FunctionSchema>::shared_ptr<std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (
    this=0x3ffdbb47880, __tag=..., __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:409
#10 0x000003ff41402b6e in std::allocate_shared<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (__a=...,
    __args=..., __args=..., __args=..., __args=...) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:862
#11 0x000003ff4140215c in std::make_shared<c10::FunctionSchema, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::vector<c10::Argument, std::allocator<c10::Argument> >, std::vector<c10::Argument, std::allocator<c10::Argument> > > (__args=..., __args=..., __args=..., __args=...)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:878
#12 0x000003ff413d180c in c10::TupleType::createWithSpec<c10::basic_string_view<char> > (qualName=..., field_names=std::vector of length 1, capacity 1 = {...},
    field_types=std::vector of length 1, capacity 1 = {...}, field_defaults=std::vector of length 0, capacity 0) at /home/user/pytorch/aten/src/ATen/core/type.cpp:769
#13 0x000003ff413b9ca6 in c10::TupleType::createNamed (qualName=..., field_names=std::vector of length 1, capacity 1 = {...}, field_types=std::vector of length 1, capacity 1 = {...})
    at /home/user/pytorch/aten/src/ATen/core/type.cpp:725
#14 0x000003ff4115fbac in c10::ivalue::TupleTypeFactory<c10::TupleType>::fallback (type=...) at /home/user/pytorch/aten/src/ATen/core/dynamic_type.cpp:383
#15 0x000003ff708217fe in c10::ivalue::Tuple::type<c10::TupleType> (this=0x6080004b8520) at /home/user/pytorch/aten/src/ATen/core/ivalue_inl.h:781
#16 0x000003ff70800740 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:613
#17 0x000003ff70800306 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604
#18 0x000003ff702d6872 in pybind11::detail::type_caster<c10::IValue, void>::cast (src=...) at /home/user/pytorch/torch/csrc/jit/python/pybind.h:138
#19 0x000003ff70d98192 in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::operator()(pybind11::detail::function_call&) const (this=0x3ffdbb4ca20, call=...)
    at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:249
#20 0x000003ff70d97cfe in pybind11::cpp_function::initialize<torch::jit::initJitScriptBindings(_object*)::$_45, c10::IValue, torch::jit::mobile::Module&, pybind11::tuple const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg>(torch::jit::initJitScriptBindings(_object*)::$_45&&, c10::IValue (*)(torch::jit::mobile::Module&, pybind11::tuple const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#1}::__invoke(pybind11::detail::function_call&) (call=...)
    at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:224
#21 0x000003ff6e9652ea in pybind11::cpp_function::dispatcher (self=<PyCapsule at remote 0x3ff83e27720>,
    args_in=(<torch._C.LiteScriptModule at remote 0x3ff811844b0>, (<Tensor at remote 0x3ff814efb00>,)), kwargs_in=0x0) at /home/user/pytorch/cmake/../third_party/pybind11/include/pybind11/pybind11.h:929
```

Deallocation:
```
#0  operator delete (ptr=0x60d0005a5740) at /var/tmp/portage/sys-devel/gcc-11.3.1_p20230303/work/gcc-11-20230303/libsanitizer/asan/asan_new_delete.cpp:160
#1  0x000003ff44904fdc in __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> >::deallocate (this=0x3ffc5dc8020,
    __p=0x60d0005a5740, __t=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/ext/new_allocator.h:145
#2  0x000003ff44904fa8 in std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::deallocate (
    __a=..., __p=0x60d0005a5740, __n=1) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/alloc_traits.h:496
#3  0x000003ff449041f2 in std::__allocated_ptr<std::allocator<std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2> > >::~__allocated_ptr (
    this=0x3ffc5dc8030) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/allocated_ptr.h:74
#4  0x000003ff44904888 in std::_Sp_counted_ptr_inplace<c10::FunctionSchema, std::allocator<c10::FunctionSchema>, (__gnu_cxx::_Lock_policy)2>::_M_destroy (this=0x60d0005a5740)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:538
#5  0x000003ff43895a62 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x60d0005a5740) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:184
#6  0x000003ff43895420 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x611000c40648) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:705
#7  0x000003ff4466e7f4 in std::__shared_ptr<c10::FunctionSchema, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x611000c40640)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1154
#8  0x000003ff4466d820 in std::shared_ptr<c10::FunctionSchema>::~shared_ptr (this=0x611000c40640) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:122
#9  0x000003ff448d82f6 in c10::TupleType::~TupleType (this=0x611000c40580) at /home/user/pytorch/aten/src/ATen/core/jit_type.h:1142
#10 0x000003ff448d8346 in c10::TupleType::~TupleType (this=0x611000c40580) at /home/user/pytorch/aten/src/ATen/core/jit_type.h:1142
#11 0x000003ff731296a4 in std::_Sp_counted_ptr<c10::TupleType*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=0x603000c43ae0)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:348
#12 0x000003ff71eaf666 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x603000c43ae0) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:168
#13 0x000003ff71eaf330 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x3ffc5dc9368) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:705
#14 0x000003ff73129ee4 in std::__shared_ptr<c10::TupleType, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x3ffc5dc9360)
    at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr_base.h:1154
#15 0x000003ff73122390 in std::shared_ptr<c10::TupleType>::~shared_ptr (this=0x3ffc5dc9360) at /usr/lib/gcc/s390x-ibm-linux-gnu/11/include/g++-v11/bits/shared_ptr.h:122
#16 0x000003ff73d00788 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:613
#17 0x000003ff73d00306 in torch::jit::toPyObject (ivalue=...) at /home/user/pytorch/torch/csrc/jit/python/pybind_utils.cpp:604
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101400
Approved by: https://github.com/zou3519
2023-05-15 15:32:10 +00:00
0e017af35b make torch/csrc/jit/python/pybind_utils.cpp data_ptr-correct (#100682)
make torch/csrc/jit/python/pybind_utils.cpp data_ptr-correct

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100682
Approved by: https://github.com/Skylion007
2023-05-05 15:53:06 +00:00
24bf15fe8d Support record_stream in dispatch mode (#99529)
Summary:
Issuing a `t.record_stream(s)` call while a `TorchDispatchMode` is active was throwing because PyTorch was unable to convert a c10::Stream back to a Python object. It's now fixed.

Fixes https://github.com/pytorch/pytorch/issues/94403

Test Plan: Added a unit test

Differential Revision: D45117566

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99529
Approved by: https://github.com/albanD
2023-04-21 07:17:19 +00:00
39fd7f945f Add Symbool support in python to C++ translation (#98453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98453
Approved by: https://github.com/ezyang
2023-04-12 03:21:57 +00:00
b45880c537 Optionally ignore utf-8 decoding error when converting std::string to python str. (#97282)
Summary: When language models use c++ tokenizer, outputs are a c++ strings that are not necessarily valid utf-8 encodings. Default pybind11 casting uses strict utf-8 decoding. We relax the decoding using 'ignore' argument.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/6473924609918070

Reviewed By: Nayef211

Differential Revision: D43970697

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97282
Approved by: https://github.com/davidberard98
2023-03-23 01:19:08 +00:00
0d0ebcdfe5 feature: adding the ability to restore shapes after loading a traced model (#90744)
Adds the ability to store inputs used in tracing models when calling torch.jit.save and restore the input shapes using torch.jit.load if the appropriate variables are set.

Fixes [89185](https://github.com/pytorch/pytorch/issues/89185)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90744
Approved by: https://github.com/davidberard98
2023-02-10 17:12:52 +00:00
cyy
bfe5e1258b avoid unnecessary static_cast (#93898)
avoid unnecessary static_cast
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93898
Approved by: https://github.com/Skylion007
2023-02-03 03:44:43 +00:00
2fc73622f8 [jit] Support Awaitable type (#90863)
We want to make TorchRec sharded models TorchScriptable.

TorchRec sharded models uses generic types Awaitable[W] and LazyAwaitable[W] (https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/types.py#L212).
In sharded model those types are used instead of contained type W, having the initialization function that produces object of type W.

At the moment when the first attribute of W is requested - `LazyAwaitable[W]` will call its initialization function (on the same stack), cache the result inside and work transparently as an object of W. So we can think about it as a delayed object initialization.

To support this behavior in TorchScript - we propose a new type to TorchScript - `Await`.
In eager mode it works the same as `LazyAwaitable[W]` in TorchRec, being dynamically typed - acting as a type `W` while it is `Await[W]`.

Within torchscript it is `Await[W]` and can be only explicitly converted to W, using special function `torch.jit.awaitable_wait(aw)`.
Creation of this `Await[W]` is done via another special function `torch.jit.awaitable(func, *args)`.

The semantic is close to `torch.jit.Future`, fork, wait and uses the same jit mechanics (inline fork Closures) with the difference that it does not start this function in parallel on fork. It only stores as a lambda inside IValue that will be called on the same thread when `torch.jit.awaitable_wait` is called.

For example (more examples in this PR `test/jit/test_await.py`)
```
      def delayed(z: Tensor) -> Tensor:
          return Tensor * 3

      @torch.jit.script
      def fn(x: Tensor):
          aw: Await[int] = torch.jit._awaitable(delayed, 99)
          a = torch.eye(2)
          b = torch.jit._awaitable_wait(aw)
          return a + b + x
```

Functions semantics:

`_awaitable(func -> Callable[Tuple[...], W], *args, **kwargs) -> Await[W]`

Creates Await object, owns args and kwargs. Once _awaitable_wait calls, executes function func and owns the result of the function. Following _awaitable_wait calls will return this result from the first function call.

`_awaitable_wait(Await[W]) -> W`
Returns either cached result of W if it is not the first _awaitable_wait call to this Await object or calls specified function if the first.

`_awaitable_nowait(W) -> Await[W]`

Creates trivial Await[W] wrapper on specified object To be type complaint for the corner cases.

Differential Revision: [D42502706](https://our.internmc.facebook.com/intern/diff/D42502706)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90863
Approved by: https://github.com/davidberard98
2023-01-30 17:38:59 +00:00
0247ed27cc Apply Clang-Tidy readability-container-size-empty (#93236)
Not only is this change usually shorter and more readable, it also can yield better performance. size() is not always a constant time operation (such as on LinkedLists), but empty() always is.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93236
Approved by: https://github.com/malfet
2023-01-29 23:28:19 +00:00
8f1c3c68d3 [BE] Use nested namespaces in .cpp/.cu files (#92100)
As we live in C++17 world

This is a functional no-op, just
- `s/namespace at { namespace native {/namespace at::native {/`
- `s/namespace torch { namespace jit {/namespace torch::jit {/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92100
Approved by: https://github.com/izaitsevfb
2023-01-13 16:32:34 +00:00
e096d2db5a [BC-Breaking] Separate stream_id, device_index, and device_type in pack and unpack for Streams (#81596)
#75854

A naive attempt at working around the limitations of using a single 64-bit integer to pack `stream_id`, `device_index`, and `device_type`.

Stills needs sanity checks, testing, and minimization of BC-breaking changes.

Currently a Holder for the `StreamData3` struct is used for `IValue` compatibility. While doing this seems to work for `ivalue.h` and `ivalue_inl.h`, this doesn't seem to be naively working for the JIT CUDA stream wrapper? (Something about ambiguous calls if an `intrusive_ptr` to `c10::ivalue::StreamData3Holder` is used as the return type for `pack()`. It turns out that the methods required to access the fields for rematerializing a CUDA Stream are basically already present anyway, so `pack` is simply removed in the wrapper for now and the methods to access the required fields are called directly.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81596
Approved by: https://github.com/ezyang
2023-01-12 14:16:49 +00:00
c470ad4f4a Add missing overload for ivalue toSym(Int|Float) (#91405)
Noticed the toSymFloat / toSymInt overloads always copied the internal pointer of an ivalue even if it was an rvalue unlike other overloads (like toTensor). This fixes that issue by adding the appropriate methods needed to facilitate that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91405
Approved by: https://github.com/ezyang
2022-12-28 11:07:37 +00:00
1ff52225f1 Unify SymIntNode and SymFloatNode into SymNode (#87817)
This refactor was prompted by challenges handling mixed int/float
operations in C++.  A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/  This PR takes a different
approach.

The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode.  This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods.  This has a number of
knock on effects.

- We no longer have C++ classes to bind to Python.  Instead, we take an
  entirely new approach to our Python API, where we have a SymInt/SymFloat
  class defined entirely in Python, which hold a SymNode (which corresponds
  to the C++ SymNode).  However, SymNode is not pybind11-bound; instead,
  it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
  when it goes into C++.  This implies a userland rename.

  In principle, it is also possible for the canonical implementation of SymNode
  to be written in C++, and then bound to Python with pybind11 (we have
  this code, although it is commented out.)  However, I did not implement
  this as we currently have no C++ implementations of SymNode.

  Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
  code needs to know how to find these classes.  Currently, this is done
  just by manually importing torch and getting the attributes.

- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
  takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
  __torch_dispatch__ works.

Some miscellaneous improvements:

- SymInt now has a constructor that takes SymNode.  Note that this
  constructor is ambiguous if you pass in a subclass of SymNode,
  so an explicit downcast is necessary.  This means toSymFloat/toSymInt
  are no more.  This is a mild optimization as it means rvalue reference
  works automatically.

- We uniformly use the caster for c10::SymInt/SymFloat, rather than
  going the long way via the SymIntNode/SymFloatNode.

- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
  functions, pretty sure this doesn't do anything.

- guard_int is now a free function, since to guard on an int you cannot
  assume the method exists.  A function can handle both int and SymInt
  inputs.

- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
  ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
  plain methods; this is to help avoid confusion between the two types.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
2022-10-27 20:56:02 +00:00
169ec120ef [Modes] refactor modes to only use a stack in cpp (#86458)
Refactors the mode code to only have the C++ mode stack and not the "C++ mode" like we originally had. This also simplifies the mode logic in a number of places
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86458
Approved by: https://github.com/zou3519
2022-10-21 19:18:23 +00:00
48f0231223 Fix Scalar(bool) handling in toIValue (#87179)
At the moment, they were casted to `int64`, which breaks quite a few
casting rules for example in `ops.aten`.

Quite a vintage bug, circa 2020.

With this fix, the following code prints `torch.bool`, rather than `torch.int64`.
```python
import torch
msk = torch.tensor([False])
b = torch.tensor([False])
print(torch.ops.aten.where.ScalarSelf(msk, True, b).dtype)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87179
Approved by: https://github.com/albanD
2022-10-18 18:53:03 +00:00
cb87983cb8 Decay integer-only (Optional)SymIntArrayRef to IntList in IValue (#86094)
We have logic that says if you ask for a SymIntList from an IValue, but the IValue is actually an IntList, we will still give it to you in that case (check ivalue_to_arg in aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h). However, we also need the *inverse* version of this logic, which says that if you construct an IValue from a SymIntArrayRef, and it is actually integer only, we need to store it as an IntList, so that toIntList on the IValue will work.

The way this works is a bit twisty, but our basic strategy is to disable construction of IValue from list container types that contain SymInt directly, and then directly implement variants of these constructors by hand, which iterate over the elements of the list and test if there are any SymInts or not to decide what type to construct the underlying List. These variants have to be templated, otherwise we will run afoul ambiguous overloads. I only did the overloads that actually occurred in practice; you may need to add more if you SymIntify more stuff.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86094
Approved by: https://github.com/anjali411, https://github.com/albanD
2022-10-03 20:12:32 +00:00
8753703b68 Fix some bugs in SymFloat IValue and toPyObject handling (#86072)
- Test for symbolic cases first before non-symbolic, as symbolic
  ints/floats advertise as being ints/floats
- Add missing case for toPyObject

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86072
Approved by: https://github.com/wconstab
2022-10-03 02:06:38 +00:00
07800c9c81 Miscellaneous fixes from symbolic-shapes branch (#86042)
- Make toIValue accept SymIntNode and SymFloatNode where number (aka Scalar) is
  expected
- Binding for symintlistOptional in python arg parser
- Teach translate to convert from IntArrayRef to ArrayRef<int64_t>
- Don't query _symint function for meta info in LTC unless LTC is
  code generating a symint function

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86042
Approved by: https://github.com/Chillee
2022-10-01 13:57:58 +00:00
61b4e8a7bf More SymFloat support (#85411)
- Support storing SymFloat in IValue
- Add SymFloat to JIT type system (erases to float)
- Printing support for SymFloat
- add/sub/mul/truediv operator support for SymFloat
- Support truediv on integers, it returns a SymFloat
- Support parsing SymFloat from Python object

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85411
Approved by: https://github.com/albanD
2022-09-22 08:07:22 +00:00
7e900f204f Avoid throwing an exception when ScriptList doesn't match. (#84921)
This prevents 'catch throw' gdb breakpoint pollution and
should also improve performance.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84921
Approved by: https://github.com/Chillee
2022-09-13 14:40:01 +00:00
7a9ab5c232 Move Python argument related functions to cpp file (#84919)
No changes to contents, just moving things out of header.
I only moved the stuff I suspected I'd be editing; maybe more
things from this header could migrate out.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84919
Approved by: https://github.com/suo
2022-09-13 07:22:23 +00:00
2a332afbf4 Add SymFloat, support SymInt to SymFloat conversion (#84284)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84284
Approved by: https://github.com/albanD
2022-09-03 01:30:32 +00:00
ad44670fa1 Back out "Revert D38984222: Don't introduce new overload for SymInt (#83628)" (#84173)
Also Back out "Revert D39075159: [acc_tensor] Use SymIntArrayRef for overloaded empty.memory_format's signature"

Original commit changeset: dab4a9dba4fa
Original commit changeset: dcaf16c037a9

Original Phabricator Diff: D38984222
Original Phabricator Diff: D39075159

Also update Metal registrations for C++ registration changes.

Also update NNPI registration to account for tightened schema checking

Differential Revision: [D39084762](https://our.internmc.facebook.com/intern/diff/D39084762/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D39084762/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84173
Approved by: https://github.com/Krovatkin
2022-08-29 18:01:07 +00:00
3aae6ff1e1 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-28 18:45:25 +00:00
b159a5230f Revert "Add nvprims.var_mean (#83508)"
This reverts commit 7e7694b6615fbf46abfab234615fa891c2819eb7.

Reverted https://github.com/pytorch/pytorch/pull/83508 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-08-28 11:30:27 +00:00
7e7694b661 Add nvprims.var_mean (#83508)
This PR adds nvfuser-specific primitive - `var_mean`.
Interpretation `torch.var_mean` -> `torch.ops.nvprims.var_mean` is handled by `TorchRefsNvfuserCapabilityMode` context manager.

I moved some helper code from `_prims/__init__.py` to `_prims_common`. Correctness is tested with OpInfo tests (see `PythonRefInfo("ops.nvprims.var_mean"`).

Layer norm reference now uses `torch.var_mean` instead of `torch._refs.var_mean` to allow interception. Here's a simple comparison of performance with this PR and master (on 3080ti):
```py
import torch
from torch._prims.context import TorchRefsNvfuserCapabilityMode
from torch.fx.experimental.proxy_tensor import make_fx
from torch._prims.executor import execute

def func(a):
    return torch.native_layer_norm(a, (1024,), None, None, 1e-6)

a = torch.randn(10, 512, 1024, dtype=torch.float16, device="cuda")

with TorchRefsNvfuserCapabilityMode():
    gm = make_fx(func)(a)

for _ in range(10):
    execute(gm, a, executor="strictly_nvfuser");
```
run with `PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth python script.py`
```py
# WITH THIS PR
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.033792 ms, achieved: 621.818 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.032608 ms, achieved: 644.396 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.032768 ms, achieved: 641.25 GB/s
# kernel1 run in 0.03072 ms, achieved: 684 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s
# kernel1 run in 0.031744 ms, achieved: 661.935 GB/s

# ON MASTER
# kernel1 run in 0.05632 ms, achieved: 373.091 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043808 ms, achieved: 479.649 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.044032 ms, achieved: 477.209 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
# kernel1 run in 0.043008 ms, achieved: 488.571 GB/s
```
So this PR gives about 35% improvement in performance using nvfuser executor with this specific normalized shape.

Also this PR fixes https://github.com/pytorch/pytorch/issues/83506 (see the change in `torch/csrc/jit/python/pybind_utils.cpp`).

Ref. https://github.com/pytorch/pytorch/issues/80187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83508
Approved by: https://github.com/ngimel
2022-08-27 09:05:20 +00:00
c7edcd6968 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 9790d90e4b0288796ab44a6b4979db0a67580ba8.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to Breaks internal builds, see D39076487
2022-08-27 01:23:17 +00:00
9790d90e4b Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-26 01:35:40 +00:00
a7edf71360 Revert "Don't introduce new overload for SymInt (#83628)"
This reverts commit 8fae7027b399e65e6071d335aa874497682c84d0.

Reverted https://github.com/pytorch/pytorch/pull/83628 on behalf of https://github.com/malfet due to breaking internal builds, see https://www.internalfb.com/diff/D38984222
2022-08-25 00:49:40 +00:00
8fae7027b3 Don't introduce new overload for SymInt (#83628)
Previously, we introduced new SymInt overloads for every function we wanted.  This led to a lot of boilerplate, and also a lot of confusion about how the overloads needed to be implemented.

This PR takes a simpler but more risky approach: just take the original function and changes its ints to SymInts.

This is BC-breaking in the following ways:

* The C++ API for registering implementations for aten operators will change from int64_t to SymInt whenever you make this change. Code generated registrations in PyTorch do not change as codegen handles the translation automatically, but manual registrations will need to follow the change.  Typically, if you now accept a SymInt where you previously only took int64_t, you have to convert it back manually.  This will definitely break XLA, see companion PR https://github.com/pytorch/xla/pull/3914 Note that not all dispatch keys get the automatic translation; all the composite keys and Meta keys are modified to take SymInt directly (because they should handle them directly), and so there are adjustments for this.

This is not BC-breaking in the following ways:

* The user facing C++ API remains compatible.  Even if a function changes from int to SymInt, the default C++ binding still takes only ints.  (e.g., at::empty(IntArrayRef, ...).  To call with SymInts, you must call at::empty_symint instead. This involved adding two more signatures to CppSignatureGroup; in many cases I refactored code to iterate over all signatures in the group instead of hard-coding the two that previously existed.
* This is TorchScript compatible; internally we treat SymInts as ints so there is no change to what happens at runtime in TorchScript. In particular, it's OK to reference an empty schema by its old type (using int types), as long as you're not doing string equality (which you shouldn't be), these parse to the same underyling type.

Structure of the PR:

* The general strategy of this PR is that, even when you write `SymInt` inside `native_functions.yaml`, sometimes, we will treat it *as if* it were an `int`. This idea pervades the codegen changes, where we have a translation from SymInt to c10::SymInt or int64_t, and this is controlled by a symint kwarg which I added and then audited all call sites to decide which I wanted. Here are some of the major places where we pick one or the other:
  * The C++ FunctionSchema representation represents `SymInt` as `int`. There are a few places we do need to know that we actually have a SymInt and we consult `real_type()` to get the real type in this case. In particular:
    * When we do schema validation of C++ operator registration, we must compare against true schema (as the C++ API will provide `c10::SymInt`, and this will only be accepted if the schema is `SymInt`. This is handled with cloneWithRealTypes before we check for schema differences.
    * In `toIValue` argument parsing, we parse against the true schema value. For backwards compatibility reasons, I do still accept ints in many places where Layout/SymInt/etc were expected. (Well, accepting int where SymInt is expected is not BC, it's just the right logic!)
  * In particular, because SymInt never shows up as type() in FunctionSchema, this means that we no longer need a dedicated Tag::SymInt. This is good, because SymInts never show up in mobile anyway.
* Changes to functorch/aten are mostly about tracking changes to the C++ API registration convention. Additionally, since SymInt overloads no longer exist, registrations for SymInt implementations are deleted. In many cases, the old implementations did not properly support SymInts; I did not add any new functionality with this PR, but I did try to annotate with TODOs where this is work to do. Finally, because the signature of `native::` API changed from int to SymInt, I need to find alternative APIs for people who were directly calling these functions to call. Typically, I insert a new dispatch call when perf doesn't matter, or use `at::compositeexplicitautograd` namespace to handle other caes.
* The change to `make_boxed_from_unboxed_functor.h` is so that we accept a plain IntList IValue anywhere a SymIntList is expected; these are read-only arguments so covariant typing is OK.
* I change how unboxing logic works slightly. Previously, we interpret the C++ type for Layout/etc directly as IntType JIT type, which works well because the incoming IValue is tagged as an integer. Now, we interpret the C++ type for Layout as its true type, e.g., LayoutType (change to `jit_type.h`), but then we accept an int IValue for it anyway. This makes it symmetric with SymInt, where we interpret the C++ type as SymIntType, and then accept SymInt and int IValues for it.
* I renamed the `empty.names` overload to `empty_names` to make it less confusing (I kept mixing it up with the real empty overload)
* I deleted the `empty.SymInt` overload, which ended up killing a pile of functions. (This was originally a separate PR but the profiler expect test was giving me grief so I folded it in.)
* I deleted the LazyDynamicOpsTest tests. These were failing after these changes, and I couldn't figure out why they used to be passing: they make use of `narrow_copy` which didn't actually support SymInts; they were immediately converted to ints.
* I bashed LTC into working. The patches made here are not the end of the story. The big problem is that SymInt translates into Value, but what if you have a list of SymInt? This cannot be conveniently represented in the IR today, since variadic Values are not supported. To work around this, I translate SymInt[] into plain int[] (this is fine for tests because LTC dynamic shapes never actually worked); but this will need to be fixed for proper LTC SymInt support. The LTC codegen also looked somewhat questionable; I added comments based on my code reading.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83628
Approved by: https://github.com/albanD, https://github.com/bdhirsh
2022-08-23 22:04:07 +00:00
5b621205f4 Revert "Revert "adding a custom caster for c10::SymInt (#82692)"" (#83223)
This should fix the MacOS build errors and reland #82692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83223
Approved by: https://github.com/albanD
2022-08-12 00:46:50 +00:00
daeea7d2c3 Revert "adding a custom caster for c10::SymInt (#82692)"
This reverts commit dee63f4f7bb35559fc34539422229d9ab375dfb9.

Reverted https://github.com/pytorch/pytorch/pull/82692 on behalf of https://github.com/seemethere due to Broke internal builds, see [logs](https://www.internalfb.com/intern/sandcastle/job/4503600373141339/insights)
2022-08-09 22:17:41 +00:00
dee63f4f7b adding a custom caster for c10::SymInt (#82692)
### Description
Adding a custom caster for `c10::SymInt`. This simplifies handling of c10::SymInt on C++/Pytorch boundary. Namely, removing if statements to handle the union nature (e.g. SymIntNode, int) of c10::SymInt.

### Issue
<!-- Link to Issue ticket or RFP -->

### Testing
<!-- How did you test your change? -->

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82692
Approved by: https://github.com/ezyang
2022-08-08 21:40:53 +00:00
fd5ac1e6b5 Rename SymbolicIntNode to SymIntNodeImpl (#82350)
Done via

```
git grep -l 'SymbolicIntNode' | xargs sed -i 's/SymbolicIntNode/SymIntNodeImpl/g'
```

Reasoning for the change:

* Sym is shorter than Symbolic, and consistent with SymInt
* You usually will deal in shared_ptr<...>, so we're going to
  reserve the shorter name (SymIntNode) for the shared pointer.

But I don't want to update the Python name, so afterwards I ran

```
 git grep -l _C.SymIntNodeImpl | xargs sed -i 's/_C.SymIntNodeImpl/_C.SymIntNode/'
```

and manually fixed up the binding code

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82350
Approved by: https://github.com/Krovatkin
2022-07-28 18:27:45 +00:00