As called out in https://github.com/pytorch/pytorch/pull/137999, preserving signatures of multiple calls when buffer mutations are present was NYI. The main problem was that intermediate values of buffers were not tracked, so couldn't be propagated statefully between multiple calls (i.e., they would need to be explicitly passed around, defeating the unlifting needed for preserving signatures).
This PR fixes this situation, by introducing module attributes that carry the necessary intermediate values of buffer mutations. In general, a buffer mutation can have several intermediate values it depends on recursively, even other buffers. So rather than tying an intermediate value with a particular buffer, we tie it with the submodules that create and read it. We install an attribute on all modules that create or read a particular intermediate value, sharing the same initial storage (i.e., initialized with the same empty tensor). For the module that creates this intermediate value, we copy the value into the corresponding attribute; and for the modules that read it, we read the corresponding attribute instead.
Another complication that needed to be addressed was that a `run_decompositions` following an `export_for_training` was not preserving module call graphs, which is needed for unflattening and, in particular, used when remapping inputs. Fortunately some existing metadata already tracks provenance of nodes, which we could use to update a module call graph after functionalization / decomposition.
Differential Revision: D64806175
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138669
Approved by: https://github.com/tugsbayasgalan
Previously we would error when trying to preserve the call signature for a module when it was called multiple times. This PR can now do this without erroring. The fix is to propagate call indices in a few more places.
Note that while this works in the presence of params, buffers, and tensor constants, preserving call signatures for multiple calls to a module when buffers are mutated is not supported yet. This is future work. The main problem is that we do not have enough metadata to `copy_` mutated buffers at the end of each call to a module, so the next call can read those buffers at the beginning. Making this work will likely need some explicit tracking of intermediate values of mutated buffers when collecting metadata during functionalization in export.
Note also that we stop short of creating a single graph out of multiple graphs: that is still future work. So the unflattened module will still have different targets `n`, `n@1`, `n@2`, etc. for each call when we ask the module call signature of `n` to be preserved. However it is way easier to swap all of these targets with a replacement that behaves similar to the original, because all of these calls will respect the original module call signature. (In particular, any constant inputs will be carried by the calls.)
Differential Revision: D64406945
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137999
Approved by: https://github.com/tugsbayasgalan
We use nn_module_stack in unflatten to recognize when module calls begin and end. However the current format is not sufficient to detect module call boundaries when we have successive calls to the same module, because the successive instructions (end of one call, begin of next call) have the same nn_module_stack. This causes us to effectively "unroll" successive calls to a single call. This can cause problems when preserving module call signatures because the outputs of the successive calls might be concatenated in the single call.
Previously we introduced the concept of a "call index" to generate multiple graphs when unflattening, one per call. This PR pushes this concept into nn_module_stack itself. In particular, the keys of nn_module_stack now go from `key` to `key@call_index`. (In a previous attempt, https://github.com/pytorch/pytorch/pull/137457, instead values in nn_module_stack go from (fqn, type) to (fqn, type, call_index), which is BC-breaking.)
Note that we still do not have the ability to preserve module call signatures for multiple calls to the same module. But now instead of randomly crashing we give a proper error. OTOH when not preserving module call signatures we simply generate multiple calls, each with its own graph, possibly deduplicated, matching what we would do for non-successive calls.
Test Plan: Like D64014936
Differential Revision: D64136277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137646
Approved by: https://github.com/angelayi
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).
Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.
Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.
Test Plan: added a test
Reviewed By: angelayi
Differential Revision: D63913750
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.
Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.
```
class N(torch.nn.Module):
def forward(self, x, b):
if b:
return x + 1
else:
return x + 2
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.n = N()
def forward(self, x):
x0 = x + 3
x1 = self.n(x0, True)
x2 = x1 + 4
x3 = self.n(x2, False)
return x3 + 5
m = M()
inp = (torch.ones(1),)
print(m(*inp)) # tensor([16.])
ep = torch.export.export(m, inp)
print(ep.module()(*inp)) # tensor([16.])
unflattened = torch.export.unflatten(ep)
print(unflattened(*inp)) # tensor([15.])
```
However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.
The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.
However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.
A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.
Differential Revision: D63642479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
Summary:
# context
* for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/)
* basica idea of this diff is to **short circuit the pytree flatten-unflatten function pairs** between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict.
NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545}
* short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup.
* hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users.
# details
* The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC.
* a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns.
WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`.
# additional changes
* absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`.
* set `graph.owning_module` in export.unflatten as required by the graph modification
* add one more layer of `sparse_module` for closely mimicing the APF model structure.
Test Plan:
# run test
* serializer
```
buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer
```
* apf
```
buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir'
```
* local mp run
```
==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ====
finished
test_mtml_instagram_model_562438350_single_gpu_with_ir
Imports took: 6.0s! Profile with --import-profiler. --_ |""---__
Executed 1 example in 203.1s: |'.| || . """|
Successful: 1 | || || /|\""-. |
Failed: 0 | || || | | |
Skipped: 0 | || || | \|/ |
Not executed: 8 |."| || --"" '__|
https://testslide.readthedocs.io/ --" |__---"""
```
Differential Revision: D62606738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045
Approved by: https://github.com/angelayi
Summary:
Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module.
This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030
Test Plan: CI
Differential Revision: D60939034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996
Approved by: https://github.com/pianpwk
Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph.
Test Plan:
buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled
Differential Revision: D62014035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830
Approved by: https://github.com/angelayi
Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work.
Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export"
Reviewed By: zhxchen17
Differential Revision: D60528900
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437
Approved by: https://github.com/Skylion007
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train)
Summary:
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Test Plan:
contbuild & OSS CI, see 940e4477ab
Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.
Differential Revision: D59543603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380
Approved by: https://github.com/izaitsevfb
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
# something with _w ...
# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
# something with _w ...
# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
Summary:
Export, through AOTAutograd, [deduplicates](11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules.
If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener
Test Plan: export unflatten test case
Reviewed By: TroyGarden, angelayi
Differential Revision: D58697231
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153
Approved by: https://github.com/angelayi
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
Code snippet from TorchTitan (LLaMa):
```
for layer in self.layers.values():
h = layer(h, self.freqs_cis)
```
`self.freqs_cis` is a buffer of root module (`self`).
It is also an explicit arg in the call signature of original `layer` modules.
If not respecting scope -- `freqs_cis`'s scope only corresponds to root -- `_sink_param` can remove `freqs_cis` from `layer`'s call signature, resulting in runtime error.
There are two fixes in this PR:
1. We filter out the `inputs_to_state` corresponding to the current scope, using existing code that does prefix matching.
2. We delay the removal of param inputs from `call_module` nodes' `args`, till `_sink_param` call on that submodule returns. The return now returns information on which input is actually removed by the submodule, thus more accurate than just doing:
```
for node in call_module_nodes:
node.args = tuple(filter(lambda n: n.name not in inputs_to_state, node.args))
```
Before the PR:

After the PR:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127607
Approved by: https://github.com/pianpwk
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758
Aliased and unused params are currently an issue for strict-mode export. For a model like this:
```
def __init__(self):
# ...
self.alpha = nn.Parameter(torch.randn(4))
self.beta = self.alpha
self.gamma = self.alpha
def forward(self, x):
return x + self.beta
```
Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module).
This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass.
Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758
Approved by: https://github.com/zhxchen17
A re-land of #124239.
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.
The basic idea is to only fakify the script object during tracing (i.e. aot_export). After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.
Concretely, we
fakify all the script object inputs, and module attributes (gathered by constant_attrs).
patch the module's attributes with fakified script object
right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125490
Approved by: https://github.com/angelayi
In the given test case, we have a ModuleList of 3 modules (`norm.0`, `norm.1`, `norm.2`) which share the same `weight` and `bias` tensors. However when we trace, they all end up pointing to one state dict name, (ex. `norm.2`).
```
graph():
%p_norms_0_weight : [num_users=0] = placeholder[target=p_norms_0_weight]
%p_norms_0_bias : [num_users=0] = placeholder[target=p_norms_0_bias]
%p_norms_1_weight : [num_users=0] = placeholder[target=p_norms_1_weight]
%p_norms_1_bias : [num_users=0] = placeholder[target=p_norms_1_bias]
%p_norms_2_weight : [num_users=3] = placeholder[target=p_norms_2_weight]
%p_norms_2_bias : [num_users=3] = placeholder[target=p_norms_2_bias]
%input_ : [num_users=1] = placeholder[target=input_]
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
%native_layer_norm_1 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem_3 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_1, 0), kwargs = {})
%native_layer_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%getitem_3, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem_6 : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm_2, 0), kwargs = {})
return (getitem_6,)
```
This causes an error in the unflattener where after constructing the submodules for `norm.0`, it will have the graph pointing to `norm.2.weight` and `norm.2.bias`:
```
graph():
%p_norms_2_bias : [num_users=1] = placeholder[target=p_norms_2_bias]
%p_norms_2_weight : [num_users=1] = placeholder[target=p_norms_2_weight]
%input_ : [num_users=1] = placeholder[target=input_]
%native_layer_norm : [num_users=1] = call_function[target=torch.ops.aten.native_layer_norm.default](args = (%input_, [2, 2, 3], %p_norms_2_weight, %p_norms_2_bias, 1e-05), kwargs = {})
%getitem : [num_users=1] = call_function[target=operator.getitem](args = (%native_layer_norm, 0), kwargs = {})
return getitem
```
Since the attributes are not within the same scope of the graph, (`norm.0` vs. `norm.2`), they will not be added to the subgraph, causing an error.
So this PR handles the duplicate state dict attributes by modifying the `inputs_to_state` dict to map from node names to a list of possible state dict target names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125192
Approved by: https://github.com/zhxchen17
This PR fakify ScriptObject inputs and attributes in export non-strict mode by default.
The basic idea is to `only fakify the script object during tracing (i.e. aot_export)`. After we get the traced graph module, eagerly executing, serializing, or running more passes will use the real script objects. This is essentially treating the script object as constant tensor.
Concretely, we
1. fakify all the script object inputs, and module attributes (gathered by constant_attrs).
2. patch the module's attributes with fakified script object
3. right after aot_export, remove the patching (to avoid changing the original module) then modify the exported graph module's attribute to real script object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124239
Approved by: https://github.com/zou3519