Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).
This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.
------
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:
Part of FSDP2's tracing strategy right now is that:
(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated
(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)
(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.
It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.
However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).
So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:
(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)
(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225
Co-authored-by: Will Feng <yf225@cornell.edu>
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203
Approved by: https://github.com/bdhirsh
Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions.
This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215
Approved by: https://github.com/jansel
More details further down, but first a more high-level description of "how do we functionalize storage resizing"
Today, dynamo converts `param.untyped_storage().resize_(x)` calls that it sees from fsdp into a custom op, `ops.inductor.resize_storage_bytes_(x)`
So given this setup, there are 3 main cases that I think we want to handle:
(1) graph input starts with a real storage size, gets resized down to zero in the graph
(2) graph input starts with 0 storage size, gets resized up in the graph
(3) graph input starts with 0 storage size, gets resized up and used in some compute, then resized back down to 0
For case (1) we need to emit a `resize_storage_bytes_` at the end of the graph, similar to how we emit `copy_()` for data mutations.
For case (2), we need to emit a `resize_storage_bytes_` in the graph, and we **also** need to emit a `copy_()` (the input had its storage resized up, and filled in with data, which is we need to reflect as an input mutation)
For case (3), the net effect is that the input had no data on entry and exit of the function, so we don't need to emit any mutable ops in the end of the graph.
The main thing to call out is that: we need to write a functionalization rule for `resize_storage_byte_`, (`FunctionalTensorWrapper::storage_resize_()`) and this rule actually does very little. We would like to **not** emit any new ops in the graph (like say, a functional resize op). Instead, we should expect / rely on the fact that any resize up will be immediately followed by a `copy_()`/`foreach_copy_`/`out=` op, that will fill in the data of the tensor. So `FunctionalTensor` can temporarily live in a state where its data is invalid, until the `x.copy_(y)` "updates" its data with the new tensor.
So effectively, all that this rule does is:
(1) it stores metadata on the storage, indicating that the tensor was resized, as well as the updated storage size. We need this info in AOTAutograd, so it knows whether to emit a mutable resize_() op in the graph epilogue
(2) There is also a corner case: if we are resizing down to zero, but our tensor had **previously** had a zero size storage, then we update `value_` to point to the original value of the tensor. The reason this seems safe is because if we have a zero storage sized tensor `x`, and we resize it up, use it in some compute, resize it back down to zero, and use it somewhere, we would want the functional version of this code to use the original `x` after the second resize. For FSDP, this is important because we end up saving parameters (graph inputs) for backward, and we want to make sure that the thing we save (and the output to the forward graph) is the original, zero-storage-sized parameter, and not the "version 2" of the parameter after the first resize_()
I think a good order to look at changes in this PR would be:
(1) `test_aotdispatch.py` shows the 3 main cases I focused on as well as the expected functionalized graphs
(2) In `FunctionalStorageImpl.h/cpp`, I had to add a notion of "original base", and "original/curr_size". The first is so I can re-use the zero-size tensor after multiple resizes, and the second is so I can tell in AOTAutograd whether any resizes canceled each other out into a no-op
(3) FunctionalTensorWrapper.h/cpp has the new resize functionalizion rule + some extra utils
(4) `_functorch/_autograd`: the main changes in this folder were around adding the logic at trace-time to detect when we need to put a resize_() in the graph. I also have some assertions to check that any inputs that experience storage resizing will **always be in the graph** and not the opaque epilogue, and I also limited the resize_() mutation case so that you can only ever start with zero storage, or end with zero storage (you can't do e.g. `torch.ones(2).storage().resize_(3)`), and banned it on tensor subclasses
(5) `fake_tensor.py`/`meta_utils.py`: we now need to be able to fakeify tensors with zero storage, so I added a quick version of it in meta_utils.py. This also.. has ramifications for fake tensor caching that I need to fix (include the storage size on the cache key, maybe?)
------------------
This PR subsumes https://github.com/pytorch/pytorch/pull/120971.
This PR is enough to **almost** get a simple ppFSDP forward pass tracing with a functionalized resize_() properly. It also attempts to do the updated version from @jansel, where we don't have any notion of `resize_()` in the graph at all, post functionalization. It would probably be good to test it with @yf225 's FSDP changes, and see how many of the FX passes it allows us to remove. I think that in theory, it should allow us to remove all FX passes that affect the forward graph / partitioner, **except** the one that forces views to be recomputed in the backward (more details below).
There are a few things worth calling out:
(1) failed attempt at functionalizing `aten.copy_()`. I originally wanted to get a version takes these operations:
```
param.storage().resize_(all_gather_size)
param.copy_(all_gather_buffer)
out = aten.matmul(param, param)
```
and functionalizes them into:
```
out = aten.matmul(all_gather_buffer, all_gather_buffer)
```
This would involve getting functionalization to turn `x.copy_(y)` into a giant no-op that just returns `y`. Unfortunately, we can't actually do this in a reasonable way within functionalization (instead, there's a functional `aten.copy` in the graph - see the test case graph expecttest for details). Why? In order for that transformation to be safe, `x` and `y` need to have the same metadata. However, it's possible for `x` and `y` to be subclasses of different types. This is not something we can easily tell from within functionalization, and would be a layering violation. So for now I'm leaving it to downstream code to optimize away the `aten.copy` (this is already the case today, so I think inductor can handle this)
(2) The forward doesn't **actually** run successfully in this PR (see the `assertRaisesRegex` in the test). Why?
The final forward graph looks like this:
```
def forward(self, primals_1, primals_2):
_foreach_copy = torch.ops.aten._foreach_copy.default([primals_1], [primals_2]); primals_2 = None
getitem = _foreach_copy[0]; _foreach_copy = None
mm = torch.ops.aten.mm.default(getitem, getitem); getitem = None
t_1 = torch.ops.aten.t.default(primals_1); primals_1 = None
return [mm, t_1]
```
Where `primals_1` starts out as a secretly-zero-storage-size parameter, and gets resized up and back down within the forward (these are functionalized away).
Importantly, the matmul happy on the result of the `foreach_copy`, **but** the activation that we save for backward (`t_1`) is the result of transposing the **original parameter** (the zero-storage-size param). This is exactly the optimization in fsdp that allows us to have good peak memory usage.
The problem is that the min-cut partitioner decides to save `t_1` for backward. Running this code in eager breaks, because the kernel for `aten.permute(x)` is not happy when `x` has secretly-zero-sized-storage.
The real problem here is that in eager mode the `permute` kernel runs during the backward, after backward hooks have properly resized the saved activation. Here, we are running the transpose in the forward.
One option would be to turn off the checks in our view kernels and allow them to work on zero-storage-sized tensors, which feels pretty bad. Another option is to tweak the partitioner (or use one of Will's FX passes) to force the partitioner to not save views for backward, and allow the views to be recomputed in the backward. This seems kind of silly, but is also probably harmless.
(3) The backward is still broken. To be fair, this issue is pretty separable from "functionalizing storage resize calls", and can be fixed later (either by a real fix to our tracing infra, or via another hacky FX pass). More description of this problem is described at issue (8) of my PR description in https://github.com/pytorch/pytorch/pull/120971
(4) I only added support for "full graph" resizing: basically, the limited case where a param starts with zero storage size, and gets resized up and back down. I think we can add support for the graph break case, but I think we can keep that add-on separate from this PR unless we need it immediately. I also added asserts so we should fail loudly when we hit this case
(5) I have a change to FakeTensor creation when inputs have zero storage size that.. is probably ok. But I also removed FakeTensor caching on view ops, which I probably need to fix before I can land this PR
(6) I added a notion of "original_base" to `FunctionalStorageImpl`. More details are in the comments, but my rational for this was that we basically need it to ensure that autograd saves the **original**, zero-storage-sized param for backward, after resizing up and back down
(7) I had to update our eager kernels for `aten.copy` and `aten._foreach_copy`, to handle the case where the `self` argument has secretly-zero-storage. Inductor can probably generate correct code for this case, but we need these ops to work properly in this situation for the `aot_eager` backend to do the right thing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122434
Approved by: https://github.com/jansel
### Context
In today's Dynamo, we lift all tensors encountered during tracing to be individual graph inputs, even when they were in a container.
And [Dynamo generates](fdc281f258/torch/_dynamo/codegen.py (L371)) the runtime function's signature using the graph's graphargs.
This means that the generated function will have each grapharg as an argument, which is problematic if we want to free the inputs in inductor codegen. See [python function arguments are kept alive for the duration of the function call](https://github.com/pytorch/pytorch/pull/83137#issuecomment-1211320670).
```python
# original code
def forward(inputs):
a, b, c, d, e = inputs
inputs.clear()
out = a
out += b
del b # frees memory
out += c
del c # frees memory
out += d
del d # frees memory
out += e
del e # frees memory
return out
# compiled code:
def forward(a, b, c, d, e):
# b, c, d, e can't be freed before end of function
```
This isn't a concern when compiling forward because a, b, c, d, e are all from user code, and should be kept alive. But when compiling backwards, a, b, c, d, e may be intermediate results i.e. activations, that we DO want to clear ASAP to remain on par with eager peak memory.
### Solution
We have encountered similar memory problems in AOTAutograd before, where we adopted the boxed calling convention (wrapping to-be-freed objects in a list), adding list clearing to inductor codegen, and being careful about holding references to elements in the input list. We need to do something similar, but for inputs from the user program (compiled autograd fx graph in this case).
This PR support lists as graphargs/placeholder nodes. When tracing a list of tensors, we create a node for it, and pre-emptively initialize variable trackers for its elements before they are used in the user program. Subsequent uses of those variables will find hits in the lookup table `input_source_to_var`.
With the inputs as a list in the graph args, our compiled code can free inputs just like in the eager case.
```python
def forward(inputs):
# a, b, c, d, e can be freed within the function now
```
Currently, AOT/Inductor flattens list input via [flatten_graph_inputs wrapper](597f479643/torch/_inductor/compile_fx.py (L1454-L1478)), which is why this PR's CI can be green. Additional changes are needed to its runtime wrapper, done in the next PR. The next step is to ensure that we are careful in forwarding the list to inductor codegen without holding additional references.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122353
Approved by: https://github.com/jansel
ghstack dependencies: #123630, #123674
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)
We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan