This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR.
This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way.
Misc changes:
* Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test
* Adds compile support to the ring attention implementations (required some tweaks to process groups)
Test plan:
```
pytest test/distributed/_tensor/test_attention.py
pytest test/distributed/test_functional_api.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215
Approved by: https://github.com/wanchaol
**Summary**
We wrap DTensor's local tensor in `LocalShardsWrapper` for torchrec's table-wise sharding. The exception is on non-participating ranks: for non-participating ranks, the local tensor is an empty torch.Tensor object. The reason of this design is to avoid complexity on supporting empty tensor case on `LocalShardsWrapper`.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122853
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392, #122843
**Summary**
Always wrap local tensor into a `LocalShardsWrapper`. This is for uniformity and it leads to easiness on adoption of DTensor as a wrapper for local shard(s) representation. To support more tensor ops over `LocalShardsWrapper`, users need to extend its `__torch_dispatch__`.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise-even`
**Result**
```
Row-wise even sharding example in DTensor
Col 0-15
------- ----------
Row 0-1 cuda:0
Row 2-3 cuda:1
Row 4-5 cuda:2
Row 6-7 cuda:3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122843
Approved by: https://github.com/wz337
ghstack dependencies: #120265, #121392
**Summary**
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.TABLE_WISE` using DTensor.
**Test**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e table-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120265
Approved by: https://github.com/wanchaol
Fixes https://github.com/pytorch/pytorch/issues/122459, https://github.com/pytorch/torchtrain/issues/61
Even with the previous PR ("support DTensor/subclass constructors directly in the graph"), I still see some errors when running the repro above that start some logs showing that dynamo is inlining `__new__`.
I noticed that putting `@torch._dynamo.disable` on DTensor's `__new__` makes the entire repro pass.
Why does having dynamo try to inline `Subclass.__new__` run into problems? Morally, dynamo probably shouldn't be inlining __new__ ("creating a subclass" is a blackbox operation that AOTAutograd can trace through anyway). But concretely, we can end up with a node in the dynamo FX graph that has a "partially initialized tensor subclass" as its example value, because the subclass has been created but its fields have not been assigned to yet.
This breaks a bunch of invariants throughout dynamo: there are many places where if we have a tensor subclass node, we want to look at its inner tensors, to see if they are FakeTensors, what their FakeTensorMode is, and if they have dynamic shapes.
One option is to decide that "uninitialized subclass" is a first-class thing that anyone looking at the FX node examples values on the dynamo graph needs to handle, but this seems like a lot of work when in reality we don't need dynamo to trace the __new__ at all. Hence the `torch._dynamo.disable`.
I still wasn't very satisfied, since it was unclear to me **why** dynamo was inlining the `__new__` call, instead of interposing on the `DTensor()` constructor directly. After a long chat with @anijain2305, he explained that with code like this:
```
@torch._dynamo.disable(recursive=False)
def f(x):
out = SubclassConstructor(x)
```
Dynamo will never get the chance to interpose on the subclass constructor. Instead, what will happen is:
(1) Dynamo hands back control to cpython to run `f()`, since we disabled that frame
(2) `SubclassConstructor(x)` is run in eager mode
(3) `SubclassConstructor(x)` eventually calls `SubclassConstructor__new__`
(4) this is a new frame, that cpython then allows dynamo to intercept and start compiling
So it looks like we are basically forced to handle the situation where dynamo might directly start compiling `Subclass.__new__`
All of the above does not explain the story for `__torch_dispatch__` though. Empirically, I have a repro in torchtrain where looking at the dynamo logs, we see dynamo try to inline `__torch_dispatch__`.
```
[rank0]:DEBUG: Skipping frame because no content in function call _prepare_output_fn /data/users/hirsheybar/b/pytorch/torch/distributed/tensor/parallel/style.py 318
[rank0]:DEBUG: torchdynamo start compiling __torch_dispatch__ /data/users/hirsheybar/b/pytorch/torch/distributed/_tensor/api.py:297, stack (elided 5 frames):
```
I haven't been able to create a smaller repro of the problem (even using `_dynamo.disable(recursive=False)`), although in theory, if there is a `torch.*` op that you were to inline (where one of the inputs is a subclass), the next frame would likely be `__torch_dispatch__`. Dynamo always treats `torch.*` operations as not-inlinable though, so in theory we shouldn't ever see dynamo inline `__torch_dispatch__`, but a `_dynamo.disable()` fixes the problem.
I asked Animesh if we can have dynamo automatically apply this behavior to subclasses instead of needing it to be added explicitly. He pointed out that for `disable(recursive=False)`, we can't really do this within dynamo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123347
Approved by: https://github.com/zou3519
ghstack dependencies: #122502, #122751, #123348
Ring attention support for _scaled_dot_product_flash_attention with DTensor.
This assumes the query and key/value are sharded along the sequence length dimension. See the tests for example usage with PT Transformer as well as direct usage with _scaled_dot_product_flash_attention.
## Notable caveats
* Numerical accuracy: The backwards pass doesn't match numerically with the non-chunked version but the forwards pass does. I assume this is due to accumulated errors. I've added a chunked version that uses autograd to verify that the distributed version matches the chunked version.
* nn.Linear has incorrect behavior when running on a sharded tensor of size (bs, heads, seq_len, dim) with `Shard(2)` and does an unnecessary accumulate which requires `Replicate()` on QKV when using `nn.MultiHeadedAttention` to work around the issue.
* If enabled, it forces sequence parallelism and doesn't interop with tensor parallelism.
## SDPA usage
```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
dquery = distribute_tensor(query, device_mesh, [Shard(2)])
dkey = distribute_tensor(key, device_mesh, [Shard(2)])
dvalue = distribute_tensor(value, device_mesh, [Shard(2)])
dout: DTensor = torch.nn.functional.scaled_dot_product_attention(
dquery, dkey, dvalue, is_causal=is_causal
)
out = dout.to_local()
```
## Transformer usage
```py
with attention_context_parallel(), sdpa_kernel(backends=[SDPBackend.FLASH_ATTENTION]):
encoder_layer = nn.TransformerEncoderLayer(
d_model=dim,
nhead=nheads,
dim_feedforward=dim,
batch_first=True,
).to(dtype)
encoder_layer = parallelize_module(
module=encoder_layer,
device_mesh=device_mesh,
parallelize_plan={
"self_attn": ContextParallel(),
},
)
model = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
```
## Test plan
```
pytest test/distributed/_tensor/test_attention.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122460
Approved by: https://github.com/drisspg, https://github.com/wanchaol
as titled, previously we could possibly return the expected input spec
that shared by multiple args, this is not ok since different args might
have different tensor metas, why it was working before is because
redistribute in these cases become a no-op.
This PR fixes it by making each expected input spec to shallow clone the
corresponding input metadata
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122949
Approved by: https://github.com/tianyu-l
ghstack dependencies: #122929
This PR refactors the schema_suggestions in OuputSharding to be a single
OpSchema instead of list of schemas, which in practice we only have one,
for the multiple resharding case we also moved to OpStrategy so there's
no case that needs it to be a list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122929
Approved by: https://github.com/tianyu-l
This PR is enough to fix https://github.com/pytorch/pytorch/issues/118600.
More description of the problem is in the issue, but the high-level problem is similar to the "tangents might be non-contiguous" problem that we handle today, via forcing all tangents to be contiguous. There, the problem was something like:
"We guessed the tangent strides incorrectly, because strides on the runtime tangents were different from strides on the forward outputs, which we used to generate tangents"
Here, the problem is similar:
"We guessed the tangent tensor subclass's metadata incorrectly, because the runtime tangent was a subclass with different metadata than the forward output subclass".
This happened in an internal DTensor issue, where the metadata in question was the `placements` (shard vs. replicate vs. Partial).
One option is to solve this problem via backward guards. This is needed to unblock internal though, so I figured handling this similarly to how we handle non-contiguous tangents would be reasonable. I did this by:
(1) Assert that the metadata on subclass tangents is the same as what we guessed, and if not raise a loud error
(2) In the error message, provide the name of an optional method that the subclass must implement to handle this case:
`def __force_same_metadata__(self, metadata_tensor):`: If the forward output had a `Replicate()` placement, but the runtime tangent had a `Shard(1)` placement, this method allows a subclass to take the tangent and "convert" it to one with a `Replicate()` placement.
`__force_standard_metadata__(self)`: One issue is that there is another placement called `_Partial`, and its semantics are such that DTensor is **unable** to convert a DTensor with some placement type into another DTensor with a `_Partial` placement.
`__force_standard_metadata__` is now called on all (fake) subclass forward outs at trace-time to generate tangents, and gives subclasses a chance to "fix" any outputs with metadata that they cannot convert to later. Morally, this is similar to the fact that we force a `contiguous()` call on all tangents at trace-time.
I'm interested in thoughts/feedback! Two new dunder methods on traceable subclasses is definitely a contentious change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118670
Approved by: https://github.com/ezyang
We would like to improve consistency for nn_module_stack metadata in torch.export.
This PR ensures that all tests in test/export/test_export.py has the following constraints:
- Remove nn_module_stack for all placeholder & output nodes, for all modules and submodules
- Ensure nn_module_stack is present for all other node types for the top-level module (there is still an issue with torch.cond submodules having empty fields)
- Add these checks to _export() in _trace.py (we would add this in the Verifier, but downstream apps construct ExportedPrograms separate from _export(), and metadata may not be maintained there)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120661
Approved by: https://github.com/avikchaudhuri
Enable ASGD foreach optimizer and add DTensor optimizer unit test for ASGD.
Note that we need to investigate why when using ASGD we need higher atol and rtol when comparing model parameters. Listing it as a TODO now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121942
Approved by: https://github.com/wanchaol
This PR adds support for 2D `clip_grad_norm_` (`foreach=True`).
- This PR changes `OpSchema.args_spec` to use pytree if the runtime schema info specifies it.
- This PR includes a unit test for 2D FSDP2 + SP with `clip_grad_norm_` enabled, which serves as a complete numerics test for 2D.
Note: With this PR patched, 2-way SP + 4-way FSDP matches 8-way FSDP numerics on Llama-7B (doubling local batch size for the 2-way SP run).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121945
Approved by: https://github.com/wanchaol
ghstack dependencies: #121747, #121869
This PR rewrite the stack strategy to be more generalized, basically
stack/cat like strategy follow pattern need to be smarter, i.e. it
should be able to identify:
1. PR, PP, RP -> follow PP
2. RR, SR, RS -> follow SS
So this PR refactors how the follow strategy should work, and make sure
we start following the strategy that incurred lowest cost. i.e. for
multiple PR, RP placements, we should be able to further delay the
pending sum reductions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121869
Approved by: https://github.com/awgu
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
**Summary**
We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385
async output option was only available in `full_tensor()` call, but I think it's
generally good to make this option available in the `redistribute` call directly
so that user can control it
This PR adds async_op option to redistribute call, to allow user control
whether to perform tensor redistribution asynchronously or not.
By default we set this to False, this is to follow the semantics of the c10d
collectives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121477
Approved by: https://github.com/wz337
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
to_local accepts a `grad_placements` if user choose to pass, previously
we enforce the grad_out to be the "same" placement as the current
DTensor for safety.
But I realized that we DO NOT need to enforce this constraint. Why?
backward placement does not need to be the same as fwd tensor placement, this
is already the case for param vs param.grad (i.e. param can be replicate
and grad can be partial), so we should not restrict this to activation
vs activation grad too
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121474
Approved by: https://github.com/awgu, https://github.com/yoyoyocmu, https://github.com/yifuwang
This PR removes the deprecated tp_mesh_dim arg to prepare for release.
As we deprecated this arg for a while (by throwing deprecating
messages), we should remove it before the release
#suppress-api-compatibility-check
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121432
Approved by: https://github.com/wz337
ghstack dependencies: #121431
**Summary**
In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179, #120260
**Summary**
our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change.
This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case.
**Test Run**
`torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260
Approved by: https://github.com/wanchaol
ghstack dependencies: #121179
Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`).
```
pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328
Approved by: https://github.com/wanchaol
ghstack dependencies: #120351
This is a BC breaking change to distribute_module. The underlying rationle
for this change is that sometimes in the input_fn/output_fn, user would want
to access to the current module for some attributes. This might not be
common enough, but in some cases it's worth to access to the module.
An outstanding use case we want to support is float8, if we want to make
float8 works with the TP API, the input_fn/output_fn of TP parallel
styles would need to get access to the module, where the module might
encapsulates `dynamic_linear.emulate` attribute, that is useful for
input/output casting
Since this is needed for fp8 and DTensor still under prototype release,
I feel it's worth the change and it's better we make the change as
early.
Right now making it a soft BC breaking, which means we maintain BC still
but throw deprecation messages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120895
Approved by: https://github.com/tianyu-l
Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code.
Here are the underlying rationales why we are going through these op replacements:
1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it.
2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723#119255#118917#119256). They are doing computation with input *replicated* on the class dimension.
3. However when the input of this loss calculation is **sharded on the class dimension**, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives **in the middle of** those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to **decompose** these two ops into smaller ops to have collectives run in the middle of these two ops.
4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261.
5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877
Approved by: https://github.com/wanchaol
This PR adds support for `clip_grad_norm_(foreach=True)` by implementing `aten._foreach_norm.Scalar` and `aten._foreach_mul_.Tensor`. `foreach=True` is required to get competitive performance with `DTensor`.
`foreach=True` reduces CPU overhead for Llama-7B from 388 ms to 63 ms. Existing flat-parameter FSDP's `clip_grad_norm_` takes 3 ms on CPU 😢 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120910
Approved by: https://github.com/wanchaol, https://github.com/janeyx99
ghstack dependencies: #120238
This PR adds `DTensor` support for `aten.linalg_vector_norm.default` and `aten.stack.default` so that we can run `clip_grad_norm_` (with `foreach=False`).
To implement `linalg_vector_norm`, we introduce a `_NormPartial` placement since the reduction op for norm is the norm itself.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120238
Approved by: https://github.com/wanchaol
This PR refactors the tuple strategy handling logic, and allow
TupleStrategy to have both input/output specs for each OpStrategy child,
so that we could further enable operators like foreach norm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120695
Approved by: https://github.com/awgu
We have many tests that use CommDebugMode to verify the occurrence of collectives. These tests do so by querying comm_counts with legacy funcol ops as key. For the purpose of native funcol migration, we need these tests to work for both legacy and native funcol. To avoid the need to modify all tests to accommodate the two implementations, we make CommDebugMode translate native funcol ops into legacy funcol ops until the migration finishes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120070
Approved by: https://github.com/wconstab, https://github.com/wanchaol
ghstack dependencies: #120042, #120043