TP style still have some regression due to negative dim specifications,
fix it by allow DTensor API to handle negative dims and normalize them.
i.e. TP uses `Shard(-1)`, and then try to redistribute `Shard(1) -> Shard(-1)`, this should ideally be no-op but current it runs a decompose sharding phrase and it would turn this transformation to `Shard(1) -> Replicate -> Shard(-1)`, which is wrong and triggers unnecessary allgathers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111750
Approved by: https://github.com/rohan-varma
The compute_local_shape_and_global_offset API does the following:
1) Calculate both local_shape and global_offset in one API to replace two API calls (compute_local_size and compute_local_shape).
2) Generate the correct global_offset for checkpointing purposes. We are currently using compute_local_offset for downstream checkpoint components, which could lead to incorrect results. For checkpointing, we need global_offset instead of local_offset. In some cases, global_offset does not equal to local_offset, when a dimension is sharded multipe times on different mesh dimension (e.g. placements = [Shard(0), Shard(0)]).
Follow-up PRs:
1) Replace related downstream components to use compute_local_shape_and_global_offset instead of compute_local_size and compute_local_offset.
2) Audit existing code base to see if we can remove compute_local_size and compute_local_offset, since they are currently being used.
cc. @wanchaol
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107996
Approved by: https://github.com/wanchaol