Commit Graph

10 Commits

Author SHA1 Message Date
93e249969b [BE] enable ruff rule RSE and remove useless parentheses in raise statements (#124261)
Remove useless parentheses in `raise` statements if the exception type is raised with no argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261
Approved by: https://github.com/albanD
2024-04-17 19:29:34 +00:00
46a25cc0db [DCP] Adds support for non-primatives in async_save by deep copying during cpu offloading (#123941)
Adds support for non-primatives in async_save by deep copying during cpu offloading.

If users are not type checking, the expectation in async is likely that the object is copied

Differential Revision: [D56065237](https://our.internmc.facebook.com/intern/diff/D56065237/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123941
Approved by: https://github.com/fegin
2024-04-16 20:49:25 +00:00
d838cc8f66 [DCP] Returns a copy of sd in copy sd (#123567)
I found that returning the copy is actually useful in situations where you might do something like:

```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```

and would like `cache` not to change structure from `ret.update(some_other_values)`.  Open to some notes here, not returning a copy might force the user to do some additional copies for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
2024-04-16 15:29:32 +00:00
620aaaf0cb [DCP] Adds ability to create a CPU state dict that is both shared and pinned (#122338)
[DCP] Adds ability to create a CPU state dict that is both shared and pinned, as well as a new utility specific to copying the state dict

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1ge8d5c17670f16ac4fc8fcb4181cb490c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122338
Approved by: https://github.com/fegin
2024-04-03 20:05:01 +00:00
0811f15270 [DCP][state_dict] Let _offload_state_dict_to_cpu to return the companion_obj if it exist. (#121273)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121273
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-03-08 00:24:29 +00:00
5abf7972d1 [DCP][state_dict] Implement pin_memory and shared_memory copy for _offload_state_dict_to_cpu (#120378)
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.

This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.

**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5

cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)

# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)

# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)

# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)

# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```

Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
2024-03-05 17:48:15 +00:00
2bda6b4cb8 [DTensor] Only wait on AsyncCollectiveTensor after DTensor-based state dict loading (#119716)
Summary:
This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`.

Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")`

Test Plan:
**CI**:
Wait for the CI test

**Test with prod model**:
- Tested with models and no-longer ran into the exception after checkpoint loading.

Differential Revision: D53680406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337
2024-02-13 04:30:45 +00:00
f9971daaee Fix divergence between internal + external (#118509)
D53049807 and https://github.com/pytorch/pytorch/pull/118197 got out of sync somehow

Fixing externally since I'm pretty sure the internal version is correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118509
Approved by: https://github.com/malfet
2024-01-29 14:53:50 +00:00
4f78869c18 [state_dict] Calls wait() for the DTensor to_local() result (#118197)
See the discussion in https://github.com/pytorch/pytorch/pull/117799.

There are some issues when returning a AsyncCollectiveTensor (haven't found the
root causes), including OOM and unexpected values.

This PR forces `_gather_state_dict()` to be synchronous with respect to the mian stream.

Differential Revision: [D53049807](https://our.internmc.facebook.com/intern/diff/D53049807/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118197
Approved by: https://github.com/wz337, https://github.com/LucasLLC
2024-01-25 17:14:08 +00:00
cc28f61fa3 [DCP][BE] Move DCP._state_dict_utils out from DCP (#115523)
DCP._state_dict_utils is also used by FSDP. This can cause circular import sometimes. Move it out from DCP to avoid circular import.

Differential Revision: [D52022440](https://our.internmc.facebook.com/intern/diff/D52022440/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115523
Approved by: https://github.com/wz337
2023-12-13 08:59:48 +00:00