I found that returning the copy is actually useful in situations where you might do something like:
```
ret = _copy_state_dict(obj, cache)
ret.update(some_other_values)
```
and would like `cache` not to change structure from `ret.update(some_other_values)`. Open to some notes here, not returning a copy might force the user to do some additional copies for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123567
Approved by: https://github.com/wz337
**Summary**
This PR extend `_offload_state_dict_to_cpu` to accept a `cpu_offload_state_dict` argument. If `cpu_offload_state_dict` is not None, `_offload_state_dict_to_cpu` will use `copy_` to copy the GPU data to the CPU tensors. This allows users to pass a pin_memory or share_memory version of `cpu_offload_state_dict`.
This PR also adds `_create_cpu_state_dict` to allow users to easily create a pin_memory or share_memory cpu state_dict.
**Performance improvement**
```
# The micro-benchmark has a source state_dict with 150 tensors, and each tensor is 50MB.
# The micro-benchmark is run on a H100 machine with PCIe 5
cpu_state_dict_2 = _create_cpu_state_dict(state_dict, pin_memory=True)
cpu_state_dict_3 = _create_cpu_state_dict(state_dict, share_memory=True)
# GPU->CPU memory: 4.6556 seconds
cpu_state_dict = _offload_state_dict_to_cpu(state_dict)
# GPU->pin memory: 0.1566 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
# GPU->shared memory: 0.5509 seconds (variation is quite large)
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_3)
# GPU->pin memory->shared memory: 0.2550 seconds
_offload_state_dict_to_cpu(state_dict, cpu_offload_state_dict=cpu_state_dict_2)
_offload_state_dict_to_cpu(cpu_state_dict_2, cpu_offload_state_dict=cpu_state_dict_3)
```
Differential Revision: [D54045845](https://our.internmc.facebook.com/intern/diff/D54045845/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120378
Approved by: https://github.com/LucasLLC
Summary:
This PR serves as a follow-up fix to address numerical correctness concerns identified in PR #118197, and we should only wait on `AsyncCollectiveTensor`.
Without the change, we occasionally ran into exception: `AttributeError("'Tensor' object has no attribute 'wait'")`
Test Plan:
**CI**:
Wait for the CI test
**Test with prod model**:
- Tested with models and no-longer ran into the exception after checkpoint loading.
Differential Revision: D53680406
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119716
Approved by: https://github.com/fegin, https://github.com/Skylion007, https://github.com/wz337