[dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)

Summary: This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype. this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug. This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging. This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed. Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references. I am using data_ptr of the storage to keep track of this. Please share thoughts on this. The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner. The API: ``` # construct a stager once per job in checkpointing. stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory) # do this on every checkpoint: with staging_context(stager): cpu_state_dict = copy.deepcopy(state_dict) ``` Also, adds support for pinned-memory. One problem this implementation does not address is that we lose the original device. The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing. Update: Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution: 1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed. 2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied? 3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this. The new API: ``` # construct a stager once per job in checkpointing. stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory) # do this on every checkpoint: cpu_state_dict = copy.stage(state_dict) ``` Test Plan: unit tests Differential Revision: D75993324 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192 Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-10-20 21:14:14 +08:00 · 2025-06-19 02:04:17 +00:00
parent d4ad280429
commit 19ffdf4ea0
4 changed files with 1202 additions and 17 deletions
--- a/test/distributed/checkpoint/test_state_dict_stager.py
+++ b/test/distributed/checkpoint/test_state_dict_stager.py
@ -0,0 +1,821 @@
 # Owner(s): ["oncall: distributed"]
 import dataclasses
 import torch
 import torch.distributed as dist
 from torch.distributed._tensor import DTensor
 from torch.distributed._tensor.placement_types import Shard
 from torch.distributed.checkpoint._state_dict_stager import StateDictStager
 from torch.testing._internal.common_distributed import requires_nccl, skip_if_lt_x_gpu
 from torch.testing._internal.common_utils import requires_cuda, run_tests, TestCase
 from torch.testing._internal.distributed._tensor.common_dtensor import (
    DTensorTestBase,
    with_comms,
 )
 def create_cpu_state_dict(state_dict):
    cpu_state_dict = {}
    for key, value in state_dict.items():
        cpu_state_dict[key] = value.cpu()
    return cpu_state_dict
 def compare_state_dicts(cuda_state_dict, cpu_state_dict, rtol=1e-5, atol=1e-8):
    """
    Compare if two state dictionaries (one on CUDA, one on CPU) are otherwise the same.
    This function checks if the tensors in both state dictionaries have the same values,
    shapes, dtypes, etc., ignoring the device difference. It also checks if tensors that
    share storage in one state dict also share storage in the other.
    Args:
        cuda_state_dict: The state dictionary with tensors on CUDA
        cpu_state_dict: The state dictionary with tensors on CPU
        rtol: Relative tolerance for comparing tensor values
        atol: Absolute tolerance for comparing tensor values
    Returns:
        bool: True if the state dictionaries are equivalent, False otherwise
        str: Error message if the state dictionaries are not equivalent, empty string otherwise
    """
    # Track storage data pointers to check storage sharing
    cuda_storage_ptrs = {}
    cpu_storage_ptrs = {}
    def compare_objects(cuda_obj, cpu_obj, path=""):
        # If objects are tensors, compare them
        if isinstance(cuda_obj, torch.Tensor) and isinstance(cpu_obj, torch.Tensor):
            # Check if devices are as expected
            if cuda_obj.device.type != "cuda":
                return (
                    False,
                    f"Expected CUDA tensor, got {cuda_obj.device.type} tensor at {path}",
                )
            if cpu_obj.device.type != "cpu":
                return (
                    False,
                    f"Expected CPU tensor, got {cpu_obj.device.type} tensor at {path}",
                )
            if cuda_obj.storage_offset() != cpu_obj.storage_offset():
                return (
                    False,
                    f"Storage offset mismatch at {path}: {cuda_obj.storage_offset()} vs {cpu_obj.storage_offset()}",
                )
            if not torch.equal(cuda_obj.cpu(), cpu_obj):
                return (
                    False,
                    f"Tensors are not same at {path}",
                )
            # Track storage sharing
            cuda_storage_ptr = cuda_obj.storage().data_ptr()
            cpu_storage_ptr = cpu_obj.storage().data_ptr()
            if cuda_storage_ptr in cuda_storage_ptrs:
                # This CUDA tensor shares storage with another tensor
                # Check if the corresponding CPU tensors also share storage
                if cpu_storage_ptr != cuda_storage_ptrs[cuda_storage_ptr]:
                    return (
                        False,
                        f"Storage sharing mismatch: CUDA tensors share storage but CPU tensors don't at {path}",
                    )
            else:
                # First time seeing this storage
                cuda_storage_ptrs[cuda_storage_ptr] = cpu_storage_ptr
                cpu_storage_ptrs[cpu_storage_ptr] = cuda_storage_ptr
            return True, ""
        # If objects are dictionaries, compare them recursively
        elif isinstance(cuda_obj, dict) and isinstance(cpu_obj, dict):
            if cuda_obj.keys() != cpu_obj.keys():
                return (
                    False,
                    f"Dictionary keys mismatch at {path}: {cuda_obj.keys()} vs {cpu_obj.keys()}",
                )
            for key in cuda_obj:
                result, error = compare_objects(
                    cuda_obj[key], cpu_obj[key], f"{path}.{key}" if path else key
                )
                if not result:
                    return False, error
            return True, ""
        # If objects are lists, tuples, or sets, compare them recursively
        elif isinstance(cuda_obj, (list, tuple, set)) and isinstance(
            cpu_obj, (list, tuple, set)
        ):
            if len(cuda_obj) != len(cpu_obj):
                return (
                    False,
                    f"Collection length mismatch at {path}: {len(cuda_obj)} vs {len(cpu_obj)}",
                )
            if type(cuda_obj) != type(cpu_obj):
                return (
                    False,
                    f"Collection type mismatch at {path}: {type(cuda_obj)} vs {type(cpu_obj)}",
                )
            for i, (cuda_item, cpu_item) in enumerate(zip(cuda_obj, cpu_obj)):
                result, error = compare_objects(cuda_item, cpu_item, f"{path}[{i}]")
                if not result:
                    return False, error
            return True, ""
        # If objects are custom classes, compare their attributes
        elif hasattr(cuda_obj, "__dict__") and hasattr(cpu_obj, "__dict__"):
            if type(cuda_obj) != type(cpu_obj):
                return (
                    False,
                    f"Object type mismatch at {path}: {type(cuda_obj)} vs {type(cpu_obj)}",
                )
            result, error = compare_objects(
                cuda_obj.__dict__, cpu_obj.__dict__, f"{path}.__dict__"
            )
            if not result:
                return False, error
            return True, ""
        # For other types, use direct equality comparison
        else:
            if type(cuda_obj) != type(cpu_obj):
                return (
                    False,
                    f"Type mismatch at {path}: {type(cuda_obj)} vs {type(cpu_obj)}",
                )
            if cuda_obj != cpu_obj:
                return False, f"Value mismatch at {path}: {cuda_obj} vs {cpu_obj}"
            return True, ""
    # Start the recursive comparison
    result, error = compare_objects(cuda_state_dict, cpu_state_dict)
    return result, error
@dataclasses.dataclass
 class TestStruct:
    tensor1: torch.Tensor
@dataclasses.dataclass
 class NestedTensorStruct:
    tensor: torch.Tensor
    value: int = 42
@dataclasses.dataclass
 class ComplexDataClass:
    tensor: torch.Tensor
    name: str
    values: list[float]
    nested: NestedTensorStruct
@dataclasses.dataclass(frozen=True)
 class FrozenDataClass:
    tensor: torch.Tensor
    value: int = 100
 class TestStateDictStager(TestCase):
    @requires_cuda
    def test_views(self):
        test_configs = [
            (False, False),  # pin_memory=False, share_memory=False,
            (True, False),  # pin_memory=True, share_memory=False
            (False, True),  # pin_memory=False, share_memory=True
            (True, True),  # pin_memory=True, share_memory=True
        ]
        for pin_memory, share_memory in test_configs:
            with self.subTest(pin_memory=pin_memory, share_memory=share_memory):
                tensor1 = torch.randn(4, 4).cuda()
                tensor2 = tensor1.view(16)
                tensor3 = torch.randn(4, 4).cuda()
                state_dict = {
                    "tensor1": tensor1,
                    "tensor2": tensor2,
                    "recursive": {
                        "tensor3": tensor3,
                        "type": TestStruct(tensor1=tensor3.narrow(0, 0, 2)),
                    },
                }
                assert (
                    state_dict["tensor1"].storage().data_ptr()
                    == state_dict["tensor2"].storage().data_ptr()
                )
                stager = StateDictStager(
                    pin_memory=pin_memory, share_memory=share_memory
                )
                cpu_state_dict = stager.stage(state_dict)
                # Calculate stats
                num_storages = len(stager._cached_storage_mapping)
                num_bytes = sum(
                    storage.nbytes()
                    for storage in stager._cached_storage_mapping.values()
                )
                # Validate tensor count and bytes
                expected_storage_cnt = 2
                assert (
                    num_storages == expected_storage_cnt
                ), f"Expected {expected_storage_cnt} storages, got {num_storages}"
                # Calculate expected bytes
                # Note: Only unique storages are counted in the byte count
                expected_bytes = (
                    tensor1.numel() * tensor1.element_size()
                    + tensor3.numel()  # tensor1 and tensor2 share storage
                    * tensor3.element_size()  # tensor3 and its narrow view share storage
                )
                assert (
                    num_bytes == expected_bytes
                ), f"Expected {expected_bytes} bytes, got {num_bytes}"
                # Verify that the CPU state dict is equivalent to the original CUDA state dict
                result, error = compare_state_dicts(state_dict, cpu_state_dict)
                assert result, f"State dicts are not equivalent: {error}"
                # Additional checks for storage sharing
                assert cpu_state_dict["tensor1"].device == torch.device("cpu")
                assert cpu_state_dict["tensor2"].device == torch.device("cpu")
                assert (
                    cpu_state_dict["tensor1"].storage().data_ptr()
                    == cpu_state_dict["tensor2"].storage().data_ptr()
                )
                recursive = cpu_state_dict["recursive"]
                assert recursive["tensor3"].device == torch.device("cpu")
                assert recursive["type"].tensor1.device == torch.device("cpu")
                assert (
                    recursive["tensor3"].storage().data_ptr()
                    == recursive["type"].tensor1.storage().data_ptr()
                )
    @requires_cuda
    def test_caching(self):
        """
        Test that the StateDictStager correctly caches and reuses storages.
        """
        test_configs = [
            (False, False),  # pin_memory=False, share_memory=False,
            (True, False),  # pin_memory=True, share_memory=False
            (False, True),  # pin_memory=False, share_memory=True
            (True, True),  # pin_memory=True, share_memory=True
        ]
        for pin_memory, share_memory in test_configs:
            with self.subTest(pin_memory=pin_memory, share_memory=share_memory):
                # Create test tensors and state dict
                tensor1 = torch.randn(4, 4).cuda()
                tensor2 = tensor1.view(16)
                tensor3 = torch.randn(4, 4).cuda()
                state_dict = {
                    "tensor1": tensor1,
                    "tensor2": tensor2,
                    "recursive": {
                        "tensor3": tensor3,
                        "type": TestStruct(tensor1=tensor3.narrow(0, 0, 2)),
                    },
                }
                # Create a StateDictStager instance
                stager = StateDictStager(
                    pin_memory=pin_memory, share_memory=share_memory
                )
                # First call to stage with staging context
                cpu_state_dict1 = stager.stage(state_dict)
                # Get the number of cached storages after first stage
                num_storages1 = len(stager._cached_storage_mapping)
                # Verify the first result is correct
                result, error = compare_state_dicts(state_dict, cpu_state_dict1)
                assert (
                    result
                ), f"First state dict is not equivalent to original: {error}"
                # Modify the original tensors
                tensor1.fill_(0)
                tensor3.fill_(0)
                # Second call to stage with staging context
                cpu_state_dict2 = stager.stage(state_dict)
                # Get the number of cached storages after second stage
                num_storages2 = len(stager._cached_storage_mapping)
                # Verify that the second CPU state dict is equivalent to the modified original state dict
                result, error = compare_state_dicts(state_dict, cpu_state_dict2)
                assert (
                    result
                ), f"Second state dict is not equivalent to modified original: {error}"
                # Verify that the number of cached storages hasn't changed
                assert (
                    num_storages1 == num_storages2
                ), f"Storage count changed: {num_storages1} vs {num_storages2}"
                # Verify that the tensors in the second state dict have the same storage pointers as the first
                assert (
                    cpu_state_dict1["tensor1"].storage().data_ptr()
                    == cpu_state_dict2["tensor1"].storage().data_ptr()
                ), "Storage pointers should match for tensor1"
                assert (
                    cpu_state_dict1["tensor2"].storage().data_ptr()
                    == cpu_state_dict2["tensor2"].storage().data_ptr()
                ), "Storage pointers should match for tensor2"
                assert (
                    cpu_state_dict1["recursive"]["tensor3"].storage().data_ptr()
                    == cpu_state_dict2["recursive"]["tensor3"].storage().data_ptr()
                ), "Storage pointers should match for tensor3"
                # Modify the original tensors again with different values
                tensor1.fill_(42.0)
                # Third call to stage with staging context
                cpu_state_dict3 = stager.stage(state_dict)
                # Verify that the third CPU state dict reflects the updated values
                assert torch.all(
                    cpu_state_dict3["tensor1"] == 42.0
                ), "Updated values should be reflected in the cached state dict"
                assert torch.all(
                    cpu_state_dict3["tensor2"] == 42.0
                ), "Updated values should be reflected in the cached state dict"
    @requires_cuda
    def test_tensor_attrs(self):
        """
        Test that tensor attributes are preserved during stage with StateDictStager.
        """
        tensor1 = torch.randn(4, 4).cuda()
        tensor2 = tensor1.view(16)
        tensor3 = torch.randn(4, 4).cuda()
        # Add custom attributes to tensors
        tensor1.a = 42
        tensor1.b = 43
        tensor3.c = 44
        state_dict = {
            "tensor1": tensor1,
            "tensor2": tensor2,
            "recursive": {
                "tensor3": tensor3,
                "type": TestStruct(tensor1=tensor3.narrow(0, 0, 2)),
            },
        }
        stager = StateDictStager(pin_memory=True, share_memory=True)
        cpu_state_dict = stager.stage(state_dict)
        # Verify that tensor attributes are preserved
        assert hasattr(
            cpu_state_dict["tensor1"], "a"
        ), "Tensor attribute 'a' was not preserved"
        assert (
            cpu_state_dict["tensor1"].a == 42
        ), "Tensor attribute 'a' has incorrect value"
        assert hasattr(
            cpu_state_dict["tensor1"], "b"
        ), "Tensor attribute 'b' was not preserved"
        assert (
            cpu_state_dict["tensor1"].b == 43
        ), "Tensor attribute 'b' has incorrect value"
        assert hasattr(
            cpu_state_dict["recursive"]["tensor3"], "c"
        ), "Tensor attribute 'c' was not preserved"
        assert (
            cpu_state_dict["recursive"]["tensor3"].c == 44
        ), "Tensor attribute 'c' has incorrect value"
    @requires_cuda
    def test_different_dtypes(self):
        """
        Test that StateDictStager works correctly with tensors of different data types.
        """
        # Create tensors with different dtypes
        tensors = {
            "float32": torch.randn(4, 4, dtype=torch.float32).cuda(),
            "float64": torch.randn(4, 4, dtype=torch.float64).cuda(),
            "int32": torch.randint(-100, 100, (4, 4), dtype=torch.int32).cuda(),
            "int64": torch.randint(-100, 100, (4, 4), dtype=torch.int64).cuda(),
            "bool": torch.randint(0, 2, (4, 4), dtype=torch.bool).cuda(),
        }
        # Create a state dict with these tensors
        state_dict = tensors.copy()
        stager = StateDictStager()
        cpu_state_dict = stager.stage(state_dict)
        # Verify that all tensors have been correctly copied to CPU with the right dtypes
        for dtype_name, original_tensor in tensors.items():
            cpu_tensor = cpu_state_dict[dtype_name]
            self.assertEqual(
                cpu_tensor.device.type, "cpu", f"Tensor {dtype_name} should be on CPU"
            )
            self.assertEqual(
                cpu_tensor.dtype,
                original_tensor.dtype,
                f"Tensor {dtype_name} has incorrect dtype",
            )
            self.assertTrue(
                torch.allclose(cpu_tensor, original_tensor.cpu()),
                f"Tensor {dtype_name} has incorrect values",
            )
    @requires_cuda
    def test_empty_tensors(self):
        """
        Test that StateDictStager works correctly with empty tensors.
        """
        test_configs = [
            (False, False),  # pin_memory=False, share_memory=False,
            (True, False),  # pin_memory=True, share_memory=False
            (False, True),  # pin_memory=False, share_memory=True
            (True, True),  # pin_memory=True, share_memory=True
        ]
        for pin_memory, share_memory in test_configs:
            with self.subTest(pin_memory=pin_memory, share_memory=share_memory):
                # Create empty tensors with different shapes
                tensors = {
                    "empty_0d": torch.tensor([], dtype=torch.float32).cuda(),
                    "empty_1d": torch.tensor([], dtype=torch.float32).reshape(0).cuda(),
                    "empty_2d": torch.tensor([], dtype=torch.float32)
                    .reshape(0, 0)
                    .cuda(),
                    "empty_3d": torch.tensor([], dtype=torch.float32)
                    .reshape(0, 0, 0)
                    .cuda(),
                    "zero_dim": torch.tensor(0.0).cuda(),  # scalar tensor
                }
                # Create a state dict with these tensors
                state_dict = tensors.copy()
                cpu_state_dict = StateDictStager(pin_memory, share_memory).stage(
                    state_dict
                )
                # Verify that all tensors have been correctly copied to CPU
                for tensor_name, original_tensor in tensors.items():
                    cpu_tensor = cpu_state_dict[tensor_name]
                    self.assertEqual(
                        cpu_tensor.device.type,
                        "cpu",
                        f"Tensor {tensor_name} should be on CPU",
                    )
                    self.assertEqual(
                        cpu_tensor.shape,
                        original_tensor.shape,
                        f"Tensor {tensor_name} has incorrect shape",
                    )
                    self.assertEqual(
                        cpu_tensor.dtype,
                        original_tensor.dtype,
                        f"Tensor {tensor_name} has incorrect dtype",
                    )
    @requires_cuda
    def test_complex_storage_sharing(self):
        """
        Test that StateDictStager correctly handles complex storage sharing scenarios.
        """
        # Create a base tensor
        base_tensor = torch.randn(10, 10).cuda()
        # Create various views and slices that share storage
        view1 = base_tensor.view(100)
        view2 = base_tensor.view(10, 10)
        slice1 = base_tensor[2:8, 2:8]
        slice2 = base_tensor[:, :5]
        slice3 = view1[10:60]
        # Create a state dict with these tensors
        state_dict = {
            "base": base_tensor,
            "view1": view1,
            "view2": view2,
            "slice1": slice1,
            "slice2": slice2,
            "slice3": slice3,
        }
        cpu_state_dict = StateDictStager().stage(state_dict)
        # Verify that all tensors have been correctly copied to CPU
        result, error = compare_state_dicts(state_dict, cpu_state_dict)
        self.assertTrue(result, f"State dicts are not equivalent: {error}")
        # Verify storage sharing is preserved
        # All these tensors should share the same storage
        storage_ptr = cpu_state_dict["base"].storage().data_ptr()
        self.assertEqual(
            cpu_state_dict["view1"].storage().data_ptr(),
            storage_ptr,
            "view1 should share storage with base",
        )
        self.assertEqual(
            cpu_state_dict["view2"].storage().data_ptr(),
            storage_ptr,
            "view2 should share storage with base",
        )
        self.assertEqual(
            cpu_state_dict["slice1"].storage().data_ptr(),
            storage_ptr,
            "slice1 should share storage with base",
        )
        self.assertEqual(
            cpu_state_dict["slice2"].storage().data_ptr(),
            storage_ptr,
            "slice2 should share storage with base",
        )
        self.assertEqual(
            cpu_state_dict["slice3"].storage().data_ptr(),
            storage_ptr,
            "slice3 should share storage with base",
        )
        # Verify that modifying the base tensor affects all views and slices
        cpu_state_dict["base"].fill_(42.0)
        self.assertTrue(
            torch.all(cpu_state_dict["view1"] == 42.0),
            "view1 should reflect changes to base",
        )
        self.assertTrue(
            torch.all(cpu_state_dict["view2"] == 42.0),
            "view2 should reflect changes to base",
        )
        self.assertTrue(
            torch.all(cpu_state_dict["slice1"] == 42.0),
            "slice1 should reflect changes to base",
        )
        self.assertTrue(
            torch.all(cpu_state_dict["slice2"] == 42.0),
            "slice2 should reflect changes to base",
        )
        self.assertTrue(
            torch.all(cpu_state_dict["slice3"] == 42.0),
            "slice3 should reflect changes to base",
        )
    @requires_cuda
    def test_dataclasses(self):
        # Create tensors
        tensor1 = torch.randn(4, 4).cuda()
        tensor2 = torch.randn(8, 8).cuda()
        tensor3 = torch.randn(2, 6).cuda()
        tensor4 = torch.randn(3, 5).cuda()
        # Create dataclass instances
        nested = NestedTensorStruct(tensor=tensor3)
        complex_dc = ComplexDataClass(
            tensor=tensor1, name="test", values=[1.0, 2.0, 3.0], nested=nested
        )
        frozen_dc = FrozenDataClass(tensor=tensor4)
        # Create a state dict with these dataclasses
        state_dict = {
            "regular_tensor": tensor2,
            "complex_dataclass": complex_dc,
            "frozen_dataclass": frozen_dc,
        }
        # Stage the state dict
        stager = StateDictStager(pin_memory=False, share_memory=False)
        cpu_state_dict = stager.stage(state_dict)
        # Verify regular tensor
        self.assertEqual(cpu_state_dict["regular_tensor"].device.type, "cpu")
        self.assertTrue(torch.allclose(cpu_state_dict["regular_tensor"], tensor2.cpu()))
        # Verify complex dataclass
        complex_cpu = cpu_state_dict["complex_dataclass"]
        self.assertEqual(complex_cpu.name, "test")
        self.assertEqual(complex_cpu.values, [1.0, 2.0, 3.0])
        self.assertEqual(complex_cpu.tensor.device.type, "cpu")
        self.assertTrue(torch.allclose(complex_cpu.tensor, tensor1.cpu()))
        # Verify nested dataclass inside complex dataclass
        nested_cpu = complex_cpu.nested
        self.assertEqual(nested_cpu.value, 42)
        self.assertEqual(nested_cpu.tensor.device.type, "cpu")
        self.assertTrue(torch.allclose(nested_cpu.tensor, tensor3.cpu()))
        # Verify frozen dataclass
        frozen_cpu = cpu_state_dict["frozen_dataclass"]
        self.assertEqual(frozen_cpu.value, 100)
        self.assertEqual(frozen_cpu.tensor.device.type, "cpu")
        self.assertTrue(torch.allclose(frozen_cpu.tensor, tensor4.cpu()))
        # Verify that modifying the original tensors doesn't affect the staged ones
        tensor1.fill_(99.0)
        tensor3.fill_(88.0)
        tensor4.fill_(77.0)
        self.assertFalse(torch.allclose(complex_cpu.tensor, tensor1.cpu()))
        self.assertFalse(torch.allclose(nested_cpu.tensor, tensor3.cpu()))
        self.assertFalse(torch.allclose(frozen_cpu.tensor, tensor4.cpu()))
    def test_cpu_storage_independence(self):
        """
        Test ensures CPU tensors passed to StateDictStager are actually cloned
        """
        # Create test tensors
        tensor1 = torch.randn(4, 4)
        tensor2 = torch.randn(8, 8)
        # Create a state dict with these tensors
        state_dict = {
            "tensor1": tensor1,
            "tensor2": tensor2,
        }
        cpu_state_dict = StateDictStager().stage(state_dict)
        cpu_tensor1 = cpu_state_dict["tensor1"]
        cpu_tensor2 = cpu_state_dict["tensor2"]
        # Verify that the CPU tensors have different storage pointers than the original tensors
        self.assertNotEqual(
            tensor1.storage().data_ptr(),
            cpu_tensor1.storage().data_ptr(),
            "CPU tensor should have a different storage pointer than the original tensor",
        )
        self.assertNotEqual(
            tensor2.storage().data_ptr(),
            cpu_tensor2.storage().data_ptr(),
            "CPU tensor should have a different storage pointer than the original tensor",
        )
        self.assertTrue(
            torch.allclose(tensor1, cpu_tensor1),
            "CPU tensor should have the same values as the original tensor",
        )
        self.assertTrue(
            torch.allclose(tensor2, cpu_tensor2),
            "CPU tensor should have the same values as the original tensor",
        )
        # Modify the original CPU tensors and validate staged tensors are not modified
        cloned_orginial1 = tensor1.clone()
        cloned_orginia2 = tensor2.clone()
        tensor1.fill_(99.0)
        tensor2.fill_(88.0)
        self.assertFalse(torch.allclose(cloned_orginial1, tensor1))
        self.assertTrue(
            torch.allclose(cloned_orginial1, cpu_tensor1),
            "CPU tensor should have the same values as the original tensor",
        )
        self.assertTrue(
            torch.allclose(cloned_orginia2, cpu_tensor2),
            "CPU tensor should have the same values as the original tensor",
        )
    @requires_cuda
    def test_tensor_pinned_and_shared(self):
        """
        Test that verifies tensors are actually pinned and shared using tensor.is_pinned() and tensor.is_shared() methods.
        """
        # Create test tensors
        tensor1 = torch.randn(4, 4).cuda()
        tensor2 = torch.randn(8, 8).cuda()
        # Create a state dict with these tensors
        state_dict = {
            "tensor1": tensor1,
            "tensor2": tensor2,
        }
        # Test all combinations of pin_memory and share_memory
        test_configs = [
            (False, False),  # pin_memory=False, share_memory=False
            (True, False),  # pin_memory=True, share_memory=False
            (False, True),  # pin_memory=False, share_memory=True
            (True, True),  # pin_memory=True, share_memory=True
        ]
        for pin_memory, share_memory in test_configs:
            with self.subTest(pin_memory=pin_memory, share_memory=share_memory):
                # Create stager with specific configuration
                stager = StateDictStager(
                    pin_memory=pin_memory, share_memory=share_memory
                )
                cpu_state_dict = stager.stage(state_dict)
                # Get the staged tensors
                cpu_tensor1 = cpu_state_dict["tensor1"]
                cpu_tensor2 = cpu_state_dict["tensor2"]
                # Verify tensor device
                self.assertEqual(
                    cpu_tensor1.device.type, "cpu", "Staged tensor should be on CPU"
                )
                self.assertEqual(
                    cpu_tensor2.device.type, "cpu", "Staged tensor should be on CPU"
                )
                # Verify tensor values
                self.assertTrue(
                    torch.allclose(cpu_tensor1, tensor1.cpu()),
                    "CPU tensor should have the same values as the original tensor",
                )
                self.assertTrue(
                    torch.allclose(cpu_tensor2, tensor2.cpu()),
                    "CPU tensor should have the same values as the original tensor",
                )
                # Verify pinned memory status
                self.assertEqual(
                    cpu_tensor1.is_pinned(),
                    pin_memory,
                    f"Tensor pinned status should be {pin_memory}",
                )
                self.assertEqual(
                    cpu_tensor2.is_pinned(),
                    pin_memory,
                    f"Tensor pinned status should be {pin_memory}",
                )
                # Verify shared memory status
                self.assertEqual(
                    cpu_tensor1.is_shared(),
                    share_memory,
                    f"Tensor shared status should be {share_memory}",
                )
                self.assertEqual(
                    cpu_tensor2.is_shared(),
                    share_memory,
                    f"Tensor shared status should be {share_memory}",
                )
                # Verify storage sharing is consistent with tensor sharing
                if share_memory:
                    # When share_memory is True, the storage should also be shared
                    self.assertTrue(
                        cpu_tensor1.storage().is_shared(),
                        "When share_memory=True, tensor storage should be shared",
                    )
                    self.assertTrue(
                        cpu_tensor2.storage().is_shared(),
                        "When share_memory=True, tensor storage should be shared",
                    )
                else:
                    # When share_memory is False, the storage should not be shared
                    self.assertFalse(
                        cpu_tensor1.storage().is_shared(),
                        "When share_memory=False, tensor storage should not be shared",
                    )
                    self.assertFalse(
                        cpu_tensor2.storage().is_shared(),
                        "When share_memory=False, tensor storage should not be shared",
                    )
 class TestDTensorStateDictStager(DTensorTestBase):
    @with_comms
    @requires_nccl()
    @skip_if_lt_x_gpu(2)
    def test_dtensor(self):
        """
        Test that StateDictStager works correctly with DTensors.
        """
        # Create a DTensor
        device_mesh = dist.DeviceMesh("cuda", list(range(dist.get_world_size())))
        tensor = torch.randn(3, 3, device="cuda")
        dtensor = DTensor.from_local(tensor, device_mesh, [Shard(0)])
        dtensor = dtensor + 1
        dtensor = dtensor * 2
        state_dict = {
            "dtensor": dtensor,
        }
        stager = StateDictStager(pin_memory=True, share_memory=True)
        cpu_state_dict = stager.stage(state_dict)
        # Verify the original DTensor has the expected values
        self.assertTrue(torch.allclose(dtensor.to_local(), (tensor + 1) * 2))
        self.assertTrue(
            torch.allclose(
                cpu_state_dict["dtensor"].to_local(), dtensor.to_local().cpu()
            )
        )
        self.assertEqual(cpu_state_dict["dtensor"]._spec, dtensor._spec)
 if __name__ == "__main__":
    run_tests()
--- a/torch/cuda/_pin_memory_utils.py
+++ b/torch/cuda/_pin_memory_utils.py
@ -0,0 +1,24 @@
 import torch
 def pin_memory(data_ptr: int, size: int) -> None:
    cudart = torch.cuda.cudart()
    succ = int(
        cudart.cudaHostRegister(
            data_ptr,
            size,
            1,  # lines up with 'cudaHostRegisterPortable'
        )
    )
    if succ != 0:
        raise RuntimeError(
            f"Registering memory failed with cudaError: {succ}."
            " It's possible that this is an asynchronous error raised from a previous cuda operation."
            " Consider launching with CUDA_LAUNCH_BLOCKING=1 to debug."
        )
 def unpin_memory(data_ptr: int) -> None:
    succ = int(torch.cuda.cudart().cudaHostUnregister(data_ptr))
    assert succ == 0, f"Unpinning shared memory failed with error-code: {succ}"
--- a/torch/distributed/_state_dict_utils.py
+++ b/torch/distributed/_state_dict_utils.py
@ -7,6 +7,7 @@ from collections.abc import Mapping, MutableMapping
 from typing import Any, Callable, cast, NamedTuple, Optional, TYPE_CHECKING, Union
 import torch
 import torch.cuda._pin_memory_utils as pin_memory_utils
 import torch.distributed as dist
 import torch.nn.functional as F
 from torch.distributed._functional_collectives import AsyncCollectiveTensor
@ -421,24 +422,9 @@ def _create_cpu_state_dict(
            t = torch.empty(*tuple(obj.size()), dtype=obj.dtype)
            t = t.share_memory_()
            if pin_memory:
                pin_memory_utils.pin_memory(t.data_ptr(), t.numel() * t.element_size())
                weakref.finalize(t, pin_memory_utils.unpin_memory, t)
                def unpin_memory(t):
                    succ = int(torch.cuda.cudart().cudaHostUnregister(t.data_ptr()))
                    assert succ == 0, (
                        f"Unpinning shared memory failed with error-code: {succ}"
                    )
                weakref.finalize(t, unpin_memory, t)
                succ = int(
                    torch.cuda.cudart().cudaHostRegister(
                        t.data_ptr(),
                        t.numel() * t.element_size(),
                        1,  # lines up with 'cudaHostRegisterPortable'
                    )
                )
                assert succ == 0, (
                    f"Pinning shared memory failed with error-code: {succ}"
                )
            return t
        elif pin_memory:
            return torch.empty(*tuple(obj.size()), dtype=obj.dtype).pin_memory()
--- a/torch/distributed/checkpoint/_state_dict_stager.py
+++ b/torch/distributed/checkpoint/_state_dict_stager.py
@ -0,0 +1,354 @@
 # mypy: allow-untyped-defs
 import logging
 import types
 import weakref
 from copyreg import dispatch_table
 from logging import getLogger
 from typing import Any
 import torch
 import torch.cuda._pin_memory_utils as pin_memory_utils
 from torch.storage import UntypedStorage
 from torch.utils.weak import WeakIdKeyDictionary
 logger = getLogger()
 logger.setLevel(logging.INFO)
 class StateDictStager:
    """
    A class for optimizing storage objects during staging for async checkpointing.
    StateDictStager stages the state_dict to CPU DRAM while applying optimizations
    like memory sharing and pinning to improve performance. It caches storage objects
    to avoid redundant copies and can be configured to automatically share memory
    (for multi-process usage) and pin memory (for faster CPU-GPU transfers).
    Attributes:
        pin_memory (bool): Whether to pin CPU memory for faster CPU-GPU transfers
        share_memory (bool): Whether to share memory across processes
        _cached_storage_mapping (WeakIdKeyDictionary): Maps storage objects to optimized CPU storages using weak references
    """
    def __init__(self, pin_memory: bool = False, share_memory: bool = False):
        if pin_memory and not torch.cuda.is_available():
            logger.warning(
                "Ignoring pin_memory flag for checkpoint staging as pinning memory"
                "requires CUDA, but CUDA is not available. "
            )
            self.pin_memory = False
        else:
            self.pin_memory = pin_memory
        self.share_memory = share_memory
        # Mapping from original storage objects to CPU storages using weak references
        self._cached_storage_mapping = WeakIdKeyDictionary()
        def _deepcopy_atomic(x, _):
            return x
        def _deepcopy_list(x, memo):
            y: list = []
            memo[id(x)] = y
            append = y.append
            for a in x:
                append(self.deepcopy_with_tensor_offload(a, memo))
            return y
        def _deepcopy_tuple(x, memo):
            y = [self.deepcopy_with_tensor_offload(a, memo) for a in x]
            # We're not going to put the tuple in the memo, but it's still important we
            # check for it, in case the tuple contains recursive mutable structures.
            try:
                return memo[id(x)]
            except KeyError:
                pass
            # Check if any elements changed during deepcopy
            for k, j in zip(x, y):
                if k is not j:
                    # At least one element changed, create new tuple
                    return tuple(y)
            # No elements changed, return original tuple
            return x
        def _deepcopy_dict(x, memo):
            y: dict = {}
            memo[id(x)] = y
            for key, value in x.items():
                y[self.deepcopy_with_tensor_offload(key, memo)] = (
                    self.deepcopy_with_tensor_offload(value, memo)
                )
            return y
        def _deepcopy_method(x, memo):  # Copy instance methods
            return type(x)(
                x.__func__, self.deepcopy_with_tensor_offload(x.__self__, memo)
            )
        d: dict[Any, Any] = {}
        self._deepcopy_dispatch = d
        d[type(None)] = _deepcopy_atomic
        d[int] = _deepcopy_atomic
        d[float] = _deepcopy_atomic
        d[bool] = _deepcopy_atomic
        d[complex] = _deepcopy_atomic
        d[bytes] = _deepcopy_atomic
        d[str] = _deepcopy_atomic
        d[types.CodeType] = _deepcopy_atomic
        d[type] = _deepcopy_atomic
        d[range] = _deepcopy_atomic
        d[types.BuiltinFunctionType] = _deepcopy_atomic
        d[types.FunctionType] = _deepcopy_atomic
        d[weakref.ref] = _deepcopy_atomic
        d[property] = _deepcopy_atomic
        d[types.MethodType] = _deepcopy_method
        d[dict] = _deepcopy_dict
        d[tuple] = _deepcopy_tuple
        d[list] = _deepcopy_list
    def _stage_untyped_storage(
        self, storage: UntypedStorage, non_blocking: bool = False
    ):
        """
        Called from the hooked storage_deepcopy function in torch.Tensor.__deepcopy__.
        This method handles the storage optimization logic for the StagingStateDict class.
        It checks if the storage has already been cached, and if so, reuses it.
        Otherwise, it creates a new CPU storage and applies memory optimizations.
        Args:
            storage: The storage to optimize
        Returns:
            The optimized storage
        """
        # Check if we've already cached this storage
        if storage in self._cached_storage_mapping:
            cached_storage = self._cached_storage_mapping[storage]
            assert cached_storage.size() == storage.size(), (
                "For async checkpointing,  We cache storages in DRAM and reuse them."
                "Cached storage size does not match original storage size."
                "This should never happen as we track the original storage weakref "
                "and clean up the cache storage. Please report this to PyTorch Distributed Checkpointing."
            )
            # Reuse cached storage but update with new data
            cached_storage.copy_(storage, non_blocking=non_blocking)
            return cached_storage
        # Create new CPU storage
        if self.share_memory:
            new_storage = type(storage)._new_shared(storage.size(), device="cpu")
        else:
            new_storage = type(storage)(storage.size(), device="cpu")
        if self.pin_memory and new_storage.nbytes() > 0:
            pin_memory_utils.pin_memory(new_storage.data_ptr(), new_storage.nbytes())
            # Set up a weak reference to unpin when cpu storage is garbage collected
            f = weakref.finalize(
                new_storage, pin_memory_utils.unpin_memory, new_storage.data_ptr()
            )
            # This makes sure that the finalizer is not called after
            # cuda context is destroyed.
            f.atexit = False
        new_storage.copy_(storage, non_blocking=non_blocking)
        # Cache the storage - WeakIdKeyDictionary will automatically clean up when storage is garbage collected
        self._cached_storage_mapping[storage] = new_storage
        return new_storage
    @torch.no_grad()
    def stage(
        self,
        state_dict: dict[str, Any],
        non_blocking: bool = False,
    ) -> dict[str, Any]:
        return self.deepcopy_with_tensor_offload(state_dict, non_blocking=non_blocking)
    def _offload_tensor(self, x, memo, non_blocking=False):
        """
        Deep copy a PyTorch tensor with optimized storage handling.
        This method creates a CPU copy of a tensor while applying memory optimizations
        like sharing and pinning based on the StateDictStager configuration.
        Args:
            x: The tensor to copy
            memo: Memo dictionary for tracking already copied objects
            non_blocking: Whether to perform non-blocking copies where possible
        Returns:
            A CPU copy of the tensor with optimized storage
        """
        # Create a new empty tensor on CPU
        y = x.new_empty([], device="cpu")
        # Store in memo dict early to handle recursive references
        d = id(x)
        memo[d] = y
        if type(x) is torch.Tensor or x.data_ptr() != 0:
            # Try to get the untyped storage and optimize it
            untyped_storage = x.untyped_storage()
            copied_storage = self._stage_untyped_storage(
                untyped_storage, non_blocking=non_blocking
            )
            # Set the tensor data using the optimized storage
            y.set_(copied_storage, x.storage_offset(), x.size(), x.stride())
        # Copy any attributes the tensor might have
        if hasattr(x, "__dict__"):
            for attr_name, attr_value in x.__dict__.items():
                setattr(
                    y,
                    attr_name,
                    self.deepcopy_with_tensor_offload(
                        attr_value, memo, non_blocking=non_blocking
                    ),
                )
        if hasattr(x, "__slots__"):
            for slot in x.__slots__:
                if hasattr(x, slot):
                    setattr(
                        y,
                        slot,
                        self.deepcopy_with_tensor_offload(
                            getattr(x, slot), memo, non_blocking=non_blocking
                        ),
                    )
        return y
    @torch.no_grad()
    def deepcopy_with_tensor_offload(self, x, memo=None, _nil=[], non_blocking=False):  # noqa: B006
        """Deep copy operation on arbitrary Python objects with special handling for PyTorch tensors.
        This implementation extends the standard deepcopy functionality to handle PyTorch tensors
        and their storages in a way that optimizes memory usage and performance, similar to the
        stage method. It applies memory sharing and pinning optimizations based on the StateDictStager
        configuration.
        Args:
            x: The object to deep copy
            memo: Memo dictionary for tracking already copied objects
            _nil: Sentinel value for memo dictionary
            non_blocking: Whether to perform non-blocking copies where possible
        Returns:
            A deep copy of the input object with optimized tensor storage handling
        """
        if memo is None:
            memo = {}
        d = id(x)
        y = memo.get(d, _nil)
        if y is not _nil:
            return y
        cls = type(x)
        # tensors and subclasses of tensors are handled separately
        if isinstance(x, torch.Tensor):
            y = self._offload_tensor(x, memo, non_blocking=non_blocking)
        # Use the dispatch table for standard types
        copier = self._deepcopy_dispatch.get(cls)
        if copier is not None:
            y = copier(x, memo)
        else:
            if issubclass(cls, type):
                y = self._deepcopy_dispatch[type](x, memo)
            else:
                copier = getattr(x, "__deepcopy__", None)
                if copier is not None:
                    y = copier(memo)
                else:
                    reductor = dispatch_table.get(cls)
                    if reductor:
                        rv = reductor(x)
                    else:
                        reductor = getattr(x, "__reduce_ex__", None)
                        if reductor is not None:
                            rv = reductor(4)
                        else:
                            reductor = getattr(x, "__reduce__", None)
                            if reductor:
                                rv = reductor()
                            else:
                                raise RuntimeError(
                                    f"un(deep)copyable object of type {cls}"
                                )
                    if isinstance(rv, str):
                        y = x
                    else:
                        y = self._reconstruct(x, memo, *rv)
        # If is its own copy, don't memoize.
        if y is not x:
            memo[d] = y
            self._keep_alive(x, memo)  # Make sure x lives at least as long as d
        return y
    def _keep_alive(self, x, memo):
        """Keeps a reference to the object x in the memo.
        Because we remember objects by their id, we have
        to assure that possibly temporary objects are kept
        alive by referencing them.
        We store a reference at the id of the memo, which should
        normally not be used unless someone tries to deepcopy
        the memo itself...
        """
        try:
            memo[id(memo)].append(x)
        except KeyError:
            # aha, this is the first one :-)
            memo[id(memo)] = [x]
    def _reconstruct(
        self, x, memo, func, args, state=None, listiter=None, dictiter=None
    ):
        deep = memo is not None
        if deep and args:
            args = (self.deepcopy_with_tensor_offload(arg, memo) for arg in args)
        y = func(*args)
        if deep:
            memo[id(x)] = y
        if state is not None:
            if deep:
                state = self.deepcopy_with_tensor_offload(state, memo)
            if hasattr(y, "__setstate__"):
                y.__setstate__(state)
            else:
                if isinstance(state, tuple) and len(state) == 2:
                    state, slotstate = state
                else:
                    slotstate = None
                if state is not None:
                    y.__dict__.update(state)
                if slotstate is not None:
                    for key, value in slotstate.items():
                        setattr(y, key, value)
        if listiter is not None:
            if deep:
                for item in listiter:
                    item = self.deepcopy_with_tensor_offload(item, memo)
                    y.append(item)
            else:
                for item in listiter:
                    y.append(item)
        if dictiter is not None:
            if deep:
                for key, value in dictiter:
                    key = self.deepcopy_with_tensor_offload(key, memo)
                    value = self.deepcopy_with_tensor_offload(value, memo)
                    y[key] = value
            else:
                for key, value in dictiter:
                    y[key] = value
        return y