pytorch/checkpoint at 381d0cb239fa35f23b34ffb51f94784a3f9798b4 - pytorch - Gitea: Git for Me

frozenleaves/pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Files

History

Saurabh Mishra 381d0cb239 [DCP] Avoid in-place update and deepcopy during dudpe (#149320 )

Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:

#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s

Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s

Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822

Differential Revision: D71245218

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery

2025-03-18 16:08:40 +00:00

..

[reland][dtensor] move DTensor to public namespace (#134203 )

2024-09-08 17:08:40 +00:00

__init__.py

Add new hf storage class to torch.distributed package (#148361 )

2025-03-05 21:52:06 +00:00

_async_executor.py

[DCP] Introduce process based async checkpointing (#147039 )

2025-03-04 13:33:28 +00:00

_async_process_executor.py

[DCP] Introduce process based async checkpointing (#147039 )

2025-03-04 13:33:28 +00:00

_async_thread_executor.py

[DCP] Introduce process based async checkpointing (#147039 )

2025-03-04 13:33:28 +00:00

_checkpointer.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

_dedup_save_plans.py

[DCP] Avoid in-place update and deepcopy during dudpe (#149320 )

2025-03-18 16:08:40 +00:00

_dedup_tensors.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

_extension.py

PEP585: Missed conversions (#145342 )

2025-01-29 05:24:36 +00:00

_fsspec_filesystem.py

Revert "Build a storage reader/writer to write checkpoints in HF format (#146352 )"

2025-02-21 07:30:52 +00:00

_hf_storage.py

Build a storage reader/writer to write checkpoints in HF format (#148089 )

2025-02-28 07:38:10 +00:00

_nested_dict.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

_sharded_tensor_utils.py

[BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866 )

2024-06-18 13:51:53 +00:00

_storage_utils.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

_traverse.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

_version.py

[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158 )

2024-08-28 16:31:44 +00:00

api.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

default_planner.py

[DCP] Avoid in-place update and deepcopy during dudpe (#149320 )

2025-03-18 16:08:40 +00:00

filesystem.py

Build a storage reader/writer to write checkpoints in HF format (#148089 )

2025-02-28 07:38:10 +00:00

format_utils.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

logger.py

[DCP] Introduce process based async checkpointing (#147039 )

2025-03-04 13:33:28 +00:00

logging_handlers.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

metadata.py

[DCP] Introduce modules metadata in the storage_meta (#146654 )

2025-02-13 17:44:30 +00:00

optimizer.py

[BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547 )

2025-02-28 07:35:56 +00:00

planner_helpers.py

[DCP] Cache save plans in default planner (#147343 )

2025-02-25 20:59:25 +00:00

planner.py

[BE][PYFMT] migrate PYFMT for torch.{distributed,distributions} to ruff format (#144547 )

2025-02-28 07:35:56 +00:00

resharding.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

staging.py

Fix staging for CPU tensors in OSS DCP async_save (#145408 )

2025-01-23 12:49:26 -08:00

state_dict_loader.py

Typo Errors fixed in multiple files (#148262 )

2025-03-09 12:21:40 +00:00

state_dict_saver.py

[DCP] Introduce process based async checkpointing (#147039 )

2025-03-04 13:33:28 +00:00

state_dict.py

[BE]: Apply ruff PERF403 to use dict comprehensions more often (#149257 )

2025-03-18 00:46:07 +00:00

stateful.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

storage.py

PEP585 update - torch/distributed/elastic torch/distributed/checkpoint (#145163 )

2025-01-19 20:55:59 +00:00

utils.py

[DCP] fix dcp gather_object/scatter_object_list (#147675 )

2025-03-06 21:20:38 +00:00