378 Commits

Author SHA1 Message Date
3255e7872b Enable all flake8-logging-format rules (#164655)
These rules are enabled by removing existing suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99, https://github.com/mlazos
2025-10-19 00:59:28 +00:00
2bcd892c86 [distributed] Replace assert statements in distributed checkpoint with explicit checks (#165256)
Fixes partially #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165256
Approved by: https://github.com/albanD
2025-10-17 20:14:35 +00:00
fbe0d20a17 [2/N] More ruff SIM fixes (#165031)
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-14 14:22:54 +00:00
b8be796a57 Revert "[2/N] More ruff SIM fixes (#165031)"
This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956.

Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))
2025-10-10 13:42:14 +00:00
70925bdf82 [1/N] Use "is" in python type comparison (#165037)
It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037
Approved by: https://github.com/mlazos
2025-10-10 12:36:50 +00:00
38095fbd13 [2/N] More ruff SIM fixes (#165031)
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-10 05:37:46 +00:00
7457d139c5 Add pyrefly suppressions to torch/distributed (7/n) (#165002)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

One more PR after this one.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
fd4bde430a Revert "list_stored_sd_metadata API. (#160610)"
This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca.

Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))
2025-10-08 15:10:38 +00:00
da903b6a8b list_stored_sd_metadata API. (#160610)
Summary:
1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load.
2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available).
3\ There, we have to use saved checkpoint  (metadata) as our source of truth.
4\ This RFC proposal exposes the checkpoint metadata using a public API.

In this proposal we expose the stored state-dict metadata  (minus associated storage/chunk metadata).

Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader.

Test Plan:
UT.

Rollback Plan:

Differential Revision: D80231457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610
Approved by: https://github.com/saumishr
2025-10-08 04:33:51 +00:00
5d7360bb03 Revert "Enable all SIM rules except disabled ones (#164645)"
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.

Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))
2025-10-05 19:32:21 +00:00
321e602692 Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
35c4130fd1 [2/N] Fix ruff warnings (#164460)
Apply ruff `SIM` rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
a10207e61b Revert "[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760)"
This reverts commit 0925c644edafbb6a8ff42fef5f3bd48b6042fad3.

Reverted https://github.com/pytorch/pytorch/pull/162760 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162760#issuecomment-3358630631))
2025-10-02 00:44:44 +00:00
da003d7b95 [3/N] Import Callable from collections.abc in torch/distributed (#164104)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
This PR is the follow-up of #164054.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104
Approved by: https://github.com/Skylion007
2025-09-30 00:28:53 +00:00
134dfbeaef [DCP] DTensor slice dequantization with proper block alignment (#163532)
Summary:
When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions.

This fix addresses the issue by:

1. **Proper coordinate mapping**: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape.

3. **Block-aligned dequantization**: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors.

The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions.

Test Plan:
Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that:
1. Dequantized tensor values are numerically correct
2. Block boundaries are properly calculated relative to full tensor shape
3. Scale factors are applied to correct tensor regions
4. Tensor shapes map is built efficiently using only metadata

Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py
```
{
  "model.layers.0.mlp.gate_proj.weight": {
    "mse": 4.30626645453458e-11,
    "mae": 9.98388827611052e-07,
    "max_abs_diff": 0.0009703934192657471,
    "cosine_similarity": 1.010810375213623,
    "relative_error": 0.001330620958469808,
    "kl_divergence_1_to_2": "6.563401e-08",
    "kl_divergence_2_to_1": "-6.522914e-08",
    "js_divergence": 1.3711876079014476e-10,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.4453125,
      "max": 0.30859375,
      "mean": -1.2592146958922967e-05
    },
    "t2_stats": {
      "min": -0.44529813528060913,
      "max": 0.3085886240005493,
      "mean": -1.2624391274584923e-05
    }
  },
  "model.layers.0.mlp.up_proj.weight": {
    "mse": 2.5534721906361746e-11,
    "mae": 3.118609583907528e-06,
    "max_abs_diff": 0.00047551095485687256,
    "cosine_similarity": 1.038962483406067,
    "relative_error": 0.0013681650161743164,
    "kl_divergence_1_to_2": "-5.8253768e-08",
    "kl_divergence_2_to_1": "5.8747577e-08",
    "js_divergence": NaN,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.228515625,
      "max": 0.2333984375,
      "mean": 8.862222955485777e-08
    },
    "t2_stats": {
      "min": -0.2285017967224121,
      "max": 0.23338991403579712,
      "mean": 8.824501662729745e-08
    }
  },
  "model.layers.0.mlp.down_proj.weight": {
    "mse": 2.2803769289536646e-11,
    "mae": 2.8916260816913564e-06,
    "max_abs_diff": 0.0008973777294158936,
    "cosine_similarity": 1.0376262664794922,
    "relative_error": 0.001346255769021809,
    "kl_divergence_1_to_2": "1.2744896e-07",
    "kl_divergence_2_to_1": "-1.2736885e-07",
    "js_divergence": 5.992362162032805e-11,
    "shape": [
      7168,
      18432
    ],
    "t1_stats": {
      "min": -0.54296875,
      "max": 0.546875,
      "mean": -2.9487239316949854e-07
    },
    "t2_stats": {
      "min": -0.5429964661598206,
      "max": 0.5469087362289429,
      "mean": -2.9507478416235244e-07
    }
  }
}
```

https://www.internalfb.com/intern/testinfra/testrun/3940649985202645

Differential Revision: D82975005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532
Approved by: https://github.com/wwwjn
2025-09-23 16:48:16 +00:00
559e8d1c20 [doc]: Small typos (#162982)
Small typo fixes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982
Approved by: https://github.com/ezyang, https://github.com/zou3519
2025-09-16 17:42:19 +00:00
3ae31782cc [DCP] Add timeout for checkpoint background process join (#162828)
Summary:
Cleaning up checkpoint background process can currently block trainer thread indefinitely if the process is hanging (notably due to Gloo pg init timeout).

This diff adds a 5s grace period for normal termination and sends SIGTERM if unable to shut down in that period.

Rollback Plan:

Differential Revision: D82268979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162828
Approved by: https://github.com/meetv18
2025-09-16 02:32:50 +00:00
0925c644ed [DCP] Decrease checkpoint background process Gloo pg init timeout (#162760)
Summary:
Sometimes checkpoint background process creation times out during gloo pg init.
Attempting to destroy the process during that time can block the trainer thread until the timeout completes.

This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time.

Test Plan:
CI

Rollback Plan:

Differential Revision: D81724668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760
Approved by: https://github.com/meetv18
2025-09-13 01:50:40 +00:00
33589374b6 [DCP] Avoid multiple storage writer resets in async save (#159448)
Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path.

Test Plan:
```
buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test
```
https://www.internalfb.com/intern/testinfra/testrun/15199648841705052

Rollback Plan:

Differential Revision: D79230339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448
Approved by: https://github.com/meetv18
2025-09-10 00:43:03 +00:00
01ab325cc2 [DCP][Quantization] Fix the issue when scale vector is in a different SafeTensors file (#162214)
Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files.

Test Plan:
buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage

Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2
Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082

Rollback Plan:

Differential Revision: D81718598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162214
Approved by: https://github.com/wwwjn
2025-09-05 22:43:58 +00:00
06da7c0730 [DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202)
Summary:
Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13

We will use FP32 for the scale vector multiplication and convert to the target dtype.

Upcasting helps with the following:

1.  **Full CPU support**: `float32` has complete CPU kernel implementations for all operations
2.  **Numerical stability**: `float32` provides more precision during intermediate calculations
3.  **Compatibility**: Works across all devices (CPU/GPU) and PyTorch versions

Test Plan:
UTs

Rollback Plan:

Differential Revision: D81711093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202
Approved by: https://github.com/wwwjn
2025-09-05 16:06:21 +00:00
1281470155 [DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682)
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency

Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D80174674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
2025-09-04 01:09:53 +00:00
f0a65cd6d6 Add pg argument to consolidate_safetensors_files_on_every_rank (#161421)
Summary: Based on feedback on https://github.com/pytorch/torchtitan/pull/1625, adding a pg argument to consolidate_safetensors_files_on_every_rank so that we don't infer the pg and users can supply one if needed.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D80954339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161421
Approved by: https://github.com/fegin
2025-08-29 13:31:11 +00:00
46429be723 [DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205)
Parallelize reading of data behind thread_count argument to HFStorageReader
Test plan: ensure existing tests pass and run a job successfully with these changes

Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205
Approved by: https://github.com/meetv18
2025-08-21 23:58:02 +00:00
fb241d0a44 [dcp][hf] Fix multi-rank consolidation for no files to process case (#160660)
Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.

Test Plan:
tested this case on a job e2e
added a unit test

Rollback Plan:

Differential Revision: D80273616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660
Approved by: https://github.com/sibuachu
2025-08-21 21:18:03 +00:00
d8fcb2a4ac [dcp_poc] Fix parameter order in distributed checkpoint API to use path-first for consistency (#160986)
Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns.

Test Plan:
sandcastle tests

Rollback Plan:

Differential Revision: D80549014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986
Approved by: https://github.com/pradeepfn
2025-08-20 04:09:18 +00:00
6ee175195a [DCP][OSS] Rank local checkpointing in DCP without collectives (#147758)
Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.

Differential Revision: D70112642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758
Approved by: https://github.com/meetv18
2025-08-13 16:20:28 +00:00
118bc97b14 Write full tensors out at once in HF consolidation script (#159394)
Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller.

Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394
Approved by: https://github.com/saumishr
ghstack dependencies: #159392, #159393
2025-08-13 03:51:16 +00:00
2c5e10a5fc Add new function consolidate_safetensors_files_on_every_rank for HF consolidation (#159393)
Currently we are only using rank-0 for HF consolidation. But we should be able to use every rank to consolidate the sharded files, which will speed up the consolidation by Nx (where N is the number of ranks). Adding a new method consolidate_safetensors_files_on_every_rank to do this.

Differential Revision: [D79000720](https://our.internmc.facebook.com/intern/diff/D79000720/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159393
Approved by: https://github.com/saumishr
ghstack dependencies: #159392
2025-08-13 03:31:36 +00:00
f95b58c284 Remove usage of fsspec in HF consolidation script (#159392)
Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only

Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392
Approved by: https://github.com/sibuachu
2025-08-12 20:41:06 +00:00
e96c7c4bb0 [dcp][hf] Improve HF consolidation algorithm (#158648)
Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written

Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648
Approved by: https://github.com/saumishr
2025-08-09 00:11:22 +00:00
8399cf88ce Use only safetensors APIs in HFStorageReader (#159681)
Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs

Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681
Approved by: https://github.com/saumishr
ghstack dependencies: #159405, #159406
2025-08-07 17:23:03 +00:00
0b187b3114 DCP HF reader: use safe_open instead of reading the bytes (#159406)
Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after.

Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406
Approved by: https://github.com/saumishr
ghstack dependencies: #159405
2025-08-07 17:23:03 +00:00
69cc606fda HF component update to not use fsspec components (#159405)
Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it

Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405
Approved by: https://github.com/saumishr
2025-08-07 17:22:54 +00:00
4c01991b38 [DCP][Prototype] Checkpoint replication via PGTransport (#157963) (#159801)
Summary:

### PR Context

Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners".

Test Plan:
CI

Rollback Plan:

Differential Revision: D79590797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801
Approved by: https://github.com/saumishr
2025-08-06 16:52:03 +00:00
002f18807e [DCP] Improve error handling for process based async checkpointing (#159374)
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.

Test Plan:
CI

Rollback Plan:

Differential Revision: D79177410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
2025-07-30 16:25:28 +00:00
02ca965560 Device agnostic for DCP (#158337)
Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well.
use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices.

Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158337
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/Saiteja64

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-25 05:24:09 +00:00
f5e2de928b [BE] fix remaining flake8 v7 warnings (#159044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044
Approved by: https://github.com/Skylion007
ghstack dependencies: #159043
2025-07-25 02:56:34 +00:00
f903bc475c [BE] add noqa for flake8 rule B036: found except BaseException without re-raising (#159043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043
Approved by: https://github.com/Skylion007
2025-07-25 02:56:34 +00:00
febf3c475e fix forced loglevel in pytorch oss code (#158820)
Differential Revision: [D78715806](https://our.internmc.facebook.com/intern/diff/D78715806/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158820
Approved by: https://github.com/Skylion007, https://github.com/pradeepfn
2025-07-24 00:40:28 +00:00
5e17932c22 [DCP] Add support for ShardedTensor to PgTransport (#158573)
Add support for ShardedTensors in when PGTransport is used for send/recv checkpoints

Test is pulled from https://github.com/pytorch/pytorch/pull/157963

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158573
Approved by: https://github.com/meetv18
2025-07-21 21:04:23 +00:00
4b02bd76d3 DCP safetensors test fix (#158685)
https://github.com/pytorch/pytorch/pull/158069 removed the consolidated output path argument without updating the test. Reported by a user here https://github.com/pytorch/pytorch/pull/156705#issuecomment-3090748034.
Adding back the logic from the original PR https://github.com/pytorch/pytorch/pull/158069 and fixing the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158685
Approved by: https://github.com/teja-rao
2025-07-20 22:52:54 +00:00
e3351b3ddf Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069)"
This reverts commit 627ba411366bcc15019c49756d3f22fd3914bd50.

Reverted https://github.com/pytorch/pytorch/pull/158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/158069#issuecomment-3090692336))
2025-07-18 20:54:19 +00:00
725cdb218e Name threads in caffe2/torch/distributed/checkpoint AsyncCheckpointExecutor (#158612)
Differential Revision: D78493333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158612
Approved by: https://github.com/d4l3k
2025-07-18 17:33:12 +00:00
38c04415a9 [oss][hf][bug fix] Remove buggy consolidation logic (#158380)
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D78366701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
2025-07-17 13:05:06 +00:00
627ba41136 [DCP][HF] [ez]Change where sharded tensors are saved (#158069)
Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.

Test Plan:
ensure existing tests pass

Rollback Plan:

Differential Revision: D78108144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069
Approved by: https://github.com/teja-rao
2025-07-12 01:02:17 +00:00
3232b57cd8 Updates to safetensors checkpoint consolidation script to be faster (#157936)
Summary:
- adding mmap-ing
- more efficient writing in larger chunks

latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks

Test Plan:
ran consolidation with the following code:

```
from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files
import time

start_time = time.time()
consolidate_safetensors_files(base_path, consolidated_path)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
```

With the old code this was taking a couple minutes and this is now down to ~6s.
Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors

Rollback Plan:

Differential Revision: D77960054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936
Approved by: https://github.com/teja-rao, https://github.com/pradeepfn
2025-07-10 02:50:20 +00:00
3404c1f0cf [HF][DCP] Upload local consolidated files to remote storage if needed (#157371)
If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written.
Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones.

Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371
Approved by: https://github.com/pradeepfn
2025-07-10 02:40:25 +00:00
4cc8b60d1b [BE][1/16] fix typos in torch/ (#156311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311
Approved by: https://github.com/albanD
2025-07-09 11:02:22 +00:00
dea4864ce0 HF loads dcp - don't do a full deserialize on every file (#157715)
Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff

Test Plan:
ensure tests pass

Rollback Plan:

Differential Revision: D77871099

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715
Approved by: https://github.com/meetv18
2025-07-08 18:13:27 +00:00