It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037
Approved by: https://github.com/mlazos
Summary:
1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load.
2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available).
3\ There, we have to use saved checkpoint (metadata) as our source of truth.
4\ This RFC proposal exposes the checkpoint metadata using a public API.
In this proposal we expose the stored state-dict metadata (minus associated storage/chunk metadata).
Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader.
Test Plan:
UT.
Rollback Plan:
Differential Revision: D80231457
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610
Approved by: https://github.com/saumishr
Summary:
Cleaning up checkpoint background process can currently block trainer thread indefinitely if the process is hanging (notably due to Gloo pg init timeout).
This diff adds a 5s grace period for normal termination and sends SIGTERM if unable to shut down in that period.
Rollback Plan:
Differential Revision: D82268979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162828
Approved by: https://github.com/meetv18
Summary:
Sometimes checkpoint background process creation times out during gloo pg init.
Attempting to destroy the process during that time can block the trainer thread until the timeout completes.
This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time.
Test Plan:
CI
Rollback Plan:
Differential Revision: D81724668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760
Approved by: https://github.com/meetv18
Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path.
Test Plan:
```
buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test
```
https://www.internalfb.com/intern/testinfra/testrun/15199648841705052
Rollback Plan:
Differential Revision: D79230339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448
Approved by: https://github.com/meetv18
Summary:
Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13
We will use FP32 for the scale vector multiplication and convert to the target dtype.
Upcasting helps with the following:
1. **Full CPU support**: `float32` has complete CPU kernel implementations for all operations
2. **Numerical stability**: `float32` provides more precision during intermediate calculations
3. **Compatibility**: Works across all devices (CPU/GPU) and PyTorch versions
Test Plan:
UTs
Rollback Plan:
Differential Revision: D81711093
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202
Approved by: https://github.com/wwwjn
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency
Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D80174674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work.
Test Plan:
tested this case on a job e2e
added a unit test
Rollback Plan:
Differential Revision: D80273616
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660
Approved by: https://github.com/sibuachu
Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns.
Test Plan:
sandcastle tests
Rollback Plan:
Differential Revision: D80549014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986
Approved by: https://github.com/pradeepfn
Summary:
DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon.
Differential Revision: D70112642
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758
Approved by: https://github.com/meetv18
Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller.
Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394
Approved by: https://github.com/saumishr
ghstack dependencies: #159392, #159393
Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it
Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405
Approved by: https://github.com/saumishr
Summary:
### PR Context
Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners".
Test Plan:
CI
Rollback Plan:
Differential Revision: D79590797
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801
Approved by: https://github.com/saumishr
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79177410
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.
Test Plan:
ensure tests pass
Rollback Plan:
Differential Revision: D78366701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders.
Test Plan:
ensure existing tests pass
Rollback Plan:
Differential Revision: D78108144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069
Approved by: https://github.com/teja-rao
Summary:
- adding mmap-ing
- more efficient writing in larger chunks
latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks
Test Plan:
ran consolidation with the following code:
```
from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files
import time
start_time = time.time()
consolidate_safetensors_files(base_path, consolidated_path)
end_time = time.time()
print(f"Time taken: {end_time - start_time} seconds")
```
With the old code this was taking a couple minutes and this is now down to ~6s.
Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors
Rollback Plan:
Differential Revision: D77960054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936
Approved by: https://github.com/teja-rao, https://github.com/pradeepfn
If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written.
Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones.
Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371
Approved by: https://github.com/pradeepfn
Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff
Test Plan:
ensure tests pass
Rollback Plan:
Differential Revision: D77871099
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715
Approved by: https://github.com/meetv18