pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Yuanyuan Chen	3255e7872b	Enable all flake8-logging-format rules (#164655 ) These rules are enabled by removing existing suppressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655 Approved by: https://github.com/janeyx99, https://github.com/mlazos	2025-10-19 00:59:28 +00:00
Rohit Singh Rathaur	2bcd892c86	[distributed] Replace assert statements in distributed checkpoint with explicit checks (#165256 ) Fixes partially #164878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165256 Approved by: https://github.com/albanD	2025-10-17 20:14:35 +00:00
Yuanyuan Chen	fbe0d20a17	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-14 14:22:54 +00:00
PyTorch MergeBot	b8be796a57	Revert "[2/N] More ruff SIM fixes (#165031 )" This reverts commit 38095fbd1323ee4a9541fbcbb9b28bd20f2cd956. Reverted https://github.com/pytorch/pytorch/pull/165031 on behalf of https://github.com/albanD due to One of the changed line started to fail on trunk ([comment](https://github.com/pytorch/pytorch/pull/165031#issuecomment-3390190870))	2025-10-10 13:42:14 +00:00
Yuanyuan Chen	70925bdf82	[1/N] Use "is" in python type comparison (#165037 ) It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037 Approved by: https://github.com/mlazos	2025-10-10 12:36:50 +00:00
Yuanyuan Chen	38095fbd13	[2/N] More ruff SIM fixes (#165031 ) This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031 Approved by: https://github.com/mlazos	2025-10-10 05:37:46 +00:00
Maggie Moss	7457d139c5	Add pyrefly suppressions to torch/distributed (7/n) (#165002 ) Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283 One more PR after this one. Test plan: dmypy restart && python3 scripts/lintrunner.py -a pyrefly check step 1: delete lines in the pyrefly.toml file from the project-excludes field step 2: run pyrefly check step 3: add suppressions, clean up unused suppressions before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199 after: INFO 0 errors (6,884 ignored) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002 Approved by: https://github.com/oulgen	2025-10-09 04:08:25 +00:00
PyTorch MergeBot	fd4bde430a	Revert "list_stored_sd_metadata API. (#160610 )" This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca. Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))	2025-10-08 15:10:38 +00:00
Pradeep Fernando	da903b6a8b	list_stored_sd_metadata API. (#160610 ) Summary: 1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load. 2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available). 3\ There, we have to use saved checkpoint (metadata) as our source of truth. 4\ This RFC proposal exposes the checkpoint metadata using a public API. In this proposal we expose the stored state-dict metadata (minus associated storage/chunk metadata). Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader. Test Plan: UT. Rollback Plan: Differential Revision: D80231457 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610 Approved by: https://github.com/saumishr	2025-10-08 04:33:51 +00:00
PyTorch MergeBot	5d7360bb03	Revert "Enable all SIM rules except disabled ones (#164645 )" This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911. Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))	2025-10-05 19:32:21 +00:00
Yuanyuan Chen	321e602692	Enable all SIM rules except disabled ones (#164645 ) `SIM` rules are useful for simplifying boolean expressions and enhances code readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645 Approved by: https://github.com/ezyang	2025-10-05 07:38:25 +00:00
Yuanyuan Chen	35c4130fd1	[2/N] Fix ruff warnings (#164460 ) Apply ruff `SIM` rules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460 Approved by: https://github.com/ezyang	2025-10-04 03:40:32 +00:00
PyTorch MergeBot	a10207e61b	Revert "[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760 )" This reverts commit 0925c644edafbb6a8ff42fef5f3bd48b6042fad3. Reverted https://github.com/pytorch/pytorch/pull/162760 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162760#issuecomment-3358630631))	2025-10-02 00:44:44 +00:00
Yuanyuan Chen	da003d7b95	[3/N] Import Callable from collections.abc in torch/distributed (#164104 ) This is the result of applying the ruff `UP035` check. `Callable` is imported from `collections.abc` instead of `typing`. This PR is the follow-up of #164054. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104 Approved by: https://github.com/Skylion007	2025-09-30 00:28:53 +00:00
Saurabh Mishra	134dfbeaef	[DCP] DTensor slice dequantization with proper block alignment (#163532 ) Summary: When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions. This fix addresses the issue by: 1. Proper coordinate mapping: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape. 3. Block-aligned dequantization: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors. The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions. Test Plan: Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that: 1. Dequantized tensor values are numerically correct 2. Block boundaries are properly calculated relative to full tensor shape 3. Scale factors are applied to correct tensor regions 4. Tensor shapes map is built efficiently using only metadata Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py ``` { "model.layers.0.mlp.gate_proj.weight": { "mse": 4.30626645453458e-11, "mae": 9.98388827611052e-07, "max_abs_diff": 0.0009703934192657471, "cosine_similarity": 1.010810375213623, "relative_error": 0.001330620958469808, "kl_divergence_1_to_2": "6.563401e-08", "kl_divergence_2_to_1": "-6.522914e-08", "js_divergence": 1.3711876079014476e-10, "shape": [ 18432, 7168 ], "t1_stats": { "min": -0.4453125, "max": 0.30859375, "mean": -1.2592146958922967e-05 }, "t2_stats": { "min": -0.44529813528060913, "max": 0.3085886240005493, "mean": -1.2624391274584923e-05 } }, "model.layers.0.mlp.up_proj.weight": { "mse": 2.5534721906361746e-11, "mae": 3.118609583907528e-06, "max_abs_diff": 0.00047551095485687256, "cosine_similarity": 1.038962483406067, "relative_error": 0.0013681650161743164, "kl_divergence_1_to_2": "-5.8253768e-08", "kl_divergence_2_to_1": "5.8747577e-08", "js_divergence": NaN, "shape": [ 18432, 7168 ], "t1_stats": { "min": -0.228515625, "max": 0.2333984375, "mean": 8.862222955485777e-08 }, "t2_stats": { "min": -0.2285017967224121, "max": 0.23338991403579712, "mean": 8.824501662729745e-08 } }, "model.layers.0.mlp.down_proj.weight": { "mse": 2.2803769289536646e-11, "mae": 2.8916260816913564e-06, "max_abs_diff": 0.0008973777294158936, "cosine_similarity": 1.0376262664794922, "relative_error": 0.001346255769021809, "kl_divergence_1_to_2": "1.2744896e-07", "kl_divergence_2_to_1": "-1.2736885e-07", "js_divergence": 5.992362162032805e-11, "shape": [ 7168, 18432 ], "t1_stats": { "min": -0.54296875, "max": 0.546875, "mean": -2.9487239316949854e-07 }, "t2_stats": { "min": -0.5429964661598206, "max": 0.5469087362289429, "mean": -2.9507478416235244e-07 } } } ``` https://www.internalfb.com/intern/testinfra/testrun/3940649985202645 Differential Revision: D82975005 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532 Approved by: https://github.com/wwwjn	2025-09-23 16:48:16 +00:00
joshuamarkovic	559e8d1c20	[doc]: Small typos (#162982 ) Small typo fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/162982 Approved by: https://github.com/ezyang, https://github.com/zou3519	2025-09-16 17:42:19 +00:00
Kevin Tang	3ae31782cc	[DCP] Add timeout for checkpoint background process join (#162828 ) Summary: Cleaning up checkpoint background process can currently block trainer thread indefinitely if the process is hanging (notably due to Gloo pg init timeout). This diff adds a 5s grace period for normal termination and sends SIGTERM if unable to shut down in that period. Rollback Plan: Differential Revision: D82268979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162828 Approved by: https://github.com/meetv18	2025-09-16 02:32:50 +00:00
Kevin Tang	0925c644ed	[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760 ) Summary: Sometimes checkpoint background process creation times out during gloo pg init. Attempting to destroy the process during that time can block the trainer thread until the timeout completes. This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time. Test Plan: CI Rollback Plan: Differential Revision: D81724668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760 Approved by: https://github.com/meetv18	2025-09-13 01:50:40 +00:00
Saurabh Mishra	33589374b6	[DCP] Avoid multiple storage writer resets in async save (#159448 ) Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path. Test Plan: ``` buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test ``` https://www.internalfb.com/intern/testinfra/testrun/15199648841705052 Rollback Plan: Differential Revision: D79230339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448 Approved by: https://github.com/meetv18	2025-09-10 00:43:03 +00:00
Saurabh Mishra	01ab325cc2	[DCP][Quantization] Fix the issue when scale vector is in a different SafeTensors file (#162214 ) Summary: The current dequantization implementation assumes that the weight and scale tenors are in the same SafeTensors files. This diff fixes the issue to support the case when these could be in different files. Test Plan: buck test fbcode//caffe2/test/distributed/checkpoint\:test_quantized_hf_storage Buck UI: https://www.internalfb.com/buck2/532bf151-bb40-41fd-b080-ff898675afe2 Test UI: https://www.internalfb.com/intern/testinfra/testrun/15199648851011082 Rollback Plan: Differential Revision: D81718598 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162214 Approved by: https://github.com/wwwjn	2025-09-05 22:43:58 +00:00
Saurabh Mishra	06da7c0730	[DCP][Quantization] Fix for FP8 multiplication during dequantization (#162202 ) Summary: Weight vector needs to be upcasted since some FP8 formats (like Float8_e4m3fn) don't have CPU implementations in PyTorch. Reference: https://docs.pytorch.org/docs/stable/tensors.html#id13 We will use FP32 for the scale vector multiplication and convert to the target dtype. Upcasting helps with the following: 1. Full CPU support: `float32` has complete CPU kernel implementations for all operations 2. Numerical stability: `float32` provides more precision during intermediate calculations 3. Compatibility: Works across all devices (CPU/GPU) and PyTorch versions Test Plan: UTs Rollback Plan: Differential Revision: D81711093 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162202 Approved by: https://github.com/wwwjn	2025-09-05 16:06:21 +00:00
Saurabh Mishra	1281470155	[DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682 ) This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced: - Configuration the target DType and the block size. - Multi threaded dequantization for efficiency Test Plan: buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage ``` Time elapsed: 2:34.1s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D80174674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682 Approved by: https://github.com/ankitageorge	2025-09-04 01:09:53 +00:00
Ankita George	f0a65cd6d6	Add pg argument to consolidate_safetensors_files_on_every_rank (#161421 ) Summary: Based on feedback on https://github.com/pytorch/torchtitan/pull/1625, adding a pg argument to consolidate_safetensors_files_on_every_rank so that we don't infer the pg and users can supply one if needed. Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D80954339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161421 Approved by: https://github.com/fegin	2025-08-29 13:31:11 +00:00
Ankita George	46429be723	[DCP][HF] Add option to parallelize reads in HF Storage Reader (#160205 ) Parallelize reading of data behind thread_count argument to HFStorageReader Test plan: ensure existing tests pass and run a job successfully with these changes Differential Revision: [D79478188](https://our.internmc.facebook.com/intern/diff/D79478188/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160205 Approved by: https://github.com/meetv18	2025-08-21 23:58:02 +00:00
Ankita George	fb241d0a44	[dcp][hf] Fix multi-rank consolidation for no files to process case (#160660 ) Summary: In the consolidate_safetensors_files_on_every_rank method, where we use multiple ranks to combine sharded safetensors files, if there are more ranks in the world size, than there are safetensors file to consolidate, then some ranks don't have to do any work. When I had tested, this case wasn't caught, and there was an extra barrier call, causing issues for the ranks that had no work to do. They should wait at the end, as do the ranks with work. Test Plan: tested this case on a job e2e added a unit test Rollback Plan: Differential Revision: D80273616 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160660 Approved by: https://github.com/sibuachu	2025-08-21 21:18:03 +00:00
Teja Rao	d8fcb2a4ac	[dcp_poc] Fix parameter order in distributed checkpoint API to use path-first for consistency (#160986 ) Summary: This commit standardizes the parameter order across PyTorch's experimental distributed checkpoint (DCP) API, changing all checkpoint operations from (state_dict, path) to (path, state_dict) for consistency with standard file I/O patterns. Test Plan: sandcastle tests Rollback Plan: Differential Revision: D80549014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160986 Approved by: https://github.com/pradeepfn	2025-08-20 04:09:18 +00:00
Saurabh Mishra	6ee175195a	[DCP][OSS] Rank local checkpointing in DCP without collectives (#147758 ) Summary: DCP metadata collectives become prohibitively expensive as the job scale grows. This PR introduces rank-local checkpointing which basically saves and loads the checkpoint without any collective. The trade off for now is the dedupe and re-sharding. Support for these would be introduced soon. Differential Revision: D70112642 Pull Request resolved: https://github.com/pytorch/pytorch/pull/147758 Approved by: https://github.com/meetv18	2025-08-13 16:20:28 +00:00
Ankita George	118bc97b14	Write full tensors out at once in HF consolidation script (#159394 ) Not all storage systems support writing at random offsets. This PR changes the writes of the consolidation script to write each tensor to a buffer, and then write out the buffer, sequentially going through every tensor in the output file. This will also help in the case where the sharded files weren't just sharded in the row-wise dimension. The reason is because small writes are expensive and we were writing each write for every chunk that was the largest number of contiguous bytes in the final tensor, but this could be a small amount of bytes for col-wise sharding. Now the full tensor is needed for the write, making the number of small writes smaller. Differential Revision: [D78684452](https://our.internmc.facebook.com/intern/diff/D78684452/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159394 Approved by: https://github.com/saumishr ghstack dependencies: #159392, #159393	2025-08-13 03:51:16 +00:00
Ankita George	2c5e10a5fc	Add new function consolidate_safetensors_files_on_every_rank for HF consolidation (#159393 ) Currently we are only using rank-0 for HF consolidation. But we should be able to use every rank to consolidate the sharded files, which will speed up the consolidation by Nx (where N is the number of ranks). Adding a new method consolidate_safetensors_files_on_every_rank to do this. Differential Revision: [D79000720](https://our.internmc.facebook.com/intern/diff/D79000720/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159393 Approved by: https://github.com/saumishr ghstack dependencies: #159392	2025-08-13 03:31:36 +00:00
Ankita George	f95b58c284	Remove usage of fsspec in HF consolidation script (#159392 ) Moving towards just supporting local storage to take advantage of HF apis such as safe_open. This was already done in Storage component in https://github.com/pytorch/pytorch/pull/159405. This PR removes fsspec usages in consolidation script and relies on local storage only Differential Revision: [D78997975](https://our.internmc.facebook.com/intern/diff/D78997975/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159392 Approved by: https://github.com/sibuachu	2025-08-12 20:41:06 +00:00
Ankita George	e96c7c4bb0	[dcp][hf] Improve HF consolidation algorithm (#158648 ) Before we had a bunch of if-else cases based on sharding strategy to decide how to save the tensor with different logic for different strategies. This can be consolidated into one function that uses an algorithm to handle all cases by finding the max possible contiguous bytes that can be written Differential Revision: [D78489438](https://our.internmc.facebook.com/intern/diff/D78489438/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158648 Approved by: https://github.com/saumishr	2025-08-09 00:11:22 +00:00
Ankita George	8399cf88ce	Use only safetensors APIs in HFStorageReader (#159681 ) Get rid of the logic to read the metadata from the header of the safetensors file manually and use the functions as part of safe_open() to get the metadata. This is much cleaner and allows us to not rely on our own custom methods to get metadata, but use safetensors provided APIs Differential Revision: [D79460272](https://our.internmc.facebook.com/intern/diff/D79460272/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159681 Approved by: https://github.com/saumishr ghstack dependencies: #159405, #159406	2025-08-07 17:23:03 +00:00
Ankita George	0b187b3114	DCP HF reader: use safe_open instead of reading the bytes (#159406 ) Reading the bytes and converting to tensors is much slower than using safe_open. For a 8B model across 8 ranks, took ~30s to load before this change and ~4s after. Differential Revision: [D78994259](https://our.internmc.facebook.com/intern/diff/D78994259/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159406 Approved by: https://github.com/saumishr ghstack dependencies: #159405	2025-08-07 17:23:03 +00:00
Ankita George	69cc606fda	HF component update to not use fsspec components (#159405 ) Update HF components to not inherit from fsspec components and instead use filesystem writer/reader. The reason is because there doesn't seem to be much of a need for fsspec, since users are using mounted storage. Using local storage will allow for performance improvements because we can take advantage of the safe_open API provided by HF safetensors (30s vs 4s for load of 8b model), which is signifcant performance wins over reading bytes and converting to tensors which is what we are doing now. Also, we can use the official methods provided by HF instead of relying on reading the metadata by bytes and loading it Differential Revision: [D78993550](https://our.internmc.facebook.com/intern/diff/D78993550/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159405 Approved by: https://github.com/saumishr	2025-08-07 17:22:54 +00:00
Meet Vadakkanchery	4c01991b38	[DCP][Prototype] Checkpoint replication via PGTransport (#157963 ) (#159801 ) Summary: ### PR Context Introduce simple replication logic via PGTransport. The goal is to showcase a working prototype of replication via PGTransport, in this impl we assume world_sizes are equal allowing us to create perfect bi-directional pairs for the purpose of choosing replica "partners". Test Plan: CI Rollback Plan: Differential Revision: D79590797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159801 Approved by: https://github.com/saumishr	2025-08-06 16:52:03 +00:00
Meet Vadakkanchery	002f18807e	[DCP] Improve error handling for process based async checkpointing (#159374 ) Summary: ### PR Context - Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process. - When a checkpoint fails to save, log and return the error but continue the serving loop. Test Plan: CI Rollback Plan: Differential Revision: D79177410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374 Approved by: https://github.com/sibuachu	2025-07-30 16:25:28 +00:00
Han, Chao1	02ca965560	Device agnostic for DCP (#158337 ) Enable device-agnostic implementation of DCP-related functionality, allowing the new DCP features to be supported on XPU as well. use_cuda_non_blocking_copy to use_non_blocking_copy because non-blocking copy is supported by most GPUs and is not exclusive to CUDA devices. Test plan: test cases have not yet been updated to be fully device agnostic; this will be addressed in future work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158337 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/Saiteja64 Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>	2025-07-25 05:24:09 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
Xuehai Pan	f903bc475c	[BE] add noqa for flake8 rule B036: found `except BaseException` without re-raising (#159043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159043 Approved by: https://github.com/Skylion007	2025-07-25 02:56:34 +00:00
Teja	febf3c475e	fix forced loglevel in pytorch oss code (#158820 ) Differential Revision: [D78715806](https://our.internmc.facebook.com/intern/diff/D78715806/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158820 Approved by: https://github.com/Skylion007, https://github.com/pradeepfn	2025-07-24 00:40:28 +00:00
Howard Huang	5e17932c22	[DCP] Add support for ShardedTensor to PgTransport (#158573 ) Add support for ShardedTensors in when PGTransport is used for send/recv checkpoints Test is pulled from https://github.com/pytorch/pytorch/pull/157963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158573 Approved by: https://github.com/meetv18	2025-07-21 21:04:23 +00:00
Ankita George	4b02bd76d3	DCP safetensors test fix (#158685 ) https://github.com/pytorch/pytorch/pull/158069 removed the consolidated output path argument without updating the test. Reported by a user here https://github.com/pytorch/pytorch/pull/156705#issuecomment-3090748034. Adding back the logic from the original PR https://github.com/pytorch/pytorch/pull/158069 and fixing the test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158685 Approved by: https://github.com/teja-rao	2025-07-20 22:52:54 +00:00
PyTorch MergeBot	e3351b3ddf	Revert "[DCP][HF] [ez]Change where sharded tensors are saved (#158069 )" This reverts commit 627ba411366bcc15019c49756d3f22fd3914bd50. Reverted https://github.com/pytorch/pytorch/pull/158069 on behalf of https://github.com/jithunnair-amd due to Didn't remove reference to `consolidated_output_path` in test_hf_safetensor_e2e.py; CUDA runs do not surface issue because safetensors is not installed and the test silently passes ([comment](https://github.com/pytorch/pytorch/pull/158069#issuecomment-3090692336))	2025-07-18 20:54:19 +00:00
Deepak Seshadri	725cdb218e	Name threads in caffe2/torch/distributed/checkpoint AsyncCheckpointExecutor (#158612 ) Differential Revision: D78493333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158612 Approved by: https://github.com/d4l3k	2025-07-18 17:33:12 +00:00
Ankita George	38c04415a9	[oss][hf][bug fix] Remove buggy consolidation logic (#158380 ) Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time. Test Plan: ensure tests pass Rollback Plan: Differential Revision: D78366701 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380 Approved by: https://github.com/Saiteja64	2025-07-17 13:05:06 +00:00
Ankita George	627ba41136	[DCP][HF] [ez]Change where sharded tensors are saved (#158069 ) Summary: Previously was saving sharded tensors to same directory as full tensors. But am realizing this doesn't make sense because on load(), you would be loading for a directory which contains both, with no way to distinguish them, so they should be in separate folders. Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D78108144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/158069 Approved by: https://github.com/teja-rao	2025-07-12 01:02:17 +00:00
Ankita George	3232b57cd8	Updates to safetensors checkpoint consolidation script to be faster (#157936 ) Summary: - adding mmap-ing - more efficient writing in larger chunks latency from ~150s to ~6s for simple row-wise consolidation of a 7gb model sharded across 4 ranks Test Plan: ran consolidation with the following code: ``` from torch.distributed.checkpoint._consolidate_hf_safetensors import consolidate_safetensors_files import time start_time = time.time() consolidate_safetensors_files(base_path, consolidated_path) end_time = time.time() print(f"Time taken: {end_time - start_time} seconds") ``` With the old code this was taking a couple minutes and this is now down to ~6s. Internal users can find the tensor shards in the manifold path: manifold://ankita_test_bucket/tree/safetensors Rollback Plan: Differential Revision: D77960054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157936 Approved by: https://github.com/teja-rao, https://github.com/pradeepfn	2025-07-10 02:50:20 +00:00
Ankita George	3404c1f0cf	[HF][DCP] Upload local consolidated files to remote storage if needed (#157371 ) If the final output file is in remote storage, then create a local temp directory to write the files and upload the files to the remotes storage after they are written. Add a new config to the storage writer, `enable_consolidation`, so we don't need to rely on the presence of the `consolidation_output_path` to decide if consolidation is enabled. If `enable_consolidation` is True and `consolidation_output_path` isn't provided, the consolidated safetensors will be added to the same path as the sharded ones. Differential Revision: [D77554585](https://our.internmc.facebook.com/intern/diff/D77554585/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/157371 Approved by: https://github.com/pradeepfn	2025-07-10 02:40:25 +00:00
Xuehai Pan	4cc8b60d1b	[BE][1/16] fix typos in torch/ (#156311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/156311 Approved by: https://github.com/albanD	2025-07-09 11:02:22 +00:00
Ankita George	dea4864ce0	HF loads dcp - don't do a full deserialize on every file (#157715 ) Summary: These changes in D76442012 got reverted after the PR landed due to aps_models/ads/launchers/pearl/tests/ne/e2e_deterministic_tests:pearl_e2e_ne_tests failing with `Config not loaded due to no timely response from configerator. Likely configerator_proxy or falcon_proxy are not healthy`, but that test failing is definitely transient and unrelated to my changes, so re-creating the diff Test Plan: ensure tests pass Rollback Plan: Differential Revision: D77871099 Pull Request resolved: https://github.com/pytorch/pytorch/pull/157715 Approved by: https://github.com/meetv18	2025-07-08 18:13:27 +00:00

1 2 3 4 5 ...

378 Commits