633a3b7f67
Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"
...
This reverts commit fa0db212e717b6cb225159cb32ea3d83baa52381.
Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3419893217 ))
2025-10-19 19:20:45 +00:00
fa0db212e7
shrink_group implementation to expose ncclCommShrink API ( #164518 )
...
Closes #164529
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/kwen2501
2025-10-19 18:00:08 +00:00
fae74cd52f
Revert "shrink_group implementation to expose ncclCommShrink API ( #164518 )"
...
This reverts commit a032510db38e8331afa08f7635d146f9cefdd0ab.
Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3416718767 ))
2025-10-17 18:55:53 +00:00
a032510db3
shrink_group implementation to expose ncclCommShrink API ( #164518 )
...
Closes #164529
To expose the new [ncclCommShrink](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/comms.html#ncclcommshrink ) API to PyTorch.
This is useful when you need to exclude certain GPUs or nodes from a collective operation, for example in fault tolerance scenarios or when dynamically adjusting resource utilization.
For more info: [Shrinking a communicator](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html#shrinking-a-communicator )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164518
Approved by: https://github.com/Skylion007 , https://github.com/syed-ahmed , https://github.com/kwen2501
2025-10-17 17:55:03 +00:00
8c60f4ae08
[Distributed] update table in docs ( #165009 )
...
Fixes #162248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165009
Approved by: https://github.com/ezyang
2025-10-14 18:17:22 +00:00
fa95882093
[BE] document distributed apis ( #165194 )
...
This PR documents some `torch.distributed.distributed_c10d` APIs. Below are some screenshots of the rendered docs.
<img width="909" height="527" alt="Screenshot 2025-10-10 at 10 18 40 PM" src="https://github.com/user-attachments/assets/555ae886-bead-47f3-8c67-9bc91c14bd11 " />
<img width="885" height="548" alt="Screenshot 2025-10-10 at 10 18 47 PM" src="https://github.com/user-attachments/assets/1d6f7af1-db28-40f9-927e-5c47668a1a88 " />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165194
Approved by: https://github.com/janeyx99
2025-10-13 20:13:59 +00:00
1281470155
[DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints ( #160682 )
...
This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced:
- Configuration the target DType and the block size.
- Multi threaded dequantization for efficiency
Test Plan:
buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage
```
Time elapsed: 2:34.1s
Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D80174674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682
Approved by: https://github.com/ankitageorge
2025-09-04 01:09:53 +00:00
fd47401536
[doc] Updates to distributed.md for XCCL backend ( #155834 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155834
Approved by: https://github.com/guangyey , https://github.com/AlannaBurke , https://github.com/d4l3k
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com >
2025-07-22 21:01:43 +00:00
b146ca74f0
docs: add get_default_backend_for_device to distributed documentation ( #156783 )
...
`torch.distributed.get_default_backend_for_device()` API was added to torch 2.6, but is still missing in distributed documentation. This commit addresses the gap.
CC: @guangyey, @EikanWang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156783
Approved by: https://github.com/guangyey , https://github.com/malfet
2025-07-10 05:11:30 +00:00
ed051c3084
torch.distributed: add initial _dist2 prototype API ( #157841 )
...
This adds the initial dist2 API as proposed in https://docs.google.com/document/d/13R-1t_yESTvmAjcCN-wQjQQadIEu0JNIdS65uZawZzY/edit?tab=t.0#heading=h.3ctbqqopzc89
This is a WIP experimental API and is a sandbox for a number of new features and quality of life improvements/changes to c10d.
Test plan:
```
pytest test/distributed/test_dist2.py
```
Docs
```
cd docs
make html
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157841
Approved by: https://github.com/fduwjj
2025-07-09 23:40:43 +00:00
bf798a2f01
Change _hfstorage to hfstorage ( #155837 )
...
Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this
Test Plan:
ensure existing tests pass
Rollback Plan:
Differential Revision: D76364024
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837
Approved by: https://github.com/saumishr
2025-06-13 20:19:51 +00:00
7986c0dba6
rename distributed.rst to md ( #155767 )
...
Fixes #155019
For sanity checks, split PR to have this one only include distributed.rst -> distributed.md
Preview -> [distributed.md](https://docs-preview.pytorch.org/pytorch/pytorch/155767/distributed.html )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155767
Approved by: https://github.com/sekyondaMeta
2025-06-12 18:42:15 +00:00