pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Files

Ke Wen 8fbf866904 [PGNCCL] Use non-blocking mode by default in eager init (#138527 )

### Why use non-blocking mode in eager init?
For overlapping comm init and model init, etc.
![image](https://github.com/user-attachments/assets/9b0bf7a9-be26-4d16-827b-dbe861f083cd)

### Why can we set non-blocking as default?
If the setting is dangling -- i.e. not passed in by user nor set via env -- `ProcessGroupNCCL` can have some preferred logic. And torch-level API semantics does not change whether the NCCL comm is blocking or non-blocking (handled within `ProcessGroupNCCL`).

### Why not make non-blocking default for lazy mode as well?
PR https://github.com/pytorch/pytorch/pull/137544 tried it.
Two reasons why that's not preferred today:
1. It is hard -- too big a blast.
2. There is no gain by doing lazy init in non-blocking mode, because the right next CPU call is a collective, and we will block there waiting for comm to be ready, so same effect as blocked init, no "opening" compared to eager mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138527
Approved by: https://github.com/wconstab
ghstack dependencies: #137855, #138488, #138374, #138384

2024-10-23 08:51:54 +00:00

control_collectives

Reapply "c10d: add Collectives abstraction (#125978 )" (#126695 )

2024-05-21 18:00:09 +00:00

control_plane

Do not use <filesystem> on Linux (#134494 )

2024-08-27 14:44:10 +00:00

quantization

[Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102 )

2024-05-30 16:19:53 +00:00

Backend.cpp

Revert "Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 )"

2023-10-16 23:03:26 +00:00

Backend.hpp

[c10d] Rename PG name and PG ID attribute (#132915 )

2024-08-09 21:26:56 +00:00

Backoff.cpp

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

Backoff.hpp

TCPStore: improve connect and retry logic (#129261 )

2024-06-25 19:24:22 +00:00

c10d.h

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

comm.cpp

[1/N] Replace std::tie with structural binding (#119774 )

2024-02-14 09:25:04 +00:00

comm.hpp

[codemod] c10:optional -> std::optional (#126135 )

2024-05-14 19:35:51 +00:00

CudaDMAConnectivity.cpp

[CudaDMAConnectivityDetector] improve the detection robustness (#137530 )

2024-10-09 23:30:16 +00:00

CUDASymmetricMemory-inl.h

[ROCm] CK-based GEMM (#131004 )

2024-10-20 02:57:43 +00:00

CUDASymmetricMemory.cu

[SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038 )

2024-10-16 03:34:26 +00:00

CUDASymmetricMemory.hpp

[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643 )

2024-10-15 21:35:14 +00:00

CUDASymmetricMemoryOps.cu

[SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529 )

2024-10-09 23:30:16 +00:00

debug.cpp

[Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942 )

2024-10-15 07:47:24 +00:00

debug.h

…

default_comm_hooks.cpp

[ncclx] Rename NCCL-EXP to NCCLX (#125238 )

2024-05-01 23:29:55 +00:00

default_comm_hooks.hpp

[Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 )

2024-05-11 00:03:52 +00:00

DMAConnectivity.cpp

[c10d] Introduce a util for detecting DMA connectivity among devices (#129510 )

2024-06-27 23:02:07 +00:00

DMAConnectivity.hpp

[c10d] Introduce a util for detecting DMA connectivity among devices (#129510 )

2024-06-27 23:02:07 +00:00

error.h

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

exception.h

[RESUBMIT] Standardize on error types for distributed errors. (#108191 )

2023-08-30 21:47:39 +00:00

FakeProcessGroup.hpp

[reland] Fix estimate_nccl_collective_runtime (#118986 )

2024-02-12 18:48:06 +00:00

FileStore.cpp

Replace c10::invoke_result with std::invoke_result (#124169 )

2024-05-25 02:42:13 +00:00

FileStore.hpp

Apply clang-format to distributed/c10d folder (#107140 )

2023-08-14 23:16:38 +00:00

Functional.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

Functional.hpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

GlooDeviceFactory.cpp

[Environment Variable][4/N] Use thread-safe getenv functions (#137843 )

2024-10-21 02:58:59 +00:00

GlooDeviceFactory.hpp

…

GroupRegistry.cpp

[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 )

2024-04-16 00:42:18 +00:00

GroupRegistry.hpp

Disable GroupRegistry's thread isolation by default (#121457 )

2024-03-08 19:31:24 +00:00

HashStore.cpp

[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 )

2024-04-13 11:45:00 +00:00

HashStore.hpp

Reapply "c10d: add Collectives abstraction (#125978 )" (#126695 )

2024-05-21 18:00:09 +00:00

init.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

intra_node_comm.cpp

[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 )

2024-10-09 23:30:16 +00:00

intra_node_comm.cu

[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 )

2024-10-09 23:30:16 +00:00

intra_node_comm.hpp

[IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475 )

2024-10-09 23:30:16 +00:00

logger.cpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

logger.hpp

[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 )

2024-04-23 00:43:50 +00:00

logging.cpp

Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )"

2024-08-26 15:19:27 +00:00

logging.h

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

NanCheck.cu

[Distributed] add pack-check method for float8_e5m2 (#136115 )

2024-09-15 21:37:43 +00:00

NanCheck.hpp

[1/N] Move NaN check onto NCCL stream (#134300 )

2024-08-29 08:28:49 +00:00

NCCLUtils.cpp

[PGNCCL] Use non-blocking mode by default in eager init (#138527 )

2024-10-23 08:51:54 +00:00

NCCLUtils.hpp

[PGNCCL] Use non-blocking mode by default in eager init (#138527 )

2024-10-23 08:51:54 +00:00

Ops.cpp

[codemod] c10:optional -> std::optional (#126135 )

2024-05-14 19:35:51 +00:00

ParamCommsUtils.cpp

[Distributed/Profiler] Fix input/output dimension overflow (#134360 )

2024-08-25 16:25:56 +00:00

ParamCommsUtils.hpp

fix sequence number for group (#134578 )

2024-10-10 04:24:06 +00:00

PrefixStore.cpp

[Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884 )

2024-03-31 09:06:35 +00:00

PrefixStore.hpp

[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 )

2024-04-23 00:43:50 +00:00

ProcessGroup.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

ProcessGroup.hpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

ProcessGroupGloo.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

ProcessGroupGloo.hpp

Fix comment in ProcessGroupGloo (#137746 )

2024-10-12 01:04:41 +00:00

ProcessGroupMPI.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

ProcessGroupMPI.hpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

ProcessGroupNCCL.cpp

[PGNCCL] Use non-blocking mode by default in eager init (#138527 )

2024-10-23 08:51:54 +00:00

ProcessGroupNCCL.hpp

[PGNCCL] Use non-blocking mode by default in eager init (#138527 )

2024-10-23 08:51:54 +00:00

ProcessGroupUCC.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

ProcessGroupUCC.hpp

[codemod] c10:optional -> std::optional (#126135 )

2024-05-14 19:35:51 +00:00

ProcessGroupWrapper.cpp

C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 )

2024-10-19 13:17:43 +00:00

ProcessGroupWrapper.hpp

[Distributed] [1/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#122884 )

2024-03-31 09:06:35 +00:00

PyProcessGroup.hpp

Fix build warnings for torch_python (#134981 )

2024-09-12 03:59:34 +00:00

python_comm_hook.cpp

…

python_comm_hook.h

…

RankLocal.hpp

Handle unwaited work objects on process termination (#119881 )

2024-02-19 02:46:02 +00:00

reducer_cuda.cpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

reducer_timer.hpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

reducer.cpp

C10_UNUSED to [[maybe_unused]] (#6357 ) (#138364 )

2024-10-19 13:17:43 +00:00

reducer.hpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

sequence_num.cpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

sequence_num.hpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

socket_fmt.h

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

socket.cpp

More appropriate socket errors and debug messages (#130347 )

2024-10-21 21:28:40 +00:00

socket.h

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

Store.cpp

[Reland] Add -Wdeprecated and related fixes (#110019 )

2023-09-28 03:34:29 +00:00

Store.hpp

[12/N] Use std::optional (#132361 )

2024-08-02 13:46:46 +00:00

SymmetricMemory.cpp

[SymmetricMemory] fix incorrect numel caculations that are using int as std::accumulate's accumulator (#138038 )

2024-10-16 03:34:26 +00:00

SymmetricMemory.hpp

[SymmetricMemory] implement timeout for barrier(), put_signal() and wait_signal() (#137643 )

2024-10-15 21:35:14 +00:00

TCPStore.cpp

[TCPStore] Remove deprecated constructor (#136004 )

2024-09-14 04:25:47 +00:00

TCPStore.hpp

[TCPStore] Remove deprecated constructor (#136004 )

2024-09-14 04:25:47 +00:00

TCPStoreBackend.cpp

TCPStore: add ping to verify network connectivity on connect (#129985 )

2024-07-03 02:09:44 +00:00

TCPStoreBackend.hpp

TCPStore: add ping to verify network connectivity on connect (#129985 )

2024-07-03 02:09:44 +00:00

TCPStoreLibUvBackend.cpp

TCPStoreLibUvBackend: trace operations (#136320 )

2024-09-20 00:53:21 +00:00

TraceUtils.h

[Distributed] [16/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#137404 )

2024-10-10 18:05:34 +00:00

Types.hpp

[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301 )

2024-07-08 07:03:53 +00:00

UCCTracing.cpp

[Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942 )

2024-10-15 07:47:24 +00:00

UCCTracing.hpp

…

UCCUtils.cpp

…

UCCUtils.hpp

Apply clang-format to distributed/c10d folder (#107140 )

2023-08-14 23:16:38 +00:00

UnixSockUtils.hpp

[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 )

2024-04-16 00:42:18 +00:00

Utils.cpp

[Distributed] [7/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124987 )

2024-05-11 00:03:52 +00:00

Utils.hpp

[Environment Variable][4/N] Use thread-safe getenv functions (#137843 )

2024-10-21 02:58:59 +00:00

WinSockUtils.hpp

[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 )

2024-04-16 00:42:18 +00:00

Work.cpp

[c10d][Partial-Graph Overlap] Support calling .wait_tensor() within compiled region on output tensor of eager async_op=True collective (#137763 )

2024-10-21 06:02:57 +00:00

Work.hpp

[c10d] differentiate timeout errors from nccl errors (#138240 )

2024-10-18 01:36:32 +00:00