[c10d/nccl-pg] allow user to pass process group description (#123472)

Summary:
We need a way to allow user set a customized description for a process group, e.g. FSDP, PP.

Here are several use cases of user specified group_desc:
- Logging: we can easily match a log line and understand what's this collective/pg is used to.
- Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP.
- Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG.

Solution: Add a group_desc field to c10d

Differential Revision: D55781850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472
Approved by: https://github.com/kwen2501
This commit is contained in:
Shengbao Zheng
2024-04-12 08:44:21 +00:00
committed by PyTorch MergeBot
parent 73f0ecc1ac
commit 4e9094533e
9 changed files with 124 additions and 25 deletions

View File

@ -694,6 +694,8 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
const std::string& getGroupName() const;
void setGroupName(const std::string& name);
const std::string& getGroupDesc() const;
void setGroupDesc(const std::string& name);
void enableCollectivesTiming();
void release_resources() override;
@ -724,6 +726,7 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
const int size_;
const c10::intrusive_ptr<Options> options_;
const BackendType backendType_;
std::string pg_desc_;
// Debug level setting. It is parsed once when ProcessGroup is constructed and
// remains the same across use of this process group.