Files
pytorch/torch/csrc/distributed/c10d/exception.h
Pritam Damania 704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00

34 lines
968 B
C++

// Copyright (c) Facebook, Inc. and its affiliates.
// All rights reserved.
//
// This source code is licensed under the BSD-style license found in the
// LICENSE file in the root directory of this source tree.
#pragma once
#include <stdexcept>
#include <c10/macros/Macros.h>
#include <c10/util/Exception.h>
// Utility macro similar to C10_THROW_ERROR, the major difference is that this
// macro handles exception types defined in the c10d namespace, whereas
// C10_THROW_ERROR requires an exception to be defined in the c10 namespace.
#define C10D_THROW_ERROR(err_type, msg) \
throw ::c10d::err_type( \
{__func__, __FILE__, static_cast<uint32_t>(__LINE__)}, msg)
namespace c10d {
using c10::DistNetworkError;
class TORCH_API SocketError : public DistNetworkError {
using DistNetworkError::DistNetworkError;
};
class TORCH_API TimeoutError : public DistNetworkError {
using DistNetworkError::DistNetworkError;
};
} // namespace c10d