Commit Graph

104 Commits

Author SHA1 Message Date
451462dbff [1/N] Add missing constructors or assignment operators (#131077)
Just mark them as deleted in most cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131077
Approved by: https://github.com/ezyang
2024-07-24 12:09:39 +00:00
cyy
05fa05cbae [2/N] Change static functions in headers to inline (#127764)
Follows #127727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127764
Approved by: https://github.com/Skylion007
2024-06-04 00:49:04 +00:00
e5cce35c21 Remove use of USE_C10D (#126120)
As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126120
Approved by: https://github.com/aaronenyeshi
2024-05-15 00:00:26 +00:00
d1f45a93af Check for releasing GIL at compiletime (#116695)
Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster.

@albanD This is the GIL change extracted from #112607 as discussed.

Also fixes a potential use of a moved-from object introduced in #116560:
- `f` is captured by value in a lambda that may be used called times
- After `std::move(f)` the lambda is not safe to call anymore

CC @cyyever for that change
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-03-11 22:04:56 +00:00
cyy
5c17f66a3d [Exception] [5/N] Remove torch::IndexError (#117713)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713
Approved by: https://github.com/ezyang
2024-01-19 03:36:15 +00:00
cyy
396a5c3091 [Exception] [4/N] Replace torch::IndexError and torch::ValueError with C10 counterparts (#117317)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117317
Approved by: https://github.com/ezyang
2024-01-18 00:35:29 +00:00
cyy
2b5a201aa6 [Exception] [3/N] Replace torch::NotImplementedError and torch::LinAlgError with C10 counterparts. (#116824)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116824
Approved by: https://github.com/albanD
2024-01-11 11:27:04 +00:00
cyy
2f17a21b2b [Reland] [13/N] Enable clang-tidy on headers of torch/csrc (#117088)
Reland of #116560 and fixes the issued reported by #116695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117088
Approved by: https://github.com/albanD
2024-01-10 23:58:04 +00:00
cyy
20f769544c [12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)
This PR follows #116751.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2024-01-10 08:48:14 +00:00
0aa50909f3 Revert "[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)"
This reverts commit 5aa258eb09d5ecd62aea4d2bd02bbfa5eda0d554.

Reverted https://github.com/pytorch/pytorch/pull/116486 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on https://github.com/pytorch/pytorch/pull/116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116486#issuecomment-1876042948))
2024-01-03 22:18:54 +00:00
791db94c62 Revert "[13/N] Enable clang-tidy on headers of torch/csrc (#116560)"
This reverts commit b0629cdd67ea5dd264250262e0af75579ed26952.

Reverted https://github.com/pytorch/pytorch/pull/116560 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on #116353, which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116560#issuecomment-1876033363))
2024-01-03 22:08:40 +00:00
cyy
b0629cdd67 [13/N] Enable clang-tidy on headers of torch/csrc (#116560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116560
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-01-02 05:33:04 +00:00
cyy
5aa258eb09 [12/N] Apply clang-tidy and fix warnings in headers of torch/csrc (#116486)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2023-12-30 18:38:53 +00:00
cyy
d250b2158e [4/N] Fixes clang-tidy warnings in header files (#115163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115163
Approved by: https://github.com/Skylion007
2023-12-06 05:00:01 +00:00
cyy
efc7c366f4 Remove auto_gil.h (#108492)
auto_gil.h has been deprecated for a long time. We can switch to pybind11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108492
Approved by: https://github.com/Skylion007
2023-09-05 08:26:13 +00:00
704b0b3c67 [RESUBMIT] Standardize on error types for distributed errors. (#108191)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108191
Approved by: https://github.com/H-Huang
2023-08-30 21:47:39 +00:00
d4ff06ec84 Revert "Standardize on error types for distributed errors. (#107651)"
This reverts commit 0e2317479b3cb987e1f3230876654f156bd11a09.

Reverted https://github.com/pytorch/pytorch/pull/107651 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor test in trunk for one of its model moco ([comment](https://github.com/pytorch/pytorch/pull/107651#issuecomment-1696578138))
2023-08-28 23:58:33 +00:00
0e2317479b Standardize on error types for distributed errors. (#107651)
We have a plethora of error types for various errors raised from c10d. These include `RuntimeError`, `TimeoutError`, `SocketError`, `DistBackendError` etc.

This results in messy code during error handling somewhat like this:
```
if "NCCL" in exception_str:
  ...
if "Timed out initializing process group in store based barrier on rank" in exception_str:
  ...
if "The client socket has timed out after" in exception_str:
  ...
if "Broken pipe" in exception_str:
  ...
if "Connection reset by peer" in exception_str:
  ...
```

To address this issue, in this PR I've ensured added these error types:

1. **DistError** - the base type of all distributed errors
2. **DistBackendError** - this already existed and referred to PG backend errors
3. **DistStoreError** - for errors originating from the store
4. **DistNetworkError** - for general network errors coming from the socket library
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107651
Approved by: https://github.com/H-Huang
2023-08-28 21:58:15 +00:00
5970fb402e C++ CustomClass in Python: indicate which methods are not implemented (#100171)
Without these changes, it can be hard to know which magic methods are not implemented on a given ScriptObject.

before:
```py
torch.ops.load_library("somelib.so")
c = torch.classes.somelib.SomeClass()
print(len(c))
# raise NotImplementedError
```

after:
```py
torch.ops.load_library("somelib.so")
c = torch.classes.somelib.SomeClass()
print(len(c))
# raise NotImplementedError: '__len__' is not implemented for __torch__.torch.classes.somelib.SomeClass
```

------

I could not find a linked issue, if you want me to open one as well I can do this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100171
Approved by: https://github.com/ezyang
2023-05-09 18:41:40 +00:00
3112d2a2b6 Export function symbols to enable Windows build of Intel Extension for PyTorch (#98054)
This PR is to export specific function symbols into .dll shared library on Windows platform to support Windows build for [Intel Extension for PyTorch](https://github.com/intel/intel-extension-for-pytorch).
TORCH_API/TORCH_PYTHON_API/PYBIND11_EXPORT are macros that decorate the function as dllexport while compilation, so that the function symbol will be exported into the .dll shared library file on Windows platform. It is necessary for other libraries (such as IPEX) to import and call these functions through dynamic linking of PyTorch on Windows platform.
The code changes of this PR adds decorators to export specific functions used by IPEX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98054
Approved by: https://github.com/ezyang
2023-04-05 23:23:18 +00:00
cyy
f27e09de04 Cleanup Windows warning suppression in CMake and fix some warnings in the source code (#94927)
This PR do two things:
1. It moves some Windows warning suppression from various CMake files into the main CMakeList.txt, following the conventions of gcc and clang.
2. It fixes some Windows warnings in the source code. Most importantly, it fixes lots of dll warnings by adjusting C10_API to TORCH_API or TORCH_PYTHON_API. There are still some dll warnings because some TORCH_API functions are actually built as part of libtorch_python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94927
Approved by: https://github.com/malfet
2023-02-27 19:22:20 +00:00
434eb16deb Correctly restore pybind11 error_already_set (#93238)
We would handle py::error_already_set correctly from pybind11 bindings,
but not from our regular TH bindings, which meant that anything from
an inner pybind11 function call was getting unconditionally transformed
into a RuntimeError.  Not too many cases where we do this, but
PySymNodeImpl was one of them.

To test this, I need to raise a non-RuntimeError from a function which
is invoked from pybind11 and then propagated to a non-pybind11 call
site.  I introduce GuardOnDataDependentSymNode for expressly this
purpose (this is how I discovered the bug anyway.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93238
Approved by: https://github.com/Skylion007, https://github.com/albanD
2023-01-30 16:43:01 +00:00
d8aa68c683 make sure that our error handling runs with the GIL enabled (#92848)
Fixes https://github.com/pytorch/pytorch/issues/92684

I checked the other use case of this API and they never release the GIL

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92848
Approved by: https://github.com/ngimel
2023-01-24 09:30:42 +00:00
cyy
9b716a0682 Clean up more clang-tidy supression (#92203)
1. remove unused NOLINTNEXTLINE(performance-move-const-arg)
2. add more std::move

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92203
Approved by: https://github.com/Skylion007
2023-01-17 05:43:08 +00:00
3779a75fc9 Apply noexcept to relevant move methods to improve performance (#92156)
This clang-tidy check is disabled globally due to false positives on containers, but there are a few places here where adding clang-tidy would actually improve performance (by allowing STL containers to use the move operator / assignment)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92156
Approved by: https://github.com/ngimel
2023-01-14 00:17:26 +00:00
b3603f8129 Revert "Deduplicate c10 error and PyTorchError hierarchy (#87855)"
This reverts commit 34f2d3e6ae56744c20c2f859f97101dff291bbbc.

Reverted https://github.com/pytorch/pytorch/pull/87855 on behalf of https://github.com/osalpekar due to perf regression in quantization tests
2023-01-06 19:56:35 +00:00
34f2d3e6ae Deduplicate c10 error and PyTorchError hierarchy (#87855)
Fixes #53370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87855
Approved by: https://github.com/albanD
2023-01-02 15:53:36 +00:00
77c2a8a11f Clang-Tidy: Improve ctors by removing unnecessary copies and initializations (#91538)
Apply clang-tidy fixups to prefer member initializer and modernize-pass-by-value. This is a mostly a noop, but it should make a few ctors slighlty more readable and more efficient. Also drops in some missing moves that prevents a lot of unnecessary copying.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91538
Approved by: https://github.com/ezyang
2022-12-31 07:19:30 +00:00
3d79ced8cf wrap_pybind_function: support member function pointers (#88932)
This updates `wrap_pybind_function` to use `invoke` and adds the
`invoke_traits` object which is analogous to `function_traits` but
for member functions it includes the class as an explicit argument.

To test this is working properly, I've also applied it to the
`CUDAGraph` binding code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88932
Approved by: https://github.com/albanD
2022-11-14 18:47:34 +00:00
bc66ddb5cb Add torch.distributed.DistBackendError exception type, thrown from C10D_NCCL_CHECK (#88134)
Currently all of the distributed errors are thrown from the `TORCH_CHECK` macro which throws a generic `RuntimeError`. This change introduced a new error type `DistBackendError` which derives from `RuntimeError` to signify there was an error with the backend communication library. This allows for better error handling and analysis at higher levels in the stack. Motivation: https://docs.google.com/document/d/1j6VPOkC6znscliFuiDWMuMV1_fH4Abgdq7TCHMcXai4/edit#heading=h.a9rc38misyx8

Changes:
- introduce new error type
- Update `C10D_NCCL_CHECK`

Sample script to demonstrate new error type

```python
# python -m torch.distributed.run --nproc_per_node=2 <script>.py

import torch
import torch.distributed as dist

if __name__ == "__main__":
    dist.init_process_group("nccl")
    dist.broadcast(torch.tensor([1, 2, 3]).cuda(), 0)
```

Differential Revision: [D40998803](https://our.internmc.facebook.com/intern/diff/D40998803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88134
Approved by: https://github.com/rohan-varma
2022-11-08 13:26:42 +00:00
1dbc8ad3b7 Add Warning class and refactor C++ warnings to use it (#84101)
Also adds `TORCH_WARN_WITH` and `TORCH_WARN_DEPRECATION` macros

Part of #72948

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84101
Approved by: https://github.com/albanD
2022-10-18 20:02:42 +00:00
4128712397 Propagate CUDAOutOfMemoryError to Python. (#83146)
The intention is to make it easier to catch this situation for debugging,
logging, or application-specific recovery.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83146
Approved by: https://github.com/albanD
2022-08-11 21:32:11 +00:00
df69660832 Revert "Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"" (#82599)
This reverts commit 532b8a9e00d7eea2636e67621bfcfa34d9c85bcb.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82599
Approved by: https://github.com/albanD
2022-08-02 19:37:02 +00:00
532b8a9e00 Revert "Add a lint rule for torch/csrc/util/pybind.h include (#82552)"
This reverts commit 9465c0e0b50f3c37bc150ef0016238ba33eca6f4.

Reverted https://github.com/pytorch/pytorch/pull/82552 on behalf of https://github.com/zengk95 due to This seems to be breaking windows binary wheels
2022-08-01 20:25:35 +00:00
9465c0e0b5 Add a lint rule for torch/csrc/util/pybind.h include (#82552)
We define specializations for pybind11 defined templates
(in particular, PYBIND11_DECLARE_HOLDER_TYPE) and consequently
it is important that these specializations *always* be #include'd
when making use of pybind11 templates whose behavior depends on
these specializations, otherwise we can cause an ODR violation.

The easiest way to ensure that all the specializations are always
loaded is to designate a header (in this case, torch/csrc/util/pybind.h)
that ensures the specializations are defined, and then add a lint
to ensure this header is included whenever pybind11 headers are
included.

The existing grep linter didn't have enough knobs to do this
conveniently, so I added some features.  I'm open to suggestions
for how to structure the features better.  The main changes:

- Added an --allowlist-pattern flag, which turns off the grep lint
  if some other line exists.  This is used to stop the grep
  lint from complaining about pybind11 includes if the util
  include already exists.

- Added --match-first-only flag, which lets grep only match against
  the first matching line.  This is because, even if there are multiple
  includes that are problematic, I only need to fix one of them.
  We don't /really/ need this, but when I was running lintrunner -a
  to fixup the preexisting codebase it was annoying without this,
  as the lintrunner overall driver fails if there are multiple edits
  on the same file.

I excluded any files that didn't otherwise have a dependency on
torch/ATen, this was mostly caffe2 and the valgrind wrapper compat
bindings.

Note the grep replacement is kind of crappy, but clang-tidy lint
cleaned it up in most cases.

See also https://github.com/pybind/pybind11/issues/4099

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82552
Approved by: https://github.com/albanD
2022-08-01 17:16:58 +00:00
71c24a6a2e Reduce string formatting overhead in PyWarningHandler
Closes #76952

This does `processErrorMsg` inplace on the warning string, so that in
the fast-path of no type translation it doesn't need to allocate a new
string just to copy the contents over. I also replaced `ostringstream`
with `fmt::format_to` which has noticably better performance.

Overall in a benchmark of `torch.floor_divide`, this drops the
callgrind instruction count from 703,168 to 571,774 and the bechmark
improves by 300 ns from 2.26 us to 1.94 us.

This brings the callgrind count for `~PyWarningHandler` up to ~80%
from `PyErr_WarnEx` so this is probably about as fast as our warning
handling can reasonably get.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76977

Approved by: https://github.com/swolchok
2022-06-21 00:04:17 +00:00
30fb2c4aba [lint] autoformat test/cpp and torch/csrc
Let's have some fun.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78828

Approved by: https://github.com/ezyang
2022-06-11 21:11:16 +00:00
272890998e [JIT] pass more exception info through the JIT interpreter
If TORCH_SHOW_CPP_STACKTRACES=1, then dump e.what() into the RuntimeError, which should make it easier to debug exceptions that happen within interpreted sections.

Test:
```patch
diff --git a/test/cpp/jit/test_dce.cpp b/test/cpp/jit/test_dce.cpp
index 6f9161d0d9..7c574787cf 100644
--- a/test/cpp/jit/test_dce.cpp
+++ b/test/cpp/jit/test_dce.cpp
@@ -3,6 +3,10 @@
 #include <torch/csrc/jit/ir/irparser.h>
 #include <torch/csrc/jit/passes/dead_code_elimination.h>
 #include <torch/csrc/jit/testing/file_check.h>
+#include <torch/csrc/jit/runtime/interpreter.h>
+#include <test/cpp/jit/test_utils.h>
+
+#include <ATen/ATen.h>

 namespace torch {
 namespace jit {
@@ -48,5 +52,30 @@ graph():
   // Check that dead code elimin
   testing::FileCheck().run(input, *graph);
 }
+
+TEST(EliminateDeadCodeTest, interpreterfailure) {
+  const std::string input = R"IR(
+graph(%x.1 : Tensor):
+  %2 : int = prim::Constant[value=128]() # /data/users/dberard/scripts/DGB/sz.py:4:38
+  %3 : int = prim::Constant[value=256]() # /data/users/dberard/scripts/DGB/sz.py:4:43
+  %5 : int = prim::Constant[value=1]() # /data/users/dberard/scripts/DGB/sz.py:4:53
+  %4 : int[] = prim::ListConstruct(%2, %3)
+  %6 : Tensor[] = aten::split_with_sizes(%x.1, %4, %5) # /data/users/dberard/scripts/DGB/sz.py:4:11
+  return (%6)
+)IR";
+  auto graph = std::make_shared<Graph>();
+  parseIR(input, graph.get());
+
+  //auto stack = createStack({at::randn({2, 383}, at::kCPU)});
+  auto stack = createStack({at::Tensor{}});
+
+  Code code(graph, "");
+  InterpreterState interpreter{code};
+  interpreter.run(stack);
+ ASSERT_EQ(2, stack.size());
+  ASSERT_FALSE(stack[0].toTensor().defined());
+  ASSERT_FALSE(stack[1].toTensor().defined());
+}
+
 } // namespace jit
 } // namespace torch
```

^ use this to repro the interpreter issue: `TORCH_SHOW_CPP_STACKTRACES=1 ./bin/test_jit --gtest_filter="EliminateDeadCodeTest.interpreterfailure"` and the stack trace is shown.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/75682

Approved by: https://github.com/eellison
2022-04-21 18:26:49 +00:00
e1db2f13ce Refactor TORCH_DISTRIBUTED_DEBUG implementation (#73166)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73166

This PR refactors, cleans up, and optimizes the implementation of `TORCH_DISTRIBUTED_DEBUG`. It also introduces three new user APIs: `get_debug_level()`, `set_debug_level()`, and `set_debug_level_from_env()` to retrieve and modify the debug level after a process has started.
ghstack-source-id: 149778566

Test Plan: Run the existing unit tests.

Reviewed By: rohan-varma

Differential Revision: D34371226

fbshipit-source-id: e18443b411adcbaf39b2ec999178c198052fcd5b
(cherry picked from commit 26d6bb1584b83a0490d8b766482656a5887fa21d)
2022-02-24 02:33:05 +00:00
d100d98db8 torch.linalg routines return torch.linalg.LinAlgError when a numerical error in the computation is found. (#68571)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/64785 by introducing a `torch.LinAlgError` for reporting errors caused by bad values in linear algebra routines which should allow users to easily catch errors caused by numerical errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68571

Reviewed By: malfet

Differential Revision: D33254087

Pulled By: albanD

fbshipit-source-id: 94b59000fdb6a9765e397158e526d1f815f18f0f
2021-12-23 10:53:26 -08:00
96fe82ac3c HANDLE_TH_ERRORS: Move exception translation out of line (#69974)
Summary:
I've noticed that the `HANDLE_TH_ERRORS` macros are actually very expensive in terms of compile time.  Moving the bulk of the catch statements out of line using a lippincott function significantly improves compile times and object file binary sizes. For just the generated autograd bindings, this halves serial build time from 8 minutes to 4 and binary size is more than halved for most files with the biggest difference being `python_variable_methods.cpp` which went from 126 MB to 43 MB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69974

Reviewed By: mruberry

Differential Revision: D33160899

Pulled By: albanD

fbshipit-source-id: fc35fa86f69ffe5a0752557be30b438c8564e998
2021-12-16 11:04:48 -08:00
cd9da3267c Rationalize API exports in torch_python (#68095)
Summary:
This renames `WindowsTorchApiMacro.h` to `Export.h` to mirror the c10 header `c10/macros/Export.h` and also updates it to use `C10_EXPORT`/`C10_IMPORT`. This also removes the `THP_API` macro from `THP_export.h` which appears to serve the same purpose.

cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68095

Reviewed By: jbschlosser

Differential Revision: D32810881

Pulled By: albanD

fbshipit-source-id: d6949ccd0d80d6c3e5ec1264207611fcfe2503e3
2021-12-07 15:24:37 -08:00
6e640a0acf Revise the socket implementation of c10d (#68226)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68226

**Note that this PR is unusually big due to the urgency of the changes. Please reach out to me in case you wish to have a "pair" review.**

This PR introduces a major refactoring of the socket implementation of the C10d library. A big portion of the logic is now contained in the `Socket` class and a follow-up PR will further consolidate the remaining parts. As of today the changes in this PR offer:

 - significantly better error handling and much more verbose logging (see the example output below)
 - explicit support for IPv6 and dual-stack sockets
 - correct handling of signal interrupts
 - better Windows support

A follow-up PR will consolidate `send`/`recv` logic into `Socket` and fully migrate to non-blocking sockets.

## Example Output

```
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[W logging.h:28] The server socket on [localhost]:29501 is not yet listening (Error: 111 - Connection refused), retrying...
[I logging.h:21] The server socket will attempt to listen on an IPv6 address.
[I logging.h:21] The server socket is attempting to listen on [::]:29501.
[I logging.h:21] The server socket has started to listen on [::]:29501.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42650.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42650.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42722.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42722.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42724.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42724.
[I logging.h:21] The client socket will attempt to connect to an IPv6 address on (127.0.0.1, 29501).
[I logging.h:21] The client socket is attempting to connect to [localhost]:29501.
[I logging.h:21] The client socket has connected to [localhost]:29501 on [localhost]:42726.
[I logging.h:21] The server socket on [::]:29501 has accepted a connection from [localhost]:42726.
```
ghstack-source-id: 143501987

Test Plan: Run existing unit and integration tests on devserver, Fedora, Ubuntu, macOS Big Sur, Windows 10.

Reviewed By: Babar, wilson100hong, mrshenli

Differential Revision: D32372333

fbshipit-source-id: 2204ffa28ed0d3683a9cb3ebe1ea8d92a831325a
2021-11-16 20:49:25 -08:00
5f45927d15 Autograd: Delay warnings until the end of backward execution (#66235)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50209

This adds a new warning handler that stores all warnings in a shared
queue, which can be "replayed" at a later time and, crucially, on
another thread. Then, I use this inside the autograd engine to ensure
that warnings are processed by the handler registered on the main
thread.

For testing, I also add an operator that always warns in the backward
pass and test that the warning is a normal Python warning.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235

Reviewed By: ejguan

Differential Revision: D31505413

Pulled By: albanD

fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9
2021-10-13 15:38:04 -07:00
db4b68b3ac Back out "Eagerly populate python_error::what() when TORCH_SHOW_CPP_STACKTRACES=1"
Summary: Original commit changeset: 9cfda47cafb3

Test Plan: unland

Reviewed By: ezyang

Differential Revision: D31116643

fbshipit-source-id: 631eea446ed48c63ca39281d24163a2eadbe8d12
2021-09-22 10:37:27 -07:00
3c6d9fd124 Eagerly populate python_error::what() when TORCH_SHOW_CPP_STACKTRACES=1 (#65376)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65376

Let's suppose there's a bug in PyTorch and python_error gets thrown
and never gets caught.  Typically, you'll get a very useless error
message like this:

```
terminate called after throwing an instance of 'python_error'
  what():
  Aborted (core dumped)
```

Now, you'll get:

```
what():  unknown Python error (for more information, try rerunning with TORCH_SHOW_CPP_STACKTRACES=1)
```

and with TORCH_SHOW_CPP_STACKTRACES=1 you'll get:

```
what():  error message from Python object
```

If we're OK with making Python exceptions go even slower, we could
eagerly populate unconditionally.  I'm also not so happy we don't get
a Python backtrace or the Python error name, that's worth improving
(this is a minimal diff to get the discussion going.)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31067632

Pulled By: ezyang

fbshipit-source-id: 9cfda47cafb349ee3d6853cdfb0f319073b87bff
2021-09-22 07:12:28 -07:00
ee44d73e59 Modernize override (#61744)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61744

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D29717320

fbshipit-source-id: 6eea4295ee2e5572ab337620be412376fcc2f3cc
2021-07-23 23:04:46 -07:00
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
b162d95e46 Fix a number of lint perf and safety issues in torch (#59897)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/59897

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D29037012

fbshipit-source-id: 7c16286d5fc2b67964fb65f8374dfff4d1a7aefb
2021-06-15 13:14:51 -07:00
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00