distributed debug handlers (#126601)

This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
This commit is contained in:
Tristan Rice
2024-05-30 02:21:05 +00:00
committed by PyTorch MergeBot
parent e1c322112a
commit 3d541835d5
18 changed files with 548 additions and 0 deletions

View File

@ -178,6 +178,12 @@ new_patched_local_repository(
path = "third_party/tbb",
)
new_local_repository(
name = "cpp-httplib",
build_file = "//third_party:cpp-httplib.BUILD",
path = "third_party/cpp-httplib",
)
new_local_repository(
name = "tensorpipe",
build_file = "//third_party:tensorpipe.BUILD",