Flight recoder data as JSON (#129505)

Summary:
Provide a new API to retrieve flight recorder data as JSON.
The one minor difference between flight recorder as Pickle v/s JSON is
that the JSON API does not retrieve stack traces at the moment.
This ends up being far too much data.

Test Plan:
unit test

Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505
Approved by: https://github.com/wconstab, https://github.com/d4l3k
This commit is contained in:
Chirag Pandya
2024-07-09 11:45:06 -07:00
committed by PyTorch MergeBot
parent 86bca69c5f
commit 83c95c48f7
5 changed files with 253 additions and 74 deletions

View File

@ -3194,6 +3194,23 @@ such as `dist.all_reduce(tensor, async_op=True)`.
Arguments:
tensors(List[torch.Tensor]): List of tensors we want to hash.
)");
module.def(
"_dump_nccl_trace_json",
[](std::optional<bool> includeCollectives,
std::optional<bool> onlyActive) {
return py::bytes(::c10d::dump_nccl_trace_json(
includeCollectives.value_or(true), onlyActive.value_or(false)));
},
py::arg("includeCollectives") = std::optional<bool>(),
py::arg("onlyActive") = std::optional<bool>(),
R"(
Arguments:
includeCollectives(bool, optional): Whether to include collective work traces. Default is True.
onlyActive (bool, optional): Whether to only include active collective work traces. Default is False.
Returns:
Stringified json work traces.
Default settings return everything - i.e. contains NCCL comm dumps and collective traces.
)");
module.def(
"_dump_nccl_trace",
[](std::optional<bool> includeCollectives,