No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
- Avoid using deprecated enums for MacOS builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )
300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
- TensorPipe
- LLVM
- FBGEMM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62960
A bug was filed a few years ago for sending sparse tensor over rpc #30807.
This pr updates rpc/tensorpipe logic for CUDA sparse tensors. During the serialization process, the pickler.cpp implementation breaks down the sparse tensor into two tensors and metadata. torch/csrc/distributed/rpc/tensorpipe_agent.cpp needs to be updated because it does not have logic sparse tensors. It pushes a single device for a sparse tensor. This is wrong because after the sparse tensor has been serialized, there will be two tensors. The second tensor will not have a device. This will cause the second tensor to have the wrong target device. tensorpipe_utils.cpp needs to be updated because deserialization happens after the data is received on the target pipe. This takes the two tensors and metadata sent and rebuilds the sparse tensor. There will be two tpDescriptors but only one tensor after deserialization. The logic is updated to verify the sparse tensor is on the correct device using the first tpDescriptor.
This pr also updates ivalue.cpp and ivalue.h to support more paths for Sparse COO tensors.
I tested these changes by adding sparse tests to rpc_test.py and dist_autograd_test.py.
Test Plan: Imported from OSS
Reviewed By: gchanan
Differential Revision: D30717285
Pulled By: gcramer23
fbshipit-source-id: daee9a56764550f56b131f9dd8e74e23113d6714
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59377
This PR demonstrates that now the CUDA parts of the TensorPipe agent just "plug on top" of the CPU-only parts. Thus ideally the CPU-only parts could go in libtorch while the CUDA-only parts could go in libtorch_cuda. Unfortunately we can't do that just yet, because the TensorPipe agent depends on c10d (for its Store and its ProcessGroup), which lives in libtorch_python.
ghstack-source-id: 131326168
Test Plan: CI
Reviewed By: cbalioglu
Differential Revision: D28796429
fbshipit-source-id: 41b2eb8400c0da282f3750a4eea21ad83ee4a175
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59376
This is an experiment. The end goal is to separate the CUDA-specific aspects of the TensorPipe agent so that they can be plugged "on top" of the CPU-only parts. This will then allow to move the TP agent to libtorch (because libtorch is split into a CPU and a CUDA part; now it's in libtorch_python), although unfortunately other conditions need to also be met for this to happen.
The only instance where we had CPU and CUDA logic within the same code, guarded by `#ifdef USE_CUDA`, is the serialization/deserialization code. I'm thus introducing a sort-of registry in order to "decentralize it". It's not a c10::Registry, because that's overkill (it uses an unordered_map, with strings as keys): here we can just use an array with integers as "keys".
ghstack-source-id: 131326167
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28796428
fbshipit-source-id: b52df832e0c0abf489a9e418353103496382ea41
Summary:
Switches most of the simple for loops outside of `jit` directories to use `c10::irange`.
Generated with D28874212.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59481
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D28909681
fbshipit-source-id: ec9ab1bd602933238d9d0f73d4d8d027b75d9d85
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59212
Reland of https://github.com/pytorch/pytorch/pull/58428
Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 130202842
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28623885
fbshipit-source-id: 29333bcb75d077ab801eac92017d0e381e8f5569
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59205
Reland of https://github.com/pytorch/pytorch/pull/58422
Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).
By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.
In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 130202849
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28623891
fbshipit-source-id: c9aeea3440679a11741ca78c06b03c57cb815a5e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58428
Until now, the TP agent expected the output of a remote function to be on the same streams as the inputs. In other words, it used the lazy stream context of the inputs to synchronize the output tensors. This was true in the most common case of a synchronous remote function. However it wasn't true for async functions, for fetching RRefs, ... The more generic way is to use the CUDA events held by the Future to perform this synchronization. (These events may be on the input streams, or they may not be!).
ghstack-source-id: 129567045
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28474982
fbshipit-source-id: c0034eb3f2a2ea525efb63a31b839bc086060e7e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58422
Similar to Future (which I tackled recently), Message is an ivalue type (a "custom class" one), and the natural way to represent it is inside an intrusive_ptr. However in the RPC code we had a mix of usages, often passing Message by value. This has undesirable consequences, as it could easily trigger a copy by accident, which I believe is why in many places we accepted _rvalue references_ to Message, in order to force the caller to move. In my experience this is non-idiomatic in C++ (normally a function signature specifies how the function consumes its arguments, and it's up to the caller to then decide whether to copy or move).
By moving to intrusive_ptr everywhere I think we eliminate and simplify many of the problems above.
In this PR I do half of the migration, by updating everything except the `toMessageImpl` methods, which will come in the next PR.
ghstack-source-id: 129567053
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28474878
fbshipit-source-id: 5b76d45e05f6fa58c831e369c5c964d126187a6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57294
With the advent of CPUs in the device maps, and to be more generic (e.g., to support AMD GPUs), and to avoid conversions when passing to Future and RRef and such, it's easier to use Devices instead of DeviceIndices. This started by just migrating the TensorPipe agent but the RPC layer is quite intertwined so I had to migrate a lot of stuff.
ghstack-source-id: 127916562
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D28092733
fbshipit-source-id: 024dcb3648c5898ab13e770413c43958f04f1a8a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56346
Now that TensorPipe's API has `targetDevice`, use that instead of
manually writing the CUDA device index in `metadata`.
Test Plan: CI
Reviewed By: lw
Differential Revision: D27703235
fbshipit-source-id: c5b620e3b3ce619367412efdbe9fa3778f6b8869
Summary: The `tensorpipe::Buffer::deviceType()` method is going away.
Test Plan: CI
Reviewed By: lw
Differential Revision: D27478436
fbshipit-source-id: 3962257bc6237d1dde7e5f4fddae38abe8384c68
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55136
This will ease the transition to the new API where `Buffer` does not
store a length anymore.
Test Plan: CI
Reviewed By: lw
Differential Revision: D27466385
fbshipit-source-id: 9a167f8c501455a3ab49ce75257c69d8b4869925
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54251
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/324
In order to merge the channel hierarchies, we need a generic `Buffer` type, that can wrap either a `CpuBuffer` or a `CudaBuffer`.
The constraints are that, since this type is used by the channels, it cannot explicitly refer to `CudaBuffer`. We propose here a type-erasure based solution, with small-buffer optimization to avoid heap-allocating the wrapped concrete buffer.
This is a new version of D27001339 (c618dc13d2) which broke PyTorch OSS build.
Test Plan: CI
Reviewed By: lw, mrshenli
Differential Revision: D27156053
fbshipit-source-id: 4244302af33a3be91dcd06093c0d6045d081d3cc
Summary:
Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/322
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54145
In order to merge the channel hierarchies, we need a generic `Buffer` type, that can wrap either a `CpuBuffer` or a `CudaBuffer`.
The constraints are that, since this type is used by the channels, it cannot explicitly refer to `CudaBuffer`. We propose here a type-erasure based solution, with small-buffer optimization to avoid heap-allocating the wrapped concrete buffer.
ghstack-source-id: 124131499
Test Plan: CI
Reviewed By: lw
Differential Revision: D27001339
fbshipit-source-id: 26d7dc19d69d7e3336df6fd4ff6ec118dc17c5b6