mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 12:54:11 +08:00
Unify PyTorch mobile's threadpool usage. (#37243)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37243 *** Why *** As it stands, we have two thread pool solutions concurrently in use in PyTorch mobile: (1) the open source pthreadpool library under third_party, and (2) Caffe2's implementation of pthreadpool under caffe2/utils/threadpool. Since the primary use-case of the latter has been to act as a drop-in replacement for the third party version so as to enable integration and usage from within NNPACK and QNNPACK, Caffe2's implementation is intentionally written to the exact same interface as the third party version. The original argument in favor of C2's implementation has been improved performance as a result of using spin locks, as opposed to relinquishing the thread's time slot and putting it to sleep - a less expensive operation up to a point. That seems to have given C2's implementation the upper hand in performance, hence justifying the added maintenance complexity, until the third party version improved in parallel surpassing the efficiency of C2's implementation as I have verified in benchmarks. With that advantage gone, there is no reason to continue using C2's implementation in PyTorch mobile either from the perspective of performance or code hygiene. As a matter of fact, there is considerable performance benefit to be had as a result of using the third party version as it currently stands. This is a tricky change though, mainly because in order to avoid potential performance regressions, of which I have witnessed none but just in abundance of caution, we have decided to continue using the internal C2's implementation whenever building for Caffe2. Again, this is mainly to avoid potential performance regressions in production C2 use cases even if doing so results in reduced performance as far as I can tell. So to summarize, today, and as it currently stands, we are using C2's implementation for (1) NNPACK, (2) PyTorch QNNPACK, and (3) ATen parallel_for on mobile builds, while using the third party version of pthreadpool for XNNPACK as XNNPACK does not provide any build options to link against an external implementation unlike NNPACK and QNNPACK do. The goal of this PR then, is to unify all usage on mobile to the third party implementation both for improved performance and better code hygiene. This applies to PyTorch's use of NNPACK, QNNPACK, XNNPACK, and mobile's implementation of ATen parallel_for, all getting routed to the exact same third party implementation in this PR. Considering that NNPACK, QNNPACK, and XNNPACK are not mobile specific, these benefits carry over to non-mobile builds of PyTorch (but not Caffe2) as well. The implementation of ATen parallel_for on non-mobile builds remains unchanged. *** How *** This is where things get tricky. A good deal of the build system complexity in this PR arises from our desire to maintain C2's implementation intact for C2's use. pthreadpool is a C library with no concept of namespaces, which means two copies of the library cannot exist in the same binary or symbol collision will occur violating ODR. This means that somehow, and based on some condition, we must decide on the choice of a pthreadpool implementation. In practice, this has become more complicated as a result of all the possible combinations that USE_NNPACK, USE_QNNPACK, USE_PYTORCH_QNNPACK, USE_XNNPACK, USE_SYSTEM_XNNPACK, USE_SYSTEM_PTHREADPOOL and other variables can result in. Having said that, I have done my best in this PR to surgically cut through this complexity in a way that minimizes the side effects, considering the significance of the performance we are leaving on the table, yet, as a result of this combinatorial explosion explained above I cannot guarantee that every single combination will work as expected on the first try. I am heavily relying on CI to find any issues as local testing can only go that far. Having said that, this PR provides a simple non mobile-specific C++ thread pool implementation on top of pthreadpool, namely caffe2::PThreadPool that automatically routes to C2's implementation or the third party version depending on the build configuration. This simplifies the logic at the cost of pushing the complexity to the build scripts. From there on, this thread pool is used in aten parallel_for, and NNPACK and family, again, routing all usage of threading to C2 or third party pthreadpool depending on the build configuration. When it is all said or done, the layering will look like this: a) aten::parallel_for, uses b) caffe2::PThreadPool, which uses c) pthreadpool C API, which delegates to c-1) third_party implementation of pthreadpool if that's what the build has requested, and the rabbit hole ends here. c-2) C2's implementation of pthreadpool if that's what the build has requested, which itself delegates to c-2-1) caffe2::ThreadPool, and the rabbit hole ends here. NNPACK, and (PyTorch) QNNPACK directly hook into (c). They never go through (b). Differential Revision: D21232894 Test Plan: Imported from OSS Reviewed By: dreiss Pulled By: AshkanAliabadi fbshipit-source-id: 8b3de86247fbc3a327e811983e082f9d40081354
This commit is contained in:
committed by
Facebook GitHub Bot
parent
c7d79f35e3
commit
b9d3869df3
@ -6,9 +6,9 @@
|
||||
// External API
|
||||
//
|
||||
|
||||
void pthreadpool_compute_1d(
|
||||
pthreadpool_t threadpool,
|
||||
pthreadpool_function_1d_t function,
|
||||
void legacy_pthreadpool_compute_1d(
|
||||
legacy_pthreadpool_t threadpool,
|
||||
legacy_pthreadpool_function_1d_t function,
|
||||
void* argument,
|
||||
size_t range) {
|
||||
if (threadpool == nullptr) {
|
||||
@ -27,30 +27,31 @@ void pthreadpool_compute_1d(
|
||||
range);
|
||||
}
|
||||
|
||||
size_t pthreadpool_get_threads_count(pthreadpool_t threadpool) {
|
||||
// The current fix only useful when XNNPACK calls pthreadpool_get_threads_count with nullptr.
|
||||
void legacy_pthreadpool_parallelize_1d(
|
||||
const legacy_pthreadpool_t threadpool,
|
||||
const legacy_pthreadpool_function_1d_t function,
|
||||
void* const argument,
|
||||
const size_t range,
|
||||
uint32_t) {
|
||||
legacy_pthreadpool_compute_1d(threadpool, function, argument, range);
|
||||
}
|
||||
|
||||
size_t legacy_pthreadpool_get_threads_count(legacy_pthreadpool_t threadpool) {
|
||||
// The current fix only useful when XNNPACK calls legacy_pthreadpool_get_threads_count with nullptr.
|
||||
if (threadpool == nullptr) {
|
||||
return 1;
|
||||
}
|
||||
return reinterpret_cast<caffe2::ThreadPool*>(threadpool)->getNumThreads();
|
||||
// TODO: Future fix: If we keep maintaining two different threadpools.
|
||||
// Old C2 and new one for XNNPACK, then the we have two different pthreadpool pointer
|
||||
// types. One is caffe2::Thredpool*, the other is pthreadpool* (pthreadpool_new_if_impl.c)
|
||||
// XNNPACK calls pthreadpool_get_threads_count during op setup using pthreadpool*, and
|
||||
// uses _parallelize_ interface for for actual work.
|
||||
// While NNPACK uses caffe2::Threadpool*.
|
||||
// Thus if pthreadpool_get_threads_count is getting called from XNNPACK we cannot
|
||||
// reinterpret_cast it to ThreadPool. It will seg fault or worse will have unedfined behavior.
|
||||
}
|
||||
|
||||
pthreadpool_t pthreadpool_create(size_t threads_count) {
|
||||
legacy_pthreadpool_t legacy_pthreadpool_create(size_t threads_count) {
|
||||
std::mutex thread_pool_creation_mutex_;
|
||||
std::lock_guard<std::mutex> guard(thread_pool_creation_mutex_);
|
||||
|
||||
return reinterpret_cast<pthreadpool_t>(new caffe2::ThreadPool(threads_count));
|
||||
return reinterpret_cast<legacy_pthreadpool_t>(new caffe2::ThreadPool(threads_count));
|
||||
}
|
||||
|
||||
void pthreadpool_destroy(pthreadpool_t pthreadpool) {
|
||||
void legacy_pthreadpool_destroy(legacy_pthreadpool_t pthreadpool) {
|
||||
if (pthreadpool) {
|
||||
caffe2::ThreadPool* threadpool =
|
||||
reinterpret_cast<caffe2::ThreadPool*>(pthreadpool);
|
||||
|
Reference in New Issue
Block a user