Fix flaky test_udf_remote_message_delay_timeout_to_self (#41217)

Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41217

Fixes this flaky test. Due to the possibility of callback
finishCreatingOwnerRRef running after request_callback has processed and
created the owner RRef, we could actually end up with 0 owners on the node,
since the callback removes from the owners_ map. In this case, shutdown is fine
since there are no owners. On the other hand, if the callback runs first, there
will be 1 owner which we will delete in shutdown when we detect it has no
forks. So either way, shutdown works fine and we don't need to enforce there to
be 1 owner.
ghstack-source-id: 107883497

Test Plan: Ran the test 500 times with TSAN.

Reviewed By: ezyang

Differential Revision: D22469806

fbshipit-source-id: 02290d6d5922f91a9e2d5ede21d1cf1c4598cb46
This commit is contained in:
Rohan Varma
2020-07-16 11:17:31 -07:00
committed by Facebook GitHub Bot
parent 94e4248d80
commit b5e32528d0
5 changed files with 26 additions and 12 deletions

View File

@ -273,6 +273,9 @@ void RRefContext::delAllUsersAndUnforkedOwners(
for (auto& rrefId : unforkedOwners) {
LOG(INFO) << "Removing unforked OwnerRRef with RRefId: " << rrefId;
auto iter = owners_.find(rrefId);
TORCH_CHECK(
iter != owners_.end(),
c10::str("Did not find OwnerRRef with RRefId: ", rrefId));
owners_.erase(iter);
}
}