c10d/Store: add nonblocking mode to queue_pop (#151485)

This adds a non-blocking mode to queue_pop. This allows for workers to poll if work is ready without blocking the main loop. This is useful for the case where you want to have a GPU have maximum utilization when something only periodically is sent on the queue.

We also expose a `torch.distributed.QueueEmptyError` so users can catch the error and handle it accordingly.

Test plan:

```
pytest test/distributed/test_store.py -k queue -v -s -x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151485
Approved by: https://github.com/fduwjj, https://github.com/tianfengfrank
This commit is contained in:
Tristan Rice
2025-04-18 02:14:47 +00:00
committed by PyTorch MergeBot
parent 3ed5f1fb77
commit 98c892749b
16 changed files with 64 additions and 23 deletions

View File

@ -117,7 +117,7 @@ class TORCH_API TCPStore : public Store {
void queuePush(const std::string& key, const std::vector<uint8_t>& value)
override;
std::vector<uint8_t> queuePop(const std::string& key) override;
std::vector<uint8_t> queuePop(const std::string& key, bool block) override;
int64_t queueLen(const std::string& key) override;