[inductor] Reduce device context manager overhead (#91045)

This adds `torch.cuda._DeviceGuard` which is a stripped down version of
`torch.cuda.device` with lower overhead. To do this, it only accepts `int` as
the device so we don't need to call `_get_device_index` and is implemented
with a new C++ helper `torch._C._cuda_exchangeDevice` that allows
`_DeviceGuard.__enter__` to be just a single function call. On my machine,
I see a drop from 3.8us of overhead to 0.94 us with this simple benchmark:

```python
def set_device():
    with torch.cuda.device(0):
        pass

%timeit set_device()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91045
Approved by: https://github.com/ngimel, https://github.com/anijain2305
This commit is contained in:
Peter Bell
2023-01-12 11:41:40 +00:00
committed by PyTorch MergeBot
parent db466ae057
commit eece6da162
7 changed files with 62 additions and 1 deletions

View File

@ -229,6 +229,7 @@ std::shared_ptr<SugaredValue> CUDAPythonModuleValue::attr(
"current_stream",
"default_stream",
"current_device",
"_exchange_device",
"set_device",
"device_index",
"device_count",