In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.
Testing:
```
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
mod = torch.get_device_module('cuda')
hw = mod._device_limits.GPULimits(device)
print(hw.get_tflops_per_second(torch.float16))
print(hw.get_tflops_per_second(torch.float32))
print(hw.get_tflops_per_second(torch.float64))
print(hw.get_tflops_per_second(torch.bfloat16))
print(hw.get_tflops_per_second(torch.int8))
print(hw.get_memory_bandwidth_Bps() / 1e9)
print(hw.get_shared_memory_bandwidth_Bps() / 1e9)
# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.
Testing:
```
import torch
if torch.cuda.is_available():
device = torch.cuda.current_device()
mod = torch.get_device_module('cuda')
hw = mod._device_limits.GPULimits(device)
print(hw.get_tflops_per_second(torch.float16))
print(hw.get_tflops_per_second(torch.float32))
print(hw.get_tflops_per_second(torch.float64))
print(hw.get_tflops_per_second(torch.bfloat16))
print(hw.get_tflops_per_second(torch.int8))
print(hw.get_memory_bandwidth_Bps() / 1e9)
print(hw.get_shared_memory_bandwidth_Bps() / 1e9)
# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel