dcc3cf7066
[BE] fix ruff rule E226: add missing whitespace around operator in f-strings ( #144415 )
...
The fixes are generated by:
```bash
ruff check --fix --preview --unsafe-fixes --select=E226 .
lintrunner -a --take "RUFF,PYFMT" --all-files
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144415
Approved by: https://github.com/huydhn , https://github.com/Skylion007
2025-01-08 21:55:00 +00:00
5a10b56083
[dynamo] Small microbenchmark changes ( #122032 )
...
Used to generate numbers in #122029
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122032
Approved by: https://github.com/yanboliang
2024-03-18 18:08:06 +00:00
c5702a0891
[dynamo] Optimize BACKEND_MATCH guard ( #118065 )
...
As measured by `benchmarks/dynamo/microbenchmarks/overheads.py`:
- Before `22.5us`
- After `18.1us`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118065
Approved by: https://github.com/ydwu4
2024-01-24 07:47:52 +00:00
a669319450
[inductor] Faster C++ kernel python bindings ( #117500 )
...
Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main
async_compile = AsyncCompile()
src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
float* out_ptr0)
{
{
auto tmp0 = in_ptr0[static_cast<long>(0L)];
auto tmp1 = static_cast<float>(1.0);
auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
out_ptr0[static_cast<long>(0L)] = tmp2;
}
}
'''
cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)
async_compile.wait(globals())
del async_compile
def call(arg0_1):
buf0 = empty((1,), device='cpu', dtype=torch.float32)
if use_ctypes:
for _ in range(100):
cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
else:
for _ in range(100):
cpp_fused_add_cpython(arg0_1, buf0)
del arg0_1
return (buf0,)
def benchmark_compiled_module(times=1000, repeat=100):
arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)
print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings: ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings: 0.000013
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-18 16:20:12 +00:00
a1afd1b195
Revert "[inductor] Faster C++ kernel python bindings ( #117500 )"
...
It should have never been landed, but was landed again, thanks to
ghstack grafting/ungrafting see discussion on https://github.com/pytorch/pytorch/pull/116910
This reverts commit e457b6fb18782425661e8a09d0222d0b29518ad1.
2024-01-17 17:06:32 -08:00
e457b6fb18
[inductor] Faster C++ kernel python bindings ( #117500 )
...
Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main
async_compile = AsyncCompile()
src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
float* out_ptr0)
{
{
auto tmp0 = in_ptr0[static_cast<long>(0L)];
auto tmp1 = static_cast<float>(1.0);
auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
out_ptr0[static_cast<long>(0L)] = tmp2;
}
}
'''
cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)
async_compile.wait(globals())
del async_compile
def call(arg0_1):
buf0 = empty((1,), device='cpu', dtype=torch.float32)
if use_ctypes:
for _ in range(100):
cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
else:
for _ in range(100):
cpp_fused_add_cpython(arg0_1, buf0)
del arg0_1
return (buf0,)
def benchmark_compiled_module(times=1000, repeat=100):
arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)
print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings: ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings: 0.000013
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409 , #116667 , #117591
2024-01-17 23:03:15 +00:00
da6abaeeac
Revert "[inductor] Faster C++ kernel python bindings ( #117500 )"
...
This reverts commit bb0fd1bd3ca145b77159427bc5bacf5f98ec3896.
Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896516512 ))
2024-01-17 19:34:26 +00:00
bb0fd1bd3c
[inductor] Faster C++ kernel python bindings ( #117500 )
...
Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main
async_compile = AsyncCompile()
src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
float* out_ptr0)
{
{
auto tmp0 = in_ptr0[static_cast<long>(0L)];
auto tmp1 = static_cast<float>(1.0);
auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
out_ptr0[static_cast<long>(0L)] = tmp2;
}
}
'''
cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)
async_compile.wait(globals())
del async_compile
def call(arg0_1):
buf0 = empty((1,), device='cpu', dtype=torch.float32)
if use_ctypes:
for _ in range(100):
cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
else:
for _ in range(100):
cpp_fused_add_cpython(arg0_1, buf0)
del arg0_1
return (buf0,)
def benchmark_compiled_module(times=1000, repeat=100):
arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)
print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings: ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings: 0.000013
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
ghstack dependencies: #117409 , #116667 , #117591
2024-01-17 19:12:24 +00:00
9da01affd3
Revert "[inductor] Faster C++ kernel python bindings ( #117500 )"
...
This reverts commit 3a52147cc59b240737602d3d046080bbf6f567f1.
Reverted https://github.com/pytorch/pytorch/pull/117500 on behalf of https://github.com/PaliC due to breaking internal discussed with author offline ([comment](https://github.com/pytorch/pytorch/pull/117500#issuecomment-1896426304 ))
2024-01-17 18:42:39 +00:00
3a52147cc5
[inductor] Faster C++ kernel python bindings ( #117500 )
...
Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark:
```python
from ctypes import c_void_p
import torch
from torch import empty
from torch._inductor.codecache import AsyncCompile
from torch._dynamo.testing import rand_strided
from torch._inductor.utils import print_performance
from torch._inductor.wrapper_benchmark import compiled_module_main
async_compile = AsyncCompile()
src = '''
#include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h"
extern "C" void kernel(const float* in_ptr0,
float* out_ptr0)
{
{
auto tmp0 = in_ptr0[static_cast<long>(0L)];
auto tmp1 = static_cast<float>(1.0);
auto tmp2 = decltype(tmp0)(tmp0 + tmp1);
out_ptr0[static_cast<long>(0L)] = tmp2;
}
}
'''
cpp_fused_add_ctypes = async_compile.cpp(src)
cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float*", "float*"], src)
async_compile.wait(globals())
del async_compile
def call(arg0_1):
buf0 = empty((1,), device='cpu', dtype=torch.float32)
if use_ctypes:
for _ in range(100):
cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
else:
for _ in range(100):
cpp_fused_add_cpython(arg0_1, buf0)
del arg0_1
return (buf0,)
def benchmark_compiled_module(times=1000, repeat=100):
arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32)
return print_performance(lambda: call(arg0_1), times=times, repeat=repeat)
print("old ctypes bindings: ", end='')
use_ctypes = True
compiled_module_main('None', benchmark_compiled_module)
print("new bindings: ", end='')
use_ctypes = False
compiled_module_main('None', benchmark_compiled_module)
```
Output:
```
old ctypes bindings: 0.000073
new bindings: 0.000013
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500
Approved by: https://github.com/desertfire
2024-01-16 22:30:04 +00:00