Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
Add scalar information to the kernel configuration.
#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.
However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to enable a cache mechanism to accelerate **eager-through-torch.compile**. When **eager-through-torch.compile** is enabled, we will store a persistent config to cache the kernel information for the aten operation.
The persistent config consists of two parts - meta_info and kernel_path.
- meta_info: The input tensors' shape, stride, device type, data type, and symbolic flag.
- kernel_path: The path of the kernel produced by Inductor.
When an aten operation is registered, the `kernel_holder` will load the persistent config and parse it to build the cache map; the meta_info is key, and the kernel library is the value.
Currently, this PR only supports static shape to guard the kernel. From the C++ class perspective, we defined to base class `TensorChecker` and `StaticTensorChecker` inherits the base class. In the future, we will implement a class to support dynamic shape by inheriting this base class as well.
Take a `mul` as an example.
```python
class MulKernel:
def __init__(self) -> None:
pass
def __call__(self, *args: Any, **kwargs: Any) -> Any:
with torch._C._SetExcludeDispatchKeyGuard(torch._C.DispatchKey.Python, False):
opt_fn = torch.compile(torch.ops.aten.mul, dynamic=False, options={
"aot_inductor.eager_mode": True,
"aot_inductor.eager_op_name": "mul_Tensor"
}
)
return opt_fn(*args, **kwargs)
torch_compile_op_lib_impl = torch.library.Library("aten", "IMPL")
_, overload_names = torch._C._jit_get_operation("aten::mul")
schema = torch._C._get_schema("aten::mul", overload_name)
reg_name = schema.name
if schema.overload_name:
reg_name = f"{reg_name}.{schema.overload_name}"
torch_compile_op_lib_impl.impl(
reg_name,
MulKernel(),
"CUDA",
compile_mode=True)
a = torch.randn(1024, 1024, device=device)
b = torch.randn(1024, 1024, device=device)
warm_up_iter = 1000
iter = 10000
fn = torch.mul
# Warm up
for _ in range(warm_up_iter):
fn(a, b)
# Collect performance
beg = time.time()
for _ in range(iter):
fn(a, b)
end = time.time()
print(f"E2E run: {end - beg}")
```
It will produce the config as follows.
```json
[
{
"meta_info": [
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
},
{
"is_symbolic": "false",
"device_type": "cuda",
"dtype": "torch.float32",
"sizes": "[1024, 1024]",
"strides": "[1024, 1]"
}
],
"kernel_path": "/tmp/torchinductor_eikan/e4/ce4jw46i5l2e7v3tvr2pyglpjmahnp7x3hxaqotrvxwoeh5t6qzc.so"
}
]
```
Performance-wise, we collected mul.Tensor through torch.compile w/ 10000 runs(e2e). The data is as follows. And we will collect data when we support dynamic shape.
- Eager: ~266.11ms
- W/O Cache: ~3455.54ms
- W/ Cache and Cache Miss: ~3555.3ms
- W/ Cache and Cache Hit: ~267.12ms
Hardware:
- CPU: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz
- GPU: CUDA A10
Software:
- PyTorch Version: 39df084001c54cca5fe3174176f9b0206ddb7dcf
- GPU Driver Version: 525.147.05
- CUDA Version: 12.0
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a cache mechanism to accelerate torch.compile-for-eager.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
cc voznesenskym penguinwu jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
This PR is a follow-up of RFC https://github.com/pytorch/pytorch/issues/115545.
In this PR, we are trying to provide a registration mode to implement a single aten operation on the top of `torch.compile` and then register to aten.
By now, the Python-based aten kernel implementation assumes the hermetic Python object. For `torch.compile`-based aten kernel implementation, the assumption will be broken. Because
> While HermeticPyObject was enabled, we attempted to create a tensor subclass with __torch_dispatch__. This violates the invariant that operations in HermeticPyObject have equivalent C++ implementations.
[ghstack-poisoned]
Scripts for building aarch64 PyTorch PIP Wheels. These scripts build the following wheels:
* torch
* torchvision
* torchaudio
* torchtext
* torchdata
## Aarch64_ci_build.sh
This script is design to support CD operations within PyPi manylinux aarch64 container, and be executed in the container. It prepares the container and then executes __aarch64_wheel_ci_build.py__ to build the wheels. The script "assumes" the PyTorch repo is located at: ```/pytorch``` and will put the wheels into ```/artifacts```.
This app allows a person to build using AWS EC3 resources and requires AWS-CLI and Boto3 with AWS credentials to support building EC2 instances for the wheel builds. Can be used in a codebuild CD or from a local system.
The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.
1.**Automatic Rebuilding**:
- The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory
- This ensures all images stay up-to-date with the latest dependencies and configurations
2.**Image Reuse in PyTorch Build Workflows** (example: linux-build):
- The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step
- The `_linux-build.yml` workflow:
- Pulls the Docker image determined by the `calculate-docker-image` step
- Runs a Docker container with that image
- Executes `.ci/pytorch/build.sh` inside the container to build PyTorch
3.**Usage in Test Workflows** (example: linux-test):
- The same Docker images are also used in `_linux-test.yml` for running tests
- The `_linux-test.yml` workflow follows a similar pattern:
- It uses the `calculate-docker-image` step to determine which Docker image to use
- It pulls the Docker image and runs a container with that image
- It installs the wheels from the artifacts generated by PyTorch build jobs
- It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container
### Understanding File Purposes
#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`
- **`.ci/docker/build.sh`**:
- Used for building base Docker images
- Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI
- Contains configurations for different Docker build environments
- **`.ci/pytorch/build.sh`**:
- Used for building PyTorch inside a Docker container
- Called by workflows like `_linux-build.yml` after the Docker container is started
- Builds PyTorch wheels and other artifacts
#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`
- **`.ci/docker/ci_commit_pins/`**:
- Used for pinning dependency versions during base Docker image building
- Ensures consistent environments for building PyTorch
- Changes here trigger base Docker image rebuilds
- **`.github/ci_commit_pins`**:
- Used for pinning dependency versions during PyTorch building and tests
- Ensures consistent dependencies for PyTorch across different builds
- Used by build scripts running inside Docker containers
### Step-by-Step Guide for Adding a New Base Docker Image
#### 1. Add Pinned Commits (If Applicable)
We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.
If your new Docker image needs a library installed from a specific pinned commit or built from source:
1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`
2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1
#### 2. Configure the Base Docker Image
1.**Add new Base Docker image configuration** (if applicable):
Add the configuration in `.ci/docker/build.sh`. For example:
#lintrunner is supported on aarch64-linux only from 0.12.4 version
lintrunner==0.12.7
lintrunner==0.12.5
#Description: all about linters!
#Pinned versions: 0.12.7
#Pinned versions: 0.12.5
#test that import:
redis>=4.0.0
#Description: redis database
#test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
rockset==1.0.3
#Description: queries Rockset
#Pinned versions: 1.0.3
#test that import:
ghstack==0.8.0
#Description: ghstack tool
#Pinned versions: 0.8.0
#test that import:
jinja2==3.1.6
jinja2==3.1.4
#Description: jinja2 template engine
#Pinned versions: 3.1.4
#test that import:
@ -306,92 +289,26 @@ pytest-cpp==2.3.0
#Pinned versions: 2.3.0
#test that import:
z3-solver==4.15.1.0 ; platform_machine != "s390x"
z3-solver==4.12.2.0
#Description: The Z3 Theorem Prover Project
#Pinned versions:
#test that import:
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"
tensorboard==2.13.0
#Description: Also included in .ci/docker/requirements-docs.txt
#Pinned versions:
#test that import: test_tensorboard
pywavelets==1.4.1 ; python_version < "3.12"
pywavelets==1.7.0 ; python_version >= "3.12"
pywavelets==1.5.0 ; python_version >= "3.12"
#Description: This is a requirement of scikit-image, we need to pin
# it here because 1.5.0 conflicts with numpy 1.21.2 used in CI
#Pinned versions: 1.4.1
#test that import:
lxml==5.3.0
lxml==5.0.0.
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries
PyGithub==2.3.0
sympy==1.13.3
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
onnx==1.18.0
#Description: Required by onnx tests, and mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.5.3
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
parameterized==0.8.1
#Description: Parameterizes unittests, both the tests themselves and the entire testing class
#Pinned versions:
#test that import:
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 1.24.0
#test that import: test_sac_estimator.py
pwlf==2.2.1
#Description: required for testing torch/distributed/_tools/sac_estimator.py
#Pinned versions: 2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
pyyaml==6.0.2
pyzstd
setuptools==78.1.1
packaging==23.1
six
scons==4.5.2 ; platform_machine == "aarch64"
pulp==2.9.0
#Description: required for testing ilp formulaiton under torch/distributed/_tools
#Pinned versions: 2.9.0
#test that import: test_sac_ilp.py
dataclasses_json==0.6.7
#Description: required for data pipeline and scripts under tools/stats
#Pinned versions: 0.6.7
#test that import:
cmake==3.31.6
#Description: required for building
tlparse==0.4.0
#Description: required for log parsing
filelock==3.18.0
#Description: required for inductor testing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x" and platform_system != "Darwin"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py
setuptools-git-versioning==2.1.0
scikit-build==0.18.1
pyre-extensions==0.0.32
tabulate==0.9.0
#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.