From a9015d81d468c000cc05076e3757710c615e26f8 Mon Sep 17 00:00:00 2001
From: Vitaly Fedyunin <vitalyf@fb.com>
Date: Fri, 12 Jul 2019 11:53:23 -0400
Subject: [PATCH] Replace binary_kernel with cpu_kernel

---
 How-to-use-TensorIterator.md | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/How-to-use-TensorIterator.md b/How-to-use-TensorIterator.md
index f05912c..af3cf7f 100644
--- a/How-to-use-TensorIterator.md
+++ b/How-to-use-TensorIterator.md
@@ -16,13 +16,15 @@ As the first step, we are naively introducing `TensorIterator` to replace `CPU_t
 
 [[images/tensor_iterator/change0.png]] 
 
+**Update: `binary_kernel` recently was replaced by more universal `cpu_kernel`, they have exact same API.**
+
 code: https://github.com/pytorch/pytorch/pull/21025/commits/c5593192e1f21dd5eb1062dbacfdf7431ab1d47f
 
 In compare to TH_APPLY_* and CPU_tensor_apply* and  solutions, TensorIterator usage is separated by two steps.
 1) Defining iterator configuration with the TensorIterator::Builder. Under the hood, builder calculates tensors shapes and types to find the most performant way to traverse them (https://github.com/pytorch/pytorch/blob/dee11a92c1f1c423020b965837432924289e0417/aten/src/ATen/native/TensorIterator.h#L285)
 2) Loop implementation.
-There are multiple different kernels in Loops.h depending on the number of inputs (binary_kernel; unary_kernel), the ability to do parallel calculations, vectorized version availability (binary_kernel_vec), type of operation (vectorized_inner_reduction).
-In our case, we have one output and two inputs, in this type of scenario we can use binary_kernel.
+There are multiple different kernels in Loops.h depending on the number of inputs, they dispatch automatically by `cpu_kernel`, the ability to do parallel calculations, vectorized version availability (cpu_kernel_vec), type of operation (vectorized_inner_reduction).
+In our case, we have one output and two inputs, in this type of scenario we can use cpu_kernel.
 TensorIterator automatically picks the best way to traverse tensors (such as taking into account contiguous layout) as well as using parallelization for bigger tensors.
 
 As a result, we have a 24x performance gain.
@@ -119,7 +121,7 @@ In [*4*]: timeit torch.lerp(x,y,0.5)
 440 µs ± 2.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
 ```
 
-## Vectorization with the binary_kernel_vec
+## Vectorization with the cpu_kernel_vec
 
 In many cases, we can also benefit from the explicit vectorization (provided by Vec256 library). TensorIterator provides the easy way to do it by using _vec loops.
 
@@ -127,4 +129,4 @@ In many cases, we can also benefit from the explicit vectorization (provided by
 
 code: https://github.com/pytorch/pytorch/pull/21025/commits/83a23e745e839e8db81cf58ee00a5755d7332a43
 
-We are doing so by replacing the binary_kernel with the binary_kernel_vec. At this particular case, weight_val check was omitted (to simplify example code), and performance benchmark show no significant gain.
\ No newline at end of file
+We are doing so by replacing the cpu_kernel with the cpu_kernel_vec. At this particular case, weight_val check was omitted (to simplify example code), and performance benchmark show no significant gain.
\ No newline at end of file