Update some stale links and changed codes

2025-10-20 21:14:14 +08:00 · 2024-01-31 19:12:34 +08:00
parent 0256ce1c97
commit b80aa51960
1 changed files with 473 additions and 309 deletions
--- a/PyTorch-dispatcher-walkthrough.md
+++ b/PyTorch-dispatcher-walkthrough.md
@ -1,339 +1,503 @@
-[[Page Maintainers|Where or how should I add documentation?]]: @bdhirsh
+[[Page Maintainers|Where or how should I add documentation?]]:
 [@bdhirsh](mailto:briandhirsh@gmail.com)
-Adapted from: http://blog.ezyang.com/2020/09/lets-talk-about-the-pytorch-dispatcher/
+# Codegen + Structured Kernels Overview
-## Why care about a dispatcher?
+Ed has a really great overview of code-generation and why we have it in PyTorch:
 check out his podcast:
 https://pytorch-dev-podcast.simplecast.com/episodes/code-generation.
-* PyTorch has a lot of systems: autograd, tracing, vmap
+This document will go over our codegen subsystem + structured kernels in more
-    * and a lot of backend devices: XLA, CUDA, CPU, ...
+detail, and involve you using gdb to jump through the different code-generated
-* We could write a single at::add function that handles all of the above
+files that are part of a call into torch.add().
    * It would probably have a big fat switch statement with a lot of code in it.
    * Think about packing the VariableType (autogenerated) code, the CUDA code, and the CPU code all into one function!
 ### What it is
-What is the dispatcher?
+We have a code-generation pipeline that runs as part of the PyTorch build - it
 reads in some yaml files, and spits out a bunch of C++ files.
-* Each operator has a *dispatch table*, a table of function pointers for each key.
+### Why we have it
 * The dispatch keys [are sorted by *priority*](https://github.com/pytorch/pytorch/blob/f588ad6a35c3f52da8e8180c7b51de954fce5fd1/c10/core/DispatchKey.h#L21)
 * When you call an operator, the dispatcher looks at the current set of DispatchKeys to figure out which function pointer to call.
 So, why do we have codegen? One big motivating factor is to reduce boilerplate.
 PyTorch has a lot of operators, and there’s a lot of stuff that should “just
 work” for every operator. We don’t want to make someone hand-write all of that
 functionality whenever a new operator is added. Instead, we code-generate it.
-Each Tensor has a DispatchKeySet
+A (non-exhaustive) list of functionality (we need all of this for every
 operator, so multiply by ~2000):
-* To figure out which function pointer to call, we:
+- bindings to python
-    * Union all the dispatch keys in the Tensor
+- The frontend C++ API
-    * Union some global dispatch keys (just BackendSelect*)
+- autograd support
-    * Union a set of “Local Include” keys. These are usually set in a thread-local way
+- registering kernels to the dispatcher
-    * Remove a set of “Local Exclude keys”. This is usually set in a thread-local way
+- other stuff
-* The code for that lives [here](https://github.com/pytorch/pytorch/blob/65f33ec85c2a7d8fb9bf582017d3170bf89e6c12/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h#L23).
+  - special logic for factory functions
-* Once we have our final key set, we pick the first dispatch key.
+  - torch.jit.trace functionality
-## Let’s go through an example of what happens when we add two Tensors together.
+### Inputs
-```py
+We have a yaml file, native_functions.yaml, which describes metadata about each
-x = torch.randn(3, device='cuda')
+operator that gets consumed by the codegen:
-y = torch.randn(1, device='cuda')
+https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/native_functions.yaml
 torch.add(x, y)
-# Tensor dispatch keys:
+We’re going to focus on the operator torch.add(a, b, out=c), which corresponds
-# x has the AutogradCUDA and CUDA dispatch key
+to the yaml entry add.out:
 # y has the AutogradCUDA and CUDA dispatch key.
-# Global dispatch keys:
+```yaml
-# [BackendSelect]
+- func: add.out(Tensor self, Tensor other, *, Scalar alpha=1, Tensor(a!) out) -> Tensor(a!)
-
+  device_check: NoCheck # TensorIterator
-# Local include: []
+  structured: True
-# Local exclude: []
+  structured_inherits: TensorIteratorBase
-
+  ufunc_inner_loop:
-# The final set ordered by priority, is [AutogradCUDA, BackendSelect, CUDA]
+    Generic: add (AllAndComplex, BFloat16, Half, ComplexHalf)
    ScalarOnly: add (Bool)
  dispatch:
    SparseCPU: add_out_sparse_cpu
    SparseCUDA: add_out_sparse_cuda
    SparseCsrCPU: add_out_sparse_compressed_cpu
    SparseCsrCUDA: add_out_sparse_compressed_cuda
    MkldnnCPU: mkldnn_add_out
    MPS: add_out_mps
  tags: pointwise
 ```
-Ultimately, what happens is that we will make it to `native::add`. native::add is [registered for at::add on the CPU and CUDA keys](https://github.com/pytorch/pytorch/blob/f588ad6a35c3f52da8e8180c7b51de954fce5fd1/aten/src/ATen/native/native_functions.yaml#L372)
+There’s public documentation on each of the different pieces of yaml here:
 https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/README.md
-So here’s what happens:
+The codegen is written in a functional style using python dataclasses to
 represent the different inputs/intermediates/outputs. For example, each entry in
 `native_functions.yaml` is represented in the codegen as a NativeFunction
 object:
 https://github.com/pytorch/pytorch/blob/6596a3f23dfe1ea4175637fa979bcbfbff397737/torchgen/model.py#L427
 Finally, one of the main entry points to the codegen is in
 `tools/codegen/gen.py` (there’s also a separate entry point for the autograd
 codegen pipeline). You can see the part of the file where we generate the C++
 API for example, `Functions.h`:
 (https://github.com/pytorch/pytorch/blob/f8e14f3b46e68a5271a8c57ce749ad8057d77ddd/torchgen/gen.py#L1781)
 It reads in a template file, `aten/src/ATen/templates/Functions.h`
 (https://github.com/pytorch/pytorch/blob/f8e14f3b46e68a5271a8c57ce749ad8057d77ddd/aten/src/ATen/templates/Functions.h),
 and generates the file build/aten/src/ATen/Functions.h:
 # Exercise 0: Full Stack Trace of torch.add
 For this exercise you’ll need to have pytorch built with debug symbols. I
 usually do that with `USE_CUDA=0 DEBUG=1 python setup.py develop` (The
 `USE_CUDA=0` is because we don’t need it, and building with cuda takes a long
 time).
 We’re going to run a small python program using gdb to view the full stack
 trace. Create a python script, `tmp.py`, with the following:
 ```python
 import torch
 a = torch.tensor([1, 1])
 b = torch.tensor([1, 1])
 c = torch.add(a, b)
 ```
 Run `gdb python`(or `lldb python -- tmp.py`) to start up `gdb`. We’re going to set
 a breakpoint in the `add` kernel - to do that, in the `gdb` prompt, type `break
 structured_ufunc_add_CPU::impl`(or `b structured_ufunc_add_CPU::impl` in lldb).
 Then run your script inside of `gdb` with `run tmp.py`(or `r` in lldb).
 The debugger should pause inside of the add kernel. Type `bt` to view the
 current stack trace.
 Ignoring the first ~10 function calls through the python interpreter, you should
 see a stack trace that looks something like the following:
 ```
 * thread #1, name = 'python', stop reason = breakpoint 1.1
  * frame #0: 0x00007fffd38ed42a libtorch_cpu.so`at::native::structured_ufunc_add_CPU::impl(this=0x00007fffffffae60, self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0, out=0x00007fffffffb190) at UfuncCPU_add.cpp:30:11
    frame #1: 0x00007fffd2ae81aa libtorch_cpu.so`at::(anonymous namespace)::wrapper_CPU_add_Tensor(self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) at RegisterCPU.cpp:1576:8
    frame #2: 0x00007fffd2c9079d libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(const at::Tensor&, const at::Tensor&, const c10::Scalar&), at::(anonymous namespace)::wrapper_CPU_add_Tensor>, at::Tensor, c10::guts::typelist::typelist<const at::Tensor&, const at::Tensor&, const c10::Scalar&> >, at::Tensor(const at::Tensor&, const at::Tensor&, const c10::Scalar&)>::call(c10::OperatorKernel *, c10::DispatchKeySet, const at::Tensor &, const at::Tensor &, const c10::Scalar &) [inlined] operator(args#2=0x00007fffffffbbb0, args#1=0x00007fffffffbbd8, args#0=0x00007fffffffbbe0, this=0x0000555556633ac0) at WrapFunctionIntoFunctor.h:13:72
    frame #3: 0x00007fffd2c90759 libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(const at::Tensor&, const at::Tensor&, const c10::Scalar&), at::(anonymous namespace)::wrapper_CPU_add_Tensor>, at::Tensor, c10::guts::typelist::typelist<const at::Tensor&, const at::Tensor&, const c10::Scalar&> >, at::Tensor(const at::Tensor&, const at::Tensor&, const c10::Scalar&)>::call(functor=0x0000555556633ac0, (null)=(repr_ = 32769), args#0=0x00007fffffffbbe0, args#1=0x00007fffffffbbd8, args#2=0x00007fffffffbbb0) at make_boxed_from_unboxed_functor.h:468:63
    frame #4: 0x00007fffd1f5dec7 libtorch_cpu.so`at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar const&>(unboxed_kernel_func=0x00007fffd2c906ee, functor=0x0000555556633ac0, dispatchKeySet=(repr_ = 32769), (null)=0x00007fffffffbbe0, (null)=0x00007fffffffbbd8, (null)=0x00007fffffffbbb0) at KernelFunction_impl.h:52:72
    frame #5: 0x00007fffd1e1ea24 libtorch_cpu.so`at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar const&>(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&)> const&, c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) const at KernelFunction_impl.h:104:87
    frame #6: 0x00007fffd1e1e9aa libtorch_cpu.so`at::Tensor c10::Dispatcher::redispatch<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar const&>(this=0x00007fffe61a1de0, op=0x00007fffe61c7db0, currentDispatchKeySet=(repr_ = 32769), (null)=0x00007fffffffbbe0, (null)=0x00007fffffffbbd8, (null)=0x00007fffffffbbb0) const at Dispatcher.h:712:102
    frame #7: 0x00007fffd2332b7c libtorch_cpu.so`at::_ops::add_Tensor::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, c10::Scalar const&) [inlined] c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&)>::redispatch(args#2=0x00007fffffffbbb0, args#1=0x00007fffffffbbd8, args#0=0x00007fffffffbbe0, currentDispatchKeySet=(repr_ = 32769), this=<unavailable>) const at Dispatcher.h:532:126
    frame #8: 0x00007fffd2332acd libtorch_cpu.so`at::_ops::add_Tensor::redispatch(dispatchKeySet=(repr_ = 32769), self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) at Operators_2.cpp:1049:60
    frame #9: 0x00007fffd502cbf2 libtorch_cpu.so`at::redispatch::add(dispatchKeySet=(repr_ = 32769), self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) at RedispatchFunctions.h:607:83
    frame #10: 0x00007fffd4ef7650 libtorch_cpu.so`operator(__closure=0x00007fffffffb5c0) at VariableType_2.cpp:5969:85
    frame #11: 0x00007fffd4ef7b7c libtorch_cpu.so`torch::autograd::VariableType::(anonymous namespace)::add_Tensor(ks=(repr_ = 274877939713), self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) at VariableType_2.cpp:5970:6
    frame #12: 0x00007fffd4ff0ad9 libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(c10::DispatchKeySet, const at::Tensor&, const at::Tensor&, const c10::Scalar&), torch::autograd::VariableType::(anonymous namespace)::add_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, const at::Tensor&, const at::Tensor&, const c10::Scalar&> >, at::Tensor(c10::DispatchKeySet, const at::Tensor&, const at::Tensor&, const c10::Scalar&)>::call(c10::OperatorKernel *, c10::DispatchKeySet, const at::Tensor &, const at::Tensor &, const c10::Scalar &) [inlined] operator(args#3=0x00007fffffffbbb0, args#2=0x00007fffffffbbd8, args#1=0x00007fffffffbbe0, args#0=(repr_ = 274877939713), this=0x0000555557b02710) at WrapFunctionIntoFunctor.h:13:72
    frame #13: 0x00007fffd4ff0a80 libtorch_cpu.so`c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor(c10::DispatchKeySet, const at::Tensor&, const at::Tensor&, const c10::Scalar&), torch::autograd::VariableType::(anonymous namespace)::add_Tensor>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, const at::Tensor&, const at::Tensor&, const c10::Scalar&> >, at::Tensor(c10::DispatchKeySet, const at::Tensor&, const at::Tensor&, const c10::Scalar&)>::call(functor=0x0000555557b02710, dispatchKeySet=(repr_ = 274877939713), args#0=0x00007fffffffbbe0, args#1=0x00007fffffffbbd8, args#2=0x00007fffffffbbb0) at make_boxed_from_unboxed_functor.h:485:79
    frame #14: 0x00007fffd1f5dec7 libtorch_cpu.so`at::Tensor c10::callUnboxedKernelFunction<at::Tensor, at::Tensor const&, at::Tensor const&, c10::Scalar const&>(unboxed_kernel_func=0x00007fffd4ff0a0b, functor=0x0000555557b02710, dispatchKeySet=(repr_ = 274877939713), (null)=0x00007fffffffbbe0, (null)=0x00007fffffffbbd8, (null)=0x00007fffffffbbb0) at KernelFunction_impl.h:52:72
    frame #15: 0x00007fffd233293b libtorch_cpu.so`at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) at KernelFunction_impl.h:104:87
    frame #16: 0x00007fffd23328ac libtorch_cpu.so`at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) at Dispatcher.h:694:97
    frame #17: 0x00007fffd2332695 libtorch_cpu.so`at::_ops::add_Tensor::call(at::Tensor const&, at::Tensor const&, c10::Scalar const&) [inlined] c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, at::Tensor const&, c10::Scalar const&)>::call(args#2=0x00007fffffffbbb0, args#1=0x00007fffffffbbd8, args#0=0x00007fffffffbbe0, this=<unavailable>) const at Dispatcher.h:527:97
    frame #18: 0x00007fffd233257b libtorch_cpu.so`at::_ops::add_Tensor::call(self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) at Operators_2.cpp:1042:38
    frame #19: 0x00007fffe74b677c libtorch_python.so`at::Tensor::add(this=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) const at TensorBody.h:1664:79
    frame #20: 0x00007fffe75f1164 libtorch_python.so`operator(__closure=0x00007fffffffbaad, self=0x00007fffffffbbe0, other=0x00007fffffffbbd8, alpha=0x00007fffffffbbb0) at python_torch_functions_2.cpp:1400:39
    frame #21: 0x00007fffe75f1777 libtorch_python.so`torch::autograd::THPVariable_add(self_=0x0000000000000000, args=0x00007ffd68f74b80, kwargs=0x0000000000000000) at python_torch_functions_2.cpp:1402:33
 ```
 That’s a lot of function calls! We’re going walk through the main pieces that
 are relevant to codegen and where they live. For each piece, I listed the
 relevant numbers in the gdb stack trace.
 > Tip: In `lldb`, if you are curious about the abstract path of the source files
 > listed in the frame backtrace, for example the 12th frame , you may first
 > switch to that frame `f 12` and then show its source info `so i`.
 ### (1) Python Bindings
 > #21: torch::autograd::THPVariable_add
 This is the first stop that we hit after going through the python interpreter:
 python bindings. This is the code that interfaces directly with cpython to bind
 our C++ functions to python.
 You can see a snippet of the function below: Its job is basically to take all of
 the PyObjects that it was handed from CPython, parse them into actual C++ types
 (like `at::Tensor`), and call into the C++ API. It does that below by calling into
 the Tensor add method: self.add(other, alpha).
 * at::add(x, y) invokes the dispatcher, which combines the dispatch keys into a DispatchKeySet as described above. In this scenario, the highest priority key is the `AutogradCUDA` key.
    * The file that this function lives in is actually codegen’d, so if you want to view it in source you’ll need to build pytorch. Then you can view it at `build/aten/src/ATen/Functions.h`.
 * at::add(x, y) dispatches to the Autograd implementation of add. That’s the below function.
    * Don’t worry too much about the `Autograd` vs. `AutogradCUDA` distinction; in 99% of cases you can treat them as identical. If you’re curious though, `Autograd` is an [alias dispatch key](https://github.com/pytorch/pytorch/blob/65f33ec85c2a7d8fb9bf582017d3170bf89e6c12/c10/core/DispatchKeySet.h#L211).
    * This function is also codegen’d - after building, you can view it at `torch/csrc/autograd/generated/VariableTypeEverything.cpp`.
 ```cpp
-// In VariableTypeEverything.cpp
+static PyObject * THPVariable_add(PyObject* self_, PyObject* args, PyObject* kwargs)
 {
  HANDLE_TH_ERRORS
  static PythonArgParser parser({
    "add(Tensor input, Scalar alpha, Tensor other, *, Tensor out=None)|deprecated",
    "add(Tensor input, Tensor other, *, Scalar alpha=1, Tensor out=None)",
  }, /*traceable=*/true);
-    Tensor add_Tensor(const Tensor & self, const Tensor & other, Scalar alpha) {
+  ParsedArgs<4> parsed_args;
-      auto& self_ = unpack(self, "self", 0);
+  auto _r = parser.parse(nullptr, args, kwargs, parsed_args);
-      auto& other_ = unpack(other, "other", 1);
+  ...
-      std::shared_ptr<AddBackward0> grad_fn;
+        auto dispatch_add = [](const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) -> at::Tensor
-      if (compute_requires_grad( self, other )) {
+          pybind11::gil_scoped_release no_gil;
-        grad_fn = std::shared_ptr<AddBackward0>(new AddBackward0(), deleteNode);
+          return self.add(other, alpha);                                                                                  return self.add(other, alpha);
-        grad_fn->set_next_edges(collect_next_edges( self, other ));
+        };
-        grad_fn->alpha = alpha;
+        return wrap(dispatch_add(_r.tensor(0), _r.tensor(1), _r.scalar(2)));
  ...
  Py_RETURN_NONE;
  END_HANDLE_TH_ERRORS
 }
 ```
 These are all codegen’d and live in
 `torch/csrc/autograd/generated/python_torch_functions_2.cpp`.
 ### (2) C++ API
 > #18: at::\_ops::add_Tensor::call
 > #19: at::Tensor::add
 The next stop is the C++ method API, which is one of the top-level API’s for
 calling into the dispatcher. The dispatcher then looks at all of the arguments +
 any thread-local state to figure out which kernel to dispatch to.
 https://github.com/pytorch/pytorch/wiki/PyTorch-dispatcher-walkthrough has some
 more details about the dispatcher key-calculation process.
 In `build/aten/src/ATen/core/TensorBody.h`:
 ```cpp
 //namespace at
 inline at::Tensor Tensor::add(const at::Tensor & other, const at::Scalar & alpha) const {
    return at::_ops::add_Tensor::call(const_cast<Tensor&>(*this), other, alpha);
 }
 ```
 In `build/aten/src/ATen/Operators_2.cpp`:
 ```cpp
 static C10_NOINLINE c10::TypedOperatorHandle<add_Tensor::schema> create_add_Tensor_typed_handle() {
  return c10::Dispatcher::singleton()
      .findSchemaOrThrow(add_Tensor::name, add_Tensor::overload_name)
      .typed<add_Tensor::schema>();
 }
 at::Tensor add_Tensor::call(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {
    static auto op = create_add_Tensor_typed_handle();
    return op.call(self, other, alpha);
 }
 ```
 ### (3) Autograd kernel
 > #11: torch::autograd::VariableType::(anonymous namespace)::add_Tensor
 After a bunch of dispatcher-related functions, the dispatcher eventually takes
 us to the autograd add kernel. The autograd kernel:
 - saves some metadata for autograd
 - re-invokes the dispatcher by calling `at::redispatch::add(ks &
 c10::after_autograd_keyset, self_, other_, alpha);`
 In `torch/csrc/autograd/generated/VariableType_2.cpp`:
 ```cpp
 // namespace at::VariableType
 at::Tensor add_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {
  ...
 }
 // Register `add_Tensor` so that it can be found
 TORCH_LIBRARY_IMPL(aten, Autograd, m) {
  ...
  m.impl("add.Tensor",TORCH_FN(VariableType::add_Tensor));
  ...
 }
 ```
 The autograd kernel ends up calling back into the C++ API (by calling
 `at::redispatch::add`), which then calls back into the dispatcher and calculates
 the next kernel to dispatch to.
 ### (4) CPU kernel
 > #0: at::native::structured_ufunc_add_CPU::impl
 > #1: at::(anonymous namespace)::wrapper_CPU_add_Tensor
 After a few more function hops through the dispatcher, we eventually dispatch to
 the CPU add kernel, which has to actually carry out the computation. The code
 for the cpu kernel (and the code that registers the kernel to the dispatcher)
 looks like this:
 In `build/aten/src/ATen/RegisterCPU.cpp`:
 ```cpp
 at::Tensor wrapper_CPU_add_Tensor(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {
  structured_ufunc_add_CPU_functional op;
  op.meta(self, other, alpha);
  op.impl(self, other, alpha, op.outputs_[0]);
  return std::move(op.outputs_[0]);
 }
 TORCH_LIBRARY_IMPL(aten, CPU, m) {
  m.impl("add.Tensor", TORCH_FN(wrapper_CPU_add_Tensor));
 }
 ```
 This code looks a little funky; it calls into a `meta()` and `impl()` function
 that are defined elsewhere. This is because add is implemented as a structured
 kernel - a new way of implementing operators in pytorch.
 That code is some scaffolding that contains a call to the hand-written “cpu add”
 kernel. The call to `op.impl()` corresponds directly to the add kernel written
 in `build/aten/src/ATen/UfuncCPU_add.cpp`
 ```cpp
 TORCH_IMPL_FUNC(ufunc_add_CPU)(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha, const at::Tensor & out) {
  add_stub(device_type(), *this, alpha);
 }
 ```
 (Note: there’s a bit more indirection inside of the handwritten kernel before
 reaching main part of the add kernel, which lives in
 `build/aten/src/ATen/UfuncCPUKernel_add.cpp`
 ```cpp
 void add_kernel(TensorIteratorBase& iter, const at::Scalar & alpha) {
  AT_DISPATCH_SWITCH(iter.common_dtype(), "add_stub",
    ...
    AT_DISPATCH_CASE(at::ScalarType::Long,
      [&]() {
        auto _s_alpha = alpha.to<scalar_t>();
        auto _v_alpha = at::vec::Vectorized<scalar_t>(_s_alpha);
        cpu_kernel_vec(iter,
          [=](scalar_t self, scalar_t other) { return ufunc::add(self, other, _s_alpha); },
          [=](at::vec::Vectorized<scalar_t> self, at::vec::Vectorized<scalar_t> other) { return ufunc::add(self, other, _v_alpha)};
        );
      }
-      #ifndef NDEBUG
+    )
-      c10::optional<Storage> self__storage_saved =
+    ...
-        self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
+  )
-      c10::intrusive_ptr<TensorImpl> self__impl_saved;
+}
      if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
      c10::optional<Storage> other__storage_saved =
        other_.has_storage() ? c10::optional<Storage>(other_.storage()) : c10::nullopt;
      c10::intrusive_ptr<TensorImpl> other__impl_saved;
      if (other_.defined()) other__impl_saved = other_.getIntrusivePtr();
      #endif
      auto tmp = ([&]() {
        at::AutoNonVariableTypeMode non_var_type_mode(true);
        return at::add(self_, other_, alpha);
      })();
      auto result = std::move(tmp);
      #ifndef NDEBUG
      if (self__storage_saved.has_value())
        AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
      if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
      if (other__storage_saved.has_value())
        AT_ASSERT(other__storage_saved.value().is_alias_of(other_.storage()));
      if (other__impl_saved) AT_ASSERT(other__impl_saved == other_.getIntrusivePtr());
      #endif
      if (grad_fn) {
          set_history(flatten_tensor_args( result ), grad_fn);
      }
      return result;
    }
 ```
 * None of the tensors require grad, so none of the autograd specific logic actually happens
    * fun fact: we’re planning on changing this behavior in the near-to-mid future! Eventually, we’d like it if the autograd kernel doesn’t ever get called unless the input tensors actually require gradients (specifying `requires_grad=True`)
 ```cpp
      auto tmp = ([&]() {
        at::AutoNonVariableTypeMode non_var_type_mode(true);
        return at::add(self_, other_, alpha);
      })();
 ```
-* Now, the above code: creates a [Local Exclude set of [Autograd]](https://github.com/pytorch/pytorch/blob/f588ad6a35c3f52da8e8180c7b51de954fce5fd1/aten/src/ATen/core/LegacyTypeDispatch.h#L50)
+The code-generated code above lives in the code-generated file
-* Inside at::add(self_, other_, alpha);, the computation happens again:
+`build/aten/src/ATen/UfuncCPUKernel_add.cpp`
-```
+
-        # Tensor dispatch keys:
+So, the code above calls into our hand-written CPU add kernel, returns a new
-        # x has the AutogradCUDA and CUDA dispatch key
+output tensor containing the result, and we’re done!
-        # y has the AutogradCUDA and CUDA dispatch key.
+
-        
+## Takeaway
-        # Global include dispatch keys:
+
-        # BackendSelect
+The main takeaway from the exercise above is that:
-        
+
-        # Local include: []
+- A lot of stuff happens when you call an operator
-        # Local exclude: [AutogradCPU, AutogradCUDA, AutogradXLA]
+- ...most of which is code-generated! A lot of this logic is _really_ similar
-        
+  across PyTorch’s ~2000 operators, and ripe for abstracting over (through
-        # The final set is [BackendSelect, CUDA]
+  something like code generation).
-```
+
-* Cool! So now the dispatcher looks up BackendSelect’s implementation for add.
+Sometimes when you’re working on / debugging a feature, it can be useful to know
-* Checking `build/aten/src/ATen/BackendSelectRegister.cpp`, there is no BackendSelect implementation for add. vBackendSelect’s add implementation is a special [*fallback kernel*](https://github.com/pytorch/pytorch/blob/f588ad6a35c3f52da8e8180c7b51de954fce5fd1/aten/src/ATen/core/BackendSelectFallbackKernel.cpp#L3-L5)that says “there is nothing here, instead, hop over to the next dispatch key”. More on fallback kernels later.
+which bits of logic are codegen’d, and where that logic lives.
-    * In fact, the Dispatcher actually has an optimization to avoid calling the fallthrough kernel at all. It figures out which kernels are “fallthrough” kernels at static initialization time, and adds them to a bitset mask to skip them entirely. If you’re curious, the logic for that lives [here](https://github.com/pytorch/pytorch/blob/a029422cae70b019222c00558da5437020550173/aten/src/ATen/core/dispatch/DispatchKeyExtractor.h#L145). 
+
 # Structured Kernels
 Another big part of the codegen is “structured kernels” - a new way of writing
 kernels in PyTorch, which uses some clever factoring + a bunch of codegen to
 reduce the amount of boilerplate required when writing kernels. torch.add is
 implemented as a “structured kernel”, so we’re going to walk through the bits of
 it related to structured kernels.
 The process of implementing an operator as a structured kernel involves writing
 two functions:
 - A “meta” function, which asserts that the inputs have the correct shape/dtype
  and figures out what size the output tensor should be.
 - An “impl” function, which does the actual computation. There will be a
  separate impl() function for every backend (cpu, cuda, xla, etc).
 The codegen is responsible for taking these two functions, and plugging them
 together in the right way to create all 3 variants of the operator for you:
 - at::add() (functional version)
 - at::add\_() (inplace version)
 - at::add_out() (out= version)
 Helpful reading: this presentation on structured kernels, including a diagram on
 the class hierarchy (which will be useful in the exercise further down).
 https://drive.google.com/file/d/16qPvpCF4Jbh7ss2lCQMk5hmcyzJvUyQj/view?usp=sharing
 See also: the structured kernels RFC contains a more detailed overview of what
 they are and what the codegen creates:
 https://github.com/pytorch/rfcs/blob/rfc-0005/RFC-0005-structured-kernel-definitions.md
 ## Structured Kernel codegen output example: torch.add
 The CPU kernel for the torch.add operator lives in
 https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/BinaryOps.cpp#L16,
 and has two components:
 ### Meta function:
 ```cpp
-        TORCH_LIBRARY_IMPL(_, BackendSelect, m) {
+// expands to structured_add_Tensor::meta() { ... }
-          m.fallback(torch::CppFunction::makeFallthrough());
+TORCH_META_FUNC2(add, Tensor) (
  const Tensor& self, const Tensor& other, const Scalar& alpha
 ) {
  build_borrowing_binary_op(maybe_get_output(), self, other);
  native::alpha_check(dtype(), alpha);
 }
 ```
 Impl function:
 ```cpp
 // expands to structured_add_out::impl() { ... }
 TORCH_IMPL_FUNC(add_out) (
  const Tensor& self, const Tensor& other, const Scalar& alpha, const Tensor& result
 ) {
  add_stub(device_type(), *this, alpha);
  TORCH_INTERNAL_ASSERT(result.scalar_type() == output().dtype());
 }
 ```
 So, the code above implements the two functions structured_add_Tensor::meta()
 and structured_add_out::impl(), but where are they declared? The codegen creates
 declarations for them.
 In NativeMetaFunctions.h:
 ```cpp
 // namespace at::meta
 struct TORCH_API structured_add_Tensor : public TensorIteratorBase {
    void meta(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha);
 };
 ```
 In NativeFunctions.h:
 ```cpp
 // namespace at::native
 struct TORCH_API structured_add_out : public at::meta::structured_add_Tensor {
    void impl(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha, const at::Tensor & out);
 };
 ```
 You can see that the codegen generated declarations for the two functions, and
 we hand-implemented them ourselves in BinaryOps.cpp. But how does the codegen
 use them?
 The code-generated logic that stitches them together lives in the code-generated
 file `RegisterCPU.cpp`, and looks like this:
 ```cpp
 // functional version
 at::Tensor wrapper_CPU_add_Tensor(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {
  structured_ufunc_add_CPU_functional op;
  op.meta(self, other, alpha);
  op.impl(self, other, alpha, op.outputs_[0]);
  return std::move(op.outputs_[0]);
 }
 // inplace version
 at::Tensor & wrapper_CPU_add__Tensor(at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {
  structured_ufunc_add_CPU_inplace op(self);
  op.meta(self, other, alpha);
  op.impl(self, other, alpha, op.outputs_[0]);
  if (op.proxy_outputs_[0].has_value()) op.outputs_[0].get().copy_(*op.proxy_outputs_[0]);
  return self;
 }
 // out= version
 at::Tensor & wrapper_CPU_add_out_out(const at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha, at::Tensor & out) {
  structured_ufunc_add_CPU_out op(out);
  op.meta(self, other, alpha);
  op.impl(self, other, alpha, op.maybe_get_output(0));
  if (op.proxy_outputs_[0].has_value()) op.outputs_[0].get().copy_(*op.proxy_outputs_[0]);
  return out;
 }
 // registering the 3 kernels above to the dispatcher, under the CPU Dispatch Key.
 TORCH_LIBRARY_IMPL(aten, CPU, m) {
  ...
  m.impl("add.Tensor", TORCH_FN(wrapper_CPU_add_Tensor));
  m.impl("add.out", TORCH_FN(wrapper_CPU_add_out_out));
  m.impl("add_.Tensor", TORCH_FN(wrapper_CPU_add__Tensor));
 }
 ```
 This is the "final" output - the 3 operators that we needed. The codegen created
 3 new kernels, each of which call into our `meta()` and `impl()` functions. The
 only difference between the 3 is that they use different classes, each of which
 has a different implementation of `set_output()`. You can also find the
 definition of all 3 of these classes in `RegisterCPU.cpp`, but below is the
 example for `structured_ufunc_add_CPU_functional`:
 ```cpp
 struct structured_ufunc_add_CPU_functional final : public at::native::structured_ufunc_add_CPU {
    void set_output_strided(
        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
        TensorOptions options, DimnameList names
    ) override {
        outputs_[output_idx] = create_out(sizes, strides, options);
        if (!names.empty()) {
          namedinference::propagate_names(outputs_[output_idx], names);
        }
-```
+        // super must happen after, so that downstream can use maybe_get_output
-
+        // to retrieve the output
-* So the dispatcher goes and picks the next key down the list, which is CUDA. We now invoke at::native::add. We’ve reached the end!
+        at::native::structured_ufunc_add_CPU::set_output_raw_strided(output_idx, sizes, strides, options, names);
-
+    }
-## Example number 2: factory function
+    void set_output_raw_strided(
-
+        int64_t output_idx, IntArrayRef sizes, IntArrayRef strides,
-```py
+        TensorOptions options, DimnameList names
-x = torch.randn(3, 3, device='cuda')
+    ) override {
-
+        outputs_[output_idx] = create_out(sizes, strides, options);
-# Upon calling at::randn, the dispatch keys are:
+        if (!names.empty()) {
-# Global include set: [BackendSelect]
+          namedinference::propagate_names(outputs_[output_idx], names);
 # Local include set: []
 # Local exclude set: []
 # So we select the BackendSelect version of randn.
 ```
 * Factory functions like randn are treated specially, and do not get Fallthrough kernels registered to the dispatcher. Again, you can see the kernel for randn that’s registered to the `BackendSelect` key in `build/aten/src/ATen/BackendSelectRegister.cpp`.
 * BackendSelect version of randn:
 ```cpp
        C10_ALWAYS_INLINE
        at::Tensor randn(at::IntArrayRef size, c10::optional<at::ScalarType> dtype, c10::optional<at::Layout> layout, c10::optional<a
          static auto op = c10::Dispatcher::singleton()
            .findSchemaOrThrow("aten::randn", "")
            .typed<at::Tensor (at::IntArrayRef, c10::optional<at::ScalarType>, c10::optional<at::Layout>, c10::optional<at::Device>,
          DispatchKeySet _dk = c10::DispatchKeySet(c10::computeDispatchKey(dtype, layout, device));
          return op.redispatch(_dk, size, dtype, layout, device, pin_memory);
        }
-```
+        // super must happen after, so that downstream can use maybe_get_output
-
+        // to retrieve the output
-* It computes a dispatch key based on the dtype, layout, and device. In our case, the computed dispatch key is CUDA, so it straight up calls native::randn: https://github.com/pytorch/pytorch/blob/2c554266108f1b556dd49f7c3c06c08f2bbd3cbe/aten/src/ATen/native/TensorFactories.cpp#L616-L623
+        at::native::structured_ufunc_add_CPU::set_output_raw_strided(output_idx, sizes, strides, options, names);
 ```cpp
    Tensor randn(IntArrayRef size, c10::optional<Generator> generator, const TensorOptions& options) {
      auto result = at::empty(size, options);
      return result.normal_(0, 1, generator);
    }
-```
+    const Tensor& maybe_get_output(int64_t output_idx) override {
-* But we’re not done! at::empty got invoked. [at::empty also invokes](https://github.com/pytorch/pytorch/blob/2c554266108f1b556dd49f7c3c06c08f2bbd3cbe/aten/src/ATen/native/native_functions.yaml#L1620)the [CUDA version](https://github.com/pytorch/pytorch/blob/2c554266108f1b556dd49f7c3c06c08f2bbd3cbe/aten/src/ATen/native/cuda/TensorFactories.cu#L44)
+      return outputs_[output_idx];
 ```
      - func: empty.memory_format(int[] size, *, ScalarType? dtype=None, Layout? layout=None, Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
      #use_c10_dispatcher: full
      dispatch:
        CPU: empty_cpu
        CUDA: empty_cuda
 ```
 * Great. `result` is a Tensor with dispatch keys [AutogradCUDA, CUDA].
 * `result.normal_(...)` dispatches to the Autograd implementation of normal_
 ```cpp
    // See VariableTypeEverything.cpp
    Tensor & normal_(Tensor & self, double mean, double std, c10::optional<Generator> generator) {
      auto& self_ = unpack(self, "self", 0);
      check_inplace(self);
      std::shared_ptr<NormalBackward0> grad_fn;
      if (compute_requires_grad( self )) {
        grad_fn = std::shared_ptr<NormalBackward0>(new NormalBackward0(), deleteNode);
        grad_fn->set_next_edges(collect_next_edges( self ));
      }
      #ifndef NDEBUG
      c10::optional<Storage> self__storage_saved =
        self_.has_storage() ? c10::optional<Storage>(self_.storage()) : c10::nullopt;
      c10::intrusive_ptr<TensorImpl> self__impl_saved;
      if (self_.defined()) self__impl_saved = self_.getIntrusivePtr();
      #endif
      {
        at::AutoNonVariableTypeMode non_var_type_mode(true);
        self_.normal_(mean, std, generator);
      }
      #ifndef NDEBUG
      if (self__storage_saved.has_value())
        AT_ASSERT(self__storage_saved.value().is_alias_of(self_.storage()));
      if (self__impl_saved) AT_ASSERT(self__impl_saved == self_.getIntrusivePtr());
      #endif
      increment_version(self);
      if (grad_fn) {
          rebase_history(flatten_tensor_args( self ), grad_fn);
      }
      return self;
    }
-```
+    std::array<Tensor, 1> outputs_;
-* Which then goes to
+};
 ```
      {
        at::AutoNonVariableTypeMode non_var_type_mode(true);
        self_.normal_(mean, std, generator);
      }
 ```
-* `self_` is a Tensor with dispatch keys [AutogradCUDA, CUDA]. The AutoNonVariableTypeMode adds a Local Exclude of [AutogradCUDA]. So the final set is [BackendSelect, CUDA] and the dispatcher selects the CUDA implementation of normal_
+You can see that it has its own definition of `set_output_strided()` and
-* Which straight up goes to[native::normal_](https://github.com/pytorch/pytorch/blob/2c554266108f1b556dd49f7c3c06c08f2bbd3cbe/aten/src/ATen/native/native_functions.yaml#L6793)
+`set_output_raw_strided()` - in this case,
-```
+it’s implementing the functional `at::Tensor::add` kernel, so it needs to
-          - func: normal_(Tensor(a!) self, float mean=0, float std=1, *, Generator? generator=None) -> Tensor(a!)
+allocate a new tensor as the output (it does that using `at::create_out()`).
          variants: method
          dispatch:
            CPU, CUDA: normal_
 ```
 ## How do we populate the dispatch table?
 The dispatcher has a registration API
 * see “Operator Registration” in  http://blog.ezyang.com/2020/09/lets-talk-about-the-pytorch-dispatcher/
 * Our codegen pipeline takes care of the work of calling the registration API, and registering all of our different kernels to most of the important dispatch keys:
    * CPU
    * CUDA
    * Autograd
    * BackendSelect
 * The API includes the ability to define a Fallback kernel (that does nothing), which you saw used by BackendSelect
 ## Boxing vs unboxing
 Helpful resources:
 * See “Unboxing” in  http://blog.ezyang.com/2020/09/lets-talk-about-the-pytorch-dispatcher/
 * See this wiki page, which has a really useful diagram: https://github.com/pytorch/pytorch/wiki/Boxing-and-Unboxing-in-the-PyTorch-Operator-Library
 **Understanding boxed vs. unboxed representations**
 Unboxed representation
 * Objects have a different layout depending on the data in question
 * What you expect from C++: each struct is a different size depending on its type
 * This is great for efficiency - your data is packed together tightly, only takes up as much space as it needs!
 An unboxed data representation has a downside though: you can’t write a single function that works over all of your different objects!
 Well, you sort of can with templates in C++:
 ```cpp
    template<typename T>
    void foo(T obj) {...}
 ```
 In the above, `void foo(T)` is a function template that you call with different types. But if I call `foo("a"); foo(123);`, the compiler generates and stamps out two completely different implementations of foo() - one that looks like `void foo(const char*)`, and another that looks like `void foo(int)`! Templates are handy for avoiding code duplication, but we still end up producing a new specialized function for every different type that’s passed into the template.
 Contrast that to a boxed representation:
 Boxed representation
 * Objects have a unified layout.
 * In general: Different programming languages may choose to use a boxed layout by default for all of their types, e.g. Java.
 * In PyTorch: We have our own boxed layout implemented in C++: Some of our APIs shove values into these things [called IValues](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/ivalue.h)and toss them onto a stack.
    * IValue is a union between Tensor, int64, float64, etc...
 Having a boxed representation for types lets us write boxed functions in PyTorch:
 * Boxed functions can be written once, and (if implemented correctly) work for all operators
 * Boxed functions in PyTorch have a very specific schema:
    * `void my_boxed_f(const OperatorHandle& op, std::vector<c10::IValue> stack)`
    * Defined here: https://github.com/pytorch/pytorch/blob/8216da1f23b893c074e76e8a9aa7127efbda4287/aten/src/ATen/core/boxing/KernelFunction.h#L104
    * `OperatorHandle` is a class that represents an operator (e.g. `torch.add`, `torch.mul`), and the `std::vector<c10::IValue>` is a stack of IValues that are the inputs to that operator
 * The general idea when writing a boxed function: Pop the inputs off of the stack, compute some output(s), and push them back onto the stack.
 **Example where boxing is used: Batched Fallback kernel.**
 * https://github.com/pytorch/pytorch/blob/2c554266108f1b556dd49f7c3c06c08f2bbd3cbe/aten/src/ATen/BatchedFallback.cpp#L238
 * `void`` ``batchedTensorForLoopFallback``(``const`` c10::OperatorHandle& op, torch::jit::Stack* stack) {...}`
 * Essentially, all of the arguments are IValues on the stack, and we have a handle to an operator in the DispatchTable. A fallback tells us what to do to handle this
 * What BatchedFallback does is the following:
    * For each sample in the batch, call op(sample)
    * For example: 
 ```py
 x = torch.randn(N, 3)
 y = torch.randn(N, 3)
 torch.add(x, y)
 ```
 * There can be multiple tensors present in `stack`.
 * BatchedFallback takes all the IValues, converts some to Tensor, slices them in the batch dimension as necessary, and calls `op` multiple times.
    * It ends up doing: `torch.stack([x[0] + y[0], x[1] + y[1], x[2] + y[2], ...)`
 **Benefits of boxing in PyTorch:**
 There are some benefits to writing batching logic as a boxed fallback like the above:
 * Complexity decrease. This is kind of subjective, but arguably a single boxed kernel like the above is easier to maintain compared to the alternatives. If you want to write some functionality that works for every operator in PyTorch, some alternatives to writing a boxed fallback are:
    * Manually write 1000+ versions of your code, one for each operator (ouch)
    * Write fancy template metaprogramming logic to templatize over the operator and argument types (ouch)
    * Write codegen that generates all of the different kernels for you.
        * We actually do this in some cases: for autograd, and (currently) for tracing. This code is faster than a boxed fallback, but also requires work and careful design to make it maintainable.
 * Binary size: We only have one function, instead of having separate specialized functions for every operator (and there are 1000+ operators).
    * This is especially important for the mobile use case: mobile cares a lot about having a small binary size!
    * Mobile internally also uses the Lite Interpreter, which executes ops in a boxed format. I’m not an expert on what this looks like though.
 **How is this related to the Dispatcher?**
 So, how is this whole notion of boxed vs. unboxed kernels relevant to the dispatcher?
 Well, suppose we have a boxed kernel for Batching like we described above, and with batching turned on, I call `torch.sin(x)` on a cpu tensor. We expect batching logic to run before we eventually hit the sin() kernel.
 The dispatcher is responsible for going from
 * `at::sin()` (normal, unboxed frontend C++ entry point)
 * to `void batchedTensorForLoopFallback(const c10::OperatorHandle&, torch::jit::Stack*)` (BOXED kernel that performs batching. Somehow all of the arguments to sin() need to be wrapped into a Stack!)
 * to `at::native::sin()` (normal, UNBOXED cpu kernel for sin(). Somehow, the arguments need to be unboxed again so we can call this unboxed function!)
 Where does all of this boxing and unboxing conversion logic happen? Since the dispatcher provides API’s for registering both unboxed and boxed kernels, then it also needs to know how to convert arguments and operators between the unboxed and boxed world between invocations.
 * If you’re curious, some of the template magic that does that lives around here: https://github.com/pytorch/pytorch/blob/8216da1f23b893c074e76e8a9aa7127efbda4287/aten/src/ATen/core/boxing/impl/make_boxed_from_unboxed_functor.h#L521
 That class corresponds to one of the leaves of the class hierarchy - a picture
 of the full class hierarchy can be found in the linked presentation
 (https://drive.google.com/file/d/16qPvpCF4Jbh7ss2lCQMk5hmcyzJvUyQj/view?usp=sharing)