Files
pytorch/torch/_python_dispatcher.py
Brian Hirsh 1b7d7d9327 Reland: "free up dispatch key space (in C++)" (#74963)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74963

This is a re-land of D35192346 (9872a06d77) and D35192317 (a9216cde6c), which together are a diff that changes the internal representation of `DispatchKeySet` in pytorch core to free up the number of dispatch keys that we have available. See a more detailed description of the design in the original PR: https://github.com/pytorch/pytorch/pull/69633.

The original PR broke Milan workflows, which use a pytorch mobile build, and manifested as a memory corruption bug inside of `liboacrmerged.so`.

**Background: Existing Mobile Optimization**
Pytorch mobile builds have an existing optimization (here cc23725e89/c10/core/DispatchKey.h (L382) and here cc23725e89/aten/src/ATen/core/dispatch/OperatorEntry.h (L214)), which works as follows:

Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc).

In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys.

The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined here: cc23725e89/aten/src/ATen/core/dispatch/Dispatcher.h (L294).

The mobile-optimization currently does **not** extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). So the per-operator tables on mobile are size 8, while the fallback table is size 64.

**The Bug**
This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on this line: https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294).

That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the `Dispatcher` object, `backendFallbackKernels_`.

**Why didn't this problem show up in OSS CI? Why didn't it break other internal mobile workflows aside from Milan?**

Ideally, this failure would show up as part of the OSS signal on GitHub, since we already have mobile OSS builds. Given that it was another memory corruption issue that only affected Milan (subset of mobile), I'm not sure what's specific about Milan's builds that caused it only to manifest there. dreiss I wonder if there's another flavor of mobile builds we could run in OSS CI that could potentially help catch this?

**The debugging experience was pretty difficult**

Debugging the Milan-specific failure was made difficult by the following:

(1) lack of CI
- the original Milan failure didn't surface on my original diff, because the Milan job(s) that failed weren't triggered to run on pytorch changes. There's probably a balance to strike here, since those jobs will only be useful if they aren't flaky, and if they can produce reliable failure logs for debugging.

(2) It's difficult to get a repro.
- my work laptop doesn't have the right specs to run the Milan development workflow (not enough disk space)
- There is an existing OnDemand workflow for Milan, but it appears to be relatively new, and after a bunch of help from MarcioPorto, we ran into issues forwarding the log output from Milan tests on the emulator back to the terminal (see the original discussion here: https://fb.workplace.com/groups/OnDemandFRL/permalink/1424937774645433/)

(3) Lack of stack-traces.
- Most Milan failures didn't include actionable stack traces. phding generously helped me debug by running my suggested patches locally, and reporting back if there were any failures. The failing test didn't include a stack trace though (just the line where the crash appeared), so I ended up making some educated guesses about what the issue was based on the area of the crash.
ghstack-source-id: 152688542

Test Plan: Confirmed with phding that the broken Milan workflow from the previous version of this diff is now passing.

Reviewed By: phding, albanD

Differential Revision: D35222806

fbshipit-source-id: 0ad115a0f768bc8ea5d4c203b2990254c7092d30
(cherry picked from commit 002b91966f11fd55ab3fa3801b636fa39a6dd12c)
2022-03-31 21:52:38 +00:00

160 lines
6.8 KiB
Python

import re
import torch._C as C
"""
PythonDispatcher class is a thin python-binding to C++ dispatcher and it
is designed to show how dispatcher precompute works. In particular,
it shows for a certain op `foo`, what the computed dispatch table looks
like after user register their kernels to certains dispatch keys.
In the real C++ dispatcher we support many dispatch keys for different
functionalities. For simplicity PythonDispatcher only supports dispatch
keys for a single example of each use case. These use cases are listed below:
- CPU/AutogradCPU: represents in-tree backends which we usually have dedicated inference &
autograd kernel in pytorch core library.
E.g. CPU, CUDA
- FPGA/AutogradOther: represents in-tree backends which we usually have backend specific
inference kernels, but they share the same autograd kernel specified in AutogradOther.
E.g. FPGA, SparseCsrCPU
- XLA/AutogradXLA: represents out-of-tree backends which we don't have either inference or autograd
kernel defined in pytorch core library. Backend owner is responsible for registering both
inference & autograd kernels in their extensions(e.g. torch-xla) for the operators they support.
E.g. XLA, XPU, MLC
- CompositeExplicitAutograd: alias key mapped to inference kernels of all backends like CPU, CUDA, XLA etc.
Kernels registered to this key MUST work for inference for all backends.
- Autograd: alias key mapped to autograd of all backends like AutogradCPU, AutogradXLA, AutogradOther.
Kernels registered to this key MUST work for autograd for all backends.
- CompositeImplicitAutograd: alias key CompositeImplicitAutograd = CompositeExplicitAutograd + Autograd
Kernels registered to this key MUST work for both inference + autograd for all backends.
Note we only allow registrations to alias keys inside pytorch core library. E.g
you shouldn't register a CompositeImplicitAutograd or CompositeExplicitAutograd
kernel from torch-xla extension, instead you should upstream the kernel into
pytorch/pytorch repo so that it's available for all backends and continuously
tested even without the extension.
Usage:
dispatcher = PythonDispatcher()
dispatcher.register(["CPU", "XLA", "CompositeImplicitAutograd"])
print(dispatcher.dispatchTable()) # This tells you exactly which kernel is used for certain backend.
# For more debugging information
# print(dispatcher.keys())
# print(dispatcher.registrations())
# print(dispatcher.rawRegistrations())
# print(dispatcher.rawDispatchTable())
PythonDispatcher calls C++ dispatcher under the hood for to precompute dispatch table.
This file only provides the simplified API for developers, revelant test code is located in
test/test_dispatch.py
"""
class PythonDispatcher:
namespace = "__test__"
name = "foo"
runtime_keys = [
"CPU", "AutogradCPU",
"FPGA", "AutogradOther",
"XLA", "AutogradXLA",
"Lazy", "AutogradLazy",
]
alias_keys = [
"CompositeExplicitAutograd",
"Autograd",
"CompositeImplicitAutograd",
]
supported_keys = runtime_keys + alias_keys
def __init__(self):
C._dispatch_check_invariants(self.name) # type: ignore[attr-defined]
self.ref = C._dispatch_library("FRAGMENT", self.namespace, "") # type: ignore[attr-defined]
self.ref.def_("foo(Tensor x) -> Tensor")
"""
Returns a list of dispatch keys supported by PythonDispatcher.
You can register kernels to these keys.
"""
def keys(self):
return self.supported_keys
"""
Register kernels to the target dispatchKeys.
dispatchKeys(list[str]): a list of dispatch keys that you want to register
your own kernel. Note that you don't need to write the kernel yourself in
this PythonDispatcher.E.g. for CPU key, a kernel(e.g fn_CPU for CPU) is
automatically generated and registered.
"""
def register(self, dispatchKeys):
# Overriden is not supported and triggers a warning in C++ dispatcher.
if len(set(dispatchKeys)) != len(dispatchKeys):
raise RuntimeError(f"Overriden is not allowed but found duplicates in {dispatchKeys}.")
# We currently forbid this in codegen instead of C++ dispatcher.
if 'CompositeImplicitAutograd' in dispatchKeys and 'CompositeExplicitAutograd' in dispatchKeys:
raise RuntimeError("Registration to both CompositeImplicitAutograd and CompositeExplicitAutograd is not allowed.")
for key in dispatchKeys:
if key not in self.supported_keys:
raise RuntimeError(f"{key} is not supported, please select a dispatch key in {self.supported_keys}.")
self.ref.impl_t_t("foo", dispatch=key, debug="fn_" + key)
"""
Helper function to format (key, kernel).
"""
def _format_line(self, key, kernel):
return "{:<15} {}\n".format(key, kernel)
"""
Helper function to print a table header.
"""
def _format_header(self, header):
s = f"""
{header}
"""
s += self._format_line("key", "kernel")
s += "---------------------------\n"
return s
"""
Returns raw output of all registration info for debugging only.
Use registrations() for a simplified version.
"""
def rawRegistrations(self):
return C._dispatch_dump("{}::{}".format(self.namespace, self.name)) # type: ignore[attr-defined]
"""
Returns raw output of computed dispatch table for debugging only.
Use dispatchTable() for a simplified version.
"""
def rawDispatchTable(self):
return C._dispatch_dump_table("{}::{}".format(self.namespace, self.name)) # type: ignore[attr-defined]
"""
Returns a table(str) including all the registrations from users.
Note this includes registrations to both runtime keys and alias keys.
"""
def registrations(self):
output = self._format_header("Registered Kernels")
state = self.rawRegistrations()
state_entries = state.split('\n')
for line in state_entries:
first = line.split(":")[0]
if any(first.startswith(k) for k in self.supported_keys):
kernel = line.split("::")[0].split(" ")[1]
output += self._format_line(first, kernel)
return output
"""
Returns the computed dispatch table(str). Note this only include
runtime keys, registrations to alias keys have been decoded to their
mapped runtime keys.
"""
def dispatchTable(self):
output = self._format_header("Computed Dispatch Table")
table = self.rawDispatchTable()
table_entries = table.split('\n')
regex = re.compile(r"registered at .*FallbackKernel\.cpp.*(\[)")
for line in table_entries:
k = line.split(":")[0]
if k in self.runtime_keys:
entry = regex.sub('[', line)
output += self._format_line(k, entry.split(": ")[1])
return output