mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-21 05:34:18 +08:00
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37774 Move ThreadLocalDebugInfo from ATen to C10 Test Plan: Imported from OSS Differential Revision: D21384249 Pulled By: ilia-cher fbshipit-source-id: f9b5089a868f84a2ee013695a481fcc883d3c6b2
146 lines
6.2 KiB
ReStructuredText
146 lines
6.2 KiB
ReStructuredText
Features for large-scale deployments
|
|
====================================
|
|
|
|
.. contents:: :local:
|
|
|
|
This note talks about several extension points and tricks that might be useful
|
|
when running PyTorch within a larger system or operating multiple systems using
|
|
PyTorch in a larger organization.
|
|
|
|
It doesn't cover topics of deploying models to production. Check
|
|
:mod:`torch.jit` or one of the corresponding tutorials.
|
|
|
|
The note assumes that you either build PyTorch from source in your
|
|
organization or have an ability to statically link additional code to be loaded
|
|
when PyTorch is used. Therefore, many of the hooks are exposed as C++ APIs that
|
|
can be triggered once in a centralized place, e.g. in static initialization
|
|
code.
|
|
|
|
Fleet-wide operator profiling
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
PyTorch comes with :mod:`torch.autograd.profiler` capable of measuring time
|
|
taken by individual operators on demand. One can use the same mechanism to do
|
|
"always ON" measurements for any process running PyTorch. It might be useful for
|
|
gathering information about PyTorch workloads running in a given process or
|
|
across the entire set of machines.
|
|
|
|
New callbacks for any operator invocation can be added with
|
|
``torch::addGlobalCallback``. Hooks will be called with
|
|
``torch::RecordFunction`` struct that describes invocation
|
|
context (e.g. `name`). If enabled, ``RecordFunction::inputs()`` contains arguments
|
|
of the function represented as ``torch::IValue`` variant type. Note, that inputs
|
|
logging is relatively expensive and thus has to be enabled explicitly.
|
|
|
|
The operator callbacks also have access to ``c10::ThreadLocalDebugInfo::get()``
|
|
interface that returns a pointer to the struct holding the debug information.
|
|
This debug information can be set earlier by using ``at::DebugInfoGuard`` object.
|
|
Debug information is propagated through the forward (including async ``fork``
|
|
tasks) and backward passes and can be useful for passing some extra information
|
|
about execution environment (e.g. model id) from the higher layers of the
|
|
application down to the operator callbacks.
|
|
|
|
Invoking callbacks adds some overhead, so usually it's useful to just randomly
|
|
sample operator invocations. This can be enabled on per-callback basis with an
|
|
optional sampling rate passed into ``torch::addGlobalCallback``.
|
|
|
|
Note, that ``addGlobalCallback`` is not thread-safe and can be called only when no
|
|
PyTorch operator is running. Usually, it's a good idea to call them once during
|
|
initialization.
|
|
|
|
Here's an example:
|
|
|
|
.. code-block:: cpp
|
|
|
|
// Called somewhere in the program beginning
|
|
void init() {
|
|
// Sample one in a hundred operator runs randomly
|
|
addGlobalCallback(
|
|
RecordFunctionCallback(
|
|
&onFunctionEnter,
|
|
&onFunctionExit)
|
|
.needsInputs(true)
|
|
.samplingProb(0.01)
|
|
);
|
|
// Note, to enable observers in the model calling thread,
|
|
// call enableRecordFunction() in the thread before running a model
|
|
}
|
|
|
|
void onFunctionEnter(const RecordFunction& fn) {
|
|
std::cerr << "Before function " << fn.name()
|
|
<< " with " << fn.inputs().size() << " inputs" << std::endl;
|
|
}
|
|
|
|
void onFunctionExit(const RecordFunction& fn) {
|
|
std::cerr << "After function " << fn.name();
|
|
}
|
|
|
|
API usage logging
|
|
^^^^^^^^^^^^^^^^^
|
|
|
|
When running in a broader ecosystem, for example in managed job scheduler, it's
|
|
often useful to track which binaries invoke particular PyTorch APIs. There
|
|
exists simple instrumentation injected at several important API points that
|
|
triggers a given callback. Because usually PyTorch is invoked in one-off python
|
|
scripts, the callback fires only once for a given process for each of the APIs.
|
|
|
|
``c10::SetAPIUsageHandler`` can be used to register API usage instrumentation
|
|
handler. Passed argument is going to be an "api key" identifying used point, for
|
|
example ``python.import`` for PyTorch extension import or
|
|
``torch.script.compile`` if TorchScript compilation was triggered.
|
|
|
|
.. code-block:: cpp
|
|
|
|
SetAPIUsageLogger([](const std::string& event_name) {
|
|
std::cerr << "API was used: " << event_name << std::endl;
|
|
});
|
|
|
|
Note for developers: new API trigger points can be added in code with
|
|
``C10_LOG_API_USAGE_ONCE("my_api")`` in C++ or
|
|
``torch._C._log_api_usage_once("my.api")`` in Python.
|
|
|
|
Attaching metadata to saved TorchScript models
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
TorchScript modules can be saved as an archive file that bundles serialized
|
|
parameters and module code as TorchScript (see :meth:`torch.jit.save`). It's
|
|
often convenient to bundle additional information together with the model, for
|
|
example, description of model producer or auxiliary artifacts.
|
|
|
|
It can be achieved by passing the ``_extra_files`` argument to
|
|
:meth:`torch.jit.save` and ``torch::jit::load`` to store and retrieve
|
|
arbitrary binary blobs during saving process. Since TorchScript files are
|
|
regular ZIP archives, extra information gets stored as regular files inside
|
|
archive's ``extra/`` directory.
|
|
|
|
There's also a global hook allowing to attach extra files to any TorchScript
|
|
archive produced in the current process. It might be useful to tag models with
|
|
producer metadata, akin to JPEG metadata produced by digital cameras. Example
|
|
usage might look like:
|
|
|
|
.. code-block:: cpp
|
|
|
|
SetExportModuleExtraFilesHook([](const Module&) {
|
|
ExtraFilesMap files;
|
|
files["producer_info.json"] = "{\"user\": \"" + getenv("USER") + "\"}";
|
|
return files;
|
|
});
|
|
|
|
|
|
Build environment considerations
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
TorchScript's compilation needs to have access to the original python files as
|
|
it uses python's ``inspect.getsource`` call. In certain production environments
|
|
it might require explicitly deploying ``.py`` files along with precompiled
|
|
``.pyc``.
|
|
|
|
Common extension points
|
|
^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
PyTorch APIs are generally loosely coupled and it's easy to replace a component
|
|
with specialized version. Common extension points include:
|
|
|
|
* Custom operators implemented in C++ - see `tutorial for more details <https://pytorch.org/tutorials/advanced/cpp_extension.html>`_.
|
|
* Custom data reading can be often integrated directly by invoking corresponding python library. Existing functionality of :mod:`torch.utils.data` can be utilized by extending :class:`~torch.utils.data.Dataset` or :class:`~torch.utils.data.IterableDataset`.
|