This refactor was prompted by challenges handling mixed int/float
operations in C++. A previous version of this patch
added overloads for each permutation of int/float and was unwieldy
https://github.com/pytorch/pytorch/pull/87722/ This PR takes a different
approach.
The general outline of the patch is to combine the C++ types SymIntNode
and SymFloatNode into a single type, SymNode. This is type erased; we
no longer know statically at C++ if we have an int/float and have to test
it with the is_int()/is_float() virtual methods. This has a number of
knock on effects.
- We no longer have C++ classes to bind to Python. Instead, we take an
entirely new approach to our Python API, where we have a SymInt/SymFloat
class defined entirely in Python, which hold a SymNode (which corresponds
to the C++ SymNode). However, SymNode is not pybind11-bound; instead,
it lives as-is in Python, and is wrapped into C++ SymNode using PythonSymNode
when it goes into C++. This implies a userland rename.
In principle, it is also possible for the canonical implementation of SymNode
to be written in C++, and then bound to Python with pybind11 (we have
this code, although it is commented out.) However, I did not implement
this as we currently have no C++ implementations of SymNode.
Because we do return SymInt/SymFloat from C++ bindings, the C++ binding
code needs to know how to find these classes. Currently, this is done
just by manually importing torch and getting the attributes.
- Because SymInt/SymFloat are easy Python wrappers, __sym_dispatch__ now
takes SymInt/SymFloat, rather than SymNode, bringing it in line with how
__torch_dispatch__ works.
Some miscellaneous improvements:
- SymInt now has a constructor that takes SymNode. Note that this
constructor is ambiguous if you pass in a subclass of SymNode,
so an explicit downcast is necessary. This means toSymFloat/toSymInt
are no more. This is a mild optimization as it means rvalue reference
works automatically.
- We uniformly use the caster for c10::SymInt/SymFloat, rather than
going the long way via the SymIntNode/SymFloatNode.
- Removed some unnecessary toSymInt/toSymFloat calls in normalize_*
functions, pretty sure this doesn't do anything.
- guard_int is now a free function, since to guard on an int you cannot
assume the method exists. A function can handle both int and SymInt
inputs.
- We clean up the magic method definition code for SymInt/SymFloat/SymNode.
ONLY the user classes (SymInt/SymFloat) get magic methods; SymNode gets
plain methods; this is to help avoid confusion between the two types.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
cc @jansel @mlazos @soumith @voznesenskym @yanboliang @penguinwu @anijain2305
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87817
Approved by: https://github.com/albanD, https://github.com/anjali411
This reverts commit 978b46d7c96627e3b3553ad70ad21cb161d05f90.
Reverted https://github.com/pytorch/pytorch/pull/86488 on behalf of https://github.com/osalpekar due to Broke executorch builds internally with the following message: RuntimeError: Missing out variant for functional op: aten::split.Tensor(Tensor(a -> *) self, SymInt split_size, int dim=0) -> Tensor(a)[] . Make sure you have loaded your custom_ops_generated_lib
symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops
add meta_bernoulli_
meta kernel for at::gather
get pytorch_struct to pass: meta for scatter_add, fix backward
symintify split ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86488
Approved by: https://github.com/ezyang
symintify split_with_sizes, dropout, fused_fake_obs_quant. meta for padding_2d ops
add meta_bernoulli_
meta kernel for at::gather
get pytorch_struct to pass: meta for scatter_add, fix backward
symintify split ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86334
Approved by: https://github.com/ezyang
- TensorGeometry supports symint
- check_size supports symint
- functorch batch rule improved symint
- Some operator support for symint in LTC
- More supported operations on SymInt and SymFloat
- More symint support in backwards formulas
This merge includes code contributions from bdhirsh and anjali411.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/86160
Approved by: https://github.com/Chillee
From the perspective of having valid sympy expressions for any given size/stride property, we can have tensors inherit SymInts from each other (in cases where the size expression is unchanged, which is a common case).
But we also use SymInts to let us build graph traces of our programs, and we need to be able to trace from a SymInt back to the tensor that it originated from in order to trace correct graphs. This change ensures each tensor starts with fresh SymInts.
- note: our policy has already been to use PySymIntNode objects to store pointers to proxy-tracer objects for use during tracing
- before making this change (to clone symints), sometimes we'd attempt to store more than one proxy-tracer object on the same symint and the last-stored one would clobber all the earlier ones. This would result in tracing the wrong graph in some cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85878
Approved by: https://github.com/ezyang
- Support storing SymFloat in IValue
- Add SymFloat to JIT type system (erases to float)
- Printing support for SymFloat
- add/sub/mul/truediv operator support for SymFloat
- Support truediv on integers, it returns a SymFloat
- Support parsing SymFloat from Python object
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85411
Approved by: https://github.com/albanD
Signed-off-by: Edward Z. Yang <ezyangfb.com>
From @ezyang's original PR:
There are a number of situations where we have non-backend kernels (e.g., CompositeImplicitAutograd, batching rules) which we would like to port to Python, but we have no way to integrate these ports with the overall system while using preexisting C++ registrations otherwise. This PR changes that by introducing a Python dispatcher (which can have its own kernels directly in Python), which can be interpose over ordinary C++ dispatch. The ingredients:
We introduce a new PythonDispatcher dispatch key, that has the same tenor as FuncTorchDynamicLayerFrontMode: it works by getting triggered before every other dispatch key in the dispatch key, and shunting to a Python implementation
The Python dispatcher is a per-interpreter global object that is enabled/disabled via the guard EnablePythonDispatcher/DisablePythonDispatcher. We don't make it compositional as I have no idea what a compositional version of this feature would look like. Because it is global, we don't need to memory manage it and so I use a simpler SafePyHandle (newly added) to control access to this pointer from non-Python C++. Like __torch_dispatch__, we use PyInterpreter to get to the Python interpreter to handle the dispatch.
I need to reimplement dispatch table computation logic in Python. To do this, I expose a lot more helper functions for doing computations on alias dispatch keys and similar. I also improve the pybind11 handling for DispatchKey so that you can either accept the pybind11 bound enum or a string; this simplifies our binding code. See https://github.com/pybind/pybind11/issues/483#issuecomment-1237418106 for how this works; the technique is generally useful.
I need to be able to call backend fallbacks. I do this by permitting you to call at a dispatch key which doesn't have a kernel for the operator; if the kernel doesn't exist, we check the backend fallback table instead.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84826
Approved by: https://github.com/ezyang
### Description
Add oneDNN graph context manager API to be consistent with other fusers.
NNC and nvFuser have two ways to use: 1) a function to enable/disable and 2) a context manager. And the later way is used extensively in libraries like Dynamo. Currently oneDNN Graph fuser only has the former way. To promote the usage of oneDNN graph fuser, this PR creates the context manager for oneDNN graph fuser.
This PR should not affect any performance.
### Testing
A unit-test `test_context_manager` is added under `test/test_jit_llga_fuser.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82491
Approved by: https://github.com/malfet
Adds support for constant tensor tracking within FakeTensors. Copy-pasta'ing from `proxy_tensor.py` why this is useful:
```
# In some circumstances, we will be tracing in a situation where a tensor
# is *statically* known to be a constant (currently, this only happens if
# you run torch.tensor; deterministic factory functions like torch.arange
# don't get this treatment). When the tensor in question is small, it's
# helpful to due constant propagation in case we call item() (in which
# case we can return the constant value that is known, rather than give
# an error.)
```
This PR only attempts to add support for the tracing scenarios where we run each operation linearly - aot autograd, torchdynamo. It does not yet handle how constant tensors should be handled as part of the persistent fx graph. Additionally, it does not yet attempt to de-duplicate or interact with ProxyMode's only constant tensor handling.
Edit: plan is to rely on functionalization for fx graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84387
Approved by: https://github.com/ezyang
Summary:
After inserting quant dequant nodes in the graph, we need
1. Insert packed param creation and quantized op
2. Create packed_params attribute in the top module. For this we need
graph that inlined except for calculate_qparams method calls. But they
can be inlined too. So perhaps we need to make sure no other callmethods
exist.
3. Insert SetAttr for the packed param
4. Insert GetAttr for the packed param
5. Use GetAttr output for quantized op where applicable, e.g.
linear_dynamic
The above is added to quantize_<method-name> method created inprevious
step. Once the above steps are done clone the method into
quantized_<method-name>
Modify quantize_<method-name>:
1. Remove all outputs from the method.
2. Run dce
3. Remove all inputs from the method except self.
Modify quantized_<method-name>:
1. Remove all packed_param setAttr nodes.
2. Run dce.
This should result in removal of all nodes that generate packed param.
Test Plan: To be written
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D38771416](https://our.internmc.facebook.com/intern/diff/D38771416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83571
Approved by: https://github.com/jerryzh168
Summary:
This diff adds a way to:
- clone previously observed method
- Add calls to observer's calculate_qparams methods
- Extract the scale and zero point
- Use them to insert quant dequant nodes
Now for forward method we have
- observe_forward
- quantize_forward
observe_forward is used post training to observer statistics. In the
case of dynamic PTQ this requires just running that method once to
update weight observer statistics.
quantize_forward method will be used to use the observer
statistics to calculate quantization parameters and apply that to quant
dequant op.
Subsequent diffs will replace dequant + op with their quantized op
counter parts and replace quantize ops with relevant packed params class
where possible
Test Plan:
To be written
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D38771419](https://our.internmc.facebook.com/intern/diff/D38771419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83570
Approved by: https://github.com/jerryzh168
Summary:
TO support on device quantization this diff introduces observer
insertion. Specifically observers are inserted by adding new method with
prefix observ_.
Intent is that post training, this method will be run to record
statistics
Test Plan:
test_ondevice_quantization.py
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D38771417](https://our.internmc.facebook.com/intern/diff/D38771417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83568
Approved by: https://github.com/jerryzh168
This allows you to directly call into the CompositeImplicitAutograd
implementation of an operator, *without* changing any aspects of the
dispatcher state. In particular, you can use this to recursively call
into a decomposition, dispatching back to your tensor subclass/mode
as desired.
Hypothetically, we should also make these available in the
decompositions dictionary, but I'm leaving this as future work as
enumerating these decompositions is annoying (as operators are lazily
registered.)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83075
Approved by: https://github.com/albanD
New namespace `torch.ops.nvprims` is meant for specific to the nvFuser set of primitives. All `impl_nvfuser` attributes are removed from `torch.ops.prims` functions.
`NvfuserPrimsMode()` context manager can be used for automatic rewrite of `torch.ops.prims` calls to `torch.ops.nvprims` when possible.
The previous way to test whether a prim would be executable with nvFuser was to test `impl_nvfuser is not None`, now all functions in the `torch.ops.nvprims` namespace are supposed to have the `impl_nvfuser` attribute and hence all are executable by nvFuser.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82155
Approved by: https://github.com/jjsjann123, https://github.com/ngimel
Done via
```
git grep -l 'SymbolicIntNode' | xargs sed -i 's/SymbolicIntNode/SymIntNodeImpl/g'
```
Reasoning for the change:
* Sym is shorter than Symbolic, and consistent with SymInt
* You usually will deal in shared_ptr<...>, so we're going to
reserve the shorter name (SymIntNode) for the shared pointer.
But I don't want to update the Python name, so afterwards I ran
```
git grep -l _C.SymIntNodeImpl | xargs sed -i 's/_C.SymIntNodeImpl/_C.SymIntNode/'
```
and manually fixed up the binding code
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82350
Approved by: https://github.com/Krovatkin
- toTypeInferredIValue will throw an error when given an empty container because it isn't able to tell what kind of container it is. Thus empty containers are ignored in addArgumentValue/s, overlaps, and is_alias_of.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81786
Approved by: https://github.com/davidberard98
- Modify the is_mutable(size_t index) overload to become is_mutable(const SchemaArgument& argument) due to cases where one might want to check the mutability of either input or output arguments.
- Refactored all calls to the function to use this new overload
- Tested through is_mutable() tests in test_schema_info.cpp
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81784
Approved by: https://github.com/davidberard98
- Modified is_mutable python binding to accept a string instead of a string_view for better python compatibility.
- Modified argument value adding python bindings to deal with input/self edge case due to inconsistencies in how the first variable is named.
- Modified _is_alias_of and created _contains_alias_of python bindings to accurately find out if values are aliasing, or contain an alias.
- Fixed is_mutable implementation to cover all ops that have mutable optional arguments. (These are all the ops that have the optional arguments 'running_mean' and 'running_var' along with either 'train', 'training' or 'use_input_stats.'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81782
Approved by: https://github.com/davidberard98
Fix torch.save _open_zipfile_writer optimization that uses a c++ stream when `f` is a os.PathLike.
This fastpath requires that we don't `open()` in python if possible, so don't do it unconditionally.
Fix PyTorchStreamWriter construction binding that takes a buffer object.
Use py::memoryview instead of py::bytes as the former doesn't copy the data.
Validated with a trivial benchmark that calls torch.save in a loop 20x with a 10M elements float32 tensor
either on cpu or cuda. Saved to /dev/null.
Tried two variants 'str' and 'open'
In 'str' we pass the string "/dev/null" to torch.save.
In 'open' we pass `open("/dev/null", "wb")` to torch.save.
Timing in seconds.
Before this patch:
str-cpu :: 0.757
open-cpu :: 0.757
str-cuda :: 1.367
open-cuda :: 1.366
After this patch:
str-cpu :: 0.256
open-cpu :: 0.251
str-cuda :: 0.896
open-cuda :: 0.834
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80404
Approved by: https://github.com/jamesr66a