Summary:
tl;dr; rewrites the FX graph mode quantization observer insertion to be easier to understand and extend.
The key conceptual difference from before is:
* before: for each node, observers are always inserted to the output of the current node, even if they are needed for the next node. This is hard to reason about.
* after: for each node, observers are inserted to the inputs (if needed, as calculated by the dtype of the argument and dtype of current node) and to the output (if needed for the type of pattern and qconfig). There is no knowledge of future nodes needed to insert observers for the current node.
This allows us to significantly simplify various things:
* all new observers needed for a node are inserted together. This makes it easier to understand and debug things. We add an invariant that node X will never change any observers inserted by any preceding or subsequent node, so to debug an issue the user can just understand what is happening for node X, without having to understand what happens before or after it.
* all the state tracking of activation_post_process_map and activation_post_process_indices are removed, instead observers are looked up by graph traversals
* since there is no longer a need for overlapping graph passes which mutate each other's interemediate state, it is easier to understand what the rules are for inserting observers, and to create new rules in the future.
Test Plan:
```
# all OSS tests pass
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Differential Revision: D28241864
Reviewed By: jerryzh168
Pulled By: vkuzo
fbshipit-source-id: 950d58972d26362808564cc0a2dfb30413a3734d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57470
Removes the earlier hack of matching patterns originally matched
to BinaryOpQuantizeHandler to switch to CopyHandler. After this PR,
each pattern can only be matched to one type of QuantizeHandler or
to nothing.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D28152909
fbshipit-source-id: afc285e770bd7eb0518c90e3ee4874c421e78bbc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57393
Moves the information on whether we should pass the information
whether the output is quantized based on the inputs to live
on the qhandler object. This allows us to remove
FixedQParamsOpQuantizeHandler from quantize.py, further reducing
the coupling between handler objects and the quantization pass.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: astaff
Differential Revision: D28132414
fbshipit-source-id: 5c28524b47c00f618d3a38657376abae9e6ffe7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57388
It's a bit confusing to have this be a decorator. It's simpler to
just expose it as a function on qhandler.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D28129411
fbshipit-source-id: f7316f285e8546c67e8d8cf753462b2c2abb2636
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57377
Moves the logic which determines
1. whether a pattern instance's output should be observed
2. whether a pattern instance's output should be marked as observed based on its inputs
3. whether to ovverride the activation specified in the qconfig
from `quantize.py` to `quantization_patterns.py`. This makes
the code easier to read and reduces the coupling between `Quantizer`
and `QuantizeHandler` instances.
Note: there are some further cleanups which would be good after this one
- leaving those for future PRs.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D28126896
fbshipit-source-id: 94c80a9c7307452783348d65b402acc84983e3f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54924
Previously we are producing torch.ops.quantize.cat which takes inputs, dequantize them
and requantize with new qparams. This PR changes that to produce torch.cat directly, torch.cat
will assume all inputs are sharing the same qparam, and it will produce a quantized Tensor with
the same qparam as all inputs (because previous PR makes sure all inputs and output of cat are sharing
the same observer/fakequant instance).
Using torch.cat is expected to be more efficient since it does not introduce extra quant/dequant.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_cat
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D27416528
fbshipit-source-id: 896c280abec2903c29d597c655729666583ff0dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56004
added reference pattern support for GELU, softmax and bmm for int dtypes. For GELU and Softmax, this consisted of adding reference patterns to the default node handler for int dtypes. Note GELU and softmax patterns are not registered since they do not have a proper quantized kernel which means they would either add unnecessary dequant and quant ops to the network, or they would simply error. This can be circumvented with custom qconfig usage as in test_gelu_reference
bmm was added within binary ops along with some significant changes to how that code is structured. Theoretically the reference pattern used for bmm could be applied to other dtypes. This was not enabled because of issues relating to Line 1323 in quantize.py. In essence, the prepare step does not know whether an op will use a reference pattern or not, so for ops that are supported with one dtype in reference and one dtype normally, this has the potential to cause issues. This is difficult to get aorund with the is_reference flag being available in the prepare step or discussed changes around separating
Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_gelu_reference
python test/test_quantization.py TestQuantizeFxOps.ttest_gelu_normal
python test/test_quantization.py TestQuantizeFxOps.test_softmax_reference
python test/test_quantization.py TestQuantizeFxOps.test_softmax_normal
python test/test_quantization.py TestQuantizeFxOps.test_silu_reference
python test/test_quantization.py TestQuantizeFxOps.test_bmm_int_reference
python test/test_quantization.py TestQuantizeFxOps
python test/test_quantization.py TestFuseFx
python test/test_quantization.py TestQuantizeFx
python test/test_quantization.py TestQuantizeFxModels
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D27818340
fbshipit-source-id: de65be0797035463cd2d1b0e4677d1a87f69143c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55311
Before this PR, `F.conv1d` was matched by FX graph mode quant patterns
but the prepacking was happening inline. There was also a bug with
argument type mismatch.
This PR fixes both issues and adds a test. Thanks jerryzh168 for the
code tip.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_functional_not_reference
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D27575422
fbshipit-source-id: 42301e23cb101a9e64e46800813bc771317e233e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55727
number of dequantize for fp16 reference pattern was incorrect before, this
PR fixes the problem
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D27713390
fbshipit-source-id: 72b8d4cda0bdcea74abe27a76f918d1b47819b01
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55429
Previously we special case copy operator in normal insert observer code, this PR tries to split the
special case logic to a separate function and keep the rest of the code clean.
Test Plan:
Imported from OSS
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D27609972
fbshipit-source-id: 378f6aa70f18c0b477b62b6efe236648748aae7e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55388
temporarily revert D27314678 (c57541ce06), it appears to cause a perf regression that makes quantization of some models take too long to complete tests.
Reviewed By: houseroad
Differential Revision: D27583809
fbshipit-source-id: e9c088ccbfd3bfb3a1d4c7eafee3eca29ee7717b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54644
Previously we special case copy operator in normal insert observer code, this PR tries to split the
special case logic to a separate function and keep the rest of the code clean.
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D27314678
fbshipit-source-id: d36870ceb3717bc01eaeaa6f3f1532ad562cbaf1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53586
Previously one value can only be quantized to one dtype, this PR adds the support for quantizing one value
in the fx graph with multiple dtypes, e.g. first quantize to int8 and then float16
might do some followup PRs to clean up the hacks and refactor the code.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_multiple_qconfigs_single_value
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26912676
fbshipit-source-id: ae3653fd67f05870a3a9e808f491871826c555d5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53614
Ensures that every subclass of `QuantizeHandler` has a clear name. This
prevents ambiguous names like `Cat`, which look like a module but are
really a quantize handler.
Test Plan:
```
python test/test_quantization.py TestQuantizeFx
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26914784
fbshipit-source-id: 6dca7e27975c09f422f8e36f1d2b709bf3eaaadf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53196
Before this PR, code patterns like this did not work:
```
x = some_quant_layer(x)
x = torch.stack([x, ...])
x = torch.sum(x, ...)
```
The reason this did not work is because `torch.sum` is treated as
"quantized" because of the newly added fp16 support, even though it is
not actually "quantized" for models where fp16 is not used. We may
need to adjust the concept of "quantized vs non-quantized" into a
"dtype" for the longer term fix.
The current PR is a hacky fix to unblock. We need to clean things
up before this is landable
Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_quant_sum
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26783960
fbshipit-source-id: 3be7c3c1eaa2b8fcb99a105e1b0004c9ffd3a1c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53120
Currently there is a pattern which is not handled correctly by
FX graph mode quantization:
```
def forward(self, x):
ndim = x.ndim
# or add, mul, div, etc
x = torch.sub(x, ndim)
return x
```
The reason this does not work is as follows:
1. x.ndim becomes a getattr node
2. the real world type of x.ndim is an integer, but this is not known from the graph (yet)
3. binary ops such as `torch.sub` require quantization of inputs
4. the framework inserts an observer to observe the output of `ndim`
5. the observer fails because `ndim` is not a Tensor
For now, we hack a bandaid to unblock some teams, none of this is for
land. We will have to think of a better fix which is landable (TBD).
Test Plan:
```
python test/test_quantization.py TestQuantizeFx.test_getattr_with_nontensor_result
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26756180
fbshipit-source-id: c0e498766b22c23df74fbb5aaeaa237c4c944263
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53585
Previously fp16_static CopyNode would be marked as unquantized because of
an incorrect condition check of whether a Node is statically quantized or not.
This PR fixes that.
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26912677
fbshipit-source-id: 4ddb538714c5ba2db28430de5e1cf2931baf1993
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50002
The last commit adds tests for 3d conv with the `SubModelFusion` and `SubModelWithoutFusion` classes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50003
Reviewed By: mrshenli
Differential Revision: D26325953
Pulled By: jerryzh168
fbshipit-source-id: 7406dd2721c0c4df477044d1b54a6c5e128a9034
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52651
Merging them for easier extensions to fp16 and more binary ops
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26600118
fbshipit-source-id: a1816e593cf3065afe87d2e6e44cdace13bf6aeb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52534
Currently linear_dynamic_fp16 has a signature that's tied to fbgemm/qnnpack
We'll need to produce a pattern equivalent to linear_dynamic_fp16 to support extensions
to other backends
Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_linear_dynamic_fp16
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26557726
fbshipit-source-id: 270c9f781f73c79416a092b7831294cabca84b0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52179
Rename debug to reference. We'll use this to produce a reference quantized model
that can be used as a common interface between pytorch quantized model and backends.
Test Plan:
python test/test_quantization.py TestQuantizeFx
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D26424656
fbshipit-source-id: a0299b023f6ba7d98f5750724c517b0ecb987b35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52413
TODO: We'll need to add this guard for other ops as well
(Note: this ignores all push blocking failures!)
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_mul_add_fp16_config
Imported from OSS
Reviewed By: supriyar
Differential Revision: D26503348
fbshipit-source-id: 5aaba518742a516cc3521fd5f23f1a264d2973e2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51259
Store the FQN of the module that is using the packed weights (the quantized op)
In the case of fusion we update the scope mapping to store the module path of the fused node.
Test Plan:
python test/test_quantization.py test_packed_weight_fused_op
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26117964
fbshipit-source-id: 9d929997baafb1c91063dd9786a451b0040ae461
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51171
Following up on previous PR, this PR makes scale and zero_point for quantize_per_tensor to be
registered as buffers in the module.
Currently the dtype is still stored as attr (not registered as buffer) since we can only register tensor types.
Test Plan:
python test/test_quantization.py test_qparams_buffers
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26092964
fbshipit-source-id: a54d914db7863402f2b5a3ba2c8ce8b27c18b47b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51166
Currently scale and zero_point values are stored as constant values in the graph.
This prevents these values from being updated in the graph and also does not enable saving
these values to state_dict
After this PR we store scale/zero_point values for quantized ops as buffers in the root module
and createe get_attr nodes for them in the graph.
We also use the FQN of the module where the quantized ops are present to name these attributes so
that they can be uniquely identified and mapped to quantized ops.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qparams_buffers
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D26092965
fbshipit-source-id: b549b2d3dccb45c5d38415ce95a09c26f5bd590b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50058
This PR adds the support for {input/output}_quantized_idxs for standalone module.
if input_quantized_idxs = [] and output_quantized_idxs = [], the standalone module will be expecting float
input and produce float output, and will quantize the input and dequantize output internally
if input_quantized_idxs = [0] and otuput_qiuantized_idxs = [0], the standalone module will be expecting quantized
input and produce quantized output, the input will be quantized in the parent module, and output will be dequantized
in the parent module as well, this is similar to current quantized modules like nn.quantized.Conv2d
For more details, please see the test case
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_standalone_module
Imported from OSS
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D25768910
fbshipit-source-id: 96c21a3456cf192c8f1400afa4e86273ee69197b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49717
Quantization of `ConvTranpose{n}d` is supported in Eager mode. This PR
adds the support for FX graph mode.
Note: this currenlty only works in `qnnpack` because per-channel weights
are not supported by quantized conv transpose. In a future PR we should throw
an error when someone tries to quantize a ConvTranspose model with per-channel
weight observers until this is fixed.
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_conv_transpose_1d
python test/test_quantization.py TestQuantizeFxOps.test_conv_transpose_2d
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D25674636
fbshipit-source-id: b6948156123ed55db77e6337bea10db956215ae6