**Summary**
Per the discussion in https://github.com/pytorch/pytorch/pull/123444, the `decomposed quant/dequant` patterns changed after https://github.com/pytorch/pytorch/pull/123445, we can move the optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase to avoid the changes. In this way, we can:
- Avoid the pattern matcher failure introduced in https://github.com/pytorch/pytorch/pull/123445
- Make the quantization pattern clearer in the pattern matcher phase, since the `quant/dequant` nodes have not been decomposed.
**Changes in this PR**
- Move optimization of `decomposed quant/dequant` from inductor decomposition into lowering phase.
- Corresponding changes in the quantization pattern matcher to ensure no bc-breaking.
**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_q
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124041
Approved by: https://github.com/peterbell10, https://github.com/jgong5
Summary: This commit fixes the pattern matching for conv-bn
during QAT fusion where both weight and bias are quantized per
channel. Previously this failed because weights and biases used
the same example kwargs for their scales and zero points,
causing these qparams to be tied during pattern matching.
Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_qat_conv_bn_per_channel_weight_bias
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_qat_conv_bn_per_channel_weight_bias
Reviewers: jerryzh168, angelayi
Subscribers: jerryzh168, angelayi, supriyar
Differential Revision: [D56740694](https://our.internmc.facebook.com/intern/diff/D56740694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125208
Approved by: https://github.com/angelayi
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
Summary: Avoid situation where the graph traversal finds a matmul node with a `get_attr` as its `args[0]`, and incorrectly propagate the `get_attr`'s meta to everything downstream.
Test Plan: CI
Differential Revision: D56219120
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124415
Approved by: https://github.com/jerryzh168
Summary: Before in `move_exported_model_to_train/eval`, we only
switched the CPU versions of the batch norm op. This commit adds
support for the cuda versions of the op too. Note that this fix
is temporary; we won't have to differentiate between these two
cases once we have batch norm consolidation.
Test Plan:
python test/test_quantization.py -k test_move_exported_model_bn
Reviewers: jerryzh168
Subscribers: jerryzh168, leslie-fang-intel, supriyar
Differential Revision: [D56070054](https://our.internmc.facebook.com/intern/diff/D56070054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123957
Approved by: https://github.com/jerryzh168
This commit enables float8_e5m2 and float8_e4m3fn dtypes in fx quantization and PT2E.
Motivation for using fp8 quantization instead of int8:
- it works better to run inference with the same datatype the model was trained with,
- fp8 can handle outliers better, which is one of the problems in LLMs activations.
The numerical recipe we want to use it for is fp8 inference:
- bgemms/gemms running in float8_e4m3fn,
- Per-Tensor-Quantization/Scaling,
- amax observer for measurement with input_backoff and weight_backoff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123161
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Adds the following to allowed globals for the `weights_only` unpickler
- [x] `torch._utils._rebuild_qtensor` and qtensor related types
- [x] `torch._utils._rebuild_parameter_with_state` (used deserializing a parameter that has user-defined attributes like `Param.foo`)
The remaining rebuild functions that have not been allowlisted are
- [x] `torch._utils._rebuild_wrapper_subclass` (allowlisted in above PR)
- [ ] `torch._utils._rebuild_device_tensor_from_numpy`
- [ ] `torch._utils._rebuild_xla_tensor` (legacy)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124330
Approved by: https://github.com/albanD
Update ruff to 0.4.1 .
This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes.
Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0
| Repository | Linter (v0.3) | Linter (v0.4) | Formatter (v0.3) | Formatter (v0.4) |
|----------------------------------------------------|---------------|---------------|------------------|------------------|
| [pytorch/pytorch](https://github.com/pytorch/pytorch) | 328.7 | 251.8 | 351.1 | 274.9 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549
Approved by: https://github.com/ezyang
Summary: https://github.com/pytorch/pytorch/pull/123452 added
backward support to this op by turning it into
CompositeImplicitAutograd, which meant it gets decomposed during
export/compile. However, this is not desirable behavior for the
PTQ case when we try to lower the model. This commit enables
QAT without breaking PTQ by refactoring the impl into a separate
op that does have backward support.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai, zou3519
Subscribers: jerryzh168, digantdesai, zou3519, supriyar
Differential Revision: [D56192116](https://our.internmc.facebook.com/intern/diff/D56192116)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124178
Approved by: https://github.com/digantdesai
Summary: When running the backward for this op, we get the error:
```
RuntimeError: derivative for aten::aminmax is not implemented
```
This commit replaces this call with separate amin and amax
calls instead, which do have implemented derivatives.
Test Plan:
python test/test_quantization.py -k test_decomposed_choose_qparams_per_token_asymmetric_backward
Reviewers: jerryzh168, digantdesai
Subscribers: jerryzh168, digantdesai, supriyar
Differential Revision: [D55805170](https://our.internmc.facebook.com/intern/diff/D55805170)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123452
Approved by: https://github.com/digantdesai, https://github.com/jerryzh168
**Summary**
Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models. `matmul` recipe is disabled by default.
**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_attention_block
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122776
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
ghstack dependencies: #122775
**Summary**
Default recipes are enabled in `X86InductorQuantizer` and request comes to customize recipes based on these defaults.
- Avoid annotation propagation and restrict annotation only to annotate `conv`/`linear`.
- Add `matmul` in the quantization recipes, noting that it's not a general recipe but tailored to meet accuracy criteria for specific models.
To meet these requests, we made changes in this PR by introducing interface as `set_function_type_qconfig` and `set_module_type_qconfig`
- `set_function_type_qconfig` accepts functional input as `torch.nn.functional.linear` or `torch.matmul`; `set_module_type_qconfig` accepts nn.Module input as `torch.nn.Conv2d`.
- To disable the recipe for this operator, user can simply exclude it from the list of operations as `quantizer.set_function_type_qconfig(op, None)`.
- To modify or extend the recipe for this operator with default recipe, user can customize as `quantizer.set_function_type_qconfig(op, config)`.
**Test Plan**
```
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_conv2d_recipe
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_linear_recipe
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_filter_maxpool2d_recipe
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122775
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
**Note**: This is a reopen of https://github.com/pytorch/pytorch/pull/122288, which was merged by `ghstack land` to its base (not main) by mistake.
**Description**
Add qlinear_binary op for X86Inductor backend of quantization PT2E. It only supports `add` and `add_relu` now.
It will use post op sum if the extra input has the same dtype as output. Otherwise, it uses binary add.
```
+-------------------+--------------+---------------+
| Extra input dtype | Output dtype | Post op |
+-------------------+--------------+---------------+
| Fp32/bf16 | fp32/bf16 | sum or add* |
+-------------------+--------------+---------------+
| Fp32/bf16 | int8 | add |
+-------------------+--------------+---------------+
| int8 | fp32/bf16 | not supported |
+-------------------+--------------+---------------+
| int8 | int8 | sum |
+-------------------+--------------+---------------+
*Use sum if extra input and output have the same dtype; otherwise use add.
```
**Test plan**
python test_quantization.py -k test_qlinear_add_pt2e
python test_quantization.py -k test_qlinear_add_relu_pt2e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123144
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168
Summary:
Even with changes in D55347133, it is still possible to OOM in histogram observer, because the size of allocated tensor also depends on *downsample_rate*.
For example, I still see OOM due to the attempt of allocating a 10GB+ histogram tensor in multi-task model.
To fix OOM issue better, we use *try-catch* clause to avoid OOM.
Empirically, we set the max size of a single histogram tensor size to 1 GB.
Test Plan: Test the change for Multi-Task model (depth + segmentation)
Differential Revision: D55567292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123309
Approved by: https://github.com/jerryzh168
Summary: After we migrate to torch.export, we won't see ops like add_ and mul_ due to functionalization. We are rolling out pre dispatch export, so for now we just skip those mutating ops in tests.
Test Plan: buck run mode/opt caffe2/test/quantization:test_quantization
Reviewed By: tugsbayasgalan
Differential Revision: D55442019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122863
Approved by: https://github.com/clee2000
Summary:
We add a new op quantized.linear_unpacked_dynamic_fp16, which is essentially linear_dynamic_fp16 with different (unpacked) weight/bias format.
This op does packing on the fly for each call with standard at::Tensor weight & bias.
Test Plan:
Included in commit.
test_quantized_op::test_unpacked_qlinear_dynamic_fp16
Differential Revision: [D55433203](https://our.internmc.facebook.com/intern/diff/D55433203)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122762
Approved by: https://github.com/jerryzh168
Summary:
Right now we don't insert additional observers (share observers) if qspec.dtype and qspec.is_dynamic matches exactly,
since fixed qparams quantization spec and derived quantization spec do have have is_dynamic field curerntly, observer sharing does not happen between them and quantization spec, in this PR we fixed the issue by
adding is_dynamic to all quantization specs.
Note: SharedQuantizationSpec should probably be its own type in the future
TODO later:
(1). move all these fields (dtype, is_dynamic, quant_min, quant_max etc.) to QuantizationSpecBase,
(2). make SharedQuantizationSpec a separate type
(3). add quant_min/quant_max in observer sharing checking in pt2e/prepare.py
Test Plan:
python test/test_quantization.py -k test_fixed_qparams_qspec_observer_dedup
Reviewers:
Subscribers:
Tasks:
Tags:
Differential Revision: [D55396546](https://our.internmc.facebook.com/intern/diff/D55396546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122734
Approved by: https://github.com/andrewor14
Summary:
also added some utils in xnnpack_quantizer_utils.py
* annotate_conv_tranpsose_bn_relu and annotate_conv_transpose_bn -> this is for QAT
* annotate_conv_transpose_relu
conv_transpose + bn weights fusion is performed automatically and can not be disabled currently
we can add support to allow disable this fusion later if needed
Test Plan:
python test/test_quantization.py -k test_conv_transpose_bn_fusion
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122046
Approved by: https://github.com/andrewor14
This does not introduce a new test but is tested by checking that all the classes we already have still behave as before now that they don't explicitly disable torch_function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120632
Approved by: https://github.com/ezyang
Summary: Today we don't allow free functions to be tracing callable in torch.export. As a part of migrating capture_preautograd_graph usages to torch.export, we need to ban free functions to capture_preautograd_graph as well
Test Plan: CI
Differential Revision: D54319597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120817
Approved by: https://github.com/zhxchen17, https://github.com/andrewor14
Summary: This commit adds the `model_is_exported` util function
for users to be able to easily tell what APIs to call to move
their models between train and eval modes. This has the
additional advantage of hiding the implementation of how we
detect a model is exported, in case the metadata format changes
in the future.
Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_model_is_exported
Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726
Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD