Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64981
this would have cause errors when observer.py was moved to ao.
see: D30391189
ghstack-source-id: 138118430
Test Plan:
buck test mode/opt //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_dynamic_quant_multi_uses (quantization.jit.test_quantize_jit.TestQuantizeDynamicJitPasses)'
buck test mode/opt //caffe2/test:quantization -- --exact 'caffe2/test:quantization - test_save_load_state_dict_script (quantization.core.test_workflow_module.TestObserver)'
Reviewed By: supriyar
Differential Revision: D30432008
fbshipit-source-id: 754727a89c78f6ceada6f8ff92c304f3953f38fc
Summary:
Fixes https://github.com/pytorch/pytorch/issues/63326
Currently `get_callable_args` has the side effect of mutating the input _PartialWrapper. When that input is one of the global defaults, there are all sorts of lifetime issues that crop up. (Details in the linked issue.) So far as I can tell, we only need to make a constructor which is module (and by extension device) aware, so making a fresh one should have the same effect without leaking the last call's module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63374
Test Plan: the repro in https://github.com/pytorch/pytorch/issues/63326 now reports no leaked Tensors, and all quantization tests pass locally.
Reviewed By: HDCharles
Differential Revision: D30359360
Pulled By: robieta
fbshipit-source-id: aef33261ac49952d8d90da868a57ab063dfc456e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62863
To make this consistent with other observers, add reduce_range option that can be used to update quant_min/max
Test Plan:
python test/test_quantization.py test_fused_mod_reduce_range
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D30146602
fbshipit-source-id: a2015f095766f9c884611e9ab6942528bc9bc972
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62345
This PR updates the attribute names from min_vals to min_val. the motivation for this is to keep the attribute name consistent with per-tensor observers so that dependencies (like FusedMovingAvgObsFakeQuantize) don't need to differentiate between the two observer types to access the attributes.
It also adds some BC tests to make sure that observers saved earlier with min_vals/max_vals can be loaded depending on the state_dict version.
Note: Scriptability of the observers isn't fully supported yet, so we aren't testing for that in this PR.
Test Plan:
python test/test_quantization.py TestSerialization
Imported from OSS
Reviewed By: HDCharles
Differential Revision: D30003700
fbshipit-source-id: 20e673f1bb15e2b209551b6b9d5f8f3be3f85c0a
Summary:
This PR enables gpu only quantization, best used with is_reference since
there are not many gpu kernels for ops as of now.
This PR mainly changes how qconfigs and their obs constructors operate once they
on modules qconfig. The function add_module_to_qconfig_obs_ctr takes the obs constructors on the original
qconfig, and configures them so that when invoked, the created obs will
be on whatever device the module occupies. (Once observers are created,
module.to(device) is already setup so that it moves any observers). To do this,
a new method and a few small chanegs were added to the _PartialWrapper class that
our observers already use to create constructors (without changing the
existing functionality). These changes work in
concert with changes to the prepare flow such that when the qconfigs are
propagated to the moduels (in quantize.py and qconfig_utils.py) they are configured using add_module_to_qconfig_obs_ctr.
Ideally this would work on other models but the is_reference support for
a lot of modules isn't there yet, those tests should be added in a
future PR
Test Plan:
python test/test_quantization.py TestQuantizeFxModels.test_static_gpu_convert_basic
python test/test_quantization.py TestQuantizeFxModels.test_switch_device_prepare_convert
python test/test_quantization.py TestQuantizeFxModels.test_prepare_serialize_switch_device_convert
python test/test_quantization.py TestQuantizeFx.test_qconfig_precedence
Reviewed By: vkuzo
Differential Revision: D29684114
fbshipit-source-id: 19fefb8e1998eaf212723e836276ccf39467f2e7
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61317
Add an overload to fake_quantize_per_tensor that accepts scale/zero_point as input. The reasons to do this are
* required for fused observer + fake_quant operator on GPU where the scale/zero_point will be calculated by the observer on device. Passing tensor inputs enables us to directly access the scale/zero-point value in the cuda kernel to avoid extra copies/malloc
* enables us to pass in float as scale dtype and int32 as zero_point dtype (which is consistent with what the quantize call actually uses) https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/quantized/affine_quantizer_base.cpp#L52-L53
* overload consistent with `quantizer_per_tensor.tensor_qparams`
ghstack-source-id: 133370216
Test Plan:
buck test mode/dev-nosan caffe2/test/:quantization -- test_backward_per_tensor_cachemask
buck test mode/dev-nosan caffe2/test/:quantization -- test_forward_per_tensor_cachemask
Reviewed By: raghuramank100
Differential Revision: D29552727
fbshipit-source-id: cbb9af40fc575ad27a29c646b760d5ee52cc923d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60883
As per this [comment](https://github.com/pytorch/pytorch/pull/59964#discussion_r659064270), I created a `reset_min_max_vals()` function inside the observers which will be called during input-weight equalization. This is so that we will not expose the implementation of the observers in the equalization code.
Test Plan:
`python test/test_quantization.py TestEqualizeFx`
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D29491848
fbshipit-source-id: 00e91959ceb3b4f3688175a1a7ba11823e929b2f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60386
During QAT we sometimes encounter errors with scripted models
`RuntimeError: cannot resize variables that require grad`
For per-tensor cases we don't need to resize some buffers so this PR removes the extra resize ops where applicable
Test Plan: Imported from OSS
Reviewed By: vkuzo
Differential Revision: D29271905
fbshipit-source-id: 01a484a9559a3a4180490f9476d0cd3044ba0d1b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59953
The following modifications were made to the equalization
observers due to design changes:
- [InputEqualizationObserver] Replaced `calculate_qparams()` with
`calculate_scaled_minmax()` since we will need to return the scaled
min/max values to update the following input quantization observer
- [WeightEqualizationObserver] We no longer need a row observer since
this will be taken care of by the following weight quantization observer
- [WeightEqualizationObserver] Following the previous comment, we no
longer need to calculate the scaled qparam values. Instead, we will use
the equalization scale to later scale the weights and the qparams will
be taken care of by the weight quantization observer.
Test Plan:
`python test/test_quantization.py
TestEqualizeFx.test_input_weight_eq_observer`
Imported from OSS
Reviewed By: supriyar
Differential Revision: D29135332
fbshipit-source-id: be7e468273c8b62fc183b1e1ec50f6bd6d8cf831
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57068
When training with histogram observer on, we got this runtime error:
```
torch/quantization/observer.py", line 942, in forward
self.bins)
self.histogram.resize_(combined_histogram.shape)
~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
self.histogram.copy_(combined_histogram)
self.min_val.resize_(combined_min.shape)
RuntimeError: cannot resize variables that require grad
```
Since this is the histogram observer that is used to collect histogram information, should not need gradient. So turn off the grad before resizing using `detach_()` method.
Test Plan:
- arc lint
- Train with histogram observer turned on, training finished successfully
f264139727
Reviewed By: supriyar
Differential Revision: D27147212
fbshipit-source-id: abed5b9c4570ffc6bb60e58e64791cfce66856cd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57067
auto format the code
Test Plan: lint
Reviewed By: jerryzh168
Differential Revision: D27147213
fbshipit-source-id: 008871d276c8891b2411549e17617e5c27d16ee3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49800
Ensures that having a Tensor with 0 elements does not crash observers.
Note: it's illegal to pass Tensors with 0 elements to reductions such
as min and max, so we gate this out before the logic hits min/max.
This should not be hit often in practice, but it's coming up
during debugging of some RCNN models with test inputs.
Test Plan:
```
python test/test_quantization.py TestObserver.test_zero_numel
```
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D25693230
fbshipit-source-id: d737559697c98bd923356edacba895835060bb38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48069
also renamed float_qparam_dynamic_qconfig to float_qparam_weight_only_qconfig
It's not used in user code yet so we only need to update the tests.
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D25010175
fbshipit-source-id: caa3eaa5358a8bc5c808bf5f64e6ebff3e0b61e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47514
Previosuly the scale and zero_point were returned on the CPU even if
the input tensor was on the GPU.
This is because `copy_()` doesn't respect the device when copying over the tensor.
Also fixed a bug where we were always setting the device to 'cuda' (irrespective of the device id)
in the calculate_qparams function
Test Plan:
python test/test_quantization.py TestObserver.test_observer_qparams_respects_device_affinity
Imported from OSS
Reviewed By: vkuzo
Differential Revision: D24800495
fbshipit-source-id: d7a76c59569842ed69029d0eb4fa9df63f87e28c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45752
Use the torch.quint4x2 dtype to create 4-bit packed tensors in the previous PR.
These packed tensors can be directly consumed by the operator.
Serialization of the packed tensors is supported using torchbind custom class.
Module support will follow in a later PR.
Test Plan:
python test/test_quantization.py TestEmbeddingBagOps
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D24120996
fbshipit-source-id: 2639353b3343ebc69e058b5ba237d3fc56728e1c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45343
Current default dynamic quant observer is not correct since we don't accumulate
min/max and we don't need to calculate qparams.
Test Plan: Imported from OSS
Reviewed By: supriyar
Differential Revision: D23933995
fbshipit-source-id: 3ff497c9f5f74c687e8e343ab9948d05ccbba09b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44846
The save function traverses the model state dict to pick out the observer stats
load function traverse the module hierarchy to load the state dict into module attributes depending on observer type
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_save_observer_state_dict
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23746821
fbshipit-source-id: 05c571b62949a2833602d736a81924d77e7ade55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44956
Makes buffer shapes for HistogramObserver have the
same shapes in uninitialized versus initialized states.
This is useful because the detectron2 checkpointer assumes
that these states will stay the same, so it removes the
need for manual hacks around the shapes changing.
Test Plan:
```
python test/test_quantization.py TestObserver.test_histogram_observer_consistent_buffer_shape
```
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23785382
fbshipit-source-id: 1a83fd4f39b244b00747c368d5d305a07d877c92
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44749
Ensure fx module is scriptable after calling prepare_qat on it
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_qat_and_script
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23718380
fbshipit-source-id: abf63ffb21e707f7def8f6c88246877f5aded58c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44537
Originally, the `min_val`, `max_val`, `min_vals`, `max_vals`
attributes of observers were Tensors but not buffers. They had custom
state_dict save/load code to ensure their state was saved.
At some point, these attributes became buffers, and the custom
save/load code remained. This introduced a subtle bug:
* create model A, move it to a device (cpu/cuda) and save its state_dict
* create model B, load its state dict.
* `min_val|min_vals|max_val|max_vals` would always be loaded to model A's device, even if the rest of model B was on a different device
* the above is inconsistent with how save/load on different devices is expected to work (see https://pytorch.org/tutorials/beginner/saving_loading_models.html#saving-loading-model-across-devices)
In practice, the case people would sometimes hit is:
* model A is on CPU, state dict is saved
* model B is created and moved to GPU, state_dict from model A is loaded
* assertions throw when operations are attempted across different devices
This PR fixes the behavior by removing the custom save/load where
possible and letting the default `nn.Module` save/load code handle
device assignment. We special case `PerChannelMinMaxObserver` and its
children to allow for loading buffers or different size, which is
normal.
There are some followups to also enable this for HistogramObserver
and FakeQuantize, which can be done in separate PRs due to higher
complexity.
Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```
Imported from OSS
Reviewed By: raghuramank100
Differential Revision: D23644493
fbshipit-source-id: 0dbb6aa309ad569a91a663b9ee7e44644080032e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43789
Since it's single element.. In some cases we may not be able to resize the
buffers.
Test Plan: unit tests
Reviewed By: supriyar
Differential Revision: D23393108
fbshipit-source-id: 46cd7f73ed42a05093662213978a01ee726433eb
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43151
Using `torch.all` instead of `torch.sum` and length check.
It's unclear whether the increase in perf (~5% for small inputs) is
real, but should be a net benefit, especially for larger channel inputs.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23170426
fbshipit-source-id: ee5c25eb93cee1430661128ac9458a9c525df8e5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43150
The current logic was expensive because it created tensors on CUDA.
Switching to clamp since it can work without needing to create tensors.
Test Plan:
benchmarks
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23170427
fbshipit-source-id: 6fe3a728e737aca9f6c2c4d518c6376738577e21
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43149
This value doesn't change, making it a buffer to only pay
the cost of creating a tensor once.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23170428
fbshipit-source-id: 6b963951a573efcc5b5a57649c814590b448dd72
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42602
In this diff, clearer semantics and namings for are introduced by splitting the original `init_dynamic_qrange` into 2 separate `Optional[int]` types `qmin` and `qmax` to avoid the confusion of the parameters with dynamic quantization.
The `qmin` and `qmax` parameters allow customers to specify their own customary quantization range and enables specific use cases for lower bit quantization.
Test Plan:
To assert the correctness and compatibility of the changes with existing observers, on a devvm, execute the following command to run the unit tests:
`buck test //caffe2/test:quantization -- observer`
Reviewed By: vkuzo, raghuramank100
Differential Revision: D22948334
fbshipit-source-id: 275bc8c9b5db4ba76fc2e79ed938376ea4f5a37c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42348
Use the dtype info in placeholderObserver to decide what ops to insert in the graph
In the next PR we can delete NoopObserver
Test Plan:
python test/test_quantization.py
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22859457
fbshipit-source-id: a5c618f22315534ebd9a2df77b14a0aece196989
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42221
Adds a new observer that emits a warning if the range of tensor is beyond fp16 range. This will be further used in graph mode quantization to insert the cast to fp16 ops in the graph
Test Plan:
python test/test_quantizaton.py TestObserver.test_fp16_observer
Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D22849222
fbshipit-source-id: a301281ce38ba4d4e7a009308400d34a08c113d2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41612
This change adds preliminary support to quantize the EmbeddingBag operators. We currently support 4-bit and 8-bit quantization+packing of the weights.
To quantize these operators, specify the operator name in the `custom_op_name` field of the NoopObserver. Based on the op name (4bit or 8bit) we call the corresponding quantization functions.
Refer to the testplan for how to invoke the qconfig for the embedding_bag ops.
Future versions of this will support 4-bit and 2-bit qtensors with native support to observe and quantize it.
NB - This version assumes that the weights in the EmbeddingBag Module reside on the same device.
Test Plan:
python test/test_quantization.py TestQuantizeDynamicJitOps.test_embedding_bag
Imported from OSS
Reviewed By: vkuzo, jerryzh168
Differential Revision: D22609342
fbshipit-source-id: 23e33f44a451c26719e6e283e87fbf09b584c0e6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41815
**All are minor changes to enable better simulations.**
The constructors of MinMaxObserver, MovingAverageMinMaxObserver, PerChannelMinMaxObserver, and MovingAveragePerChannelMinMaxObserver are augmented so they can utilize the dynamic quantization range support in the _ObserverBase class.
In addition, minor adjustments are made to the enable_static_observation function that allow observer to update parameters but do not fake quantize on the output (for constructing baseline).
Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:
```
buck test //caffe2/test:quantization -- observer
```
Reviewed By: z-a-f
Differential Revision: D22649128
fbshipit-source-id: 32393b706f9b69579dc2f644fb4859924d1f3773
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/41113
In this diff, the `ObserverBase` class is augmented with 2 additional optional arguments qmin and qmax. Correspondingly the calculation of qmin and qmax and the related quantization parameters are modified to accommodate this additional flexibility should the number of bits for quantization be lower than 8 (the default value).
Additional logic in the base class `_calculate_qparams` function has also been modified to provide support for dynamic quantization range.
Test Plan:
To ensure this modification is still backward compatible with past usages, numerics are verified by running the quantization unit test suite, which contains various observer tests. The following command executes the test suite, which also verifies the observer numerics:
`buck test //caffe2/test:quantization -- observer`
This modified observer script can be tested within the experiments for lower bit fake quantization. Please see the following diffs for reference.
- Single Fake Quantizer: D22337447
- Single Conv Layer: D22338532
Reviewed By: z-a-f
Differential Revision: D22427134
fbshipit-source-id: f405e633289322078b0f4a417f54b684adff2549