You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fix INT32 bias overflow in QOperator INT8 symmetric quantization by adjusting weight scale and requantizing (microsoft#25278)
### Overview
This PR introduces a critical fix for **QOperator INT8 symmetric
quantization** in ONNX Runtime. It addresses a situation where the
computed **bias scale** (`input_scale * weight_scale`) becomes too
small, leading to **int32 overflow** or **precision clipping** during
bias quantization.
### Problem
In symmetric quantization (i.e., zero_point = 0), the bias tensor is
quantized using a fixed-point scale:
**bias_scale = input_scale * weight_scale**
When this value is too small, the quantized int32 bias may exceed the
range of `int32`, causing saturation or significant quantization error.
This was observed to cause **>51% accuracy loss** in some models.
### Solution
This PR adds two new functions to mitigate this:
---
#### 🔧 `_adjust_weight_scale_for_int32_bias(...)`
Located in `onnx_quantizer.py`, this function:
- **Inspects the float bias range** to compute the smallest valid bias
scale (based on int32 dynamic range)
- **Compares** this threshold against `input_scale * weight_scale`
- If too small, **scales up the weight scale** accordingly, to prevent
overflow
- Supports both per-tensor and per-channel weight quantization cases
This logic is **only triggered when**:
- The weight's zero point is exactly zero (i.e. symmetric)
- The weight data type is `INT8` or `INT16`
---
#### 🔄 `_requantize_weight(...)`
After weight scale adjustment, this function:
- **Finds the original quantized weight** (`q_weight`), scale, and zero
point from the initializer list
- **Removes** the outdated quantized weight and scale
- **Re-quantizes** the original float weights using the new scale and
the same zero point
- **Re-inserts** them into the model to maintain consistency
---
### Summary of Benefits
- ✅ Prevents int32 overflow or saturation during symmetric bias
quantization
- ✅ Ensures weight and bias quantization remain consistent
- ✅ Reduced quantization error from >51.4% to ~3% in test models
- ✅ Fix is limited in scope to QOperator + symmetric INT8/INT16 flow
(safe for other modes)
- ✅ Improves robustness of static quantization for hardware that
performs integer-only inference
---
### Code Location
- `onnxruntime/quantization/onnx_quantizer.py`
- `def _adjust_weight_scale_for_int32_bias(...)`
- `def _requantize_weight(...)`
- Integrated in `quantize_bias_static(...)`
---
Please let me know if you'd like additional test coverage or integration
points. Thanks!
0 commit comments