Quantization Basics


Danyal Malik


June 20, 2024


Deep learning has a growing history of successes, but heavy algorithms running on large graphical processing units are far from ideal. A relatively new family of deep learning methods called quantized neural networks have appeared in answer to this discrepancy.

How it works

Normally, higher-precision weights and activations are used in deep learning models, but quantized neural networks use lower-precision weights and activations. This reduces the memory and computation requirements of the model, making it faster and more efficient. Usually, floating point numbers are converted to integers, but there are many ways to quantize a neural network.

Asymmetric quantization

In asymmetric quantization, a range of floating point numbers [A, B] is mapped to a range of integers [0, 2^N - 1]. The range of integers is determined by the number of bits N used to represent the integer. The range of floating point numbers is determined by the minimum and maximum values of the floating point numbers in the layer.

Figure A: Asymmetric Quantization

Once we have quantized the numbers, we need a way to dequantize them. Dequantization. Dequantization is the process of converting the quantized numbers back to floating point numbers. The dequantization process is the inverse of the quantization process. Usually, this results in a loss of precision.

Figure B: Asymmetric Dequantization

Symmetric quantization

In symmetric quantization, a range of floating point numbers [-A, A] is mapped to a range of integers [-2^(N-1), 2^(N-1) - 1]. The range of integers is determined by the number of bits N used to represent the integer. The range of floating point numbers is determined by the maximum absolute value of the floating point numbers in the layer.

Figure C: Symmetric Quantization

We can dequantize as follows.

Figure D: Symmetric Dequantization

Uniform vs Non-uniform quantization

So far, we have only discussed uniform quantization, where the range of floating point numbers is divided uniformly into the range of integers. However, non-uniform quantization is also possible, where the range of floating point numbers is divided non-uniformly into the range of integers. We will not discuss non-uniform quantization in this article.

Quantization Range Selection

The range is controlled by the A and B parameters mentioned above. There are many ways to select these. In asymmetric quantization, the simplest way is to select B = min(weights) and A = max(weights). Similary, in symmetric quantization, we can select A = max(abs(weights)). There are more complex ways to select these parameters, but we will not discuss those in this article.


Create a simple tensor with random items

import numpy as np

# Suppress scientific notation

# Generate randomly distributed parameters
params = np.random.uniform(low=-50, high=150, size=20)

# Make sure important values are at the beginning for better debugging
params[0] = params.max() + 1
params[1] = params.min() - 1
params[2] = 0

# Round each number to the second decimal place
params = np.round(params, 2)

# Print the parameters
[127.48 -40.1    0.    89.74 124.38 -39.1  126.48  21.2  -35.99 124.16
   5.92  41.68  23.6  -26.4  -21.51 -20.6   94.49  85.07  70.11  76.91]

Define the quantization methods and quantize

def clamp(params_q: np.array, lower_bound: int, upper_bound: int) -> np.array:
    params_q[params_q < lower_bound] = lower_bound
    params_q[params_q > upper_bound] = upper_bound
    return params_q

def asymmetric_quantization(params: np.array, bits: int) -> tuple[np.array, float, int]:
    # Calculate the scale and zero point
    alpha = np.max(params)
    beta = np.min(params)
    scale = (alpha - beta) / (2**bits-1)
    zero = -1*np.round(beta / scale)
    lower_bound, upper_bound = 0, 2**bits-1
    # Quantize the parameters
    quantized = clamp(np.round(params / scale + zero), lower_bound, upper_bound).astype(np.int32)
    return quantized, scale, zero

def asymmetric_dequantize(params_q: np.array, scale: float, zero: int) -> np.array:
    return (params_q - zero) * scale

def symmetric_dequantize(params_q: np.array, scale: float) -> np.array:
    return params_q * scale

def symmetric_quantization(params: np.array, bits: int) -> tuple[np.array, float]:
    # Calculate the scale
    alpha = np.max(np.abs(params))
    scale = alpha / (2**(bits-1)-1)
    lower_bound = -2**(bits-1)
    upper_bound = 2**(bits-1)-1
    # Quantize the parameters
    quantized = clamp(np.round(params / scale), lower_bound, upper_bound).astype(np.int32)
    return quantized, scale

def quantization_error(params: np.array, params_q: np.array):
    # calculate the MSE
    return np.mean((params - params_q)**2)

(asymmetric_q, asymmetric_scale, asymmetric_zero) = asymmetric_quantization(params, 8)
(symmetric_q, symmetric_scale) = symmetric_quantization(params, 8)

print(np.round(params, 2))
print(f'Asymmetric scale: {asymmetric_scale}, zero: {asymmetric_zero}')
print(f'Symmetric scale: {symmetric_scale}')
[127.48 -40.1    0.    89.74 124.38 -39.1  126.48  21.2  -35.99 124.16
   5.92  41.68  23.6  -26.4  -21.51 -20.6   94.49  85.07  70.11  76.91]

Asymmetric s: 0.6571764705882354, z: 61.0
[255   0  61 198 250   2 253  93   6 250  70 124  97  21  28  30 205 190
 168 178]

Symmetric s: 1.003779527559055
[127 -40   0  89 124 -39 126  21 -36 124   6  42  24 -26 -21 -21  94  85
  70  77]
# Dequantize the parameters back to 32 bits
params_deq_asymmetric = asymmetric_dequantize(asymmetric_q, asymmetric_scale, asymmetric_zero)
params_deq_symmetric = symmetric_dequantize(symmetric_q, symmetric_scale)

print(np.round(params, 2))
print(f'Dequantize Asymmetric:')
print(f'Dequantize Symmetric:')
print(np.round(params_deq_symmetric, 2))
[127.48 -40.1    0.    89.74 124.38 -39.1  126.48  21.2  -35.99 124.16
   5.92  41.68  23.6  -26.4  -21.51 -20.6   94.49  85.07  70.11  76.91]

Dequantize Asymmetric:
[127.49 -40.09   0.    90.03 124.21 -38.77 126.18  21.03 -36.14 124.21
   5.91  41.4   23.66 -26.29 -21.69 -20.37  94.63  84.78  70.32  76.89]

Dequantize Symmetric:
[127.48 -40.15   0.    89.34 124.47 -39.15 126.48  21.08 -36.14 124.47
   6.02  42.16  24.09 -26.1  -21.08 -21.08  94.36  85.32  70.26  77.29]
# Calculate the quantization error
print(f'{"Asymmetric error: ":>20}{np.round(quantization_error(params, params_deq_asymmetric), 2)}')
print(f'{"Symmetric error: ":>20}{np.round(quantization_error(params, params_deq_symmetric), 2)}')
  Asymmetric error: 0.03
   Symmetric error: 0.08


Umar Jamil’s Notebook

Joel Nicholas’ Blog Post