Why are 8-Bits Enough for ML?

When neural networks were first being developed, the biggest challenge was getting them to work at all! That meant that accuracy and speed during training were the top priorities. Using floating-point arithmetic was the easiest way to preserve accuracy, and GPUs were well-equipped to accelerate those calculations, so it’s natural that not much attention was paid to other numerical formats.

These days, we actually have a lot of models being deployed in commercial applications. The computation demands of training grow with the number of researchers, but the cycles needed for inference expands in proportion to the number of users. That means that inference efficiency has become a burning issue for a lot of teams in organizations that are deploying ML solutions, TinyML included.

That is where quantization comes in. It’s an umbrella term that covers a lot of different techniques to store numbers and perform calculations on them in more compact formats than 32-bit floating-point. We are going to focus on an eight-bit fixed point.

Why does Quantization Work?

Neural networks are trained through stochastic gradient descent; applying many tiny nudges to the weights. These small increments typically need floating-point precision to work (though there are research efforts to use quantized representations here too), otherwise, you can get yourself into a pickle with such things as “vanishing gradients.” Recall that activation functions restrict the large range of values into a rather confined numerical representation so that any large changes in the input values do not cause the network to have catastrophically different behavior.

Taking a pre-trained model and running inference is very different. One of the magical qualities of deep networks is that they tend to cope very well with high levels of noise in their inputs. If you think about recognizing an object in a photo you’ve just taken, the network has to ignore all the CCD noise, lighting changes, and other non-essential differences between it and the training examples it’s seen before, and focus on the important similarities instead. This ability means that they seem to treat low-precision calculations as just another source of noise, and still produce accurate results even with numerical formats that hold less information.

You can run many neural networks with eight-bit parameters and intermediate buffers (instead of full precision 32-bit floating-point values), and suffer no noticeable loss in the final accuracy. OK, so fine, sometimes you might suffer a little bit of loss in accuracy but often the gains you get in terms of performance latency and memory bandwidth are justifiable.

Why Quantize?

Neural network models can take up a lot of space on disk, with the original AlexNet being over 200 MB in float format, for example. Almost all of that size is taken up with the weights since there are often many millions of these in a single model. Because they’re all slightly different floating-point numbers, simple compression formats like “zip” don’t compress them well. They are arranged in large layers though, and within each layer, the weights tend to be normally distributed within a certain range, for example, -3.0 to 6.0.

The simplest motivation for quantization is to shrink file sizes by storing the min and max for each layer and then compressing each float value to an eight-bit integer representing the closest real number in a linear set of 256 within the range. For example with the -3.0 to 6.0 range, a 0 byte would represent -3.0, a 255 would stand for 6.0, and 128 would represent about 1.5. This means you can get the benefit of a file on disk that’s shrunk by 75%, and then convert back to float after loading so that your existing floating-point code can work without any changes.

Another reason to quantize is to reduce the computational resources you need to do the inference calculations, by running them entirely with eight-bit inputs and outputs. This is a lot more difficult since it requires changes everywhere you do calculations, but offers a lot of potential rewards. Fetching eight-bit values only require 25% of the memory bandwidth of floats, so you’ll make much better use of caches and avoid bottlenecking on RAM access. You can also typically use hardware-accelerated Single-Instruction Multiple Data (SIMD) operations that do many more operations per clock cycle. In some cases, you’ll have a digital signal processor (DSP) chip available that can accelerate eight-bit calculations too, which can offer a lot of advantages.

Moving calculations over to eight-bit will help you run your models faster, and use less power, which is especially important on mobile devices. It also opens the door to a lot of embedded systems that can’t run floating-point code efficiently, so it can enable a lot of applications in the TinyML world.

Quantization

Quantization is an optimization that works by reducing the precision of the numbers used to represent a model's parameters, which by default are 32-bit floating point numbers. This results in a smaller model size, better portability and faster computation.

Post Training Quantization (PTQ)

First import the needed packages

# For Numpy
import matplotlib.pyplot as plt
import numpy as np
import pprint
import re
import sys
# For TensorFlow Lite (also uses some of the above)
import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)
import tensorflow as tf
from tensorflow import keras
import pathlib

Exploring Post Training Quantization Algorithms in Python

Let us assume we have a weight array of size (256, 256).

weights = np.random.randn(256, 256)

In Post Training Quantization, we map the 32-bit floating point numbers to 8-bit integers. To do this, we need to find a very important value, the scale. The scale value is used to convert numbers back and forth between the various representations. For example, 32-bit floating point numbers can be contructed from 8-bit Integers by the following formula:

FP32_Reconstructed_Value=Scale×Int8_value

To make sure we can cover the complete weight distribution, the scale value needs to take into account the full range of weight values which we can compute using the following formula. The denominator is 256 because that is the range of values that can be represented using 8-bits ( $2^8=256$ ).

scale\,=\,\frac{max(weights)−min(weights)}{256}

Now lets code this up!

We can then use this function to quantize our weights and then reconstruct them back to floating point format. We can then see what kinds of errors are introduced by this process. Our hope is that the errors in general are small showing that this process does a good job representing our weights in a more compact format. In general if our scale is smaller it is more likely to have smaller errors as we are not lumping as many numbers into the same bin.

def quantizeAndReconstruct(weights):
    """
    @param W: np.ndarray

    This function computes the scale value to map fp32 values to int8. The function returns a weight matrix in fp32, that is representable
    using 8-bits.
    """
    # Compute the range of the weight.
    max_weight = np.max(weights)
    min_weight = np.min(weights)
    range = max_weight - min_weight

    max_int8 = 2**8
    
    # Compute the scale
    scale = range / max_int8

    # Compute the midpoint
    midpoint = np.mean([max_weight, min_weight])

    # Next, we need to map the real fp32 values to the integers. For this, we make use of the computed scale. By diving the weight 
    # matrix with the scale, the weight matrix has a range between (-128, 127). Now, we can simply round the full precision numbers
    # to the closest integers.
    centered_weights = weights - midpoint
    quantized_weights = np.rint(centered_weights / scale)

    # Now, we can reconstruct the values back to fp32.
    reconstructed_weights = scale * quantized_weights + midpoint
    return reconstructed_weights

reconstructed_weights = quantizeAndReconstruct(weights)
print("Original weight matrix\n", weights)
print("Weight Matrix after reconstruction\n", reconstructed_weights)
errors = reconstructed_weights-weights
max_error = np.max(errors)
print("Max Error  : ", max_error)
reconstructed_weights.shape

Output:

Original weight matrix
 [[-0.68503607 -0.24720313  1.79082141 ...  0.53586705  0.15627166
   0.93190925]
 [-0.28559666  0.74497075  1.14643476 ... -0.81286924  1.53595602
  -0.92288445]
 [ 1.94845683 -0.43615839 -1.09491501 ... -1.29511372 -1.67912247
   1.9637891 ]
 ...
 [-0.15498464 -0.28471354  0.34981028 ... -1.57595872  1.51135378
   1.21919072]
 [-0.50953124 -0.60033975  1.56282835 ...  0.48747877  0.78063144
   1.17327074]
 [-0.20390523 -2.8456798  -0.1404125  ... -1.79616565 -0.21115114
   0.28823105]]
Weight Matrix after reconstruction
 [[-0.68938805 -0.25824211  1.80509917 ...  0.54245749  0.14210769
   0.94280729]
 [-0.28903825  0.75803046  1.15838026 ... -0.8125726   1.52793392
  -0.93575716]
 [ 1.95907986 -0.44301894 -1.08973785 ... -1.30531082 -1.67486448
   1.95907986]
 ...
 [-0.16585369 -0.28903825  0.35768066 ... -1.58247606  1.49713778
   1.21997254]
 [-0.50461122 -0.59699963  1.55873006 ...  0.48086521  0.7888266
   1.15838026]
 [-0.19664983 -2.84511774 -0.13505755 ... -1.79804903 -0.19664983
   0.29608838]]
Max Error  :  0.015397858237824913
(256, 256)

The quantized representation should not have more than 256 unique floating numbers, lets do a sanity check.

# We can use np.unique to check the number of unique floating point numbers in the weight matrix.
np.unique(quantizeAndReconstruct(weights)).shape

Exploring Post Training Quantization using TFLite

Now that we know how PTQ works under the hood, lets move over to seeing the actual benefits in terms of memory and speed. Since in numpy, we were representing our final weight matrix in full precision, the memory occupied was still the same. However, in TFLite, we only store the matrix in an 8-bit format.

Note: We however do not save a perfect factor of 4 in total memory usage as we now also have to store the scale (and potentially other factors needed to properly convert the numbers).

# Load MNIST dataset
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the input image so that each pixel value is between 0 to 1.
train_images = train_images / 255.0
test_images = test_images / 255.0

# Define the model architecture
model = keras.Sequential([
  keras.layers.InputLayer(input_shape=(28, 28)),
  keras.layers.Reshape(target_shape=(28, 28, 1)),
  keras.layers.Conv2D(filters=64, kernel_size=(3, 3), activation=tf.nn.relu),
  keras.layers.MaxPooling2D(pool_size=(2, 2)),
  keras.layers.Flatten(),
  keras.layers.Dense(10)
])

# Train the digit classification model
model.compile(optimizer='adam',
              loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
model.fit(
  train_images,
  train_labels,
  epochs=1,
  validation_data=(test_images, test_labels)
)

converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()

tflite_models_dir = pathlib.Path("/content/mnist_tflite_models/")
tflite_models_dir.mkdir(exist_ok=True, parents=True)
tflite_model_file = tflite_models_dir/"mnist_model.tflite"
tflite_model_file.write_bytes(tflite_model)

# Convert the model using DEFAULT optimizations: https://github.com/tensorflow/tensorflow/blob/v2.4.1/tensorflow/lite/python/lite.py#L91-L130
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_quant_model = converter.convert()
tflite_model_quant_file = tflite_models_dir / "mnist_model_quant.tflite"
tflite_model_quant_file.write_bytes(tflite_quant_model)

Output:
114024

!ls -lh /content/mnist_tflite_models/

Output:

total 544K
-rw-r--r-- 1 root root 112K Feb 12 17:53 mnist_model_quant.tflite
-rw-r--r-- 1 root root 429K Feb 12 17:53 mnist_model.tflite

Notice the size difference - the quantized model is smaller by a factor of ~4 as expected

Software Installation to Inspect TFLite Models

Before we can inspect TF Lite files in detail we need to build and install software to read the file format. First we’ll build and install the Flatbuffer compiler, which takes in a schema definition and outputs Python files to read files with that format.

Note: This will take a few minutes to run.

%%bash

cd /content/
git clone https://github.com/google/flatbuffers
cd flatbuffers
git checkout 0dba63909fb2959994fec11c704c5d5ea45e8d83
cmake -G "Unix Makefiles" -DCMAKE_BUILD_TYPE=Release
make
cp flatc /usr/local/bin/
cd /content/
git clone --depth 1 https://github.com/tensorflow/tensorflow
flatc --python --gen-object-api tensorflow/tensorflow/lite/schema/schema_v3.fbs
pip install flatbuffers

# To allow us to import the Python files we've just generated we need to update the path env variable
sys.path.append("/content/tflite/")
import Model

Then we define some utility functions that will help us convert the model into a dictionary that's easy to work with in Python.

def CamelCaseToSnakeCase(camel_case_input):
  """Converts an identifier in CamelCase to snake_case."""
  s1 = re.sub("(.)([A-Z][a-z]+)", r"\1_\2", camel_case_input)
  return re.sub("([a-z0-9])([A-Z])", r"\1_\2", s1).lower()

def FlatbufferToDict(fb, attribute_name=None):
  """Converts a hierarchy of FB objects into a nested dict."""
  if hasattr(fb, "__dict__"):
    result = {}
    for attribute_name in dir(fb):
      attribute = fb.__getattribute__(attribute_name)
      if not callable(attribute) and attribute_name[0] != "_":
        snake_name = CamelCaseToSnakeCase(attribute_name)
        result[snake_name] = FlatbufferToDict(attribute, snake_name)
    return result
  elif isinstance(fb, str):
    return fb
  elif attribute_name == "name" and fb is not None:
    result = ""
    for entry in fb:
      result += chr(FlatbufferToDict(entry))
    return result
  elif hasattr(fb, "__len__"):
    result = []
    for entry in fb:
      result.append(FlatbufferToDict(entry))
    return result
  else:
    return fb

def CreateDictFromFlatbuffer(buffer_data):
  model_obj = Model.Model.GetRootAsModel(buffer_data, 0)
  model = Model.ModelT.InitFromObj(model_obj)
  return FlatbufferToDict(model)

Visualizing TFLite model weight distributions

This example uses the Inception v3 model, dating back to 2015, but you can replace it with your own file by updating the variables. To load in any TFLite model.

MODEL_ARCHIVE_NAME = 'inception_v3_2015_2017_11_10.zip'
MODEL_ARCHIVE_URL = 'https://storage.googleapis.com/download.tensorflow.org/models/tflite/' + MODEL_ARCHIVE_NAME
MODEL_FILE_NAME = 'inceptionv3_non_slim_2015.tflite'
!curl -o {MODEL_ARCHIVE_NAME} {MODEL_ARCHIVE_URL}
!unzip {MODEL_ARCHIVE_NAME}
with open(MODEL_FILE_NAME, 'rb') as file:
 model_data = file.read()

Once we have the raw bytes of the file, we need to convert them into an understandable form. The utility functions and Python schema code we generated earlier will help us create a dictionary holding the file contents in a structured form.

model_dict = CreateDictFromFlatbuffer(model_data)

Now that we have the model file in a dictionary, we can examine its contents using standard Python commands. In this case we're interested in examining the tensors (arrays of values) in the first subgraph, so we're printing them out.

pprint.pprint(model_dict['subgraphs'][0]['tensors'])

Let's inspect the weight parameters of a typical convolution layer, so looking at the output above we can see that the tensor with the name 'Conv2D' has a buffer index of 212. This index points to where the raw bytes for the trained weights are stored. From the tensor properties I can see its type is '0', which corresponds to a type of float32.

This means we have to cast the bytes into a numpy array using the frombuffer() function.

param_bytes = bytearray(model_dict['buffers'][212]['data'])
params = np.frombuffer(param_bytes, dtype=np.float32)

With the weights loaded into a numpy array, we can now use all the standard functionality to analyze them. To start, let's print out the minimum and maximum values to understand the range.

params.min()

params.max()

This gives us the total range of the weight values, but how are those parameters distributed across that range?

plt.figure(figsize=(8,8))
plt.hist(params, 100)

Output:

This shows a distribution that's heavily concentrated around zero. This explains why quantization can work quite well. With values so concentrated around zero, our scale can be quite small and therefore it is much easier to do an accurate reconstruction as we do not need to represent a large number of values!

More Models to Explore

# Text Classification
!wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/text_classification/text_classification_v2.tflite

# Post Estimation
!wget https://storage.googleapis.com/download.tensorflow.org/models/tflite/posenet_mobilenet_v1_100_257x257_multi_kpt_stripped.tflite

TEXT_CLASSIFICATION_MODEL_FILE_NAME = "text_classification_v2.tflite"
POSE_ESTIMATION_MODEL_FILE_NAME = "posenet_mobilenet_v1_100_257x257_multi_kpt_stripped.tflite"

with open(TEXT_CLASSIFICATION_MODEL_FILE_NAME, 'rb') as file:
  text_model_data = file.read()

with open(POSE_ESTIMATION_MODEL_FILE_NAME, 'rb') as file:
  pose_model_data = file.read()

def aggregate_all_weights(buffers):
    weights = []
    for i in range(len(buffers)):
        raw_data = buffers[i]['data']
        if raw_data is not None:
            param_bytes = bytearray(raw_data)
            params = np.frombuffer(param_bytes, dtype=np.float32)
            weights.extend(params.flatten().tolist())

    weights = np.asarray(weights)
    weights = weights[weights<50]
    weights = weights[weights>-50]

    return weights

Lets plot the distribution of the Text Classification Model in log scale

model_dict_temp = CreateDictFromFlatbuffer(text_model_data)
weights = aggregate_all_weights(model_dict_temp['buffers'])

plt.figure(figsize=(8,8))
plt.hist(weights, 256, log=True)

Output: 下载.png

Lets plot the distribution of the Pose Net Model in log scale

model_dict_temp = CreateDictFromFlatbuffer(pose_model_data)
weights = aggregate_all_weights(model_dict_temp['buffers'][:-1])

plt.figure(figsize=(8,8))
plt.hist(weights, 256, log=True)

Output:

下载 (1).png

Again we find that most model weights are closely packed around 0.

Colab

Conversion and Deployment

Now that we have talked about the difference between TensorFlow and TensorFlow Lite, let’s dive a little under the hood to understand how models are represented and how conversion is processed.

Tensorflow’s Computational Graph

Tensorflow represents models as a computational graph. To better understand what that means, let's use a simple example of the program shown below:

var1 = 6
var2 = 2
temp1 = var1 + var2
temp2 = var1 - var2
var3 = 4
temp3 = temp2 + var3
res = temp1/temp3

This program can be represented by the following computational graph which specifies the relationships between constants variables (often represented as tensors), and operations (ops):

The benefit of using such a computational graph is twofold:

Optimizations such as parallelism can be applied to the graph after it is created and managed by a backend compiled (e.g., Tensorflow) instead of needing to be specified by the user. For example, in the case of distributed training, the graph structure can be used for efficient pa rtitions and learnable weights can be averaged easily after each distributed run. As another example, it enables efficient auto-differentiation for faster training by keeping all of the variables in a structured format.
Portability is increased as the graph can be designed in a high level language (like we do using Python in this course) and then converted into an efficient low level representation using C/C++ or even assembly language by the backend compiler.

Both of these reasons become increasingly important as we increase the size of our models --- even for TinyML many models have thousands of weights and biases!

If you’d like to learn more about this we suggest checkout this great article which we used for inspiration for this section of the reading.

Tensorflow Checkpoints

As you train your model you will notice that Tensorflow will save a series of .ckpt files. These files save a snapshot of the trained model at different points during the training process. While it may seem like a lot of files at first (e.g., you will notice later that that /train directory will get quite full when you train your custom keyword spotting model in the next section), you do not need to be able to understand their contents. The most important thing to keep in mind about the checkpoint files is that you can rebuild a snapshot of your model. However, at a high level:

.ckpt-meta files contain the metagraph, i.e. the structure of your computation graph, without the values of the variables
.ckpt-data contains the values for all the weights and biases, without the structure.
.ckpt-index contains the mappings between the -data and -meta files to enable the model to be restored. As such, to restore a model in python, you'll usually need all three files.

If you’d like to take Google’s hands on crash course on all things checkpoint files check out this Colab!

Freezing a Model

You may also notice some .pb files appear as well. These are in the Google “Proto Buffers” format. A Proto Buffer represents a complete model (both the metadata and the weights/biases). A ”.pb” file is the output from “freezing” a model --- a complete representation of the model in a ready to run format!

Google differentiates between checkpoint files and frozen model files (which it refers to as “saved models”) as follows:

Frozen/saved models are designed to be deployed into production environments from that format
Checkpoint files are designed to be used to jumpstart future training

If you’d like to learn more about freezing/saving models you can check out this Colab made by Google.

Optimizing a Model (Converting to TensorFlow Lite)

As we described earlier in the course, the Converter can be used to turn Tensorflow models into quantized TensorFlow Lite models. We won’t go into detail about that process here as we covered it in detail earlier but suffice it to say that that process often reduces the size of the model by a factor of 4 by quantizing from Float32 to Int8.

TensorFlow Lite models are stored in the FlatBuffer file format as.tflite files.

The primary difference between FlatBuffers and ProtoBuffers is that ProtoBuffers are designed to be converted into a secondary representation before use (requiring memory to be allocated and copied at runtime), while FlatBuffers are designed to be per-object allocated into a compressed usable format. As such FlatBuffers offer less flexibility in what they can represent but are an order of magnitude more compact in terms of the amount of code required to create them, load them (i.e., do not require any dynamic memory allocation) and runt them, all of which are vitally important for TinyML!

Applications of TinyML：Machine Learning on Mobile and Edge IoT Devices 2

Why are 8-Bits Enough for ML?

Why does Quantization Work?

Why Quantize?

Quantization

Post Training Quantization (PTQ)

Exploring Post Training Quantization Algorithms in Python

Exploring Post Training Quantization using TFLite

Software Installation to Inspect TFLite Models

Visualizing TFLite model weight distributions

More Models to Explore

Conversion and Deployment

Tensorflow’s Computational Graph

Tensorflow Checkpoints

Freezing a Model

Optimizing a Model (Converting to TensorFlow Lite)