Visual Wake Words
Visual Wake Words represents a common TinyML visual use case of identifying whether an object (or a person) is present in the image or not.
The Math Behind MobileNets Efficient Computation
Standard Convolutions
In order to better understand how depthwise separable convolutions are different from standard convolutions let’s first re-examine standard convolutions. In particular, we are going to quantify the number of multiplication operations and parameters in a standard convolution.
Most color images are three-channel RGB images. Where there is a value representing how much red, green, and blue is in the color. As such the filters (also often referred to as kernels) used are not simply a matrix but instead a tensor as it needs to multiply all three channels at the same time. This tensor operation is shown below. In this example, we have a 3x3x3 kernel being convolved with a 9x9x3 image to produce a 7x7x1 output. The highlighted squares are the inputs and outputs of the last convolution operation.
Note that in general, each convolution requires multiplications due to the size of the kernel and the number of channels in the image. Then to produce a single output we need to do these operations times to produce the full output. Also in general we don’t just use a single kerel, we use N kernels. Multiple kernels are referred to as a filter. A filter is a concatenation of multiple different kernels where each kernel is assigned to a particular channel of the input. As such the total number of multiplications will be:
Depthwise Separable Convolutions
Depthwise Separable Convolutions instead proceed in a two-step process. First, each channel is treated independently as if they were separate single-channel images and filters are applied to them creating multiple outputs. This is referred to as a Depthwise Convolution. Next, a Pointwise Convolution is applied to those outputs using a 1x1xC filter to compute the final output. This process can be seen below where we again take our 9x9x3 image and apply three separate 3x3 filters to it depthwise to produce a 7x7x3 output. We then take that output and apply a 1x1x3 pointwise filter to it to produce our final 7x7x1 output.
Note that in general now each depthwise convolution requires M filters of multiplications and to produce a single output we need to do these operations times. Therefore we need multiplications for that stage.
For the pointwise convolution, we now use a 1x1xM filter times. In general, just like regular convolutions, we don’t use a single filter we will use multiple filters. During Depthwise Separable Convolutions those multiple filters occur in the pointwise step. Therefore if we had N pointwise filters we will then need total multiplications for this stage.
Summing the number of multiplications we will need in both stages we find that in total we need:
Comparing the two kinds of Convolutions
We can compare the two kinds of convolutions through a ratio of the number of multiplications required for each. Placing standard convolutions on the denominator we get:
This means that the more filters we use and the larger the kernels are, the more multiplications we can save. If we use our example from above where and we use conservatively only filters we will find that the ratio becomes 0.2111 meaning that by using Depthwise Separable Convolutions we save almost 5x the number of multiplication operations! This is far more efficient and can greatly improve latency.
Also, note that in the case of standard convolution we have learnable parameters in our various filters/kernels. In contrast in the Depthwise Separable case, we have . Again if we take the ratio of the two we find that:
This means that we also have a much smaller memory requirement as we have far fewer parameters to store!
There is a tradeoff however, in improving our latency and memory needs we have reduced the number of parameters that we can use to learn with. Thus our models are more limited in their expressiveness. This is usually sufficient for TinyML applications but is something to consider when using Depthwise Separable Convolutions in general!
Finally, if you’d like to read more detail about MobileNets you can check out the paper describing them here.
Transfer Learning for VWW
Now that you have explored vision applications for TinyML, the MobileNet model, and transfer learning, we are going to put it all together and train a model to help fight COVID-19. We are going to detect if people are wearing masks or not!
In this Colab you will see through a hands on example how to leverage a pre-trained model as a feature extractor, and how to add new layers and train them to use for classification on a new task. You will also be challenged to find the right number of epochs when using transfer learning. Hint: it's going to be a pretty small number! Assignment: Transfer Learning in Colab: colab.research.google.com/github/tiny…
Assignment Solution: You should have found that you only needed VERY FEW epochs to get the model to train very accurately. Depending on how lucky you were with initialization it could have been in the single digits! We found that by setting EPOCHS = 10 our models were always sufficiently trained, and in some cases we were even able to get EPOCHS = 2 to work!
Common Myths and Pitfalls about Transfer learning
Hopefully, you have now been convinced of the power of transfer learning. Transfer learning allows us to use a model generated for one task to be fine-tuned for a different but related task. In doing this, we can repurpose models for use in new tasks, thereby saving energy that would be required in computation. Furthermore, models with superior performance can be developed with a relatively small amount of data, which would normally cause overfitting in a network trained from scratch. Consequently, transfer learning provides a trifecta of benefits: reduced energy usage, faster convergence, and higher asymptotic accuracy. However, transfer learning can be difficult to implement and is not always guaranteed to be successful.
Task Similarity
Transfer learning is only viable when the task of interest is similar in scope to the original task. Image recognition is the archetypal example of this, wherein the convolutional filters close to the input can be ported to many imaging tasks due to their generality (e.g., spotting lines, edges, shapes). The further into the convolutional neural network we get, the more specific the filters become and the less transferrable they become.
Tasks that are too dissimilar may not receive any benefit from transfer learning. In fact, the use of transfer learning may even be detrimental to the final model performance. For example, performing transfer learning on a neural network used for language translation may perform well if done between two similar languages, such as two Romanic languages (e.g., French and Spanish), but may perform poorly if done between two dissimilar languages (e.g., Mandarin and French). This situation is commonly referred to as negative transfer.
Fragile Co-Adaptation
Transfer learning involves the use of neuron weights from one trained network to pre-initialize neuron weights in a second network for a related task. Often, the weights of several of the first layers are frozen so that they cannot be changed during learning. The rationale for this is to prevent noise in the new dataset from detrimentally altering the earlier layers. However, freezing weights in itself can be detrimental to model performance as it can lead to fragile co-adaptation of neurons between neighboring frozen and unfrozen layers. This occurs when neurons in the former layers are fixed and the latter layers alone are unable to adapt effectively to the new data. In practice, this is often solved by not freezing any layers and instead using a significantly smaller learning rate used when training the original network. This procedure helps to ensure former layers are not altered significantly while reducing the possibility of fragile co-adaptation.
Fixed Architecture
A key disadvantage of transfer learning is the restriction on neural architecture. Transfer learning is often performed from well-known models developed by commercial entities such as Inceptionv3, ResNet, and NASNet. These models are trained with specific neural architectures which are reflected in the model weights. Model layers cannot be modified, removed, or added without these changes propagating across the network. As such, if the network is altered, we cannot have confidence that the remaining parameters are still able to accurately model the data. This presents important limitations for model tuning and, more importantly for tiny machine learning applications, also for model size. These well-known architectures are large and often cannot be compressed sufficiently to fit within the hardware constraints of embedded systems. Thus, the use of transfer learning for tiny machine learning applications is fundamentally limited by existing model architectures unless the resources are available to train bespoke models for transfer learning using our own pre-defined architecture.
In summary, here is a general rule of thumb of when transfer learning works well when applying transfer learning to your situation where you have a new dataset that you wish to fine-tune your network upon.
| New dataset is similar to original dataset | New dataset is less similar to original dataset | |
|---|---|---|
| New dataset is small | Best case scenario for transfer learning | Not the best scenario for transfer learning |
| New dataset is large | Transfer learning will work | Training from scratch might yield better accuracy |
Despite some pitfalls, it is clear that transfer learning is a hugely powerful concept and, although difficult to implement in practice, presents many exciting opportunities for tiny machine learning applications.