Anomaly Detection in Industry
First, anomaly detection systems must be fast enough to keep up with the rate at which sensors produce data. Often the sensors produce too much data to be sent over a wireless network therefore computation must be done locally. To accomplish this, the anomaly detection model must fit within the tight memory constraints of a microcontroller. Finally, an anomaly detection system must mitigate false negative and false positives, which, depending on the application, can have dire consequences.
Data and Datasets
Anomaly detection is unique in that it is impractical to collect real examples of anomalous data. Collecting data on a failure is often too expensive since it usually involves intentionally breaking something. Additionally, anomalies are inherently rare and can take many different forms therefore it is difficult to train anomaly detection models in a normal supervised manner.
Real and Synthetic Data
Oftentimes, data is the limiting factor and in turn dictates model performance. A scarcity in the amount of data points or number of features hinders the predictive power of our model. The solution seems obvious: collect more data, but this is not always feasible. So what can an ML engineer do?
Fortunately, with recent advances in deep generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs), very realistic synthetic data can be generated from a small number of known examples. This is achieved by approximately “learning” the data distribution and drawing random samples from the latent space of the generative model. This can be done on any type of data, allowing us to generate time series information (such as voice or sensor data), as well as photorealistic images. Using these new techniques, ML engineers can now bootstrap machine learning with machine learning -- that is, they can train models to generate more data with which to train other models!
What is Synthetic Data?
Imagine a power plant engineer that wants to use data to prevent a nuclear meltdown. They clearly cannot generate sample nuclear meltdowns to obtain relevant data, but they still need to be able to somehow make predictions. So what can they do? Well, if they have a very small amount of data from previous historical meltdowns or a simulation, then they can possibly use one of these techniques to train a model to generate enough data such that it becomes feasible to train an anomaly detection classifier using supervised learning. In this case, synthetic data is the only viable path forward.
The need to generate anomalous data is a commonplace situation in anomaly detection applications, since often the so-called “anomalies” are undesirable and can be extremely costly or damaging. Perhaps two of the most important applications of anomaly detection are for industrial processes and medical devices, where it can be used for predictive maintenance (fixing something before catastrophic failure or death).
Predictive maintenance will likely become an important feature of Industry 4.0, reducing costs by predicting component failures in advance and resolving the issue before it becomes more severe. An open-source dataset was created for this aiding with this specific application, referred to as MIMII (Malfunctioning Industrial Machine Investigation and Inspection). The MIMII dataset focuses on sounds exhibited by various components and can be used as a starting point to generate synthetic anomalous data for a variety of industrial components, reducing the need for individuals or organizations to collect these data.
Similarly, in the medical device space, anomaly detection allows for moderately malfunctioning devices to be detected on and replaced before they pose life-threatening issues to the device wearer. The ECG5000 dataset was created for this purpose, and provides 20 hours of ECG data - a type of data used to study heart health. Anomalous data is also provided in this dataset which corresponds to an individual with severe congestive heart failure. The dataset is open-source and can be freely used by anyone working on medical devices, and can be used as a starting point to train an anomaly detection model or to create more anomalous data.
This may sound incredible - we can effectively train models by simulating new data based on our current data. However, you have already done this in this course! When you use Data Augmentation, you are effectively expanding your dataset with synthetic data! That said, there are concerns when relying heavily on synthetic data that should be highlighted.
Quality Concerns of Synthetic Data
Real-life datasets are often complex and multifaceted. For example, when recording sounds, the actual sound plays an important role, but so does the microphone, background noise, microphone orientation, as well as myriad other factors. Creating synthetic data that is able to mimic this variability is infeasible, and currently there is no universally accepted method for generating high-quality synthetic data. One important reason for this infeasibility is that data must often go through multiple preprocessing stages before the generation of synthetic data. During this preprocessing stage, assumptions are implicitly made about the data which directly influence the synthetic data, meaning it is not purely based on the properties of the unprocessed data.
Similarly, if only a small amount of real-life data is collected, synthetic data generated using it may exhibit a high degree of bias. This can make it difficult to effectively extrapolate to other environments. Thus, it is still necessary to have access to as large and diverse a dataset as is possible given the contextual constraints to improve the generalizability of our synthetic data. The quality of the generated synthetic data must also be assessed, which can be problematic since the “quality” of a dataset can sometimes be very complex or specific, and not readily described by standard metrics.
Unsupervised Learning for Anomaly Detection (with K-Means)
Now that you’ve learned the basics of anomaly detection and the data-based challenges it imposes, we will explore anomaly detection through the lens of a traditional ML technique, K-means clustering. In this Colab you will visualize anomaly detection and unsupervised learning as well as learn and evaluate the strengths and weaknesses of K-means clustering. We will be using both the SKLearn KMeans and TSNE libraries. colab.research.google.com/github/tiny…
Autoencoder Model Architecture
You’ve previously been introduced to the basic concept of an autoencoder model. At its core, an autoencoder is trained to reconstruct data after compressing it into an embedding. You’ve learned how autoencoders can be applied to anomaly detection by training only on normal data and then using the reconstruction error to determine if a new input is similar to the normal training data. In this reading, we will examine autoencoders in more detail through a high level overview describing the strengths of different types of autoencoders. While autoencoders have many uses, including denoising, compression, and data generation, in this reading we will continue to focus on time-series anomaly detection. If you’d like to examine more applications of autoencoders check out this article.
Types of Autoencoders
It turns out that there are many kinds of autoencoders that are used in the wild. In this rest of this article we will survey some of the most popular types. To see how to implement all of these different autoencoders in keras check out this article.
Fully Connected Autoencoder
A deep fully-connected autoencoder has multiple fully connected layers with some small embedding in the middle. You’ve already seen an example of these in the previous video. They are the most basic form of an autoencoder and are simple to implement and deploy.
Convolutional Autoencoder
A convolutional autoencoder operates under the same principle as the deep fully-connected autoencoder but uses convolutions for the encoder layers and transpose convolutions for the decoder layers. Due to the spatial nature of convolutions, these autoencoders are particularly useful for images and spectrograms (which are essentially images). Unfortunately, at this time, transpose convolutions are less commonly supported by tinyML frameworks, so deploying convolutional autoencoders to microcontrollers is more challenging. That said, hopefully they will become more accessible soon. In the meantime, if you would like to try using a convolutional autoencoder for anomaly detection in Colab, check out this keras example.
LSTM Autoencoders
Long short term memory networks are a type of recurrent neural network. They are useful for processing longer sequences while capturing the temporal structure of the data like the time-series data we have explored thus far. This can lead to a more detailed analysis of longer sequences which leads to higher accuracy in some cases. However, these RNN layers generally use more working memory as they maintain and modify a state over long sequences. This means they often consume more SRAM, which is a precious resource in tinyML. Additionally, it is less common for tinyML frameworks to support LSTMs or RNNs in general. For more detail on LSTM autoencoders check out this article.
Variational Autoencoders
Unlike the previous examples, the unique aspect of variational autoencoders is not that they use a different type of layer. Instead, variational autoencoders impose an additional constraint during training, which forces the model to map the input onto a distribution, which means the normal latent vector of an autoencoder is replaced by a mean vector and a standard deviation vector from which a latent vector is sampled. For anomaly detection, we can use the reconstruction probability that the model produces instead of the reconstruction error used in normal autoencoders, which can at times lead to better predictions.
Visualizing the Latent Space of Variational Autoencoders
There are multiple ways of visualizing the learned low-dimensional representation, or latent space, in an autoencoder. Due to the additional constraints imposed on the latent representation in variational autoencoders, visualizing the latent space is more straightforward than with traditional autoencoders.
When variational autoencoders are trained on images, a common technique to visualize the latent space is to randomly draw values from the learned distributions and use these values as inputs to the decoder (ignoring the encoder entirely). This produces a novel image sampled from the latent space. For example, after training on MNIST digits, the resultant images will usually resemble specific digits. Note that these generated images are not present in the training set - they are representations of what the network has learned about encoding digits. In fact, by running linearly-spaced sample values through the decoder, you will notice that the output images appear to interpolate between digits. We are thus visualizing samples along a continuous latent space (i.e., a manifold):
If your training images belong to specific classes, as MNIST digits do, you may also observe potential clustering within the latent space. You might expect each digit to be spatially separated from other digits, and indeed this is the case. The variational autoencoder in this article (which we used for inspiration for this article) encodes MNIST digits in a two-dimensional latent space. In the below figure, each test image from the MNIST dataset is fed through the encoder (the decoder is ignored). The two-dimensional mean vector of the encoder’s output is plotted on the x-y plane, and each digit (indicated by the legend on the right using different colors) is fairly well-clustered:
For variational autoencoder networks with larger latent spaces (more than 2 dimensions), common dimensionality reduction algorithms like PCA, t-SNE, or UMAP may help with visualization, though these are outside the scope of our discussion.
Unfortunately, variational autoencoders are more complicated to train and deploy, therefore in the remainder of this course we will focus on standard autoencoders. However, since you may run into these new and powerful techniques in the future we wanted to make sure to introduce you to them now. Next, you will train a fully connected autoencoder to identify abnormal heart rhythms from ECG data and learn how to evaluate an anomaly detection model.
Training and Metrics
Now that you have been introduced to the basics of autoencoder model architectures, you will explore the process of training one to detect anomalous ECG data. This will get you familiar with the unique process of training only on normal data and allow you to play with the architecture of a fully connected autoencoder. Additionally, you will learn the two important metrics, the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC).
In this Colab you will be using TensorFlow to define and train the autoencoder and SKLearn to evaluate its performance.
colab.research.google.com/github/tiny…
A Coding Assignment
Now that you know how to train an autoencoder for anomaly detection using only normal data, it’s time to test your skills. You will be repeating the ECG anomaly detection with a two key differences:
- The training set now includes a low percentage of anomalous data. This reflects a real-world scenario where you don’t have access to experts to label input data and therefore can not ensure that the training data contains only normal data. Since the anomalous portion of the training data is small, the model can still perform quite well as long as we don’t overfit.
- You will be tasked with determining the size of the autoencoder’s embedding layer and picking the error threshold to obtain the best performance given the application.
colab.research.google.com/github/tiny…
Assignment Solution
The size of the encoding layer should be less than 8 and more than 0 as we need to have a bottleneck from the previous layer (which was size 8). We have found: EMBEDDING_SIZE = 2 seems to work the best! The encoding layer size is a critical factor in how much information our model can encode. Remember, we want the encoding layer (and the model in general) to be large enough that it can encode the normal signals that represent 90% of our training data but we want it to be small enough so that it can’t learn to reproduce the 10% of our training data that is anomalous. For this dataset, the model and encoding layer can be very small, but it’s important to remember that this is a very controlled and simple example.
The threshold depends on the training and on the encoding layer size but somewhere in the range of threshold = 0.037 works best. When picking the threshold the goal is to maximize the accuracy, precision, and recall. The ‘knee’ of the ROC curve is a good place to start as it represents a good balance between precision and recall. For this application, it might be more important to properly classify an abnormal rhythm as such, instead of maximizing the accuracy.
Summary
In this section you explored the TinyML flow and data engineering in the context of anomaly detection applications, focusing on some unique challenges presented by unsupervised learning. Anomaly detection represents a common TinyML use case of classifying some time-series data as ‘normal’ or ‘abnormal’. Anomaly detection can be applied to data coming from a wide variety of sensors in a number of different application domains.
Anomaly Detection in Industry
You explored an example of anomaly detection in an industrial setting where you learned about the primary constraints of tinyML anomaly detection. First, anomaly detection systems must be fast enough to keep up with the rate at which sensors produce data. Often the sensors produce too much data to be sent over a wireless network therefore computation must be done locally. To accomplish this, the anomaly detection model must fit within the tight memory constraints of a microcontroller. Finally, an anomaly detection system must mitigate false negative and false positives, which, depending on the application, can have dire consequences.
Data and Datasets
Anomaly detection is unique in that it is impractical to collect real examples of anomalous data. Collecting data on a failure is often too expensive since it usually involves intentionally breaking something. Additionally, anomalies are inherently rare and can take many different forms therefore it is difficult to train anomaly detection models in a normal supervised manner. That all said, you explored the MIMII and ToyADMOS datasets as well as the concept of generating synthetic data to understand some of the processes used to try to collect an anomaly detection dataset.
Unsupervised Learning: K-Means and Autoencoders
To avoid some of the issues with supervised learning you then explored a classical and neural network based approach to unsupervised learning. For the classical approach, you explored anomaly detection through the lens of K-means clustering and learned about its advantages (ease of implementation and use) and disadvantages (scalability). You then learned about autoencoders and how they can be applied to many domains, among which is anomaly detection. Autoencoders are trained to reconstruct normal data after compressing it into an embedding. By comparing the error between the input and the output of the autoencoder we are able to determine how similar the input is to our normal training data. We perform anomaly detection by selecting a threshold above which we classify the input as an anomaly. You also learned about how the model architecture impacts the autoencoder and how metics like ROC and AUC help us evaluate our performance. Critically, you learned that anomaly detection models often have low transferability and therefore have to be trained for their specific application.