机器学习(15)异常检测

195 阅读3分钟

「这是我参与11月更文挑战的第29天,活动详情查看:2021最后一次更文挑战

Notes of Andrew Ng’s Machine Learning —— (15) Anomaly Detection

Density Estimation

Problem Motivation

Anomaly Detection is a type of machine learning problem.

Imagine that you're a manufacturer of aircraft engines, and let's say that as your aircraft engines roll off the assembly line, you're doing quality assurance testing, and as part of that testing you measure features of your aircraft engine, like the heat generated, the vibrations and so on. So you now have a data set of x1x_1 through xmx_m, if you have manufactured mm aircraft engines and plot your data. Then, given a new engine xtestx_{test}, you want to know is xtestx_{test} anomalous? This problem is called Anomaly Detection.

image-20191114105935962

Here is want we are going to do:

Dataset: {x(1),x(2),,x(m)}\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}

Is xtestx_{test} anomalous?

To solve this problem, we will train a model for p(x)p(x) (i.e. a model for the probability of xx, where xx are these features of, say, aircraft engines). We're then going to say that for the new aircraft engine, if p(xtest)p(x_{test}) is less than some ϵ\epsilon then we flag this as an anomaly:

p(xtest)<ϵanomalyp(xtest)ϵOK\begin{array}{l} p(x_{test})<\epsilon \Rightarrow \textrm{anomaly}\\ p(x_{test})\ge\epsilon \Rightarrow \textrm{OK} \end{array}

This problem of estimating this distribution p(x)p(x) are sometimes called the problem of density estimation.

image-20191114111418858

Anomaly detection example

Anomaly detec-on example

Fraud detection

  • x(i)x^{(i)}= features of user ii’s actvities
  • Model p(x)p(x) from data.
  • Idenify unusual users by checking which have p(x)<ϵp(x)<\epsilon

Manufacturing (Monitoring computers in a data center )

  • x(i)x^{(i)} = features of machine ii

  • x1x_1 = memory use, x2x_2 = number of disk accesses/sec, x3x_3 = CPU load, x4x_4 = CPU load/network traffic ...

Gaussian Distribution

Gaussian distribution which is also called the normal distribution.

If xx is a distributed Gaussian with mean μ\mu, variance σ2\sigma^2, we will write it as:

xN(μ,σ2)x \sim \mathcal{N}(\mu,\sigma^2)

In this case, the probability of xx is:

p(x)=12πσexp((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)

There is some example of Gaussian distribution:

image-20191115172344685

We can see that the "center" of these plots is actually the value of μ\mu. And with our σ\sigma increasing, the plots is more and more "flat". Or, it will be "thin & tall" if σ\sigma is small.

Parameter estimation

Given a dataset: {x(1),x(2),,x(m)}\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}, we can get our μ\mu and σ\sigma by:

μ=1mi=1mx(i)\mu=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}
σ2=1mi=1m(x(i)μ)2\sigma^2=\frac{1}{m}\sum_{i=1}^{m}\left(x^{(i)}-\mu\right)^2

P.S. this is what people turn to use when handling a machine learning problem, it's a little different from what we normally utilize in mathematics.

Anomaly detection algorithm

  1. Choose features xix_i that you think might be indicative of anomalous examples.

  2. Fit parameters μ1,,μn,σ12,,σn2\mu_1,\cdots,\mu_n,\sigma_1^2,\cdots,\sigma_n^2: (I wrote the vectorilized version)

    μ=1mi=1mx(i)\mu=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}
    σ2=1mi=1m(x(i)μ)2\sigma^2=\frac{1}{m}\sum_{i=1}^{m}\left(x^{(i)}-\mu\right)^2
  3. Given new example xx, compute p(x)p(x):

    p(x)=j=1np(xj;μj,σj2)=j=1n12πσjexp((xjμj)22σj2)p(x)=\prod_{j=1}^{n}p(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi}\sigma_j}\exp\left(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}\right)
  4. Anomaly if p(x)<ϵp(x)<\epsilon

Building an Anomaly Detection System

Developing and Evaluating an Anomaly Detection System

The importance of real-number evalua-on

When developing a learning algorithm (choosing features, etc.), making decisions is much easier if we have a way of evalua-ng our learning algorithm.

Assume we have some labeled data, of anomalous and non- anomalous examples. (y=0y=0 if normal, y=1y=1 if anomalous).

  • Training set: x(1),x(2),,x(m)x^{(1)},x^{(2)},\cdots,x^{(m)} (assume normal examples/not anomalous)

  • Cross validation set: (xcv(1),ycv(1)),,(xcv(mcv),ycv(mcv))\left(x_{cv}^{(1)},y_{cv}^{(1)}\right),\cdots,\left(x_{cv}^{(m_{cv})},y_{cv}^{(m_{cv})}\right)

  • Test set: (xtest(1),ytest(1)),,(xtest(mtest),ytest(mtest))\left(x_{test}^{(1)},y_{test}^{(1)}\right),\cdots,\left(x_{test}^{(m_{test})},y_{test}^{(m_{test})}\right)

Aircraft engines motivating example

Say, what we have is:

  • 10000 good (normal) engines (y=0y=0)
  • 20 flawed engines (anomalous) (y=1y=1)

Than we are going to choose:

  • Training set: 6000 good engines
  • CV: 2000 good engines (y=0y=0), 10 anomalous (y=1y=1)
  • Test: 2000 good engines (y=0y=0), 10 anomalous (y=1y=1)

Algorithm evaluation

Fit model p(x)p(x) on training set {x(1),,x(m)}\{x^{(1)},\cdots,x^{(m)}\}

On a cross validation/test example xx, predict:

y={1if p(x)<ϵ(anomaly)0if p(x)ϵ(normal)y = \left\{\begin{array}{ll} 1 & \textrm{if } p(x) < \epsilon \quad \textrm{(anomaly)}\\ 0 & \textrm{if } p(x) \ge \epsilon \quad \textrm{(normal)}\\ \end{array}\right.

Possible evaluation matrices:

  • True positive, false positive, false negative, true negative
  • Precision/Recall
  • F1F_1-score

Can also use cross valida-on set to choose parameter ϵ\epsilon.

Anomaly Detection vs. Supervised Learning

image-20191116164047663

image-20191116164113192

There are some more examples:

  • (Anomaly Detection) You run a power utility (supplying electricity to customers) and want to monitor your electric plants to see if any one of them might be behaving strangely.

  • (Supervised Learning) You run a power utility and want to predict tomorrow’s expected demand for electricity (so that you can plan to ramp up an appropriate amount of generation capacity).

  • (Anomaly Detection) A computer vision / security application, where you examine video images to see if anyone in your company’s parking lot is acting in an unusual way.

  • (Supervised Learning) A computer vision application, where you examine an image of a person entering your retail store to determine if the person is male or female.

Choosing What Features to Use

preprocess non-gaussian features

We can use hist in octave to plot out the histogram of our data, if we find that it is non-gaussian, we can try to apply a log(x+c)\log(x+c) or x1cx^{\frac{1}{c}} (try different cc for a better result) to make it looks more gaussian. For example:

image-20191117113816740

Error analysis for anomaly detection

What we want is:

  • p(x)p(x) large for normal examples xx.
  • p(x)p(x) small for anomalous examples xx.

And the most common problem is:

p(x)p(x) is comparable (say, both large) for normal and anomalous examples.

Suppose our anomaly detection algorithm is performing poorly and outputs a large value of p(x) for many normal examples and for many anomalous examples in the cross validation dataset, what is most likely to help is to try coming up with more features to distinguish between the normal and the anomalous examples.

image-20191117115423248

Multivariate Gaussian distribution

Multivariate Gaussian Distribution

What we still want to do is:

  • Given xRnx\in\R^n.
  • Don't model p(x1),p(x2),p(x_1),p(x_2),\cdots separately.
  • Model p(x)p(x) all in one go.

Multivariate gaussian is able to make it:

p(x;μ,Σ)=exp(12(xμ)TΣ1(xμ))(2π)nΣp(x;\mu,\Sigma)=\frac {\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)} {\sqrt{(2\pi)^{n}|\Sigma|}}

P.s. Σ=detΣ|\Sigma| = \det\Sigma is the determinant of Σ\Sigma .

Where we need parameters:

  • μRn\mu\in\R^n
  • ΣRn×n\Sigma\in\R^{n\times n} (covariance matrix, Sigma = 1/m * X' * X;)

Given training set {x(1),,x(m)}\{x^{(1)},\cdots,x^{(m)}\}, we can fit the parameters:

μ=1mi=1mx(i)Σ=1mi=1m(x(i)μ)(x(i)μ)T\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} \qquad \Sigma=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)\left(x^{(i)}-\mu\right)^T

Here are lots of examples:

屏幕快照 2019-11-17 13.39.08 3

屏幕快照 2019-11-17 13.39.08 2

Anomaly Detection using the Multivariate Gaussian Distribution

Using the Multivariate Gaussian Distribution

  1. Fit model p(x)p(x) by setting:

    μ=1mi=1mx(i)Σ=1mi=1m(x(i)μ)(x(i)μ)T\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} \qquad \Sigma=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)\left(x^{(i)}-\mu\right)^T
  2. Given a new example xx, compute:

    p(x)=exp(12(xμ)TΣ1(xμ))(2π)nΣp(x)=\frac {\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)} {\sqrt{(2\pi)^{n}|\Sigma|}}
  3. Flag an anomaly if p(x)<ϵp(x)<\epsilon

Relationship to original model

Original model: p(x)=p(x1;μ1,σ12)×p(x2;μ2,σ22)××p(xn;μn,σn2)p(x)=p(x_1;\mu_1,\sigma_1^2)\times p(x_2;\mu_2,\sigma_2^2) \times \cdots \times p(x_n;\mu_n,\sigma_n^2)

image-20191118172458083

As we can see, the contours of the Original model are always axis aligned.

This model actually corresponds to a special case of a multivariate Gaussian distribution p(x)=exp(12(xμ)TΣ1(xμ))(2π)nΣp(x)=\frac{\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)} {\sqrt{(2\pi)^{n}|\Sigma|}} where:

Σ=[σ12σ12σn2]\Sigma=\left[\begin{array}{cccc} \sigma_1^2 \\ & \sigma_1^2 \\ & & \ddots \\ & & & \sigma_n^2 \end{array}\right]

Original model vs. Multivariate gaussian

Original model:

  • p(x)=p(x1;μ1,σ12)××p(xn;μn,σn2)p(x)=p(x_1;\mu_1,\sigma_1^2)\times \cdots \times p(x_n;\mu_n,\sigma_n^2)
  • Manually create features to capture anomalies where x1,x2x_1,x_2 take unusual combinations of values. (e.g. x3=x1x2x_3=\frac{x_1}{x_2})
  • Computationally cheaper (alternatively, scales better to large nn)
  • OK even if mm (training set size) is small

Multivariate gaussian:

  • p(x)=exp(12(xμ)TΣ1(xμ))(2π)nΣp(x)=\frac{\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)} {\sqrt{(2\pi)^{n}|\Sigma|}}
  • Automatically captures correlations between features (no need to create a x3=x1x2x_3=\frac{x_1}{x_2})
  • Computationally more expensive
  • Must have m>nm>n & all features are not redundant (need to promise that there are no features that are linearly dependent) or else Σ\Sigma is non-invertible.