Classification

54 阅读1分钟

Application

  • credit scoring
  • medical diagnosis
  • handwritten character recognition

Regression?

Why not regression

  • penalize examples that are "too correct" image.png
  • multiple class: 1, 2, 3 (classes 1 and 2 are not necessarily closer than classes 1 and 3)

Ideal alternatives

Gradient descent cannot be applied! - indifferentiable image.png

Two classes

Probabilistic generative model - Bayes' Theorem

P(C1x)=P(xC1)P(C1)P(xC1)P(C1)+P(xC2)P(C2)P(C_1|x) = \frac{P(x|C_1) P(C_1)}{P(x|C_1) P(C_1) + P(x|C_2) P(C_2)}

  • estimating the probabilities from training data
  • if assume all the dimensions are independent -> Naive Bayes Classifier

P(C1)P(C_1) & P(C2)P(C_2)

Easily calculated from training data

P(xC1)P(x|C_1) & P(xC2)P(x|C_2)

  • Assume that the points are sampled from a Gaussian distribution (it's okay to use other distributions as well. e.g., Bernoulli distribution for binary features.) image.png
  • Find the Gaussian distribution behind each class (Maximum likelihood)

L(μ,)=fμ,(x1)fμ,(x2)fμ,(xn)L(\mu, \sum) = f_{\mu, \sum} (x^1) f_{\mu, \sum} (x^2) \dots f_{\mu, \sum} (x^n)

μ,=argmaxμ,L(μ,) \mu^*, \sum^* = \arg \max_{\mu, \sum} L(\mu, \sum)

Easily overfitting if a distribution for each class, since a covariance matrix can have too many parameters (# features * # features) -> the same covariance matrix for each class.

L(μ1,μ2,)=fμ1,(x1)fμ1,(x2)fμ1,(xn1)×fμ2,(xn1+1)fμ1,(xn1+2)fμ1,(xn1+n2)L(\mu_1, \mu_2, \sum) = f_{\mu^1, \sum} (x^1) f_{\mu^1, \sum} (x^2) \dots f_{\mu^1, \sum} (x^{n^1}) \times f_{\mu^2, \sum} (x^{n^1+1}) f_{\mu^1, \sum} (x^{n^1+2}) \dots f_{\mu^1, \sum} (x^{n^1 + n^2})

image.png

  • Find the probability of the new point based on the estimated distribution

Finally

P(C1x)=σ(wx+b),σ(z)=11+exp(z)\begin{aligned} P(C_1|x) &= \sigma(wx + b), \\ \sigma(z) &= \frac{1}{1+exp(-z)} \end{aligned}

image.png

image.png

-> How about we directly find w and b?