Logistic regression - discrimitive

79 阅读1分钟

Step 1: Function set

Target: Pw,b(C1x)P_{w, b} (C_1|x)

  • If target >= 0.5, output C1C_1
  • Else, output C2C_2

Based on previous knowledge in juejin.cn/post/722520…

  • Pw,b(C1x)=σ(z)=11+exp(z)P_{w, b} (C_1|x) = \sigma(z) = \frac{1}{1 + exp(-z)}
  • z=wx+b=iwixi+bz = wx + b = \sum_i w_i x_i + b

Function set: {fw,b(x)=Pw,b(C1x),w,b}\{f_{w, b} (x) = P_{w, b} (C_1|x), \forall w, \forall b\}

Step 2: Goodness of a function

Training data: {(x1,C1),(x2,C1),(x3,C2),,(xN,C1)}\{(x_1, C_1), (x_2, C_1), (x_3, C_2),\dots, (x_N, C_1)\}

  • Given w & b, its probability of generating the training data

L(w,b)=fw,b(x1)fw,b(x2)(1fw,b(x3))fw,b(xN)L(w, b) = f_{w, b}(x^1) f_{w, b}(x^2) (1 - f_{w, b}(x^3)) \dots f_{w, b}(x^N)

  • Find the w & b with maximum probability

w,b=argmaxw,bL(w,b)=argminw,blnL(w,b)w^*, b^* = \arg \max_{w, b} L(w, b) = \arg \min_{w, b} -ln L(w, b)

  • Define y^n\hat{y}^n: 1 for C1C_1, 0 for C2C_2

lnL(w,b)=n(y^nlnfw,b(xn)+(1y^n)lnfw,b(xn))- ln L(w, b) = - \sum_n (\hat{y}^n ln f_{w, b} (x^n) + (1 - \hat{y}^n) ln f_{w, b} (x^n) ) -- cross entropy between two Bernoulli distributions

Step 3: Find the best function

Gradient descent:

  • lnL(w,b)wi=n(y^nfw,b(xn))xin\frac{\partial -ln L(w, b)}{ \partial w_i} = \sum_n -(\hat{y}^n - f_{w, b} (x^n)) x_i^n -- larger difference, larger update

Discussions

Compare with linear regression

image.png

Why not square error in loss function?

Lwi=(fw,b(x)y^)2wi=2(fw,b(x)y^)fw,b(x)(1fw,b(x))xi\frac{\partial L}{\partial w_i} = \frac{\partial (f_{w, b}(x) - \hat{y})^2} {\partial w_i} = 2 (f_{w, b}(x) - \hat{y}) f_{w, b}(x) (1 - f_{w, b}(x)) x_i

When y^n=1\hat{y}^n = 1, (similar when 0)

  • If fw,b(xn)=1f_{w, b}(x^n) = 1, close to target, Lwi=0\frac{\partial L}{\partial w_i} = 0
  • If fw,b(xn)=0f_{w, b}(x^n) = 0, far from target, Lwi=0\frac{\partial L}{\partial w_i} = 0 -- update slow

Cannot tell close to / far from target with small Lwi\frac{\partial L}{\partial w_i}

Discrimitive v.s. Generative

See the generative method in juejin.cn/post/722520…

image.png

Differnt function because of different assumptions

  • in the generative model: distribution assumption, naive bayes, etc.

Usually discrimitive model has a better performance

image.png

Limitation of logistic regression

Linear boundary -> feature transformation

How to find a good transformation? -> cascading logistic regression models -- neural network

image.png

image.png

Multi-class classification

Softmax: yi=P(Cix)=ezi/jezjy_i = P(C_i | x) = e^{z_i} / \sum_j e^{z_j}, where zi=wix+biz_i = w_i x + b_i

  • 0<yi<10 < y_i < 1
  • iyi=1\sum_i y_i = 1

Cross entropy: iy^ilnyi- \sum_i \hat{y}_i ln y_i

  • y^=(y^1,y^2,)\hat{y} = (\hat{y}_1, \hat{y}_2, \dots ) - target
  • If xC1x \in C_1, y^=(1,0,0,)\hat{y} = (1, 0, 0, \dots )