【机器学习】公式推导与证明:高斯贝叶斯和逻辑回归的关系

309 阅读2分钟

情景1:特定形式的高斯朴素贝叶斯分类器与逻辑回归(Specific Gaussian naive Bayes classifiers and logistic regression)

Consider a specific class of Gaussian naive Bayes classifier where:

  • yy is a boolean variable following a Bernouli distribution, with parameter π=P(y=1)\pi=P(y=1) and thus P(Y=0)=1πP(Y=0)=1-\pi.
  • x=[x1,,xD]Tx=[x_1,\dots,x_D]^T, with each feature xix_i a continous random variable. For each xix_i, P(xiy=k)P(x_i|y=k) is a Gaussian distribution N(μik,σi)N(\mu_{ik},\sigma_i). Note that σi\sigma_i is the standard deviation of the Gaussian distribution, which does not depend on kk.
  • For all iji \neq j and xix_i and xjx_j are conditionally independent given yy (so called " naice" classifier).

Question: please show that the relationship between a discriminative classifier (say logistic regression) and the above specific class of Gaussian naive Bayes classifiers is precisely the form used by logistic regression.

中文描述:iji \neq jxix_ixjx_j 相互独立,σ\sigma 依赖于 ii 而不依赖于 kk

思路:

尝试将贝叶斯表达式转化为逻辑回归的一般形式:

P(Y=1X)=11+exp(w0+i=1nwixi)P(Y=0X)=exp(w0+i=1nwixi)1+exp(w0+i=1nwixi)P(Y=1|X)=\frac{1}{1 + \exp(w_0 + \sum_{i=1}^n w_i x_i)}\\ P(Y=0|X)=\frac{\exp(w_0+\sum_{i=1}^nw_ix_i)}{1 + \exp(w_0 + \sum_{i=1}^n w_i x_i)}

Answer:

根据上述特定高斯朴素贝叶斯分类器的假设,以及贝叶斯法则,有:

P(Y=1X)=P(Y=1)P(XY=1)P(Y=1)P(XY=1)+P(Y=0)P(XY=0)=11+P(Y=0)P(XY=0)P(Y=1)P(XY=1)=11+exp(lnP(Y=0)P(XY=0)P(Y=1)P(XY=1))\begin{aligned} P(Y=1|X) &=\frac{P(Y=1)P(X|Y=1)}{P(Y=1)P(X|Y=1)+P(Y=0)P(X|Y=0)}\\ &=\frac{1}{1+\frac{P(Y=0)P(X|Y=0)}{P(Y=1)P(X|Y=1)}}\\ &=\frac{1}{1+\exp{(\ln\frac{P(Y=0)P(X|Y=0)}{P(Y=1)P(X|Y=1)})}} \end{aligned}

由给定 xi,xjx_i, x_j 的条件独立性假设,可得:

P(Y=1X)=11+exp(lnP(Y=0)P=(Y=1)+ilnP(xiY=0)P(xiY=1))=11+exp(ln1ππ+ilnP(xiY=0)P(xiY=1))\begin{aligned} P(Y=1|X) &=\frac{1}{1+\exp{(\ln\frac{P(Y=0)}{P=(Y=1)}+\sum_i\ln\frac{P(x_i|Y=0)}{P(x_i|Y=1)})}}\\ &=\frac{1}{1+\exp{(\ln\frac{1-\pi}{\pi}+\sum_i\ln\frac{P(x_i|Y=0)}{P(x_i|Y=1)})}} \end{aligned}

再根据 P(xiY=yk)P(x_i|Y=y_k) 服从高斯分布 N(μik,σi)N(μ_{ik},σ_i),可得:

lnP(xiY=0)P(xiY=1)=ln12πσi2exp((xiμi0)22σi2)12πσi2exp((xiμi1)22σi2)=(xiμi1)2(xiμi0)22σi2=2xi(μi0μi1)+μi12μi022σi2=μi0μi1σi2xi+μi12μi022σi2\begin{aligned} \ln\frac{P(x_i|Y=0)}{P(x_i|Y=1)} &=\ln\frac{\frac{1}{\sqrt{2\pi\sigma_i^2}}\exp(\frac{-(x_i-\mu_{i0})^2}{2\sigma_i^2})}{\frac{1}{\sqrt{2\pi\sigma_i^2}}\exp(\frac{-(x_i-\mu_{i1})^2}{2\sigma_i^2})}\\ &=\frac{(x_i-\mu_{i1})^2-(x_i-\mu_{i0})^2}{2\sigma_i^2}\\ &=\frac{2x_i(\mu_{i0}-\mu_{i1})+\mu_{i1}^2-\mu_{i0}^2}{2\sigma_i^2}\\ &=\frac{\mu_{i0}-\mu_{i1}}{\sigma_i^2}x_i+\frac{\mu_{i1}^2-\mu_{i0}^2}{2\sigma_i^2} \end{aligned}

则:

P(Y=1X)=11+exp(ln1ππ+i(μi0μσi2xi+μi12μi022σi2))P(Y=1|X)=\frac{1}{1+\exp(\ln\frac{1-\pi}{\pi}+\sum_i(\frac{\mu_{i0}-\mu}{\sigma_i^2}x_i+\frac{\mu_{i1}^2-\mu_{i0}^2}{2\sigma_i^2}))}

等价于逻辑回归的一般形式:

P(Y=1X)=11+exp(w0+i=1nwixi)P(Y=1|X)=\frac{1}{1+\exp(w_0+\sum_{i=1}^nw_ix_i)}

其中:

w0=ln1ππ+iμi12μi022σi2w1=μi0μi1σi2\begin{aligned} w_0&=\ln\frac{1-\pi}{\pi}+\sum_i\frac{\mu_{i1}^2-\mu_{i0}^2}{2\sigma_i^2}\\ w_1&=\frac{\mu_{i0}-\mu_{i1}}{\sigma_i^2} \end{aligned}

所以该特定形式的高斯贝叶斯分类器正是逻辑回归所使用的形式。

情景2:一般形式的高斯朴素贝叶斯分类器与逻辑回归(General Gaussian naive Bayes classifiers and logistic regression)

Removing the assumption that the standard devition σi\sigma_i of P(xiy=k)P(x_i|y=k) does not depend on kk. That is, for each xix_i, P(xiy=k)P(x_i|y=k) is a Gaussian distribution N(μik,σik)N(\mu_{ik}, \sigma_{ik}), where i=1,,Di=1,\dots,D and k=0,1k=0,1.

Question: is the new form of P(yx)P(y|x) implied by this more general Gaussian naive Bayes classifier still the form used by logistic regression? Derive the new form of P(yx)P(y|x) to prove your answer.

中文描述:iji \neq j 时, xix_ixjx_j 相互独立,σ\sigma 依赖于 iikk

Answer:

由上题可得:

P(Y=1X)=11+exp(ln1ππ+ilnP(xiY=0)P(xiY=1))\begin{aligned} P(Y=1|X) &=\frac{1}{1+\exp{(\ln\frac{1-\pi}{\pi}+\sum_i\ln\frac{P(x_i|Y=0)}{P(x_i|Y=1)})}} \end{aligned}

再根据 P(xiY=yk)P(x_i|Y=y_k) 服从高斯分布 N(μik,σik)N(\mu_{ik}, \sigma_{ik}),可得:

lnP(xiY=0)P(xiY=1)=ln12πσi02exp((xiμi0)22σi02)12πσi12exp((xiμi1)22σi12)=ln(12πσi02exp((xiμi0)22σi02))ln(12πσi12exp((xiμi1)22σi02))=ln12πσi02(xiμi0)22σi02ln12πσi12+(xiμi1)22σi12=lnσi1σi0+(xiμi1)22σi12(xiμi0)22σi02=lnσi1σi0+σi02μi12σi12μi022σi02σi12+μi0σi02μi1σi12σi02σi12xi+σi02σi122σi02σi12xi2\begin{aligned} \ln\frac{P(x_i|Y=0)}{P(x_i|Y=1)} &=\ln\frac{\frac{1}{\sqrt{2\pi\sigma_{i0}^2}}\exp(\frac{-(x_i-\mu_{i0})^2}{2\sigma_{i0}^2})}{\frac{1}{\sqrt{2\pi\sigma_{i1}^2}}\exp(\frac{-(x_i-\mu_{i1})^2}{2\sigma_{i1}^2})}\\ &=\ln(\frac{1}{\sqrt{2\pi\sigma_{i0}^2}}\exp(\frac{-(x_i-\mu_{i0})^2}{2\sigma_{i0}^2}))-\ln(\frac{1}{\sqrt{2\pi\sigma_{i1}^2}}\exp(\frac{-(x_i-\mu_{i1})^2}{2\sigma_{i0}^2}))\\ &=\ln\frac{1}{\sqrt{2\pi\sigma_{i0}^2}}-\frac{(x_i-\mu_{i0})^2}{2\sigma_{i0}^2}-\ln\frac{1}{\sqrt{2\pi\sigma_{i1}^2}}+\frac{(x_i-\mu_{i1})^2}{2\sigma_{i1}^2}\\ &=\ln\frac{\sigma_{i1}}{\sigma_{i0}}+\frac{(x_i-\mu_{i1})^2}{2\sigma_{i1}^2}-\frac{(x_i-\mu_{i0})^2}{2\sigma_{i0}^2}\\ &=\ln\frac{\sigma_{i1}}{\sigma_{i0}}+\frac{\sigma_{i0}^2\mu_{i1}^2-\sigma_{i1}^2\mu_{i0}^2}{2\sigma_{i0}^2\sigma_{i1}^2}+\frac{\mu_{i0}\sigma_{i0}^2-\mu_{i1}\sigma_{i1}^2}{\sigma_{i0}^2\sigma_{i1}^2}x_i+\frac{\sigma_{i0}^2-\sigma_{i1}^2}{2\sigma_{i0}^2\sigma_{i1}^2}x_i^2 \end{aligned}

其中含有 xi2x_i^2 项,所以一般高斯朴素贝叶斯分类器不是逻辑回归所使用的形式。

情景3:高斯贝叶斯分类器与逻辑回归(Gaussian Bayes classifiers and logistic regression)

Now, consider the following assumptions for our Gaussian Bayses classifiers (without "naive").

  • yy is a boolean variable following a Bernouli distribution, with parameter π=P(y=1)\pi=P(y=1) and thus P(Y=0)=1πP(Y=0)=1-\pi.

  • x=[x1,x2]Tx=[x_1,x_2]^T, i.e., we only consider two features for each sample, with each feature a continous random variable. x1x_1 and x2x_2 are not conditional independent given y. We assume P(x1,x2y=k)P(x_1,x_2|y=k) is a bivariate Gaussian distribution N(μ1k,μ2k,σ1,σ2,ρ)N(\mu_{1k},\mu_{2k},\sigma_1,\sigma_2,\rho), where μ1k\mu_{1k} and μ2k\mu_{2k} are means of x1x_1 and x2x_2. The density of the bivariate Gaussian distribution is:
    P(x1,x2y=k)=12πσ1σ21ρ2exp[σ2(x1μ1k)2+σ12(x2μ2k)22ρσ1σ2(x1μ1k)(x2μ2k)2(1ρ2)σ12σ22]P(x_1,x_2|y=k)=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}\exp[-\frac{\sigma^2(x_1-\mu_{1k})^2+\sigma_1^2(x_2-\mu_{2k})^2-2\rho\sigma_1\sigma_2(x_1-\mu_{1k})(x_2-\mu_{2k})}{2(1-\rho^2)\sigma_1^2\sigma2^2}]

Question: is the form of P(yx)P(y|x) implied by such not-so-naive Gaussian Bayes classifiers still the form used by logistic regression? Derive the form of P(yx)P(y|x) to prove your answer.

中文描述:xix_ixjx_j 不相互独立,σ\sigma 依赖于 ii 而不依赖于 kk

Answer:

根据上述非朴素的高斯贝叶斯分类器的假设,以及贝叶斯法则,有:

P(Y=1X)=11+exp(ln1ππ+lnP(XY=0)P(XY=1))P(Y=1|X)=\frac{1}{1+\exp{(\ln\frac{1-\pi}{\pi}+\ln\frac{P(X|Y=0)}{P(X|Y=1)})}}

其中 X=[x1,x2]TX=[x_1,x_2]^T ,且服从二元高斯分布:

P(x1,x2Y=k)=12πσ1σ21ρ2exp[σ22(x1μ1k)2+σ12(x2μ2k)22ρσ1σ2(x1μ1k)(x2μ2k)2(1ρ2)σ12σ22]P(x_1,x_2|Y=k)=\frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}\exp\left[-\frac{\sigma_2^2(x_1-\mu_{1k})^2+\sigma_1^2(x_2-\mu_{2k})^2-2\rho\sigma_1\sigma_2(x_1-\mu_{1k})(x_2-\mu_{2k})}{2(1-\rho^2)\sigma_1^2\sigma_2^2}\right]

则:

lnP(XY=0)P(XY=1)=lnP(x1,x2Y=0)P(x1,x2Y=1)=[μ10μ11(1ρ2)σ12+ρ(μ21μ20)(1ρ2)σ1σ2]x1+[μ20μ21(1ρ2)σ22+ρ(μ11μ10)(1ρ2)σ1σ2]x2+[μ112μ1022(1ρ2)σ12+μ212μ2022(1ρ2)σ22+ρ(μ10μ20μ11μ21)(1ρ2)σ1σ2]\begin{aligned} \ln\frac{P(X|Y=0)}{P(X|Y=1)} &=\ln\frac{P(x_1,x_2|Y=0)}{P(x_1,x_2|Y=1)}\\ &=\left[\frac{\mu_{10}-\mu_{11}}{(1-\rho^2)\sigma_1^2}+\frac{\rho(\mu_{21}-\mu_{20})}{(1-\rho^2)\sigma_1\sigma_2}\right]x_1\\ &+\left[\frac{\mu_{20}-\mu_{21}}{(1-\rho^2)\sigma_2^2}+\frac{\rho(\mu_{11}-\mu_{10})}{(1-\rho^2)\sigma_1\sigma_2}\right]x_2\\ &+\left[\frac{\mu_{11}^2-\mu_{10}^2}{2(1-\rho^2)\sigma_1^2}+\frac{\mu_{21}^2-\mu_{20}^2}{2(1-\rho^2)\sigma_2^2}+\frac{\rho(\mu_{10}\mu_{20}-\mu_{11}\mu_{21})}{(1-\rho^2)\sigma_1\sigma_2}\right] \end{aligned}

等价于逻辑回归的一般形式:

P(Y=1X)=11+exp(w0+i=12wixi)P(Y=1|X)=\frac{1}{1+\exp(w_0+\sum_{i=1}^2w_ix_i)}

其中:

w0=ln1ππ+[(μ112μ102)2(1ρ2)σ12+(μ212μ202)2(1ρ2)σ22+ρ(μ10μ20μ11μ21)(1ρ2)σ1σ2]w1=μ10μ11(1ρ2)σ12+ρ(μ21μ20)(1ρ2)σ1σ2w2=μ20μ21(1ρ2)σ22+ρ(μ11μ10)(1ρ2)σ1σ2\begin{aligned} w_0&=\ln\frac{1-\pi}{\pi}+\left[\frac{(\mu_{11}^2-\mu_{10}^2)}{2(1-\rho^2)\sigma_1^2}+\frac{(\mu_{21}^2-\mu_{20}^2)}{2(1-\rho^2)\sigma_2^2}+\frac{\rho(\mu_{10}\mu_{20}-\mu_{11}\mu_{21})}{(1-\rho^2)\sigma_1\sigma_2}\right]\\ w_1&=\frac{\mu_{10}-\mu_{11}}{(1-\rho^2)\sigma_1^2}+\frac{\rho(\mu_{21}-\mu_{20})}{(1-\rho^2)\sigma_1\sigma_2} \\ w_2&=\frac{\mu_{20}-\mu_{21}}{(1-\rho^2)\sigma_2^2}+\frac{\rho(\mu_{11}-\mu_{10})}{(1-\rho^2)\sigma_1\sigma_2} \end{aligned}

所以非朴素的高斯贝叶斯分类器也正是逻辑回归所使用的形式。