接着这篇文章进行另一个角度的公式推导:
变分自编码器 | Variational Auto-Encoder
在数学上,我们可以将潜变量和我们观察到的数据想象成一个联合分布p ( x , z ) p( x , z) p ( x , z ) 来建模,VAE就是学习一个模型,使所有观测x x x 的似然p ( x ) p ( x ) p ( x ) 最大化。我们有两种方法可以操纵这个联合分布来恢复纯观测数据p ( x ) p ( x ) p ( x ) 的似然;
我们可以显式地将潜变量z z z 边缘化:
p ( x ) = ∫ p ( x , z ) d z 1 p(\boldsymbol{x})=\int p(\boldsymbol{x}, \boldsymbol{z}) d \boldsymbol{z} \quad 1 p ( x ) = ∫ p ( x , z ) d z 1
或者使用概率的链式法则:
p ( x ) = p ( x , z ) p ( z ∣ x ) 2 p(\boldsymbol{x})=\frac{p(\boldsymbol{x}, \boldsymbol{z})}{p(\boldsymbol{z} |\boldsymbol{x})} \quad 2 p ( x ) = p ( z ∣ x ) p ( x , z ) 2
很不幸,我们没办法用上边两个公式直接去计算极大似然:
但是,利用这两个方程,我们可以推导出一个称为证据下界( ELBO )的项,顾名思义,它是证据的下界。在这种情况下,证据被量化为观测数据的对数似然。然后,最大化ELBO成为优化潜变量模型的代理目标;在最好的情况下,当ELBO被强大的参数化和完美优化时,它就完全等价于证据。形式上,ELBO的方程为:
E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] \mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\right] E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ]
推理过程
我们先用公式1直接推,可以得到:
log p ( x ) = log ∫ p ( x , z ) d z (Apply Equation 1) = log ∫ p ( x , z ) q ϕ ( z ∣ x ) q ϕ ( z ∣ x ) d z (Multiply by 1 = q ϕ ( z ∣ x ) q ϕ ( z ∣ x ) ) = log E q ϕ ( z ∣ x ) [ p ( x , z ) q ϕ ( z ∣ x ) ] (Definition of Expectation) ≥ E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] (Apply Jensen’s Inequality) \begin{aligned}
\log p(\boldsymbol{x}) & =\log \int p(\boldsymbol{x}, \boldsymbol{z}) d \boldsymbol{z} & & \text { (Apply Equation 1) } \\
& =\log \int \frac{p(\boldsymbol{x}, \boldsymbol{z}) q_\phi(\boldsymbol{z} \mid x)}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})} d z & & \text { (Multiply by } 1=\frac{q_\phi(\boldsymbol{z} \mid x)}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid x)} \text { ) } \\
& =\log \mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\right] & & \text { (Definition of Expectation) } \\
& \geq \mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\right] & & \text { (Apply Jensen's Inequality) }
\end{aligned} log p ( x ) = log ∫ p ( x , z ) d z = log ∫ q ϕ ( z ∣ x ) p ( x , z ) q ϕ ( z ∣ x ) d z = log E q ϕ ( z ∣ x ) [ q ϕ ( z ∣ x ) p ( x , z ) ] ≥ E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ] (Apply Equation 1) (Multiply by 1 = q ϕ ( z ∣ x ) q ϕ ( z ∣ x ) ) (Definition of Expectation) (Apply Jensen’s Inequality)
从这我们可以知道E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] \mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\right] E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ]
确实是个下界。这是直接应用了琴生不等式得到的下界,我们不能确定它为什么是个下界,就是不知道具体原理,所以现在我们用第二段推导:
log p ( x ) = log p ( x ) ∫ q ϕ ( z ∣ x ) d z (Multiply by 1 = ∫ q ϕ ( z ∣ x ) d z ) = ∫ q ϕ ( z ∣ x ) ( log p ( x ) ) d z (Bring evidence into integral) = E q ϕ ( z ∣ x ) [ log p ( x ) ] (Definition of Expectation) = E q ϕ ( z ∣ x ) [ log p ( x , z ) p ( z ∣ x ) ] (Apply Equation 2) = E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) p ( z ∣ x ) q ϕ ( z ∣ x ) ] (Multiply by 1 = q ϕ ( z ∣ x ) q ϕ ( z ∣ x ) ) = E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] + E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( z ∣ x ) ] (Split the Expectation) = E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] + D K L ( q ϕ ( z ∣ x ) ∥ p ( z ∣ x ) ) (Definition of KL Divergence) ⭐ ≥ E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] (KL Divergence always ≥ 0 ) \begin{aligned}
\log p(x) & =\log p(x) \int q_\phi(z \mid x) d z & & \text { (Multiply by } \left.1=\int q_\phi(z \mid x) d z\right) \\
& =\int q_\phi(z \mid x)(\log p(x)) d z && \text { (Bring evidence into integral) } \\
& =\mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}[\log p(x)] & & \text { (Definition of Expectation) } \\
& =\mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p(x, z)}{p(z \mid x)}\right] & & \text { (Apply Equation 2) } \\
& =\mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p(x, z) q_\phi(z \mid x)}{p(z \mid x) q_\phi(z \mid x)}\right] & & \text { (Multiply by } \left.1=\frac{q_\phi(z \mid x)}{q_\phi(\boldsymbol{z} \mid x)}\right) \\
& =\mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_\phi(\boldsymbol{z} \mid x)}\right]+\mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{q_\phi(z \mid x)}{p(z \mid x)}\right] & & \text { (Split the Expectation) } \\
& =\mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p(x, z)}{q_\phi(z \mid x)}\right]+D_{\mathrm{KL}}\left(q_\phi(z \mid x) \| p(z \mid x)\right) & & \text { (Definition of KL Divergence) } \quad ⭐\\
& \geq \mathbb{E}_{q_\phi(z \mid x)}\left[\log \frac{p(x, z)}{q_\phi(z \mid x)}\right] & & \text { (KL Divergence always } \geq 0 \text { ) }
\end{aligned} log p ( x ) = log p ( x ) ∫ q ϕ ( z ∣ x ) d z = ∫ q ϕ ( z ∣ x ) ( log p ( x )) d z = E q ϕ ( z ∣ x ) [ log p ( x )] = E q ϕ ( z ∣ x ) [ log p ( z ∣ x ) p ( x , z ) ] = E q ϕ ( z ∣ x ) [ log p ( z ∣ x ) q ϕ ( z ∣ x ) p ( x , z ) q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ] + E q ϕ ( z ∣ x ) [ log p ( z ∣ x ) q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ] + D KL ( q ϕ ( z ∣ x ) ∥ p ( z ∣ x ) ) ≥ E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ] (Multiply by 1 = ∫ q ϕ ( z ∣ x ) d z ) (Bring evidence into integral) (Definition of Expectation) (Apply Equation 2) (Multiply by 1 = q ϕ ( z ∣ x ) q ϕ ( z ∣ x ) ) (Split the Expectation) (Definition of KL Divergence) ⭐ (KL Divergence always ≥ 0 )
从这个推导中,我们清楚地从⭐那一步可以看到证据等于ELBO加上近似后验q ϕ ( z ∣ x ) q_\phi ( z | x ) q ϕ ( z ∣ x ) 和真实后验p ( z ∣ x ) p ( z | x ) p ( z ∣ x ) 之间的KL散度。
现在我们就可以知道,第一次推导中琴声不等式是直接把这个KL散度项去掉了。
我们现在知道,ELBO确实是一个下界。证据与ELBO的差值是一个严格非负的KL项,因此ELBO的值永远不能超过证据。
为什么要寻求ELBO的最大化。
在引入了我们想要建模的潜变量z z z 之后,我们的目标是学习到中间这个bottleneck的潜变量表示。
换句话说,我们想要优化我们的变分后验q ϕ ( z ∣ x ) q_\phi ( z | x ) q ϕ ( z ∣ x ) 的参数来精确匹配真实的后验分布p ( z ∣ x ) p ( z | x ) p ( z ∣ x ) ,这是通过最小化它们的KL散度(理想情况下为零)来实现的。但是我们无法获得p ( z ∣ x ) p ( z | x ) p ( z ∣ x ) 的真实分布,直接最小化这个KL散度项是困难的。
注意 到在⭐那一步,我们的证据项log p ( x ) \log p ( x) log p ( x ) 的似然始终是关于ϕ \phi ϕ 的常数,因为它是通过将联合分布p ( x , z ) p( x , z) p ( x , z ) 中的所有潜变量z z z 边缘化来计算的,并不依赖于ϕ \phi ϕ 。所以ELBO和KL散度项加总为常数,因此ELBO项关于ϕ \phi ϕ 的任何最大化也就意味着KL散度项的最小化。因此,可以最大化ELBO作为学习如何完美建模真实潜在后验分布的代理表示:我们越优化ELBO,我们的近似后验就越接近真实后验。
在变分自编码器( Variational Autoencoder,VAE )的默认形式中,我们直接最大化ELBO。但是让我们进一步剖析ELBO项:
E q ϕ ( z ∣ x ) [ log p ( x , z ) q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log p θ ( x ∣ z ) p ( z ) q ϕ ( z ∣ x ) ] = E q ϕ ( z ∣ x ) [ log p θ ( x ∣ z ) ] + E q ϕ ( z ∣ x ) [ log p ( z ) q ϕ ( z ∣ x ) ] (Chain Rule of Probability) = E q ϕ ( z ∣ x ) [ log p θ ( x ∣ z ) ] ⏟ reconstruction term − D K L ( q ϕ ( z ∣ x ) ∥ p ( z ) ) ⏟ prior matching term (De Expectation) \begin{aligned}
\mathbb{E}_{q_\phi(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{x}, \boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] & =\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z}) p(\boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] \\
& =\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\right]+\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log \frac{p(\boldsymbol{z})}{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\right] &&\text { (Chain Rule of Probability) } \\
& =\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x})}\left[\log p_{\boldsymbol{\theta}}(\boldsymbol{x} \mid \boldsymbol{z})\right]}_{\text {reconstruction term }}-\underbrace{D_{\mathrm{KL}}\left(q_{\boldsymbol{\phi}}(\boldsymbol{z} \mid \boldsymbol{x}) \| p(\boldsymbol{z})\right)}_{\text {prior matching term }} &&\text { (De Expectation) }
\end{aligned} E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( x , z ) ] = E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p θ ( x ∣ z ) p ( z ) ] = E q ϕ ( z ∣ x ) [ log p θ ( x ∣ z ) ] + E q ϕ ( z ∣ x ) [ log q ϕ ( z ∣ x ) p ( z ) ] = reconstruction term E q ϕ ( z ∣ x ) [ log p θ ( x ∣ z ) ] − prior matching term D KL ( q ϕ ( z ∣ x ) ∥ p ( z ) ) (Chain Rule of Probability) (De Expectation)
当我们尽力最大化这一项的时候,就是让第一项尽量大,让第二项尽量小,这就意味着:
本文正在参加「金石计划」