CHAPTER 3 Stochastic Least-Squares Problems

3.1提出问题

3.2给出一个简单的结果

3.3给出几何解释

3.4开始讲线性方法

3.5给出随机性最小二乘和确定性最小二乘的等价关系和对偶关系

3.1 随机估计问题THE PROBLEM OF STOCHASTIC ESTIMATION

给出两个独立的随机变量 $x,y$ , $y$ 为已知， $x$ 为未知。

problem1：由 $y$ 的一个观测值y估计出 $x$ 的一个观测值x_hat:(3.1.1)

或者更一般可以说，由随机变量 $y$ 来估计出随机变量 $x$ :(3.1.2)

引出problem2：h()函数如何确定？

要找到一个满足optimality criterion的估计方法：the least-mean-squares criterion（和C2中的least-squares criterion类似)

通过解条件期望：(推导见3.A.1)

\hat{\mathbf{x}}=E[\mathbf{x} | \mathbf{y}]

需要用到联合概率分布，很难求。所以我们简化一下：

让h()为一个线性函数。我们还可能指出，当{x，y}为联合高斯分布时，通常是合理的假设，则无约束的最小均方估计量是线性的。

3.2线性最小均方估计LINEAR LEAST-MEAN-SQUARES ESTIMATORS

对于x和y这样的复数值和零均值随机变量，我们将协方差矩阵定义为 $cov（x，y）= Exy^{*}$ （-期望的乘积那一项因为他们各自期望为0所以为0），其中表示标量随机变量的复共轭和复共轭转置（即所谓的Hermitian转置）用于向量值的随机变量。

此定义的主要原因是要确保 $var（x）= cov（x，x）= Exx^{ *}$ ，当x为标量时为非负标量，当x为向量随机变量时为非负定矩阵。

3.2.1基本方程式

我们的目标是在给定随机变量{yi}假定某些值{yi}的情况下估计由随机变量x假定的值。我们对x的线性估计量感兴趣，即对随机变量{yi} 进行线性运算的估计量.

假定估计值可以由参数矩阵K0（还未知）得出：

此时得到的最优估计值就可以使误差协方差矩阵达到最小

P\left(K_{o}\right) \triangleq E[\mathbf{x}-\hat{\mathbf{x}}][\mathbf{x}-\hat{\mathbf{x}}]^{*}=\text { minimum. }

注：矩阵的数学期望是矩阵, 矩阵中的元素是矩阵中每一元素的期望.

Theorem 3.21 (Optimal Linear L.M.S. Estimators) Given two complex zero mean random variables $\mathbf{x}$ and $\mathbf{y},$ the $L$ . Im.s. estimator of $\mathbf{x}$ given $\mathbf{y},$ defined by $(3.2 .1)-(3.2 .2)$ is given by any solution $K_{o}$ of the so-called normal equations

\boldsymbol{K}_{\boldsymbol{o}} \boldsymbol{R}_{\boldsymbol{y}}=\boldsymbol{R}_{\boldsymbol{x y}}

where $R_{y}=E \mathbf{y} \mathbf{y^{*}}$ and $R_{x y}=E \mathbf{x y}^{*}=R_{y x}^{*} .$ The corresponding minimum-mean-square-error matrix (or error covariance matrix) is

P\left(K_{o}\right)=R_{x}-K_{o} R_{y x}=R_{x}-R_{x y} K_{o}^{*}

证明: $K_{o}$ is a solution of the optimization problem (3.2.1)-(3.2.2) if, and only if, for all vectors $a, a \ddot{K}_{o}$ is a minimum of $a P(K) a^{*}$ , where

a P(K) a^{*}=a E\left[(\mathbf{x}-K y)(\mathbf{x}-K y)^{*}\right] a^{*}=a\left[R_{x}-R_{x y} K^{*}-K R_{y x}+K R_{y} K^{*}\right] a^{*}

Note that $a P(K) a^{*}$ is a scalar function of a complex-valued (row) vector quantity aK. Then (see App. A.6) differentiating $a P(K) a^{*}$ with respect to $a K$ and setting the derivative equal to zero at $K=K_{o}$ leads to the equations $R_{x y}=K_{o} R_{y} .$ The corresponding minimum-mean-square-error (or, m.m.s.e. for short) matrix is

\begin{aligned}
\mathbf{m . m . s . e .} \triangleq P\left(K_{o}\right) &=E(\mathbf{x}-\hat{\mathbf{x}})(\mathbf{x}-\hat{\mathbf{x}})^{*}=E(\mathbf{x}-\hat{\mathbf{x}}) \mathbf{x}^{*}-E(\mathbf{x}-\hat{\mathbf{x}}) \hat{\mathbf{x}}^{*} \\
&=E\left(\mathbf{x}-K_{o} \mathbf{y}\right) \mathbf{x}^{*}-E\left(\mathbf{x}-K_{o} \mathbf{y}\right) \mathbf{y}^{*} K_{o}^{*} \\
&=R_{x}-K_{o} R_{y x}-\left(R_{x y}-K_{o} R_{y}\right) K_{o}^{*}=R_{x}-K_{o} R_{y x}
\end{aligned}

其中有? a P(K) a^{} \geq a P\left(K_{o}\right) a^{} ? for every $K$ and for every row vector $a$ . The solution to the above problem is given by the following theorem.

也即 $K_{o}$ also minimizes the mean-square error in the estimator of each component of the vector $\mathbf{x}$ .

Theorem 3.2.2 (Unique Solutions) Assume that $R_{y}>0 .$ （R_y正定方阵，必可逆）Then the optimum choice $\boldsymbol{K}_{\boldsymbol{o}}$ that minimizes $P(\boldsymbol{K})=\boldsymbol{E}[\mathbf{x}-\boldsymbol{K} \mathbf{y}][\mathbf{x}-\boldsymbol{K} \mathbf{y}]^{*}$ is given by

The m. m.s.e. (see $(3.2 .5)$ ) can be written as

P\left(K_{o}\right)=R_{x}-R_{x y} R_{y}^{-1} R_{y x} \triangleq R_{\tilde{x}}

K is always nonnegative (because Ry > 0)。

3.2.2Stochastic Interpretation of Triangular Fadorization （LDL,UDU,舒尔补）

从分块矩阵、三角矩阵、舒尔补的角度来看上一节推导出的结果：（另一种简便的求解 $P(K_{0})$ 的方法）

assuming $R_{y}>0,$ i.e., $P（K_{0}）=R_{\tilde{x}}=R_{x}-R_{x y} R_{y}^{-1} R_{y x},$ is the 舒尔补 of $R_{y}$ in the joint covariance matrix:

E\left[\begin{array}{l} \mathbf{x} \\ \mathbf{y} \end{array}\right]\left[\begin{array}{ll} \mathbf{x}^{*} & \mathbf{y}^{*} \end{array}\right]=\left[\begin{array}{ll} R_{x} & R_{x y} \\ R_{y x} & R_{y} \end{array}\right]

对于这个joint covariance matrix，可以很方便地进行 $LDL^{*}$ 分解和 $UDU^{*}$ 分解

When $R_{y}>0,$ the UDU* decomposition如下：

\left[\begin{array}{cc}
R_{x} & R_{x y} \\
R_{y x} & R_{y}
\end{array}\right]=\left[\begin{array}{cc}
I & R_{x y} R_{y}^{-1} \\
0 & I
\end{array}\right]\left[\begin{array}{cc}
R_{x}-R_{x y} R_{y}^{-1} R_{y x} & 0 \\
0 & R_{y}
\end{array}\right]\left[\begin{array}{cc}
I & 0 \\
R_{y}^{-1} R_{y x} & I
\end{array}\right]

follows from the representation of the pair $\{\mathbf{x}, \mathbf{y}\}$ of correlated random variables in terms of the obviously uncorrelated pair $\left\{\tilde{\mathbf{x}}_{| y}, y\right\},$ Whese $\tilde{x}_{| y}=x-R_{x y} R_{y}^{-1} y:(error)$

\left[\begin{array}{l}
\mathbf{x} \\
\mathbf{y}
\end{array}\right]=\left[\begin{array}{lc}
I & R_{x y} R_{y}^{-1} \\
0 & I
\end{array}\right]\left[\begin{array}{l}
\tilde{x}_{1 y} \\
y
\end{array}\right]

扩展到从多个随机变量估计一个随机变量的情况：y,z -> x

$E\left(\left[\begin{array}{l}\mathbf{x} \\ \mathbf{y} \\ \mathbf{z}\end{array}\right]\left[\begin{array}{l}\mathbf{x} \\ \mathbf{y} \\ \mathbf{z}\end{array}\right]^{*}\right)=\left[\begin{array}{ccc}R_{x} & R_{x y} & 0 \\ R_{y x} & R_{y} & R_{y z} \\ 0 & R_{z y} & R_{z}\end{array}\right]$

and verify that the covariance matrix of the error in estimating $y$ given both $x$ and $z$ is given by

E \tilde{\mathbf{y}}\tilde{\mathbf{y}}_{|\mathbf{x}, \mathbf{z}| \mathbf{x}, \mathbf{z}}^{*} \triangleq R_{\tilde{y}_{1 \mathbf{x}, z}}^{*}=R_{y}-R_{y x} R_{x}^{-1} R_{x y}-R_{y z} R_{z}^{-1} R_{z y}=R_{\bar{y}_{x}}+R_{\bar{y}_{z}}-R_{y}

这个结果对于合并基于不同观测值的估计量的问题很有用，我们将在 sec.3.4.3和 prob.3.23。

3.2.3奇异数据的协方差矩阵Singular Data Covariance Matrices

先前我们都是假设 $R_{y}>0$ ,Ry正定 = = Ry可逆矩阵==Ry向量线性无关。现在我们假设Ry是个奇异矩阵，然后通过...证明这条假设是矛盾的（在复数下？），所以Ry就是正定矩阵。

如果我们这里强行假设Ry为奇异矩阵，我们想知道normal function的解的情况。讨论了 $R_{y}>0$ 的不必要性，不>0虽然normal function有多个解，但是最优估计和cost function都是唯一的，见如下定理。

Theorem 32.3 (Non-unique Solutions) Even if $R_{y}=E y y^{*}$ is singular, the normal equations $K_{o} R_{y}=R_{x y}$ will be consistent, and there will be many solutions. No matter which solution $K_{o}$ is used, the corresponding l. l.m.s. estimator $\hat{\mathbf{x}}=K_{o} \mathbf{y}$ will, however, be unique, and so of course will $P\left(K_{o}\right)$

3.2.4非零均值and居中Nonzero-Mean Values and Centering

前面的讨论以及normal function的解都是在x和y的均值/期望为0的条件下进行的。

现在如果x和y的均值/期望不为0，我们则通过Centering的方法进行仿射变换。

\mathbf{x}^{o} \triangleq \mathbf{x}-m_{x}, \quad \mathbf{y}^{o} \triangleq \mathbf{y}-m_{y}

他们的协方差/互协方差矩阵为如下：

$E \mathbf{x}^{o} \mathbf{x}^{o *}=E\left(\mathbf{x}-m_{x}\right)\left(\mathbf{x}-m_{x}\right)^{*}=E \mathbf{x} \mathbf{x}^{*}-m_{x} m_{x}^{*}$ $\triangleq R_{x},$ the covariance matrix of $x$

$E \mathbf{x}^{o} \mathbf{y}^{o *}=E\left(\mathbf{x}-m_{x}\right)\left(\mathbf{y}-m_{y}\right)^{*}=E \mathbf{x} \mathbf{y}^{*}-m_{x} m_{y}^{*}$ $\triangleq R_{x y}, \quad$ the cross-covariance matrix of $\mathbf{x}$ and $\mathbf{y}$

最优估计值为：

\hat{\mathbf{x}}^{o}=\left(E \mathbf{x}^{o} \mathbf{y}^{o *}\right)\left(E \mathbf{y}^{o} \mathbf{y}^{o *}\right)^{-1} \mathbf{y}^{o}

或等价为

\begin{aligned}
\hat{\mathbf{x}} &=m_{x}+R_{x y} R_{y}^{-1}\left(\mathbf{y}-m_{y}\right) \\
&=R_{x y} R_{y}^{-1} \mathbf{y}+\left(m_{x}-R_{x y} R_{y}^{-1} m_{y}\right)
\end{aligned}

我们可以看到，严格来说，给定y的x的线性最小均方估计量实际上是y的仿射函数，而不是线性函数。然而，很容易为继续将x称为y的线性函数而辩解。

3.2.5复数值随机变量的估计量Estimators for Complex-Valued Random Variables

复数值随机变量的估计有两种方法：不管我们是把问题简化为实数随机变量来做，还是用复杂的复数随机变量来做，normal function还是相同的；唯一的区别是，我们可能正在使用给定的向量{x，y}或它们的扩展版本{xR，xI，yR，yI}。

3.3几何观点A GEOMETRIC FORMULATION

3.3.1正交条件

E\left(\mathbf{x}-K_{o} y\right) \mathbf{y}^{*}=0?(3.3.2)

由这个式子可以联想到Chap.2中正交的概念：**两个向量正交，则他们的内积inner product为零**。

内积必须满足以下条件：

1. Linearity: $\left\langle a_{1} \mathbf{x}_{1}+a_{2} \mathbf{x}_{2}, \mathbf{y}\right\rangle= a_{1}\left\langle\mathbf{x}_{1}, \mathbf{y}\right\rangle+ a_{2}\left\langle\mathbf{x}_{2}, \mathbf{y}\right\rangle,$ for any $a_{1}, a_{2} \in \mathbf{C}$
2. Reflexivity: $(\mathbf{x}, \mathbf{y})=\langle\mathbf{y}, \mathbf{x}\rangle^{*}$
3. Nondegeneracy: $\|\mathbf{x}\|^{2} \triangleq\langle\mathbf{x}, \mathbf{x})$ is zero only when $\mathbf{x}=0$

下面我们就根据(3.3.2)的特点和内积的条件，定义如下的随机变量的内积：（将x，y随机变量看作向量）

?\langle\mathbf{x}, \mathbf{y}\rangle \triangleq E \mathbf{x} \mathbf{y}^{*}

那么(3.3.2)就可以用几何观点来看：

对比一下确定性最小二乘的几何观点：

注意：投影平面是不一样的！！figure3.1中?K_{0}?是系数，figure2.1中 $\hat{x}$ 是系数

用引理来说明几何观点：

Lemma 3.3.1 (The Orthogonality Condition) The linear least-mean-squares estimator (LLm.se) of a random variable x given a set of other random variables y is characterized by the fact that the error $\tilde{\mathbf{x}}$ in the estimator is orthogonal to (i.e., uncorrelated with) each of the random variables used to form the estimator. Equivalently, the LL.m.se. is the projection of x onto $\mathcal{L}(\mathbf{y})$

Projection onto the linear space $\mathcal{L}(\mathbf{y})$ (which we denote here by $\hat{\mathbf{x}}_{| \mathbf{y}}$ ) has the important properties

\left(\mathbf{x}_{1} \widehat{+} \mathbf{x}_{2}\right)_{| \mathbf{y}}=\hat{\mathbf{x}}_{1 | \mathbf{y}}+\hat{\mathbf{x}}_{2 | \mathbf{y}}

and

\hat{\mathbf{x}}_{| \mathbf{y}_{1}, \mathbf{y}_{2}}=\hat{\mathbf{x}}_{| \mathbf{y}_{1}}+\hat{\mathbf{x}}_{| \mathbf{y}_{2}} \text { if, and only if } \mathbf{y}_{1} \perp \mathbf{y}_{2}

These geometrically intuitive properties can be formally verified by using the explicit formula $\hat{\mathbf{x}}_{| \mathbf{y}}=\langle\mathbf{x}, \mathbf{y}\rangle\|\mathbf{y}\|^{-2} \mathbf{y}$

3.3.2例子

3.4线性模型

An extremely important special case that will often arise in our analysis occurs when $y$ and $\mathbf{x}$ are linearly related, say as

where $H \in \mathbb{C}^{p \times n}$ is a known matrix and $\mathbf{v}$ is a zero-mean random-noise vector uncorrelated with $\mathbf{x}$ . Assume that $R_{x}=\langle\mathbf{x}, \mathbf{x}\rangle$ and $R_{v}=\langle\mathbf{v}, \mathbf{v}\rangle$ are known and also that. $R_{y}=H R_{x} H^{*}+R_{v}>0 .$ Then the l.l.m.s.e. and the corresponding m.m.s.e. can be written as

\hat{\mathbf{x}}=K_{o} \mathbf{y}, \quad K_{o}=R_{x} H^{*}\left[H R_{x} H^{*}+R_{v}\right]^{-1}

and

P_{x} \triangleq R_{\tilde{x}}=R_{x}-R_{x} H^{*}\left[R_{\nu}+H R_{x} H^{*}\right]^{-1} H R_{x}

These formulas will be encountered in many different contexts in later chapters.

3.4.1 Rx> 0和Rv> 0时的information forms

We may remark that formulas using inverses of covariance matrices are sometimes called Information Form results, because loosely speaking the amount of information obtained by observing a random variable varies inversely as its variance.

使用矩阵求逆定理：

$R_{y}^{-1}=\left(R_{v}+H R_{x} H^{*}\right)^{-1}=R_{v}^{-1}-R_{v}^{-1} H\left(R_{x}^{-1}+H^{*} R_{v}^{-1} H\right)^{-1} H^{*} R_{v}^{-1}$

重新表示：

$\begin{aligned} K_{o} &=R_{x} H^{*} R_{v}^{-1}-R_{x} H^{*} R_{v}^{-1} H\left(R_{x}^{-1}+H^{*} R_{v}^{-1} H\right)^{-1} H^{*} R_{v}^{-1} \\ &=R_{x}\left[\left(R_{x}^{-1}+H^{*} R_{v}^{-1} H\right)-H^{*} R_{v}^{-1} H\right]\left(R_{x}^{-1}+H^{*} R_{v}^{-1} H\right)^{-1} H^{*} R_{v}^{-1} \\ &=\left(R_{x}^{-1}+H^{*} R_{v}^{-1} H\right)^{-1} H^{*} R_{v}^{-1} \end{aligned}$

$P_{x}=R_{x}-R_{x} H^{*}\left[R_{v}+H R_{x} H^{*}\right]^{-1} H R_{x}=\left(R_{x}^{-1}+H^{*} R_{v}^{-1} H\right)^{-1}$

有如下的nice fomula

$P_{x}^{-1} \hat{\mathbf{x}}=H^{*} R_{v}^{-1} \mathbf{y}$

3.4.2高斯-马尔可夫定理

3.4.3组合估计器Combining Estimators

用多个随机变量来估计未知随机变量:

Lemma 3.4.1 (Combining Estimators) Let $\mathbf{y}_{a}$ and $\mathbf{y}_{b}$ be two separate observations of a zero-mean random variable $\mathbf{x}$ , such that $\mathbf{y}_{a}=H_{a} \mathbf{x}+\mathbf{v}_{a}$ and $\mathbf{y}_{b}=H_{b} \mathbf{x}+\mathbf{v}_{b}$ where $\left\{\mathbf{v}_{a}, \mathbf{v}_{b}, \mathbf{x}\right\}$ are mutually uncorrelated zero-mean random variables with covariance matrices $R_{a}, R_{b},$ and $R_{x},$ respectively. Denote by $\hat{\mathbf{x}}_{a}$ and $\hat{\mathbf{x}}_{b}$ the L.l.m.s. estimators of $\mathbf{x}$ given $\mathbf{y}_{a}$ and $\mathbf{y}_{b},$ respectively, and likewise define the error covariance matrices. $P_{a}=\left\langle\mathbf{x}-\hat{\mathbf{x}}_{a}, \mathbf{x}-\hat{\mathbf{x}}_{a}\right\rangle$ and $P_{b}=\left\langle\mathbf{x}-\hat{\mathbf{z}}_{b}, \mathbf{x}-\hat{\mathbf{x}}_{b}\right\rangle .$ Then $\hat{\mathbf{x}},$ the $L L m . s$ estimator of $\mathbf{x}$ given both $\mathbf{y}_{\boldsymbol{a}}$ and $\mathbf{y}_{\boldsymbol{b}},$ can be found as

P^{-1} \hat{\mathbf{x}}=P_{a}^{-1} \hat{\mathbf{x}}_{a}+P_{b}^{-1} \hat{\mathbf{x}}_{b}

$(3.4 .12)$ where $P$ , the corresponding error covariance matrix, is given by

♥3.5与确定性最小二乘的等价性

Appendix for Chapter 3

3.A最小均方估计LEAST-MEAN-SQUARES ESTIMATION

在本附录中，我们考虑一个更普遍的问题，即确定一个可能的非线性函数h（·），该函数在一个随机变量 $x$ 的最小均方意义上根据另一个随机变量 $y$ 的观测值作为最佳估计.

定义一个error 随机变量：

\tilde{\mathbf{x}}=\mathbf{x}-\hat{\mathbf{x}}=\mathbf{x}-h(\mathbf{y})

The least-mean-squares criterion minimizes the "variance" of the error variable：

cost function:(结果是（误差）协方差矩阵）

\min _{h()} E(\tilde{\mathbf{x}} \tilde{{\mathbf{x}}^{*}})

Theorem 3.A.1 (The Optimal Least-Mean-Squares Estimator) The optimal leastmean-squares ( l.m.s. ) estimator (cf. $(3.4 .1)$ ) of a random variable $\mathbf{x}$ given the value of another random variable y is given by the conditional expectation

In particular, if $\mathbf{x}$ and $\mathbf{y}$ are independent random variables, then the optimal estimator of $\mathbf{x}$ is $\hat{\mathbf{x}}=E(\mathbf{x} | \mathbf{y})=E(\mathbf{x})$

《线性估计》 Chapter3 随机性最小二乘问题（Stochastic Least-Squares Problems ）