一起养成写作习惯！这是我参与「掘金日新计划 · 4 月更文挑战」的第1天，点击查看活动详情。

RV coefficient是Pearson correlation coefficient平方的多元泛化
RV代表着R-Vector, 意思是向量版本的皮尔逊相关系数
RV系数取值介于0和1之间
它测量两组点的接近度, 每个点可以用矩阵表示
应用
- 常用于生物领域, 例如功能性神经成像
  - 测量两个受试者的脑部扫描相似性
  - 同一受试者不同时间扫描相似性
评价(来源知乎)
- 并没有把矩阵看作一种线性变换, 而只是作为高维数据的表现形式, 并从范数角度推出较为漂亮的表达式. 多用于生物数据的相关性计算
- 在高通量生物数据的处理中, RV correlation相关的方法被认为在低样本量时存在过高偏倚, 而Borzou A等推荐在小样本中通过矩阵分解消除这种影响(DOI: 10.1093/bioinformatics/btz281)

Notation

$\mathbf{Y} \in \mathbb{R}^{N \times D_x}$ : 原始的多维数组
$\mathbf{X} \in \mathbb{R}^{N\times D_y}$ : 预处理后的多维数组, 对 $\mathbf{Y}$ 的每列进行中心化
$\mathbf{S} \in \mathbb{R}^{N \times N}$ : 半正定矩阵, 使用 $\mathbf{S} = \mathbf{X}\mathbf{X}^{\top}$ 计算获得
- 是正方形的, 对称的, 并且它们的对角元素总是大于或等于零
$\operatorname{tr}(\cdot)$ : 矩阵的迹运算符
协方差:
- $\Sigma_{\mathbf{X}}=\operatorname{cov}(\mathbf{X}, \mathbf{X})=\mathbf{XX}^{\top}$ , $\mathbf{X}$ 要每行中心化
- $\Sigma_{\mathbf{XY}}=\operatorname{cov}(\mathbf{X}, \mathbf{Y})=\mathbf{XX}^\top\mathbf{YY}^\top$
Frobenius范数(又名 Hilbert-Schmidt 范数)

$||\Sigma_\mathbf{xy}|| *{\mathcal{F}}^2 = \sum_i \lambda_i^2 = \text{tr}\left( \Sigma*\mathbf{xy}^\top \Sigma_\mathbf{xy} \right)=\sum_i\sum_ja_{ij}^2$
$\operatorname{vec}(\cdot)$ 将一个矩阵转变为向量

RV系数计算公式

预处理多维矩阵, 每列进行中心化

$\mathbf{X}_i = \mathbf{Y}*i - \mu*{\mathbf{Y}_i}$
- $\mu \in \mathbb{R}^{1\times D}$
计算半正定矩阵

$\mathbf{S}_i=\mathbf{X}_i\mathbf{X}_i^\top$
计算RV系数

$RV=\frac{\operatorname{tr}(\mathbf{S}_i^\top\mathbf{S}_j)}{\sqrt{\operatorname{tr}(\mathbf{S}_i^\top\mathbf{S}_i)\operatorname{tr}(\mathbf{S}_j^\top\mathbf{S}_j)}}$

RV系数推导

Pearson相关系数的定义

$\rho_{X_i, X_j}=\frac{\operatorname{cov}(X_i, X_j)}{\sigma_{{X}*i} \sigma*{{X}_j}}$
RV系数借鉴了Pearson相关系数的的定义

$\begin{aligned} RV &= \frac{\operatorname{covv}\left(\mathbf{X}_i, \mathbf{X}_j\right)}{\sqrt{\operatorname{vav}(\mathbf{X}_i)\operatorname{vav}(\mathbf{X}*j)}}\\ &=\frac{\operatorname{tr}(\Sigma*{\mathbf{X}_i\mathbf{X}*j}\Sigma*{\mathbf{X}_j\mathbf{X}*i})}{\sqrt{\operatorname{tr}(\Sigma*{\mathbf{X}*i}^2)\operatorname{tr}(\Sigma*{\mathbf{X}*j}^2)}}\\ &=\frac{\left<\mathbf{S}* {*i},\mathbf{S}* {*j}\right>}{\| \mathbf{S}* {*i} \| \| \mathbf{S}* {_j} \|}\\ &=\frac{\operatorname{tr}(\mathbf{S}_i^\top\mathbf{S}_j)}{\sqrt{\operatorname{tr}(\mathbf{S}_i^\top\mathbf{S}_i)\operatorname{tr}(\mathbf{S}_j^\top\mathbf{S}_j)}} \end{aligned}$
- $\operatorname{covv}()$ 表示的是 $\mathbf{X}_i$ 和 $\mathbf{X}_j$ 之间的协方差
- $\operatorname{vav}()$ 表示的是 $\mathbf{X}_i$ 或 $\mathbf{X}_j$ 之间的方差
- 如果 $\mathbf{X}_i\in\mathbb{R}^{N \times 1}, \mathbf{X}_j\in\mathbb{R}^{N \times 1}$ , $RV=\rho^2$
- $0\le RV(\mathbf{X}_i,\mathbf{X}_j)\le1$
- 当且仅当 $\Sigma_{\mathbf{X}_i,\mathbf{X}_j}=0$ 时, $RV=0$ , 表明其没有线性关系不代表其独立
- $RV(\mathbf{X}_i, a\mathbf{BX}_i+\mathbf{c})=1$ , $\mathbf{B}$ 为正交矩阵, $a$ 为常数, $\mathbf{c}$ 为常数向量时成立. RV系数通过移位, 旋转和整体缩放后保持不变
代数的观点
- 分子对应了半正定矩阵之间的标量积, 因此给出了这组矩阵向量空间的结构
- 分母被称为Frobenius或Schur或Hilbert-Schmidt矩阵标量积
- RV系数是矩阵之间的余弦
- 这个矩阵空间结构代表着RV系数的数学性质

RV系数检验

对给定的值进行统计检验, 确定一个系数的值是否仅仅是偶然获得的
提出假设
- $H_0:RV=0$ , 两个矩阵没有线性相关性
- $H_1:RV>0$ , 两个矩阵具有线性相关性
置换检验
- 在这个框架中, 对矩阵 $\mathbf{X}$ 的每一列进行置换, 并用它来创建矩阵 $\mathbf{S}$ . 置换检验的分布的均值和方差可以直接通过 $\mathbf{S}$ 计算得到
- 计算四个量
  
  $\begin{aligned}\beta_{\mathbf{S}}&=\frac{\left(\sum_{\ell}^{L} \mathbf{*S} \lambda*{\ell}\right)^{2}}{\sum_{\ell}^{L} \mathbf{*S} \lambda*{\ell}^{2}}=\frac{\operatorname{tr}(\mathbf{S})^{2}}{\operatorname{tr}(\mathbf{S \cdot S})}\\\delta_{\mathbf{S}}&=\frac{\sum_{i}^{N} s_{i, i}^{2}}{\sum_{\ell}^{L} \mathbf{*S} \lambda*{\ell}^{2}}=\frac{\operatorname{tr}(\mathbf{S}^2)}{\operatorname{tr}(\mathbf{S \cdot S})}\\\alpha_{\mathbf{S}}&=N-1-\beta_{\mathbf{S}}\\C_{\mathbf{S}}&=\frac{(N-1)\left[N(N+1) \delta_{\mathbf{S}}-(N-1)\left(\beta_{\mathbf{S}}+2\right)\right]}{\alpha_{\mathbf{S}}(N-3)}\end{aligned}$
  - $\beta_s$ 称为sphericity index(球形指数), 用于估计一般线性模型的多变量测试的自由度, 其取决于 $\mathbf{S}$ 矩阵的特征值, 用 $_\mathbf{S}\lambda_l$ 表示
- 计算均值和方差
  
  $\begin{aligned}E\left(RV\right)&= \frac{\sqrt{\beta_{\mathbf{S}*i} \beta*{\mathbf{S}*j}}}{N-1}\\V\left(RV\right)&= \alpha*{\mathbf{S}*i} \alpha*{\mathbf{S}*i} \times \frac{2 N(N-1)+(N-3) C*{\mathbf{S}*i} C*{\mathbf{S}_i}}{N(N+1)(N-2)(N-1)^{3}}\end{aligned}$
- 计算Z分数
  
  $Z_{RV}=\frac{RV-E\left(RV\right)}{\sqrt{V\left(RV\right)}}$
  - 置换系数的抽样分布类似于正态分布(尽管通常不是), 使用Z标准来执行零假设检验或计算置信区间, 例如给准则可用于检验零假设, 即观测到的R值是由于偶然性造成的
- 使用Z分数获取置信度

modified-RV coefficient

RV系数在某些情况下(高维问题, n=20和p=q=100), 可能会出现接受原假设, 但是RV系数很大的情况, RV系数和样本上的数量有关系

特征数量分别为113和130, 左侧色带为min-max, 右侧色带为std
modified-RV coefficient是为了修改, 以减小在高维情况下的偏差
通过使用随机矩阵理论计算两个独立的正态随机矩阵X和Y在系数为零的情况下的期望值, 表明问题可以追溯到矩阵 $\mathbf{S}$ 的对角元素. 因此通过去除这些元素将低偏差
取值介于-1和1之间
对于固定的n值, 他们模拟了两个互不相关的矩阵, 并慢慢增加了两组之间的相关性. modified-RV在0到1之间变化, 而RV在0.85到0.99之间变化
modified-RV 公式

$\begin{aligned}\tilde{\mathbf{S}}_i &= \mathbf{S}_i - \operatorname{diag}(\mathbf{S}*i)\\RV*{mod} & = \frac{\operatorname{tr}(\tilde{\mathbf{S}}_i^\top\tilde{\mathbf{S}}_j)}{\sqrt{\operatorname{tr}(\tilde{\mathbf{S}}_i^\top\tilde{\mathbf{S}}_i)\operatorname{tr}(\tilde{\mathbf{S}}_j^\top\tilde{\mathbf{S}}_j)}}\\\end{aligned}$

例子

多个专家对红酒测试的例子
- 三位专家对红酒以九分制进行评价
- 一共评价六款来自同一批的红酒(不同的桶中酿造), 其中1,5,6是同一种橡木塞; 2,3,4是同一种橡木塞
该数据在The STATIS Method算出的RV有问题, 在RV Coefficient and Congruence Coefficient一节中expert1和expert3计算的RV值进行了更正

初始化矩阵

import numpy as np
y1 = np.array([[1, 6, 7], [5, 3, 2], [6, 1, 1], [7, 1, 2], [2, 5, 4], [3, 4, 4]])
y2 = np.array([[2, 5, 7, 6], [4, 4, 4, 2], [5, 2, 1, 1], [7, 2, 1, 2], [3, 5, 6, 5], [3, 5, 4, 5]])
y3 = np.array([[3, 6, 7], [4, 4, 3], [7, 1, 1], [2, 2, 2], [2, 6, 6], [1, 7, 5]])

对数据进行预处理

每列中心化, 即按照特征进行中心化(the mean of each column is zero)

cent_func = lambda x:  x-x.mean(0,keepdims=True)
x1 = cent_func(y1)
x2 = cent_func(y2)
x3 = cent_func(y3)

手动计算

trans_func = lambda x:  np.dot(x, x.T)
s1 = trans_func(x1)
s2 = trans_func(x2)
s3 = trans_func(x3)

rv_func = lambda x,y: np.trace(np.dot(x.T, y)) / np.sqrt(np.trace(np.dot(x.T, x)) * np.trace(np.dot(y.T, y)))
rv_func(s1, s2), rv_func(s2, s3), rv_func(s1, s3)

库计算

hoggorm

import hoggorm as ho
ho.RVcoeff([x1,x2,x3])

自己写的函数

from RV import RVcoeff
RVcoeff([y1,y2,y3])

验证对于一维矩阵RV系数为person系数的平方

import math
from scipy.stats import pearsonr
math.isclose(
    np.sqrt(ho.RVcoeff([x1[:,[0]], x3[:,[0]]])[0,1]),
    pearsonr(x1[:,0],x3[:,0])[0]
)

参考

知乎-有哪些度量两个矩阵相关性程度的方法？
Encyclopedia of Measurement and Statistics
- The STATIS Method
- RV Coefficient and Congruence Coefficient
A Unifying Tool for Linear Multivariate Statistical Methods: The RV- Coefficient - DOI: 10.2307/2347233
Le Traitement des Variables Vectorielles - DOI: 10.2307/2529140
Measuring multivariate association and beyond - DOI: 10.1214/16-SS116
Testing the significance of the RV coefficient - DOI: 10.1016/j.csda.2008.06.012
Matrix correlations for high-dimensional data: the modified RV-coefficient - DOI: 10.1093/bioinformatics/btn634
GitHub-RV Coefficient
StackExchange-Correlation between matrices in R
WiKi-RV coefficient
Python-hoggorm package

矩阵相似度的度量——RV系数(RV Coefficient)

Notation

RV系数计算公式

RV系数推导

RV系数检验

modified-RV coefficient

例子

参考