携手创作,共同成长!这是我参与「掘金日新计划 · 8 月更文挑战」的第2天,点击查看活动详情
| 软件 | 版本号 |
|---|---|
| windows操作系统 | 10 |
| python | 3.6 |
| pycharm | Professional 2021.1.1 |
| matplotlib | 3.4.3 |
| pandas | 1.3.3 |
| anconda | Anaconda3-2021.05 |
(用scikit_leran 中的feauture_selection 库中的f_regression)选择F检验值大或者p值小的特征。
这里我们不对F和P做过多的讲解,有兴趣的同学可以在课下学习统计学课本中与检验与假定相关的章节。下面是我对F检验值(F值)与P值检验的简单解释:
The methods based on F-test estimate the degree of linear dependency between two random variables. (F检验用于评估两个随机变量的线性相关度)
A p value is the probability that an observation or more extreme value will occur if the null hypothesis is true.(p值就是在0假设为真的情况下一个观测值或更多极端值出现的的概率。)
F检验之f_regression:
from sklearn.feature_selection import f_regression
Data_01=np.random.rand()*np.random.randint(2,1000,(200,6))
f_regression(Data_01[:,0:5],Data_01[:,5])#返回f值和f-值的p值
(array([0.44398786, 1.03851596, 0.31914032, 0.15589646, 2.33403656]),
array([0.50597956, 0.30941044, 0.57276423, 0.693388 , 0.12816918]))
f-值的计算流程:
Y1=(Data_01[:, 5] - np.mean(Data_01[:, 5],axis=0))
#关联系数
cor_1=[np.dot((Data_01[:, k] - np.mean(Data_01[:, k],axis=0))
,Y1)/(Data_01[:, k].std(0)*Data_01[:, 5].std(0)*len(Data_01[:, 5])) for k in range(5)]
#通常我们利用F-检验(F_value(1,n-2))来检验正态假定下两个变量之间的相关性
F=((np.array(cor_1)**2)/(1-(np.array(cor_1)**2)))*(len(Data_01[:, 5])-2)
F
array([0.44398786, 1.03851596, 0.31914032, 0.15589646, 2.33403656])
P-值的计算流程:
p=[]
for k in F:
p.append(1-f.cdf(k,1,len(Data_01[:, 5])-2))
p
[0.5059795615561491,
0.30941043833696547,
0.5727642294156681,
0.6933879956863616,
0.12816917757410518]
#F检验的使用,
from sklearn.datasets import load_diabetes
dia_data=load_diabetes()
f_regression(dia_data.data,dia_data.target)
(array([ 16.10137401, 0.81742349, 230.65376449, 106.52138379,
20.71056745, 13.74607917, 81.23965868, 100.06926441,
207.27209108, 75.3996832 ]),
array([7.05568615e-05, 3.66429295e-01, 3.46600645e-42, 1.64853275e-22,
6.92071179e-06, 2.35984810e-04, 6.16286470e-18, 2.30425328e-21,
8.82375416e-39, 7.58008327e-17])
只选择p值小于或等于显著性水平0.05的特征,因为只有这种情况下才能拒绝0假设(比如这样的0假设:特征值列与目标值列(响应变量)没有线性相关性)
F检验之f_classif:
from sklearn.feature_selection import f_classif
from sklearn.datasets import load_breast_cancer
cancer_data=load_breast_cancer()
f_classif(cancer_data.data,cancer_data.target)