#目录
- 数据
- 后退法
- 向前选择
简介
-
Backward elimination(后退法)和forward selection(前向选择)都是统计模型中常用的变量选择方法。
-
backward elimination(后退法)是一种逐步剔除自变量的方法。它从包含所有自变量的完整模型开始,然后通过逐步删除对因变量影响不显著的自变量来构建最优模型。在每一步,通过统计检验(如p值)或信息准则(如AIC或BIC)来评估自变量的重要性,将不显著的自变量逐步剔除,直到满足某个停止准则。
-
forward selection(前向选择)是一种逐步添加自变量的方法。它从一个空模型开始,逐步添加与因变量具有显著关联的自变量,直到满足某个停止准则。在每一步,通过统计检验或信息准则评估候选自变量的重要性,并选择对因变量有显著影响的自变量进行添加。
-
这两种方法在变量选择过程中有所不同。backward elimination从包含所有自变量的完整模型开始,逐步删除不显著的自变量;而forward selection从空模型开始,逐步添加显著的自变量。它们都考虑了自变量的重要性和对因变量的影响,以构建最优的预测模型。
数据
应用回归分析 R语言版 5.9习题
x1 | x2 | x3 | x4 | x5 | x6 | y | |
---|---|---|---|---|---|---|---|
0 | 1018.4 | 1607.0 | 138.2 | 96259 | 2239.1 | 50760 | 1132.26 |
1 | 1258.9 | 1769.7 | 143.8 | 97542 | 2619.4 | 39370 | 1146.38 |
2 | 1359.4 | 1996.5 | 195.5 | 98705 | 2976.1 | 44530 | 1159.93 |
3 | 1545.6 | 2048.4 | 207.1 | 100072 | 3309.1 | 39790 | 1175.79 |
4 | 1761.6 | 2162.3 | 220.7 | 101654 | 3637.9 | 33130 | 1212.33 |
5 | 1960.8 | 2375.6 | 270.6 | 103008 | 4020.5 | 34710 | 1366.95 |
6 | 2295.5 | 2789.0 | 316.7 | 104357 | 4694.5 | 31890 | 1642.86 |
7 | 2541.6 | 3448.7 | 417.9 | 105851 | 5773.0 | 44370 | 2004.82 |
8 | 2763.9 | 3967.0 | 525.7 | 107507 | 6542.0 | 47140 | 2122.01 |
9 | 3204.3 | 4585.8 | 665.8 | 109300 | 7451.2 | 42090 | 2199.35 |
10 | 3831.0 | 5777.2 | 810.0 | 111026 | 9360.1 | 50870 | 2357.24 |
11 | 4228.0 | 6484.0 | 794.0 | 112704 | 10556.5 | 46990 | 2664.90 |
12 | 5017.0 | 6858.0 | 859.4 | 114333 | 11365.2 | 38470 | 2937.10 |
13 | 5288.6 | 8087.1 | 1015.1 | 115823 | 13145.9 | 55470 | 3149.48 |
14 | 5800.0 | 10284.5 | 1415.0 | 117171 | 15952.1 | 51330 | 3483.37 |
15 | 6882.1 | 14143.8 | 2284.7 | 118517 | 20182.1 | 48830 | 4348.95 |
16 | 9457.2 | 19359.6 | 3012.6 | 119850 | 26796.0 | 55040 | 5218.10 |
17 | 11993.0 | 24718.3 | 3819.6 | 121121 | 33635.0 | 45821 | 6242.20 |
18 | 13844.2 | 29082.6 | 4530.5 | 122389 | 40003.9 | 46989 | 7407.99 |
19 | 14211.2 | 32412.1 | 4810.6 | 123626 | 43579.4 | 53429 | 8651.14 |
20 | 14599.6 | 33429.8 | 5262.0 | 124810 | 46405.9 | 50145 | 9875.95 |
后退法
import pandas as pd
from statsmodels.formula.api import ols
def backward_select(data, target):
variate = set(data.columns)
variate.remove(target)
selected = list(variate) # 初始化选择的自变量为所有变量
current_score, best_new_score = float('inf'), float('inf')
while len(selected) > 1:
aic_with_variate = []
for candidate in selected:
remaining = list(set(selected) - set([candidate]))
formula = "{}~{}".format(target, "+".join(remaining))
print(formula,candidate)
aic = ols(formula=formula, data=data).fit().aic
aic_with_variate.append((aic, candidate))
aic_with_variate.sort(reverse=True)
print(aic_with_variate)
best_new_score, worst_candidate = aic_with_variate.pop()
if current_score > best_new_score:
selected.remove(worst_candidate)
current_score = best_new_score
print("AIC is {}, continuing!".format(current_score))
else:
print("Feature selection is over!")
break
formula = "{}~{}".format(target, "+".join(selected))
print("Final formula is {}".format(formula))
model = ols(formula=formula, data=data).fit()
return model
backward_select(df,'y')
具体过程如下
- 排除某一元素查看其之后AIC值,如果降低了,说明可以排除该元素
- 过程一,排除x4后AIC下降最大,选择排除x4
y~x2+x3+x1+x4+x6 x5
y~x5+x3+x1+x4+x6 x2
y~x5+x2+x1+x4+x6 x3
y~x5+x2+x3+x4+x6 x1
y~x5+x2+x3+x1+x6 x4
y~x5+x2+x3+x1+x4 x6
[(306.66624815818585, 'x5'), (298.9675225755362, 'x1'), (287.19181720701107, 'x2'), (285.11520627825007, 'x6'), (284.6944317142424, 'x3'), (283.87339775474504, 'x4')]
AIC is 283.87339775474504, continuing!
- 第二步,同理,排除x3
y~x2+x6+x3+x1 x5
y~x5+x3+x6+x1 x2
y~x2+x5+x6+x1 x3
y~x2+x5+x3+x6 x1
y~x2+x5+x3+x1 x6
[(312.61603496923055, 'x5'), (302.63595803303264, 'x1'), (288.46527935366834, 'x2'), (283.42893816218475, 'x6'), (282.7821873389373, 'x3')]
AIC is 282.7821873389373, continuing!
- 第三步排除 同理排除x6
y~x2+x6+x1 x5
y~x5+x6+x1 x2
y~x2+x5+x6 x1
y~x2+x5+x1 x6
[(310.7505406826555, 'x5'), (300.6425836373816, 'x1'), (295.8750357317772, 'x2'), (281.9870784132091, 'x6')]
AIC is 281.9870784132091, continuing!
y~x2+x1 x5
y~x5+x1 x2
y~x2+x5 x1
- 第四步 排除某一特征变量之后AIC比最佳的还高,不删除,然后选择结束,
- 最后结果为 y~x5+x2+x1
[(309.1033264049531, 'x5'), (298.6602125882606, 'x1'), (293.88328749955855, 'x2')]
Feature selection is over!
Final formula is y~x5+x2+x1
向前选择,跟back elimination类似
import pandas as pd
from statsmodels.formula.api import ols #加载ols模型
#定义向前逐步回归函数
def forward_selection(data,target):
variate=set(data.columns) #将字段名转换成字典类型
variate.remove(target) #去掉因变量的字段名
selected=[]
current_score,best_new_score=float('inf'),float('inf') #目前的分数和最好分数初始值都为无穷大(因为AIC越小越好)
#循环筛选变量
while variate:
aic_with_variate=[]
for candidate in variate: #逐个遍历自变量
formula="{}~{}".format(target,"+".join(selected+[candidate])) #将自变量名连接起来
print(formula)
aic=ols(formula=formula,data=data).fit().aic #利用ols训练模型得出aic值
aic_with_variate.append((aic,candidate)) #将第每一次的aic值放进空列表
aic_with_variate.sort(reverse=True) #降序排序aic值
best_new_score,best_candidate=aic_with_variate.pop() #最好的aic值等于删除列表的最后一个值,以及最好的自变量等于列表最后一个自变量
if current_score>best_new_score: #如果目前的aic值大于最好的aic值
variate.remove(best_candidate) #移除加进来的变量名,即第二次循环时,不考虑此自变量了
selected.append(best_candidate) #将此自变量作为加进模型中的自变量
current_score=best_new_score #最新的分数等于最好的分数
print("aic is {},continuing!".format(current_score)) #输出最小的aic值
else:
print("for selection over!")
break
formula="{}~{}".format(target,"+".join(selected)) #最终的模型式子
print("final formula is {}".format(formula))
model=ols(formula=formula,data=data).fit()
return(model)
forward_selection(df,'y')
- 第一步 添加x5 后 aic最低,选择添加x5
y~x4 x4
y~x5 x5
y~x6 x6
y~x1 x1
y~x2 x2
y~x3 x3
添加x5
aic is 298.99768031566737,continuing!
- 第二步 同理 添加x1
y~x5+x4 x4
y~x5+x6 x6
y~x5+x1 x1
y~x5+x2 x2
y~x5+x3 x3
添加x1
aic is 293.88328749955855,continuing!
- 第三步添加x2
y~x5+x1+x4 x4
y~x5+x1+x6 x6
y~x5+x1+x2 x2
y~x5+x1+x3 x3
添加x2
aic is 281.9870784132091,continuing!
- 第四步,检查添加x6的AIC值,发现比上一步的还要高,选择结束.
- 最终结果是y~x5+x1+x2
y~x5+x1+x2+x4 x4
y~x5+x1+x2+x6 x6
y~x5+x1+x2+x3 x3
添加x6
for selection over!
final formula is y~x5+x1+x2